Skip to content

Commit fc2a86a

Browse files
authored
update links
1 parent 32534c2 commit fc2a86a

File tree

6 files changed

+14
-14
lines changed

6 files changed

+14
-14
lines changed

docs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ The current version of PyTorch is ${executorch_version:pytorch}.
102102

103103
This will result in the following output:
104104

105-
<img src="./source/_static/img/s_custom_variables_extension.png" width="300">
105+
<img src="source/_static/img/s_custom_variables_extension.png" width="300">
106106

107107
Right now we only support PyTorch version as custom variable, but will support others in the future.
108108

docs/source/compiler-delegate-and-partitioner.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ A delegate backend implementation is composed of:
2828

2929
The diagram looks like following
3030

31-
<img src="./_static/img/backend_interface.png" alt="drawing" style="width:600px;"/>
31+
<img src="_static/img/backend_interface.png" alt="drawing" style="width:600px;"/>
3232

3333
**Figure 1.** A high level of entry points for backend interfaces, including both ahead-of-time and runtime.
3434

@@ -71,7 +71,7 @@ parsed and executed at runtime.
7171

7272
The diagram looks like following
7373

74-
<img src="./_static/img/backend_interface_aot.png" alt="drawing" style="width:800px;"/>
74+
<img src="_static/img/backend_interface_aot.png" alt="drawing" style="width:800px;"/>
7575

7676
**Figure 2.** The graph goes through partition and each subgraph will be sent to the preprocess part.
7777

@@ -107,7 +107,7 @@ virtual void destroy(ET_UNUSED DelegateHandle* handle);
107107
108108
The diagram looks like following
109109
110-
<img src="./_static/img/backend_interface_runtime.png" alt="drawing" style="width:600px;"/>
110+
<img src="_static/img/backend_interface_runtime.png" alt="drawing" style="width:600px;"/>
111111
112112
**Figure 3.** The relationship between standard ExecuTorch Runtime and backend entry point.
113113
@@ -137,12 +137,12 @@ In order to provide consistent debugging experience to users, regardless of the
137137

138138
By leveraging the debug identifier, backend developer can embed the debug as part of the delegated blob
139139

140-
<img src="./_static/img/backend_debug_handle.png" alt="drawing" style="width:600px;"/>
140+
<img src="_static/img/backend_debug_handle.png" alt="drawing" style="width:600px;"/>
141141

142142
In this way, during execute stage, with the debug identifier, backend developer can associate the failed instruction inside the delegate
143143
back to the exact line of PyThon code.
144144

145-
<img src="./_static/img/backend_debug_handle_example.png" alt="drawing" style="width:700px;"/>
145+
<img src="_static/img/backend_debug_handle_example.png" alt="drawing" style="width:700px;"/>
146146

147147
## Common Questions
148148

examples/demo-apps/react-native/rnllama/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# React Native Llama
22

33
<p align="center">
4-
<img src="./assets/images/rnllama.png" width="200" alt="rnllama Logo">
4+
<img src="assets/images/rnllama.png" width="200" alt="rnllama Logo">
55
</p>
66

77
A React Native mobile application for running LLaMA language models using ExecuTorch. This example is for iOS only for now.

examples/models/llama/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -80,12 +80,12 @@ Llama 3.2 1B and 3B performance was measured on Android OnePlus 12 device. The p
8080
<table>
8181
<tr>
8282
<td>
83-
<img src="./Android3_2_1B_bf16.gif" width="300">
83+
<img src="Android3_2_1B_bf16.gif" width="300">
8484
<br>
8585
<em> Llama3.2 1B, unquantized, BF16 on Android phone. </em>
8686
</td>
8787
<td>
88-
<img src="./Android3_2_3B_SpinQuant.gif" width="300">
88+
<img src="Android3_2_3B_SpinQuant.gif" width="300">
8989
<br>
9090
<em>
9191
Llama3.2 3B, 4bit quantized (SpinQuant) on Android phone
@@ -129,7 +129,7 @@ Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus
129129

130130
<p align="center">
131131
<br>
132-
<img src="./llama_via_xnnpack.gif" width=300>
132+
<img src="llama_via_xnnpack.gif" width=300>
133133
<br>
134134
<em>
135135
Llama3.1 8B, 4bit quantized on Android phone

examples/models/llava/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ huggingface page [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llav
1111

1212

1313
<p align="center">
14-
<img src="./llava_via_xnnpack.gif" width=300>
14+
<img src="llava_via_xnnpack.gif" width=300>
1515
<br>
1616
<em>
1717
Running Llava1.5 7B on Android phone

examples/qualcomm/oss_scripts/llama/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Hybrid Mode: Hybrid mode leverages the strengths of both AR-N model and KV cache
1414
- AR-N model: The auto-regression (AR) length determines the number of tokens to consume and the number of logits to produce. Use it to process the prompt and generate the key-value (kv) cache, which serves as a prompt processor in hybrid mode.
1515
- Prompt processing with AR-N model:
1616
<figure>
17-
<img src="./assets/PromptProcessingWithARN.png" alt="Prompt Processing With AR-N Model">
17+
<img src="assets/PromptProcessingWithARN.png" alt="Prompt Processing With AR-N Model">
1818
<figcaption>Prompt processing is done using a for-loop. An N-token block is taken, and the KV cache is updated for that block. This process is repeated until all tokens are consumed, with the last block potentially requiring padding. For flexibility, the AR-N model can handle any input length less than the maximum sequence length. For TTFT, the input length (or number of blocks) will vary depending on the actual input length, rather than always being the same.
1919
</figcaption>
2020
</figure>
@@ -70,14 +70,14 @@ We have two distinct mechanisms for updating the key-value (KV) cache, which can
7070
#### Shift Pointer mechanism
7171

7272
<figure>
73-
<img src="./assets/ShiftPointer.png" alt="Shift Pointer mechanism"> <figcaption>
73+
<img src="assets/ShiftPointer.png" alt="Shift Pointer mechanism"> <figcaption>
7474
The figure illustrates the process of updating the key and value caches during each inference step. In key cache update process, we initially allocate memory for each layer with <code>num_head</code> size of <code>(head_dim + 1) * (seq_len - 1)</code>. After a single inference, the new key cache is copied from the key output pointer <code>k_out</code> and appended to the key cache. Subsequently, the buffer start pointer of the key cache <code>k_in</code> moves to the next token, making the previous position of the buffer start pointer unused. This process is repeated for each subsequent inference step.
7575
For the value cache update process, we first allocate a contiguous memory of size <code>(num_head + 1) * head_dim * (seq_len - 1)</code> for each layer, with the last head reserved for I/O shifting, After the first inference, the cache is updated by simply shifting the pointers of all heads to the next token position, making only the previous <code>head_dim * 1</code> section of the buffer start pointer <code>v_in</code> of the first head unused. This process is repeated for each subsequent inference step.</figcaption>
7676
</figure>
7777

7878
#### Smart Mask mechanism:
7979
<figure>
80-
<img src="./assets/SmartMask.png" alt="Smart Mask mechanism">
80+
<img src="assets/SmartMask.png" alt="Smart Mask mechanism">
8181
<figcaption>The Smart Mask mechanism streamlines the process of updating tokens in the cache. Unlike the Shift Pointer mechanism, which requires moving the buffer start pointer <code>k_in</code>/<code>v_in</code> of the cache, the Smart Mask mechanism updates only the new token at the specified position. This approach eliminates the need to adjust the buffer start pointer. This mechanism is beneficial for shared buffers but requires CPU memory copying. </figcaption>
8282
</figure>
8383

0 commit comments

Comments
 (0)