update links

shoumikhin · web-flow · commit fc2a86a3c351 · 2025-04-18T22:28:34.000-07:00
diff --git a/docs/README.md b/docs/README.md
@@ -102,7 +102,7 @@ The current version of PyTorch is ${executorch_version:pytorch}.
 
 This will result in the following output:
 
-<img src="./source/_static/img/s_custom_variables_extension.png" width="300">
+<img src="source/_static/img/s_custom_variables_extension.png" width="300">
 
 Right now we only support PyTorch version as custom variable, but will support others in the future.
 
diff --git a/docs/source/compiler-delegate-and-partitioner.md b/docs/source/compiler-delegate-and-partitioner.md
@@ -28,7 +28,7 @@ A delegate backend implementation is composed of:
 
 The diagram looks like following
 
-<img src="./_static/img/backend_interface.png" alt="drawing" style="width:600px;"/>
+<img src="_static/img/backend_interface.png" alt="drawing" style="width:600px;"/>
 
 **Figure 1.** A high level of entry points for backend interfaces, including both ahead-of-time and runtime.
 
@@ -71,7 +71,7 @@ parsed and executed at runtime.
 
 The diagram looks like following
 
-<img src="./_static/img/backend_interface_aot.png" alt="drawing" style="width:800px;"/>
+<img src="_static/img/backend_interface_aot.png" alt="drawing" style="width:800px;"/>
 
 **Figure 2.** The graph goes through partition and each subgraph will be sent to the preprocess part.
 
@@ -107,7 +107,7 @@ virtual void destroy(ET_UNUSED DelegateHandle* handle);
 
 The diagram looks like following
 
-<img src="./_static/img/backend_interface_runtime.png" alt="drawing" style="width:600px;"/>
+<img src="_static/img/backend_interface_runtime.png" alt="drawing" style="width:600px;"/>
 
 **Figure 3.** The relationship between standard ExecuTorch Runtime and backend entry point.
 
@@ -137,12 +137,12 @@ In order to provide consistent debugging experience to users, regardless of the
 
 By leveraging the debug identifier, backend developer can embed the debug as part of the delegated blob
 
-<img src="./_static/img/backend_debug_handle.png" alt="drawing" style="width:600px;"/>
+<img src="_static/img/backend_debug_handle.png" alt="drawing" style="width:600px;"/>
 
 In this way, during execute stage, with the debug identifier, backend developer can associate the failed instruction inside the delegate
 back to the exact line of PyThon code.
 
-<img src="./_static/img/backend_debug_handle_example.png" alt="drawing" style="width:700px;"/>
+<img src="_static/img/backend_debug_handle_example.png" alt="drawing" style="width:700px;"/>
 
 ## Common Questions
 
diff --git a/examples/demo-apps/react-native/rnllama/README.md b/examples/demo-apps/react-native/rnllama/README.md
@@ -1,7 +1,7 @@
 # React Native Llama
 
 <p align="center">
-  <img src="./assets/images/rnllama.png" width="200" alt="rnllama Logo">
+  <img src="assets/images/rnllama.png" width="200" alt="rnllama Logo">
 </p>
 
 A React Native mobile application for running LLaMA language models using ExecuTorch. This example is for iOS only for now.
diff --git a/examples/models/llama/README.md b/examples/models/llama/README.md
@@ -80,12 +80,12 @@ Llama 3.2 1B and 3B performance was measured on Android OnePlus 12 device. The p
 <table>
   <tr>
     <td>
-        <img src="./Android3_2_1B_bf16.gif" width="300">
+        <img src="Android3_2_1B_bf16.gif" width="300">
         <br>
         <em> Llama3.2 1B, unquantized, BF16 on Android phone. </em>
     </td>
     <td>
-      <img src="./Android3_2_3B_SpinQuant.gif" width="300">
+      <img src="Android3_2_3B_SpinQuant.gif" width="300">
       <br>
       <em>
       Llama3.2 3B, 4bit quantized (SpinQuant) on Android phone
@@ -129,7 +129,7 @@ Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus
 
 <p align="center">
       <br>
-      <img src="./llama_via_xnnpack.gif" width=300>
+      <img src="llama_via_xnnpack.gif" width=300>
       <br>
       <em>
       Llama3.1 8B, 4bit quantized on Android phone
diff --git a/examples/models/llava/README.md b/examples/models/llava/README.md
@@ -11,7 +11,7 @@ huggingface page [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llav
 
 
 <p align="center">
-      <img src="./llava_via_xnnpack.gif" width=300>
+      <img src="llava_via_xnnpack.gif" width=300>
       <br>
       <em>
       Running Llava1.5 7B on Android phone
diff --git a/examples/qualcomm/oss_scripts/llama/README.md b/examples/qualcomm/oss_scripts/llama/README.md
@@ -14,7 +14,7 @@ Hybrid Mode: Hybrid mode leverages the strengths of both AR-N model and KV cache
   - AR-N model: The auto-regression (AR) length determines the number of tokens to consume and the number of logits to produce. Use it to process the prompt and generate the key-value (kv) cache, which serves as a prompt processor in hybrid mode.
   - Prompt processing with AR-N model: 
   <figure>
-    <img src="./assets/PromptProcessingWithARN.png" alt="Prompt Processing With AR-N Model">
+    <img src="assets/PromptProcessingWithARN.png" alt="Prompt Processing With AR-N Model">
     <figcaption>Prompt processing is done using a for-loop. An N-token block is taken, and the KV cache is updated for that block. This process is repeated until all tokens are consumed, with the last block potentially requiring padding. For flexibility, the AR-N model can handle any input length less than the maximum sequence length. For TTFT, the input length (or number of blocks) will vary depending on the actual input length, rather than always being the same.
     </figcaption>
 </figure>
@@ -70,14 +70,14 @@ We have two distinct mechanisms for updating the key-value (KV) cache, which can
 #### Shift Pointer mechanism
 
 <figure>
-    <img src="./assets/ShiftPointer.png" alt="Shift Pointer mechanism"> <figcaption>
+    <img src="assets/ShiftPointer.png" alt="Shift Pointer mechanism"> <figcaption>
     The figure illustrates the process of updating the key and value caches during each inference step. In key cache update process, we initially allocate memory for each layer with <code>num_head</code> size of <code>(head_dim + 1) * (seq_len - 1)</code>. After a single inference, the new key cache is copied from the key output pointer <code>k_out</code> and appended to the key cache. Subsequently, the buffer start pointer of the key cache <code>k_in</code> moves to the next token, making the previous position of the buffer start pointer unused. This process is repeated for each subsequent inference step.
     For the value cache update process, we first allocate a contiguous memory of size <code>(num_head + 1) * head_dim * (seq_len - 1)</code> for each layer, with the last head reserved for I/O shifting, After the first inference, the cache is updated by simply shifting the pointers of all heads to the next token position, making only the previous <code>head_dim * 1</code> section of the buffer start pointer <code>v_in</code> of the first head unused. This process is repeated for each subsequent inference step.</figcaption>
 </figure>
 
 #### Smart Mask mechanism:
 <figure>
-    <img src="./assets/SmartMask.png" alt="Smart Mask mechanism">
+    <img src="assets/SmartMask.png" alt="Smart Mask mechanism">
     <figcaption>The Smart Mask mechanism streamlines the process of updating tokens in the cache. Unlike the Shift Pointer mechanism, which requires moving the buffer start pointer <code>k_in</code>/<code>v_in</code> of the cache, the Smart Mask mechanism updates only the new token at the specified position. This approach eliminates the need to adjust the buffer start pointer. This mechanism is beneficial for shared buffers but requires CPU memory copying. </figcaption>
 </figure>