Mergen comments

jackzhxng · jackzhxng · commit 1a3e4221a568 · 2025-07-18T17:21:39.000-07:00
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -88,7 +88,7 @@ ExecuTorch provides support for:
 - [Selective Build](kernel-library-selective-build)
 #### Working with LLMs
 - [Getting Started](llm/getting-started.md)
-- [Exporting LLMs with export_llm](llm/export-llm.md)
+- [Exporting LLMs](llm/export-llm.md)
 - [Exporting custom LLMs](llm/export-custom-llm.md)
 - [Running with C++](llm/run-with-c-plus-plus.md)
 - [Running on Android (XNNPack)](llm/llama-demo-android.md)
diff --git a/docs/source/llm/export-llm.md b/docs/source/llm/export-llm.md
@@ -1,4 +1,4 @@
-# Exporting popular LLMs out of the box
+# Exporting LLMs
 
 Instead of needing to manually write code to call torch.export(), use ExecuTorch's assortment of lowering APIs, or even interact with TorchAO quantize_ APIs for quantization, we have provided an out of box experience which performantly exports a selection of supported models to ExecuTorch.
 
@@ -42,6 +42,9 @@ We only require manually specifying a checkpoint path for the Llama model family
 
 For the other supported LLMs, the checkpoint will be downloaded from HuggingFace automatically, and the param files can be found in their respective directories under `executorch/examples/models`, for instance `executorch/examples/models/qwen3/config/0_6b_config.json`.
 
+## Export settings
+[ExportConfig](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py) contains settings for the exported `.pte`, such as `max_seq_length` (max length of the prompt) and `max_context_length` (max length of the model's memory/cache).
+
 ## Adding optimizations
 `export_llm` performs a variety of optimizations to the model before export, during export, and during lowering. Quantization and delegation to accelerator backends are the main ones and will be covered in the next two sections. All other optimizations can be found under [`ModelConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L120). We will go ahead and add a few optimizations.
 
@@ -81,6 +84,9 @@ Common ones to use are:
 - `8da4w`: short for int8 dynamic activation + int4 weight quantization.
 - `int8`: int8 weight-only quanziation.
 
+Group size is specified with:
+- `group_size`: 8, 32, 64, etc.
+
 For Arm CPUs, there are also [low-bit kernels](https://pytorch.org/blog/hi-po-low-bit-operators/) for int8 dynamic activation + int[1-8] weight quantization. Note that this should not be used alongside XNNPACK, and experimentally we have found that the performance could sometimes even be better for the equivalent `8da4w`. To use these, specify `qmode` to either:
 - `torchao:8da(\d+)w`: int8 dynamic activation + int[1-8] weights, for example `torchao:8da5w`
 - `torchao:fpa(\d+)w`: int[1-8] weight only, for example `torchao:fpa4w`
@@ -156,7 +162,10 @@ python -m extension.llm.export.export_llm \
   --config path/to/config.yaml
 ```
 
-In the logs, there will be a table of all ops in the graph, and which ones were and were not delegated. Here is an example:
+In the logs, there will be a table of all ops in the graph, and which ones were and were not delegated.
+
+Here is an example:
+<details>
 ```
 Total delegated subgraphs: 368
 Number of delegated nodes: 2588
@@ -251,6 +260,8 @@ Number of non-delegated nodes: 2513
 │ 42 │ Total                                     │                              2588 │                                  2513 │
 ╘════╧═══════════════════════════════════════════╧═══════════════════════════════════╧═══════════════════════════════════════╛
 ```
+</details>
+<br/>
 
 To do further performance analysis, you can may opt to use [ExecuTorch's Inspector APIs](https://docs.pytorch.org/executorch/stable/llm/getting-started.html#performance-analysis) to do things such as trace individual operator performance back to source code, view memory planning, and debug intermediate activations. To generate the ETRecord necessary for the Inspector APIs to link back to source code, you can use:
 
diff --git a/docs/source/llm/getting-started.md b/docs/source/llm/getting-started.md
@@ -1,4 +1,4 @@
-# Deploying LLMs to Executorch
+# Deploying LLMs to ExecuTorch
 
 ExecuTorch is designed to support all types of machine learning models, and LLMs are no exception.
 In this section we demonstrate how to leverage ExecuTorch to performantly run state of the art
@@ -16,7 +16,9 @@ To follow this guide, you'll need to install ExecuTorch. Please see [Setting Up
 
 ## Next steps
 
-- [Exporting popular LLMs out of the box](export-llm.md)
+Deploying LLMs to ExecuTorch can be boiled down to a two-step process: (1) exporting the LLM to a `.pte` file and (2) running the `.pte` file using our C++ APIs or Swift/Java bindings.
+
+- [Exporting LLMs](export-llm.md)
 - [Exporting custom LLMs](export-custom-llm.md)
 - [Running with C++](run-with-c-plus-plus.md)
 - [Running on Android (XNNPack)](llama-demo-android.md)