From d0bbcc6cff615d90530892d295e4596db0f502af Mon Sep 17 00:00:00 2001 From: Jack Zhang Date: Fri, 4 Oct 2024 18:27:42 -0700 Subject: [PATCH] Release docs proofreading --- examples/models/phi-3-mini-lora/README.md | 2 +- extension/llm/README.md | 20 +++++++++++--------- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/examples/models/phi-3-mini-lora/README.md b/examples/models/phi-3-mini-lora/README.md index 92f23f137b4..987052dbf24 100644 --- a/examples/models/phi-3-mini-lora/README.md +++ b/examples/models/phi-3-mini-lora/README.md @@ -11,7 +11,7 @@ To see how you can use the model exported for training in a fully involved finet - `./examples/models/phi-3-mini-lora/install_requirements.sh` ### Step 3: Export and run the model -1. Export the inferenace and training models to ExecuTorch. +1. Export the inference and training models to ExecuTorch. ``` python export_model.py ``` diff --git a/extension/llm/README.md b/extension/llm/README.md index dfc193e41e1..ddcf4c727d2 100644 --- a/extension/llm/README.md +++ b/extension/llm/README.md @@ -2,8 +2,9 @@ This subtree contains libraries and utils of running generative AI, including La Below is a list of sub folders. ## export Model preparation codes are in _export_ folder. The main entry point is the _LLMEdgeManager_ class. It hosts a _torch.nn.Module_, with a list of methods that can be used to prepare the LLM model for ExecuTorch runtime. -Note that ExecuTorch supports two [quantization APIs](https://pytorch.org/docs/stable/quantization.html#quantization-api-summary): eager mode quantization (aka source transform based quantization), and PyTorch 2 Export based quantization (aka pt2e quantization). -Typical methods include: +Note that ExecuTorch supports two [quantization APIs](https://pytorch.org/docs/stable/quantization.html#quantization-api-summary): eager mode quantization (aka source transform based quantization) and PyTorch 2 Export based quantization (aka pt2e quantization). + +Commonly used methods in this class include: - _set_output_dir_: where users want to save the exported .pte file. - _to_dtype_: override the data type of the module. - _source_transform_: execute a series of source transform passes. Some transform passes include @@ -19,7 +20,7 @@ Typical methods include: Some usage of LLMEdgeManager can be found in executorch/examples/models/llama2, and executorch/examples/models/llava. -When the .pte file is exported and saved, we can prepare a load and run it in a runner. +When the .pte file is exported and saved, we can load and run it in a runner (see below). ## tokenizer Currently, we support two types of tokenizers: sentencepiece and Tiktoken. @@ -28,20 +29,21 @@ Currently, we support two types of tokenizers: sentencepiece and Tiktoken. - _tokenizer.py_: rewrite a sentencepiece tokenizer model to a serialization format that the runtime can load. - In C++: - _tokenizer.h_: a simple tokenizer interface. Actual tokenizer classes can be implemented based on this. In this folder, we provide two tokenizer implementations: - - _bpe_tokenizer_. We need the rewritten version of tokenizer artifact (refer to _tokenizer.py_ above), for bpe tokenizer to work. - - _tiktokern_. It's for llama3 and llama3.1. + - _bpe_tokenizer_. Note: we need the rewritten version of tokenizer artifact (refer to _tokenizer.py_ above), for bpe tokenizer to work. + - _tiktoken_. For llama3 and llama3.1. ## sampler A sampler class in C++ to sample the logistics given some hyperparameters. ## custom_ops -It hosts a custom sdpa operator. This sdpa operator implements CPU flash attention, it avoids copies by taking the kv cache as one of the arguments to this custom operator. -- _sdpa_with_kv_cache.py_, _op_sdpa_aot.cpp_: custom op definition in PyTorch with C++ registration. -- _op_sdpa.cpp_: the optimized operator implementation and registration of _sdpa_with_kv_cache.out_. +Contains custom op, such as: +- custom sdpa: implements CPU flash attention and avoids copies by taking the kv cache as one of its arguments. + - _sdpa_with_kv_cache.py_, _op_sdpa_aot.cpp_: custom op definition in PyTorch with C++ registration. + - _op_sdpa.cpp_: the optimized operator implementation and registration of _sdpa_with_kv_cache.out_. ## runner It hosts the libary components used in a C++ llm runner. Currently, it hosts _stats.h_ on runtime status like token numbers and latency. -With the components above, an actual runner can be built for a model or a series of models. An exmaple is in //executorch/examples/models/llama2/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture. +With the components above, an actual runner can be built for a model or a series of models. An example is in //executorch/examples/models/llama2/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture. Usages can also be found in the [torchchat repo](https://github.com/pytorch/torchchat/tree/main/runner).