Update stories cmd to use kv cache (#5460)

lucylq · facebook-github-bot · commit d2a38ccc82c4 · 2024-09-18T07:16:27.000-07:00
Summary: Pull Request resolved: #5460 Reviewed By: dvorjackz Differential Revision: D62925331 Pulled By: lucylq fbshipit-source-id: a5c977055fe208cd8f1db20f147247a5a0f6fdbf
diff --git a/examples/models/llama2/README.md b/examples/models/llama2/README.md
@@ -66,7 +66,7 @@ Note that since Llama3's vocabulary size is 4x that of Llama2, we had to quantiz
 |OnePlus 12 | 10.85 tokens/second | 11.02 tokens/second |
 
 ### Llama3.1
-> :warning: **use the main branch**: Llama3.1 is supported on the ExecuTorch main branch (not release 0.3).
+Llama3.1 is supported on the ExecuTorch main branch and release/0.4
 
 # Instructions
 
@@ -117,7 +117,7 @@ If you want to deploy and run a smaller model for educational purposes. From `ex
     ```
 3. Export model and generate `.pte` file.
     ```
-    python -m examples.models.llama2.export_llama -c stories110M.pt -p params.json -X
+    python -m examples.models.llama2.export_llama -c stories110M.pt -p params.json -X -kv
     ```
 4. Create tokenizer.bin.