pytorch · facebook-github-bot · Jul 25, 2025 · Jul 25, 2025
@@ -168,7 +168,7 @@ LLAMA_CHECKPOINT=path/to/consolidated.00.pth
 LLAMA_PARAMS=path/to/params.json
 
 python -m extension.llm.export.export_llm \
-  --config examples/models/llamaconfig/llama_bf16.yaml
+  --config examples/models/llamaconfig/llama_bf16.yaml \
   +base.model_class="llama3_2" \
   +base.checkpoint="${LLAMA_CHECKPOINT:?}" \
   +base.params="${LLAMA_PARAMS:?}" \
@@ -186,7 +186,7 @@ LLAMA_QUANTIZED_CHECKPOINT=path/to/spinquant/consolidated.00.pth.pth
 LLAMA_PARAMS=path/to/spinquant/params.json
 
 python -m extension.llm.export.export_llm \
-  --config examples/models/llama/config/llama_xnnpack_spinquant.yaml
+  --config examples/models/llama/config/llama_xnnpack_spinquant.yaml \
   +base.model_class="llama3_2" \
   +base.checkpoint="${LLAMA_QUANTIZED_CHECKPOINT:?}" \
   +base.params="${LLAMA_PARAMS:?}"
@@ -203,7 +203,7 @@ LLAMA_QUANTIZED_CHECKPOINT=path/to/qlora/consolidated.00.pth.pth
 LLAMA_PARAMS=path/to/qlora/params.json
 
 python -m extension.llm.export.export_llm \
-    --config examples/models/llama/config/llama_xnnpack_qat.yaml
+    --config examples/models/llama/config/llama_xnnpack_qat.yaml \
     +base.model_class="llama3_2" \
     +base.checkpoint="${LLAMA_QUANTIZED_CHECKPOINT:?}" \
     +base.params="${LLAMA_PARAMS:?}" \
@@ -219,15 +219,16 @@ You can export and run the original Llama 3 8B instruct model.
 2. Export model and generate `.pte` file
 ```
 python -m extension.llm.export.export_llm \
-    --config examples/models/llama/config/llama_q8da4w.yaml
-    +base.model_clas="llama3"
+    --config examples/models/llama/config/llama_q8da4w.yaml \
+    +base.model_class="llama3" \
     +base.checkpoint=<consolidated.00.pth.pth> \
     +base.params=<params.json>
 ```
-    Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `quantization.embedding_quantize=\'4,32\'` as shown above to further reduce the model size.
 
+Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `quantization.embedding_quantize=\'4,32\'` as shown above to further reduce the model size.
 
-    If you're interested in deploying on non-CPU backends, [please refer the non-cpu-backend section](non_cpu_backends.md)
+
+If you're interested in deploying on non-CPU backends, [please refer the non-cpu-backend section](non_cpu_backends.md)
 
 ## Step 3: Run on your computer to validate
 
@@ -450,7 +451,7 @@ python -m examples.models.llama.eval_llama \
 	-d <checkpoint dtype> \
 	--tasks mmlu \
 	--num_fewshot 5 \
-	--max_seq_len <max sequence length>
+	--max_seq_len <max sequence length> \
 	--max_context_len <max context length>
 ```