Fix typos (#354)

carmocca · web-flow · commit 713a0b152f5f · 2023-06-01T18:53:06.000+02:00
diff --git a/README.md b/README.md
@@ -98,7 +98,7 @@ See `python generate.py --help` for more options.
 You can also use GPTQ-style int4 quantization, but this needs conversions of the weights first:
 
 ```bash
-python quantize/gptq.py --output_path checkpoints/lit-llama/7B/llama-gptq.4bit.pt --dtype bfloat16 --quantize gptq.int4
+python quantize/gptq.py --output_path checkpoints/lit-llama/7B/llama-gptq.4bit.pth --dtype bfloat16 --quantize gptq.int4
 ```
 
 GPTQ-style int4 quantization brings GPU usage down to about ~5GB. As only the weights of the Linear layers are quantized, it is useful to also use `--dtype bfloat16` even with the quantization enabled.
diff --git a/howto/inference.md b/howto/inference.md
@@ -31,7 +31,7 @@ See `python generate.py --help` for more options.
 You can also use GPTQ-style int4 quantization, but this needs conversions of the weights first:
 
 ```bash
-python quantize/gptq.py --output_path checkpoints/lit-llama/7B/llama-gptq.4bit.pt --dtype bfloat16 --quantize gptq.int4
+python quantize/gptq.py --output_path checkpoints/lit-llama/7B/llama-gptq.4bit.pth --dtype bfloat16 --quantize gptq.int4
 ```
 
 GPTQ-style int4 quantization brings GPU usage down to about ~5GB. As only the weights of the Linear layers are quantized, it is useful to also use `--dtype bfloat16` even with the quantization enabled.