You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently these models are private. Please join https://huggingface.co/amd to access.
@@ -72,7 +73,7 @@ These FP8 quantized checkpoints were generated with AMD’s Quark Quantizer. For
72
73
### Quantize your own models
73
74
This step is optional for you to use quantized models on your own. Take Llama 3.1 405B as an example.
74
75
75
-
Download the Model View the Meta-Llama-3.1-405B model at https://huggingface.co/meta-llama/Meta-Llama-3.1-405B. Ensure that you have been granted access, and apply for it if you do not have access.
76
+
Download the Model View the Llama-3.1-405B model at https://huggingface.co/meta-llama/Llama-3.1-405B. Ensure that you have been granted access, and apply for it if you do not have access.
76
77
77
78
If you do not already have a HuggingFace token, open your user profile (https://huggingface.co/settings/profile), select "Access Tokens", press "+ Create New Token", and create a new Read token.
78
79
@@ -92,18 +93,18 @@ Create the directory for Llama 3.1 models (if it doesn't already exist)
@@ -176,7 +177,7 @@ Below is a list of options which are useful:
176
177
-**--max-seq-len-to-capture** : Maximum sequence length for which Hip-graphs are captured and utilized. It's recommended to use Hip-graphs for the best decode performance. The default value of this parameter is 8K, which is lower than the large context lengths supported by recent models such as LLama. Set this parameter to max-model-len or maximum context length supported by the model for best performance.
177
178
-**--gpu-memory-utilization** : The ratio of GPU memory reserved by a vLLM instance. Default value is 0.9. It's recommended to set this to 0.99 to increase KV cache space.
178
179
179
-
Note: vLLM's server creation command line (vllm serve) supports the above parameters as command line arguments. However, vLLM's benchmark_latency and benchmark_throughput command lines may not include all of these flags as command line arguments. In that case, it might be necessary to add these parameters to the LLMEngine instance constructor inside the benchmark script.
180
+
Note: vLLM's server creation command line (vllm serve) supports the above parameters as command line arguments.
180
181
181
182
##### Online Gemm Tuning
182
183
Online Gemm tuning for small decode batch sizes can improve performance in some cases. e.g. Llama 70B upto Batch size 8
@@ -268,16 +269,18 @@ If you want to run Meta-Llama-3.1-405B FP16, please run
0 commit comments