You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+48-15Lines changed: 48 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -70,14 +70,23 @@ where `model_name` is the name of the model on the HF hub. Ensure that it's run
70
70
71
71
This will attempt to download weights in `.safetensors` format, and if those aren't in the HF hub will download pytorch `.bin` weights and then convert them to `.safetensors`.
72
72
73
+
If needed, specific file extensions can be downloaded by using the `--extension` option, for example:
`.saftensors` weights are now required for many models, in particular:
74
81
- When using the optimized flash attention mode (`FLASH_ATTENTION=true`) - this is currently supported for Llama, Falcon, Starcoder and GPT-NeoX based models, on newer GPUs
75
82
- When using tensor parallel (see below)
76
83
- Also recommended for BLOOM and T5 type models generally
77
84
78
-
If needed, specific file extensions can be downloaded by using the `--extension` option, for example:
85
+
They can be downloaded directly from the huggingface hub for some models. As explained above, the download command by default will download and convert them from PyTorch weights if safetensors weights aren't available.
86
+
87
+
To convert from pre-existing PyTorch `.bin` weights:
1. Ensure that the `CUDA_VISIBLE_DEVICES` environment variable is set appropriately (e.g. "0,1" to use the first two GPUs). The number of GPUs to use will be inferred from this or else can be set explicitly with the `NUM_GPUS` environment variable.
106
-
2. Set the environment variable `DEPLOYMENT_FRAMEWORK=hf_custom_tp`
104
+
1. Ensure that the model weights are in `safetensors format (see above)
105
+
2. Ensure that the `CUDA_VISIBLE_DEVICES` environment variable is set appropriately (e.g. "0,1" to use the first two GPUs). The number of GPUs to use will be inferred from this or else can be set explicitly with the `NUM_GPUS` environment variable.
106
+
3. Set the environment variable `DEPLOYMENT_FRAMEWORK=hf_custom_tp`
107
107
108
108
### TLS configuration
109
109
@@ -119,4 +119,37 @@ These paths can reference mounted secrets containing the certs.
119
119
120
120
Prometheus metrics are exposed on the same port as the health probe endpoint (default 3000), at `/metrics`.
121
121
122
-
They are all prefixed with `tgi_`. A full list with descriptions will be added here soon.
122
+
They are all prefixed with `tgi_`. Descriptions will be added to the table below soon.
0 commit comments