Gated models are models that require approval from the model authors before you can download and use them. They typically require:
- Accepting terms of service on HuggingFace
- Providing a HuggingFace access token
The WebUI includes these excellent CPU-optimized models:
- Model:
TinyLlama/TinyLlama-1.1B-Chat-v1.0 - Size: 1.1B parameters (~2.2GB)
- Speed on CPU: Fast (10-30 tokens/sec on modern CPUs)
- No HF token required
- Best for: Quick testing, development, limited hardware
- Model:
meta-llama/Llama-3.2-1B - Size: 1B parameters (~2GB)
- Speed on CPU: Fast (10-30 tokens/sec)
- Requires: HuggingFace account + token
- Best for: Latest Meta technology, good quality responses
- Model:
google/gemma-2-2b - Size: 2B parameters (~4GB)
- Speed on CPU: Moderate (5-15 tokens/sec)
- Requires: HuggingFace account + token
- Best for: Google's latest efficient model, good quality
- Go to https://huggingface.co/join
- Create a free account
- Verify your email
- Visit https://huggingface.co/meta-llama/Llama-3.2-1B
- Click "Agree and access repository"
- Read and accept Meta's license agreement
- Wait for approval (usually instant, but can take up to 24 hours)
- Visit https://huggingface.co/google/gemma-2-2b
- Click "Agree and access repository"
- Read and accept Google's terms
- Wait for approval (usually instant)
- Go to https://huggingface.co/settings/tokens
- Click "New token"
- Give it a name (e.g., "vLLM Playground")
- Select token type:
- Read: Sufficient for downloading models
- Write: Not needed for vLLM
- Click "Generate token"
- Copy the token immediately (you won't be able to see it again!)
Example token format: hf_ABcDEfGHiJKlMNoPQrSTuVwXyZ1234567890
You have three options:
- In the WebUI, go to the configuration panel
- Find the "HuggingFace Token" field
- Paste your token:
hf_xxxxxxxxxxxxx - Select your gated model
- Click "Start Server"
export HF_TOKEN="hf_xxxxxxxxxxxxx"
python app.py# Install HuggingFace CLI
pip install huggingface-hub
# Login (saves token permanently)
huggingface-cli login
# Paste your token when promptedOnce logged in via CLI, you don't need to provide the token in WebUI.
# Check if you can access the model info
python -c "from huggingface_hub import HfApi; api = HfApi(); print(api.model_info('meta-llama/Llama-3.2-1B'))"If this works, you have proper access!
{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"max_model_len": 2048,
"cpu_kvcache_space": 4,
"dtype": "bfloat16"
}{
"model": "meta-llama/Llama-3.2-1B",
"max_model_len": 2048,
"cpu_kvcache_space": 4,
"dtype": "bfloat16",
"hf_token": "hf_xxxxxxxxxxxxx"
}{
"model": "google/gemma-2-2b",
"max_model_len": 2048,
"cpu_kvcache_space": 6,
"dtype": "bfloat16",
"hf_token": "hf_xxxxxxxxxxxxx"
}Cause: You haven't requested access to the gated model yet.
Solution:
- Visit the model page on HuggingFace
- Click "Agree and access repository"
- Wait for approval (check your email)
- Try again
Cause: Token is incorrect, expired, or doesn't have read permissions.
Solution:
- Generate a new token with "read" permissions
- Make sure you copied the entire token (starts with
hf_) - Update your configuration
Cause: Token not provided or not recognized by vLLM.
Solution:
- Make sure token is set in WebUI or as environment variable
- Try logging in with
huggingface-cli loginfirst - Restart the WebUI
Cause: Large model files + slow internet connection.
Solution:
- Be patient - first download takes time
- Models are cached locally after first download
- Use
download_dirto specify custom cache location - Consider downloading manually first:
python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.2-1B')"
-
Never commit tokens to git
- Add
.envto.gitignore - Don't paste tokens in public code
- Add
-
Use environment variables for production
export HF_TOKEN="hf_xxxxxxxxxxxxx"
-
Rotate tokens regularly
- Generate new tokens every few months
- Revoke old tokens you're not using
-
Use separate tokens for different projects
- Easier to track and revoke if compromised
-
Don't share tokens
- Each user should have their own token
- Tokens are tied to your HuggingFace account
On a typical M2 MacBook Pro or modern Intel i7:
| Model | Size | Speed | Quality | Best For |
|---|---|---|---|---|
| TinyLlama 1.1B | 2.2GB | 20-30 tok/s | Good | Testing, Development |
| Llama 3.2 1B | 2GB | 15-25 tok/s | Better | Quality + Speed balance |
| Gemma 2 2B | 4GB | 10-15 tok/s | Best | Best quality in 2B class |
| OPT 125M | 250MB | 50-100 tok/s | Basic | Quick testing only |
- TinyLlama 1.1B - No token required, fast, good enough
- Llama 3.2 1B - Good quality, fast, requires token
- Gemma 2 2B - Best quality in small size, requires token
- Llama 2 7B or Mistral 7B - Much better quality, requires GPU
- HuggingFace Models: https://huggingface.co/models
- HuggingFace Tokens: https://huggingface.co/settings/tokens
- Llama 3.2 Model Card: https://huggingface.co/meta-llama/Llama-3.2-1B
- Gemma 2 Model Card: https://huggingface.co/google/gemma-2-2b
- TinyLlama Model Card: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
# Set your token
export HF_TOKEN="hf_xxxxxxxxxxxxx"
# Start WebUI
python app.py
# In WebUI, select:
# - Model: meta-llama/Llama-3.2-1B
# - Max Model Length: 2048
# - CPU KV Cache: 4
# - Click "Start Server"Q: Do I need to pay for HuggingFace? A: No, free account is sufficient for these models.
Q: How long does approval take? A: Usually instant for Llama 3.2 and Gemma 2.
Q: Can I use the same token for multiple models? A: Yes, one token works for all models you have access to.
Q: What if I don't want to use gated models? A: Use TinyLlama or OPT models - they're ungated and work great!
Q: Will my token expire? A: Tokens don't expire automatically, but you should rotate them periodically for security.
Q: Can I use these models offline? A: Yes, after the first download, models are cached locally.