Accessing Gated Models in vLLM Playground

What are Gated Models?

Gated models are models that require approval from the model authors before you can download and use them. They typically require:

Accepting terms of service on HuggingFace
Providing a HuggingFace access token

CPU-Friendly Models in WebUI

The WebUI includes these excellent CPU-optimized models:

1. TinyLlama-1.1B-Chat-v1.0 (Ungated) ⭐ Recommended for Testing

Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Size: 1.1B parameters (~2.2GB)
Speed on CPU: Fast (10-30 tokens/sec on modern CPUs)
No HF token required
Best for: Quick testing, development, limited hardware

2. Llama 3.2 1B (Gated) ⚠️ Requires HF Token

Model: meta-llama/Llama-3.2-1B
Size: 1B parameters (~2GB)
Speed on CPU: Fast (10-30 tokens/sec)
Requires: HuggingFace account + token
Best for: Latest Meta technology, good quality responses

3. Gemma 2 2B (Gated) ⚠️ Requires HF Token

Model: google/gemma-2-2b
Size: 2B parameters (~4GB)
Speed on CPU: Moderate (5-15 tokens/sec)
Requires: HuggingFace account + token
Best for: Google's latest efficient model, good quality

How to Access Gated Models

Step 1: Create HuggingFace Account

Go to https://huggingface.co/join
Create a free account
Verify your email

Step 2: Request Access to Gated Models

For Llama 3.2:

Visit https://huggingface.co/meta-llama/Llama-3.2-1B
Click "Agree and access repository"
Read and accept Meta's license agreement
Wait for approval (usually instant, but can take up to 24 hours)

For Gemma 2:

Visit https://huggingface.co/google/gemma-2-2b
Click "Agree and access repository"
Read and accept Google's terms
Wait for approval (usually instant)

Step 3: Generate HuggingFace Access Token

Go to https://huggingface.co/settings/tokens
Click "New token"
Give it a name (e.g., "vLLM Playground")
Select token type:
- Read: Sufficient for downloading models
- Write: Not needed for vLLM
Click "Generate token"
Copy the token immediately (you won't be able to see it again!)

Example token format: hf_ABcDEfGHiJKlMNoPQrSTuVwXyZ1234567890

Step 4: Configure Token in WebUI

You have three options:

Option A: Enter Token in WebUI (Recommended)

In the WebUI, go to the configuration panel
Find the "HuggingFace Token" field
Paste your token: hf_xxxxxxxxxxxxx
Select your gated model
Click "Start Server"

Option B: Set Environment Variable

export HF_TOKEN="hf_xxxxxxxxxxxxx"
python app.py

Option C: Use HuggingFace CLI

# Install HuggingFace CLI
pip install huggingface-hub

# Login (saves token permanently)
huggingface-cli login

# Paste your token when prompted

Once logged in via CLI, you don't need to provide the token in WebUI.

Testing Access

Test if you have access:

# Check if you can access the model info
python -c "from huggingface_hub import HfApi; api = HfApi(); print(api.model_info('meta-llama/Llama-3.2-1B'))"

If this works, you have proper access!

Example Configurations

TinyLlama (No Token Required)

{
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "max_model_len": 2048,
  "cpu_kvcache_space": 4,
  "dtype": "bfloat16"
}

Llama 3.2 1B (Token Required)

{
  "model": "meta-llama/Llama-3.2-1B",
  "max_model_len": 2048,
  "cpu_kvcache_space": 4,
  "dtype": "bfloat16",
  "hf_token": "hf_xxxxxxxxxxxxx"
}

Gemma 2 2B (Token Required)

{
  "model": "google/gemma-2-2b",
  "max_model_len": 2048,
  "cpu_kvcache_space": 6,
  "dtype": "bfloat16",
  "hf_token": "hf_xxxxxxxxxxxxx"
}

Troubleshooting

Error: "Repository not found" or "Access denied"

Cause: You haven't requested access to the gated model yet.

Solution:

Visit the model page on HuggingFace
Click "Agree and access repository"
Wait for approval (check your email)
Try again

Error: "Invalid token"

Cause: Token is incorrect, expired, or doesn't have read permissions.

Solution:

Generate a new token with "read" permissions
Make sure you copied the entire token (starts with hf_)
Update your configuration

Error: "401 Unauthorized"

Cause: Token not provided or not recognized by vLLM.

Solution:

Make sure token is set in WebUI or as environment variable
Try logging in with huggingface-cli login first
Restart the WebUI

Model downloads very slowly

Cause: Large model files + slow internet connection.

Solution:

Be patient - first download takes time
Models are cached locally after first download
Use download_dir to specify custom cache location

Consider downloading manually first:

python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.2-1B')"

Security Best Practices

⚠️ Protect Your Token

Never commit tokens to git
- Add .env to .gitignore
- Don't paste tokens in public code
Use environment variables for production
```
export HF_TOKEN="hf_xxxxxxxxxxxxx"
```
Rotate tokens regularly
- Generate new tokens every few months
- Revoke old tokens you're not using
Use separate tokens for different projects
- Easier to track and revoke if compromised
Don't share tokens
- Each user should have their own token
- Tokens are tied to your HuggingFace account

Performance Comparison (CPU)

On a typical M2 MacBook Pro or modern Intel i7:

Model	Size	Speed	Quality	Best For
TinyLlama 1.1B	2.2GB	20-30 tok/s	Good	Testing, Development
Llama 3.2 1B	2GB	15-25 tok/s	Better	Quality + Speed balance
Gemma 2 2B	4GB	10-15 tok/s	Best	Best quality in 2B class
OPT 125M	250MB	50-100 tok/s	Basic	Quick testing only

Model Recommendations by Use Case

For Learning/Testing

TinyLlama 1.1B - No token required, fast, good enough

For Development

Llama 3.2 1B - Good quality, fast, requires token

For Production (CPU)

Gemma 2 2B - Best quality in small size, requires token

For GPU

Llama 2 7B or Mistral 7B - Much better quality, requires GPU

Additional Resources

HuggingFace Models: https://huggingface.co/models
HuggingFace Tokens: https://huggingface.co/settings/tokens
Llama 3.2 Model Card: https://huggingface.co/meta-llama/Llama-3.2-1B
Gemma 2 Model Card: https://huggingface.co/google/gemma-2-2b
TinyLlama Model Card: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

Quick Start Command

# Set your token
export HF_TOKEN="hf_xxxxxxxxxxxxx"

# Start WebUI
python app.py

# In WebUI, select:
# - Model: meta-llama/Llama-3.2-1B
# - Max Model Length: 2048
# - CPU KV Cache: 4
# - Click "Start Server"

FAQ

Q: Do I need to pay for HuggingFace? A: No, free account is sufficient for these models.

Q: How long does approval take? A: Usually instant for Llama 3.2 and Gemma 2.

Q: Can I use the same token for multiple models? A: Yes, one token works for all models you have access to.

Q: What if I don't want to use gated models? A: Use TinyLlama or OPT models - they're ungated and work great!

Q: Will my token expire? A: Tokens don't expire automatically, but you should rotate them periodically for security.

Q: Can I use these models offline? A: Yes, after the first download, models are cached locally.

FilesExpand file tree

GATED_MODELS_GUIDE.md

Latest commit

History

GATED_MODELS_GUIDE.md

File metadata and controls

Accessing Gated Models in vLLM Playground

What are Gated Models?

CPU-Friendly Models in WebUI

1. TinyLlama-1.1B-Chat-v1.0 (Ungated) ⭐ Recommended for Testing

2. Llama 3.2 1B (Gated) ⚠️ Requires HF Token

3. Gemma 2 2B (Gated) ⚠️ Requires HF Token

How to Access Gated Models

Step 1: Create HuggingFace Account

Step 2: Request Access to Gated Models

For Llama 3.2:

For Gemma 2:

Step 3: Generate HuggingFace Access Token

Step 4: Configure Token in WebUI

Option A: Enter Token in WebUI (Recommended)

Option B: Set Environment Variable

Option C: Use HuggingFace CLI

Testing Access

Test if you have access:

Example Configurations

TinyLlama (No Token Required)

Llama 3.2 1B (Token Required)

Gemma 2 2B (Token Required)

Troubleshooting

Error: "Repository not found" or "Access denied"

Error: "Invalid token"

Error: "401 Unauthorized"

Model downloads very slowly

Security Best Practices

⚠️ Protect Your Token

Performance Comparison (CPU)

Model Recommendations by Use Case

For Learning/Testing

For Development

For Production (CPU)

For GPU

Additional Resources

Quick Start Command

FAQ