Run Clawdbot with GLM-4.7 and other open-source coding models on RunPod using vLLM. Chat with your AI assistant via Telegram!
| Model | GPU | VRAM | Cost/hr | Context | Folder |
|---|---|---|---|---|---|
| Base (Qwen2.5-7B) | Any | 16GB | $0.50 | 16k | Dockerfile |
| GLM-4.7-Flash FP16 | H100/A100 80GB | 56GB | $1.20-1.99 | 32k-64k | models/glm47-flash-fp16/ |
| GLM-4.7-Flash AWQ 4-bit | A100 80GB | 71GB | $1.19 | 114k | models/glm47-flash-awq-4bit/ |
| GLM-4.7-REAP W4A16 | B200 | 108GB | $5.19 | 32k | models/glm47-reap-w4a16/ |
Best value option with full 114k context window at $1.19/hr on A100 80GB.
# GLM-4.7-Flash AWQ 4-bit (Best value, A100 80GB)
IMAGE=yourusername/clawdbot-glm47-flash-awq-4bit:latest
# GLM-4.7-Flash FP16 (Full precision, H100/A100 80GB)
IMAGE=yourusername/clawdbot-glm47-flash-fp16:latest
# GLM-4.7-REAP W4A16 (High-end, B200)
IMAGE=yourusername/clawdbot-glm47-reap-w4a16:latest
# Base (Qwen2.5-7B, any GPU)
IMAGE=yourusername/clawdbot-vllm:latest- Image: Your chosen image from above
- GPU: Match model requirements
- Volume: 150GB at
/workspace - Container Disk: 50-100GB (depending on model)
- Ports:
8000/http, 18789/http, 22/tcp
VLLM_API_KEY=your-secure-key # Required
TELEGRAM_BOT_TOKEN=your-telegram-token # Optional
GITHUB_TOKEN=ghp_xxx # Optional# Health check
curl http://localhost:8000/health
# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer $VLLM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.7-flash",
"messages": [{"role": "user", "content": "Hello!"}]
}'Images are automatically built and pushed to Docker Hub via GitHub Actions.
| Image | Description |
|---|---|
clawdbot-glm47-flash-awq-4bit |
GLM-4.7-Flash AWQ 4-bit for A100 80GB |
clawdbot-glm47-flash-fp16 |
GLM-4.7-Flash FP16 for H100/A100 80GB |
clawdbot-glm47-reap-w4a16 |
GLM-4.7-REAP W4A16 for B200 |
clawdbot-vllm |
Base image with Qwen2.5-7B |
runpod-clawdbot/
├── README.md # This file
├── .github/
│ └── workflows/
│ └── docker-build.yml # Build & push to Docker Hub
│
├── models/
│ ├── glm47-flash-fp16/ # Full precision FP16 (H100/A100 80GB)
│ │ ├── README.md
│ │ ├── Dockerfile
│ │ └── entrypoint.sh
│ │
│ ├── glm47-flash-awq-4bit/ # AWQ 4-bit quantized (A100 80GB)
│ │ ├── README.md
│ │ ├── Dockerfile
│ │ └── entrypoint.sh
│ │
│ └── glm47-reap-w4a16/ # Pruned W4A16 quantized (B200)
│ ├── README.md
│ ├── Dockerfile
│ └── entrypoint.sh
│
├── scripts/
│ ├── setup-clawdbot.sh
│ └── start-vllm.sh
│
├── config/
│ ├── clawdbot.json
│ └── workspace/
│
├── templates/
│ └── clawdbot-vllm.json
│
├── tests/
│ ├── test-vllm.sh
│ └── test-tool-calling.sh
│
├── Dockerfile # Base image (Qwen2.5-7B)
├── docker-compose.yml
└── .env.example
Images are built automatically on:
- Push to
main→ tagged as:latest - Push to other branches → tagged as
:dev-{branch-name}(e.g.,:dev-feature-xyz) - Push git tag (e.g.,
v1.0.0) → tagged as:v1.0.0+:latest - Pull requests → build only, no push (validation)
- Manual workflow dispatch → select specific model
Secrets (Repository → Settings → Secrets → Actions):
| Secret | Description |
|---|---|
DOCKERHUB_USERNAME |
Your Docker Hub username |
DOCKERHUB_TOKEN |
Docker Hub access token (not password) |
Variables (Repository → Settings → Variables → Actions):
| Variable | Description |
|---|---|
DOCKERHUB_REPO |
(Optional) Custom repo name, defaults to username |
# Build locally
docker build -t clawdbot-glm47-flash-awq-4bit models/glm47-flash-awq-4bit/
docker build -t clawdbot-glm47-flash-fp16 models/glm47-flash-fp16/
docker build -t clawdbot-glm47-reap-w4a16 models/glm47-reap-w4a16/
# Push to Docker Hub
docker tag clawdbot-glm47-flash-awq-4bit yourusername/clawdbot-glm47-flash-awq-4bit:latest
docker push yourusername/clawdbot-glm47-flash-awq-4bit:latest| Variable | Default | Description |
|---|---|---|
VLLM_API_KEY |
changeme |
API key for vLLM authentication |
MODEL_NAME |
Model-specific | HuggingFace model ID |
SERVED_MODEL_NAME |
glm-4.7-flash |
Model name in API responses |
MAX_MODEL_LEN |
Auto-detected | Maximum context length |
GPU_MEMORY_UTILIZATION |
0.92 |
GPU memory to use |
TELEGRAM_BOT_TOKEN |
Telegram bot token from @BotFather | |
GITHUB_TOKEN |
GitHub PAT for git/gh operations |
Config is auto-generated at /workspace/.clawdbot/clawdbot.json:
{
"models": {
"providers": {
"local-vllm": {
"baseUrl": "http://localhost:8000/v1",
"apiKey": "your-vllm-api-key",
"api": "openai-completions"
}
}
}
}- Create a bot with @BotFather
- Copy the bot token
- Set
TELEGRAM_BOT_TOKENenvironment variable - Start or restart the pod
- Message your bot on Telegram!
For git operations inside the container:
- Create a GitHub Personal Access Token
- Select scopes:
repo,read:org,workflow - Set
GITHUB_TOKENenvironment variable - Token is auto-configured on startup
# Basic health check
curl http://localhost:8000/health
# List models
curl http://localhost:8000/v1/models \
-H "Authorization: Bearer $VLLM_API_KEY"
# Tool calling test
curl http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer $VLLM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.7-flash",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"tools": [{
"type": "function",
"function": {
"name": "calculate",
"description": "Perform a calculation",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string"}
}
}
}
}]
}'- Check GPU availability:
nvidia-smi - Verify VRAM is sufficient for model
- Check logs:
journalctl -u vllmor container logs
- First load downloads model from HuggingFace (can be 18-60GB)
- Use network volume to persist model across restarts
- AWQ 4-bit model (18GB) loads faster than FP16 (31GB)
- Verify
--enable-auto-tool-choiceis set - Check tool parser matches model (
glm47for GLM-4.7) - Run test script:
./tests/test-tool-calling.sh
- If vLLM crashes, GPU memory may stay allocated
- Restart the pod to clear memory
- Check with:
nvidia-smi
- RunPod assigns random SSH ports after restart
- Check port via RunPod console or API
- Use RunPod web terminal as alternative
- GGUF not supported - vLLM doesn't support GLM-4.7's GGUF format. Use AWQ.
- Container disk doesn't persist - Only
/workspacesurvives restarts. - B200 requires CUDA 13.1+ - The REAP image includes this automatically.
- Use AWQ 4-bit - Same model, lower VRAM, cheaper GPU ($1.19 vs $1.99/hr)
- Stop pods when idle - RunPod charges per minute
- Use network volumes - Avoid re-downloading models
- Consider spot instances - Up to 80% cheaper
MIT