Skip to content

Commit 8f6988b

Browse files
committed
Update README.md
1 parent af64c28 commit 8f6988b

File tree

1 file changed

+28
-1
lines changed

1 file changed

+28
-1
lines changed

README.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1001,6 +1001,33 @@ Notes:
10011001
- In soft mode, the client will require a patched server to accept soft embeddings. The flag ensures no breakage.
10021002
10031003
1004+
### Alternative: GLM API Provider
1005+
1006+
Instead of running llama.cpp locally, you can use the GLM API (ZhipuAI) as your decoder backend:
1007+
1008+
**Setup:**
1009+
```bash
1010+
# In .env
1011+
REFRAG_DECODER=1
1012+
REFRAG_RUNTIME=glm # Switch from llamacpp to glm
1013+
GLM_API_KEY=your-api-key # Required
1014+
GLM_MODEL=glm-4.6 # Optional, defaults to glm-4.6
1015+
```
1016+
1017+
**How it works:**
1018+
- Uses OpenAI SDK with `base_url="https://api.z.ai/api/paas/v4/"`
1019+
- Supports prompt mode only (soft embeddings ignored)
1020+
- Handles GLM-4.6's reasoning mode (`reasoning_content` field)
1021+
- Drop-in replacement for llama.cpp—same interface, no code changes needed
1022+
1023+
**Switch back to llama.cpp:**
1024+
```bash
1025+
REFRAG_RUNTIME=llamacpp
1026+
```
1027+
1028+
The GLM provider is implemented in `scripts/refrag_glm.py` and automatically selected when `REFRAG_RUNTIME=glm`.
1029+
1030+
10041031
## How context_answer works (with decoder)
10051032
10061033
The `context_answer` MCP tool answers natural-language questions using retrieval + a decoder sidecar.
@@ -1016,7 +1043,7 @@ Pipeline
10161043
1) Hybrid search (gate-first): Uses MINI-vector gating when `REFRAG_GATE_FIRST=1` to prefilter candidates, then runs dense+lexical fusion
10171044
2) Micro-span budgeting: Merges adjacent micro hits and applies a global token budget (`REFRAG_MODE=1`, `MICRO_BUDGET_TOKENS`, `MICRO_OUT_MAX_SPANS`)
10181045
3) Prompt assembly: Builds compact context blocks and a “Sources” footer
1019-
4) Decoder call (llama.cpp): When `REFRAG_DECODER=1`, calls `LLAMACPP_URL` to synthesize the final answer
1046+
4) Decoder call: When `REFRAG_DECODER=1`, calls the configured runtime (`REFRAG_RUNTIME=llamacpp` or `glm`) to synthesize the final answer
10201047
5) Return: Answer + citations + usage flags; errors keep citations for debugging
10211048
10221049
Environment toggles

0 commit comments

Comments
 (0)