-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Moderation Filter Bypass via Wrong State Index in Arena Side-by-Side Views #3794
Description
1. Exploitability Summary
| Aspect | Status |
|---|---|
| External Attack Path | ✅ Verified — Gradio HTTP endpoint (textbox submit / send button) |
| Runtime Protections Bypassed | ✅ Yes — Moderation filter checks wrong conversation state |
| Requires Other Vulnerabilities | ✅ None required |
| Real-World Exploitability | ✅ CONFIRMED — Partial moderation bypass for right-side model conversation history |
2. Vulnerability Details
2.1 Original Patch (34eca62)
The commit 34eca625b77fb8a514a61092d4db597e89c14f71 fixed a bug in fastchat/serve/gradio_block_arena_named.py where the moderation filter used states[0].conv.get_prompt() for both the left and right conversation sides:
# BEFORE (Vulnerable)
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[0].conv.get_prompt() # BUG: should be states[1]
# AFTER (Fixed)
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[1].conv.get_prompt() # FIXED2.2 Unpatched Variants Found
Two files contain the exact same bug, unpatched:
Variant 1: fastchat/serve/gradio_block_arena_anony.py (Anonymous Arena)
File: fastchat/serve/gradio_block_arena_anony.py
Line: 310
Function: add_text() (line 269)
# Line 309-314 — VULNERABLE
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[0].conv.get_prompt() # ← BUG: should be states[1]
all_conv_text = (
all_conv_text_left[-1000:] + all_conv_text_right[-1000:] + "\nuser: " + text
)
flagged = moderation_filter(all_conv_text, model_list, do_moderation=True)Key Note: In anonymous arena mode, do_moderation=True is explicitly set, meaning moderation is always active regardless of model type. This makes this variant more critical than the original patched bug, which only moderates certain model types.
Variant 2: fastchat/serve/gradio_block_arena_vision_named.py (Named Vision Arena)
File: fastchat/serve/gradio_block_arena_vision_named.py
Lines: 244-245
Function: add_text() (line 190)
# Lines 244-253 — VULNERABLE
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[0].conv.get_prompt() # ← BUG: should be states[1]
all_conv_text = (
all_conv_text_left[-1000:] + all_conv_text_right[-1000:] + "\nuser: " + text
)
images = convert_images_to_conversation_format(images)
text, image_flagged, csam_flag = moderate_input(
state0, text, all_conv_text, model_list, images, ip
)2.3 Impact Analysis
What the bug does:
- In side-by-side arena views, users chat with two models simultaneously
- The moderation filter is supposed to check the conversation history of both models plus the new user input
- Due to the bug, the RIGHT-side model's (Model B's) conversation history is never checked — the LEFT-side model's (Model A's) history is checked twice instead
- This means if Model B generates content that should trigger moderation (e.g., violating sexual content thresholds), subsequent moderation checks won't catch it
Attack Scenario (Multi-Turn Bypass):
- User starts a side-by-side chat (anonymous or named vision arena)
- User sends an initial benign message → both models respond
- Model B generates a response containing borderline/violating content
- User sends another message → the moderation filter checks
states[0](Model A) history twice, completely ignoring Model B's response that contains violating content - The moderation filter does NOT flag this, allowing the conversation to continue with violating content from Model B visible to the user
Severity: Medium — The new user INPUT text is still included and moderated, but the conversation history of the right-side model is not. This is primarily a concern for multi-turn conversations where model responses may escalate or contain violating content.
2.4 CWE Classification
- CWE-670: Always-Incorrect Control Flow Implementation
- CWE-284: Improper Access Control (content moderation bypass)
3. Reproduction Steps
Manual Reproduction
-
Start the FastChat Gradio server with Arena mode:
python3 -m fastchat.serve.controller python3 -m fastchat.serve.model_worker --model-path <model> --controller-address http://localhost:21001 python3 -m fastchat.serve.gradio_web_server_multi --share
-
Navigate to the Anonymous Arena tab (where
do_moderation=Trueis always on) -
Send an initial message to establish conversation with two models
-
Wait for Model B to respond — note that its response content is stored in
states[1].conv -
Send another message — observe that the moderation filter constructs
all_conv_textusing:all_conv_text_left = states[0].conv.get_prompt() # Model A's history all_conv_text_right = states[0].conv.get_prompt() # BUG: Also Model A's history!
Model B's conversation history is never checked.
Code Verification
You can verify the bug by examining the source code directly:
# Variant 1: Anonymous Arena
grep -n "all_conv_text_right = states\[0\]" fastchat/serve/gradio_block_arena_anony.py
# Output: 310: all_conv_text_right = states[0].conv.get_prompt()
# Variant 2: Named Vision Arena
grep -n "all_conv_text_right = states\[0\]" fastchat/serve/gradio_block_arena_vision_named.py
# Output: 245: all_conv_text_right = states[0].conv.get_prompt()
# Compare with the FIXED file:
grep -n "all_conv_text_right = states\[1\]" fastchat/serve/gradio_block_arena_named.py
# Output: 182: all_conv_text_right = states[1].conv.get_prompt()Suggested Fix
For both files, change states[0] to states[1] for the right-side conversation:
fastchat/serve/gradio_block_arena_anony.py line 310:
# Before:
all_conv_text_right = states[0].conv.get_prompt()
# After:
all_conv_text_right = states[1].conv.get_prompt()fastchat/serve/gradio_block_arena_vision_named.py line 245:
# Before:
all_conv_text_right = states[0].conv.get_prompt()
# After:
all_conv_text_right = states[1].conv.get_prompt()4. Attack Path Diagram
[User Browser] → [Gradio HTTP POST /api/predict]
│
▼
[add_text() function in arena module]
│
▼
[Construct all_conv_text]
├── all_conv_text_left = states[0].conv.get_prompt() ← Model A history ✅
├── all_conv_text_right = states[0].conv.get_prompt() ← BUG: Model A history AGAIN ❌
│ (should be states[1])
└── + "\nuser: " + text ← New user input ✅
│
▼
[moderation_filter(all_conv_text, ...)]
│
▼
[OpenAI Moderation API]
│
Result: Model B's conversation history is NEVER moderated
5. Files Affected
| File | Line | Status | Severity |
|---|---|---|---|
fastchat/serve/gradio_block_arena_named.py |
182 | ✅ FIXED (by patch 34eca62) | — |
fastchat/serve/gradio_block_arena_anony.py |
310 | ❌ UNPATCHED | Medium-High (always-on moderation) |
fastchat/serve/gradio_block_arena_vision_named.py |
245 | ❌ UNPATCHED | Medium |