Skip to content

Moderation Filter Bypass via Wrong State Index in Arena Side-by-Side Views #3794

@YLChen-007

Description

@YLChen-007

1. Exploitability Summary

Aspect Status
External Attack Path ✅ Verified — Gradio HTTP endpoint (textbox submit / send button)
Runtime Protections Bypassed ✅ Yes — Moderation filter checks wrong conversation state
Requires Other Vulnerabilities ✅ None required
Real-World Exploitability ✅ CONFIRMED — Partial moderation bypass for right-side model conversation history

2. Vulnerability Details

2.1 Original Patch (34eca62)

The commit 34eca625b77fb8a514a61092d4db597e89c14f71 fixed a bug in fastchat/serve/gradio_block_arena_named.py where the moderation filter used states[0].conv.get_prompt() for both the left and right conversation sides:

# BEFORE (Vulnerable)
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[0].conv.get_prompt()  # BUG: should be states[1]

# AFTER (Fixed)
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[1].conv.get_prompt()  # FIXED

2.2 Unpatched Variants Found

Two files contain the exact same bug, unpatched:

Variant 1: fastchat/serve/gradio_block_arena_anony.py (Anonymous Arena)

File: fastchat/serve/gradio_block_arena_anony.py
Line: 310
Function: add_text() (line 269)

# Line 309-314 — VULNERABLE
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[0].conv.get_prompt()   # ← BUG: should be states[1]
all_conv_text = (
    all_conv_text_left[-1000:] + all_conv_text_right[-1000:] + "\nuser: " + text
)
flagged = moderation_filter(all_conv_text, model_list, do_moderation=True)

Key Note: In anonymous arena mode, do_moderation=True is explicitly set, meaning moderation is always active regardless of model type. This makes this variant more critical than the original patched bug, which only moderates certain model types.

Variant 2: fastchat/serve/gradio_block_arena_vision_named.py (Named Vision Arena)

File: fastchat/serve/gradio_block_arena_vision_named.py
Lines: 244-245
Function: add_text() (line 190)

# Lines 244-253 — VULNERABLE
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[0].conv.get_prompt()   # ← BUG: should be states[1]
all_conv_text = (
    all_conv_text_left[-1000:] + all_conv_text_right[-1000:] + "\nuser: " + text
)

images = convert_images_to_conversation_format(images)

text, image_flagged, csam_flag = moderate_input(
    state0, text, all_conv_text, model_list, images, ip
)

2.3 Impact Analysis

What the bug does:

  • In side-by-side arena views, users chat with two models simultaneously
  • The moderation filter is supposed to check the conversation history of both models plus the new user input
  • Due to the bug, the RIGHT-side model's (Model B's) conversation history is never checked — the LEFT-side model's (Model A's) history is checked twice instead
  • This means if Model B generates content that should trigger moderation (e.g., violating sexual content thresholds), subsequent moderation checks won't catch it

Attack Scenario (Multi-Turn Bypass):

  1. User starts a side-by-side chat (anonymous or named vision arena)
  2. User sends an initial benign message → both models respond
  3. Model B generates a response containing borderline/violating content
  4. User sends another message → the moderation filter checks states[0] (Model A) history twice, completely ignoring Model B's response that contains violating content
  5. The moderation filter does NOT flag this, allowing the conversation to continue with violating content from Model B visible to the user

Severity: Medium — The new user INPUT text is still included and moderated, but the conversation history of the right-side model is not. This is primarily a concern for multi-turn conversations where model responses may escalate or contain violating content.

2.4 CWE Classification

  • CWE-670: Always-Incorrect Control Flow Implementation
  • CWE-284: Improper Access Control (content moderation bypass)

3. Reproduction Steps

Manual Reproduction

  1. Start the FastChat Gradio server with Arena mode:

    python3 -m fastchat.serve.controller
    python3 -m fastchat.serve.model_worker --model-path <model> --controller-address http://localhost:21001
    python3 -m fastchat.serve.gradio_web_server_multi --share
  2. Navigate to the Anonymous Arena tab (where do_moderation=True is always on)

  3. Send an initial message to establish conversation with two models

  4. Wait for Model B to respond — note that its response content is stored in states[1].conv

  5. Send another message — observe that the moderation filter constructs all_conv_text using:

    all_conv_text_left = states[0].conv.get_prompt()    # Model A's history
    all_conv_text_right = states[0].conv.get_prompt()   # BUG: Also Model A's history!

    Model B's conversation history is never checked.

Code Verification

You can verify the bug by examining the source code directly:

# Variant 1: Anonymous Arena
grep -n "all_conv_text_right = states\[0\]" fastchat/serve/gradio_block_arena_anony.py
# Output: 310:    all_conv_text_right = states[0].conv.get_prompt()

# Variant 2: Named Vision Arena
grep -n "all_conv_text_right = states\[0\]" fastchat/serve/gradio_block_arena_vision_named.py
# Output: 245:    all_conv_text_right = states[0].conv.get_prompt()

# Compare with the FIXED file:
grep -n "all_conv_text_right = states\[1\]" fastchat/serve/gradio_block_arena_named.py
# Output: 182:    all_conv_text_right = states[1].conv.get_prompt()

Suggested Fix

For both files, change states[0] to states[1] for the right-side conversation:

fastchat/serve/gradio_block_arena_anony.py line 310:

# Before:
all_conv_text_right = states[0].conv.get_prompt()
# After:
all_conv_text_right = states[1].conv.get_prompt()

fastchat/serve/gradio_block_arena_vision_named.py line 245:

# Before:
all_conv_text_right = states[0].conv.get_prompt()
# After:
all_conv_text_right = states[1].conv.get_prompt()

4. Attack Path Diagram

[User Browser] → [Gradio HTTP POST /api/predict]
       │
       ▼
[add_text() function in arena module]
       │
       ▼
[Construct all_conv_text]
  ├── all_conv_text_left  = states[0].conv.get_prompt()  ← Model A history ✅
  ├── all_conv_text_right = states[0].conv.get_prompt()  ← BUG: Model A history AGAIN ❌
  │                                                         (should be states[1])
  └── + "\nuser: " + text  ← New user input ✅
       │
       ▼
[moderation_filter(all_conv_text, ...)]
       │
       ▼
[OpenAI Moderation API]
       │
  Result: Model B's conversation history is NEVER moderated

5. Files Affected

File Line Status Severity
fastchat/serve/gradio_block_arena_named.py 182 ✅ FIXED (by patch 34eca62)
fastchat/serve/gradio_block_arena_anony.py 310 ❌ UNPATCHED Medium-High (always-on moderation)
fastchat/serve/gradio_block_arena_vision_named.py 245 ❌ UNPATCHED Medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions