Moderation Filter Bypass via Wrong State Index in Arena Side-by-Side Views

## 1. Exploitability Summary

| Aspect | Status |
|--------|--------|
| External Attack Path | ✅ Verified — Gradio HTTP endpoint (textbox submit / send button) |
| Runtime Protections Bypassed | ✅ Yes — Moderation filter checks wrong conversation state |
| Requires Other Vulnerabilities | ✅ None required |
| Real-World Exploitability | ✅ CONFIRMED — Partial moderation bypass for right-side model conversation history |

## 2. Vulnerability Details

### 2.1 Original Patch (34eca62)

The commit `34eca625b77fb8a514a61092d4db597e89c14f71` fixed a bug in `fastchat/serve/gradio_block_arena_named.py` where the moderation filter used `states[0].conv.get_prompt()` for **both** the left and right conversation sides:

```python
# BEFORE (Vulnerable)
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[0].conv.get_prompt()  # BUG: should be states[1]

# AFTER (Fixed)
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[1].conv.get_prompt()  # FIXED
```

### 2.2 Unpatched Variants Found

**Two files contain the exact same bug, unpatched:**

#### Variant 1: `fastchat/serve/gradio_block_arena_anony.py` (Anonymous Arena)

**File:** `fastchat/serve/gradio_block_arena_anony.py`
**Line:** 310
**Function:** `add_text()` (line 269)

```python
# Line 309-314 — VULNERABLE
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[0].conv.get_prompt()   # ← BUG: should be states[1]
all_conv_text = (
    all_conv_text_left[-1000:] + all_conv_text_right[-1000:] + "\nuser: " + text
)
flagged = moderation_filter(all_conv_text, model_list, do_moderation=True)
```

**Key Note:** In anonymous arena mode, `do_moderation=True` is explicitly set, meaning moderation is **always active** regardless of model type. This makes this variant **more critical** than the original patched bug, which only moderates certain model types.

#### Variant 2: `fastchat/serve/gradio_block_arena_vision_named.py` (Named Vision Arena)

**File:** `fastchat/serve/gradio_block_arena_vision_named.py`
**Lines:** 244-245
**Function:** `add_text()` (line 190)

```python
# Lines 244-253 — VULNERABLE
all_conv_text_left = states[0].conv.get_prompt()
all_conv_text_right = states[0].conv.get_prompt()   # ← BUG: should be states[1]
all_conv_text = (
    all_conv_text_left[-1000:] + all_conv_text_right[-1000:] + "\nuser: " + text
)

images = convert_images_to_conversation_format(images)

text, image_flagged, csam_flag = moderate_input(
    state0, text, all_conv_text, model_list, images, ip
)
```

### 2.3 Impact Analysis

**What the bug does:**
- In side-by-side arena views, users chat with two models simultaneously
- The moderation filter is supposed to check the conversation history of **both** models plus the new user input
- Due to the bug, the RIGHT-side model's (Model B's) conversation history is never checked — the LEFT-side model's (Model A's) history is checked twice instead
- This means if Model B generates content that should trigger moderation (e.g., violating sexual content thresholds), subsequent moderation checks won't catch it

**Attack Scenario (Multi-Turn Bypass):**
1. User starts a side-by-side chat (anonymous or named vision arena)
2. User sends an initial benign message → both models respond
3. Model B generates a response containing borderline/violating content
4. User sends another message → the moderation filter checks `states[0]` (Model A) history twice, completely ignoring Model B's response that contains violating content
5. The moderation filter does NOT flag this, allowing the conversation to continue with violating content from Model B visible to the user

**Severity:** Medium — The new user INPUT text is still included and moderated, but the conversation **history** of the right-side model is not. This is primarily a concern for multi-turn conversations where model responses may escalate or contain violating content.

### 2.4 CWE Classification

- **CWE-670**: Always-Incorrect Control Flow Implementation
- **CWE-284**: Improper Access Control (content moderation bypass)

## 3. Reproduction Steps

### Manual Reproduction

1. **Start the FastChat Gradio server with Arena mode:**
   ```bash
   python3 -m fastchat.serve.controller
   python3 -m fastchat.serve.model_worker --model-path <model> --controller-address http://localhost:21001
   python3 -m fastchat.serve.gradio_web_server_multi --share
   ```

2. **Navigate to the Anonymous Arena tab** (where `do_moderation=True` is always on)

3. **Send an initial message** to establish conversation with two models

4. **Wait for Model B to respond** — note that its response content is stored in `states[1].conv`

5. **Send another message** — observe that the moderation filter constructs `all_conv_text` using:
   ```python
   all_conv_text_left = states[0].conv.get_prompt()    # Model A's history
   all_conv_text_right = states[0].conv.get_prompt()   # BUG: Also Model A's history!
   ```
   Model B's conversation history is never checked.

### Code Verification

You can verify the bug by examining the source code directly:

```bash
# Variant 1: Anonymous Arena
grep -n "all_conv_text_right = states\[0\]" fastchat/serve/gradio_block_arena_anony.py
# Output: 310:    all_conv_text_right = states[0].conv.get_prompt()

# Variant 2: Named Vision Arena
grep -n "all_conv_text_right = states\[0\]" fastchat/serve/gradio_block_arena_vision_named.py
# Output: 245:    all_conv_text_right = states[0].conv.get_prompt()

# Compare with the FIXED file:
grep -n "all_conv_text_right = states\[1\]" fastchat/serve/gradio_block_arena_named.py
# Output: 182:    all_conv_text_right = states[1].conv.get_prompt()
```

### Suggested Fix

For both files, change `states[0]` to `states[1]` for the right-side conversation:

**`fastchat/serve/gradio_block_arena_anony.py` line 310:**
```python
# Before:
all_conv_text_right = states[0].conv.get_prompt()
# After:
all_conv_text_right = states[1].conv.get_prompt()
```

**`fastchat/serve/gradio_block_arena_vision_named.py` line 245:**
```python
# Before:
all_conv_text_right = states[0].conv.get_prompt()
# After:
all_conv_text_right = states[1].conv.get_prompt()
```

## 4. Attack Path Diagram

```
[User Browser] → [Gradio HTTP POST /api/predict]
       │
       ▼
[add_text() function in arena module]
       │
       ▼
[Construct all_conv_text]
  ├── all_conv_text_left  = states[0].conv.get_prompt()  ← Model A history ✅
  ├── all_conv_text_right = states[0].conv.get_prompt()  ← BUG: Model A history AGAIN ❌
  │                                                         (should be states[1])
  └── + "\nuser: " + text  ← New user input ✅
       │
       ▼
[moderation_filter(all_conv_text, ...)]
       │
       ▼
[OpenAI Moderation API]
       │
  Result: Model B's conversation history is NEVER moderated
```

## 5. Files Affected

| File | Line | Status | Severity |
|------|------|--------|----------|
| `fastchat/serve/gradio_block_arena_named.py` | 182 | ✅ FIXED (by patch 34eca62) | — |
| `fastchat/serve/gradio_block_arena_anony.py` | 310 | ❌ UNPATCHED | Medium-High (always-on moderation) |
| `fastchat/serve/gradio_block_arena_vision_named.py` | 245 | ❌ UNPATCHED | Medium |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moderation Filter Bypass via Wrong State Index in Arena Side-by-Side Views #3794

1. Exploitability Summary

2. Vulnerability Details

2.1 Original Patch (`34eca62`)

2.2 Unpatched Variants Found

Variant 1: `fastchat/serve/gradio_block_arena_anony.py` (Anonymous Arena)

Variant 2: `fastchat/serve/gradio_block_arena_vision_named.py` (Named Vision Arena)

2.3 Impact Analysis

2.4 CWE Classification

3. Reproduction Steps

Manual Reproduction

Code Verification

Suggested Fix

4. Attack Path Diagram

5. Files Affected

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	Status
External Attack Path	✅ Verified — Gradio HTTP endpoint (textbox submit / send button)
Runtime Protections Bypassed	✅ Yes — Moderation filter checks wrong conversation state
Requires Other Vulnerabilities	✅ None required
Real-World Exploitability	✅ CONFIRMED — Partial moderation bypass for right-side model conversation history

File	Line	Status	Severity
`fastchat/serve/gradio_block_arena_named.py`	182	✅ FIXED (by patch `34eca62`)	—
`fastchat/serve/gradio_block_arena_anony.py`	310	❌ UNPATCHED	Medium-High (always-on moderation)
`fastchat/serve/gradio_block_arena_vision_named.py`	245	❌ UNPATCHED	Medium

Moderation Filter Bypass via Wrong State Index in Arena Side-by-Side Views #3794

Description

1. Exploitability Summary

2. Vulnerability Details

2.1 Original Patch (34eca62)

2.2 Unpatched Variants Found

Variant 1: fastchat/serve/gradio_block_arena_anony.py (Anonymous Arena)

Variant 2: fastchat/serve/gradio_block_arena_vision_named.py (Named Vision Arena)

2.3 Impact Analysis

2.4 CWE Classification

3. Reproduction Steps

Manual Reproduction

Code Verification

Suggested Fix

4. Attack Path Diagram

5. Files Affected

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2.1 Original Patch (`34eca62`)

Variant 1: `fastchat/serve/gradio_block_arena_anony.py` (Anonymous Arena)

Variant 2: `fastchat/serve/gradio_block_arena_vision_named.py` (Named Vision Arena)