Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cache-lock-bug-repro/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/target
56 changes: 56 additions & 0 deletions cache-lock-bug-repro/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
[package]
name = "cache-lock-bug-repro"
version = "0.1.0"
edition = "2021"
description = "Reproduction of Pingora cache lock client disconnect bug"

# Keep this package out of the parent workspace
[workspace]

[[bin]]
name = "slow-origin"
path = "src/bin/slow_origin.rs"

[[bin]]
name = "proxy"
path = "src/bin/proxy.rs"

[[bin]]
name = "test-client"
path = "src/bin/test_client.rs"

[dependencies]
# Pingora dependencies - use path to local checkout
pingora = { path = "../pingora" }
pingora-core = { path = "../pingora-core" }
pingora-proxy = { path = "../pingora-proxy" }
pingora-cache = { path = "../pingora-cache" }
pingora-http = { path = "../pingora-http" }
pingora-error = { path = "../pingora-error" }

# Async runtime
tokio = { version = "1", features = ["full"] }
async-trait = "0.1"

# HTTP client for testing
reqwest = { version = "0.11", features = ["json"] }

# Logging
env_logger = "0.10"
log = "0.4"

# CLI
clap = { version = "4", features = ["derive"] }

# Time utilities
chrono = "0.4"

# HTTP types
http = "1.0"
bytes = "1.0"

# For the slow origin server
hyper = { version = "1.0", features = ["server", "http1"] }
hyper-util = { version = "0.1", features = ["tokio", "server", "http1"] }

once_cell = "1.18"
213 changes: 213 additions & 0 deletions cache-lock-bug-repro/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
# Pingora Cache Lock Client Disconnect Bug Reproduction

This repository contains a standalone reproduction of a bug in Pingora's cache lock mechanism that causes requests to be held for up to 60 seconds when clients disconnect while waiting on a cache lock.

## Bug Summary

When Pingora's cache lock feature is enabled and multiple concurrent requests hit the same uncached URL:

1. **Writer**: The first request acquires the cache lock and fetches from origin
2. **Readers**: Subsequent requests wait on the cache lock
3. **BUG**: If a reader's downstream client disconnects while waiting:
- The server-side handler continues waiting on the cache lock
- Server resources are held until the lock times out (default: 60 seconds)
- This can cause resource exhaustion under high concurrency

### Impact

- Real user-facing delays when clients disconnect (e.g., page navigation, mobile network issues)
- Server resource exhaustion under high load
- Wasted connection pool capacity
- The issue affects HTTP/1.1, HTTP/2, and indirectly HTTP/3 (where QUIC stream cancellation may not propagate)

## Files

- `src/bin/slow_origin.rs` - A simple origin server that delays responses (configurable)
- `src/bin/proxy.rs` - A minimal Pingora proxy with caching and cache lock enabled
- `src/bin/test_client.rs` - Test client that demonstrates the bug

## Prerequisites

- Rust toolchain (1.70+)
- Pingora source code (this repo should be adjacent to pingora crates)

## Running the Reproduction

### Step 1: Start the slow origin server

```bash
# In terminal 1
RUST_LOG=info cargo run --bin slow-origin -- --port 8000 --delay 5
```

This starts an origin server that delays responses by 5 seconds (configurable via `x-set-sleep` header).

### Step 2: Start the Pingora proxy

```bash
# In terminal 2
RUST_LOG=info cargo run --bin proxy -- --port 6148 --origin-port 8000
```

This starts a Pingora proxy with:
- Caching enabled (in-memory)
- Cache lock enabled with 60-second timeout
- Listening on port 6148

### Step 3: Run the test client

```bash
# In terminal 3
RUST_LOG=info cargo run --bin test-client -- --proxy-port 6148 --num-readers 3
```

## Expected Results

### Without the Fix (BUG present)

```
=== RESULTS ===

Total test time: ~65s

Server close times after client disconnect:
Min: ~60s
Max: ~60s
Avg: ~60s

FAIL: Server took too long to close connections

Max close time: 60.123s
Threshold: 2s

This indicates the bug is present:
Server waited for cache lock timeout instead of detecting disconnect
```

### With the Fix

```
=== RESULTS ===

Total test time: ~5s (origin delay)

Server close times after client disconnect:
Min: ~100ms
Max: ~500ms
Avg: ~300ms

PASS: Server closed connections quickly after client disconnect

Max close time 423ms < threshold 2s

This indicates the fix is working:
Server detected client disconnect
Server exited cache lock wait early
```

## The Fix

The fix involves using `tokio::select!` to race the cache lock wait against client disconnect detection:

```rust
// In proxy_cache.rs
async fn cache_lock_wait_or_disconnect(
session: &mut Session,
_ctx: &SV::CTX,
) -> Option<LockStatus> {
// Skip disconnect detection for subrequests (no downstream client)
if session.subrequest_ctx.is_some() {
return Some(session.cache.cache_lock_wait().await);
}

// Can only detect disconnect if request body is done
let body_done = session.as_mut().is_body_done();
if !body_done {
return Some(session.cache.cache_lock_wait().await);
}

// Race cache lock wait against client disconnect
tokio::select! {
biased;

lock_status = session.cache.cache_lock_wait() => {
Some(lock_status)
}

disconnect_result = session.downstream_session.read_body_or_idle(true) => {
// Client disconnected
None
}
}
}
```

When a client disconnects:
1. `read_body_or_idle(true)` returns an error
2. `cache_lock_wait_or_disconnect` returns `None`
3. The request handler can clean up and exit early

## Configuration Options

### Slow Origin Server

```
--port <PORT> Port to listen on [default: 8000]
--delay <DELAY> Default response delay in seconds [default: 5]
```

### Proxy

```
--port <PORT> Port to listen on [default: 6148]
--origin-port <PORT> Origin server port [default: 8000]
--lock-age-timeout <SECS> Cache lock age timeout [default: 60]
--lock-wait-timeout <SECS> Cache lock wait timeout [default: 60]
```

### Test Client

```
--proxy-port <PORT> Proxy port [default: 6148]
--num-readers <N> Number of reader clients [default: 3]
--disconnect-delay-ms <MS> Time before disconnecting readers [default: 500]
--threshold-ms <MS> Success threshold [default: 2000]
--origin-delay <SECS> Origin delay for writer [default: 5]
```

## Test Variations

### Shorter lock timeout

To demonstrate the bug with a shorter timeout (faster iteration):

```bash
# Start origin with 10s delay
cargo run --bin slow-origin -- --delay 10

# Start proxy (lock times out at 5s)
cargo run --bin proxy -- --lock-age-timeout 5

# Run test
cargo run --bin test-client -- --threshold-ms 1000 --origin-delay 10
```

### Multiple readers

To show resource impact with more concurrent readers:

```bash
cargo run --bin test-client -- --num-readers 20
```

## Related Issues

This bug was discovered while investigating 60-second delays in production systems using Pingora as a caching proxy. The delays were particularly problematic when:

1. Mobile clients with unstable connections disconnected frequently
2. Users navigated away from pages before content loaded
3. HTTP/3 was in use (QUIC stream cancellation didn't propagate to Pingora)

## License

Apache-2.0 (same as Pingora)
64 changes: 64 additions & 0 deletions cache-lock-bug-repro/run_test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/bin/bash
# Run the cache lock bug reproduction test
#
# This script:
# 1. Builds the test binaries
# 2. Starts the slow origin server
# 3. Starts the Pingora proxy
# 4. Runs the test client
# 5. Cleans up

set -e

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"

# Configuration
ORIGIN_PORT=${ORIGIN_PORT:-8000}
PROXY_PORT=${PROXY_PORT:-6148}
ORIGIN_DELAY=${ORIGIN_DELAY:-5}
NUM_READERS=${NUM_READERS:-3}
DISCONNECT_DELAY_MS=${DISCONNECT_DELAY_MS:-500}
THRESHOLD_MS=${THRESHOLD_MS:-2000}

cleanup() {
echo "Cleaning up..."
pkill -f "slow-origin" 2>/dev/null || true
pkill -f "cache-lock-bug-repro.*proxy" 2>/dev/null || true
exit
}
trap cleanup EXIT INT TERM

echo "=== Pingora Cache Lock Bug Reproduction ==="
echo ""

# Build
echo "Building..."
cargo build --release 2>&1 | grep -E "(Compiling cache-lock|Finished|error)" || true
echo ""

# Start origin server
echo "Starting slow origin server (port $ORIGIN_PORT, delay ${ORIGIN_DELAY}s)..."
RUST_LOG=info ./target/release/slow-origin --port $ORIGIN_PORT --delay $ORIGIN_DELAY &
ORIGIN_PID=$!
sleep 1

# Start proxy server
echo "Starting Pingora proxy (port $PROXY_PORT, origin port $ORIGIN_PORT)..."
RUST_LOG=info ./target/release/proxy --port $PROXY_PORT --origin-port $ORIGIN_PORT &
PROXY_PID=$!
sleep 2

# Run test
echo ""
echo "Running test client..."
echo ""
RUST_LOG=info ./target/release/test-client \
--proxy-port $PROXY_PORT \
--num-readers $NUM_READERS \
--disconnect-delay-ms $DISCONNECT_DELAY_MS \
--threshold-ms $THRESHOLD_MS \
--origin-delay $ORIGIN_DELAY

echo ""
echo "Test complete."
Loading
Loading