Thread Tag Matching Fails in Container Environments (PID=1)

### Summary

Thread tag matching completely fails when running in Docker containers where the main process has PID=1. This is caused by the hash algorithm `thread_id % pid` which results in all threads having the same hash value of `hash(0)`, making it impossible to distinguish between different threads.

### Environment

- **Runtime**: Docker container
- **Process PID**: 1 (typical for container main process)
- **Affected Code**: 
  - `pyroscope_ffi/python/lib/src/lib.rs`: `add_thread_tag()` and `remove_thread_tag()`
  - `src/backend/ruleset.rs`: Thread tag matching logic

### Root Cause Analysis

#### Current Implementation

In `add_thread_tag()`:
```rust
let pid = std::process::id();  // pid = 1 in container
let mut hasher = DefaultHasher::new();
hasher.write_u64(thread_id % pid as u64);  // thread_id % 1 = 0 for ALL threads
let id = hasher.finish();  // hash(0) - same for all threads!
```

In `ruleset.rs` matching:
```rust
if let (Some(stack_thread_id), Some(stack_pid)) = (self.thread_id, self.pid) {
    let mut hasher = DefaultHasher::new();
    hasher.write_u64(stack_thread_id % stack_pid as u64);  // thread_id % 1 = 0
    let id = hasher.finish();  // hash(0) - matches ALL threads
    if &id == thread_id {
        return Some(tag.clone());
    }
}
```

#### Problem Breakdown

When PID = 1:

```
Thread-8  (thread_id = 22944203888384)
  → hash(22944203888384 % 1) = hash(0) = X

Thread-22 (thread_id = 22931155187456)
  → hash(22931155187456 % 1) = hash(0) = X

Thread-3  (thread_id = 22944157726464)
  → hash(22944157726464 % 1) = hash(0) = X
```

**Result**: All threads are assigned the same hash ID, causing:
1. All `ThreadTag` rules are stored with the same ID
2. HashSet deduplication or all tags match all threads
3. Multiple span_ids appear on a single thread in profiling data

### Observed Symptoms

From production logs:
```
Multiple span_id values found: ["106a0062439ea85d", "7100fce0bf50a932", "aa40f80152bed675"], 
thread_id: 22944203888384, thread_name: Thread-8
```

Each thread shows multiple span_ids that should belong to different threads or different time periods.

### Temporary Fix

Commenting out the hash calculation and using raw `thread_id` directly:

```rust
#[no_mangle]
pub extern "C" fn add_thread_tag(thread_id: u64, key: *const c_char, value: *const c_char) -> bool {
    let key = unsafe { CStr::from_ptr(key) }.to_str().unwrap().to_owned();
    let value = unsafe { CStr::from_ptr(value) }.to_str().unwrap().to_owned();

    // Directly use thread_id instead of hash
    return ffikit::send(ffikit::Signal::AddThreadTag(thread_id, key, value)).is_ok();
}
```

This resolves the issue in container environments.

### Questions

1. **Why use `thread_id % pid` in the first place?**
   - Was this designed for multi-process scenarios?
   - What problem does the modulo operation solve?

2. **Why apply DefaultHasher?**
   - DefaultHasher is non-deterministic across different invocations
   - Even without the PID=1 issue, this could cause matching inconsistencies

3. **Container-aware design**
   - Should we detect when PID=1 and use a different strategy?
   - Or should we abandon the hash approach entirely?

### Proposed Solutions

#### Option 1: Remove Hash Completely (Recommended)
Use raw thread_id directly:
```rust
let id = thread_id;  // No modulo, no hash
```

#### Option 2: PID-aware Hash
```rust
let id = if pid == 1 {
    thread_id  // Use raw thread_id in containers
} else {
    // Keep existing logic for non-container scenarios
    let mut hasher = DefaultHasher::new();
    hasher.write_u64(thread_id % pid as u64);
    hasher.finish()
};
```

#### Option 3: Better Hash Algorithm
If hashing is necessary, use a stable hash or avoid modulo:
```rust
let id = if pid <= 1 {
    thread_id
} else {
    // Use a different formula that doesn't break at pid=1
    thread_id.wrapping_mul(pid as u64)
}
```

### Additional Context

This issue primarily affects:
- Docker/Podman containers
- Systemd services with `Type=simple` (PID 1)
- Any environment where the profiled process is PID 1

The issue is critical for OpenTelemetry span correlation, where each thread should have exactly one active span_id at any given time.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thread Tag Matching Fails in Container Environments (PID=1) #250

Summary

Environment

Root Cause Analysis

Current Implementation

Problem Breakdown

Observed Symptoms

Temporary Fix

Questions

Proposed Solutions

Option 1: Remove Hash Completely (Recommended)

Option 2: PID-aware Hash

Option 3: Better Hash Algorithm

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Thread Tag Matching Fails in Container Environments (PID=1) #250

Description

Summary

Environment

Root Cause Analysis

Current Implementation

Problem Breakdown

Observed Symptoms

Temporary Fix

Questions

Proposed Solutions

Option 1: Remove Hash Completely (Recommended)

Option 2: PID-aware Hash

Option 3: Better Hash Algorithm

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions