Skip to content

ThreadTag Conflicts Cause Data Contamination #249

@luhaoqi

Description

@luhaoqi

ThreadTag Conflicts Cause Data Contamination

Problem Description

When multiple ThreadTag rules with the same thread_id and key exist in the Ruleset, they are all applied to StackTrace metadata during profiling. However, during pprof encoding, a HashMap is used to store labels, which silently overwrites duplicate keys, causing profiling data from one span to be incorrectly labeled with another span's tag.

Root Cause

The issue stems from an inconsistency between data collection and encoding:

1. Data Collection (src/backend/ruleset.rs):

  • Uses HashSet<Rule> where Rule is the complete tuple (thread_id, Tag{key, value})
  • Allows multiple rules like:
    ThreadTag(100, Tag{key:"span_id", value:"span_a"})
    ThreadTag(100, Tag{key:"span_id", value:"span_b"})
  • Both rules match the same thread_id and are applied to metadata

2. Pprof Encoding (src/encode/pprof.rs):

let mut labels = HashMap::new();
for l in &stacktrace.metadata.tags {
    let k = b.add_string(&l.key);
    let v = b.add_string(&l.value);
    labels.insert(k, v);  // ⚠️ Duplicate keys are silently overwritten
}
for l in &report.metadata.tags {
    let k = b.add_string(&l.key);
    let v = b.add_string(&l.value);
    labels.insert(k, v);  // ⚠️ Duplicate keys are silently overwritten
}

The HashMap only retains the last value for each key, discarding earlier values without any warning. This is the fundamental issue: pprof format output does not allow duplicate keys with different values.

Steps to Reproduce

import pyroscope
import threading

pyroscope.configure(application_name="test", server_address="http://localhost:4040")
thread_id = threading.get_ident()

# Step 1: Add first tag
pyroscope.add_thread_tag(thread_id, "span_id", "span_a")
# ... profiling data collected with span_a ...

# Step 2: Forget to remove, or remove fails
# pyroscope.remove_thread_tag(thread_id, "span_id", "span_a")  # Skipped

# Step 3: Add new tag
pyroscope.add_thread_tag(thread_id, "span_id", "span_b")
# ... more profiling data collected ...

# Result: Data from span_a period is labeled as span_b in pprof output

Expected Behavior

Each (thread_id, key) combination should have at most one value at any given time to match the pprof encoding constraints.

Proposed Solutions

Solution 1: Use HashMap for ThreadTags (Recommended)

Replace HashSet<Rule> with a structure that enforces uniqueness by (thread_id, key):

#[derive(Debug, Default, Clone)]
pub struct Ruleset {
    pub global_tags: Arc<Mutex<HashSet<Tag>>>,
    pub thread_tags: Arc<Mutex<HashMap<(u64, String), Tag>>>,
}

Benefits:

  • O(1) automatic deduplication - inherently prevents conflicts
  • Natural key-value semantics - matches how tags are actually used
  • Consistency with pprof encoding - aligns data structure with output format

Solution 2: Auto-remove Conflicts in add_rule

Modify add_rule to remove conflicting entries before insertion:

pub fn add_rule(&self, rule: Rule) -> Result<bool> {
    let mut rules = self.rules.lock()?;

    if let Rule::ThreadTag(thread_id, ref tag) = rule {
        rules.retain(|existing_rule| {
            match existing_rule {
                Rule::ThreadTag(existing_thread_id, existing_tag) => {
                    existing_thread_id != &thread_id || existing_tag.key != tag.key
                }
                _ => true,
            }
        });
    }

    let insert = rules.insert(rule);
    Ok(insert)
}

Time complexity: O(n) where n is the number of rules.

Question for Maintainers

I'm curious about the original design decision: Why use HashSet<Rule> (where Rule contains the complete Tag) instead of a structure like HashMap<(u64, String), String> for thread tags?

Was there a specific use case that required allowing multiple values for the same (thread_id, key) combination? Understanding the rationale would help ensure any fix doesn't break intended functionality.

Impact

  • Severity: High - Causes incorrect attribution of profiling data
  • Affected scenarios: Thread reuse, OpenTelemetry integration, long-running threads with sequential operations
  • Workaround: Manually call remove_thread_tag before every add_thread_tag, but this is error-prone

Environment

  • pyroscope-rs version: on tag 'python-0.8.11', but i think the logic is the same as the latest version
  • Integration: OpenTelemetry SpanProcessor (Python FFI)
  • Related files:
    • src/backend/ruleset.rs
    • src/encode/pprof.rs
    • pyroscope_ffi/python/lib/src/lib.rs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions