feat(uptime): Add ability to use queues to manage parallelism #8

akshayutture-augment · 2025-11-19T23:19:56Z

Test 9

Replicated from ai-code-review-evaluation/sentry-greptile#9

One potential problem we have with batch processing is that any one slow item will clog up the whole batch. This pr implements a queueing method instead, where we keep N queues that each have their own workers. There's still a chance of individual items backlogging a queue, but we can try increased concurrency here to reduce the chances of that happening

akshayutture-augment · 2025-11-20T00:40:20Z

@claude review this PR

claude · 2025-11-20T00:41:52Z

Claude finished @akshayutture-augment's task —— View job

PR Review Complete

I've reviewed this PR which adds a new "thread-queue-parallel" processing mode for managing parallelism in the uptime results consumer. Here's my comprehensive feedback:

Architecture & Design ✅

Strengths:

Well-designed queue-based parallelism: The FixedQueuePool approach using consistent hashing ensures messages for the same subscription are processed in order while allowing concurrent processing across different subscriptions
Clean separation of concerns: OffsetTracker, FixedQueuePool, and SimpleQueueProcessingStrategy are well-encapsulated with clear responsibilities
Proper offset management: The offset tracking ensures Kafka commits only happen when all prior offsets are processed, preventing data loss

Observations:

This adds a 4th parallelism mode to the existing 3 (serial, parallel, batched-parallel), increasing complexity
The pattern follows Arroyo's stream processing architecture correctly

Implementation Review

1. OffsetTracker (src/sentry/remote_subscriptions/consumers/queue_consumer.py:34-106)

Good:

Thread-safe per-partition locking ✅
Contiguous offset tracking prevents committing past gaps ✅
Cleanup of processed offsets in mark_committed() ✅

Issues:

Lock creation pattern (line 49-54): Uses setdefault() after checking get(), but this isn't atomic. Multiple threads could create different locks for the same partition.

# Current code
lock = self.partition_locks.get(partition)
if lock:
    return lock
return self.partition_locks.setdefault(partition, threading.Lock())

Recommendation: Use a single global lock for partition lock creation, or ensure atomic creation:

def _get_partition_lock(self, partition: Partition) -> threading.Lock:
    if partition not in self.partition_locks:
        with self._creation_lock:  # Add class-level lock
            if partition not in self.partition_locks:
                self.partition_locks[partition] = threading.Lock()
    return self.partition_locks[partition]

2. OrderedQueueWorker (lines 108-157)

Issues:

Line 132: queue.ShutDown exception handling: This is not a standard Python exception. The code catches it but queue.Queue in stdlib doesn't raise ShutDown. This appears to be custom but isn't defined anywhere visible.
- If using Python 3.13's queue.ShutDown, ensure minimum version compatibility
- Consider handling queue.Empty with timeout for graceful shutdown
Line 125: self.shutdown flag: The shutdown mechanism relies on a boolean flag, but the queue operations are blocking. When shutdown=True, the worker may still be blocked on queue.get() indefinitely.
Recommendation: Use queue.get(timeout=1.0) and check shutdown flag in the loop:
```
while not self.shutdown:
    try:
        work_item = self.work_queue.get(timeout=1.0)
    except queue.Empty:
        continue
    except queue.ShutDown:
        break
```

3. FixedQueuePool (lines 159-244)

Good:

Consistent hashing for queue assignment ✅
Fixed pool size prevents dynamic reassignment ✅
Metrics tracking via get_stats() ✅

Issues:

Line 238: queue.shutdown(immediate=False): Again references a shutdown method that doesn't exist in stdlib queue.Queue. If this is Python 3.13+, add version check or documentation.
Resource cleanup: If shutdown() is called while items are still in queues, those items won't be processed. Consider waiting for empty queues first or documenting this behavior.

4. SimpleQueueProcessingStrategy (lines 246-346)

Issues:

Line 276: Commit loop timing: Uses self.shutdown_event.wait(1.0) which commits every 1 second. This is reasonable but not configurable.
Error handling (lines 317-325): When decoder or grouping fails, the offset is immediately marked as complete. This could hide persistent decoding issues. Consider adding metrics/logging for these failures.

Line 297: Type assertion: assert isinstance(message.value, BrokerValue) - Assertions can be disabled with -O flag. Use proper error handling:

if not isinstance(message.value, BrokerValue):
    logger.error("Expected BrokerValue, got %s", type(message.value))
    return

5. result_consumer.py Changes

Issues:

Line 22-24: The import is fine, but adds coupling between remote_subscriptions and its consumers module. This is acceptable given the design.

Lines 96-99: The thread_queue_parallel boolean flag alongside batched_parallel and parallel is getting messy. Consider using an enum:

class ProcessingMode(Enum):
    SERIAL = "serial"
    PARALLEL = "parallel"
    BATCHED_PARALLEL = "batched-parallel"
    THREAD_QUEUE_PARALLEL = "thread-queue-parallel"

Line 136: Default num_queues=20 - This magic number should be documented. Why 20? Is it configurable?
Line 184: Shutdown of queue_pool only in specific condition. Ensure this doesn't leak resources in other modes.
Lines 249-252: The commit offset adjustment (offset + 1) is correct but deserves a comment explaining Kafka's "next offset to read" semantics.

6. consumers/init.py Changes (lines 121-122)

Good:

Adds thread-queue-parallel to the uptime options ✅
Follows existing pattern for mode configuration ✅

Test Coverage 🎯

Excellent test coverage! The test file adds 421 lines including:

Unit Tests (test_queue_consumer.py)

✅ TestOffsetTracker: Comprehensive offset tracking tests
✅ TestFixedQueuePool: Queue assignment, ordering, concurrency tests
✅ TestSimpleQueueProcessingStrategy: Message processing, commits, error handling
✅ Integration test with ResultsStrategyFactory

Integration Tests (test_results_consumer.py)

✅ test_thread_queue_parallel: Basic functionality (line 1689)
✅ test_thread_queue_parallel_preserves_order: Order guarantees (line 1733)
✅ test_thread_queue_parallel_concurrent_subscriptions: Concurrency (line 1780)
✅ test_thread_queue_parallel_offset_commit: Offset management (line 1824)
✅ test_thread_queue_parallel_error_handling: Error resilience (line 1874)
✅ test_thread_queue_parallel_offset_gaps: Gap handling (line 1926)
✅ test_thread_queue_parallel_graceful_shutdown: Shutdown behavior (line 1987)
✅ ProcessResultThreadQueueParallelKafkaTest: Real Kafka integration test! (line 2012)

Minor issues:

Line 1717-1724: Uses busy-wait with time.sleep(0.1) - acceptable for tests but consider using proper synchronization primitives
Line 1922: Comment says "assert len(committed_offsets) == 0 or test_partition not in committed_offsets" but with error handling, we might want to verify the error was actually logged

Performance & Scalability Considerations

Strengths:

Fixed pool size prevents thread explosion ✅
Consistent hashing distributes load evenly ✅
Per-partition locking minimizes contention ✅

Concerns:

Memory growth: If processing is slow, queues can fill up. There's no backpressure mechanism or max queue size limit
- Recommendation: Add max queue size to FixedQueuePool initialization
Commit frequency: 1-second commit interval may be too frequent under high load or too slow for low latency requirements
- Recommendation: Make commit interval configurable

Security & Error Handling

Good:

Proper exception handling in worker threads ✅
Sentry SDK transaction tracking for observability ✅
Metrics for queue depths and processing ✅

Concerns:

No rate limiting: Malicious or buggy producers could flood queues
Assertion usage: Line 297 uses assertions which can be disabled
Exception swallowing: Line 240, 290 - exceptions during shutdown are logged but not propagated

Code Quality & Best Practices

Positives:

✅ Type hints throughout
✅ Docstrings on classes
✅ Follows Arroyo's processing strategy patterns
✅ Comprehensive metrics instrumentation
✅ Proper use of generics for reusability

Issues:

Magic numbers: Line 136 (20 queues), line 243 (5s timeout), line 276 (1s commit)
Missing docstrings: Many methods lack docstrings (e.g., submit(), poll(), close())
Inconsistent naming: thread-queue-parallel (kebab-case) vs method names (snake_case) is fine but the mode name is a mouthful

Recommendations Summary

Critical 🔴

Fix lock creation race condition in OffsetTracker._get_partition_lock() (line 49-54)
Handle shutdown properly in OrderedQueueWorker.run() - use timeout on queue.get() (line 131)
Replace assertions with proper error handling (line 297)

Important 🟡

Add max queue size limits to prevent unbounded memory growth
Document queue.ShutDown exception - clarify Python version requirements
Make commit interval configurable in SimpleQueueProcessingStrategy
Add docstrings to public methods

Nice to Have 🟢

Use Enum for processing modes instead of multiple boolean flags
Add metrics for decoder/grouping failures
Document magic number for default queue count (20)

Verdict

This is a solid implementation that adds valuable functionality for managing parallelism. The core algorithm is sound, test coverage is excellent, and it follows established patterns.

However, there are some thread-safety and resource management issues that should be addressed before merging:

Lock creation race condition
Worker shutdown mechanism
Unbounded queue growth

The changes are well-tested and the new mode fills a legitimate need for parallelism with ordering guarantees. With the critical issues fixed, this would be ready to merge.

Estimated Risk: Medium - The thread-safety issues could cause hard-to-debug problems in production under high concurrency.

Tasks

Read and analyze changed files
Review queue_consumer.py implementation
Review result_consumer.py changes
Review test coverage and quality
Provide comprehensive feedback

github-actions bot added the Scope: Backend label Nov 19, 2025

claude bot mentioned this pull request Nov 20, 2025

Span Buffer Multiprocess Enhancement with Health Monitoring #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(uptime): Add ability to use queues to manage parallelism #8

feat(uptime): Add ability to use queues to manage parallelism #8

Uh oh!

akshayutture-augment commented Nov 19, 2025

Uh oh!

akshayutture-augment commented Nov 20, 2025

Uh oh!

claude bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(uptime): Add ability to use queues to manage parallelism #8

Are you sure you want to change the base?

feat(uptime): Add ability to use queues to manage parallelism #8

Uh oh!

Conversation

akshayutture-augment commented Nov 19, 2025

Uh oh!

akshayutture-augment commented Nov 20, 2025

Uh oh!

claude bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete

Architecture & Design ✅

Implementation Review

1. OffsetTracker (src/sentry/remote_subscriptions/consumers/queue_consumer.py:34-106)

2. OrderedQueueWorker (lines 108-157)

3. FixedQueuePool (lines 159-244)

4. SimpleQueueProcessingStrategy (lines 246-346)

5. result_consumer.py Changes

6. consumers/init.py Changes (lines 121-122)

Test Coverage 🎯

Unit Tests (test_queue_consumer.py)

Integration Tests (test_results_consumer.py)

Performance & Scalability Considerations

Security & Error Handling

Code Quality & Best Practices

Recommendations Summary

Critical 🔴

Important 🟡

Nice to Have 🟢

Verdict

Tasks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Nov 20, 2025 •

edited

Loading