Skip to content

Conversation

@akshayutture-augment
Copy link
Contributor

yuvmen and others added 2 commits July 25, 2025 09:48
…(#94376)

Part of the Error Upsampling project:
https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link

Events-stats API will now check if all projects in the query are
allowlisted for upsampling, and convert the count query to a sum over
`sample_weight` in Snuba, this is done by defining a new SnQL function
`upsampled_count()`.

I noticed there are also eps() and epm() functions in use in this
endpoint. I considered (and even worked on) also supporting
swapping eps() and epm() which for correctness should probably also not
count naively and use `sample_weight`, but this
caused some complications and since they are only in use by specific
dashboard widgets and not available in discover
I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance
- Separate upsampling eligibility check from query transformation for better optimization
- Remove unnecessary null checks in upsampled_count() function per schema requirements
- Add cache invalidation utilities for configuration management

This improves performance during high-traffic periods by avoiding repeated
expensive allowlist lookups while maintaining data consistency.
@akshayutture-augment
Copy link
Contributor Author

@claude review this PR

@claude
Copy link

claude bot commented Nov 20, 2025

Claude finished @akshayutture-augment's task —— View job


Code Review - Error Upsampling with Performance Optimizations

I've completed my review of this PR. Overall, the implementation is solid with good test coverage, but I've identified several important issues that should be addressed.


Strengths

  1. Good test coverage - Comprehensive unit tests and integration tests covering various scenarios
  2. Clear separation of concerns - Helper functions are well-organized
  3. Performance optimization - Caching strategy is well-documented
  4. Safe rollout - Feature is gated behind allowlist configuration

🔴 Critical Issues

1. Race Condition in Caching Logic (src/sentry/api/helpers/error_upsampling.py:27-40)

The caching implementation has a critical race condition that can cause inconsistent query results:

# Line 27-40
cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"

# Check cache first for performance optimization
cached_result = cache.get(cache_key)
if cached_result is not None:
    return cached_result and _should_apply_sample_weight_transform(dataset, request)

# Cache miss - perform fresh allowlist check
is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids, organization)

# Cache for 60 seconds to improve performance during traffic spikes
cache.set(cache_key, is_eligible, 60)

return is_eligible and _should_apply_sample_weight_transform(dataset, request)

Problem: The function caches is_eligible but returns is_eligible AND _should_apply_sample_weight_transform(...). On cache hit, it evaluates _should_apply_sample_weight_transform at call time, but on cache miss, it doesn't. This means:

  • Cache hit: Evaluates dataset/request check every time
  • Cache miss: Only evaluates project eligibility

If the dataset or request query changes between calls with the same cache key, you'll get inconsistent results.

Solution: Cache the full eligibility decision:

cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}:{dataset.__name__}:{request.GET.get('query', '')}"

cached_result = cache.get(cache_key)
if cached_result is not None:
    return cached_result

is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids, organization)
should_transform = is_eligible and _should_apply_sample_weight_transform(dataset, request)

cache.set(cache_key, should_transform, 60)
return should_transform
2. Potential Data Inconsistency (src/sentry/api/helpers/error_upsampling.py:43-64)

The comment on line 50-53 acknowledges a serious issue but doesn't address it:

# NOTE: This function reads the allowlist configuration fresh each time,
# which means it can return different results between calls if the 
# configuration changes during request processing. This is intentional
# to ensure we always have the latest configuration state.

Problem: Combined with the cache, this creates a window where:

  1. Request A caches "eligible=True" for projects [1,2]
  2. Config changes to remove project 2
  3. Request B uses cached "eligible=True"
  4. Query runs with upsampling but project 2 may not have sample_weight column populated consistently

Solution: Either:

  • Include allowlist version in cache key, OR
  • Document the acceptable inconsistency window and monitoring strategy
3. Missing Null Safety (src/sentry/api/helpers/error_upsampling.py:87-100)
def transform_query_columns_for_error_upsampling(
    query_columns: Sequence[str],
) -> list[str]:
    # ...
    # Note: We rely on the database schema to ensure sample_weight exists for all
    # events in allowlisted projects, so no additional null checks are needed here.

Problem: The comment claims schema guarantees, but:

  • Events may have been ingested before upsampling was enabled
  • Migration timing issues
  • Partial rollout scenarios

Impact: Queries will break with NULL errors if sample_weight is missing.

Solution: Add defensive null handling in the SQL function definition in discover.py:1046-1050:

snql_aggregate=lambda args, alias: Function(
    "toInt64",
    [Function("sum", [Function("coalesce", [Column("sample_weight"), 1])])],
    alias,
),

⚠️ Medium Priority Issues

4. Cache Key Collision Risk (src/sentry/api/helpers/error_upsampling.py:27)
cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"

Problem: Python's hash() can have collisions. Different project ID combinations might hash to the same value.

Solution: Use a deterministic hash or just join the IDs:

cache_key = f"error_upsampling_eligible:{organization.id}:{','.join(map(str, sorted(snuba_params.project_ids)))}"
5. Inconsistent Upsampling Application (src/sentry/api/endpoints/organization_events_stats.py:218-296)

The upsampling check happens early but transformation happens late in 3 different places (lines 233, 277, 296). This creates maintenance burden and potential for bugs if one path is missed.

Recommendation: Refactor to apply transformation in one place or use a decorator pattern.

6. Test Data Quality Issue (tests/snuba/api/endpoints/test_organization_events_stats.py:3626-3627)
assert data[0][1][0]["count"] == 10  # First bucket has 1 event
assert data[1][1][0]["count"] == 10  # Second bucket has 1 event

Problem: Comments say "1 event" but assertions check for count == 10. This is confusing and suggests the test might not be validating what you think it is.

Expected: With client_sample_rate: 0.1, 1 event should upsample to ~10. But are you actually testing upsampling, or is this hardcoded?

Action: Verify the test is correctly validating upsampled counts and fix the comments.


💡 Minor Issues & Code Quality

7. Empty Directory Added (sentry-repo/)

An empty sentry-repo/ directory was added. This appears unintentional and should be removed.

8. Type Safety (src/sentry/api/helpers/error_upsampling.py:103)
def _should_apply_sample_weight_transform(dataset: Any, request: Request) -> bool:

Issue: dataset: Any loses type safety.

Recommendation: Use Union type or Protocol:

from typing import Union
from types import ModuleType

def _should_apply_sample_weight_transform(dataset: ModuleType, request: Request) -> bool:
9. Query Parsing Fragility (src/sentry/api/helpers/error_upsampling.py:135-138)
query = request.GET.get("query", "").lower()
if "event.type:error" in query:
    return True

Problem: Simple substring matching can have false positives:

  • event.type:error_count would match
  • !event.type:error (negation) would match incorrectly

Recommendation: Use proper query parsing or more robust pattern matching.

10. Magic Numbers (src/sentry/api/helpers/error_upsampling.py:38)
cache.set(cache_key, is_eligible, 60)  # 60 seconds

Recommendation: Extract to a named constant:

UPSAMPLING_CACHE_TTL_SECONDS = 60
cache.set(cache_key, is_eligible, UPSAMPLING_CACHE_TTL_SECONDS)

📋 Testing Gaps

  1. No tests for cache invalidation - The invalidate_upsampling_cache function is never tested
  2. No tests for concurrent requests - Race condition scenarios are not covered
  3. No tests for mixed dataset queries - What happens when discover dataset has both errors and transactions?
  4. No performance tests - The PR title mentions "performance optimizations" but there are no benchmarks

📚 Documentation

Missing:

  • How to monitor for cache inconsistencies
  • Rollback procedure if upsampling causes issues
  • Expected performance improvements (quantified)
  • Migration guide for adding sample_weight column to existing events

🎯 Recommendations for Next Steps

  1. Fix the race condition (Critical) - Address issue chore(deps): bump react-mentions and @types/react-mentions #1
  2. Add null safety (Critical) - Address issue chore(deps): bump fsevents from 2.3.2 to 2.3.3 #3
  3. Fix cache key collisions (High) - Address issue chore(deps-dev): bump @storybook/addon-essentials from 6.2.9 to 8.1.11 #4
  4. Verify test assertions (High) - Address issue Add Claude Code GitHub Workflow #6
  5. Remove empty directory (Quick win)
  6. Add cache invalidation tests
  7. Document monitoring strategy

Summary

This is a solid implementation with good structure and test coverage. The main concerns are around cache consistency and null safety. The race condition in particular could lead to incorrect query results or runtime errors. I recommend addressing the critical issues before merging.

The performance optimization approach is sound, but consider adding observability (metrics, logging) to track:

  • Cache hit rate
  • Upsampling transformation frequency
  • Query performance improvements

| View PR branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants