feat(seer): Add Explorer service map extraction pipeline#108379
feat(seer): Add Explorer service map extraction pipeline#108379shruthilayaj merged 18 commits intomasterfrom
Conversation
Implements a Celery task that extracts service dependency graphs from distributed traces and sends them to Seer for hierarchical retrieval in Explorer chat. **Key features:** - Queries top transactions by total duration using EAP RPC interface - Extracts cross-project dependencies from segment spans - Classifies service roles (frontend, core backend, isolated) using graph analysis - Rate limiting and batching for resource protection - Comprehensive test coverage (40 tests) **Implementation details:** - Uses Spans.run_table_query for all Snuba queries (EAP RPC) - While loop optimization to ensure all transactions are represented - Batches parent span resolution (500 per batch) - Includes project slugs for readability - Converts role dict keys to strings for orjson compatibility **Dependencies:** - Added networkx>=3.0 for graph-based role classification **Options added:** - explorer.service_map.enable - explorer.service_map.killswitch - explorer.service_map.allowed_organizations - explorer.service_map.max_edges (default: 5000) - explorer.service_map.rate_limit_seconds (default: 3600)
…tion NetworkX was only used for basic in-degree/out-degree counting in the service classification logic. Replaced it with simple native Python using defaultdict, eliminating the 17MB dependency with no loss of functionality. Classification logic remains identical: - Counts incoming/outgoing edges for each service - Computes average degrees - Classifies as frontend/core_backend/isolated based on thresholds
These options were added by mistake and are not used anywhere in the codebase. The explorer.service_map options are retained as they are actively used by the service map pipeline.
Adds comprehensive integration tests for the Explorer service map feature that use real Snuba queries instead of mocks. These tests verify: - Cross-project dependency extraction (A→B, A→B→C, fan-in, circular) - Edge aggregation and filtering (same-project, missing parents) - Service role classification (frontend, core_backend, isolated) - Complete end-to-end workflow with Seer payload validation The integration tests successfully identified and fixed a real bug where the query was ordering by `timestamp` without including it in the selected columns, causing an InvalidSearchQuery error. Tests use SnubaTestCase and SpanTestCase to create real span data with proper parent-child relationships across projects, validating that the complete pipeline works correctly with actual Snuba storage and queries.
…ency test Removes redundant mock-based test classes that are now fully covered by integration tests: - TestQueryServiceDependencies (10 tests) - TestClassifyServiceRoles (5 tests) Fixes test_circular_dependencies by using unique transaction names per trace to avoid deduplication issues. The implementation's deduplication logic keeps only one segment per transaction name, so both circular traces now use distinct transaction names: - Trace 1: /service-a/endpoint1 → /service-b/endpoint1 - Trace 2: /service-b/endpoint2 → /service-a/endpoint2 This cleanup reduces the test suite from 51 tests to 36 tests while maintaining full coverage through comprehensive integration tests.
Fixes three mypy errors in explorer_service_map.py: 1. Fixed list comprehension type narrowing for transactions - changed to explicit loop to help mypy understand that None values are filtered out 2. Fixed edges_by_pair dict type annotation - changed from dict[tuple[int, int], int] to dict[tuple[int, str | None, int, str | None], int] to match the actual 4-tuple keys storing (source_id, source_slug, target_id, target_slug) 3. Added cast() to sort lambda to specify that x["count"] is always int, resolving type checker's inability to infer the specific dict value type
| # Dispatch tasks for each organization | ||
| for org_id in allowed_org_ids: | ||
| try: | ||
| build_service_map.apply_async( |
There was a problem hiding this comment.
Do a time staggered queue when productionizing
- Remove custom cache-based rate limiting from build_service_map; Snuba's policy system will handle this via the seer.explorer_service_map referrer - Remove explorer.service_map.rate_limit_seconds option - Build SnubaParams once in build_service_map and pass it into _query_top_transactions and _query_service_dependencies, eliminating duplicate Organization and Project DB queries per invocation
…rvice map The previous pipeline queried top-100 transactions by total duration then fetched one segment per transaction to find cross-project edges. This failed for large orgs: high-volume services dominated the top-100 list, and 100 segments were never enough to discover all edges across 200+ projects. New approach: - Phase 1: Org-wide paginated scan (up to max_segments rows, 100/page) with `is_transaction:true has:parent_span`. Tracks which projects appear. - Phase 2: If any projects had zero Phase 1 representation, run a second paginated scan scoped to those uncovered projects without `has:parent_span`, giving low-traffic services a broad fallback. - Phase 3: Batch-resolve parent span IDs in 100-span batches (unchanged logic) to determine source projects and build cross-project edges. Deduplication is now by (child_project_id, parent_span_id) pair rather than transaction name, so multiple calls to the same downstream service are counted correctly across different traces. Also adds `explorer.service_map.max_segments` option (default 10 000) to control the per-phase row budget.
… payload Each node now includes project_id, project_slug, role, callers, and callees instead of sending a flat roles dict and separate edges list.
- Remove redundant explorer.service_map.killswitch option (enable flag is sufficient) - Remove FLAG_ALLOW_EMPTY from max_segments Int option - Remove unused timestamp column from Phase 2 Snuba query - Remove leftover breakpoint() debug call - Fix duplicate @django_db_all decorator and inline import in tests
| ) | ||
|
|
||
| # TODO: Add endpoint in seer before making the actual request | ||
|
|
There was a problem hiding this comment.
Service map never sent to Seer
High Severity
_send_to_seer only serializes and logs the payload but never performs the HTTP request (and never signs it), so build_service_map cannot actually update Seer. The new tests also expect a POST to settings.SEER_AUTOFIX_URL and error handling, which won’t happen with the current stub.
Additional Locations (1)
| default=[], | ||
| type=Sequence, | ||
| flags=FLAG_ALLOW_EMPTY | FLAG_AUTOMATOR_MODIFIABLE, | ||
| ) |
There was a problem hiding this comment.
Mutable option default list can be shared
Low Severity
explorer.service_map.allowed_organizations uses default=[], which can create a shared mutable default across reads in the option manager. If any caller mutates the returned list, subsequent reads can observe the mutated “default” value unexpectedly.
| return edges | ||
|
|
||
|
|
||
| def _classify_service_roles(edges: list[dict]) -> dict[int, str]: |
There was a problem hiding this comment.
I'm not so sure about this I might just remove it for now till I have a better idea
There was a problem hiding this comment.
Okay, I'll update it to be more generic graph topology related and we can decide whether or not we want to use it in seer
Replace "core_backend", "frontend", "isolated" with "hub", "caller", "callee", "peripheral" — terms that describe observed connectivity rather than inferred service type, which is unreliable with partial instrumentation. Also adds the previously missing callee branch (high in-degree, low out-degree). Update tests to match: fix broken TestSendToSeer tests (now test payload construction rather than HTTP calls that are stubbed), and rename classification test methods to use new terminology.
| type=Int, | ||
| flags=FLAG_PRIORITIZE_DISK | FLAG_AUTOMATOR_MODIFIABLE, | ||
| ttl=60 * 5, | ||
| ) |
There was a problem hiding this comment.
Default max_segments is 500, not documented 10,000
Medium Severity
The explorer.service_map.max_segments option defaults to 500, but the PR description documents a default of 10,000. With only 500 segments scanned per org over a 24-hour window, the service map is likely to miss many cross-project dependencies, producing a far less useful graph than intended.
… 1 and 2 The Snuba query requires orderby columns to also appear in selected_columns. Phase 2 had timestamp removed during cleanup; Phase 1 was always missing it.
…eral node The previous graph (2 edges, 3 nodes, avg=0.67) classified the leaf service as callee since its in-degree of 1 met the average threshold. Peripheral requires avg > 1 (more edges than nodes). New graph uses 5 edges across 4 nodes (avg=1.25) so the weakly-connected service has both in and out degrees below average.
| logger = logging.getLogger("sentry.tasks.explorer_service_map") | ||
|
|
||
| # Seer endpoint path | ||
| SEER_SERVICE_MAP_PATH = "/v1/explorer/service-map/update" |
There was a problem hiding this comment.
Unused SEER_SERVICE_MAP_PATH constant is dead code
Low Severity
SEER_SERVICE_MAP_PATH is defined but never referenced anywhere in the codebase. While it's presumably intended for the future HTTP call in _send_to_seer, it currently contributes to dead code. The constant isn't used even in the stubbed _send_to_seer function.
|
|
||
| try: | ||
| organization = Organization.objects.get(id=organization_id) | ||
| projects = list(Project.objects.filter(organization_id=organization_id)) |
There was a problem hiding this comment.
Maybe let's only search for active projects.
| return | ||
|
|
||
| roles = _classify_service_roles(edges) | ||
| nodes = _build_nodes(edges, roles) |
There was a problem hiding this comment.
can these tiw functions just be 1 function called build_graph and pass once over the edges and nodes? I think so but maybe I'm missing something
…one pass Both functions walked the edges list to extract the same per-node data (degrees, slugs, caller/callee relationships). Merged into a single _build_nodes(edges) that collects everything in one traversal, computes average degrees, assigns roles, and returns the node list directly.
| child_project_id, | ||
| segment.get("child_project_slug"), | ||
| ) | ||
| edges_by_pair[edge_key] += 1 |
There was a problem hiding this comment.
Edge aggregation key includes slugs causing potential count splitting
Low Severity
The edges_by_pair aggregation key is (parent_project_id, parent_project_slug, child_project_id, child_project_slug), including slug metadata alongside IDs. If the same project pair ever appears with a different slug value (e.g., one query returns None for project.slug), edge counts split across separate entries. Downstream in _build_nodes, each entry independently increments degree counters, inflating in/out degrees and potentially causing incorrect role classification.
Additional Locations (1)
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| try: | ||
| build_service_map.apply_async( | ||
| args=[org_id], | ||
| countdown=0, |
There was a problem hiding this comment.
Redundant countdown parameter in task dispatch
Low Severity
The countdown=0 parameter in apply_async is redundant since 0 is the default value. Additionally, the PR discussion mentions implementing time-staggered dispatch when productionizing, suggesting this was a placeholder that should either be removed or replaced with actual staggering logic.
| for (src_id, src_slug, tgt_id, tgt_slug), count in edges_by_pair.items() | ||
| ] | ||
| edges.sort(key=lambda x: cast(int, x["count"]), reverse=True) | ||
| edges = edges[:max_edges] |
There was a problem hiding this comment.
Missing validation allows negative max_edges causing incorrect slicing
Medium Severity
The max_edges option value is used directly in list slicing without validation. If configured to a negative value, edges[:max_edges] uses Python negative indexing instead of limiting the list, causing incorrect results. For example, edges[:-1] would return all edges except the last one, rather than enforcing a maximum edge count.
| type=Int, | ||
| flags=FLAG_PRIORITIZE_DISK | FLAG_AUTOMATOR_MODIFIABLE, | ||
| ttl=60 * 5, | ||
| ) |
There was a problem hiding this comment.
Inconsistent flag usage for max_segments option
Low Severity
The explorer.service_map.max_segments option uses FLAG_PRIORITIZE_DISK while the similar explorer.service_map.max_edges option does not. Both are runtime-tunable limits for the same feature, so they should use consistent flags. Other similar max/limit options in the codebase only use FLAG_AUTOMATOR_MODIFIABLE.
Adds a periodic Celery task that analyzes distributed traces to build a
service dependency graph for each organization and sends it to Seer.
This gives Explorer the context it needs to understand which services
call which others.
## How it works
`schedule_service_map_builds` runs daily and fans out a
build_service_map task per org from the allowlist.
`build_service_map` does the following for each org:
1. Two-pass Snuba scan to find cross-project segment relationships:
- Phase 1: Org-wide query for transaction spans that have a parent_span
(cross-project candidates). Tracks which projects appear.
- Phase 2: If any projects had zero representation in Phase 1 (e.g.
low-traffic services), runs a scoped fallback scan for those projects
without the
has:parent_span filter.
- Phase 3: Batch-resolves all collected parent_span_ids back to their
source projects to build directed edges.
2. Role classification using in/out degree analysis — services are
classified as core_backend, frontend, or isolated relative to the
average connectivity of
the graph.
3. Sends to Seer via a signed POST to /v1/explorer/service-map/update
(HTTP call currently stubbed pending the Seer endpoint being ready).
Options
``` ┌────────────────────────────────────────────┬─────────┬───────────────────────────┐
│ Option │ Default │ Purpose │
├────────────────────────────────────────────┼─────────┼───────────────────────────┤
│ explorer.service_map.enable │ false │ Master on/off switch │
├────────────────────────────────────────────┼─────────┼───────────────────────────┤
│ explorer.service_map.allowed_organizations │ [] │ Allowlist of org IDs │
├────────────────────────────────────────────┼─────────┼───────────────────────────┤
│ explorer.service_map.max_segments │ 10,000 │ Max spans scanned per org │
├────────────────────────────────────────────┼─────────┼───────────────────────────┤
│ explorer.service_map.max_edges │ 5,000 │ Max edges sent to Seer │
└────────────────────────────────────────────┴─────────┴───────────────────────────┘
```
Note
- The Seer HTTP call is commented out with a TODO; the rest of the
pipeline is fully functional and can be validated end-to-end once the
endpoint lands
- This task isn't actually called yet


Adds a periodic Celery task that analyzes distributed traces to build a service dependency graph for each organization and sends it to Seer. This gives Explorer the context it needs to understand which services call which others.
How it works
schedule_service_map_buildsruns daily and fans out a build_service_map task per org from the allowlist.build_service_mapdoes the following for each org:- Phase 1: Org-wide query for transaction spans that have a parent_span (cross-project candidates). Tracks which projects appear.
- Phase 2: If any projects had zero representation in Phase 1 (e.g. low-traffic services), runs a scoped fallback scan for those projects without the
has:parent_span filter.
- Phase 3: Batch-resolves all collected parent_span_ids back to their source projects to build directed edges.
the graph.
Options
Note