feat(seer): Add Explorer service map extraction pipeline by shruthilayaj · Pull Request #108379 · getsentry/sentry

shruthilayaj · 2026-02-17T21:06:43Z

Adds a periodic Celery task that analyzes distributed traces to build a service dependency graph for each organization and sends it to Seer. This gives Explorer the context it needs to understand which services call which others.

How it works

schedule_service_map_builds runs daily and fans out a build_service_map task per org from the allowlist.
build_service_map does the following for each org:

Two-pass Snuba scan to find cross-project segment relationships:
- Phase 1: Org-wide query for transaction spans that have a parent_span (cross-project candidates). Tracks which projects appear.
- Phase 2: If any projects had zero representation in Phase 1 (e.g. low-traffic services), runs a scoped fallback scan for those projects without the
has:parent_span filter.
- Phase 3: Batch-resolves all collected parent_span_ids back to their source projects to build directed edges.
Role classification using in/out degree analysis — services are classified as core_backend, frontend, or isolated relative to the average connectivity of
the graph.
Sends to Seer via a signed POST to /v1/explorer/service-map/update (HTTP call currently stubbed pending the Seer endpoint being ready).

Options

  │                   Option                   │ Default │          Purpose          │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.enable                │ false   │ Master on/off switch      │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.allowed_organizations │ []      │ Allowlist of org IDs      │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.max_segments          │ 10,000  │ Max spans scanned per org │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.max_edges             │ 5,000   │ Max edges sent to Seer    │
  └────────────────────────────────────────────┴─────────┴───────────────────────────┘

Note

The Seer HTTP call is commented out with a TODO; the rest of the pipeline is fully functional and can be validated end-to-end once the endpoint lands
This task isn't actually called yet

Implements a Celery task that extracts service dependency graphs from distributed traces and sends them to Seer for hierarchical retrieval in Explorer chat. **Key features:** - Queries top transactions by total duration using EAP RPC interface - Extracts cross-project dependencies from segment spans - Classifies service roles (frontend, core backend, isolated) using graph analysis - Rate limiting and batching for resource protection - Comprehensive test coverage (40 tests) **Implementation details:** - Uses Spans.run_table_query for all Snuba queries (EAP RPC) - While loop optimization to ensure all transactions are represented - Batches parent span resolution (500 per batch) - Includes project slugs for readability - Converts role dict keys to strings for orjson compatibility **Dependencies:** - Added networkx>=3.0 for graph-based role classification **Options added:** - explorer.service_map.enable - explorer.service_map.killswitch - explorer.service_map.allowed_organizations - explorer.service_map.max_edges (default: 5000) - explorer.service_map.rate_limit_seconds (default: 3600)

…tion NetworkX was only used for basic in-degree/out-degree counting in the service classification logic. Replaced it with simple native Python using defaultdict, eliminating the 17MB dependency with no loss of functionality. Classification logic remains identical: - Counts incoming/outgoing edges for each service - Computes average degrees - Classifies as frontend/core_backend/isolated based on thresholds

These options were added by mistake and are not used anywhere in the codebase. The explorer.service_map options are retained as they are actively used by the service map pipeline.

Adds comprehensive integration tests for the Explorer service map feature that use real Snuba queries instead of mocks. These tests verify: - Cross-project dependency extraction (A→B, A→B→C, fan-in, circular) - Edge aggregation and filtering (same-project, missing parents) - Service role classification (frontend, core_backend, isolated) - Complete end-to-end workflow with Seer payload validation The integration tests successfully identified and fixed a real bug where the query was ordering by `timestamp` without including it in the selected columns, causing an InvalidSearchQuery error. Tests use SnubaTestCase and SpanTestCase to create real span data with proper parent-child relationships across projects, validating that the complete pipeline works correctly with actual Snuba storage and queries.

…ency test Removes redundant mock-based test classes that are now fully covered by integration tests: - TestQueryServiceDependencies (10 tests) - TestClassifyServiceRoles (5 tests) Fixes test_circular_dependencies by using unique transaction names per trace to avoid deduplication issues. The implementation's deduplication logic keeps only one segment per transaction name, so both circular traces now use distinct transaction names: - Trace 1: /service-a/endpoint1 → /service-b/endpoint1 - Trace 2: /service-b/endpoint2 → /service-a/endpoint2 This cleanup reduces the test suite from 51 tests to 36 tests while maintaining full coverage through comprehensive integration tests.

Fixes three mypy errors in explorer_service_map.py: 1. Fixed list comprehension type narrowing for transactions - changed to explicit loop to help mypy understand that None values are filtered out 2. Fixed edges_by_pair dict type annotation - changed from dict[tuple[int, int], int] to dict[tuple[int, str | None, int, str | None], int] to match the actual 4-tuple keys storing (source_id, source_slug, target_id, target_slug) 3. Added cast() to sort lambda to specify that x["count"] is always int, resolving type checker's inability to infer the specific dict value type

shruthilayaj · 2026-02-19T17:14:23Z

src/sentry/tasks/explorer_service_map.py

+    # Dispatch tasks for each organization
+    for org_id in allowed_org_ids:
+        try:
+            build_service_map.apply_async(


Do a time staggered queue when productionizing

- Remove custom cache-based rate limiting from build_service_map; Snuba's policy system will handle this via the seer.explorer_service_map referrer - Remove explorer.service_map.rate_limit_seconds option - Build SnubaParams once in build_service_map and pass it into _query_top_transactions and _query_service_dependencies, eliminating duplicate Organization and Project DB queries per invocation

…rvice map The previous pipeline queried top-100 transactions by total duration then fetched one segment per transaction to find cross-project edges. This failed for large orgs: high-volume services dominated the top-100 list, and 100 segments were never enough to discover all edges across 200+ projects. New approach: - Phase 1: Org-wide paginated scan (up to max_segments rows, 100/page) with `is_transaction:true has:parent_span`. Tracks which projects appear. - Phase 2: If any projects had zero Phase 1 representation, run a second paginated scan scoped to those uncovered projects without `has:parent_span`, giving low-traffic services a broad fallback. - Phase 3: Batch-resolve parent span IDs in 100-span batches (unchanged logic) to determine source projects and build cross-project edges. Deduplication is now by (child_project_id, parent_span_id) pair rather than transaction name, so multiple calls to the same downstream service are counted correctly across different traces. Also adds `explorer.service_map.max_segments` option (default 10 000) to control the per-phase row budget.

… payload Each node now includes project_id, project_slug, role, callers, and callees instead of sending a flat roles dict and separate edges list.

- Remove redundant explorer.service_map.killswitch option (enable flag is sufficient) - Remove FLAG_ALLOW_EMPTY from max_segments Int option - Remove unused timestamp column from Phase 2 Snuba query - Remove leftover breakpoint() debug call - Fix duplicate @django_db_all decorator and inline import in tests

src/sentry/tasks/explorer_service_map.py

cursor · 2026-02-20T17:36:20Z

src/sentry/tasks/explorer_service_map.py

+    )
+
+    # TODO: Add endpoint in seer before making the actual request
+


Service map never sent to Seer

High Severity

_send_to_seer only serializes and logs the payload but never performs the HTTP request (and never signs it), so build_service_map cannot actually update Seer. The new tests also expect a POST to settings.SEER_AUTOFIX_URL and error handling, which won’t happen with the current stub.

Additional Locations (1)

tests/sentry/tasks/test_explorer_service_map.py#L38-L91

src/sentry/tasks/explorer_service_map.py

cursor · 2026-02-20T17:36:20Z

src/sentry/options/defaults.py

+    default=[],
+    type=Sequence,
+    flags=FLAG_ALLOW_EMPTY | FLAG_AUTOMATOR_MODIFIABLE,
+)


Mutable option default list can be shared

Low Severity

explorer.service_map.allowed_organizations uses default=[], which can create a shared mutable default across reads in the option manager. If any caller mutates the returned list, subsequent reads can observe the mutated “default” value unexpectedly.

shruthilayaj · 2026-02-20T17:54:27Z

src/sentry/tasks/explorer_service_map.py

+    return edges
+
+
+def _classify_service_roles(edges: list[dict]) -> dict[int, str]:


I'm not so sure about this I might just remove it for now till I have a better idea

Okay, I'll update it to be more generic graph topology related and we can decide whether or not we want to use it in seer

Replace "core_backend", "frontend", "isolated" with "hub", "caller", "callee", "peripheral" — terms that describe observed connectivity rather than inferred service type, which is unreliable with partial instrumentation. Also adds the previously missing callee branch (high in-degree, low out-degree). Update tests to match: fix broken TestSendToSeer tests (now test payload construction rather than HTTP calls that are stubbed), and rename classification test methods to use new terminology.

tests/sentry/tasks/test_explorer_service_map.py

cursor · 2026-02-20T18:19:02Z

src/sentry/options/defaults.py

+    type=Int,
+    flags=FLAG_PRIORITIZE_DISK | FLAG_AUTOMATOR_MODIFIABLE,
+    ttl=60 * 5,
+)


Default max_segments is 500, not documented 10,000

Medium Severity

The explorer.service_map.max_segments option defaults to 500, but the PR description documents a default of 10,000. With only 500 segments scanned per org over a 24-hour window, the service map is likely to miss many cross-project dependencies, producing a far less useful graph than intended.

… 1 and 2 The Snuba query requires orderby columns to also appear in selected_columns. Phase 2 had timestamp removed during cleanup; Phase 1 was always missing it.

…eral node The previous graph (2 edges, 3 nodes, avg=0.67) classified the leaf service as callee since its in-degree of 1 met the average threshold. Peripheral requires avg > 1 (more edges than nodes). New graph uses 5 edges across 4 nodes (avg=1.25) so the weakly-connected service has both in and out degrees below average.

cursor · 2026-02-20T19:07:00Z

src/sentry/tasks/explorer_service_map.py

+logger = logging.getLogger("sentry.tasks.explorer_service_map")
+
+# Seer endpoint path
+SEER_SERVICE_MAP_PATH = "/v1/explorer/service-map/update"


Unused SEER_SERVICE_MAP_PATH constant is dead code

Low Severity

SEER_SERVICE_MAP_PATH is defined but never referenced anywhere in the codebase. While it's presumably intended for the future HTTP call in _send_to_seer, it currently contributes to dead code. The constant isn't used even in the stubbed _send_to_seer function.

Mihir-Mavalankar · 2026-02-20T19:32:15Z

src/sentry/tasks/explorer_service_map.py

+
+    try:
+        organization = Organization.objects.get(id=organization_id)
+        projects = list(Project.objects.filter(organization_id=organization_id))


Maybe let's only search for active projects.

addressed in eac1c66

src/sentry/tasks/explorer_service_map.py

Mihir-Mavalankar · 2026-02-20T19:41:25Z

src/sentry/tasks/explorer_service_map.py

+            return
+
+        roles = _classify_service_roles(edges)
+        nodes = _build_nodes(edges, roles)


can these tiw functions just be 1 function called build_graph and pass once over the edges and nodes? I think so but maybe I'm missing something

combined 5fe2fe1?w=1

Mihir-Mavalankar

Stamping contingent on comments above.

…one pass Both functions walked the edges list to extract the same per-node data (degrees, slugs, caller/callee relationships). Merged into a single _build_nodes(edges) that collects everything in one traversal, computes average degrees, assigns roles, and returns the node list directly.

cursor · 2026-02-20T20:51:21Z

src/sentry/tasks/explorer_service_map.py

+                        child_project_id,
+                        segment.get("child_project_slug"),
+                    )
+                    edges_by_pair[edge_key] += 1


Edge aggregation key includes slugs causing potential count splitting

Low Severity

The edges_by_pair aggregation key is (parent_project_id, parent_project_slug, child_project_id, child_project_slug), including slug metadata alongside IDs. If the same project pair ever appears with a different slug value (e.g., one query returns None for project.slug), edge counts split across separate entries. Downstream in _build_nodes, each entry independently increments degree counters, inflating in/out degrees and potentially causing incorrect role classification.

Additional Locations (1)

src/sentry/tasks/explorer_service_map.py#L230-L238

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-23T19:24:37Z

src/sentry/tasks/explorer_service_map.py

+        try:
+            build_service_map.apply_async(
+                args=[org_id],
+                countdown=0,


Redundant countdown parameter in task dispatch

Low Severity

The countdown=0 parameter in apply_async is redundant since 0 is the default value. Additionally, the PR discussion mentions implementing time-staggered dispatch when productionizing, suggesting this was a placeholder that should either be removed or replaced with actual staggering logic.

cursor · 2026-02-23T19:24:37Z

src/sentry/tasks/explorer_service_map.py

+        for (src_id, src_slug, tgt_id, tgt_slug), count in edges_by_pair.items()
+    ]
+    edges.sort(key=lambda x: cast(int, x["count"]), reverse=True)
+    edges = edges[:max_edges]


Missing validation allows negative max_edges causing incorrect slicing

Medium Severity

The max_edges option value is used directly in list slicing without validation. If configured to a negative value, edges[:max_edges] uses Python negative indexing instead of limiting the list, causing incorrect results. For example, edges[:-1] would return all edges except the last one, rather than enforcing a maximum edge count.

cursor · 2026-02-23T19:24:37Z

src/sentry/options/defaults.py

+    type=Int,
+    flags=FLAG_PRIORITIZE_DISK | FLAG_AUTOMATOR_MODIFIABLE,
+    ttl=60 * 5,
+)


Inconsistent flag usage for max_segments option

Low Severity

The explorer.service_map.max_segments option uses FLAG_PRIORITIZE_DISK while the similar explorer.service_map.max_edges option does not. Both are runtime-tunable limits for the same feature, so they should use consistent flags. Other similar max/limit options in the codebase only use FLAG_AUTOMATOR_MODIFIABLE.

Adds a periodic Celery task that analyzes distributed traces to build a service dependency graph for each organization and sends it to Seer. This gives Explorer the context it needs to understand which services call which others. ## How it works `schedule_service_map_builds` runs daily and fans out a build_service_map task per org from the allowlist. `build_service_map` does the following for each org: 1. Two-pass Snuba scan to find cross-project segment relationships: - Phase 1: Org-wide query for transaction spans that have a parent_span (cross-project candidates). Tracks which projects appear. - Phase 2: If any projects had zero representation in Phase 1 (e.g. low-traffic services), runs a scoped fallback scan for those projects without the has:parent_span filter. - Phase 3: Batch-resolves all collected parent_span_ids back to their source projects to build directed edges. 2. Role classification using in/out degree analysis — services are classified as core_backend, frontend, or isolated relative to the average connectivity of the graph. 3. Sends to Seer via a signed POST to /v1/explorer/service-map/update (HTTP call currently stubbed pending the Seer endpoint being ready). Options ``` ┌────────────────────────────────────────────┬─────────┬───────────────────────────┐ │ Option │ Default │ Purpose │ ├────────────────────────────────────────────┼─────────┼───────────────────────────┤ │ explorer.service_map.enable │ false │ Master on/off switch │ ├────────────────────────────────────────────┼─────────┼───────────────────────────┤ │ explorer.service_map.allowed_organizations │ [] │ Allowlist of org IDs │ ├────────────────────────────────────────────┼─────────┼───────────────────────────┤ │ explorer.service_map.max_segments │ 10,000 │ Max spans scanned per org │ ├────────────────────────────────────────────┼─────────┼───────────────────────────┤ │ explorer.service_map.max_edges │ 5,000 │ Max edges sent to Seer │ └────────────────────────────────────────────┴─────────┴───────────────────────────┘ ``` Note - The Seer HTTP call is commented out with a TODO; the rest of the pipeline is fully functional and can be validated end-to-end once the endpoint lands - This task isn't actually called yet

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Feb 17, 2026

vercel bot deployed to Preview February 17, 2026 21:08 View deployment

vercel bot deployed to Preview February 17, 2026 21:13 View deployment

chore(options): Remove unused seer.org_index options

eeb46e5

These options were added by mistake and are not used anywhere in the codebase. The explorer.service_map options are retained as they are actively used by the service map pipeline.

vercel bot deployed to Preview February 17, 2026 21:16 View deployment

vercel bot deployed to Preview February 17, 2026 22:57 View deployment

vercel bot deployed to Preview February 17, 2026 23:05 View deployment

vercel bot deployed to Preview February 17, 2026 23:18 View deployment

shruthilayaj commented Feb 19, 2026

View reviewed changes

vercel bot deployed to Preview February 19, 2026 17:35 View deployment

vercel bot deployed to Preview February 20, 2026 16:41 View deployment

ref(explorer): Replace roles dict with nodes list in Seer service map…

d3fa500

… payload Each node now includes project_id, project_slug, role, callers, and callees instead of sending a flat roles dict and separate edges list.

vercel bot deployed to Preview February 20, 2026 17:14 View deployment

shruthilayaj marked this pull request as ready for review February 20, 2026 17:32

shruthilayaj requested a review from Mihir-Mavalankar February 20, 2026 17:32

vercel bot deployed to Preview February 20, 2026 17:33 View deployment

sentry bot reviewed Feb 20, 2026

View reviewed changes

src/sentry/tasks/explorer_service_map.py Show resolved Hide resolved

cursor bot reviewed Feb 20, 2026

View reviewed changes

shruthilayaj commented Feb 20, 2026

View reviewed changes

shruthilayaj added 2 commits February 20, 2026 13:04

change max_segments to 500

88e4c38

vercel bot deployed to Preview February 20, 2026 18:09 View deployment

cursor bot reviewed Feb 20, 2026

View reviewed changes

fix(explorer): Add timestamp to selected_columns for orderby in Phase…

56712f7

… 1 and 2 The Snuba query requires orderby columns to also appear in selected_columns. Phase 2 had timestamp removed during cleanup; Phase 1 was always missing it.

vercel bot deployed to Preview February 20, 2026 18:53 View deployment

vercel bot deployed to Preview February 20, 2026 18:58 View deployment

cursor bot reviewed Feb 20, 2026

View reviewed changes

Mihir-Mavalankar reviewed Feb 20, 2026

View reviewed changes

Mihir-Mavalankar approved these changes Feb 20, 2026

View reviewed changes

vercel bot deployed to Preview February 20, 2026 20:30 View deployment

ref(explorer): Filter to active projects in build_service_map

eac1c66

vercel bot deployed to Preview February 20, 2026 20:36 View deployment

shruthilayaj enabled auto-merge (squash) February 20, 2026 20:46

cursor bot reviewed Feb 20, 2026

View reviewed changes

shruthilayaj disabled auto-merge February 23, 2026 19:09

shruthilayaj added 2 commits February 23, 2026 14:12

fix test

98c2e76

Merge branch 'master' into shruthi/explorer-service-map

33fab05

vercel bot deployed to Preview February 23, 2026 19:17 View deployment

cursor bot reviewed Feb 23, 2026

View reviewed changes

shruthilayaj merged commit 49b141b into master Feb 23, 2026
100 checks passed

shruthilayaj deleted the shruthi/explorer-service-map branch February 23, 2026 20:10

claude bot added the claude-code-assisted label Feb 24, 2026

		)

		# TODO: Add endpoint in seer before making the actual request

		return edges


		def _classify_service_roles(edges: list[dict]) -> dict[int, str]:

Uh oh!

Conversation

shruthilayaj commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How it works

Uh oh!

shruthilayaj Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Feb 20, 2026

Choose a reason for hiding this comment

Service map never sent to Seer

Uh oh!

Uh oh!

Uh oh!

cursor bot Feb 20, 2026

Choose a reason for hiding this comment

Mutable option default list can be shared

Uh oh!

shruthilayaj Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

shruthilayaj Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Feb 20, 2026

Choose a reason for hiding this comment

Default max_segments is 500, not documented 10,000

Uh oh!

cursor bot Feb 20, 2026

Choose a reason for hiding this comment

Unused SEER_SERVICE_MAP_PATH constant is dead code

Uh oh!

Mihir-Mavalankar Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

shruthilayaj Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mihir-Mavalankar Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

shruthilayaj Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Mihir-Mavalankar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 20, 2026

Choose a reason for hiding this comment

Edge aggregation key includes slugs causing potential count splitting

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 23, 2026

Choose a reason for hiding this comment

Redundant countdown parameter in task dispatch

Uh oh!

cursor bot Feb 23, 2026

Choose a reason for hiding this comment

Missing validation allows negative max_edges causing incorrect slicing

Uh oh!

cursor bot Feb 23, 2026

Choose a reason for hiding this comment

Inconsistent flag usage for max_segments option

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shruthilayaj commented Feb 17, 2026 •

edited

Loading

Default `max_segments` is 500, not documented 10,000

Unused `SEER_SERVICE_MAP_PATH` constant is dead code

Mihir-Mavalankar left a comment •

edited

Loading