Skip to content

feat(seer): Add Explorer service map extraction pipeline#108379

Merged
shruthilayaj merged 18 commits intomasterfrom
shruthi/explorer-service-map
Feb 23, 2026
Merged

feat(seer): Add Explorer service map extraction pipeline#108379
shruthilayaj merged 18 commits intomasterfrom
shruthi/explorer-service-map

Conversation

@shruthilayaj
Copy link
Member

@shruthilayaj shruthilayaj commented Feb 17, 2026

Adds a periodic Celery task that analyzes distributed traces to build a service dependency graph for each organization and sends it to Seer. This gives Explorer the context it needs to understand which services call which others.

How it works

schedule_service_map_builds runs daily and fans out a build_service_map task per org from the allowlist.
build_service_map does the following for each org:

  1. Two-pass Snuba scan to find cross-project segment relationships:
    - Phase 1: Org-wide query for transaction spans that have a parent_span (cross-project candidates). Tracks which projects appear.
    - Phase 2: If any projects had zero representation in Phase 1 (e.g. low-traffic services), runs a scoped fallback scan for those projects without the
    has:parent_span filter.
    - Phase 3: Batch-resolves all collected parent_span_ids back to their source projects to build directed edges.
  2. Role classification using in/out degree analysis — services are classified as core_backend, frontend, or isolated relative to the average connectivity of
    the graph.
  3. Sends to Seer via a signed POST to /v1/explorer/service-map/update (HTTP call currently stubbed pending the Seer endpoint being ready).

Options

  │                   Option                   │ Default │          Purpose          │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.enable                │ false   │ Master on/off switch      │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.allowed_organizations │ []      │ Allowlist of org IDs      │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.max_segments          │ 10,000  │ Max spans scanned per org │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.max_edges             │ 5,000   │ Max edges sent to Seer    │
  └────────────────────────────────────────────┴─────────┴───────────────────────────┘

Note

  • The Seer HTTP call is commented out with a TODO; the rest of the pipeline is fully functional and can be validated end-to-end once the endpoint lands
  • This task isn't actually called yet

Implements a Celery task that extracts service dependency graphs from
distributed traces and sends them to Seer for hierarchical retrieval in
Explorer chat.

**Key features:**
- Queries top transactions by total duration using EAP RPC interface
- Extracts cross-project dependencies from segment spans
- Classifies service roles (frontend, core backend, isolated) using graph analysis
- Rate limiting and batching for resource protection
- Comprehensive test coverage (40 tests)

**Implementation details:**
- Uses Spans.run_table_query for all Snuba queries (EAP RPC)
- While loop optimization to ensure all transactions are represented
- Batches parent span resolution (500 per batch)
- Includes project slugs for readability
- Converts role dict keys to strings for orjson compatibility

**Dependencies:**
- Added networkx>=3.0 for graph-based role classification

**Options added:**
- explorer.service_map.enable
- explorer.service_map.killswitch
- explorer.service_map.allowed_organizations
- explorer.service_map.max_edges (default: 5000)
- explorer.service_map.rate_limit_seconds (default: 3600)
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Feb 17, 2026
…tion

NetworkX was only used for basic in-degree/out-degree counting in the
service classification logic. Replaced it with simple native Python
using defaultdict, eliminating the 17MB dependency with no loss of
functionality.

Classification logic remains identical:
- Counts incoming/outgoing edges for each service
- Computes average degrees
- Classifies as frontend/core_backend/isolated based on thresholds
These options were added by mistake and are not used anywhere in the
codebase. The explorer.service_map options are retained as they are
actively used by the service map pipeline.
Adds comprehensive integration tests for the Explorer service map feature
that use real Snuba queries instead of mocks. These tests verify:

- Cross-project dependency extraction (A→B, A→B→C, fan-in, circular)
- Edge aggregation and filtering (same-project, missing parents)
- Service role classification (frontend, core_backend, isolated)
- Complete end-to-end workflow with Seer payload validation

The integration tests successfully identified and fixed a real bug where
the query was ordering by `timestamp` without including it in the selected
columns, causing an InvalidSearchQuery error.

Tests use SnubaTestCase and SpanTestCase to create real span data with
proper parent-child relationships across projects, validating that the
complete pipeline works correctly with actual Snuba storage and queries.
…ency test

Removes redundant mock-based test classes that are now fully covered by
integration tests:
- TestQueryServiceDependencies (10 tests)
- TestClassifyServiceRoles (5 tests)

Fixes test_circular_dependencies by using unique transaction names per
trace to avoid deduplication issues. The implementation's deduplication
logic keeps only one segment per transaction name, so both circular
traces now use distinct transaction names:
- Trace 1: /service-a/endpoint1 → /service-b/endpoint1
- Trace 2: /service-b/endpoint2 → /service-a/endpoint2

This cleanup reduces the test suite from 51 tests to 36 tests while
maintaining full coverage through comprehensive integration tests.
Fixes three mypy errors in explorer_service_map.py:

1. Fixed list comprehension type narrowing for transactions - changed to
   explicit loop to help mypy understand that None values are filtered out

2. Fixed edges_by_pair dict type annotation - changed from
   dict[tuple[int, int], int] to dict[tuple[int, str | None, int, str | None], int]
   to match the actual 4-tuple keys storing (source_id, source_slug,
   target_id, target_slug)

3. Added cast() to sort lambda to specify that x["count"] is always int,
   resolving type checker's inability to infer the specific dict value type
# Dispatch tasks for each organization
for org_id in allowed_org_ids:
try:
build_service_map.apply_async(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do a time staggered queue when productionizing

- Remove custom cache-based rate limiting from build_service_map; Snuba's
  policy system will handle this via the seer.explorer_service_map referrer
- Remove explorer.service_map.rate_limit_seconds option
- Build SnubaParams once in build_service_map and pass it into
  _query_top_transactions and _query_service_dependencies, eliminating
  duplicate Organization and Project DB queries per invocation
…rvice map

The previous pipeline queried top-100 transactions by total duration then
fetched one segment per transaction to find cross-project edges. This
failed for large orgs: high-volume services dominated the top-100 list,
and 100 segments were never enough to discover all edges across 200+ projects.

New approach:
- Phase 1: Org-wide paginated scan (up to max_segments rows, 100/page) with
  `is_transaction:true has:parent_span`. Tracks which projects appear.
- Phase 2: If any projects had zero Phase 1 representation, run a second
  paginated scan scoped to those uncovered projects without `has:parent_span`,
  giving low-traffic services a broad fallback.
- Phase 3: Batch-resolve parent span IDs in 100-span batches (unchanged logic)
  to determine source projects and build cross-project edges.

Deduplication is now by (child_project_id, parent_span_id) pair rather than
transaction name, so multiple calls to the same downstream service are counted
correctly across different traces.

Also adds `explorer.service_map.max_segments` option (default 10 000) to
control the per-phase row budget.
… payload

Each node now includes project_id, project_slug, role, callers, and callees
instead of sending a flat roles dict and separate edges list.
- Remove redundant explorer.service_map.killswitch option (enable flag is sufficient)
- Remove FLAG_ALLOW_EMPTY from max_segments Int option
- Remove unused timestamp column from Phase 2 Snuba query
- Remove leftover breakpoint() debug call
- Fix duplicate @django_db_all decorator and inline import in tests
)

# TODO: Add endpoint in seer before making the actual request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service map never sent to Seer

High Severity

_send_to_seer only serializes and logs the payload but never performs the HTTP request (and never signs it), so build_service_map cannot actually update Seer. The new tests also expect a POST to settings.SEER_AUTOFIX_URL and error handling, which won’t happen with the current stub.

Additional Locations (1)

Fix in Cursor Fix in Web

default=[],
type=Sequence,
flags=FLAG_ALLOW_EMPTY | FLAG_AUTOMATOR_MODIFIABLE,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable option default list can be shared

Low Severity

explorer.service_map.allowed_organizations uses default=[], which can create a shared mutable default across reads in the option manager. If any caller mutates the returned list, subsequent reads can observe the mutated “default” value unexpectedly.

Fix in Cursor Fix in Web

return edges


def _classify_service_roles(edges: list[dict]) -> dict[int, str]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about this I might just remove it for now till I have a better idea

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll update it to be more generic graph topology related and we can decide whether or not we want to use it in seer

Replace "core_backend", "frontend", "isolated" with "hub", "caller",
"callee", "peripheral" — terms that describe observed connectivity
rather than inferred service type, which is unreliable with partial
instrumentation. Also adds the previously missing callee branch
(high in-degree, low out-degree).

Update tests to match: fix broken TestSendToSeer tests (now test
payload construction rather than HTTP calls that are stubbed), and
rename classification test methods to use new terminology.
type=Int,
flags=FLAG_PRIORITIZE_DISK | FLAG_AUTOMATOR_MODIFIABLE,
ttl=60 * 5,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default max_segments is 500, not documented 10,000

Medium Severity

The explorer.service_map.max_segments option defaults to 500, but the PR description documents a default of 10,000. With only 500 segments scanned per org over a 24-hour window, the service map is likely to miss many cross-project dependencies, producing a far less useful graph than intended.

Fix in Cursor Fix in Web

… 1 and 2

The Snuba query requires orderby columns to also appear in selected_columns.
Phase 2 had timestamp removed during cleanup; Phase 1 was always missing it.
…eral node

The previous graph (2 edges, 3 nodes, avg=0.67) classified the leaf
service as callee since its in-degree of 1 met the average threshold.
Peripheral requires avg > 1 (more edges than nodes). New graph uses
5 edges across 4 nodes (avg=1.25) so the weakly-connected service has
both in and out degrees below average.
logger = logging.getLogger("sentry.tasks.explorer_service_map")

# Seer endpoint path
SEER_SERVICE_MAP_PATH = "/v1/explorer/service-map/update"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused SEER_SERVICE_MAP_PATH constant is dead code

Low Severity

SEER_SERVICE_MAP_PATH is defined but never referenced anywhere in the codebase. While it's presumably intended for the future HTTP call in _send_to_seer, it currently contributes to dead code. The constant isn't used even in the stubbed _send_to_seer function.

Fix in Cursor Fix in Web


try:
organization = Organization.objects.get(id=organization_id)
projects = list(Project.objects.filter(organization_id=organization_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's only search for active projects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in eac1c66

return

roles = _classify_service_roles(edges)
nodes = _build_nodes(edges, roles)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these tiw functions just be 1 function called build_graph and pass once over the edges and nodes? I think so but maybe I'm missing something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combined 5fe2fe1?w=1

Copy link
Contributor

@Mihir-Mavalankar Mihir-Mavalankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping contingent on comments above.

…one pass

Both functions walked the edges list to extract the same per-node data
(degrees, slugs, caller/callee relationships). Merged into a single
_build_nodes(edges) that collects everything in one traversal, computes
average degrees, assigns roles, and returns the node list directly.
child_project_id,
segment.get("child_project_slug"),
)
edges_by_pair[edge_key] += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edge aggregation key includes slugs causing potential count splitting

Low Severity

The edges_by_pair aggregation key is (parent_project_id, parent_project_slug, child_project_id, child_project_slug), including slug metadata alongside IDs. If the same project pair ever appears with a different slug value (e.g., one query returns None for project.slug), edge counts split across separate entries. Downstream in _build_nodes, each entry independently increments degree counters, inflating in/out degrees and potentially causing incorrect role classification.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

try:
build_service_map.apply_async(
args=[org_id],
countdown=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant countdown parameter in task dispatch

Low Severity

The countdown=0 parameter in apply_async is redundant since 0 is the default value. Additionally, the PR discussion mentions implementing time-staggered dispatch when productionizing, suggesting this was a placeholder that should either be removed or replaced with actual staggering logic.

Fix in Cursor Fix in Web

for (src_id, src_slug, tgt_id, tgt_slug), count in edges_by_pair.items()
]
edges.sort(key=lambda x: cast(int, x["count"]), reverse=True)
edges = edges[:max_edges]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation allows negative max_edges causing incorrect slicing

Medium Severity

The max_edges option value is used directly in list slicing without validation. If configured to a negative value, edges[:max_edges] uses Python negative indexing instead of limiting the list, causing incorrect results. For example, edges[:-1] would return all edges except the last one, rather than enforcing a maximum edge count.

Fix in Cursor Fix in Web

type=Int,
flags=FLAG_PRIORITIZE_DISK | FLAG_AUTOMATOR_MODIFIABLE,
ttl=60 * 5,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent flag usage for max_segments option

Low Severity

The explorer.service_map.max_segments option uses FLAG_PRIORITIZE_DISK while the similar explorer.service_map.max_edges option does not. Both are runtime-tunable limits for the same feature, so they should use consistent flags. Other similar max/limit options in the codebase only use FLAG_AUTOMATOR_MODIFIABLE.

Fix in Cursor Fix in Web

@shruthilayaj shruthilayaj merged commit 49b141b into master Feb 23, 2026
100 checks passed
@shruthilayaj shruthilayaj deleted the shruthi/explorer-service-map branch February 23, 2026 20:10
mchen-sentry pushed a commit that referenced this pull request Feb 24, 2026
Adds a periodic Celery task that analyzes distributed traces to build a
service dependency graph for each organization and sends it to Seer.
This gives Explorer the context it needs to understand which services
call which others.
                               
## How it works
`schedule_service_map_builds` runs daily and fans out a
build_service_map task per org from the allowlist.
`build_service_map` does the following for each org:
  1. Two-pass Snuba scan to find cross-project segment relationships:
- Phase 1: Org-wide query for transaction spans that have a parent_span
(cross-project candidates). Tracks which projects appear.
- Phase 2: If any projects had zero representation in Phase 1 (e.g.
low-traffic services), runs a scoped fallback scan for those projects
without the
  has:parent_span filter.
- Phase 3: Batch-resolves all collected parent_span_ids back to their
source projects to build directed edges.
2. Role classification using in/out degree analysis — services are
classified as core_backend, frontend, or isolated relative to the
average connectivity of
  the graph.
3. Sends to Seer via a signed POST to /v1/explorer/service-map/update
(HTTP call currently stubbed pending the Seer endpoint being ready).

  Options

```  ┌────────────────────────────────────────────┬─────────┬───────────────────────────┐
  │                   Option                   │ Default │          Purpose          │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.enable                │ false   │ Master on/off switch      │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.allowed_organizations │ []      │ Allowlist of org IDs      │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.max_segments          │ 10,000  │ Max spans scanned per org │
  ├────────────────────────────────────────────┼─────────┼───────────────────────────┤
  │ explorer.service_map.max_edges             │ 5,000   │ Max edges sent to Seer    │
  └────────────────────────────────────────────┴─────────┴───────────────────────────┘
```
  
Note
- The Seer HTTP call is commented out with a TODO; the rest of the
pipeline is fully functional and can be validated end-to-end once the
endpoint lands
  - This task isn't actually called yet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code-assisted Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants