Unified Storage Performance Optimizations #1

akshayutture-augment · 2025-11-14T23:42:02Z

Test 10

Summary by CodeRabbit

Bug Fixes
- Improved resource server initialization reliability by consolidating startup initialization instead of per-request calls.
Improvements
- Enhanced search index observability with better logging and tracing context propagation.
- Optimized cache locking for improved performance.
Tests
- Added postgres database skip condition for integration tests.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Replicated from ai-code-review-evaluation/grafana-coderabbit#10

…#97529) * dont lazy init unified storage * Inits index when creating new resource server. Fixes trace propagation by passing span ctx. Update some logging. * Use finer grained cache locking when building indexes to speed things up. Locking the whole function was slowing things down. * formatting * linter fix * go mod * make update-workspace * fix workspaces check error * update dependency owner in mod file * wait 1 second before querying metrics * try with big timeout, see if fixes CI. Wont fail locally. * skips postgres integration test. Only fails in drone. Will fix later. * put delay back to 500 ms

akshayutture-augment · 2025-11-14T23:48:47Z

@coderabbitai review

akshayutture-augment · 2025-11-22T18:05:27Z

@coderabbitai review

coderabbitai · 2025-11-22T18:05:33Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2025-11-22T18:05:50Z

Walkthrough

The pull request consolidates resource server initialization to construction time, removes redundant per-call Init() invocations from RPC methods, propagates tracing contexts consistently across search and SQL operations, switches logging to instance-level loggers, and optimizes cache locking behavior in the search index builder.

Changes

Cohort / File(s)	Summary
Initialization consolidation `pkg/storage/unified/resource/server.go`	Moves Init(ctx) to NewResourceServer constructor with error handling; removes per-call Init() from RPC methods (Create, Update, Delete, Read, List, Watch, Search, History, Origin, IsHealthy, PutBlob, GetBlob); reorders initialization to start watcher after search index.
Logging and context propagation `pkg/storage/unified/resource/search.go`, `pkg/storage/unified/sql/backend.go`	Removes Grafana CLI logger import; switches logging from package-level to instance logger (s.log); propagates context from tracer.Start calls (ctx, span) instead of discarding; adds initialization duration log line.
Concurrency optimization `pkg/storage/unified/search/bleve.go`	Removes initial cache lock/unlock in BuildIndex; adds guarded lock/unlock around cache assignment at function end; replaces earlier log action with placeholder comment.
Test skip logic `pkg/server/module_server_test.go`	Adds postgres database type detection and test skip with message indicating test failure in Drone CI.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant NewResourceServer
    participant Init
    participant SearchInit
    participant WatcherStart
    participant RPC as RPC Methods
    
    Caller->>NewResourceServer: Create resource server
    NewResourceServer->>Init: Call Init(ctx) at construction
    Init->>SearchInit: Initialize search index
    SearchInit-->>Init: Complete
    Init->>WatcherStart: Start watcher (after search)
    WatcherStart-->>Init: Ready
    Init-->>NewResourceServer: Return (or error)
    NewResourceServer-->>Caller: Return server instance
    
    Note over RPC: No redundant Init() calls
    Caller->>RPC: Call Create/Update/Delete/etc.
    RPC-->>Caller: Execute with pre-initialized state

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

pkg/storage/unified/resource/server.go — Initialization refactoring affects multiple RPC methods; verify all code paths handle pre-initialization state and that error propagation is correct if Init fails at construction time.
pkg/storage/unified/search/bleve.go — Concurrency change with cache locking requires careful review to ensure thread safety, race conditions, and deadlock scenarios are handled correctly; the shift from initial lock to end-of-function lock is a behavioral change that affects timing and contention.
pkg/storage/unified/resource/search.go — Context propagation and logger switch; verify instance logger is properly initialized and context is threaded correctly through all call sites.

Poem

🐰 Initialization hops into place
No more per-call Init's race,
Traces flow where contexts bind,
Locks optimized, peace of mind,
Search and watcher aligned!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Unified Storage Performance Optimizations' accurately reflects the main changes across all modified files, which focus on initialization flow optimization, context propagation, and cache locking improvements in the unified storage system.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch unified-storage-enhancements

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

akshayutture-augment · 2025-11-22T19:21:30Z

@coderabbitai review

coderabbitai · 2025-11-22T19:21:34Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

pkg/storage/unified/search/bleve.go (1)
96-154: Guard TotalDocs map iteration with cacheMu to avoid concurrent map access

BuildIndex now writes to b.cache under cacheMu.Lock() / Unlock(), but TotalDocs still iterates over b.cache without any lock. Concurrent calls to BuildIndex and TotalDocs will risk a concurrent map read and map write panic.

Protect the read side as well:
 func (b *bleveBackend) TotalDocs() int64 {
-	var totalDocs int64
-	for _, v := range b.cache {
+	b.cacheMu.RLock()
+	defer b.cacheMu.RUnlock()
+
+	var totalDocs int64
+	for _, v := range b.cache {
 		c, err := v.index.DocCount()
 		if err != nil {
 			continue
 		}
 		totalDocs += int64(c)
 	}
 	return totalDocs
 }
pkg/storage/unified/resource/search.go (1)
171-221: Fix goroutine closure over info, err, and totalBatchesIndexed in search init

Within init:

info from the for _, info := range stats loop is captured by reference, so all goroutines race on the same variable and typically see only the last element.

totalBatchesIndexed++ is incremented from multiple goroutines without synchronization (data race).

_, _, err = s.build(...) mutates the outer err variable concurrently, another data race, and the assignment is unnecessary since the goroutine returns its own error.

These are concurrency bugs even if they existed before; with errgroup.Group they are easy to fix.
 func (s *searchSupport) init(ctx context.Context) error {
 	ctx, span := s.tracer.Start(ctx, tracingPrexfixSearch+"Init")
 	defer span.End()
 	start := time.Now().Unix()
 
 	totalBatchesIndexed := 0
 	group := errgroup.Group{}
 	group.SetLimit(s.initWorkers)
 
 	stats, err := s.storage.GetResourceStats(ctx, "", s.initMinSize)
 	if err != nil {
 		return err
 	}
-
-	for _, info := range stats {
-		group.Go(func() error {
-			s.log.Debug("initializing search index", "namespace", info.Namespace, "group", info.Group, "resource", info.Resource)
-			totalBatchesIndexed++
-			_, _, err = s.build(ctx, info.NamespacedResource, info.Count, info.ResourceVersion)
-			return err
-		})
-	}
-
-	err = group.Wait()
-	if err != nil {
-		return err
-	}
+	totalBatchesIndexed = len(stats)
+
+	for _, info := range stats {
+		info := info // capture loop variable
+		group.Go(func() error {
+			s.log.Debug("initializing search index", "namespace", info.Namespace, "group", info.Group, "resource", info.Resource)
+			_, _, err := s.build(ctx, info.NamespacedResource, info.Count, info.ResourceVersion)
+			return err
+		})
+	}
+
+	if err := group.Wait(); err != nil {
+		return err
+	}
 
 	span.AddEvent("namespaces indexed", trace.WithAttributes(attribute.Int("namespaced_indexed", totalBatchesIndexed)))
This removes the races and still reports how many namespaces were scheduled for indexing.

🧹 Nitpick comments (1)

pkg/storage/unified/resource/search.go (1)
277-285: Enrich high-latency log with more context (optional)

The new Warn log for high index latency is useful. Consider including resource identifiers (namespace/group/resource) in the log fields so slow paths can be tied back to specific kinds more easily.
-	if latencySeconds > 5 {
-		s.log.Warn("high index latency", "latency", latencySeconds)
-	}
+	if latencySeconds > 5 {
+		s.log.Warn("high index latency",
+			"latency", latencySeconds,
+			"namespace", evt.Key.Namespace,
+			"group", evt.Key.Group,
+			"resource", evt.Key.Resource,
+		)
+	}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 871af07 and d762a96.

📒 Files selected for processing (5)

pkg/server/module_server_test.go (1 hunks)
pkg/storage/unified/resource/search.go (4 hunks)
pkg/storage/unified/resource/server.go (2 hunks)
pkg/storage/unified/search/bleve.go (2 hunks)
pkg/storage/unified/sql/backend.go (1 hunks)

🔇 Additional comments (3)

pkg/storage/unified/sql/backend.go (1)

125-155: Using tracer-returned context in GetResourceStats is correct

Capturing the context from b.tracer.Start and passing it into WithTx ensures the DB work in GetResourceStats is properly associated with the span. The change is sound and aligns this method with the rest of the traced paths.

Please verify via your tracing backend that sql.resource.GetResourceStats spans now include the DB queries from this method.

pkg/storage/unified/resource/search.go (1)

309-377: build tracing context propagation looks good

Starting a span with ctx, span := s.tracer.Start(ctx, tracingPrexfixSearch+"Build") and reusing that context through ListIterator, BuildDocument, and index.Write correctly ties index build work to a single span. No functional issues stand out here.

If you have tracing enabled, please confirm that unified_search.Build spans now show the expected DB and indexing sub-spans for a sample index build.

pkg/storage/unified/resource/server.go (1)

182-265: Constructor-time Init and watcher sequencing look correct; verify caller expectations

Calling s.Init(ctx) from NewResourceServer and ordering Init as: lifecycle hooks → search index init → initWatcher improves determinism and ensures the watcher only starts once search is ready. The sync.Once guard keeps Init idempotent.

Please double-check that:

All production code constructs servers via NewResourceServer (no tests or internal code rely on calling Init lazily later).

Startup/shutdown tests still pass with the watcher now started during construction.

Also applies to: 292-318

coderabbitai · 2025-11-22T19:25:09Z

pkg/server/module_server_test.go

+	// TODO - fix this test for postgres
+	if dbType == "postgres" {
+		t.Skip("skipping - test not working with postgres in Drone. Works locally.")
+	}


⚠️ Potential issue | 🟠 Major

Postgres skip hides integration regressions; narrow or remove it

The unconditional dbType == "postgres" skip means this integration path is never exercised for postgres, so regressions can go unnoticed. Since the comment already acknowledges this is a TODO, it would be better to scope the skip to CI or tie it to a tracked issue.

- // TODO - fix this test for postgres - if dbType == "postgres" { - t.Skip("skipping - test not working with postgres in Drone. Works locally.") - } + // TODO(grafana/<issue-id>): fix this test for postgres in CI + if dbType == "postgres" && os.Getenv("CI") != "" { + t.Skip("skipping in CI - test not working with postgres in Drone. Works locally.") + }

This keeps local postgres runs meaningful while unblocking Drone until the underlying issue is fixed.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// TODO - fix this test for postgres

if dbType == "postgres" {

t.Skip("skipping - test not working with postgres in Drone. Works locally.")

}

// TODO(grafana/<issue-id>): fix this test for postgres in CI

if dbType == "postgres" && os.Getenv("CI") != "" {

t.Skip("skipping in CI - test not working with postgres in Drone. Works locally.")

}

🤖 Prompt for AI Agents

In pkg/server/module_server_test.go around lines 35 to 38, the test currently unconditionally skips when dbType == "postgres", which hides postgres regressions; change the skip to be conditional (e.g., only skip in CI/Drone) or attach it to a tracked issue. Update the logic to detect CI (for example via an environment variable like CI or DRONE) and call t.Skipf with the CI condition and include the issue/PR number in the message, or remove the skip entirely and fix the underlying flakiness so postgres runs in CI; ensure the test still runs locally for developers by only skipping when the CI flag is set.

ai-code-review-evaluations deleted a comment from coderabbitai bot Nov 18, 2025

coderabbitai bot reviewed Nov 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unified Storage Performance Optimizations #1

Unified Storage Performance Optimizations #1

Uh oh!

akshayutture-augment commented Nov 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

akshayutture-augment commented Nov 14, 2025

Uh oh!

akshayutture-augment commented Nov 22, 2025

Uh oh!

coderabbitai bot commented Nov 22, 2025

Uh oh!

coderabbitai bot commented Nov 22, 2025 •

edited

Loading

Uh oh!

akshayutture-augment commented Nov 22, 2025

Uh oh!

coderabbitai bot commented Nov 22, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Unified Storage Performance Optimizations #1

Are you sure you want to change the base?

Unified Storage Performance Optimizations #1

Uh oh!

Conversation

akshayutture-augment commented Nov 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

akshayutture-augment commented Nov 14, 2025

Uh oh!

akshayutture-augment commented Nov 22, 2025

Uh oh!

coderabbitai bot commented Nov 22, 2025

Uh oh!

coderabbitai bot commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

akshayutture-augment commented Nov 22, 2025

Uh oh!

coderabbitai bot commented Nov 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akshayutture-augment commented Nov 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 22, 2025 •

edited

Loading