Skip to content

Conversation

@akshayutture-augment
Copy link

@akshayutture-augment akshayutture-augment commented Nov 14, 2025

Test 10

Summary by CodeRabbit

  • Bug Fixes

    • Improved resource server initialization reliability by consolidating startup initialization instead of per-request calls.
  • Improvements

    • Enhanced search index observability with better logging and tracing context propagation.
    • Optimized cache locking for improved performance.
  • Tests

    • Added postgres database skip condition for integration tests.

✏️ Tip: You can customize this high-level summary in your review settings.


Replicated from ai-code-review-evaluation/grafana-coderabbit#10

…#97529)

* dont lazy init unified storage

* Inits index when creating new resource server. Fixes trace propagation by passing span ctx. Update some logging.

* Use finer grained cache locking when building indexes to speed things up. Locking the whole function was slowing things down.

* formatting

* linter fix

* go mod

* make update-workspace

* fix workspaces check error

* update dependency owner in mod file

* wait 1 second before querying metrics

* try with big timeout, see if fixes CI. Wont fail locally.

* skips postgres integration test. Only fails in drone. Will fix later.

* put delay back to 500 ms
@akshayutture-augment
Copy link
Author

@coderabbitai review

@akshayutture-augment
Copy link
Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Nov 22, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link

coderabbitai bot commented Nov 22, 2025

Walkthrough

The pull request consolidates resource server initialization to construction time, removes redundant per-call Init() invocations from RPC methods, propagates tracing contexts consistently across search and SQL operations, switches logging to instance-level loggers, and optimizes cache locking behavior in the search index builder.

Changes

Cohort / File(s) Summary
Initialization consolidation
pkg/storage/unified/resource/server.go
Moves Init(ctx) to NewResourceServer constructor with error handling; removes per-call Init() from RPC methods (Create, Update, Delete, Read, List, Watch, Search, History, Origin, IsHealthy, PutBlob, GetBlob); reorders initialization to start watcher after search index.
Logging and context propagation
pkg/storage/unified/resource/search.go, pkg/storage/unified/sql/backend.go
Removes Grafana CLI logger import; switches logging from package-level to instance logger (s.log); propagates context from tracer.Start calls (ctx, span) instead of discarding; adds initialization duration log line.
Concurrency optimization
pkg/storage/unified/search/bleve.go
Removes initial cache lock/unlock in BuildIndex; adds guarded lock/unlock around cache assignment at function end; replaces earlier log action with placeholder comment.
Test skip logic
pkg/server/module_server_test.go
Adds postgres database type detection and test skip with message indicating test failure in Drone CI.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant NewResourceServer
    participant Init
    participant SearchInit
    participant WatcherStart
    participant RPC as RPC Methods
    
    Caller->>NewResourceServer: Create resource server
    NewResourceServer->>Init: Call Init(ctx) at construction
    Init->>SearchInit: Initialize search index
    SearchInit-->>Init: Complete
    Init->>WatcherStart: Start watcher (after search)
    WatcherStart-->>Init: Ready
    Init-->>NewResourceServer: Return (or error)
    NewResourceServer-->>Caller: Return server instance
    
    Note over RPC: No redundant Init() calls
    Caller->>RPC: Call Create/Update/Delete/etc.
    RPC-->>Caller: Execute with pre-initialized state
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • pkg/storage/unified/resource/server.go — Initialization refactoring affects multiple RPC methods; verify all code paths handle pre-initialization state and that error propagation is correct if Init fails at construction time.
  • pkg/storage/unified/search/bleve.go — Concurrency change with cache locking requires careful review to ensure thread safety, race conditions, and deadlock scenarios are handled correctly; the shift from initial lock to end-of-function lock is a behavioral change that affects timing and contention.
  • pkg/storage/unified/resource/search.go — Context propagation and logger switch; verify instance logger is properly initialized and context is threaded correctly through all call sites.

Poem

🐰 Initialization hops into place
No more per-call Init's race,
Traces flow where contexts bind,
Locks optimized, peace of mind,
Search and watcher aligned!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Unified Storage Performance Optimizations' accurately reflects the main changes across all modified files, which focus on initialization flow optimization, context propagation, and cache locking improvements in the unified storage system.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch unified-storage-enhancements

Comment @coderabbitai help to get the list of available commands and usage tips.

@akshayutture-augment
Copy link
Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Nov 22, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/storage/unified/search/bleve.go (1)

96-154: Guard TotalDocs map iteration with cacheMu to avoid concurrent map access

BuildIndex now writes to b.cache under cacheMu.Lock() / Unlock(), but TotalDocs still iterates over b.cache without any lock. Concurrent calls to BuildIndex and TotalDocs will risk a concurrent map read and map write panic.

Protect the read side as well:

 func (b *bleveBackend) TotalDocs() int64 {
-	var totalDocs int64
-	for _, v := range b.cache {
+	b.cacheMu.RLock()
+	defer b.cacheMu.RUnlock()
+
+	var totalDocs int64
+	for _, v := range b.cache {
 		c, err := v.index.DocCount()
 		if err != nil {
 			continue
 		}
 		totalDocs += int64(c)
 	}
 	return totalDocs
 }
pkg/storage/unified/resource/search.go (1)

171-221: Fix goroutine closure over info, err, and totalBatchesIndexed in search init

Within init:

  • info from the for _, info := range stats loop is captured by reference, so all goroutines race on the same variable and typically see only the last element.
  • totalBatchesIndexed++ is incremented from multiple goroutines without synchronization (data race).
  • _, _, err = s.build(...) mutates the outer err variable concurrently, another data race, and the assignment is unnecessary since the goroutine returns its own error.

These are concurrency bugs even if they existed before; with errgroup.Group they are easy to fix.

 func (s *searchSupport) init(ctx context.Context) error {
 	ctx, span := s.tracer.Start(ctx, tracingPrexfixSearch+"Init")
 	defer span.End()
 	start := time.Now().Unix()
 
 	totalBatchesIndexed := 0
 	group := errgroup.Group{}
 	group.SetLimit(s.initWorkers)
 
 	stats, err := s.storage.GetResourceStats(ctx, "", s.initMinSize)
 	if err != nil {
 		return err
 	}
-
-	for _, info := range stats {
-		group.Go(func() error {
-			s.log.Debug("initializing search index", "namespace", info.Namespace, "group", info.Group, "resource", info.Resource)
-			totalBatchesIndexed++
-			_, _, err = s.build(ctx, info.NamespacedResource, info.Count, info.ResourceVersion)
-			return err
-		})
-	}
-
-	err = group.Wait()
-	if err != nil {
-		return err
-	}
+	totalBatchesIndexed = len(stats)
+
+	for _, info := range stats {
+		info := info // capture loop variable
+		group.Go(func() error {
+			s.log.Debug("initializing search index", "namespace", info.Namespace, "group", info.Group, "resource", info.Resource)
+			_, _, err := s.build(ctx, info.NamespacedResource, info.Count, info.ResourceVersion)
+			return err
+		})
+	}
+
+	if err := group.Wait(); err != nil {
+		return err
+	}
 
 	span.AddEvent("namespaces indexed", trace.WithAttributes(attribute.Int("namespaced_indexed", totalBatchesIndexed)))

This removes the races and still reports how many namespaces were scheduled for indexing.

🧹 Nitpick comments (1)
pkg/storage/unified/resource/search.go (1)

277-285: Enrich high-latency log with more context (optional)

The new Warn log for high index latency is useful. Consider including resource identifiers (namespace/group/resource) in the log fields so slow paths can be tied back to specific kinds more easily.

-	if latencySeconds > 5 {
-		s.log.Warn("high index latency", "latency", latencySeconds)
-	}
+	if latencySeconds > 5 {
+		s.log.Warn("high index latency",
+			"latency", latencySeconds,
+			"namespace", evt.Key.Namespace,
+			"group", evt.Key.Group,
+			"resource", evt.Key.Resource,
+		)
+	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 871af07 and d762a96.

📒 Files selected for processing (5)
  • pkg/server/module_server_test.go (1 hunks)
  • pkg/storage/unified/resource/search.go (4 hunks)
  • pkg/storage/unified/resource/server.go (2 hunks)
  • pkg/storage/unified/search/bleve.go (2 hunks)
  • pkg/storage/unified/sql/backend.go (1 hunks)
🔇 Additional comments (3)
pkg/storage/unified/sql/backend.go (1)

125-155: Using tracer-returned context in GetResourceStats is correct

Capturing the context from b.tracer.Start and passing it into WithTx ensures the DB work in GetResourceStats is properly associated with the span. The change is sound and aligns this method with the rest of the traced paths.

Please verify via your tracing backend that sql.resource.GetResourceStats spans now include the DB queries from this method.

pkg/storage/unified/resource/search.go (1)

309-377: build tracing context propagation looks good

Starting a span with ctx, span := s.tracer.Start(ctx, tracingPrexfixSearch+"Build") and reusing that context through ListIterator, BuildDocument, and index.Write correctly ties index build work to a single span. No functional issues stand out here.

If you have tracing enabled, please confirm that unified_search.Build spans now show the expected DB and indexing sub-spans for a sample index build.

pkg/storage/unified/resource/server.go (1)

182-265: Constructor-time Init and watcher sequencing look correct; verify caller expectations

Calling s.Init(ctx) from NewResourceServer and ordering Init as: lifecycle hooks → search index init → initWatcher improves determinism and ensures the watcher only starts once search is ready. The sync.Once guard keeps Init idempotent.

Please double-check that:

  • All production code constructs servers via NewResourceServer (no tests or internal code rely on calling Init lazily later).
  • Startup/shutdown tests still pass with the watcher now started during construction.

Also applies to: 292-318

Comment on lines +35 to +38
// TODO - fix this test for postgres
if dbType == "postgres" {
t.Skip("skipping - test not working with postgres in Drone. Works locally.")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Postgres skip hides integration regressions; narrow or remove it

The unconditional dbType == "postgres" skip means this integration path is never exercised for postgres, so regressions can go unnoticed. Since the comment already acknowledges this is a TODO, it would be better to scope the skip to CI or tie it to a tracked issue.

-	// TODO - fix this test for postgres
-	if dbType == "postgres" {
-		t.Skip("skipping - test not working with postgres in Drone. Works locally.")
-	}
+	// TODO(grafana/<issue-id>): fix this test for postgres in CI
+	if dbType == "postgres" && os.Getenv("CI") != "" {
+		t.Skip("skipping in CI - test not working with postgres in Drone. Works locally.")
+	}

This keeps local postgres runs meaningful while unblocking Drone until the underlying issue is fixed.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// TODO - fix this test for postgres
if dbType == "postgres" {
t.Skip("skipping - test not working with postgres in Drone. Works locally.")
}
// TODO(grafana/<issue-id>): fix this test for postgres in CI
if dbType == "postgres" && os.Getenv("CI") != "" {
t.Skip("skipping in CI - test not working with postgres in Drone. Works locally.")
}
🤖 Prompt for AI Agents
In pkg/server/module_server_test.go around lines 35 to 38, the test currently
unconditionally skips when dbType == "postgres", which hides postgres
regressions; change the skip to be conditional (e.g., only skip in CI/Drone) or
attach it to a tracked issue. Update the logic to detect CI (for example via an
environment variable like CI or DRONE) and call t.Skipf with the CI condition
and include the issue/PR number in the message, or remove the skip entirely and
fix the underlying flakiness so postgres runs in CI; ensure the test still runs
locally for developers by only skipping when the CI flag is set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants