-
Notifications
You must be signed in to change notification settings - Fork 0
Unified Storage Performance Optimizations #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: performance-optimization-baseline
Are you sure you want to change the base?
Unified Storage Performance Optimizations #1
Conversation
…#97529) * dont lazy init unified storage * Inits index when creating new resource server. Fixes trace propagation by passing span ctx. Update some logging. * Use finer grained cache locking when building indexes to speed things up. Locking the whole function was slowing things down. * formatting * linter fix * go mod * make update-workspace * fix workspaces check error * update dependency owner in mod file * wait 1 second before querying metrics * try with big timeout, see if fixes CI. Wont fail locally. * skips postgres integration test. Only fails in drone. Will fix later. * put delay back to 500 ms
|
@coderabbitai review |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughThe pull request consolidates resource server initialization to construction time, removes redundant per-call Init() invocations from RPC methods, propagates tracing contexts consistently across search and SQL operations, switches logging to instance-level loggers, and optimizes cache locking behavior in the search index builder. Changes
Sequence DiagramsequenceDiagram
participant Caller
participant NewResourceServer
participant Init
participant SearchInit
participant WatcherStart
participant RPC as RPC Methods
Caller->>NewResourceServer: Create resource server
NewResourceServer->>Init: Call Init(ctx) at construction
Init->>SearchInit: Initialize search index
SearchInit-->>Init: Complete
Init->>WatcherStart: Start watcher (after search)
WatcherStart-->>Init: Ready
Init-->>NewResourceServer: Return (or error)
NewResourceServer-->>Caller: Return server instance
Note over RPC: No redundant Init() calls
Caller->>RPC: Call Create/Update/Delete/etc.
RPC-->>Caller: Execute with pre-initialized state
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/storage/unified/search/bleve.go (1)
96-154: GuardTotalDocsmap iteration withcacheMuto avoid concurrent map access
BuildIndexnow writes tob.cacheundercacheMu.Lock()/Unlock(), butTotalDocsstill iterates overb.cachewithout any lock. Concurrent calls toBuildIndexandTotalDocswill risk aconcurrent map read and map writepanic.Protect the read side as well:
func (b *bleveBackend) TotalDocs() int64 { - var totalDocs int64 - for _, v := range b.cache { + b.cacheMu.RLock() + defer b.cacheMu.RUnlock() + + var totalDocs int64 + for _, v := range b.cache { c, err := v.index.DocCount() if err != nil { continue } totalDocs += int64(c) } return totalDocs }pkg/storage/unified/resource/search.go (1)
171-221: Fix goroutine closure overinfo,err, andtotalBatchesIndexedin search initWithin
init:
infofrom thefor _, info := range statsloop is captured by reference, so all goroutines race on the same variable and typically see only the last element.totalBatchesIndexed++is incremented from multiple goroutines without synchronization (data race)._, _, err = s.build(...)mutates the outererrvariable concurrently, another data race, and the assignment is unnecessary since the goroutine returns its own error.These are concurrency bugs even if they existed before; with
errgroup.Groupthey are easy to fix.func (s *searchSupport) init(ctx context.Context) error { ctx, span := s.tracer.Start(ctx, tracingPrexfixSearch+"Init") defer span.End() start := time.Now().Unix() totalBatchesIndexed := 0 group := errgroup.Group{} group.SetLimit(s.initWorkers) stats, err := s.storage.GetResourceStats(ctx, "", s.initMinSize) if err != nil { return err } - - for _, info := range stats { - group.Go(func() error { - s.log.Debug("initializing search index", "namespace", info.Namespace, "group", info.Group, "resource", info.Resource) - totalBatchesIndexed++ - _, _, err = s.build(ctx, info.NamespacedResource, info.Count, info.ResourceVersion) - return err - }) - } - - err = group.Wait() - if err != nil { - return err - } + totalBatchesIndexed = len(stats) + + for _, info := range stats { + info := info // capture loop variable + group.Go(func() error { + s.log.Debug("initializing search index", "namespace", info.Namespace, "group", info.Group, "resource", info.Resource) + _, _, err := s.build(ctx, info.NamespacedResource, info.Count, info.ResourceVersion) + return err + }) + } + + if err := group.Wait(); err != nil { + return err + } span.AddEvent("namespaces indexed", trace.WithAttributes(attribute.Int("namespaced_indexed", totalBatchesIndexed)))This removes the races and still reports how many namespaces were scheduled for indexing.
🧹 Nitpick comments (1)
pkg/storage/unified/resource/search.go (1)
277-285: Enrich high-latency log with more context (optional)The new
Warnlog forhigh index latencyis useful. Consider including resource identifiers (namespace/group/resource) in the log fields so slow paths can be tied back to specific kinds more easily.- if latencySeconds > 5 { - s.log.Warn("high index latency", "latency", latencySeconds) - } + if latencySeconds > 5 { + s.log.Warn("high index latency", + "latency", latencySeconds, + "namespace", evt.Key.Namespace, + "group", evt.Key.Group, + "resource", evt.Key.Resource, + ) + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
pkg/server/module_server_test.go(1 hunks)pkg/storage/unified/resource/search.go(4 hunks)pkg/storage/unified/resource/server.go(2 hunks)pkg/storage/unified/search/bleve.go(2 hunks)pkg/storage/unified/sql/backend.go(1 hunks)
🔇 Additional comments (3)
pkg/storage/unified/sql/backend.go (1)
125-155: Using tracer-returned context inGetResourceStatsis correctCapturing the context from
b.tracer.Startand passing it intoWithTxensures the DB work inGetResourceStatsis properly associated with the span. The change is sound and aligns this method with the rest of the traced paths.Please verify via your tracing backend that
sql.resource.GetResourceStatsspans now include the DB queries from this method.pkg/storage/unified/resource/search.go (1)
309-377:buildtracing context propagation looks goodStarting a span with
ctx, span := s.tracer.Start(ctx, tracingPrexfixSearch+"Build")and reusing that context throughListIterator,BuildDocument, andindex.Writecorrectly ties index build work to a single span. No functional issues stand out here.If you have tracing enabled, please confirm that
unified_search.Buildspans now show the expected DB and indexing sub-spans for a sample index build.pkg/storage/unified/resource/server.go (1)
182-265: Constructor-timeInitand watcher sequencing look correct; verify caller expectationsCalling
s.Init(ctx)fromNewResourceServerand orderingInitas: lifecycle hooks → search index init →initWatcherimproves determinism and ensures the watcher only starts once search is ready. Thesync.Onceguard keepsInitidempotent.Please double-check that:
- All production code constructs servers via
NewResourceServer(no tests or internal code rely on callingInitlazily later).- Startup/shutdown tests still pass with the watcher now started during construction.
Also applies to: 292-318
| // TODO - fix this test for postgres | ||
| if dbType == "postgres" { | ||
| t.Skip("skipping - test not working with postgres in Drone. Works locally.") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres skip hides integration regressions; narrow or remove it
The unconditional dbType == "postgres" skip means this integration path is never exercised for postgres, so regressions can go unnoticed. Since the comment already acknowledges this is a TODO, it would be better to scope the skip to CI or tie it to a tracked issue.
- // TODO - fix this test for postgres
- if dbType == "postgres" {
- t.Skip("skipping - test not working with postgres in Drone. Works locally.")
- }
+ // TODO(grafana/<issue-id>): fix this test for postgres in CI
+ if dbType == "postgres" && os.Getenv("CI") != "" {
+ t.Skip("skipping in CI - test not working with postgres in Drone. Works locally.")
+ }This keeps local postgres runs meaningful while unblocking Drone until the underlying issue is fixed.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // TODO - fix this test for postgres | |
| if dbType == "postgres" { | |
| t.Skip("skipping - test not working with postgres in Drone. Works locally.") | |
| } | |
| // TODO(grafana/<issue-id>): fix this test for postgres in CI | |
| if dbType == "postgres" && os.Getenv("CI") != "" { | |
| t.Skip("skipping in CI - test not working with postgres in Drone. Works locally.") | |
| } |
🤖 Prompt for AI Agents
In pkg/server/module_server_test.go around lines 35 to 38, the test currently
unconditionally skips when dbType == "postgres", which hides postgres
regressions; change the skip to be conditional (e.g., only skip in CI/Drone) or
attach it to a tracked issue. Update the logic to detect CI (for example via an
environment variable like CI or DRONE) and call t.Skipf with the CI condition
and include the issue/PR number in the message, or remove the skip entirely and
fix the underlying flakiness so postgres runs in CI; ensure the test still runs
locally for developers by only skipping when the CI flag is set.
Test 10
Summary by CodeRabbit
Bug Fixes
Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.
Replicated from ai-code-review-evaluation/grafana-coderabbit#10