-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Description
Thanos, Prometheus and Golang version used:
Current main, in CI
Object Storage Provider:
What happened:
End to end tests occasionally fail with DATA RACE
Example: https://github.com/thanos-io/thanos/actions/runs/21631782081/job/62346156098
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Run e2e tests
Full logs to relevant components:
Details
Logs
13:24:57 querier-1: ==================
13:24:57 querier-1: WARNING: DATA RACE
13:24:57 querier-1: Write at 0x00c00035dd80 by goroutine 172:
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*endpointRef).updateStatus()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:683 +0x42a
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*endpointRef).update()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:671 +0x33a
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*EndpointSet).updateEndpoint()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:455 +0x51d
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update.func1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:349 +0x15e
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update.gowrap2()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:354 +0x41
13:24:57 querier-1: Previous read at 0x00c00035dd80 by goroutine 169:
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*endpointRef).LabelSets()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:794 +0xd1
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/store.newAsyncRespSet()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/store/proxy_merge.go:631 +0xdd3
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/store.(*ProxyStore).Series()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/store/proxy.go:335 +0x1bd7
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/store.(*instrumentedStoreServer).Series()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/store/telemetry.go:181 +0x1d0
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/store.(*limitedStoreServer).Series()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/store/limiter.go:145 +0x2b4
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*querier).selectFn()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/querier.go:384 +0x80b
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*querier).Select.func1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/querier.go:324 +0x347
13:24:57 querier-1: Goroutine 172 (running) created at:
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:345 +0x912
13:24:57 querier-1: main.setupEndpointSet.func9.1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/cmd/thanos/endpointset.go:375 +0x9d
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/runutil.Repeat()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:91 +0x101
13:24:57 querier-1: main.setupEndpointSet.func9()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/cmd/thanos/endpointset.go:371 +0xdb
13:24:57 querier-1: github.com/oklog/run.(*Group).Run.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/oklog/run@v1.2.0/group.go:38 +0x39
13:24:57 querier-1: github.com/oklog/run.(*Group).Run.gowrap1()
13:24:57 querier-1: /go/pkg/mod/github.com/oklog/run@v1.2.0/group.go:39 +0x4f
13:24:57 querier-1: Goroutine 169 (running) created at:
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*querier).Select()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/querier.go:308 +0x924
13:24:57 querier-1: github.com/prometheus/prometheus/promql.(*Engine).populateSeries.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/engine.go:981 +0x57d
13:24:57 querier-1: github.com/prometheus/prometheus/promql/parser.inspector.Visit()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/parser/ast.go:364 +0x62
13:24:57 querier-1: github.com/prometheus/prometheus/promql/parser.Walk()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/parser/ast.go:334 +0xa1
13:24:57 querier-1: github.com/prometheus/prometheus/promql/parser.Walk()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/parser/ast.go:340 +0x21e
13:24:57 querier-1: github.com/prometheus/prometheus/promql/parser.Walk()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/parser/ast.go:340 +0x21e
13:24:57 querier-1: github.com/prometheus/prometheus/promql/parser.Inspect()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/parser/ast.go:375 +0x244
13:24:57 querier-1: github.com/prometheus/prometheus/promql.(*Engine).populateSeries()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/engine.go:964 +0x12
13:24:57 querier-1: github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/engine.go:726 +0x2b0
13:24:57 querier-1: github.com/prometheus/prometheus/promql.(*Engine).exec()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/engine.go:687 +0x645
13:24:57 querier-1: github.com/prometheus/prometheus/promql.(*query).Exec()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/engine.go:245 +0x209
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query.func14()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/api/query/v1.go:679 +0x63
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/tracing.DoInSpan()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/tracing/tracing.go:95 +0x14f
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/api/query/v1.go:678 +0x1355
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query-fm()
13:24:57 querier-1: <autogenerated>:1 +0x45
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).Register.(*QueryAPI).Register.GetInstr.func1.func2()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/api/api.go:233 +0x8a
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).Register.(*QueryAPI).Register.GetInstr.func1.(*HTTPServerMiddleware).HTTPMiddleware.func9()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/logging/http.go:86 +0x368
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/klauspost/compress/gzhttp.NewWrapper.func1.1()
13:24:57 querier-1: /go/pkg/mod/github.com/klauspost/compress@v1.18.0/gzhttp/compress.go:519 +0x9c6
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/extprom/http.httpInstrumentationHandler.func1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/extprom/http/instrument_server.go:75 +0x161
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/client_golang@v1.23.0-rc.1/prometheus/promhttp/instrument_server.go:296 +0xe8
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/client_golang@v1.23.0-rc.1/prometheus/promhttp/instrument_server.go:147 +0xe1
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/extprom/http.httpInstrumentationHandler.instrumentHandlerInFlight.func2()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/extprom/http/instrument_server.go:164 +0x1c5
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerRequestSize.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/client_golang@v1.23.0-rc.1/prometheus/promhttp/instrument_server.go:243 +0xe8
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x1ef
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/extprom/http.(*tenantInstrumentationMiddleware).NewHandler.func1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/extprom/http/instrument_tenant_server.go:43 +0x1cf
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/tracing.HTTPMiddleware.func1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/tracing/http.go:67 +0xf37
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).Register.(*QueryAPI).Register.GetInstr.func1.RequestID.func10()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/server/http/middleware/request_id.go:40 +0x18a
13:24:57 querier-1: github.com/prometheus/common/route.(*Router).handle.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/common@v0.65.1-0.20250703115700-7f8b2a0d32d3/route/route.go:83 +0x2ed
13:24:57 querier-1: github.com/julienschmidt/httprouter.(*Router).ServeHTTP()
13:24:57 querier-1: /go/pkg/mod/github.com/julienschmidt/httprouter@v1.3.0/router.go:387 +0xee2
13:24:57 querier-1: github.com/prometheus/common/route.(*Router).ServeHTTP()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/common@v0.65.1-0.20250703115700-7f8b2a0d32d3/route/route.go:126 +0x53
13:24:57 querier-1: net/http.(*ServeMux).ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2861 +0x242
13:24:57 querier-1: net/http.serverHandler.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:3340 +0x2a1
13:24:57 querier-1: net/http.(*conn).serve()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2109 +0xda4
13:24:57 querier-1: net/http.(*Server).Serve.gowrap3()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:3493 +0x4f
13:24:57 querier-1: ==================
Anything else we need to know:
Possible root cause identified by Claude:
The Bug: GetStoreClients() Creates a Shallow Copy with a Different Mutex
// GetStoreClients returns a list of all active stores.
func (e *EndpointSet) GetStoreClients() []store.Client {
endpoints := e.getQueryableRefs()
stores := make([]store.Client, 0, len(endpoints))
for _, er := range endpoints {
if er.HasStoreAPI() {
er.mtx.RLock()
stores = append(stores, &endpointRef{
StoreClient: storepb.NewStoreClient(er.cc),
addr: er.addr,
metadata: er.metadata, // ← SHARED pointer
status: er.status, // ← SHARED pointer
// mtx is NOT copied - gets zero-value (new mutex)
})
er.mtx.RUnlock()
}
}
return stores
}
The Race
Goroutine 1 (update loop):
original.mtx.Lock() // Locks Mutex A
original.status.LabelSets = ... // Writes to shared status
original.mtx.Unlock()
Goroutine 2 (query path):
newCopy.mtx.RLock() // Locks Mutex B (different mutex!)
return newCopy.status.LabelSets // Reads from SAME shared status
newCopy.mtx.RUnlock()
Two different mutexes → no synchronization → data race!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels