Skip to content

DATA RACE exposes locking bug #8662

@madaraszg-tulip

Description

@madaraszg-tulip

Thanos, Prometheus and Golang version used:

Current main, in CI

Object Storage Provider:

What happened:

End to end tests occasionally fail with DATA RACE

Example: https://github.com/thanos-io/thanos/actions/runs/21631782081/job/62346156098

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Run e2e tests

Full logs to relevant components:

DetailsLogs

13:24:57 querier-1: ==================
13:24:57 querier-1: WARNING: DATA RACE
13:24:57 querier-1: Write at 0x00c00035dd80 by goroutine 172:
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*endpointRef).updateStatus()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:683 +0x42a
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*endpointRef).update()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:671 +0x33a
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*EndpointSet).updateEndpoint()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:455 +0x51d
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update.func1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:349 +0x15e
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update.gowrap2()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:354 +0x41
13:24:57 querier-1: Previous read at 0x00c00035dd80 by goroutine 169:
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*endpointRef).LabelSets()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:794 +0xd1
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/store.newAsyncRespSet()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/store/proxy_merge.go:631 +0xdd3
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/store.(*ProxyStore).Series()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/store/proxy.go:335 +0x1bd7
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/store.(*instrumentedStoreServer).Series()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/store/telemetry.go:181 +0x1d0
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/store.(*limitedStoreServer).Series()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/store/limiter.go:145 +0x2b4
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*querier).selectFn()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/querier.go:384 +0x80b
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*querier).Select.func1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/querier.go:324 +0x347
13:24:57 querier-1: Goroutine 172 (running) created at:
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/endpointset.go:345 +0x912
13:24:57 querier-1: main.setupEndpointSet.func9.1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/cmd/thanos/endpointset.go:375 +0x9d
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/runutil.Repeat()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:91 +0x101
13:24:57 querier-1: main.setupEndpointSet.func9()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/cmd/thanos/endpointset.go:371 +0xdb
13:24:57 querier-1: github.com/oklog/run.(*Group).Run.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/oklog/run@v1.2.0/group.go:38 +0x39
13:24:57 querier-1: github.com/oklog/run.(*Group).Run.gowrap1()
13:24:57 querier-1: /go/pkg/mod/github.com/oklog/run@v1.2.0/group.go:39 +0x4f
13:24:57 querier-1: Goroutine 169 (running) created at:
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/query.(*querier).Select()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/query/querier.go:308 +0x924
13:24:57 querier-1: github.com/prometheus/prometheus/promql.(*Engine).populateSeries.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/engine.go:981 +0x57d
13:24:57 querier-1: github.com/prometheus/prometheus/promql/parser.inspector.Visit()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/parser/ast.go:364 +0x62
13:24:57 querier-1: github.com/prometheus/prometheus/promql/parser.Walk()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/parser/ast.go:334 +0xa1
13:24:57 querier-1: github.com/prometheus/prometheus/promql/parser.Walk()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/parser/ast.go:340 +0x21e
13:24:57 querier-1: github.com/prometheus/prometheus/promql/parser.Walk()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/parser/ast.go:340 +0x21e
13:24:57 querier-1: github.com/prometheus/prometheus/promql/parser.Inspect()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/parser/ast.go:375 +0x244
13:24:57 querier-1: github.com/prometheus/prometheus/promql.(*Engine).populateSeries()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/engine.go:964 +0x12
13:24:57 querier-1: github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/engine.go:726 +0x2b0
13:24:57 querier-1: github.com/prometheus/prometheus/promql.(*Engine).exec()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/engine.go:687 +0x645
13:24:57 querier-1: github.com/prometheus/prometheus/promql.(*query).Exec()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/prometheus@v0.305.1-0.20250721065454-b09cf6be8d56/promql/engine.go:245 +0x209
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query.func14()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/api/query/v1.go:679 +0x63
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/tracing.DoInSpan()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/tracing/tracing.go:95 +0x14f
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/api/query/v1.go:678 +0x1355
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).query-fm()
13:24:57 querier-1: <autogenerated>:1 +0x45
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).Register.(*QueryAPI).Register.GetInstr.func1.func2()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/api/api.go:233 +0x8a
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).Register.(*QueryAPI).Register.GetInstr.func1.(*HTTPServerMiddleware).HTTPMiddleware.func9()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/logging/http.go:86 +0x368
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/klauspost/compress/gzhttp.NewWrapper.func1.1()
13:24:57 querier-1: /go/pkg/mod/github.com/klauspost/compress@v1.18.0/gzhttp/compress.go:519 +0x9c6
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/extprom/http.httpInstrumentationHandler.func1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/extprom/http/instrument_server.go:75 +0x161
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/client_golang@v1.23.0-rc.1/prometheus/promhttp/instrument_server.go:296 +0xe8
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/client_golang@v1.23.0-rc.1/prometheus/promhttp/instrument_server.go:147 +0xe1
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/extprom/http.httpInstrumentationHandler.instrumentHandlerInFlight.func2()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/extprom/http/instrument_server.go:164 +0x1c5
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerRequestSize.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/client_golang@v1.23.0-rc.1/prometheus/promhttp/instrument_server.go:243 +0xe8
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x1ef
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/extprom/http.(*tenantInstrumentationMiddleware).NewHandler.func1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/extprom/http/instrument_tenant_server.go:43 +0x1cf
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/tracing.HTTPMiddleware.func1()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/tracing/http.go:67 +0xf37
13:24:57 querier-1: net/http.HandlerFunc.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2322 +0x47
13:24:57 querier-1: github.com/thanos-io/thanos/pkg/api/query.(*QueryAPI).Register.(*QueryAPI).Register.GetInstr.func1.RequestID.func10()
13:24:57 querier-1: /go/src/github.com/thanos-io/thanos/pkg/server/http/middleware/request_id.go:40 +0x18a
13:24:57 querier-1: github.com/prometheus/common/route.(*Router).handle.func1()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/common@v0.65.1-0.20250703115700-7f8b2a0d32d3/route/route.go:83 +0x2ed
13:24:57 querier-1: github.com/julienschmidt/httprouter.(*Router).ServeHTTP()
13:24:57 querier-1: /go/pkg/mod/github.com/julienschmidt/httprouter@v1.3.0/router.go:387 +0xee2
13:24:57 querier-1: github.com/prometheus/common/route.(*Router).ServeHTTP()
13:24:57 querier-1: /go/pkg/mod/github.com/prometheus/common@v0.65.1-0.20250703115700-7f8b2a0d32d3/route/route.go:126 +0x53
13:24:57 querier-1: net/http.(*ServeMux).ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2861 +0x242
13:24:57 querier-1: net/http.serverHandler.ServeHTTP()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:3340 +0x2a1
13:24:57 querier-1: net/http.(*conn).serve()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:2109 +0xda4
13:24:57 querier-1: net/http.(*Server).Serve.gowrap3()
13:24:57 querier-1: /usr/local/go/src/net/http/server.go:3493 +0x4f
13:24:57 querier-1: ==================

Anything else we need to know:

Possible root cause identified by Claude:

  The Bug: GetStoreClients() Creates a Shallow Copy with a Different Mutex                                                           
                                                                                                                                     
  // GetStoreClients returns a list of all active stores.                                                                            
  func (e *EndpointSet) GetStoreClients() []store.Client {                                                                           
      endpoints := e.getQueryableRefs()                                                                                              
      stores := make([]store.Client, 0, len(endpoints))                                                                              
      for _, er := range endpoints {                                                                                                 
          if er.HasStoreAPI() {                                                                                                      
              er.mtx.RLock()                                                                                                         
              stores = append(stores, &endpointRef{                                                                                  
                  StoreClient: storepb.NewStoreClient(er.cc),                                                                        
                  addr:        er.addr,                                                                                              
                  metadata:    er.metadata,  // ← SHARED pointer                                                                     
                  status:      er.status,    // ← SHARED pointer                                                                     
                  // mtx is NOT copied - gets zero-value (new mutex)                                                                 
              })                                                                                                                     
              er.mtx.RUnlock()                                                                                                       
          }                                                                                                                          
      }                                                                                                                              
      return stores                                                                                                                  
  }                                                                     

  The Race                                                                                                                           
                                                                                                                                     
  Goroutine 1 (update loop):                                                                                                         
  original.mtx.Lock()           // Locks Mutex A                                                                                     
  original.status.LabelSets = ...  // Writes to shared status                                                                        
  original.mtx.Unlock()                                                                                                              
                                                                                                                                     
  Goroutine 2 (query path):                                                                                                          
  newCopy.mtx.RLock()           // Locks Mutex B (different mutex!)                                                                  
  return newCopy.status.LabelSets  // Reads from SAME shared status                                                                  
  newCopy.mtx.RUnlock()                                                                                                              
                                                                                                                                     
  Two different mutexes → no synchronization → data race!                     

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions