Skip to content

Inconsistent store pointers between the region-cache and store-cache cause stale regions to become inaccessible. #1823

@AndreMouche

Description

@AndreMouche

I believe this is a bug related to inconsistent state between region-cache and store-cache when a TiKV store updates its address or labels.

if s.addr != addr || !s.IsSameLabels(store.GetLabels()) {
newStore := newStore(
s.storeID,
addr,
store.GetPeerAddress(),
store.GetStatusAddress(),
storeType,
resolved,
store.GetLabels(),
)
newStore.livenessState = atomic.LoadUint32(&s.livenessState)
if newStore.getLivenessState() != reachable {
newStore.unreachableSince = s.unreachableSince
startHealthCheckLoop(scheduler, c, newStore, newStore.getLivenessState(), storeReResolveInterval)
}
if s.addr == addr {
newStore.healthStatus = s.healthStatus
}
c.put(newStore)
s.setResolveState(deleted)
logutil.BgLogger().Info("store address or labels changed, add new store and mark old store deleted",
zap.Uint64("store", s.storeID),
zap.String("old-addr", s.addr),
zap.Any("old-labels", s.labels),
zap.String("old-liveness", s.getLivenessState().String()),
zap.String("new-addr", newStore.addr),
zap.Any("new-labels", newStore.labels),
zap.String("new-liveness", newStore.getLivenessState().String()))

From the above code, when we update the address or lable of a TiKV instance, a new store will be created and replace the old one in store-cache, we can confirm this by the log

 store address or labels changed, add new store and mark old store deleted...

However, since we do not replace the new store in region-cache, for region with its leader from region-cache on this tikv, the status will never change and keeps unavailable

if isSamePeer(replica.peer, leader) {
// If hibernate region is enabled and the leader is not reachable, the raft group
// will not be wakened up and re-elect the leader until the follower receives
// a request. So, before the new leader is elected, we should not send requests
// to the unreachable old leader to avoid unnecessary timeout.
if replica.store.getLivenessState() != reachable {
return -1
}

When accessing new regions that were not previously cached, the new store point is used and the leader may became available

We do have a related issue #1401 , and a related fix #1402,
However, it only stop the health check for the old store object, which still not replace the store-pointer in region-cache.

Here is my question, why we do not reuse the old store object directly instead of create a new one?

Workaround: restart the TiDB instance

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects-8.5This bug affects the 8.5.x(LTS) versions.bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions