Skip to content

storage,admission: investigate read-only batch latency during high-volume snapshot ingest #89788

@irfansharif

Description

@irfansharif

Describe the problem

Experiment discussed internally here. When trying to reproduce snapshot-induced-latency-hits, using the roachtest added in #89191, we noticed that p99.9 latencies for read traffic over data that's not currently receiving snapshots see an increase. When looking at outlier traces, the time is spent entirely below pebble. There's little trace info from within pebble to understand why; this issue tracks investigating just that.

To Reproduce

Using #89191-ish:

image

First red annotation is leases for foreground load being transferred to the node that's going to start receiving snapshots. Second red annotation is when it starts receiving snapshots, and service latencies start going through the roof. A set of outlier traces can be found here: trace-snapshot-latency.tar.gz. They look roughly like the one below:

image

+cc @andrewbaptist, @sumeerbhola.

Jira issue: CRDB-20434

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-investigationFurther steps needed to qualify. C-label will change.T-admission-controlAdmission Control

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions