Skip to content

Commit bf2c1c1

Browse files
jaydeepkumar1984jaydeep1984
authored andcommitted
Improved observability in AutoRepair to report both expected vs. actual repair bytes and expected vs. actual keyspaces
patch by Jaydeepkumar Chovatia; reviewed by Chris Lohfink for CASSANDRA-20581
1 parent d7a46b5 commit bf2c1c1

20 files changed

+1170
-405
lines changed

CHANGES.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
5.1
2+
* Improved observability in AutoRepair to report both expected vs. actual repair bytes and expected vs. actual keyspaces (CASSANDRA-20581)
23
* Execution of CreateTriggerStatement should not rely on external state (CASSANDRA-20287)
34
* Support LIKE expressions in filtering queries (CASSANDRA-17198)
45
* Make legacy index rebuilds safe on Gossip -> TCM upgrades (CASSANDRA-20887)

doc/modules/cassandra/pages/managing/operating/metrics.adoc

Lines changed: 30 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1085,40 +1085,58 @@ Reported name format:
10851085
|===
10861086
|Name |Type |Description
10871087
|RepairsInProgress |Gauge<Integer> |Repair is in progress
1088-
on the node
1088+
on the node.
10891089

10901090
|NodeRepairTimeInSec |Gauge<Integer> |Time taken to repair
1091-
the node in seconds
1091+
the node in seconds.
10921092

10931093
|ClusterRepairTimeInSec |Gauge<Integer> |Time taken to repair
1094-
the entire Cassandra cluster in seconds
1094+
the entire Cassandra cluster in seconds.
10951095

10961096
|LongestUnrepairedSec |Gauge<Integer> |Time since the last repair
1097-
ran on the node in seconds
1097+
ran on the node in seconds.
10981098

10991099
|RepairStartLagSec|Gauge<Integer> |If a repair has not run within min_repair_interval, how long past this value since
11001100
repairs last completed. Useful for determining if repairs are behind schedule.
11011101

1102-
|SucceededTokenRangesCount |Gauge<Integer> |Number of token ranges successfully repaired on the node
1102+
|SucceededTokenRangesCount |Gauge<Integer> |Number of token ranges successfully repaired on the node.
11031103

1104-
|FailedTokenRangesCount |Gauge<Integer> |Number of token ranges failed to repair on the node
1104+
|FailedTokenRangesCount |Gauge<Integer> |Number of token ranges failed to repair on the node.
11051105

11061106
|SkippedTokenRangesCount |Gauge<Integer> |Number of token ranges skipped
1107-
on the node
1107+
on the node.
11081108

11091109
|SkippedTablesCount |Gauge<Integer> |Number of tables skipped
1110-
on the node
1110+
on the node.
11111111

11121112
|TotalMVTablesConsideredForRepair |Gauge<Integer> |Number of materialized
1113-
views considered on the node
1113+
views considered on the node.
11141114

11151115
|TotalDisabledRepairTables |Gauge<Integer> |Number of tables on which
1116-
the automated repair has been disabled on the node
1116+
the automated repair has been disabled on the node.
1117+
1118+
|TotalBytesToRepair |Gauge<Long> |Total bytes to be repaired across all keyspaces and tables involved in the current
1119+
repair schedule.
1120+
1121+
|BytesAlreadyRepaired |Gauge<Long> |Cumulative number of bytes successfully repaired so far in the current
1122+
repair schedule.
1123+
NOTE: This calculation is the best effort for the FixedSplitTokenRangeSplitter. In practice, this metric
1124+
may not give you an accurate view in case of uneven data distribution.
1125+
1126+
1127+
|TotalKeyspaceRepairPlansToRepair |Gauge<Integer> |Represents the total number of keyspace-level repair plans scheduled
1128+
for execution. If no table-level repair priorities are configured, this number typically matches the total number
1129+
of keyspaces under repair. However, if certain tables have repair priorities set, this number is usually higher than
1130+
the number of keyspaces, as multiple repair plans may be generated for different prioritized tables within the
1131+
same keyspace.
1132+
1133+
|KeyspaceRepairPlansAlreadyRepaired |Gauge<Integer> |Cumulative number of keyspace-level repair plans successfully
1134+
repaired so far in the current repair schedule.
11171135

1118-
|RepairTurnMyTurn |Counter |Represents the node's turn to repair
1136+
|RepairTurnMyTurn |Counter |Represents the node's turn to repair.
11191137

11201138
|RepairTurnMyTurnDueToPriority |Counter |Represents the node's turn to repair
1121-
due to priority set in the automated repair
1139+
due to priority set in the automated repair.
11221140

11231141
|RepairDelayedByReplica |Counter |Represents occurrences of a node's turn being
11241142
delayed because a replica was currently taking its turn. Only relevant if

src/java/org/apache/cassandra/metrics/AutoRepairMetrics.java

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,11 @@ public class AutoRepairMetrics
4646
public final Gauge<Integer> skippedTablesCount;
4747
public final Gauge<Integer> totalMVTablesConsideredForRepair;
4848
public final Gauge<Integer> totalDisabledRepairTables;
49+
public final Gauge<Long> totalBytesToRepair;
50+
public final Gauge<Long> bytesAlreadyRepaired;
51+
public final Gauge<Integer> totalKeyspaceRepairPlansToRepair;
52+
public final Gauge<Integer> keyspaceRepairPlansAlreadyRepaired;
53+
4954
public Counter repairTurnMyTurn;
5055
public Counter repairTurnMyTurnDueToPriority;
5156
public Counter repairTurnMyTurnForceRepair;
@@ -155,6 +160,34 @@ public Integer getValue()
155160
return AutoRepair.instance.getRepairState(repairType).getTotalDisabledTablesRepairCount();
156161
}
157162
});
163+
totalBytesToRepair = Metrics.register(factory.createMetricName("TotalBytesToRepair"), new Gauge<Long>()
164+
{
165+
public Long getValue()
166+
{
167+
return AutoRepair.instance.getRepairState(repairType).getTotalBytesToRepair();
168+
}
169+
});
170+
bytesAlreadyRepaired = Metrics.register(factory.createMetricName("BytesAlreadyRepaired"), new Gauge<Long>()
171+
{
172+
public Long getValue()
173+
{
174+
return AutoRepair.instance.getRepairState(repairType).getBytesAlreadyRepaired();
175+
}
176+
});
177+
totalKeyspaceRepairPlansToRepair = Metrics.register(factory.createMetricName("TotalKeyspaceRepairPlansToRepair"), new Gauge<Integer>()
178+
{
179+
public Integer getValue()
180+
{
181+
return AutoRepair.instance.getRepairState(repairType).getTotalKeyspaceRepairPlansToRepair();
182+
}
183+
});
184+
keyspaceRepairPlansAlreadyRepaired = Metrics.register(factory.createMetricName("KeyspaceRepairPlansAlreadyRepaired"), new Gauge<Integer>()
185+
{
186+
public Integer getValue()
187+
{
188+
return AutoRepair.instance.getRepairState(repairType).getKeyspaceRepairPlansAlreadyRepaired();
189+
}
190+
});
158191
}
159192

160193
public void recordTurn(AutoRepairUtils.RepairTurn turn)

src/java/org/apache/cassandra/repair/autorepair/AutoRepair.java

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -123,14 +123,15 @@ public void setup()
123123
repairExecutors = new EnumMap<>(AutoRepairConfig.RepairType.class);
124124
repairRunnableExecutors = new EnumMap<>(AutoRepairConfig.RepairType.class);
125125
repairStates = new EnumMap<>(AutoRepairConfig.RepairType.class);
126+
AutoRepairConfig config = DatabaseDescriptor.getAutoRepairConfig();
127+
126128
for (AutoRepairConfig.RepairType repairType : AutoRepairConfig.RepairType.values())
127129
{
128130
repairExecutors.put(repairType, executorFactory().scheduled(false, "AutoRepair-Repair-" + repairType.getConfigName(), Thread.NORM_PRIORITY));
129131
repairRunnableExecutors.put(repairType, executorFactory().scheduled(false, "AutoRepair-RepairRunnable-" + repairType.getConfigName(), Thread.NORM_PRIORITY));
130-
repairStates.put(repairType, AutoRepairConfig.RepairType.getAutoRepairState(repairType));
132+
repairStates.put(repairType, AutoRepairConfig.RepairType.getAutoRepairState(repairType, config));
131133
}
132134

133-
AutoRepairConfig config = DatabaseDescriptor.getAutoRepairConfig();
134135
AutoRepairUtils.setup();
135136

136137
for (AutoRepairConfig.RepairType repairType : AutoRepairConfig.RepairType.values())
@@ -197,6 +198,8 @@ public void repair(AutoRepairConfig.RepairType repairType)
197198
if (turn == MY_TURN || turn == MY_TURN_DUE_TO_PRIORITY || turn == MY_TURN_FORCE_REPAIR)
198199
{
199200
repairState.recordTurn(turn);
201+
repairState.setBytesAlreadyRepaired(0L);
202+
repairState.setKeyspaceRepairPlansAlreadyRepaired(0);
200203
// For normal auto repair, we will use primary range only repairs (Repair with -pr option).
201204
// For some cases, we may set the auto_repair_primary_token_range_only flag to false then we will do repair
202205
// without -pr. We may also do force repair for certain node that we want to repair all the data on one node
@@ -231,23 +234,30 @@ public void repair(AutoRepairConfig.RepairType repairType)
231234
}
232235

233236
// Separate out the keyspaces and tables to repair based on their priority, with each repair plan representing a uniquely occuring priority.
234-
List<PrioritizedRepairPlan> repairPlans = PrioritizedRepairPlan.build(keyspacesAndTablesToRepair, repairType, shuffleFunc);
237+
List<PrioritizedRepairPlan> repairPlans = PrioritizedRepairPlan.build(keyspacesAndTablesToRepair, repairType, shuffleFunc, primaryRangeOnly);
238+
repairState.updateRepairScheduleStatistics(repairPlans);
235239

236240
// calculate the repair assignments for each priority:keyspace.
237241
Iterator<KeyspaceRepairAssignments> repairAssignmentsIterator = config.getTokenRangeSplitterInstance(repairType).getRepairAssignments(primaryRangeOnly, repairPlans);
238242

243+
int keyspaceRepairAssignmentsAlreadyRepaired = 0;
239244
while (repairAssignmentsIterator.hasNext())
240245
{
241246
KeyspaceRepairAssignments repairAssignments = repairAssignmentsIterator.next();
242247
List<RepairAssignment> assignments = repairAssignments.getRepairAssignments();
243248
if (assignments.isEmpty())
244249
{
250+
keyspaceRepairAssignmentsAlreadyRepaired++;
245251
logger.info("Skipping repairs for priorityBucket={} for keyspace={} since it yielded no assignments", repairAssignments.getPriority(), repairAssignments.getKeyspaceName());
246252
continue;
247253
}
248254

249-
logger.info("Submitting repairs for priorityBucket={} for keyspace={} with assignmentCount={}", repairAssignments.getPriority(), repairAssignments.getKeyspaceName(), repairAssignments.getRepairAssignments().size());
255+
logger.info("Submitting repairs for priorityBucket={} for keyspace={} with assignmentCount={} and keyspaceRepairAssignmentsAlreadyRepaired={}/{}",
256+
repairAssignments.getPriority(), repairAssignments.getKeyspaceName(), repairAssignments.getRepairAssignments().size(),
257+
keyspaceRepairAssignmentsAlreadyRepaired, repairState.getTotalKeyspaceRepairPlansToRepair());
250258
repairKeyspace(repairType, primaryRangeOnly, repairAssignments.getKeyspaceName(), repairAssignments.getRepairAssignments(), collectedRepairStats);
259+
keyspaceRepairAssignmentsAlreadyRepaired++;
260+
repairState.setKeyspaceRepairPlansAlreadyRepaired(keyspaceRepairAssignmentsAlreadyRepaired);
251261
}
252262

253263
cleanupAndUpdateStats(turn, repairType, repairState, myId, startTimeInMillis, collectedRepairStats);
@@ -277,6 +287,7 @@ private void repairKeyspace(AutoRepairConfig.RepairType repairType, boolean prim
277287
long tableStartTime = timeFunc.get();
278288
int totalProcessedAssignments = 0;
279289
Set<Range<Token>> ranges = new HashSet<>();
290+
long bytesAlreadyRepaired = repairState.getBytesAlreadyRepaired();
280291
for (RepairAssignment curRepairAssignment : repairAssignments)
281292
{
282293
try
@@ -380,7 +391,10 @@ else if (retryCount < config.getRepairMaxRetries(repairType))
380391
}
381392
ranges.clear();
382393
}
383-
logger.info("Repair completed for {} tables {}, range {}", keyspaceName, curRepairAssignment.getTableNames(), curRepairAssignment.getTokenRange());
394+
bytesAlreadyRepaired += curRepairAssignment.getEstimatedBytes();
395+
repairState.setBytesAlreadyRepaired(bytesAlreadyRepaired);
396+
logger.info("Repair completed for {} tables {}, range {}, bytesAlreadyRepaired {}/{}",
397+
keyspaceName, curRepairAssignment.getTableNames(), curRepairAssignment.getTokenRange(), bytesAlreadyRepaired, repairState.getTotalBytesToRepair());
384398
}
385399
catch (Exception e)
386400
{
@@ -492,8 +506,8 @@ private void cleanupAndUpdateStats(RepairTurn turn, AutoRepairConfig.RepairType
492506
TimeUnit.SECONDS.toDays(repairState.getClusterRepairTimeInSec()));
493507
}
494508
repairState.setLastRepairTime(timeFunc.get());
495-
496509
repairState.setRepairInProgress(false);
510+
497511
AutoRepairUtils.updateFinishAutoRepairHistory(repairType, myId, timeFunc.get());
498512
}
499513

src/java/org/apache/cassandra/repair/autorepair/AutoRepairConfig.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -92,16 +92,16 @@ public String getConfigName()
9292
return configName;
9393
}
9494

95-
public static AutoRepairState getAutoRepairState(RepairType repairType)
95+
public static AutoRepairState getAutoRepairState(RepairType repairType, AutoRepairConfig config)
9696
{
9797
switch (repairType)
9898
{
9999
case FULL:
100-
return new FullRepairState();
100+
return new FullRepairState(config);
101101
case INCREMENTAL:
102-
return new IncrementalRepairState();
102+
return new IncrementalRepairState(config);
103103
case PREVIEW_REPAIRED:
104-
return new PreviewRepairedState();
104+
return new PreviewRepairedState(config);
105105
}
106106

107107
throw new IllegalArgumentException("Invalid repair type: " + repairType);

src/java/org/apache/cassandra/repair/autorepair/AutoRepairState.java

Lines changed: 68 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ public abstract class AutoRepairState
6060
@VisibleForTesting
6161
protected final RepairType repairType;
6262
@VisibleForTesting
63+
protected AutoRepairConfig config;
64+
@VisibleForTesting
6365
protected int totalTablesConsideredForRepair = 0;
6466
@VisibleForTesting
6567
protected long lastRepairTimeInMs;
@@ -84,21 +86,38 @@ public abstract class AutoRepairState
8486
@VisibleForTesting
8587
protected int skippedTablesCount = 0;
8688
@VisibleForTesting
89+
protected long totalBytesToRepair = 0;
90+
@VisibleForTesting
91+
protected long bytesAlreadyRepaired = 0;
92+
@VisibleForTesting
93+
protected int totalKeyspaceRepairPlansToRepair = 0;
94+
@VisibleForTesting
95+
protected int keyspaceRepairPlansAlreadyRepaired = 0;
96+
@VisibleForTesting
8797
protected AutoRepairHistory longestUnrepairedNode;
8898
protected final AutoRepairMetrics metrics;
8999

90-
protected AutoRepairState(RepairType repairType)
100+
protected AutoRepairState(RepairType repairType, AutoRepairConfig config)
91101
{
92102
metrics = AutoRepairMetricsManager.getMetrics(repairType);
93103
this.repairType = repairType;
104+
this.config = config;
94105
}
95106

96107
public abstract RepairCoordinator getRepairRunnable(String keyspace, List<String> tables, Set<Range<Token>> ranges, boolean primaryRangeOnly);
97108

98109
protected RepairCoordinator getRepairRunnable(String keyspace, RepairOption options)
99110
{
100111
return new RepairCoordinator(StorageService.instance, StorageService.nextRepairCommand.incrementAndGet(),
101-
options, keyspace);
112+
options, keyspace);
113+
}
114+
115+
public void updateRepairScheduleStatistics(List<PrioritizedRepairPlan> repairPlans)
116+
{
117+
setTotalBytesToRepair(repairPlans.stream().
118+
flatMap(repairPlan -> repairPlan.getKeyspaceRepairPlans().
119+
stream()).mapToLong(KeyspaceRepairPlan::getEstimatedBytes).sum());
120+
setTotalKeyspaceRepairPlansToRepair(repairPlans.stream().mapToInt(repairPlan -> repairPlan.getKeyspaceRepairPlans().size()).sum());
102121
}
103122

104123
public long getLastRepairTime()
@@ -239,20 +258,60 @@ public int getTotalDisabledTablesRepairCount()
239258
{
240259
return totalDisabledTablesRepairCount;
241260
}
261+
262+
public void setTotalBytesToRepair(long totalBytesToRepair)
263+
{
264+
this.totalBytesToRepair = totalBytesToRepair;
265+
}
266+
267+
public long getTotalBytesToRepair()
268+
{
269+
return totalBytesToRepair;
270+
}
271+
272+
public void setBytesAlreadyRepaired(long bytesAlreadyRepaired)
273+
{
274+
this.bytesAlreadyRepaired = bytesAlreadyRepaired;
275+
}
276+
277+
public long getBytesAlreadyRepaired()
278+
{
279+
return bytesAlreadyRepaired;
280+
}
281+
282+
public void setTotalKeyspaceRepairPlansToRepair(int totalKeyspaceRepairPlansToRepair)
283+
{
284+
this.totalKeyspaceRepairPlansToRepair = totalKeyspaceRepairPlansToRepair;
285+
}
286+
287+
public int getTotalKeyspaceRepairPlansToRepair()
288+
{
289+
return totalKeyspaceRepairPlansToRepair;
290+
}
291+
292+
public void setKeyspaceRepairPlansAlreadyRepaired(int keyspaceRepairPlansAlreadyRepaired)
293+
{
294+
this.keyspaceRepairPlansAlreadyRepaired = keyspaceRepairPlansAlreadyRepaired;
295+
}
296+
297+
public int getKeyspaceRepairPlansAlreadyRepaired()
298+
{
299+
return keyspaceRepairPlansAlreadyRepaired;
300+
}
242301
}
243302

244303
class PreviewRepairedState extends AutoRepairState
245304
{
246-
public PreviewRepairedState()
305+
public PreviewRepairedState(AutoRepairConfig config)
247306
{
248-
super(RepairType.PREVIEW_REPAIRED);
307+
super(RepairType.PREVIEW_REPAIRED, config);
249308
}
250309

251310
@Override
252311
public RepairCoordinator getRepairRunnable(String keyspace, List<String> tables, Set<Range<Token>> ranges, boolean primaryRangeOnly)
253312
{
254313
RepairOption option = new RepairOption(RepairParallelism.PARALLEL, primaryRangeOnly, false, false,
255-
AutoRepairService.instance.getAutoRepairConfig().getRepairThreads(repairType), ranges, false, false, PreviewKind.REPAIRED, false, true, true, false, false, false);
314+
AutoRepairService.instance.getAutoRepairConfig().getRepairThreads(repairType), ranges, false, false, PreviewKind.REPAIRED, false, true, true, false, false, false);
256315

257316
option.getColumnFamilies().addAll(tables);
258317

@@ -262,9 +321,9 @@ public RepairCoordinator getRepairRunnable(String keyspace, List<String> tables,
262321

263322
class IncrementalRepairState extends AutoRepairState
264323
{
265-
public IncrementalRepairState()
324+
public IncrementalRepairState(AutoRepairConfig config)
266325
{
267-
super(RepairType.INCREMENTAL);
326+
super(RepairType.INCREMENTAL, config);
268327
}
269328

270329
@Override
@@ -307,9 +366,9 @@ protected List<String> filterOutUnsafeTables(String keyspaceName, List<String> t
307366

308367
class FullRepairState extends AutoRepairState
309368
{
310-
public FullRepairState()
369+
public FullRepairState(AutoRepairConfig config)
311370
{
312-
super(RepairType.FULL);
371+
super(RepairType.FULL, config);
313372
}
314373

315374
@Override

0 commit comments

Comments
 (0)