Skip to content

Commit 7944225

Browse files
authored
Improved metrics output consistency and removed references to Thread and Global variables. (#329)
1 parent e018ae2 commit 7944225

File tree

9 files changed

+220
-172
lines changed

9 files changed

+220
-172
lines changed

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Migrate and Validate Tables between Origin and Target Cassandra Clusters.
2121
- **Java11** (minimum) as Spark binaries are compiled with it.
2222
- **Spark `3.5.x` with Scala `2.13` and Hadoop `3.3`**
2323
- Typically installed using [this binary](https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3-scala2.13.tgz) on a single VM (no cluster necessary) where you want to run this job. This simple setup is recommended for most one-time migrations.
24-
- However we recommend a Spark Cluster or a Spark Serverless platform like `Databricks` or `Google Dataproc` (that supports the above mentioned versions) for large (e.g. several terabytes) complex migrations OR when CDM is used as a long-term data-transfer utility and not a one-time job.
24+
- However we recommend using a Spark Cluster or a Spark Serverless platform like `Databricks` or `Google Dataproc` (that supports the above mentioned versions) for large (e.g. several terabytes) complex migrations OR when CDM is used as a long-term data-transfer utility and not a one-time job.
2525

2626
Spark can be installed by running the following: -
2727

@@ -150,7 +150,7 @@ spark-submit --properties-file cdm.properties \
150150
- Supports migration/validation from and to [Azure Cosmos Cassandra](https://learn.microsoft.com/en-us/azure/cosmos-db/cassandra)
151151
- Validate migration accuracy and performance using a smaller randomized data-set
152152
- Supports adding custom fixed `writetime` and/or `ttl`
153-
- Track run information (start-time, end-time, status, etc.) in tables (`cdm_run_info` and `cdm_run_details`) on the target keyspace
153+
- Track run information (start-time, end-time, run-metrics, status, etc.) in tables (`cdm_run_info` and `cdm_run_details`) on the target keyspace
154154

155155
# Things to know
156156
- Each run (Migration or Validation) can be tracked (when enabled). You can find summary and details of the same in tables `cdm_run_info` and `cdm_run_details` in the target keyspace.
@@ -160,7 +160,7 @@ spark-submit --properties-file cdm.properties \
160160
- If a table has only collection and/or UDT non-key columns, the `writetime` used on target will be time the job was run. If you want to avoid this, we recommend setting `spark.cdm.schema.ttlwritetime.calc.useCollections` param to `true` in such scenarios.
161161
- When CDM migration (or validation with autocorrect) is run multiple times on the same table (for whatever reasons), it could lead to duplicate entries in `list` type columns. Note this is [due to a Cassandra/DSE bug](https://issues.apache.org/jira/browse/CASSANDRA-11368) and not a CDM issue. This issue can be addressed by enabling and setting a positive value for `spark.cdm.transform.custom.writetime.incrementBy` param. This param was specifically added to address this issue.
162162
- When you rerun job to resume from a previous run, the run metrics (read, write, skipped, etc.) captured in table `cdm_run_info` will be only for the current run. If the previous run was killed for some reasons, its run metrics may not have been saved. If the previous run did complete (not killed) but with errors, then you will have all run metrics from previous run as well.
163-
- When running on a Spark Cluster (and not a single VM), the rate-limit values (`spark.cdm.perfops.ratelimit.origin` & `spark.cdm.perfops.ratelimit.target`) applies to individual Spark worker nodes. Hence this value should be set to `effective-rate-limit-you-need`/`number-of-spark-worker-nodes` . E.g. If you need an effective rate-limit of 10000, and the number of Spark worker nodes are 4, then you should set the above rate-limit params to a value of 2500.
163+
- When running on a Spark Cluster (and not a single VM), the rate-limit values (`spark.cdm.perfops.ratelimit.origin` & `spark.cdm.perfops.ratelimit.target`) applies to individual Spark worker nodes. Hence this value should be set to the effective-rate-limit-you-need/number-of-spark-worker-nodes . E.g. If you need an effective rate-limit of 10000, and the number of Spark worker nodes are 4, then you should set the above rate-limit params to a value of 2500.
164164

165165
# Performance recommendations
166166
Below recommendations may only be useful when migrating large tables where the default performance is not good enough
@@ -175,7 +175,7 @@ Below recommendations may only be useful when migrating large tables where the d
175175
- `ratelimit`: Default is `20000`, but this property should usually be updated (after updating other properties) to the highest possible value that your `origin` and `target` clusters can efficiently handle.
176176
- Using schema manipulation features (like `constantColumns`, `explodeMap`, `extractJson`), transformation functions and/or where-filter-conditions (except partition min/max) may negatively impact performance
177177
- We typically recommend [this infrastructure](https://docs.datastax.com/en/data-migration/deployment-infrastructure.html#_machines) for CDM VMs and [this starter conf](https://github.com/datastax/cassandra-data-migrator/blob/main/src/resources/cdm.properties). You can then optimize the job further based on CDM params info provided above and the observed load and throughput on `Origin` and `Target` clusters
178-
- Use a Spark Cluster or a Spark Serverless platform like `Databricks` or `Google Dataproc` for large (e.g. several terabytes) complex migrations OR when CDM is used as a long-term data-transfer utility and not a one-time job.
178+
- We recommend using a Spark Cluster or a Spark Serverless platform like `Databricks` or `Google Dataproc` for large (e.g. several terabytes) complex migrations OR when CDM is used as a long-term data-transfer utility and not a one-time job.
179179

180180
> [!NOTE]
181181
> For additional performance tuning, refer to details mentioned in the [`cdm-detailed.properties` file here](./src/resources/cdm-detailed.properties)

RELEASE.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
11
# Release Notes
2+
## [5.1.0] - 2024-11-15
3+
- Improves metrics output by producing stats labels in an intuitive and consistent order
4+
- Refactored JobCounter by removing any references to `thread` or `global` as CDM operations are now isolated within partition-ranges (`parts`). Each such `part` is then parallelly processed and aggregated by Spark.
5+
26
## [5.0.0] - 2024-11-08
37
- CDM refactored to be fully Spark Native and more performant when deployed on a multi-node Spark Cluster
48
- `trackRun` feature has been expanded to record `run-info` for each part in the `CDM_RUN_DETAILS` table. Along with granular metrics, this information can be used to troubleshoot any unbalanced problematic partitions.

src/main/java/com/datastax/cdm/job/CopyJobSession.java

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -82,52 +82,53 @@ protected void processPartitionRange(PartitionRange range) {
8282

8383
for (Row originRow : resultSet) {
8484
rateLimiterOrigin.acquire(1);
85-
jobCounter.threadIncrement(JobCounter.CounterType.READ);
85+
jobCounter.increment(JobCounter.CounterType.READ);
8686

8787
Record record = new Record(pkFactory.getTargetPK(originRow), originRow, null);
8888
if (originSelectByPartitionRangeStatement.shouldFilterRecord(record)) {
89-
jobCounter.threadIncrement(JobCounter.CounterType.SKIPPED);
89+
jobCounter.increment(JobCounter.CounterType.SKIPPED);
9090
continue;
9191
}
9292

9393
for (Record r : pkFactory.toValidRecordList(record)) {
9494
BoundStatement boundUpsert = bind(r);
9595
if (null == boundUpsert) {
96-
jobCounter.threadIncrement(JobCounter.CounterType.SKIPPED);
96+
jobCounter.increment(JobCounter.CounterType.SKIPPED);
9797
continue;
9898
}
9999

100100
rateLimiterTarget.acquire(1);
101101
batch = writeAsync(batch, writeResults, boundUpsert);
102-
jobCounter.threadIncrement(JobCounter.CounterType.UNFLUSHED);
102+
jobCounter.increment(JobCounter.CounterType.UNFLUSHED);
103103

104104
if (jobCounter.getCount(JobCounter.CounterType.UNFLUSHED) > fetchSize) {
105105
flushAndClearWrites(batch, writeResults);
106-
jobCounter.threadIncrement(JobCounter.CounterType.WRITE,
107-
jobCounter.getCount(JobCounter.CounterType.UNFLUSHED));
108-
jobCounter.threadReset(JobCounter.CounterType.UNFLUSHED);
106+
jobCounter.increment(JobCounter.CounterType.WRITE,
107+
jobCounter.getCount(JobCounter.CounterType.UNFLUSHED, true));
108+
jobCounter.reset(JobCounter.CounterType.UNFLUSHED);
109109
}
110110
}
111111
}
112112

113113
flushAndClearWrites(batch, writeResults);
114-
jobCounter.threadIncrement(JobCounter.CounterType.WRITE,
115-
jobCounter.getCount(JobCounter.CounterType.UNFLUSHED));
116-
jobCounter.threadReset(JobCounter.CounterType.UNFLUSHED);
117-
jobCounter.globalIncrement();
114+
jobCounter.increment(JobCounter.CounterType.WRITE,
115+
jobCounter.getCount(JobCounter.CounterType.UNFLUSHED, true));
116+
jobCounter.reset(JobCounter.CounterType.UNFLUSHED);
117+
jobCounter.flush();
118118
if (null != trackRunFeature) {
119-
trackRunFeature.updateCdmRun(runId, min, TrackRun.RUN_STATUS.PASS, jobCounter.getThreadCounters(true));
119+
trackRunFeature.updateCdmRun(runId, min, TrackRun.RUN_STATUS.PASS, jobCounter.getMetrics());
120120
}
121121
} catch (Exception e) {
122-
jobCounter.threadIncrement(JobCounter.CounterType.ERROR,
123-
jobCounter.getCount(JobCounter.CounterType.READ) - jobCounter.getCount(JobCounter.CounterType.WRITE)
124-
- jobCounter.getCount(JobCounter.CounterType.SKIPPED));
122+
jobCounter.increment(JobCounter.CounterType.ERROR,
123+
jobCounter.getCount(JobCounter.CounterType.READ, true)
124+
- jobCounter.getCount(JobCounter.CounterType.WRITE, true)
125+
- jobCounter.getCount(JobCounter.CounterType.SKIPPED, true));
125126
logger.error("Error with PartitionRange -- ThreadID: {} Processing min: {} max: {}",
126127
Thread.currentThread().getId(), min, max, e);
127-
logger.error("Error stats " + jobCounter.getThreadCounters(false));
128-
jobCounter.globalIncrement();
128+
logger.error("Error stats " + jobCounter.getMetrics(true));
129+
jobCounter.flush();
129130
if (null != trackRunFeature) {
130-
trackRunFeature.updateCdmRun(runId, min, TrackRun.RUN_STATUS.FAIL, jobCounter.getThreadCounters(true));
131+
trackRunFeature.updateCdmRun(runId, min, TrackRun.RUN_STATUS.FAIL, jobCounter.getMetrics());
131132
}
132133
}
133134
}

src/main/java/com/datastax/cdm/job/CounterUnit.java

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -20,30 +20,31 @@
2020
public class CounterUnit implements Serializable {
2121

2222
private static final long serialVersionUID = 2194336948011681878L;
23-
private long globalCounter = 0;
24-
private long threadLocalCounter = 0;
23+
private long count = 0;
24+
private long interimCount = 0;
2525

26-
public void incrementThreadCounter(long incrementBy) {
27-
threadLocalCounter += incrementBy;
26+
public void increment(long incrementBy) {
27+
interimCount += incrementBy;
2828
}
2929

30-
public long getThreadCounter() {
31-
return threadLocalCounter;
30+
public long getInterimCount() {
31+
return interimCount;
3232
}
3333

34-
public void resetThreadCounter() {
35-
threadLocalCounter = 0;
34+
public void reset() {
35+
interimCount = 0;
3636
}
3737

38-
public void setGlobalCounter(long value) {
39-
globalCounter = value;
38+
public void setCount(long value) {
39+
count = value;
4040
}
4141

42-
public void addThreadToGlobalCounter() {
43-
globalCounter += threadLocalCounter;
42+
public void addToCount() {
43+
count += interimCount;
44+
reset();
4445
}
4546

46-
public long getGlobalCounter() {
47-
return globalCounter;
47+
public long getCount() {
48+
return count;
4849
}
4950
}

src/main/java/com/datastax/cdm/job/DiffJobSession.java

Lines changed: 19 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -135,18 +135,18 @@ protected void processPartitionRange(PartitionRange range) {
135135
StreamSupport.stream(resultSet.spliterator(), false).forEach(originRow -> {
136136
rateLimiterOrigin.acquire(1);
137137
Record record = new Record(pkFactory.getTargetPK(originRow), originRow, null);
138-
jobCounter.threadIncrement(JobCounter.CounterType.READ);
138+
jobCounter.increment(JobCounter.CounterType.READ);
139139

140140
if (originSelectByPartitionRangeStatement.shouldFilterRecord(record)) {
141-
jobCounter.threadIncrement(JobCounter.CounterType.SKIPPED);
141+
jobCounter.increment(JobCounter.CounterType.SKIPPED);
142142
} else {
143143
for (Record r : pkFactory.toValidRecordList(record)) {
144144
rateLimiterTarget.acquire(1);
145145
CompletionStage<AsyncResultSet> targetResult = targetSelectByPKStatement
146146
.getAsyncResult(r.getPk());
147147

148148
if (null == targetResult) {
149-
jobCounter.threadIncrement(JobCounter.CounterType.SKIPPED);
149+
jobCounter.increment(JobCounter.CounterType.SKIPPED);
150150
} else {
151151
r.setAsyncTargetRow(targetResult);
152152
recordsToDiff.add(r);
@@ -168,32 +168,31 @@ protected void processPartitionRange(PartitionRange range) {
168168
.getCount(JobCounter.CounterType.CORRECTED_MISSING)
169169
&& jobCounter.getCount(JobCounter.CounterType.MISMATCH) == jobCounter
170170
.getCount(JobCounter.CounterType.CORRECTED_MISMATCH)) {
171-
jobCounter.globalIncrement();
171+
jobCounter.flush();
172172
trackRunFeature.updateCdmRun(runId, min, TrackRun.RUN_STATUS.DIFF_CORRECTED,
173-
jobCounter.getThreadCounters(true));
173+
jobCounter.getMetrics());
174174
} else {
175-
jobCounter.globalIncrement();
176-
trackRunFeature.updateCdmRun(runId, min, TrackRun.RUN_STATUS.DIFF,
177-
jobCounter.getThreadCounters(true));
175+
jobCounter.flush();
176+
trackRunFeature.updateCdmRun(runId, min, TrackRun.RUN_STATUS.DIFF, jobCounter.getMetrics());
178177
}
179178
} else if (null != trackRunFeature) {
180-
jobCounter.globalIncrement();
181-
trackRunFeature.updateCdmRun(runId, min, TrackRun.RUN_STATUS.PASS, jobCounter.getThreadCounters(true));
179+
jobCounter.flush();
180+
trackRunFeature.updateCdmRun(runId, min, TrackRun.RUN_STATUS.PASS, jobCounter.getMetrics());
182181
} else {
183-
jobCounter.globalIncrement();
182+
jobCounter.flush();
184183
}
185184
} catch (Exception e) {
186-
jobCounter.threadIncrement(JobCounter.CounterType.ERROR,
185+
jobCounter.increment(JobCounter.CounterType.ERROR,
187186
jobCounter.getCount(JobCounter.CounterType.READ) - jobCounter.getCount(JobCounter.CounterType.VALID)
188187
- jobCounter.getCount(JobCounter.CounterType.MISSING)
189188
- jobCounter.getCount(JobCounter.CounterType.MISMATCH)
190189
- jobCounter.getCount(JobCounter.CounterType.SKIPPED));
191190
logger.error("Error with PartitionRange -- ThreadID: {} Processing min: {} max: {}",
192191
Thread.currentThread().getId(), min, max, e);
193-
logger.error("Error stats " + jobCounter.getThreadCounters(false));
194-
jobCounter.globalIncrement();
192+
logger.error("Error stats " + jobCounter.getMetrics(true));
193+
jobCounter.flush();
195194
if (null != trackRunFeature)
196-
trackRunFeature.updateCdmRun(runId, min, TrackRun.RUN_STATUS.FAIL, jobCounter.getThreadCounters(true));
195+
trackRunFeature.updateCdmRun(runId, min, TrackRun.RUN_STATUS.FAIL, jobCounter.getMetrics());
197196
}
198197
}
199198

@@ -205,7 +204,7 @@ private boolean diffAndClear(List<Record> recordsToDiff, JobCounter jobCounter)
205204

206205
private boolean diff(Record record, JobCounter jobCounter) {
207206
if (record.getTargetRow() == null) {
208-
jobCounter.threadIncrement(JobCounter.CounterType.MISSING);
207+
jobCounter.increment(JobCounter.CounterType.MISSING);
209208
logger.error("Missing target row found for key: {}", record.getPk());
210209
if (autoCorrectMissing && isCounterTable && !forceCounterWhenMissing) {
211210
logger.error("{} is true, but not Inserting as {} is not enabled; key : {}",
@@ -218,27 +217,27 @@ private boolean diff(Record record, JobCounter jobCounter) {
218217
if (autoCorrectMissing) {
219218
rateLimiterTarget.acquire(1);
220219
targetSession.getTargetUpsertStatement().putRecord(record);
221-
jobCounter.threadIncrement(JobCounter.CounterType.CORRECTED_MISSING);
220+
jobCounter.increment(JobCounter.CounterType.CORRECTED_MISSING);
222221
logger.error("Inserted missing row in target: {}", record.getPk());
223222
}
224223
return true;
225224
}
226225

227226
String diffData = isDifferent(record);
228227
if (!diffData.isEmpty()) {
229-
jobCounter.threadIncrement(JobCounter.CounterType.MISMATCH);
228+
jobCounter.increment(JobCounter.CounterType.MISMATCH);
230229
logger.error("Mismatch row found for key: {} Mismatch: {}", record.getPk(), diffData);
231230

232231
if (autoCorrectMismatch) {
233232
rateLimiterTarget.acquire(1);
234233
targetSession.getTargetUpsertStatement().putRecord(record);
235-
jobCounter.threadIncrement(JobCounter.CounterType.CORRECTED_MISMATCH);
234+
jobCounter.increment(JobCounter.CounterType.CORRECTED_MISMATCH);
236235
logger.error("Corrected mismatch row in target: {}", record.getPk());
237236
}
238237

239238
return true;
240239
} else {
241-
jobCounter.threadIncrement(JobCounter.CounterType.VALID);
240+
jobCounter.increment(JobCounter.CounterType.VALID);
242241
return false;
243242
}
244243
}

src/main/java/com/datastax/cdm/job/GuardrailCheckJobSession.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,22 +54,22 @@ protected void processPartitionRange(PartitionRange range) {
5454
String checkString;
5555
for (Row originRow : resultSet) {
5656
rateLimiterOrigin.acquire(1);
57-
jobCounter.threadIncrement(JobCounter.CounterType.READ);
57+
jobCounter.increment(JobCounter.CounterType.READ);
5858

5959
checkString = guardrailFeature.guardrailChecks(originRow);
6060
if (checkString != null && !checkString.isEmpty()) {
61-
jobCounter.threadIncrement(JobCounter.CounterType.LARGE);
61+
jobCounter.increment(JobCounter.CounterType.LARGE);
6262
logger.error("Guardrails failed for row {}", checkString);
6363
} else {
64-
jobCounter.threadIncrement(JobCounter.CounterType.VALID);
64+
jobCounter.increment(JobCounter.CounterType.VALID);
6565
}
6666
}
6767
} catch (Exception e) {
6868
logger.error("Error occurred ", e);
6969
logger.error("Error with PartitionRange -- ThreadID: {} Processing min: {} max: {}",
7070
Thread.currentThread().getId(), min, max);
7171
} finally {
72-
jobCounter.globalIncrement();
72+
jobCounter.flush();
7373
}
7474

7575
ThreadContext.remove(THREAD_CONTEXT_LABEL);

0 commit comments

Comments
 (0)