Skip to content

Commit ec6041c

Browse files
authored
Merge pull request #152 from datastax/diffjobfromfile
Re-run DiffData job from a file containing partition ranges
2 parents bdec1b1 + 34d4381 commit ec6041c

File tree

5 files changed

+61
-6
lines changed

5 files changed

+61
-6
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,4 @@ target/
55
dependency-reduced-pom.xml
66
.idea/*
77
cassandra-data-migrator.iml
8-
*/DS_Store
8+
*.DS_Store

README.md

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ tar -xvzf spark-3.3.1-bin-hadoop3.tgz
3333
./spark-submit --properties-file cdm.properties /
3434
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
3535
--master "local[*]" /
36-
--class datastax.astra.migrate.Migrate cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
36+
--class datastax.astra.migrate.Migrate cassandra-data-migrator-3.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
3737
```
3838

3939
Note:
@@ -54,7 +54,7 @@ Note:
5454
./spark-submit --properties-file cdm.properties /
5555
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
5656
--master "local[*]" /
57-
--class datastax.astra.migrate.DiffData cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
57+
--class datastax.astra.migrate.DiffData cassandra-data-migrator-3.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
5858
```
5959

6060
- Validation job will report differences as “ERRORS” in the log file as shown below
@@ -85,7 +85,7 @@ Note:
8585
./spark-submit --properties-file cdm.properties /
8686
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
8787
--master "local[*]" /
88-
--class datastax.astra.migrate.MigratePartitionsFromFile cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
88+
--class datastax.astra.migrate.MigratePartitionsFromFile cassandra-data-migrator-3.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
8989
```
9090

9191
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range
@@ -103,7 +103,23 @@ This mode is specifically useful to processes a subset of partition-ranges that
103103
```
104104
grep "ERROR CopyJobSession: Error with PartitionRange" /path/to/logfile_name.txt | awk '{print $13","$15}' > partitions.csv
105105
```
106+
# Data validation for specific partition ranges
107+
- You can also use the tool to validate data for a specific partition ranges using class option `--class datastax.astra.migrate.DiffPartitionsFromFile` as shown below,
108+
```
109+
./spark-submit --properties-file cdm.properties /
110+
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
111+
--master "local[*]" /
112+
--class datastax.astra.migrate.DiffPartitionsFromFile cassandra-data-migrator-3.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
113+
```
106114

115+
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range,
116+
```
117+
-507900353496146534,-107285462027022883
118+
-506781526266485690,1506166634797362039
119+
2637884402540451982,4638499294009575633
120+
798869613692279889,8699484505161403540
121+
```
122+
This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.
107123

108124
# Perform large-field Guardrail violation checks
109125
- The tool can be used to identify large fields from a table that may break you cluster guardrails (e.g. AstraDB has a 10MB limit for a single large field) `--class datastax.astra.migrate.Guardrail` as shown below

pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
<properties>
1010
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
11-
<revision>3.4.2</revision>
11+
<revision>3.4.4</revision>
1212
<scala.version>2.12.17</scala.version>
1313
<scala.main.version>2.12</scala.main.version>
1414
<spark.version>3.3.1</spark.version>

src/main/java/datastax/astra/migrate/DiffJobSession.java

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
import java.util.Map;
1717
import java.util.Optional;
1818
import java.util.concurrent.CompletionStage;
19+
import java.util.concurrent.ExecutionException;
1920
import java.util.concurrent.atomic.AtomicLong;
2021
import java.util.stream.IntStream;
2122
import java.util.stream.StreamSupport;
@@ -107,8 +108,12 @@ private void diffAndClear(Map<Row, CompletionStage<AsyncResultSet>> srcToTargetR
107108
try {
108109
Row targetRow = srcToTargetRowMap.get(srcRow).toCompletableFuture().get().one();
109110
diff(srcRow, targetRow);
110-
} catch (Exception e) {
111+
} catch (ExecutionException | InterruptedException e) {
111112
logger.error("Could not perform diff for Key: {}", getKey(srcRow, tableInfo), e);
113+
throw new RuntimeException(e);
114+
} catch (Exception ee) {
115+
logger.error("Could not perform diff for Key: {}", getKey(srcRow, tableInfo), ee);
116+
throw new RuntimeException(ee);
112117
}
113118
}
114119
srcToTargetRowMap.clear();
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
package datastax.astra.migrate
2+
3+
import com.datastax.spark.connector.cql.CassandraConnector
4+
import org.slf4j.LoggerFactory
5+
6+
import org.apache.spark.SparkConf
7+
import scala.collection.JavaConversions._
8+
9+
object DiffPartitionsFromFile extends AbstractJob {
10+
11+
val logger = LoggerFactory.getLogger(this.getClass.getName)
12+
logger.info("Started Data Validation App based on the partitions from partitions.csv file")
13+
14+
diffTable(sourceConnection, destinationConnection, sc)
15+
16+
exitSpark
17+
18+
private def diffTable(sourceConnection: CassandraConnector, destinationConnection: CassandraConnector, config: SparkConf) = {
19+
val partitions = SplitPartitions.getSubPartitionsFromFile(numSplits)
20+
logger.info("PARAM Calculated -- Total Partitions: " + partitions.size())
21+
val parts = sContext.parallelize(partitions.toSeq, partitions.size);
22+
logger.info("Spark parallelize created : " + parts.count() + " parts!");
23+
24+
parts.foreach(part => {
25+
sourceConnection.withSessionDo(sourceSession =>
26+
destinationConnection.withSessionDo(destinationSession =>
27+
DiffJobSession.getInstance(sourceSession, destinationSession, config)
28+
.getDataAndDiff(part.getMin, part.getMax)))
29+
})
30+
31+
DiffJobSession.getInstance(null, null, config).printCounts(true);
32+
}
33+
34+
}

0 commit comments

Comments
 (0)