Skip to content

Commit e847957

Browse files
authored
Merge pull request #174 from datastax/feature/CDM-88
Feature/cdm 88
2 parents 6c8cbd9 + b5ca2fa commit e847957

25 files changed

+128
-144
lines changed

README.md

Lines changed: 16 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ tar -xvzf spark-3.3.1-bin-hadoop3.tgz
3434

3535
```
3636
./spark-submit --properties-file cdm.properties /
37-
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
37+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
3838
--master "local[*]" /
3939
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
4040
```
@@ -44,7 +44,7 @@ Note:
4444
- Add option `--driver-memory 25G --executor-memory 25G` as shown below if the table migrated is large (over 100GB)
4545
```
4646
./spark-submit --properties-file cdm.properties /
47-
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
47+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
4848
--master "local[*]" --driver-memory 25G --executor-memory 25G /
4949
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
5050
```
@@ -55,7 +55,7 @@ Note:
5555

5656
```
5757
./spark-submit --properties-file cdm.properties /
58-
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
58+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
5959
--master "local[*]" /
6060
--class com.datastax.cdm.job.DiffData cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
6161
```
@@ -82,46 +82,33 @@ spark.cdm.autocorrect.mismatch false|true
8282
Note:
8383
- The validation job will never delete records from target i.e. it only adds or updates data on target
8484

85-
# Migrating specific partition ranges
86-
- You can also use the tool to migrate specific partition ranges using class option `--class com.datastax.cdm.job.MigratePartitionsFromFile` as shown below
87-
```
88-
./spark-submit --properties-file cdm.properties /
89-
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
90-
--master "local[*]" /
91-
--class com.datastax.cdm.job.MigratePartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
92-
```
93-
94-
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range
85+
# Migrating or Validating specific partition ranges
86+
- You can also use the tool to Migrate or Validate specific partition ranges by using a partition-file with the name `./<keyspacename>.<tablename>_partitions.csv` in the below format in the current folder as input
9587
```
9688
-507900353496146534,-107285462027022883
9789
-506781526266485690,1506166634797362039
9890
2637884402540451982,4638499294009575633
9991
798869613692279889,8699484505161403540
10092
```
101-
This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.
102-
103-
> **Note:**
104-
> A file ending with `*_partitions.csv` will be auto created by the Migration & Validation job in the above format containing any failed partition ranges. Just rename it as below & run the above job.
93+
Each line above represents a partition-range (`min,max`). Alternatively, you can also pass the partition-file via command-line param as shown below
10594

10695
```
107-
mv <keyspace>.<table>_partitions.csv partitions.csv
108-
```
109-
# Data validation for specific partition ranges
110-
- You can also use the tool to validate data for a specific partition ranges using class option `--class com.datastax.cdm.job.DiffPartitionsFromFile` as shown below,
111-
```
112-
./spark-submit --properties-file cdm.properties /
113-
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
114-
--master "local[*]" /
115-
--class com.datastax.cdm.job.DiffPartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
96+
spark-submit --properties-file cdm.properties /
97+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
98+
--conf spark.tokenRange.partitionFile="/<path-to-file>/<csv-input-filename>" /
99+
--master "local[*]" /
100+
--class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
116101
```
102+
This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.
117103

118-
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder.
104+
> **Note:**
105+
> A file named `./<keyspacename>.<tablename>_partitions.csv` is auto generated by the Migration & Validation jobs in the above format containing any failed partition ranges. No file is created if there are no failed partitions. You can use this file as an input to process any failed partition in a following run.
119106
120107
# Perform large-field Guardrail violation checks
121108
- The tool can be used to identify large fields from a table that may break you cluster guardrails (e.g. AstraDB has a 10MB limit for a single large field) `--class com.datastax.cdm.job.GuardrailCheck` as shown below
122109
```
123110
./spark-submit --properties-file cdm.properties /
124-
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
111+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
125112
--conf spark.cdm.feature.guardrail.colSizeInKB=10000 /
126113
--master "local[*]" /
127114
--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
@@ -132,7 +119,7 @@ When running in above mode the tool assumes a `partitions.csv` file to be presen
132119
- Including counter table [Counter tables](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useCountersConcept.html)
133120
- Preserve [writetimes](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__retrieving-the-datetime-a-write-occurred-p) and [TTLs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__ref-select-ttl-p)
134121
- Supports migration/validation of advanced DataTypes ([Sets](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__set), [Lists](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__list), [Maps](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__map), [UDTs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__udt))
135-
- Filter records from `Origin` using `writetimes` and/or CQL conditions and/or min/max token-range
122+
- Filter records from `Origin` using `writetimes` and/or CQL conditions and/or a list of token-ranges
136123
- Perform guardrail checks (identify large fields)
137124
- Supports adding `constants` as new columns on `Target`
138125
- Supports expanding `Map` columns on `Origin` into multiple records on `Target`

RELEASE.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
11
# Release Notes
2+
## [4.1.0] - 2023-06-20
3+
- Refactored exception handling and loading of token-range filters to use the same Migrate & DiffData jobs instead of separate jobs to reduce code & maintenance overhead
4+
25
## [4.0.2] - 2023-06-16
36
- Capture failed partitions in a file for easier reruns
47
- Optimized mvn to reduce jar size
Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
1-
migrateData com.datastax.cdm.job.MigratePartitionsFromFile migrate.properties
2-
validateData com.datastax.cdm.job.DiffPartitionsFromFile migrate.properties
1+
migrateDataDefault com.datastax.cdm.job.Migrate migrate.properties
2+
validateDataDefault com.datastax.cdm.job.DiffData migrate.properties
3+
migrateData com.datastax.cdm.job.Migrate migrate_with_partitionfile.properties
4+
validateData com.datastax.cdm.job.DiffData migrate_with_partitionfile.properties

SIT/features/06_partition_range/execute.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,11 @@
33
workingDir="$1"
44
cd "$workingDir"
55

6+
/local/cdm.sh -f cdm.txt -s migrateDataDefault -d "$workingDir"
67
/local/cdm.sh -f cdm.txt -s migrateData -d "$workingDir"
78

89
cqlsh -u $CASS_USERNAME -p $CASS_PASSWORD $CASS_CLUSTER -f $workingDir/breakData.cql > $workingDir/other.breakData.out 2> $workingDir/other.breakData.err
910

11+
/local/cdm.sh -f cdm.txt -s validateDataDefault -d "$workingDir"
1012
/local/cdm.sh -f cdm.txt -s validateData -d "$workingDir"
1113

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
spark.cdm.connect.origin.host cdm-sit-cass
2+
spark.cdm.connect.target.host cdm-sit-cass
3+
4+
spark.cdm.schema.origin.keyspaceTable origin.feature_partition_range
5+
spark.cdm.schema.target.keyspaceTable target.feature_partition_range
6+
spark.cdm.perfops.numParts 1
7+
8+
spark.cdm.autocorrect.missing true
9+
spark.cdm.autocorrect.mismatch true
10+
11+
spark.tokenrange.partitionFile ./partitions.csv
12+
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
0,2000000000000000000
2+
8100000000000000000,8500000000000000000

SIT/features/07_migrate_rows/cdm.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
migrateData com.datastax.cdm.job.MigrateRowsFromFile migrate.properties
1+
migrateDataDefault com.datastax.cdm.job.MigrateRowsFromFile migrate.properties
2+
migrateData com.datastax.cdm.job.MigrateRowsFromFile migrate_with_pkrowsfile.properties

SIT/features/07_migrate_rows/execute.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
workingDir="$1"
44
cd "$workingDir"
55

6+
/local/cdm.sh -f cdm.txt -s migrateDataDefault -d "$workingDir"
67
/local/cdm.sh -f cdm.txt -s migrateData -d "$workingDir"
78

89

SIT/features/07_migrate_rows/migrate.properties

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,3 @@ spark.cdm.connect.target.host cdm-sit-cass
44
spark.cdm.schema.origin.keyspaceTable origin.feature_migrate_rows
55
spark.cdm.schema.target.keyspaceTable target.feature_migrate_rows
66
spark.cdm.perfops.numParts 1
7-
8-
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
spark.cdm.connect.origin.host cdm-sit-cass
2+
spark.cdm.connect.target.host cdm-sit-cass
3+
4+
spark.cdm.schema.origin.keyspaceTable origin.feature_migrate_rows
5+
spark.cdm.schema.target.keyspaceTable target.feature_migrate_rows
6+
spark.cdm.perfops.numParts 1
7+
8+
spark.tokenrange.partitionFile ./primary_key_rows.csv

0 commit comments

Comments
 (0)