Skip to content

Commit a73b10d

Browse files
authored
Merge pull request #120 from datastax/issue/CDM-30
Issue/cdm 30
2 parents 750d822 + 1371d68 commit a73b10d

File tree

81 files changed

+1141
-1028
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

81 files changed

+1141
-1028
lines changed

README.md

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -24,15 +24,17 @@ tar -xvzf spark-3.3.1-bin-hadoop3.tgz
2424

2525
# Steps for Data-Migration:
2626

27-
1. `sparkConf.properties` file needs to be configured as applicable for the environment
27+
> :warning: Note that Version 4 of the tool is not backward-compatible with .properties files created in previous versions, and that package names have changed.
28+
29+
1. `sparkConf.properties` file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file.
2830
> A sample Spark conf file configuration can be [found here](./src/resources/sparkConf.properties)
2931
2. Place the conf file where it can be accessed while running the job via spark-submit.
3032
3. Run the below job using `spark-submit` command as shown below:
3133

3234
```
3335
./spark-submit --properties-file sparkConf.properties /
3436
--master "local[*]" /
35-
--class datastax.astra.migrate.Migrate cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
37+
--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
3638
```
3739

3840
Note:
@@ -41,26 +43,26 @@ Note:
4143
```
4244
./spark-submit --properties-file sparkConf.properties /
4345
--master "local[*]" --driver-memory 25G --executor-memory 25G /
44-
--class datastax.astra.migrate.Migrate cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
46+
--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
4547
```
4648

4749
# Steps for Data-Validation:
4850

49-
- To run the job in Data validation mode, use class option `--class datastax.astra.migrate.DiffData` as shown below
51+
- To run the job in Data validation mode, use class option `--class datastax.cdm.job.DiffData` as shown below
5052

5153
```
5254
./spark-submit --properties-file sparkConf.properties /
5355
--master "local[*]" /
54-
--class datastax.astra.migrate.DiffData cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
56+
--class datastax.cdm.job.DiffData cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
5557
```
5658

5759
- Validation job will report differences as “ERRORS” in the log file as shown below
5860

5961
```
60-
22/10/27 23:25:29 ERROR DiffJobSession: Missing target row found for key: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% Aliquam faucibus
61-
22/10/27 23:25:29 ERROR DiffJobSession: Inserted missing row in target: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% Aliquam faucibus
62-
22/10/27 23:25:30 ERROR DiffJobSession: Mismatch row found for key: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% augue odio at quam Data: (Index: 8 Origin: Hello 3 Target: Hello 2 )
63-
22/10/27 23:25:30 ERROR DiffJobSession: Updated mismatch row in target: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% augue odio at quam
62+
23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999)
63+
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
64+
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
65+
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]
6466
```
6567

6668
- Please grep for all `ERROR` from the output log files to get the list of missing and mismatched records.
@@ -70,18 +72,18 @@ Note:
7072
- Update any mismatched records between origin and target (makes target same as origin).
7173
- Enable/disable this feature using one or both of the below setting in the config file
7274
```
73-
spark.target.autocorrect.missing true|false
74-
spark.target.autocorrect.mismatch true|false
75+
spark.cdm.autocorrect.missing false|true
76+
spark.cdm.autocorrect.mismatch false|true
7577
```
7678
Note:
7779
- The validation job will never delete records from target i.e. it only adds or updates data on target
7880

7981
# Migrating specific partition ranges
80-
- You can also use the tool to migrate specific partition ranges using class option `--class datastax.astra.migrate.MigratePartitionsFromFile` as shown below
82+
- You can also use the tool to migrate specific partition ranges using class option `--class datastax.cdm.job.MigratePartitionsFromFile` as shown below
8183
```
8284
./spark-submit --properties-file sparkConf.properties /
8385
--master "local[*]" /
84-
--class datastax.astra.migrate.MigratePartitionsFromFile cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
86+
--class datastax.cdm.job.MigratePartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
8587
```
8688

8789
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range
@@ -99,6 +101,7 @@ This mode is specifically useful to processes a subset of partition-ranges that
99101
- Supports migration/validation of advanced DataTypes ([Sets](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__set), [Lists](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__list), [Maps](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__map), [UDTs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__udt))
100102
- Filter records from `Origin` using `writetimes` and/or CQL conditions and/or min/max token-range
101103
- Supports adding `constants` as new columns on `Target`
104+
- Supports expanding `Map` columns on `Origin` into multiple records on `Target`
102105
- Fully containerized (Docker and K8s friendly)
103106
- SSL Support (including custom cipher algorithms)
104107
- Migrate from any Cassandra `Origin` ([Apache Cassandra®](https://cassandra.apache.org) / [DataStax Enterprise™](https://www.datastax.com/products/datastax-enterprise) / [DataStax Astra DB™](https://www.datastax.com/products/datastax-astra)) to any Cassandra `Target` ([Apache Cassandra®](https://cassandra.apache.org) / [DataStax Enterprise™](https://www.datastax.com/products/datastax-enterprise) / [DataStax Astra DB™](https://www.datastax.com/products/datastax-astra))
@@ -110,7 +113,7 @@ This mode is specifically useful to processes a subset of partition-ranges that
110113
1. Clone this repo
111114
2. Move to the repo folder `cd cassandra-data-migrator`
112115
3. Run the build `mvn clean package` (Needs Maven 3.8.x)
113-
4. The fat jar (`cassandra-data-migrator-3.x.x.jar`) file should now be present in the `target` folder
116+
4. The fat jar (`cassandra-data-migrator-4.x.x.jar`) file should now be present in the `target` folder
114117

115118
# Contributors
116119
Checkout all our wonderful contributors [here](./CONTRIBUTING.md#contributors).

SIT/common.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ export DOCKER_CASS=cdm-sit-cass
22
export DOCKER_CDM=cdm-sit-cdm
33
export CASS_USERNAME=cassandra
44
export CASS_PASSWORD=cassandra
5-
export KEYSPACES="source target"
5+
export KEYSPACES="origin target"
66
export CDM_DIRECTORY=/local
77
export CDM_JARFILE=cassandra-data-migrator.jar
88

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
migrateData datastax.astra.migrate.Migrate migrate.properties
2-
validateData datastax.astra.migrate.DiffData migrate.properties
3-
fixData datastax.astra.migrate.DiffData fix.properties
1+
migrateData datastax.cdm.job.Migrate migrate.properties
2+
validateData datastax.cdm.job.DiffData migrate.properties
3+
fixData datastax.cdm.job.DiffData fix.properties
Lines changed: 14 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,18 @@
1-
spark.origin.host cdm-sit-cass
2-
spark.origin.port 9042
3-
spark.origin.username cassandra
4-
spark.origin.password cassandra
5-
spark.origin.keyspaceTable source.feature_constant_column
1+
spark.cdm.origin.connect.host cdm-sit-cass
2+
spark.cdm.target.connect.host cdm-sit-cass
63

7-
spark.target.host cdm-sit-cass
8-
spark.target.port 9042
9-
spark.target.username cassandra
10-
spark.target.password cassandra
11-
spark.target.keyspaceTable target.feature_constant_column
4+
spark.cdm.schema.origin.keyspaceTable origin.feature_constant_column
5+
spark.cdm.schema.target.keyspaceTable target.feature_constant_column
6+
spark.cdm.perfops.numParts 1
127

13-
spark.numSplits 1
8+
spark.cdm.schema.origin.column.names key,value
9+
spark.cdm.schema.origin.column.partition.names key
10+
spark.cdm.schema.target.column.id.names const1,key
11+
spark.cdm.schema.origin.column.types 0,0
1412

15-
spark.query.origin key,value
16-
spark.query.origin.partitionKey key
17-
spark.query.target.id const1,key
18-
spark.query.types 0,0
13+
spark.cdm.feature.constantColumns.names const1,const2
14+
spark.cdm.feature.constantColumns.types 0,1
15+
spark.cdm.feature.constantColumns.values 'abcd',1234
1916

20-
spark.cdm.cql.feature.constantColumns.names const1,const2
21-
spark.cdm.cql.feature.constantColumns.types 0,1
22-
spark.cdm.cql.feature.constantColumns.values 'abcd',1234
23-
24-
spark.target.autocorrect.missing true
25-
spark.target.autocorrect.mismatch true
17+
spark.cdm.autocorrect.missing true
18+
spark.cdm.autocorrect.mismatch true
Lines changed: 12 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,16 @@
1-
spark.origin.host cdm-sit-cass
2-
spark.origin.port 9042
3-
spark.origin.username cassandra
4-
spark.origin.password cassandra
5-
spark.origin.keyspaceTable source.feature_constant_column
1+
spark.cdm.origin.connect.host cdm-sit-cass
2+
spark.cdm.target.connect.host cdm-sit-cass
63

7-
spark.target.host cdm-sit-cass
8-
spark.target.port 9042
9-
spark.target.username cassandra
10-
spark.target.password cassandra
11-
spark.target.keyspaceTable target.feature_constant_column
4+
spark.cdm.schema.origin.keyspaceTable origin.feature_constant_column
5+
spark.cdm.schema.target.keyspaceTable target.feature_constant_column
6+
spark.cdm.perfops.numParts 1
127

13-
spark.numSplits 1
8+
spark.cdm.schema.origin.column.names key,value
9+
spark.cdm.schema.origin.column.partition.names key
10+
spark.cdm.schema.target.column.id.names const1,key
11+
spark.cdm.schema.origin.column.types 0,0
1412

15-
spark.query.origin key,value
16-
spark.query.origin.partitionKey key
17-
spark.query.target.id const1,key
18-
spark.query.types 0,0
13+
spark.cdm.feature.constantColumns.names const1,const2
14+
spark.cdm.feature.constantColumns.types 0,1
15+
spark.cdm.feature.constantColumns.values 'abcd',1234
1916

20-
spark.cdm.cql.feature.constantColumns.names const1,const2
21-
spark.cdm.cql.feature.constantColumns.types 0,1
22-
spark.cdm.cql.feature.constantColumns.values 'abcd',1234
Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
DROP TABLE IF EXISTS source.feature_constant_column;
2-
CREATE TABLE source.feature_constant_column(key text, value text, PRIMARY KEY (key));
3-
INSERT INTO source.feature_constant_column(key,value) VALUES ('key1','valueA');
4-
INSERT INTO source.feature_constant_column(key,value) VALUES ('key2','valueB');
5-
INSERT INTO source.feature_constant_column(key,value) VALUES ('key3','valueC');
1+
DROP TABLE IF EXISTS origin.feature_constant_column;
2+
CREATE TABLE origin.feature_constant_column(key text, value text, PRIMARY KEY (key));
3+
INSERT INTO origin.feature_constant_column(key,value) VALUES ('key1','valueA');
4+
INSERT INTO origin.feature_constant_column(key,value) VALUES ('key2','valueB');
5+
INSERT INTO origin.feature_constant_column(key,value) VALUES ('key3','valueC');
66

77
DROP TABLE IF EXISTS target.feature_constant_column;
88
CREATE TABLE target.feature_constant_column(const1 text, key text, value text, const2 int, PRIMARY KEY (const1, key));

SIT/features/02_explode_map/cdm.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
migrateData datastax.astra.migrate.Migrate migrate.properties
2-
validateData datastax.astra.migrate.DiffData migrate.properties
3-
fixData datastax.astra.migrate.DiffData fix.properties
1+
migrateData datastax.cdm.job.Migrate migrate.properties
2+
validateData datastax.cdm.job.DiffData migrate.properties
3+
fixData datastax.cdm.job.DiffData fix.properties
Lines changed: 14 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,19 @@
1-
spark.origin.host cdm-sit-cass
2-
spark.origin.port 9042
3-
spark.origin.username cassandra
4-
spark.origin.password cassandra
5-
spark.origin.keyspaceTable source.feature_explode_map
1+
spark.cdm.origin.connect.host cdm-sit-cass
2+
spark.cdm.target.connect.host cdm-sit-cass
63

7-
spark.target.host cdm-sit-cass
8-
spark.target.port 9042
9-
spark.target.username cassandra
10-
spark.target.password cassandra
11-
spark.target.keyspaceTable target.feature_explode_map
4+
spark.cdm.schema.origin.keyspaceTable origin.feature_explode_map
5+
spark.cdm.schema.target.keyspaceTable target.feature_explode_map
6+
spark.cdm.perfops.numParts 1
127

13-
spark.numSplits 1
8+
spark.cdm.schema.origin.column.names key,value,fruits
9+
spark.cdm.schema.origin.column.partition.names key
10+
spark.cdm.schema.target.column.id.names key,fruit
11+
spark.cdm.schema.origin.column.types 0,0,5%0%1
1412

15-
spark.query.origin key,value,fruits
16-
spark.query.origin.partitionKey key
17-
spark.query.target.id key,fruit
18-
spark.query.types 0,0,5%0%1
13+
spark.cdm.feature.explodeMap.origin.name fruits
14+
spark.cdm.feature.explodeMap.target.name.key fruit
15+
spark.cdm.feature.explodeMap.target.name.value fruit_qty
1916

20-
spark.cdm.cql.feature.explodeMap.origin.name fruits
21-
spark.cdm.cql.feature.explodeMap.target.name.key fruit
22-
spark.cdm.cql.feature.explodeMap.target.name.value fruit_qty
17+
spark.cdm.autocorrect.missing true
18+
spark.cdm.autocorrect.mismatch true
2319

24-
spark.target.autocorrect.missing true
25-
spark.target.autocorrect.mismatch true
Lines changed: 12 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,16 @@
1-
spark.origin.host cdm-sit-cass
2-
spark.origin.port 9042
3-
spark.origin.username cassandra
4-
spark.origin.password cassandra
5-
spark.origin.keyspaceTable source.feature_explode_map
1+
spark.cdm.origin.connect.host cdm-sit-cass
2+
spark.cdm.target.connect.host cdm-sit-cass
63

7-
spark.target.host cdm-sit-cass
8-
spark.target.port 9042
9-
spark.target.username cassandra
10-
spark.target.password cassandra
11-
spark.target.keyspaceTable target.feature_explode_map
4+
spark.cdm.schema.origin.keyspaceTable origin.feature_explode_map
5+
spark.cdm.schema.target.keyspaceTable target.feature_explode_map
6+
spark.cdm.perfops.numParts 1
127

13-
spark.numSplits 1
8+
spark.cdm.schema.origin.column.names key,value,fruits
9+
spark.cdm.schema.origin.column.partition.names key
10+
spark.cdm.schema.target.column.id.names key,fruit
11+
spark.cdm.schema.origin.column.types 0,0,5%0%1
1412

15-
spark.query.origin key,value,fruits
16-
spark.query.origin.partitionKey key
17-
spark.query.target.id key,fruit
18-
spark.query.types 0,0,5%0%1
13+
spark.cdm.feature.explodeMap.origin.name fruits
14+
spark.cdm.feature.explodeMap.target.name.key fruit
15+
spark.cdm.feature.explodeMap.target.name.value fruit_qty
1916

20-
spark.cdm.cql.feature.explodeMap.origin.name fruits
21-
spark.cdm.cql.feature.explodeMap.target.name.key fruit
22-
spark.cdm.cql.feature.explodeMap.target.name.value fruit_qty

SIT/features/02_explode_map/setup.cql

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
DROP TABLE IF EXISTS source.feature_explode_map;
2-
CREATE TABLE source.feature_explode_map(key text, value text, fruits map<text,int>, PRIMARY KEY (key));
3-
INSERT INTO source.feature_explode_map(key,value,fruits) VALUES ('key1','valueA', {'apples': 3, 'oranges': 5, 'bananas': 2, 'grapes': 11});
4-
INSERT INTO source.feature_explode_map(key,value,fruits) VALUES ('key2','valueB', {'apples': 4, 'oranges': 6, 'bananas': 3, 'pears': 7});
5-
INSERT INTO source.feature_explode_map(key,value,fruits) VALUES ('key3','valueC', {'apples': 5, 'oranges': 7, 'bananas': 4, 'kiwi': 42});
1+
DROP TABLE IF EXISTS origin.feature_explode_map;
2+
CREATE TABLE origin.feature_explode_map(key text, value text, fruits map<text,int>, PRIMARY KEY (key));
3+
INSERT INTO origin.feature_explode_map(key,value,fruits) VALUES ('key1','valueA', {'apples': 3, 'oranges': 5, 'bananas': 2, 'grapes': 11});
4+
INSERT INTO origin.feature_explode_map(key,value,fruits) VALUES ('key2','valueB', {'apples': 4, 'oranges': 6, 'bananas': 3, 'pears': 7});
5+
INSERT INTO origin.feature_explode_map(key,value,fruits) VALUES ('key3','valueC', {'apples': 5, 'oranges': 7, 'bananas': 4, 'kiwi': 42});
66

77
DROP TABLE IF EXISTS target.feature_explode_map;
88
CREATE TABLE target.feature_explode_map(key text, fruit text, value text, fruit_qty int, PRIMARY KEY ((key), fruit));

0 commit comments

Comments
 (0)