Skip to content

Commit ed7ba7c

Browse files
authored
Merge pull request #162 from datastax/feature/CDM-69sit
fix issue in handling null values in target PK
2 parents 4528363 + 2234a6c commit ed7ba7c

File tree

14 files changed

+181
-63
lines changed

14 files changed

+181
-63
lines changed

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ ENV MAVEN_HOME /usr/share/maven
2525
ENV MAVEN_CONFIG "$USER_HOME_DIR/.m2"
2626
COPY ./src /assets/src
2727
COPY ./pom.xml /assets/pom.xml
28-
COPY ./src/resources/sparkConf.properties /assets/
28+
COPY ./src/resources/cdm.properties /assets/
2929
COPY ./src/resources/partitions.csv /assets/
3030
COPY ./src/resources/primary_key_rows.csv /assets/
3131
COPY ./src/resources/runCommands.txt /assets/

README.md

Lines changed: 55 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ Migrate and Validate Tables between Origin and Target Cassandra Clusters.
88
> :warning: Please note this job has been tested with spark version [3.3.1](https://archive.apache.org/dist/spark/spark-3.3.1/)
99
1010
## Install as a Container
11-
- Get the latest image that includes all dependencies from [DockerHub](https://hub.docker.com/r/datastax/cassandra-data-migrator)
12-
- All migration tools (`cassandra-data-migrator` + `dsbulk` + `cqlsh`) would be available in the `/assets/` folder of the container
11+
- Get the latest image that includes all dependencies from [DockerHub](https://hub.docker.com/r/datastax/cassandra-data-migrator)
12+
- All migration tools (`cassandra-data-migrator` + `dsbulk` + `cqlsh`) would be available in the `/assets/` folder of the container
1313

1414
## Install as a JAR file
1515
- Download the latest jar file from the GitHub [packages area here](https://github.com/orgs/datastax/packages?repo_name=cassandra-data-migrator)
@@ -26,34 +26,37 @@ tar -xvzf spark-3.3.1-bin-hadoop3.tgz
2626

2727
> :warning: Note that Version 4 of the tool is not backward-compatible with .properties files created in previous versions, and that package names have changed.
2828
29-
1. `sparkConf.properties` file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file.
30-
> A sample Spark conf file configuration can be [found here](./src/resources/sparkConf.properties)
31-
2. Place the conf file where it can be accessed while running the job via spark-submit.
29+
1. `cdm.properties` file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file.
30+
> A sample properties file configuration can be [found here](./src/resources/cdm.properties)
31+
2. Place the properties file where it can be accessed while running the job via spark-submit.
3232
3. Run the below job using `spark-submit` command as shown below:
3333

3434
```
35-
./spark-submit --properties-file sparkConf.properties /
35+
./spark-submit --properties-file cdm.properties /
36+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
3637
--master "local[*]" /
37-
--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
38+
--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
3839
```
3940

40-
Note:
41+
Note:
4142
- Above command generates a log file `logfile_name.txt` to avoid log output on the console.
4243
- Add option `--driver-memory 25G --executor-memory 25G` as shown below if the table migrated is large (over 100GB)
4344
```
44-
./spark-submit --properties-file sparkConf.properties /
45+
./spark-submit --properties-file cdm.properties /
46+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
4547
--master "local[*]" --driver-memory 25G --executor-memory 25G /
46-
--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
48+
--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
4749
```
4850

4951
# Steps for Data-Validation:
5052

5153
- To run the job in Data validation mode, use class option `--class datastax.cdm.job.DiffData` as shown below
5254

5355
```
54-
./spark-submit --properties-file sparkConf.properties /
56+
./spark-submit --properties-file cdm.properties /
57+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
5558
--master "local[*]" /
56-
--class datastax.cdm.job.DiffData cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
59+
--class datastax.cdm.job.DiffData cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
5760
```
5861

5962
- Validation job will report differences as “ERRORS” in the log file as shown below
@@ -66,10 +69,10 @@ Note:
6669
```
6770

6871
- Please grep for all `ERROR` from the output log files to get the list of missing and mismatched records.
69-
- Note that it lists differences by primary-key values.
72+
- Note that it lists differences by primary-key values.
7073
- The Validation job can also be run in an AutoCorrect mode. This mode can
71-
- Add any missing records from origin to target
72-
- Update any mismatched records between origin and target (makes target same as origin).
74+
- Add any missing records from origin to target
75+
- Update any mismatched records between origin and target (makes target same as origin).
7376
- Enable/disable this feature using one or both of the below setting in the config file
7477
```
7578
spark.cdm.autocorrect.missing false|true
@@ -81,12 +84,13 @@ Note:
8184
# Migrating specific partition ranges
8285
- You can also use the tool to migrate specific partition ranges using class option `--class datastax.cdm.job.MigratePartitionsFromFile` as shown below
8386
```
84-
./spark-submit --properties-file sparkConf.properties /
87+
./spark-submit --properties-file cdm.properties /
88+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
8589
--master "local[*]" /
86-
--class datastax.cdm.job.MigratePartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
90+
--class datastax.cdm.job.MigratePartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
8791
```
8892

89-
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range
93+
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range
9094
```
9195
-507900353496146534,-107285462027022883
9296
-506781526266485690,1506166634797362039
@@ -95,11 +99,40 @@ When running in above mode the tool assumes a `partitions.csv` file to be presen
9599
```
96100
This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.
97101

102+
> **Note:**
103+
> Here is a quick tip to prepare `partitions.csv` from the log file,
104+
105+
```
106+
grep "ERROR CopyJobSession: Error with PartitionRange" /path/to/logfile_name.txt | awk '{print $13","$15}' > partitions.csv
107+
```
108+
# Data validation for specific partition ranges
109+
- You can also use the tool to validate data for a specific partition ranges using class option `--class datastax.cdm.job.DiffPartitionsFromFile` as shown below,
110+
```
111+
./spark-submit --properties-file cdm.properties /
112+
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
113+
--master "local[*]" /
114+
--class datastax.cdm.job.DiffPartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
115+
```
116+
117+
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder.
118+
119+
# Perform large-field Guardrail violation checks
120+
- The tool can be used to identify large fields from a table that may break you cluster guardrails (e.g. AstraDB has a 10MB limit for a single large field) `--class datastax.astra.migrate.Guardrail` as shown below
121+
```
122+
./spark-submit --properties-file cdmGuardrail.properties /
123+
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
124+
--master "local[*]" /
125+
--class datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
126+
```
127+
> A sample Guardrail properties file can be [found here](./src/resources/cdmGuardrail.properties)
128+
98129
# Features
99-
- Supports migration/validation of [Counter tables](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useCountersConcept.html)
130+
- Auto-detects table schema (column names, types, keys, collections, UDTs, etc.)
131+
- Including counter table [Counter tables](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useCountersConcept.html)
100132
- Preserve [writetimes](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__retrieving-the-datetime-a-write-occurred-p) and [TTLs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__ref-select-ttl-p)
101133
- Supports migration/validation of advanced DataTypes ([Sets](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__set), [Lists](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__list), [Maps](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__map), [UDTs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__udt))
102134
- Filter records from `Origin` using `writetimes` and/or CQL conditions and/or min/max token-range
135+
- Perform guardrail checks (identify large fields)
103136
- Supports adding `constants` as new columns on `Target`
104137
- Supports expanding `Map` columns on `Origin` into multiple records on `Target`
105138
- Fully containerized (Docker and K8s friendly)
@@ -109,6 +142,9 @@ This mode is specifically useful to processes a subset of partition-ranges that
109142
- Validate migration accuracy and performance using a smaller randomized data-set
110143
- Supports adding custom fixed `writetime`
111144

145+
# Known Limitations
146+
- This tool does not migrate `ttl` & `writetime` at the field-level (for optimization reasons). It instead finds the field with the highest `ttl` & the field with the highest `writetime` within an `origin` row and uses those values on the entire `target` row.
147+
112148
# Building Jar for local development
113149
1. Clone this repo
114150
2. Move to the repo folder `cd cassandra-data-migrator`

RELEASE.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Release Notes
2+
3+
## [4.0.0] - 2023-06-02
4+
This release is a major code refactor of Cassandra Data Migrator, focused on internal code structure and organization.
5+
Automated testing (both unit and integration) was introduced and incorporated into the build process. It includes all
6+
features of the previous version, but the properties specified within configuration (.properties) file have been
7+
re-organized and renamed; therefore, the configuration file from the previous version will not work with this version.
8+
9+
New features were also introduced with this release, on top of the 3.4.5 version.
10+
### Added
11+
- New features:
12+
- `Column renaming`: Column names can differ between Origin and Target
13+
- `Migrate UDTs across keyspaces`: UDTs can be migrated from Origin to Target, even when the keyspace names differ
14+
- `Data Type Conversion`: Some predefined Codecs support type conversion between Origin and Target; custom Codecs can be added
15+
- `Separate Writetime and TTL configuration`: Writetime columns can differ from TTL columns
16+
- `Subset of columns can be specified with Writetime and TTL`: Not all eligible columns need to be used to compute the origin value
17+
- `Automatic RandomPartitioner min/max`: Partition min/max values no longer need to be manually configured
18+
- `Populate Target columns with constant values`: New columns can be added to the Target table, and populated with constant values
19+
- `Explode Origin Map Column into Target rows`: A Map in Origin can be expanded into multiple rows in Target when the Map key is part of the Target primary key
20+
21+
## [3.x.x]
22+
Previous releases of the project have not been documented in this file
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#!/bin/bash -e
2+
3+
cat <<EOF
4+
!!!!!!!!
5+
!!!!!!!! Testing Migrate
6+
!!!!!!!!
7+
EOF
8+
9+
/local/cdm.sh -c
10+
spark-submit \
11+
--properties-file /smoke/01_basic_kvp/migrate.properties \
12+
--master "local[*]" \
13+
--class datastax.astra.migrate.Migrate /local/cassandra-data-migrator.jar
14+
15+
cat <<EOF
16+
!!!!!!!!
17+
!!!!!!!! Testing DiffData
18+
!!!!!!!!
19+
EOF
20+
21+
spark-submit \
22+
--properties-file /smoke/01_basic_kvp/migrate.properties \
23+
--master "local[*]" \
24+
--class datastax.astra.migrate.DiffData /local/cassandra-data-migrator.jar
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
migrateData com.datastax.cdm.job.Migrate migrate.properties
2+
validateData com.datastax.cdm.job.DiffData migrate.properties
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
#!/bin/bash -e
2+
3+
workingDir="$1"
4+
cd "$workingDir"
5+
6+
for scenario in $(cat cdm.txt | awk '{print $1}'); do
7+
/local/cdm.sh -f cdm.txt -s $scenario -d "$workingDir"
8+
done
9+
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
SELECT * FROM target.regression_null_ts_in_pk;
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
2+
key | ts | value
3+
------+---------------------------------+--------
4+
key1 | 2023-06-01 00:00:00.000000+0000 | valueA
5+
key2 | 2023-06-02 12:00:00.000000+0000 | valueB
6+
7+
(2 rows)
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
spark.cdm.origin.connect.host cdm-sit-cass
2+
spark.cdm.target.connect.host cdm-sit-cass
3+
4+
spark.cdm.schema.origin.keyspaceTable origin.regression_null_ts_in_pk
5+
spark.cdm.schema.target.keyspaceTable target.regression_null_ts_in_pk
6+
spark.cdm.perfops.numParts 1
7+
8+
spark.cdm.autocorrect.missing true
9+
spark.cdm.autocorrect.mismatch true
10+
11+
spark.cdm.transform.missing.key.ts.replace.value 1685577600000
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
DROP TABLE IF EXISTS origin.regression_null_ts_in_pk;
2+
CREATE TABLE origin.regression_null_ts_in_pk(key text, ts timestamp, value text, PRIMARY KEY (key));
3+
INSERT INTO origin.regression_null_ts_in_pk(key,value) VALUES ('key1','valueA');
4+
INSERT INTO origin.regression_null_ts_in_pk(key,ts,value) VALUES ('key2','2023-06-02 12:00:00','valueB');
5+
6+
DROP TABLE IF EXISTS target.regression_null_ts_in_pk;
7+
CREATE TABLE target.regression_null_ts_in_pk(key text, ts timestamp, value text, PRIMARY KEY (key, ts));

0 commit comments

Comments
 (0)