Skip to content

Commit b44c383

Browse files
committed
rename sparkConf.properties to cdm.properties; tidy-up of README.md
1 parent 83fdfd0 commit b44c383

File tree

3 files changed

+56
-20
lines changed

3 files changed

+56
-20
lines changed

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ ENV MAVEN_HOME /usr/share/maven
2525
ENV MAVEN_CONFIG "$USER_HOME_DIR/.m2"
2626
COPY ./src /assets/src
2727
COPY ./pom.xml /assets/pom.xml
28-
COPY ./src/resources/sparkConf.properties /assets/
28+
COPY ./src/resources/cdm.properties /assets/
2929
COPY ./src/resources/partitions.csv /assets/
3030
COPY ./src/resources/primary_key_rows.csv /assets/
3131
COPY ./src/resources/runCommands.txt /assets/

README.md

Lines changed: 55 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ Migrate and Validate Tables between Origin and Target Cassandra Clusters.
88
> :warning: Please note this job has been tested with spark version [3.3.1](https://archive.apache.org/dist/spark/spark-3.3.1/)
99
1010
## Install as a Container
11-
- Get the latest image that includes all dependencies from [DockerHub](https://hub.docker.com/r/datastax/cassandra-data-migrator)
12-
- All migration tools (`cassandra-data-migrator` + `dsbulk` + `cqlsh`) would be available in the `/assets/` folder of the container
11+
- Get the latest image that includes all dependencies from [DockerHub](https://hub.docker.com/r/datastax/cassandra-data-migrator)
12+
- All migration tools (`cassandra-data-migrator` + `dsbulk` + `cqlsh`) would be available in the `/assets/` folder of the container
1313

1414
## Install as a JAR file
1515
- Download the latest jar file from the GitHub [packages area here](https://github.com/orgs/datastax/packages?repo_name=cassandra-data-migrator)
@@ -26,34 +26,37 @@ tar -xvzf spark-3.3.1-bin-hadoop3.tgz
2626

2727
> :warning: Note that Version 4 of the tool is not backward-compatible with .properties files created in previous versions, and that package names have changed.
2828
29-
1. `sparkConf.properties` file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file.
30-
> A sample Spark conf file configuration can be [found here](./src/resources/sparkConf.properties)
31-
2. Place the conf file where it can be accessed while running the job via spark-submit.
29+
1. `cdm.properties` file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file.
30+
> A sample properties file configuration can be [found here](./src/resources/cdm.properties)
31+
2. Place the properties file where it can be accessed while running the job via spark-submit.
3232
3. Run the below job using `spark-submit` command as shown below:
3333

3434
```
35-
./spark-submit --properties-file sparkConf.properties /
35+
./spark-submit --properties-file cdm.properties /
36+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
3637
--master "local[*]" /
37-
--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
38+
--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
3839
```
3940

40-
Note:
41+
Note:
4142
- Above command generates a log file `logfile_name.txt` to avoid log output on the console.
4243
- Add option `--driver-memory 25G --executor-memory 25G` as shown below if the table migrated is large (over 100GB)
4344
```
44-
./spark-submit --properties-file sparkConf.properties /
45+
./spark-submit --properties-file cdm.properties /
46+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
4547
--master "local[*]" --driver-memory 25G --executor-memory 25G /
46-
--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
48+
--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
4749
```
4850

4951
# Steps for Data-Validation:
5052

5153
- To run the job in Data validation mode, use class option `--class datastax.cdm.job.DiffData` as shown below
5254

5355
```
54-
./spark-submit --properties-file sparkConf.properties /
56+
./spark-submit --properties-file cdm.properties /
57+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
5558
--master "local[*]" /
56-
--class datastax.cdm.job.DiffData cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
59+
--class datastax.cdm.job.DiffData cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
5760
```
5861

5962
- Validation job will report differences as “ERRORS” in the log file as shown below
@@ -66,10 +69,10 @@ Note:
6669
```
6770

6871
- Please grep for all `ERROR` from the output log files to get the list of missing and mismatched records.
69-
- Note that it lists differences by primary-key values.
72+
- Note that it lists differences by primary-key values.
7073
- The Validation job can also be run in an AutoCorrect mode. This mode can
71-
- Add any missing records from origin to target
72-
- Update any mismatched records between origin and target (makes target same as origin).
74+
- Add any missing records from origin to target
75+
- Update any mismatched records between origin and target (makes target same as origin).
7376
- Enable/disable this feature using one or both of the below setting in the config file
7477
```
7578
spark.cdm.autocorrect.missing false|true
@@ -81,12 +84,13 @@ Note:
8184
# Migrating specific partition ranges
8285
- You can also use the tool to migrate specific partition ranges using class option `--class datastax.cdm.job.MigratePartitionsFromFile` as shown below
8386
```
84-
./spark-submit --properties-file sparkConf.properties /
87+
./spark-submit --properties-file cdm.properties /
88+
--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
8589
--master "local[*]" /
86-
--class datastax.cdm.job.MigratePartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
90+
--class datastax.cdm.job.MigratePartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
8791
```
8892

89-
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range
93+
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range
9094
```
9195
-507900353496146534,-107285462027022883
9296
-506781526266485690,1506166634797362039
@@ -95,11 +99,40 @@ When running in above mode the tool assumes a `partitions.csv` file to be presen
9599
```
96100
This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.
97101

102+
> **Note:**
103+
> Here is a quick tip to prepare `partitions.csv` from the log file,
104+
105+
```
106+
grep "ERROR CopyJobSession: Error with PartitionRange" /path/to/logfile_name.txt | awk '{print $13","$15}' > partitions.csv
107+
```
108+
# Data validation for specific partition ranges
109+
- You can also use the tool to validate data for a specific partition ranges using class option `--class datastax.cdm.job.DiffPartitionsFromFile` as shown below,
110+
```
111+
./spark-submit --properties-file cdm.properties /
112+
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
113+
--master "local[*]" /
114+
--class datastax.cdm.job.DiffPartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
115+
```
116+
117+
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder.
118+
119+
# Perform large-field Guardrail violation checks
120+
- The tool can be used to identify large fields from a table that may break you cluster guardrails (e.g. AstraDB has a 10MB limit for a single large field) `--class datastax.astra.migrate.Guardrail` as shown below
121+
```
122+
./spark-submit --properties-file cdmGuardrail.properties /
123+
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
124+
--master "local[*]" /
125+
--class datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
126+
```
127+
> A sample Guardrail properties file can be [found here](./src/resources/cdmGuardrail.properties)
128+
98129
# Features
99-
- Supports migration/validation of [Counter tables](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useCountersConcept.html)
130+
- Auto-detects table schema (column names, types, keys, collections, UDTs, etc.)
131+
- Including counter table [Counter tables](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useCountersConcept.html)
100132
- Preserve [writetimes](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__retrieving-the-datetime-a-write-occurred-p) and [TTLs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__ref-select-ttl-p)
101133
- Supports migration/validation of advanced DataTypes ([Sets](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__set), [Lists](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__list), [Maps](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__map), [UDTs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__udt))
102134
- Filter records from `Origin` using `writetimes` and/or CQL conditions and/or min/max token-range
135+
- Perform guardrail checks (identify large fields)
103136
- Supports adding `constants` as new columns on `Target`
104137
- Supports expanding `Map` columns on `Origin` into multiple records on `Target`
105138
- Fully containerized (Docker and K8s friendly)
@@ -109,6 +142,9 @@ This mode is specifically useful to processes a subset of partition-ranges that
109142
- Validate migration accuracy and performance using a smaller randomized data-set
110143
- Supports adding custom fixed `writetime`
111144

145+
# Known Limitations
146+
- This tool does not migrate `ttl` & `writetime` at the field-level (for optimization reasons). It instead finds the field with the highest `ttl` & the field with the highest `writetime` within an `origin` row and uses those values on the entire `target` row.
147+
112148
# Building Jar for local development
113149
1. Clone this repo
114150
2. Move to the repo folder `cd cassandra-data-migrator`
File renamed without changes.

0 commit comments

Comments
 (0)