rename sparkConf.properties to cdm.properties; tidy-up of README.md

mieslep · mieslep · commit b44c383b5dff · 2023-06-02T11:04:30.000+01:00
diff --git a/Dockerfile b/Dockerfile
@@ -25,7 +25,7 @@ ENV MAVEN_HOME /usr/share/maven
 ENV MAVEN_CONFIG "$USER_HOME_DIR/.m2"
 COPY ./src /assets/src
 COPY ./pom.xml /assets/pom.xml
-COPY ./src/resources/sparkConf.properties /assets/
+COPY ./src/resources/cdm.properties /assets/
 COPY ./src/resources/partitions.csv /assets/
 COPY ./src/resources/primary_key_rows.csv /assets/
 COPY ./src/resources/runCommands.txt /assets/
diff --git a/README.md b/README.md
@@ -8,8 +8,8 @@ Migrate and Validate Tables between Origin and Target Cassandra Clusters.
 > :warning: Please note this job has been tested with spark version [3.3.1](https://archive.apache.org/dist/spark/spark-3.3.1/)
 
 ## Install as a Container
-- Get the latest image that includes all dependencies from [DockerHub](https://hub.docker.com/r/datastax/cassandra-data-migrator) 
-  - All migration tools (`cassandra-data-migrator` + `dsbulk` + `cqlsh`) would be available in the `/assets/` folder of the container
+- Get the latest image that includes all dependencies from [DockerHub](https://hub.docker.com/r/datastax/cassandra-data-migrator)
+    - All migration tools (`cassandra-data-migrator` + `dsbulk` + `cqlsh`) would be available in the `/assets/` folder of the container
 
 ## Install as a JAR file
 - Download the latest jar file from the GitHub [packages area here](https://github.com/orgs/datastax/packages?repo_name=cassandra-data-migrator)
@@ -26,34 +26,37 @@ tar -xvzf spark-3.3.1-bin-hadoop3.tgz
 
 > :warning: Note that Version 4 of the tool is not backward-compatible with .properties files created in previous versions, and that package names have changed.
 
-1. `sparkConf.properties` file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file.
-   > A sample Spark conf file configuration can be [found here](./src/resources/sparkConf.properties)
-2. Place the conf file where it can be accessed while running the job via spark-submit.
+1. `cdm.properties` file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file.
+   > A sample properties file configuration can be [found here](./src/resources/cdm.properties)
+2. Place the properties file where it can be accessed while running the job via spark-submit.
 3. Run the below job using `spark-submit` command as shown below:
 
 ```
-./spark-submit --properties-file sparkConf.properties /
+./spark-submit --properties-file cdm.properties /
+--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
 --master "local[*]" /
---class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
+--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
 ```
 
-Note: 
+Note:
 - Above command generates a log file `logfile_name.txt` to avoid log output on the console.
 - Add option `--driver-memory 25G --executor-memory 25G` as shown below if the table migrated is large (over 100GB)
 ```
-./spark-submit --properties-file sparkConf.properties /
+./spark-submit --properties-file cdm.properties /
+--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
 --master "local[*]" --driver-memory 25G --executor-memory 25G /
---class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
+--class datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
 ```
 
 # Steps for Data-Validation:
 
 - To run the job in Data validation mode, use class option `--class datastax.cdm.job.DiffData` as shown below
 
 ```
-./spark-submit --properties-file sparkConf.properties /
+./spark-submit --properties-file cdm.properties /
+--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
 --master "local[*]" /
---class datastax.cdm.job.DiffData cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
+--class datastax.cdm.job.DiffData cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
 ```
 
 - Validation job will report differences as “ERRORS” in the log file as shown below
@@ -66,10 +69,10 @@ Note:
 ```
 
 - Please grep for all `ERROR` from the output log files to get the list of missing and mismatched records.
-  - Note that it lists differences by primary-key values.
+    - Note that it lists differences by primary-key values.
 - The Validation job can also be run in an AutoCorrect mode. This mode can
-  - Add any missing records from origin to target
-  - Update any mismatched records between origin and target (makes target same as origin). 
+    - Add any missing records from origin to target
+    - Update any mismatched records between origin and target (makes target same as origin).
 - Enable/disable this feature using one or both of the below setting in the config file
 ```
 spark.cdm.autocorrect.missing                     false|true
@@ -81,12 +84,13 @@ Note:
 # Migrating specific partition ranges
 - You can also use the tool to migrate specific partition ranges using class option `--class datastax.cdm.job.MigratePartitionsFromFile` as shown below
 ```
-./spark-submit --properties-file sparkConf.properties /
+./spark-submit --properties-file cdm.properties /
+--conf spark.cdm.schema.origin.keyspaceTable="<keyspace-name>.<table-name>" /
 --master "local[*]" /
---class datastax.cdm.job.MigratePartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name.txt
+--class datastax.cdm.job.MigratePartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
 ```
 
-When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range 
+When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range
 ```
 -507900353496146534,-107285462027022883
 -506781526266485690,1506166634797362039
@@ -95,11 +99,40 @@ When running in above mode the tool assumes a `partitions.csv` file to be presen
 ```
 This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.
 
+> **Note:**
+> Here is a quick tip to prepare `partitions.csv` from the log file,
+
+```
+grep "ERROR CopyJobSession: Error with PartitionRange" /path/to/logfile_name.txt | awk '{print $13","$15}' > partitions.csv
+```
+# Data validation for specific partition ranges
+- You can also use the tool to validate data for a specific partition ranges using class option `--class datastax.cdm.job.DiffPartitionsFromFile` as shown below,
+```
+./spark-submit --properties-file cdm.properties /
+--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
+--master "local[*]" /
+--class datastax.cdm.job.DiffPartitionsFromFile cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
+```
+
+When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder.
+
+# Perform large-field Guardrail violation checks
+- The tool can be used to identify large fields from a table that may break you cluster guardrails (e.g. AstraDB has a 10MB limit for a single large field)  `--class datastax.astra.migrate.Guardrail` as shown below
+```
+./spark-submit --properties-file cdmGuardrail.properties /
+--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
+--master "local[*]" /
+--class datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
+```
+> A sample Guardrail properties file can be [found here](./src/resources/cdmGuardrail.properties)
+
 # Features
-- Supports migration/validation of [Counter tables](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useCountersConcept.html)
+- Auto-detects table schema (column names, types, keys, collections, UDTs, etc.)
+    - Including counter table [Counter tables](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useCountersConcept.html)
 - Preserve [writetimes](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__retrieving-the-datetime-a-write-occurred-p) and [TTLs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__ref-select-ttl-p)
 - Supports migration/validation of advanced DataTypes ([Sets](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__set), [Lists](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__list), [Maps](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__map), [UDTs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__udt))
 - Filter records from `Origin` using `writetimes` and/or CQL conditions and/or min/max token-range
+- Perform guardrail checks (identify large fields)
 - Supports adding `constants` as new columns on `Target`
 - Supports expanding `Map` columns on `Origin` into multiple records on `Target`
 - Fully containerized (Docker and K8s friendly)
@@ -109,6 +142,9 @@ This mode is specifically useful to processes a subset of partition-ranges that
 - Validate migration accuracy and performance using a smaller randomized data-set
 - Supports adding custom fixed `writetime`
 
+# Known Limitations
+- This tool does not migrate `ttl` & `writetime` at the field-level (for optimization reasons). It instead finds the field with the highest `ttl` & the field with the highest `writetime` within an `origin` row and uses those values on the entire `target` row.
+
 # Building Jar for local development
 1. Clone this repo
 2. Move to the repo folder `cd cassandra-data-migrator`
diff --git a/src/resources/cdm.properties b/src/resources/cdm.properties