diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index cfff347a..51f34442 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -41,7 +41,6 @@ * {cstar-data-migrator} ** xref:cdm-overview.adoc[] ** xref:cdm-steps.adoc[Migrate data] -** xref:cdm-parameters.adoc[Parameters] * {dsbulk-loader} ** https://docs.datastax.com/en/dsbulk/overview/dsbulk-about.html[Overview] diff --git a/modules/ROOT/pages/cassandra-data-migrator.adoc b/modules/ROOT/pages/cassandra-data-migrator.adoc index 5ac7c323..8860183b 100644 --- a/modules/ROOT/pages/cassandra-data-migrator.adoc +++ b/modules/ROOT/pages/cassandra-data-migrator.adoc @@ -1,4 +1,5 @@ = {cstar-data-migrator} +:page-aliases: cdm-parameters.adoc Use {cstar-data-migrator} to migrate and validate tables between origin and target Cassandra clusters, with available logging and reconciliation support. @@ -42,55 +43,7 @@ include::partial$cdm-partition-ranges.adoc[] include::partial$cdm-guardrail-checks.adoc[] +[[cdm-next-steps]] +== Next steps -[[cdm-reference]] -== {cstar-data-migrator} references - -=== Common connection parameters for Origin and Target - -include::partial$common-connection-parameters.adoc[] - -=== Origin schema parameters - -include::partial$origin-schema-parameters.adoc[] - -=== Target schema parameters - -include::partial$target-schema-parameters.adoc[] - -=== Auto-correction parameters - -include::partial$auto-correction-parameters.adoc[] - -=== Performance and operations parameters - -include::partial$performance-and-operations-parameters.adoc[] - -=== Transformation parameters - -include::partial$transformation-parameters.adoc[] - -=== Cassandra filter parameters - -include::partial$cassandra-filter-parameters.adoc[] - -=== Java filter parameters - -include::partial$java-filter-parameters.adoc[] - -=== Constant column feature parameters - -include::partial$constant-column-feature-parameters.adoc[] - -=== Explode map feature parameters - -include::partial$explode-map-feature-parameters.adoc[] - -=== Guardrail feature parameter - -include::partial$guardrail-feature-parameters.adoc[] - -=== TLS (SSL) connection parameters - -include::partial$tls-ssl-connection-parameters.adoc[] - +For advanced operations, see documentation at https://github.com/datastax/cassandra-data-migrator[the repository]. diff --git a/modules/ROOT/pages/cdm-parameters.adoc b/modules/ROOT/pages/cdm-parameters.adoc deleted file mode 100644 index 3a1d8e52..00000000 --- a/modules/ROOT/pages/cdm-parameters.adoc +++ /dev/null @@ -1,70 +0,0 @@ -= {cstar-data-migrator} parameters - -Each parameter below offers a different connection. Review each option to determine what is best for your organization. - -[[cdm-connection-params]] -== Common connection parameters for origin and target - -include::partial$common-connection-parameters.adoc[] - -[[cdm-origin-schema-params]] -== Origin schema parameters - -include::partial$origin-schema-parameters.adoc[] - -[[cdm-target-schema-params]] -== Target schema parameter - -include::partial$target-schema-parameters.adoc[] - -[[cdm-auto-correction-params]] -== Auto-correction parameters - -include::partial$auto-correction-parameters.adoc[] - - -[[cdm-performance-operations-params]] -== Performance and operations parameters - -include::partial$performance-and-operations-parameters.adoc[] - - -[[cdm-transformation-params]] -== Transformation parameters - -include::partial$transformation-parameters.adoc[] - - -[[cdm-cassandra-filter-params]] -== Cassandra filter parameters - -include::partial$cassandra-filter-parameters.adoc[] - - -[[cdm-java-filter-params]] -== Java filter parameters - -include::partial$java-filter-parameters.adoc[] - - -[[cdm-constant-column-feature-params]] -== Constant column feature parameters - -include::partial$constant-column-feature-parameters.adoc[] - - -[[cdm-explode-map-feature-params]] -== Explode map feature parameters - -include::partial$explode-map-feature-parameters.adoc[] - - -[[cdm-guardrail-feature-params]] -== Guardrail feature parameter - -include::partial$guardrail-feature-parameters.adoc[] - -[[cdm-tls-ssl-connection-params]] -== TLS (SSL) connection parameters - -include::partial$tls-ssl-connection-parameters.adoc[] \ No newline at end of file diff --git a/modules/ROOT/partials/cdm-guardrail-checks.adoc b/modules/ROOT/partials/cdm-guardrail-checks.adoc index b83d372b..430f47c9 100644 --- a/modules/ROOT/partials/cdm-guardrail-checks.adoc +++ b/modules/ROOT/partials/cdm-guardrail-checks.adoc @@ -9,5 +9,5 @@ Example: --conf spark.cdm.schema.origin.keyspaceTable="." \ --conf spark.cdm.feature.guardrail.colSizeInKB=10000 \ --master "local[*]" --driver-memory 25G --executor-memory 25G \ ---class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt +--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt ---- diff --git a/modules/ROOT/partials/cdm-partition-ranges.adoc b/modules/ROOT/partials/cdm-partition-ranges.adoc index 121f1566..a14b2d19 100644 --- a/modules/ROOT/partials/cdm-partition-ranges.adoc +++ b/modules/ROOT/partials/cdm-partition-ranges.adoc @@ -1,35 +1,9 @@ -You can also use {cstar-data-migrator} to migrate or validate specific partition ranges. Use a **partition-file** with the name `./._partitions.csv`. -Use the following format in the CSV file, in the current folder as input. -Example: - -[source,csv] ----- --507900353496146534,-107285462027022883 --506781526266485690,1506166634797362039 -2637884402540451982,4638499294009575633 -798869613692279889,8699484505161403540 ----- - -Each line in the CSV represents a partition-range (`min,max`). - -Alternatively, you can also pass the partition-file with a command-line parameter. -Example: +You can also use {cstar-data-migrator} to xref:cdm-steps.adoc#cdm-steps[migrate] or xref:cdm-steps.adoc#cdm-validation-steps[validate] specific partition ranges by passing the below additional parameters. [source,bash] ---- -./spark-submit --properties-file cdm.properties \ - --conf spark.cdm.schema.origin.keyspaceTable="." \ - --conf spark.cdm.tokenrange.partitionFile.input="//" \ - --master "local[*]" --driver-memory 25G --executor-memory 25G \ - --class com.datastax.cdm.job. cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt +--conf spark.cdm.filter.cassandra.partition.min= +--conf spark.cdm.filter.cassandra.partition.max= ---- -This mode is specifically useful to process a subset of partition-ranges that may have failed during a previous run. - -[NOTE] -==== -In the format shown above, the migration and validation jobs autogenerate a file named `./._partitions.csv`. -The file contains any failed partition ranges. -No file is created if there were no failed partitions. -You can use the CSV as input to process any failed partition in a subsequent run. -==== \ No newline at end of file +This mode is specifically useful to process a subset of partition-ranges. \ No newline at end of file diff --git a/modules/ROOT/partials/cdm-prerequisites.adoc b/modules/ROOT/partials/cdm-prerequisites.adoc index a8d39bbd..eaf3b74a 100644 --- a/modules/ROOT/partials/cdm-prerequisites.adoc +++ b/modules/ROOT/partials/cdm-prerequisites.adoc @@ -2,15 +2,15 @@ Read the prerequisites below before using the Cassandra Data Migrator. * Install or switch to Java 11. The Spark binaries are compiled with this version of Java. -* Select a single VM to run this job and install https://archive.apache.org/dist/spark/spark-3.5.1/[Spark 3.5.1] there. -No cluster is necessary. -* Optionally, install https://maven.apache.org/download.cgi[Maven] 3.9.x if you want to build the JAR for local development. +* Select a single VM to run this job and install https://archive.apache.org/dist/spark/spark-3.5.3/[Spark 3.5.3] there. +No cluster is necessary for most one-time migrations. However, Spark cluster mode is also supported for complex migrations. +* Optionally, install https://maven.apache.org/download.cgi[Maven] `3.9.x` if you want to build the JAR for local development. Run the following commands to install Apache Spark: [source,bash] ---- -wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3-scala2.13.tgz +wget https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3-scala2.13.tgz -tar -xvzf spark-3.5.1-bin-hadoop3-scala2.13.tgz +tar -xvzf spark-3.5.3-bin-hadoop3-scala2.13.tgz ---- diff --git a/modules/ROOT/partials/cdm-validation-steps.adoc b/modules/ROOT/partials/cdm-validation-steps.adoc index 050ae467..32f1f889 100644 --- a/modules/ROOT/partials/cdm-validation-steps.adoc +++ b/modules/ROOT/partials/cdm-validation-steps.adoc @@ -41,6 +41,6 @@ spark.cdm.autocorrect.mismatch false|true [IMPORTANT] ==== -The {cstar-data-migrator} validation job never deletes records from the target cluster. +The {cstar-data-migrator} validation job never deletes records from the source or target clusters. The job only adds or updates data on the target cluster. ==== \ No newline at end of file diff --git a/modules/ROOT/partials/constant-column-feature-parameters.adoc b/modules/ROOT/partials/constant-column-feature-parameters.adoc deleted file mode 100644 index 3098cc92..00000000 --- a/modules/ROOT/partials/constant-column-feature-parameters.adoc +++ /dev/null @@ -1,29 +0,0 @@ -The constant columns feature allows you to add constant columns to the target table. -If used, the `spark.cdm.feature.constantColumns.names`, `spark.cdm.feature.constantColumns.types`, and `spark.cdm.feature.constantColumns.values` lists must all be the same length. - -By default, these parameters are commented out. - -[cols="2,1,3"] -|=== -|Property | Default | Notes - -| `spark.cdm.feature.constantColumns.names` -| -| A comma-separated list of column names, such as `const1,const2`. - -| `spark.cdm.feature.constantColumns.type` -| -| A comma-separated list of column types. - -| `spark.cdm.feature.constantColumns.values` -| -| A comma-separated list of hard-coded values. -Each value should be provided as you would use on the `CQLSH` command line. -Examples: `'abcd'` for a string; `1234` for an int, and so on. - -| `spark.cdm.feature.constantColumns.splitRegex` -| `,` -| Defaults to comma, but can be any regex character that works with `String.split(regex)`. -This option is needed because some data values contain commas, such as in lists, maps, and sets. - -|=== \ No newline at end of file diff --git a/modules/ROOT/partials/explode-map-feature-parameters.adoc b/modules/ROOT/partials/explode-map-feature-parameters.adoc deleted file mode 100644 index f88880f0..00000000 --- a/modules/ROOT/partials/explode-map-feature-parameters.adoc +++ /dev/null @@ -1,19 +0,0 @@ -The explode map feature allows you convert an origin table map into multiple target table records. - -By default, these parameters are commented out. - -[cols="3,3"] -|=== -|Property | Notes - -| `spark.cdm.feature.explodeMap.origin.name` -| The name of the map column, such as `my_map`. -Must be defined on `spark.cdm.schema.origin.column.names`, and the corresponding type on `spark.cdm.schema.origin.column.types` must be a map. - -| `spark.cdm.feature.explodeMap.origin.name.key` -| The name of the column on the target table that holds the map key, such as `my_map_key`. -This key must be present on the target primary key `spark.cdm.schema.target.column.id.names`. - -| `spark.cdm.feature.explodeMap.origin.value` -| The name of the column on the target table that holds the map value, such as `my_map_value`. -|=== \ No newline at end of file diff --git a/modules/ROOT/partials/guardrail-feature-parameters.adoc b/modules/ROOT/partials/guardrail-feature-parameters.adoc deleted file mode 100644 index 7c4b31ab..00000000 --- a/modules/ROOT/partials/guardrail-feature-parameters.adoc +++ /dev/null @@ -1,16 +0,0 @@ -The guardrail feature manages records that exceed guardrail checks. -The guardrail job generates a report; other jobs skip records that exceed the guardrail limit. - -By default, these parameters are commented out. - -[cols="3,1,3"] -|=== -|Property | Default | Notes - -| `spark.cdm.feature.guardrail.colSizeInKB` -| `0` -| The `0` default means the guardrail check is not done. -If set, table records with one or more fields that exceed the column size in kB are flagged. -Note this is kB which is base 10, not kiB which is base 2. - -|=== diff --git a/modules/ROOT/partials/java-filter-parameters.adoc b/modules/ROOT/partials/java-filter-parameters.adoc deleted file mode 100644 index 329a6c95..00000000 --- a/modules/ROOT/partials/java-filter-parameters.adoc +++ /dev/null @@ -1,46 +0,0 @@ -Java filters are applied on the client node. -Data must be pulled from the origin cluster and then filtered. -However, this option may have a lower impact on the production cluster than xref:cdm-cassandra-filter-params[Cassandra filters]. -Java filters put a load onto the {cstar-data-migrator} processing node. -They send more data from Cassandra. -Cassandra filters put a load on the Cassandra nodes because {cstar-data-migrator} specifies `ALLOW FILTERING`, which could cause the coordinator node to perform a lot more work. - -By default, these parameters are commented out. - -[cols="2,1,4"] -|=== -|Property | Default | Notes - -| `spark.cdm.filter.java.token.percent` -| `100` -| Between 1 and 100 percent of the token in each split that is migrated. -This property is used to do a wide and random sampling of the data. -The percentage value is applied to each split. -Invalid percentages are treated as 100. - -| `spark.cdm.filter.java.writetime.min` -| `0` -| The lowest (inclusive) writetime values to be migrated. -Using the `spark.cdm.filter.java.writetime.min` and `spark.cdm.filter.java.writetime.max` thresholds, {cstar-data-migrator} can filter records based on their writetimes. -The maximum writetime of the columns configured at `spark.cdm.schema.origin.column.writetime.names` are compared to the `.min` and `.max` thresholds, which must be in **microseconds since the epoch**. -If the `spark.cdm.schema.origin.column.writetime.names` are not specified or the thresholds are null or otherwise invalid, the filter is ignored. -Note that `spark.cdm.s.perfops.batchSize` is ignored when this filter is in place; a value of 1 is used instead. - -| `spark.cdm.filter.java.writetime.max` -| `9223372036854775807` -| The highest (inclusive) writetime values to be migrated. -The `spark.cdm.schema.origin.column.writetime.names` specifies the maximum timestamp of the columns. -If that property is not specified or is for some reason null, the filter is ignored. - -| `spark.cdm.filter.java.column.name` -| -| Filter rows based on matching a configured value. -With `spark.cdm.filter.java.column.name`, specify the column name against which the `spark.cdm.filter.java.column.value` is compared. -Must be on the column list specified at `spark.cdm.schema.origin.column.names`. -The column value is converted to a string, trimmed of whitespace on both ends, and compared. - -| `spark.cdm.filter.java.column.value` -| -| String value to use as comparison. -The whitespace on the ends of `spark.cdm.filter.java.column.value` is trimmed. -|=== \ No newline at end of file diff --git a/modules/ROOT/partials/tls-ssl-connection-parameters.adoc b/modules/ROOT/partials/tls-ssl-connection-parameters.adoc deleted file mode 100644 index 985092d9..00000000 --- a/modules/ROOT/partials/tls-ssl-connection-parameters.adoc +++ /dev/null @@ -1,66 +0,0 @@ -These are TLS (SSL) connection parameters, if configured, for the origin and target clusters. -Note that a secure connect bundle (SCB) embeds these details. - -By default, these parameters are commented out. - -[cols="3,3,3"] -|=== -|Property | Default | Notes - -| `spark.cdm.connect.origin.tls.enabled` -| `false` -| If TLS is used, set to `true`. - -| `spark.cdm.connect.origin.tls.trustStore.path` -| -| Path to the Java truststore file. - -| `spark.cdm.connect.origin.tls.trustStore.password` -| -| Password needed to open the truststore. - -| `spark.cdm.connect.origin.tls.trustStore.type` -| `JKS` -| - -| `spark.cdm.connect.origin.tls.keyStore.path` -| -| Path to the Java keystore file. - -| `spark.cdm.connect.origin.tls.keyStore.password` -| -| Password needed to open the keystore. - -| `spark.cdm.connect.origin.tls.enabledAlgorithms` -| `TLS_RSA_WITH_AES_128_CBC_SHA`,`TLS_RSA_WITH_AES_256_CBC_SHA` -| - -| `spark.cdm.connect.target.tls.enabled` -| `false` -| If TLS is used, set to `true`. - -| `spark.cdm.connect.target.tls.trustStore.path` -| -| Path to the Java truststore file. - -| `spark.cdm.connect.target.tls.trustStore.password` -| -| Password needed to open the truststore. - -| `spark.cdm.connect.target.tls.trustStore.type` -| `JKS` -| - -| `spark.cdm.connect.target.tls.keyStore.path` -| -| Path to the Java keystore file. - -| `spark.cdm.connect.target.tls.keyStore.password` -| -| Password needed to open the keystore. - -| `spark.cdm.connect.target.tls.enabledAlgorithms` -| `TLS_RSA_WITH_AES_128_CBC_SHA`,`TLS_RSA_WITH_AES_256_CBC_SHA` -| - -|=== \ No newline at end of file diff --git a/modules/ROOT/partials/use-cdm-migrator.adoc b/modules/ROOT/partials/use-cdm-migrator.adoc index e5513d51..0ab31c2d 100644 --- a/modules/ROOT/partials/use-cdm-migrator.adoc +++ b/modules/ROOT/partials/use-cdm-migrator.adoc @@ -3,11 +3,10 @@ The file can have any name. It does not need to be `cdm.properties` or `cdm-detailed.properties`. In both versions, the `spark-submit` job processes only the parameters that aren't commented out. Other parameter values use defaults or are ignored. ++ See the descriptions and defaults in each file. -For more information, see the following: - * The simplified sample properties configuration, https://github.com/datastax/cassandra-data-migrator/blob/main/src/resources/cdm.properties[cdm.properties]. - This file contains only those parameters that are commonly configured. - * The complete sample properties configuration, https://github.com/datastax/cassandra-data-migrator/blob/main/src/resources/cdm-detailed.properties[cdm-detailed.properties], for the full set of configurable settings. +For more information about the sample properties configuration, see the https://github.com/datastax/cassandra-data-migrator/blob/main/src/resources/cdm-detailed.properties[cdm-detailed.properties]. +This is the full set of configurable settings. . Place the properties file that you elected to use and customize where it can be accessed while running the job using `spark-submit`.