You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+55-19Lines changed: 55 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,8 +8,8 @@ Migrate and Validate Tables between Origin and Target Cassandra Clusters.
8
8
> :warning: Please note this job has been tested with spark version [3.3.1](https://archive.apache.org/dist/spark/spark-3.3.1/)
9
9
10
10
## Install as a Container
11
-
- Get the latest image that includes all dependencies from [DockerHub](https://hub.docker.com/r/datastax/cassandra-data-migrator)
12
-
- All migration tools (`cassandra-data-migrator` + `dsbulk` + `cqlsh`) would be available in the `/assets/` folder of the container
11
+
- Get the latest image that includes all dependencies from [DockerHub](https://hub.docker.com/r/datastax/cassandra-data-migrator)
12
+
- All migration tools (`cassandra-data-migrator` + `dsbulk` + `cqlsh`) would be available in the `/assets/` folder of the container
13
13
14
14
## Install as a JAR file
15
15
- Download the latest jar file from the GitHub [packages area here](https://github.com/orgs/datastax/packages?repo_name=cassandra-data-migrator)
@@ -26,34 +26,37 @@ tar -xvzf spark-3.3.1-bin-hadoop3.tgz
26
26
27
27
> :warning: Note that Version 4 of the tool is not backward-compatible with .properties files created in previous versions, and that package names have changed.
28
28
29
-
1.`sparkConf.properties` file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file.
30
-
> A sample Spark conf file configuration can be [found here](./src/resources/sparkConf.properties)
31
-
2. Place the conf file where it can be accessed while running the job via spark-submit.
29
+
1.`cdm.properties` file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file.
30
+
> A sample properties file configuration can be [found here](./src/resources/cdm.properties)
31
+
2. Place the properties file where it can be accessed while running the job via spark-submit.
32
32
3. Run the below job using `spark-submit` command as shown below:
- Validation job will report differences as “ERRORS” in the log file as shown below
@@ -66,10 +69,10 @@ Note:
66
69
```
67
70
68
71
- Please grep for all `ERROR` from the output log files to get the list of missing and mismatched records.
69
-
- Note that it lists differences by primary-key values.
72
+
- Note that it lists differences by primary-key values.
70
73
- The Validation job can also be run in an AutoCorrect mode. This mode can
71
-
- Add any missing records from origin to target
72
-
- Update any mismatched records between origin and target (makes target same as origin).
74
+
- Add any missing records from origin to target
75
+
- Update any mismatched records between origin and target (makes target same as origin).
73
76
- Enable/disable this feature using one or both of the below setting in the config file
74
77
```
75
78
spark.cdm.autocorrect.missing false|true
@@ -81,12 +84,13 @@ Note:
81
84
# Migrating specific partition ranges
82
85
- You can also use the tool to migrate specific partition ranges using class option `--class datastax.cdm.job.MigratePartitionsFromFile` as shown below
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range
93
+
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder in the below format, where each line (`min,max`) represents a partition-range
90
94
```
91
95
-507900353496146534,-107285462027022883
92
96
-506781526266485690,1506166634797362039
@@ -95,11 +99,40 @@ When running in above mode the tool assumes a `partitions.csv` file to be presen
95
99
```
96
100
This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.
97
101
102
+
> **Note:**
103
+
> Here is a quick tip to prepare `partitions.csv` from the log file,
- You can also use the tool to validate data for a specific partition ranges using class option `--class datastax.cdm.job.DiffPartitionsFromFile` as shown below,
When running in above mode the tool assumes a `partitions.csv` file to be present in the current folder.
118
+
119
+
# Perform large-field Guardrail violation checks
120
+
- The tool can be used to identify large fields from a table that may break you cluster guardrails (e.g. AstraDB has a 10MB limit for a single large field) `--class datastax.astra.migrate.Guardrail` as shown below
- Including counter table [Counter tables](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useCountersConcept.html)
100
132
- Preserve [writetimes](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__retrieving-the-datetime-a-write-occurred-p) and [TTLs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__ref-select-ttl-p)
101
133
- Supports migration/validation of advanced DataTypes ([Sets](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__set), [Lists](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__list), [Maps](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__map), [UDTs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__udt))
102
134
- Filter records from `Origin` using `writetimes` and/or CQL conditions and/or min/max token-range
135
+
- Perform guardrail checks (identify large fields)
103
136
- Supports adding `constants` as new columns on `Target`
104
137
- Supports expanding `Map` columns on `Origin` into multiple records on `Target`
105
138
- Fully containerized (Docker and K8s friendly)
@@ -109,6 +142,9 @@ This mode is specifically useful to processes a subset of partition-ranges that
109
142
- Validate migration accuracy and performance using a smaller randomized data-set
110
143
- Supports adding custom fixed `writetime`
111
144
145
+
# Known Limitations
146
+
- This tool does not migrate `ttl` & `writetime` at the field-level (for optimization reasons). It instead finds the field with the highest `ttl` & the field with the highest `writetime` within an `origin` row and uses those values on the entire `target` row.
147
+
112
148
# Building Jar for local development
113
149
1. Clone this repo
114
150
2. Move to the repo folder `cd cassandra-data-migrator`
This release is a major code refactor of Cassandra Data Migrator, focused on internal code structure and organization.
5
+
Automated testing (both unit and integration) was introduced and incorporated into the build process. It includes all
6
+
features of the previous version, but the properties specified within configuration (.properties) file have been
7
+
re-organized and renamed; therefore, the configuration file from the previous version will not work with this version.
8
+
9
+
New features were also introduced with this release, on top of the 3.4.5 version.
10
+
### Added
11
+
- New features:
12
+
-`Column renaming`: Column names can differ between Origin and Target
13
+
-`Migrate UDTs across keyspaces`: UDTs can be migrated from Origin to Target, even when the keyspace names differ
14
+
-`Data Type Conversion`: Some predefined Codecs support type conversion between Origin and Target; custom Codecs can be added
15
+
-`Separate Writetime and TTL configuration`: Writetime columns can differ from TTL columns
16
+
-`Subset of columns can be specified with Writetime and TTL`: Not all eligible columns need to be used to compute the origin value
17
+
-`Automatic RandomPartitioner min/max`: Partition min/max values no longer need to be manually configured
18
+
-`Populate Target columns with constant values`: New columns can be added to the Target table, and populated with constant values
19
+
-`Explode Origin Map Column into Target rows`: A Map in Origin can be expanded into multiple rows in Target when the Map key is part of the Target primary key
20
+
21
+
## [3.x.x]
22
+
Previous releases of the project have not been documented in this file
0 commit comments