Skip to content

Commit 6e7c4a2

Browse files
committed
Merge branch '3.3.0_stable' into docs_update
2 parents 8e1927c + 264e5ed commit 6e7c4a2

23 files changed

+643
-583
lines changed

.github/workflows/maven-publish.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
name: Maven Package
55

66
on:
7+
workflow_dispatch:
78
push:
89
branches: [ main ]
910

Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,14 @@ RUN apt-get update && apt-get install -y openssh-server vim python3 --no-install
1818
service ssh start
1919

2020
# Copy CDM jar & template files
21-
ARG MAVEN_VERSION=3.8.7
21+
ARG MAVEN_VERSION=3.8.8
2222
ARG USER_HOME_DIR="/root"
2323
ARG BASE_URL=https://dlcdn.apache.org/maven/maven-3/${MAVEN_VERSION}/binaries
2424
ENV MAVEN_HOME /usr/share/maven
2525
ENV MAVEN_CONFIG "$USER_HOME_DIR/.m2"
2626
COPY ./src /assets/src
2727
COPY ./pom.xml /assets/pom.xml
28-
COPY ./src/resources/sparkConf.properties /assets/
28+
COPY src/resources/cdm.properties /assets/
2929
COPY ./src/resources/partitions.csv /assets/
3030
COPY ./src/resources/primary_key_rows.csv /assets/
3131
COPY ./src/resources/runCommands.txt /assets/

README.md

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,14 @@ tar -xvzf spark-3.3.1-bin-hadoop3.tgz
2424

2525
# Steps for Data-Migration:
2626

27-
1. `sparkConf.properties` file needs to be configured as applicable for the environment
28-
> A sample Spark conf file configuration can be [found here](./src/resources/sparkConf.properties)
27+
1. `cdm.properties` file needs to be configured as applicable for the environment
28+
> A sample properties file can be [found here](./src/resources/cdm.properties)
2929
2. Place the conf file where it can be accessed while running the job via spark-submit.
3030
3. Run the below job using `spark-submit` command as shown below:
3131

3232
```
33-
./spark-submit --properties-file sparkConf.properties /
33+
./spark-submit --properties-file cdm.properties /
34+
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
3435
--master "local[*]" /
3536
--class datastax.astra.migrate.Migrate cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
3637
```
@@ -39,7 +40,8 @@ Note:
3940
- Above command generates a log file `logfile_name.txt` to avoid log output on the console.
4041
- Add option `--driver-memory 25G --executor-memory 25G` as shown below if the table migrated is large (over 100GB)
4142
```
42-
./spark-submit --properties-file sparkConf.properties /
43+
./spark-submit --properties-file cdm.properties /
44+
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
4345
--master "local[*]" --driver-memory 25G --executor-memory 25G /
4446
--class datastax.astra.migrate.Migrate cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
4547
```
@@ -49,7 +51,8 @@ Note:
4951
- To run the job in Data validation mode, use class option `--class datastax.astra.migrate.DiffData` as shown below
5052

5153
```
52-
./spark-submit --properties-file sparkConf.properties /
54+
./spark-submit --properties-file cdm.properties /
55+
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
5356
--master "local[*]" /
5457
--class datastax.astra.migrate.DiffData cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
5558
```
@@ -79,7 +82,8 @@ Note:
7982
# Migrating specific partition ranges
8083
- You can also use the tool to migrate specific partition ranges using class option `--class datastax.astra.migrate.MigratePartitionsFromFile` as shown below
8184
```
82-
./spark-submit --properties-file sparkConf.properties /
85+
./spark-submit --properties-file cdm.properties /
86+
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
8387
--master "local[*]" /
8488
--class datastax.astra.migrate.MigratePartitionsFromFile cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
8589
```
@@ -93,11 +97,23 @@ When running in above mode the tool assumes a `partitions.csv` file to be presen
9397
```
9498
This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.
9599

100+
# Perform large-field Guardrail violation checks
101+
- The tool can be used to identify large fields from a table that may break you cluster guardrails (e.g. AstraDB has a 10MB limit for a single large field) `--class datastax.astra.migrate.Guardrail` as shown below
102+
```
103+
./spark-submit --properties-file cdmGuardrail.properties /
104+
--conf spark.origin.keyspaceTable="<keyspace-name>.<table-name>" /
105+
--master "local[*]" /
106+
--class datastax.astra.migrate.Guardrail cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
107+
```
108+
> A sample Guardrail properties file can be [found here](./src/resources/cdmGuardrail.properties)
109+
96110
# Features
111+
- Auto-detects table schema (column names, types, id fields, collections, UDTs, etc.)
97112
- Supports migration/validation of [Counter tables](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useCountersConcept.html)
98113
- Preserve [writetimes](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__retrieving-the-datetime-a-write-occurred-p) and [TTLs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/cql_commands/cqlSelect.html#cqlSelect__ref-select-ttl-p)
99114
- Supports migration/validation of advanced DataTypes ([Sets](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__set), [Lists](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__list), [Maps](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__map), [UDTs](https://docs.datastax.com/en/dse/6.8/cql/cql/cql_reference/refDataTypes.html#refDataTypes__udt))
100115
- Filter records from `Origin` using `writetimes` and/or CQL conditions and/or min/max token-range
116+
- Perform guardrail checks (identify large fields)
101117
- Supports adding `constants` as new columns on `Target`
102118
- Fully containerized (Docker and K8s friendly)
103119
- SSL Support (including custom cipher algorithms)
@@ -106,6 +122,9 @@ This mode is specifically useful to processes a subset of partition-ranges that
106122
- Validate migration accuracy and performance using a smaller randomized data-set
107123
- Supports adding custom fixed `writetime`
108124

125+
# Known Limitations
126+
- This tool does not migrate `ttl` & `writetime` at the field-level (for optimization reasons). It instead finds the field with the highest `ttl` & the field with the highest `writetime` within an `origin` row and uses those values on the entire `target` row.
127+
109128
# Building Jar for local development
110129
1. Clone this repo
111130
2. Move to the repo folder `cd cassandra-data-migrator`

pom.xml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
<properties>
1010
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
11-
<revision>3.2.3</revision>
11+
<revision>3.4.0</revision>
1212
<scala.version>2.12.17</scala.version>
1313
<scala.main.version>2.12</scala.main.version>
1414
<spark.version>3.3.1</spark.version>
@@ -89,6 +89,12 @@
8989
<artifactId>log4j-to-slf4j</artifactId>
9090
<version>2.19.0</version>
9191
</dependency>
92+
<dependency>
93+
<groupId>org.projectlombok</groupId>
94+
<artifactId>lombok</artifactId>
95+
<version>1.18.26</version>
96+
<scope>provided</scope>
97+
</dependency>
9298

9399
<!-- Test Dependencies -->
94100
<dependency>

0 commit comments

Comments
 (0)