Skip to content

Commit d5870fe

Browse files
DOC-4012 (#150)
* updates on how to manage memory config for large workloads * fixed typo * Update docs-src/zdm-core/modules/migrate/pages/deployment-infrastructure.adoc Jamie's suggestion Co-authored-by: Jamie Gillenwater <[email protected]> --------- Co-authored-by: Jamie Gillenwater <[email protected]>
1 parent 6ef7290 commit d5870fe

File tree

2 files changed

+15
-15
lines changed

2 files changed

+15
-15
lines changed

docs-src/zdm-core/modules/migrate/pages/cassandra-data-migrator.adoc

Lines changed: 6 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -67,22 +67,14 @@ The fat jar (`cassandra-data-migrator-x.y.z.jar`) file should be present now in
6767
----
6868
./spark-submit --properties-file cdm.properties /
6969
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
70-
--master "local[*]" /
70+
--master "local[*]" --driver-memory 25G --executor-memory 25G /
7171
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
7272
----
7373

7474
[TIP]
7575
====
76-
* The `spark-submit` command generates a log file, `logfile_name_*.txt`, to avoid log output on the Terminal console.
77-
* If the table you're migrating is large (such as over 100GB), you can add the option `--driver-memory 25G --executor-memory 25G`. Example:
78-
79-
[source,bash]
80-
----
81-
./spark-submit --properties-file cdm.properties /
82-
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
83-
--master "local[*]" --driver-memory 25G --executor-memory 25G /
84-
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
85-
----
76+
* Above command generates a log file `logfile_name_*.txt` to avoid log output on the console.
77+
* Update the memory options (driver & executor memory) based on your use-case
8678
====
8779

8880
[[cdm-validation-steps]]
@@ -95,7 +87,7 @@ Example:
9587
----
9688
./spark-submit --properties-file cdm.properties /
9789
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
98-
--master "local[*]" /
90+
--master "local[*]" --driver-memory 25G --executor-memory 25G /
9991
--class com.datastax.cdm.job.DiffData cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
10092
----
10193

@@ -157,7 +149,7 @@ Example:
157149
./spark-submit --properties-file cdm.properties /
158150
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
159151
--conf spark.cdm.tokenRange.partitionFile="/<path-to-file>/<csv-input-filename>" /
160-
--master "local[*]" /
152+
--master "local[*]" --driver-memory 25G --executor-memory 25G /
161153
--class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
162154
----
163155

@@ -178,7 +170,7 @@ Use {cstar-data-migrator} to identify large fields from a table that may break y
178170
./spark-submit --properties-file cdm.properties /
179171
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
180172
--conf spark.cdm.feature.guardrail.colSizeInKB=10000 /
181-
--master "local[*]" /
173+
--master "local[*]" --driver-memory 25G --executor-memory 25G /
182174
--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
183175
----
184176

docs-src/zdm-core/modules/migrate/pages/deployment-infrastructure.adoc

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,15 @@ We will use the term "machine" to indicate a cloud instance (on any cloud provid
5151

5252
[NOTE]
5353
====
54-
Scenario: when you have close to 12 TBs of data and several tables, to speed up the migration of your existing data, you can run with (for example) 4 machines that are the equivalent of an AWS `m5.4xlarge`, a GCP `e2-standard-16` or an Azure `D16v5`. Then run {dsbulk-migrator} on each machine, with each one responsible for a quarter of the full token range.
54+
* Scenario: If you have 20 TBs of existing data to be migrated and want to speed up the migration, you could use multiple VMs. For example, you can use four VMs that are the equivalent of an AWS m5.4xlarge, a GCP e2-standard-16 or an Azure D16v5.
55+
+
56+
Next, run DSBulk Migrator or Cassandra-Data-Migrator in parallel on each VM with each one responsible for migrating around 5TB of data. If there is one super large table (e.g. 15 TB of 20 TB is in one table), you can choose to migrate this table in three parts on three separate VMs in parallel by splitting the full token range into three parts and migrating the rest of the tables on the fourth VM.
57+
58+
* Ensure that your Origin and Target clusters can handle high traffic from Cassandra Data Migrator or DSBulk in addition to the live traffic from your application.
59+
60+
* Test any migration in a lower environment before you plan to do it in production.
61+
62+
* Contact https://support.datastax.com/s/[DataStax support] for help configuring your workload.
5563
====
5664

5765
// TODO: investigate how to "leverage the parallelism of {cstar-data-migrator} to run the migration process across all 4 machines."

0 commit comments

Comments
 (0)