DOC-4012 (#150)

emeliawilkinson24 · jgillenwater · web-flow · commit d5870fea86bf · 2023-09-05T10:49:40.000-04:00
* updates on how to manage memory config for large workloads

* fixed typo

* Update docs-src/zdm-core/modules/migrate/pages/deployment-infrastructure.adoc

Jamie's suggestion

Co-authored-by: Jamie Gillenwater &lt;jamie.gillenwater@datastax.com&gt;

---------

Co-authored-by: Jamie Gillenwater &lt;jamie.gillenwater@datastax.com&gt;
diff --git a/docs-src/zdm-core/modules/migrate/pages/cassandra-data-migrator.adoc b/docs-src/zdm-core/modules/migrate/pages/cassandra-data-migrator.adoc
@@ -67,22 +67,14 @@ The fat jar (`cassandra-data-migrator-x.y.z.jar`) file should be present now in
 ----
 ./spark-submit --properties-file cdm.properties /
 --conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
---master "local[*]" /
+--master "local[*]" --driver-memory 25G --executor-memory 25G /
 --class com.datastax.cdm.job.Migrate cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
 ----
 
 [TIP]
 ====
-* The `spark-submit` command generates a log file, `logfile_name_*.txt`, to avoid log output on the Terminal console.
-* If the table you're migrating is large (such as over 100GB), you can add the option `--driver-memory 25G --executor-memory 25G`. Example:
-
-[source,bash]
-----
-./spark-submit --properties-file cdm.properties /
---conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
---master "local[*]" --driver-memory 25G --executor-memory 25G /
---class com.datastax.cdm.job.Migrate cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
-----
+* Above command generates a log file `logfile_name_*.txt` to avoid log output on the console.
+* Update the memory options (driver & executor memory) based on your use-case
 ====
 
 [[cdm-validation-steps]]
@@ -95,7 +87,7 @@ Example:
 ----
 ./spark-submit --properties-file cdm.properties /
 --conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
---master "local[*]" /
+--master "local[*]" --driver-memory 25G --executor-memory 25G /
 --class com.datastax.cdm.job.DiffData cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
 ----
 
@@ -157,7 +149,7 @@ Example:
 ./spark-submit --properties-file cdm.properties /
  --conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
  --conf spark.cdm.tokenRange.partitionFile="/<path-to-file>/<csv-input-filename>" /
- --master "local[*]" /
+ --master "local[*]" --driver-memory 25G --executor-memory 25G /
  --class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
 ----
 
@@ -178,7 +170,7 @@ Use {cstar-data-migrator} to identify large fields from a table that may break y
 ./spark-submit --properties-file cdm.properties /
 --conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" /
 --conf spark.cdm.feature.guardrail.colSizeInKB=10000 /
---master "local[*]" /
+--master "local[*]" --driver-memory 25G --executor-memory 25G /
 --class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
 ----
 
diff --git a/docs-src/zdm-core/modules/migrate/pages/deployment-infrastructure.adoc b/docs-src/zdm-core/modules/migrate/pages/deployment-infrastructure.adoc
@@ -51,7 +51,15 @@ We will use the term "machine" to indicate a cloud instance (on any cloud provid
 
 [NOTE]
 ====
-Scenario: when you have close to 12 TBs of data and several tables, to speed up the migration of your existing data, you can run with (for example) 4 machines that are the equivalent of an AWS `m5.4xlarge`, a GCP `e2-standard-16` or an Azure `D16v5`.  Then run {dsbulk-migrator} on each machine, with each one responsible for a quarter of the full token range. 
+* Scenario: If you have 20 TBs of existing data to be migrated and want to speed up the migration, you could use multiple VMs. For example, you can use four VMs that are the equivalent of an AWS m5.4xlarge, a GCP e2-standard-16 or an Azure D16v5.
++
+Next, run DSBulk Migrator or Cassandra-Data-Migrator in parallel on each VM with each one responsible for migrating around 5TB of data. If there is one super large table (e.g. 15 TB of 20 TB is in one table), you can choose to migrate this table in three parts on three separate VMs in parallel by splitting the full token range into three parts and migrating the rest of the tables on the fourth VM. 
+
+* Ensure that your Origin and Target clusters can handle high traffic from Cassandra Data Migrator or DSBulk in addition to the live traffic from your application. 
+
+* Test any migration in a lower environment before you plan to do it in production.
+
+* Contact https://support.datastax.com/s/[DataStax support] for help configuring your workload.
 ====
 
 // TODO: investigate how to "leverage the parallelism of {cstar-data-migrator} to run the migration process across all 4 machines."