DOC-557 | arangodump improved performance & resource usage limits (#295)

nerpaula · Simran-B · web-flow · commit ffc2c4f7947b · 2023-11-08T16:10:07.000+01:00
* add arangodump resource usage limits

* improved dump performance

* clarifications

* Apply suggestions from code review

Co-authored-by: Simran &lt;Simran-B@users.noreply.github.com&gt;

* Apply suggestions from code review

Co-authored-by: Simran &lt;Simran-B@users.noreply.github.com&gt;

* Review

---------

Co-authored-by: Simran &lt;Simran-B@users.noreply.github.com&gt;
Co-authored-by: Simran Spiller &lt;simran@arangodb.com&gt;
diff --git a/site/content/3.12/components/tools/arangodump/examples.md b/site/content/3.12/components/tools/arangodump/examples.md
@@ -113,7 +113,7 @@ with these attributes:
 
 Document data for a collection is saved in files with name pattern
 `<collection-name>.data.json`. Each line in a data file is a document insertion/update or
-deletion marker, alongside with some meta data.
+deletion marker.
 
 ## Cluster Backup
 
@@ -213,12 +213,14 @@ RocksDB encryption-at-rest feature.
 
 ## Compression
 
-`--compress-output`
+The size of dumps can be reduced using compression, for storing but also for the
+data transfer.
 
-Data can optionally be dumped in a compressed format to save space on disk.
-The `--compress-output` option cannot be used together with [Encryption](#encryption).
+You can optionally store data in a compressed format to save space on disk with
+the `--compress-output` startup option. It cannot be used together with
+[Encryption](#encryption).
 
-If compression is enabled, no `.data.json` files are written. Instead, the
+If output compression is enabled, no `.data.json` files are written. Instead, the
 collection data gets compressed using the Gzip algorithm and for each collection
 a `.data.json.gz` file is written. Metadata files such as `.structure.json` and
 `.view.json` do not get compressed.
@@ -234,13 +236,39 @@ detects whether the data is compressed or not based on the file extension.
 arangorestore --input-directory "dump"
 ```
 
+You can optionally let the server compress the data for the network transfer
+with the `--compress-transfer` startup option. This can reduce the traffic and
+thus save time and money.
+
+The data is automatically decompressed on the client side. You can use the option
+independent of the `--compress-output` option, which controls whether to store
+the dump compressed or not but without affecting the transfer size.
+
+```
+arangodump --output-directory "dump" --compress-transfer --compress-output false
+```
+
+{{< comment >}} Experimental feature in 3.12
+## Storage format
+
+The default output format for dumps is JSON.
+
+To achieve the best dump performance and the smallest data dumps in terms of
+size, you can enable the `--dump-vpack` startup option. The resulting dump data
+is then stored in the more compact but binary [VelocyPack](http://github.com/arangodb/velocypack)
+format instead of the text-based JSON format. The output file size can be less
+even compared to compressed JSON. It can also lead to faster dumps because there
+is less data to transfer and no conversion from the server-internal VelocyPack
+format to JSON is needed.
+{{< /comment >}}
+
 ## Threads
 
 _arangodump_ can use multiple threads for dumping database data in 
 parallel. To speed up the dump of a database with multiple collections, it is
 often beneficial to increase the number of _arangodump_ threads.
 The number of threads can be controlled via the `--threads` option. The default
-value was changed from `2` to the maximum of `2` and the number of available CPU cores.
+value is the maximum of `2` and the number of available CPU cores.
 
 The `--threads` option works dynamically, its value depends on the number of
 available CPU cores. If the amount of available CPU cores is less than `3`, a
@@ -267,3 +295,10 @@ file should be expected. Also note that when dumping the data of multiple shards
 from the same collection, each thread's results are written to the result 
 file in a non-deterministic order. This should not be a problem when restoring
 such dump, as _arangorestore_ does not assume any order of input.
+
+From v3.12.0 onward, you can make _arangodump_ write multiple output files per
+collection/shard. The file splitting allows for better parallelization when
+writing the results to disk, which in case of non-split files must be serialized.
+You can enable it with the `--split-files` startup option. It is disabled by
+default because dumps created with this option enabled cannot be restored into
+previous versions of ArangoDB.
diff --git a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md
@@ -232,7 +232,7 @@ can be mixed and written into the same .sst files.
 
 When these options are enabled, the RocksDB compaction is more efficient since
 a lot of different collections/shards/indexes are written to in parallel.
-The disavantage of enabling these options is that there can be more .sst
+The disadvantage of enabling these options is that there can be more .sst
 files than when the option is turned off, and the disk space used by
 these .sst files can be higher.
 In particular, on deployments with many collections/shards/indexes
@@ -241,5 +241,84 @@ of outgrowing the maximum number of file descriptors the ArangoDB process
 can open. Thus, these options should only be enabled on deployments with a
 limited number of collections/shards/indexes.
 
+## Client tools
+
+### arangodump
+
+#### Improved dump performance and size
+
+From version 3.12 onward, _arangodump_ has extended parallelization capabilities
+to work not only at the collection level, but also at the shard level.
+In combination with the newly added support for the VelocyPack format that
+ArangoDB uses internally, database dumps can now be created and restored more
+quickly and occupy less disk space. This major performance boost makes dumps and
+restores up to several times faster, which is extremely useful when dealing
+with large shards.
+
+- Whether the new parallel dump variant is used is controlled by the newly added
+  `--use-parallel-dump` startup option. The default value is `true`.
+
+- To achieve the best dump performance and the smallest data dumps in terms of
+  size, you can additionally use the `--dump-vpack` option. The resulting dump data
+  is then stored in the more compact but binary VelocyPack format instead of the
+  text-based JSON format. The output file size can be less even compared to
+  compressed JSON. It can also lead to faster dumps because there is less data to
+  transfer and no conversion from the server-internal format (VelocyPack) to JSON
+  is needed. Note, however, that this option is **experimental** and disabled by
+  default.
+
+- Optionally, you can make _arangodump_ write multiple output files per
+  collection/shard. The file splitting allows for better parallelization when
+  writing the results to disk, which in case of non-split files must be serialized.
+  You can enable it by setting the `--split-files` option to `true`. This option
+  is disabled by default because dumps created with this option enabled cannot
+  be restored into previous versions of ArangoDB.
+
+- You can enable the new `--compress-transfer` startup option for compressing the
+  dump data on the server for a faster transfer. This is helpful especially if
+  the network is slow or its capacity is maxed out. The data is decompressed on
+  the client side and recompressed if you enable the  `--compress-output` option.
+
+#### Resource usage limits and metrics
+
+The following `arangod` startup options can be used to limit
+the resource usage of parallel _arangodump_ invocations:
+
+- `--dump.max-memory-usage`: Maximum memory usage (in bytes) to be
+  used by the server-side parts of all ongoing _arangodump_ invocations.
+  This option can be used to limit the amount of memory for prefetching
+  and keeping results on the server side when _arangodump_ is invoked
+  with the `--parallel-dump` option. It does not have an effect for
+  _arangodump_ invocations that did not use the `--parallel-dump` option.
+  Note that the memory usage limit is not exact and that it can be
+  slightly exceeded in some situations to guarantee progress.
+- -`-dump.max-docs-per-batch`: Maximum number of documents per batch
+  that can be used in a dump. If an _arangodump_ invocation requests
+  higher values than configured here, the value is automatically
+  capped to this value. Will only be followed for _arangodump_ invocations
+  that use the `--parallel-dump` option.
+- `--dump.max-batch-size`: Maximum batch size value (in bytes) that
+  can be used in a dump. If an _arangodump_ invocation requests larger
+  batch sizes than configured here, the actual batch sizes is capped
+  to this value. Will only be followed for _arangodump_ invocations that
+  use the -`-parallel-dump` option.
+- `--dump.max-parallelism`: Maximum parallelism (number of server-side
+  threads) that can be used in a dump. If an _arangodump_ invocation requests
+  a higher number of prefetch threads than configured here, the actual
+  number of server-side prefetch threads is capped to this value.
+  Will only be followed for _arangodump_ invocations that use the
+  `--parallel-dump` option.
+
+The following metrics have been added to observe the behavior of parallel
+_arangodump_ operations on the server:
+
+- `arangodb_dump_memory_usage`: Current memory usage of all ongoing
+  _arangodump_ operations on the server.
+- `arangodb_dump_ongoing`: Number of currently ongoing _arangodump_
+  operations on the server.
+- `arangodb_dump_threads_blocked_total`: Number of times a server-side
+  dump thread was blocked because it honored the server-side memory
+  limit for dumps.
+
 ## Internal changes