Skip to content

Commit ffc2c4f

Browse files
nerpaulaSimran-B
andauthored
DOC-557 | arangodump improved performance & resource usage limits (#295)
* add arangodump resource usage limits * improved dump performance * clarifications * Apply suggestions from code review Co-authored-by: Simran <[email protected]> * Apply suggestions from code review Co-authored-by: Simran <[email protected]> * Review --------- Co-authored-by: Simran <[email protected]> Co-authored-by: Simran Spiller <[email protected]>
1 parent 6ddf8fc commit ffc2c4f

File tree

2 files changed

+121
-7
lines changed

2 files changed

+121
-7
lines changed

site/content/3.12/components/tools/arangodump/examples.md

Lines changed: 41 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ with these attributes:
113113

114114
Document data for a collection is saved in files with name pattern
115115
`<collection-name>.data.json`. Each line in a data file is a document insertion/update or
116-
deletion marker, alongside with some meta data.
116+
deletion marker.
117117

118118
## Cluster Backup
119119

@@ -213,12 +213,14 @@ RocksDB encryption-at-rest feature.
213213

214214
## Compression
215215

216-
`--compress-output`
216+
The size of dumps can be reduced using compression, for storing but also for the
217+
data transfer.
217218

218-
Data can optionally be dumped in a compressed format to save space on disk.
219-
The `--compress-output` option cannot be used together with [Encryption](#encryption).
219+
You can optionally store data in a compressed format to save space on disk with
220+
the `--compress-output` startup option. It cannot be used together with
221+
[Encryption](#encryption).
220222

221-
If compression is enabled, no `.data.json` files are written. Instead, the
223+
If output compression is enabled, no `.data.json` files are written. Instead, the
222224
collection data gets compressed using the Gzip algorithm and for each collection
223225
a `.data.json.gz` file is written. Metadata files such as `.structure.json` and
224226
`.view.json` do not get compressed.
@@ -234,13 +236,39 @@ detects whether the data is compressed or not based on the file extension.
234236
arangorestore --input-directory "dump"
235237
```
236238

239+
You can optionally let the server compress the data for the network transfer
240+
with the `--compress-transfer` startup option. This can reduce the traffic and
241+
thus save time and money.
242+
243+
The data is automatically decompressed on the client side. You can use the option
244+
independent of the `--compress-output` option, which controls whether to store
245+
the dump compressed or not but without affecting the transfer size.
246+
247+
```
248+
arangodump --output-directory "dump" --compress-transfer --compress-output false
249+
```
250+
251+
{{< comment >}} Experimental feature in 3.12
252+
## Storage format
253+
254+
The default output format for dumps is JSON.
255+
256+
To achieve the best dump performance and the smallest data dumps in terms of
257+
size, you can enable the `--dump-vpack` startup option. The resulting dump data
258+
is then stored in the more compact but binary [VelocyPack](http://github.com/arangodb/velocypack)
259+
format instead of the text-based JSON format. The output file size can be less
260+
even compared to compressed JSON. It can also lead to faster dumps because there
261+
is less data to transfer and no conversion from the server-internal VelocyPack
262+
format to JSON is needed.
263+
{{< /comment >}}
264+
237265
## Threads
238266

239267
_arangodump_ can use multiple threads for dumping database data in
240268
parallel. To speed up the dump of a database with multiple collections, it is
241269
often beneficial to increase the number of _arangodump_ threads.
242270
The number of threads can be controlled via the `--threads` option. The default
243-
value was changed from `2` to the maximum of `2` and the number of available CPU cores.
271+
value is the maximum of `2` and the number of available CPU cores.
244272

245273
The `--threads` option works dynamically, its value depends on the number of
246274
available CPU cores. If the amount of available CPU cores is less than `3`, a
@@ -267,3 +295,10 @@ file should be expected. Also note that when dumping the data of multiple shards
267295
from the same collection, each thread's results are written to the result
268296
file in a non-deterministic order. This should not be a problem when restoring
269297
such dump, as _arangorestore_ does not assume any order of input.
298+
299+
From v3.12.0 onward, you can make _arangodump_ write multiple output files per
300+
collection/shard. The file splitting allows for better parallelization when
301+
writing the results to disk, which in case of non-split files must be serialized.
302+
You can enable it with the `--split-files` startup option. It is disabled by
303+
default because dumps created with this option enabled cannot be restored into
304+
previous versions of ArangoDB.

site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md

Lines changed: 80 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -232,7 +232,7 @@ can be mixed and written into the same .sst files.
232232

233233
When these options are enabled, the RocksDB compaction is more efficient since
234234
a lot of different collections/shards/indexes are written to in parallel.
235-
The disavantage of enabling these options is that there can be more .sst
235+
The disadvantage of enabling these options is that there can be more .sst
236236
files than when the option is turned off, and the disk space used by
237237
these .sst files can be higher.
238238
In particular, on deployments with many collections/shards/indexes
@@ -241,5 +241,84 @@ of outgrowing the maximum number of file descriptors the ArangoDB process
241241
can open. Thus, these options should only be enabled on deployments with a
242242
limited number of collections/shards/indexes.
243243

244+
## Client tools
245+
246+
### arangodump
247+
248+
#### Improved dump performance and size
249+
250+
From version 3.12 onward, _arangodump_ has extended parallelization capabilities
251+
to work not only at the collection level, but also at the shard level.
252+
In combination with the newly added support for the VelocyPack format that
253+
ArangoDB uses internally, database dumps can now be created and restored more
254+
quickly and occupy less disk space. This major performance boost makes dumps and
255+
restores up to several times faster, which is extremely useful when dealing
256+
with large shards.
257+
258+
- Whether the new parallel dump variant is used is controlled by the newly added
259+
`--use-parallel-dump` startup option. The default value is `true`.
260+
261+
- To achieve the best dump performance and the smallest data dumps in terms of
262+
size, you can additionally use the `--dump-vpack` option. The resulting dump data
263+
is then stored in the more compact but binary VelocyPack format instead of the
264+
text-based JSON format. The output file size can be less even compared to
265+
compressed JSON. It can also lead to faster dumps because there is less data to
266+
transfer and no conversion from the server-internal format (VelocyPack) to JSON
267+
is needed. Note, however, that this option is **experimental** and disabled by
268+
default.
269+
270+
- Optionally, you can make _arangodump_ write multiple output files per
271+
collection/shard. The file splitting allows for better parallelization when
272+
writing the results to disk, which in case of non-split files must be serialized.
273+
You can enable it by setting the `--split-files` option to `true`. This option
274+
is disabled by default because dumps created with this option enabled cannot
275+
be restored into previous versions of ArangoDB.
276+
277+
- You can enable the new `--compress-transfer` startup option for compressing the
278+
dump data on the server for a faster transfer. This is helpful especially if
279+
the network is slow or its capacity is maxed out. The data is decompressed on
280+
the client side and recompressed if you enable the `--compress-output` option.
281+
282+
#### Resource usage limits and metrics
283+
284+
The following `arangod` startup options can be used to limit
285+
the resource usage of parallel _arangodump_ invocations:
286+
287+
- `--dump.max-memory-usage`: Maximum memory usage (in bytes) to be
288+
used by the server-side parts of all ongoing _arangodump_ invocations.
289+
This option can be used to limit the amount of memory for prefetching
290+
and keeping results on the server side when _arangodump_ is invoked
291+
with the `--parallel-dump` option. It does not have an effect for
292+
_arangodump_ invocations that did not use the `--parallel-dump` option.
293+
Note that the memory usage limit is not exact and that it can be
294+
slightly exceeded in some situations to guarantee progress.
295+
- -`-dump.max-docs-per-batch`: Maximum number of documents per batch
296+
that can be used in a dump. If an _arangodump_ invocation requests
297+
higher values than configured here, the value is automatically
298+
capped to this value. Will only be followed for _arangodump_ invocations
299+
that use the `--parallel-dump` option.
300+
- `--dump.max-batch-size`: Maximum batch size value (in bytes) that
301+
can be used in a dump. If an _arangodump_ invocation requests larger
302+
batch sizes than configured here, the actual batch sizes is capped
303+
to this value. Will only be followed for _arangodump_ invocations that
304+
use the -`-parallel-dump` option.
305+
- `--dump.max-parallelism`: Maximum parallelism (number of server-side
306+
threads) that can be used in a dump. If an _arangodump_ invocation requests
307+
a higher number of prefetch threads than configured here, the actual
308+
number of server-side prefetch threads is capped to this value.
309+
Will only be followed for _arangodump_ invocations that use the
310+
`--parallel-dump` option.
311+
312+
The following metrics have been added to observe the behavior of parallel
313+
_arangodump_ operations on the server:
314+
315+
- `arangodb_dump_memory_usage`: Current memory usage of all ongoing
316+
_arangodump_ operations on the server.
317+
- `arangodb_dump_ongoing`: Number of currently ongoing _arangodump_
318+
operations on the server.
319+
- `arangodb_dump_threads_blocked_total`: Number of times a server-side
320+
dump thread was blocked because it honored the server-side memory
321+
limit for dumps.
322+
244323
## Internal changes
245324

0 commit comments

Comments
 (0)