Skip to content
54 changes: 54 additions & 0 deletions site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md
Original file line number Diff line number Diff line change
Expand Up @@ -222,5 +222,59 @@ of outgrowing the maximum number of file descriptors the ArangoDB process
can open. Thus, these options should only be enabled on deployments with a
limited number of collections/shards/indexes.

## Client tools

### arangodump

#### Improved dump performance

ArangoDB 3.12 includes extended parallelization capabilities to work not only
at the collection level, but also at the shard level. In combination with the
new optimized format, database dumps are now created and restored quickly and
occupy minimal disk space. This major performance boost makes dumps five times
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "more quickly" would be better, it wasn't horribly slow before.

What are the changes to the format that make the dumps smaller on disk?

Copy link
Contributor Author

@nerpaula nerpaula Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the changes to the format that make the dumps smaller on disk?

@jsteemann maybe you can clarify this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also all for more quickly rather than quickly. The old dump variant wasn't too slow, as Simran already mentioned. Additionally, the new variant can make dump & restore quicker than when using the old variant, it is not a guarantee that we are always 5 times faster than before. So this should be rephrased to "up to several times faster" instead of giving a precise number.

The potential speedup can be achieved by the following factors:

  • new dump variant, enabled via --use-parallel-dump true (which now is also the default). The new variant uses prefetching and parallelization on the server, so that the server proactively keeps producing more results for arangodump to fetch. So that when an arangodump request comes in, the server can already respond with some ready-to-use result.
  • optional: make arangodump write multiple output files per collection/shard. This can be enabled by setting the --split-files option to true. This is currently opt-in. The reason is that dumps that are created with this option enabled cannot be restored into previous versions of ArangoDB easily. The file splitting allows better parallelization when writing the results into the output file, which in case of non-split files must be serialized. The serialization can easily become a bottleneck especially when output files are gzip-compressed by arangodump.
  • dumping the data into velocypack format instead of JSON. By setting the option --dump-vpack, the resulting dump data will be stored in velocypack format, not JSON. The velocypack format is normally more compact than JSON, so by using this option, the output file size can be reduced compared to JSON, even when compression is enabled. It can also lead to faster dumps, because less data needs to be shipped around and written. This is currently experimental and opt-in, for the reasons that only arangorestore from 3.12 or higher will be able to interpret and restore vpack dumps, and because there aren't many other tools than can read vpack data. So from the user side it may be unwanted to produce dumps in a different format that is not that much supported by other tools. But the option should be mentioned for users that want best dump performance and the smallest possible dumps.
  • compressing the dump data on the server for transfer. By setting the option --compress-transfer to true, dump data can be compressed on the server for faster transfer. This is helpful especially if the network is slow or its capacity is maxed out. It won't make a difference otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsteemann I haven't looked at the dump output but if --dump-vpack is used, can the arangovpack tool be used to convert (parts of) the dump from VPack to JSON, and does this work with reasonable speed and memory usage? If not, then we might want to create a ticket for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the arangovpack tool cannot be used for that.
The reason is that arangodump with --dump-vpack produces individual batches of data, which are all valid velocypack, but these batches are all written to the same file, one after the other.
Currently arangovpack will either only handle the velocypack from the first batch of data, or even fail. I am not sure because I didn't test that.
But I am sure we would need to augment arangovpack first. It is probably a good idea to do it anyway, but it hasn't been done yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

faster and restores three times faster, which is extremely useful when dealing
with large shards.

#### Resource usage limits

The following startup options that can be used to limit
the resource usage of parallel _arangodump_ invocations have been added:

- `--dump.max-memory-usage`: Maximum memory usage (in bytes) to be
used by the server-side parts of all ongoing _arangodump_ invocations.
This option can be used to limit the amount of memory for prefetching
and keeping results on the server side when _arangodump_ is invoked
with the `--parallel-dump` option. It does not have an effect for
_arangodump_ invocations that did not use the `--parallel-dump` option.
Note that the memory usage limit is not exact and that it can be
slightly exceeded in some situations to guarantee progress.
- -`-dump.max-docs-per-batch`: Maximum number of documents per batch
that can be used in a dump. If an _arangodump_ invocation requests
higher values than configured here, the value is automatically
capped to this value. Will only be followed for _arangodump_ invocations
that use the `--parallel-dump` option.
- `--dump.max-batch-size`: Maximum batch size value (in bytes) that
can be used in a dump. If an _arangodump_ invocation requests larger
batch sizes than configured here, the actual batch sizes is capped
to this value. Will only be followed for arangodump invocations that
use the -`-parallel-dump` option.
- `--dump.max-parallelism`: Maximum parallelism (number of server-side
threads) that can be used in a dump. If an _arangodump_ invocation requests
a higher number of prefetch threads than configured here, the actual
number of server-side prefetch threads is capped to this value.
Will only be followed for _arangodump_ invocations that use the
`--parallel-dump` option.

The following metrics have been added to observe the behavior of parallel
_arangodump_ operations on the server:

- `arangodb_dump_memory_usage`: Current memory usage of all ongoing
_arangodump_ operations on the server.
- `arangodb_dump_ongoing`: Number of currently ongoing _arangodump_
operations on the server.
- `arangodb_dump_threads_blocked_total`: Number of times a server-side
dump thread was blocked because it honored the server-side memory
limit for dumps.

## Internal changes