From 060d0c2ce075f9cd30a53ade5409c9e4e3683d8a Mon Sep 17 00:00:00 2001 From: Gavin Halliday Date: Fri, 9 Jan 2026 11:32:20 +0000 Subject: [PATCH] HPCC-35580 Further updates to the index documentation Signed-off-by: Gavin Halliday --- devdoc/indexformats.md | 162 +++++++++++++++++++++++++++++------------ 1 file changed, 117 insertions(+), 45 deletions(-) diff --git a/devdoc/indexformats.md b/devdoc/indexformats.md index 783354552ed..b262fe0d04f 100644 --- a/devdoc/indexformats.md +++ b/devdoc/indexformats.md @@ -31,6 +31,8 @@ The size on disk is interesting for a few reasons: - The size of Roxie clusters is often determined by the size of the data.\ If the data shrinks, then the Roxie cluster could run on less expensive compute. For some single-node environments enabling all the data to fit on the local disk has significant cost and deployment advantages. +See the further discussion under the compression ratio below. + #### Effect on performance The effect of disk size on performance is less clear. @@ -43,7 +45,7 @@ The effect is highly dependent on the pattern of data access. If there is an ex ### Decompression time and compression ratio -Each time a node is read from disk it needs to be decompressed before it can be used. Here are some ballpark ideas of the relative performance of some different decompression algorithms (actual numbers for compression vary much more): +Each time a node is read from disk it needs to be decompressed before it can be used. Here are some ballpark ideas of the relative performance of some different decompression algorithms (actual numbers for compression ratios vary significantly depending on the data being compressed): | Algorithm | Decompression Time | Compression Ratio | |-----------|-------------------|------------------| @@ -51,11 +53,17 @@ Each time a node is read from disk it needs to be decompressed before it can be | lz4 | 50 µs | 4× | | zstd | 100 µs | 7× | -Fast decompression will obviously reduce the processing time. But the decompression times and compression ratios have a more complex effect once you take the Linux page cache into account... +#### Decompression time + +Fast decompression will obviously reduce the processing time. This will improve the latency and the throughput, and will potentially allow the same workload to be performed on fewer Roxie nodes because of the reduced CPU load. + +#### Compression ratio + +The interaction between decompression times and compression ratios is complex once you take the Linux page cache into account... Linux has a page cache which uses any spare memory to cache previous reads from disk. Often when loading a node from disk it will already be in the page cache because it has previously been read. That typically reduces the read time from ~150 µs (NVMe/SSD) to <10 µs. -The Linux page cache contains compressed nodes, while the internal cache contains expanded pages. For the same amount of memory, more nodes can be kept in the Linux page cache than the internal cache. This reduces the number of nodes that are actually read from disk - which is more significant the slower the disk. +The Linux page cache contains compressed nodes, while the Roxie internal node-caches contain decompressed pages. For the same amount of memory, more nodes can be kept in the Linux page cache than the internal cache. Increasing memory allocated to the page cache reduces the number of nodes that are actually read from disk - which is more significant the slower the disk. The faster the decompression, the greater the benefit of the Linux page cache over the internal cache. When index formats are changed you will need to re-tune the system to find the best balance. @@ -63,7 +71,7 @@ The compression ratio further complicates the picture. As noted above, a high c Also see the notes above about disk size and performance. -There is one special case worth noting. If the size in memory is the same as the size on disk then it is worth making the in-memory cache as large as possible. (Inplace indexes with expansion on demand also provide this.) If this is the case, then the worst performance is when the internal cache and the Linux page cache are the same size - since they are likely to be storing the same data, so half the memory will be wasted. +There is one special case worth noting. If the size in memory is the same as the size on disk then it is worth making the in-memory cache as large as possible. (Inplace indexes with expansion on demand also provide this.) If this is the case, then the worst performance is when the internal cache and the Linux page cache are the same size - since they are likely to be storing the same data, half the memory will be wasted. ### Search time @@ -75,21 +83,35 @@ There is scope for profiling and optimizing the search code in the future. The ideal solution would be a format that had very high compression, no time to decompress and the in-memory size matched the disk size. That is unattainable, so what is the relative significance of the different factors? -One subtlety is the difference between batch and interactive systems. Batch systems are most concerned with throughput, interactive systems are concerned with minimizing latency. Sometimes optimizing for those goals needs different solutions. +1. Minimize disk size.\ + If cost is the ultimate priority, then minimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required. -1. Minimize real disk reads.\ +2. Minimize real disk reads.\ Disk reads take much longer than anything else, so reducing them will improve latency. If the disk reads are completely offloaded from the CPU and it is possible to increase the number of concurrent reads, then reducing the number of disk reads may not significantly improve throughput. The number of disk reads is most likely to be reduced by reducing the disk size, and setting aside a large proportion of the memory for compressed nodes. -2. Reduce decompression time.\ +3. Reduce decompression time.\ If the compressed nodes are stored in the Linux page cache, then reducing the decompression time will help when they are loaded into memory. Reducing this improves both latency and throughput. -3. Decrease the in-memory size.\ +4. Decrease the in-memory size.\ This will reduce the number of times pages will need to be read from page-cache/disk and decompressed. This will improve both latency and throughput. -It isn't clear which of (2) or (3) is the most significant - it is highly dependent on the internal cache sizes and the data access patterns - but I think (1) probably is most important. That has implications for the default compression. +It isn't clear which of (3) or (4) is the most significant - it is highly dependent on the internal cache sizes and the data access patterns - but I think (2) probably is most important. That has implications for the default compression. + +#### Batch v interactive systems + +Batch systems are most concerned with throughput, interactive systems are concerned with minimizing latency. This can lead to different priorities for the two systems. +* Anything that reduces the CPU load is likely to improve both the latency and the throughput. +* Reducing IO times will have a large effect on latency. If sufficient operations can be executed in parallel it may not affect throughput. +* Reducing IO operations similarly has a much larger effect on latency than throughput. + +##### Interactive -For a batch system it may be better to have a larger internal cache and smaller page cache - because a larger internal cache will improve throughput, but reducing the Linux page cache should only reduce latency rather than throughput. +For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large Linux page cache, and relatively small Roxie node caches will be optimal - since that will minimize the number of nodes that must be fetched from disk. + +##### Batch + +Reducing CPU load is the priority. It is possible that throughput is higher with a larger Roxie node cache - since it will reduce the time to decompress pages. That reduced decompression load will need to be balanced with extra CPU load reading pages from the Linux page cache or disk. Theoretically reducing the Linux page cache should only reduce latency rather than throughput. ## Index formats @@ -124,54 +146,65 @@ On disk: * Blob nodes\ Currently use LZW compression. -Version 9.14.x adds support for inplace:zstds, which further reduces the size of the indexes. It is likely that the reduced disk size will outweigh the extra decompression time, and the default for compressed('inplace') should be compressed('inplace:zstd') once systems that support that format are widely deployed. +Version 9.14.48 adds support for inplace:zstds, which further reduces the size of the indexes. It is likely that the reduced disk size will outweigh the extra decompression time, and the default for compressed('inplace') will be changed to compressed('inplace:zstds') in version 10.4.0. ### Hybrid -This format uses the same representation as inplace compression for branch nodes, and an improved version of the legacy compression for leaf nodes. The most significant change is the compression algorithm can be configured - allowing the use of zstd. +This format uses the same representation as inplace compression for branch nodes, and an improved version of the legacy compression for leaf nodes. + +* Branch nodes\ + Uses the same format as inplace. Pages can be searched while compressed. +* Leaf nodes\ + The whole leaf node - key and payload - is compressed as a single block. The compression algorithm can be configured. (Defaults to zstds6.) +* Blob nodes\ + Use ZStd compression. -Initial results suggest that the on-disk sizes of hybrid:zstds indexes are notably smaller than the legacy format, and generally smaller than the current "inplace:lz4hc3s". They are very similar size to inplace:zstds for those indexes where inplace compression works well, but also compress well even when inplace compression does not. +Initial results suggest that the on-disk sizes of hybrid indexes are notably smaller than the legacy format, and generally smaller than the current "inplace". They are very similar size to inplace:zstds for those indexes where inplace compression works well, but also compress well even when inplace compression is not as effective. -Once the format is stable and systems that support it (in a future 9.14.x release) are deployed, the default compression should be changed to hybrid:zstds +The hybrid format is supported for reading in builds 9.14.48, 10.0.22 and 10.2.0. The indexes should be created in version 10.0.22 or later. ### Further ways of reducing the index sizes Some of the following reduce in-memory size, some on-disk and some both -* Better encoding of variable length rows. - (Rows < 127 bytes larger than the keyed portion, or even the fixed portion.) Saves <.5% -* Special case common leading text for all entries in a leaf node. - It would require compressing the data twice, but initial experiments suggest a further 1% saving. It should be added to hybrid indexes as an option. It would also reduce the in-memory sizes. -* Encode fixed size strings as variable length. - This also relies on support for single-byte prefixed variable length strings. Initial work suggests that the disk size does not change much, but the in-memory size may nearly halve. Whether this is significant enough to be worthwhile remains to be seen. -* Use zstd for blobs +* Special case common leading text for all entries in a leaf node.\ + HPCC-35477. This requires compressing the data twice - which significantly increases the time to create the indexes. Initial experiments suggest a further 1% saving. It would also reduce the in-memory sizes. The trade off does not currently seem worthwhile, so it is not considered a priority at the moment - the code to implement the compression changes (but not reading) is in branch ghalliday/issue35477. +* Encode fixed size strings as variable length.\ + HPCC-35386. This also relies on support for single-byte prefixed variable length strings. Initial work suggests that it can have a noticeable effect on some indexes, and no effect on others. The changes are likely to be released as a new ECL syntax in 10.2.x. +* Use zstd for blobs\ Blobs are not very common, but this is likely to significantly cut the sizes of files that do use them. -* Use dictionaries - This has the potential of increasing compression. +* Use dictionaries\ + This has the potential of increasing compression. No further work has been done yet. -## Recommendations for using the new index formats +## Current Recommendations for using the new index formats -What are the recommendations for index formats, and what pathological indexes require a different approach? +What are the recommendations for index formats? The first question is what is the priority - reducing costs or improving performance? -### Use inplace by default +### Reducing costs -When the inplace indexes compress well, it should give the best performance. Branch nodes are much more efficient. Decompression times and in-memory sizes should be notably smaller. +Almost all indexes should use `compress('hybrid')`. This is likely to give the smallest disk sizes, with good decompression performance. A relatively small subset of indexes will create smaller files by using `compress('inplace:zstds,blob(zstd)')`. (This will become the default for inplace indexes in a future version.) Good candidates for inplace compression are indexes with few index fields that are fairly densely populated (e.g. phone numbers). -However, given the thought process above, it may be better to default to inplace:zstds instead of the current inplace:lz4hc3s. Decompression times would be higher, but there should be fewer disk reads. (Note: for batch, inplace:lz4hc3 may be better because the disk reads have less impact on throughput and faster decompression will increase throughput!) +### Improving performance -### Indexes with many keyed fields +It will require careful profiling to determine the best approach. The first approximation should be to follow the recommendations for reducing costs. Then explore the following: -There are some indexes that have very large numbers of keyed fields, where all the fields are used for keyed searches. If the trailing fields tend to be unrelated, then these indexes will not compress as well as the legacy format. +* Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc` for indexes with few (or very dense) keyed fields. -In this case, using the hybrid format would be a better approach. The branch nodes will likely compress better, and zstd will improve the decompression speed and reduce the disk size - which will reduce the number of disk reads. +This will trade increased disk space for reduced decompression time. If all files fit on local fast NVMe storage, and there is no opportunity to reduce costs by reducing the cluster size (e.g. the cluster is already CPU constrained), then using lz4hcs rather than zstds compression will reduce the CPU load. Whether that outweighs the increased time to read from disk will be data and query dependent. ### Indexes with payload fields in the keyed portion There are other indexes where there are fields in the keyed portion that should be in the payload. This causes two problems: + * Branch nodes are larger than they need to be since they contain extra data. * The inplace keyed compression does not work well on unrelated data, leading to bigger inplace indexes. -There are two solutions. The short-term solution would be to use hybrid compression - that should improve performance and reduce the index size. A long-term solution is to migrate fields that are never keyed to the payload. This is currently hard to implement incrementally, but see the state of play below. +Moving the fields is likely to be a worthwhile exercise: + +- It will reduce the size and number of the branch nodes - occasionally significantly. +- It may marginally speed up searching on the indexes because fewer fields need to be compared when performing a keyed match. +- It will have a notable effect on the size of inplace indexes, but relatively little on hybrid indexes. +- It potentially allows the payload fields to be implicitly trimmed, possibly improving compression. ### Small key fields and large complex payload @@ -179,37 +212,76 @@ There are some keys that have a very small keyed portion, and a very large paylo ### Highly compressible payload -There are some indexes that are pathologically compressible - e.g., leaves take up less than 5% of the size on disk. This can cause problems when they are read - because an 8KB page can expand to 400KB in memory. That can cause fragmentation problems with the memory manager and means the leaf cache can only hold a few entries. +There are some indexes that are pathologically compressible - e.g., leaves take up less than 5% of the size on disk. This can cause problems when they are read - because an 8KB page can expand to 400KB in memory. That can cause fragmentation problems with the memory manager and means the leaf cache can only hold a few entries. In general the benefits of improved compression are likely to outweigh these problems, so the compression is limited to 100x by default. -By default, the inplace compression restricts compression to a factor of 25 to help avoid these problems. However, that means the indexes are larger with the inplace format. +If anything can be done to reduce the size of the payload in memory that is likely to improve performance. Solutions: -- Use "inplace:maxCompressionFactor=50"\ - This circumvents the limit on compression to allow similar compression to legacy indexes. -- Use "inplace:zstds,maxCompressionFactor=50"\ - Same as above but also use zstd compression - if lz4hc3 does not compress as well, and disk size is critical. - Use variable size strings in the payload.\ The most likely cause of extreme compression is large fixed-size string fields that are mainly blank or contain short strings. Where that is the case, it is better to trim the fields and store them as varstrings. This will significantly increase the effectiveness of the internal cache - because the uncompressed size will be much smaller. It will also reduce the decompression time. - Enhance the platform to automatically compress padded strings.\ - This functionality doesn't exist yet, but if the platform can compress strings behind the scenes you will gain the benefits without having to make any changes to the ECL (and avoid any backward compatibility issues.) + This feature is coming soon. ## What about remote file storage? For batch systems, the throughput should be very similar when retrieving data from remote storage rather than locally, provided the number of concurrent requests can be increased to balance the increased read time. -For interactive systems, the extra latency is problematic. However, an upcoming change allows data that has been read from remote storage to be stored on a local SSD/NVMe drive. This should significantly reduce the average latency, and fewer remote requests may also reduce the maximum latency. +For interactive systems, the extra latency is problematic. Configuring the local NVMe cache should significantly reduce the average latency, and fewer remote requests may also reduce the maximum latency. ## Current State of Development * Inplace\ Implemented. Default changed to inplace:lz4hc3. Should consider changing the default to inplace:zstd. - Expansion on demand, rather than on load, is implemented but performance has not been evaluated. + Expansion of inplace index payloads on demand (rather than at index load time) is implemented, but its performance has not yet been evaluated. * Hybrid\ - PR containing the initial work is available. Expected to be included at some point in version 9.14.x. Read support may be backported to 9.12.x if it has minimal impact. + Implemented for reading in 9.14.48 and writing in 10.0.22 and later. * Trim compression\ - A relatively new idea (see above). Should be implementable with a week or two's work. + HPCC-35386. Work in progress - likely to be available in 10.2.x, with reading supported in 10.0.22 and later. * Support for moving keyed fields.\ - Currently, moving a field from the key to the payload will cause problems reading the indexes. There is work in progress to allow the ECL to specify a field in the payload that is actually in the keyed portion. This would allow the query ECL to be updated in preparation for changing the format when the key is built. There is a PR open, with most of the functionality working for hthor and roxie. It needs more testing, especially for unusual edge cases (e.g., superfiles containing a mixture of translated and untranslated.) + This should now be complete. * Zstd compression of blobs\ - No work has been done in this area, but it would be relatively trivial to use zstd to improve the decompression speed and compressibility of the data. + Complete. Can be configured for inplace indexes using the blob(zstd) option, it is the default for hybrid. +* Local NVMe file cache\ + Implemented +* Explicitly managed page cache\ + HPCC-35577. Not yet implemented. + +## Compression options + +The compression specification has the following format: + +`compress('[:[compression[(compress-options)]|compression[(compress-options)],index-options|index-options]]')` + +Compression formats (for details see above): +* legacy - the format that has been historically used +* inplace - keyed portion is searchable without decompression +* hybrid - inplace branches, legacy-style leaf nodes + +compression: +* lzw - historical compression method. Not recommended. +* lz4s - fast to decompress, does not compress as well as lzw. +* lz4shc - equally fast to decompress, slower to compress but more compact +* zstds - compresses very well, relatively fast to decompress. +* zstds3 - alias for zstds(level=3) +* zstds6 - alias for zstds(level=6) +* zstds9 - alias for zstds(level=9) + +NOTE: lz4s, lz4shc and zstds are the streaming versions and have an 's' suffix. They should be used for the node compression. + +index-options: +* blob(compression) - control the format that is used to compress blobs. Note the options should be lzw, lz4, lz4hc and zstd (without the 's' suffix). + +index-options for inplace indexes, many are only for developers +* recompress=bool - should the payload be recompressed to see if it reduces in size? (default false) +* maxCompressionFactor=n - Do not compress payload more than a factor of n (default 100) +* reuse=bool - reuse compressor (defaults to true) +* compressThreshold=n - use uncompressed if compressed > threshold% of uncompressed (default 95%) +* uncompressed=bool - avoid compressing the payload + +compress-options: Passed directly to the compression algorithm +* level=n - Which lz4/zstd compression level should be used? + +Examples: + +`compress('hybrid:zstds(level=9),blob(lz4hc)')`