-
Notifications
You must be signed in to change notification settings - Fork 311
HPCC-35580 Further updates to the index documentation #20808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Gavin Halliday <[email protected]>
|
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-35580 Jirabot Action Result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request updates the index formats documentation (devdoc/indexformats.md) to provide more comprehensive and current information about HPCC Platform index compression formats. The updates clarify performance considerations, compression options, and recommendations for using different index formats in production.
Key changes:
- Enhanced explanation of compression ratios and their interaction with Linux page cache and decompression times
- Added detailed sections distinguishing between batch and interactive system requirements
- Updated version information for hybrid and inplace format support (9.14.48, 10.0.22, 10.2.0)
- Expanded recommendations section with separate guidance for cost reduction vs. performance optimization
- Added comprehensive compression options reference section documenting syntax and parameters
| ### Reducing costs | ||
|
|
||
| When the inplace indexes compress well, it should give the best performance. Branch nodes are much more efficient. Decompression times and in-memory sizes should be notably smaller. | ||
| Almost all indexes should use `compress('hybrid')`. This is likely to give the smallest disk sizes, with good decompression performance. A relatively small subset of indexes will create smaller files by using `compress('inplace:zstds,blob(zstd)')`. (This will become the default for inplace indexes in a future version.) Good candidates for inplace compression are indexes with few index fields that are fairly densly populated (e.g. phone numbers). |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "densly" is misspelled - it should be "densely".
| Almost all indexes should use `compress('hybrid')`. This is likely to give the smallest disk sizes, with good decompression performance. A relatively small subset of indexes will create smaller files by using `compress('inplace:zstds,blob(zstd)')`. (This will become the default for inplace indexes in a future version.) Good candidates for inplace compression are indexes with few index fields that are fairly densly populated (e.g. phone numbers). | |
| Almost all indexes should use `compress('hybrid')`. This is likely to give the smallest disk sizes, with good decompression performance. A relatively small subset of indexes will create smaller files by using `compress('inplace:zstds,blob(zstd)')`. (This will become the default for inplace indexes in a future version.) Good candidates for inplace compression are indexes with few index fields that are fairly densely populated (e.g. phone numbers). |
|
|
||
| One subtlety is the difference between batch and interactive systems. Batch systems are most concerned with throughput, interactive systems are concerned with minimizing latency. Sometimes optimizing for those goals needs different solutions. | ||
| 1. Minimize disk size.\ | ||
| If cost is the ultimate priority, then minimimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required. |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "minimimizing" is misspelled - it should be "minimizing" (with two 'i's, not three).
| If cost is the ultimate priority, then minimimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required. | |
| If cost is the ultimate priority, then minimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required. |
| ##### Interactive | ||
|
|
||
| For a batch system it may be better to have a larger internal cache and smaller page cache - because a larger internal cache will improve throughput, but reducing the Linux page cache should only reduce latency rather than throughput. | ||
| For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimial - since that will minimimize the number of nodes that must be fetched from disk. |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "minimimize" is misspelled - it should be "minimize" (with two 'i's, not three).
| ##### Interactive | ||
|
|
||
| For a batch system it may be better to have a larger internal cache and smaller page cache - because a larger internal cache will improve throughput, but reducing the Linux page cache should only reduce latency rather than throughput. | ||
| For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimial - since that will minimimize the number of nodes that must be fetched from disk. |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "optimial" is misspelled - it should be "optimal".
| For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimial - since that will minimimize the number of nodes that must be fetched from disk. | |
| For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimal - since that will minimize the number of nodes that must be fetched from disk. |
| It will require careful profiling to determine the best approach. The first approximation should be to follow the recommendations for reducing costs. Then explore the following: | ||
|
|
||
| There are some indexes that have very large numbers of keyed fields, where all the fields are used for keyed searches. If the trailing fields tend to be unrelated, then these indexes will not compress as well as the legacy format. | ||
| * Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc\ for indexes with few (or very dense) keyed fields. |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a backslash character at the end of this line that appears to be unintentional and should be removed. The backslash is typically used for line continuation in markdown, but here it appears after "lz4shc" and before a space, which seems like a typo.
| * Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc\ for indexes with few (or very dense) keyed fields. | |
| * Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc` for indexes with few (or very dense) keyed fields. |
| * Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc\ for indexes with few (or very dense) keyed fields. | ||
|
|
||
| In this case, using the hybrid format would be a better approach. The branch nodes will likely compress better, and zstd will improve the decompression speed and reduce the disk size - which will reduce the number of disk reads. | ||
| This will trade increased disk space for reduced decompression time. If all files fit on local fast NVMe storage, and there is no opportunity to reduce costs by reducing the cluster size (e.g. the cluster is already cpu constrained), then using lz4hcs rather than zstds compression will reduce the cpu load. Whether that outweights the increased time to read from disk will be data and query dependent. |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "outweights" is misspelled - it should be "outweighs".
| This will trade increased disk space for reduced decompression time. If all files fit on local fast NVMe storage, and there is no opportunity to reduce costs by reducing the cluster size (e.g. the cluster is already cpu constrained), then using lz4hcs rather than zstds compression will reduce the cpu load. Whether that outweights the increased time to read from disk will be data and query dependent. | |
| This will trade increased disk space for reduced decompression time. If all files fit on local fast NVMe storage, and there is no opportunity to reduce costs by reducing the cluster size (e.g. the cluster is already cpu constrained), then using lz4hcs rather than zstds compression will reduce the cpu load. Whether that outweighs the increased time to read from disk will be data and query dependent. |
|
|
||
| - It will reduce the size and number of the branch nodes - occasionally significantly. | ||
| - It may marginally speed up searching on the indexes because fewer fields need to be compared when performing a keyed match.- | ||
| - It will have a notable affect on the size of inplace indexes, but relatively little on hybrid indexes. |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "affect" should be "effect" in this context. "Affect" is typically a verb meaning "to influence," while "effect" is a noun meaning "a result or consequence." The sentence is talking about the impact/result on the size, so "effect" is correct.
| - It will have a notable affect on the size of inplace indexes, but relatively little on hybrid indexes. | |
| - It will have a notable effect on the size of inplace indexes, but relatively little on hybrid indexes. |
| Compression types: | ||
| * hybrid - inplace branches, legacy-style leaf nodes | ||
|
|
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text "Compression types:" is listed after the "Types" section, but it should probably be labeled as "Formats:" or integrated into the Types section differently. The current structure lists "inplace" as a Type but then shows "hybrid" under "Compression types", which is confusing since hybrid is described as a format type in the earlier documentation, not a compression algorithm.
| Compression types: | |
| * hybrid - inplace branches, legacy-style leaf nodes | |
| * hybrid - inplace branches, legacy-style leaf nodes |
|
|
||
| compression: | ||
| * lzw - historical compression method. Not recommended. | ||
| * lz4s - fast to decompress, does not compress very well as lzw. |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrase "does not compress very well as lzw" is grammatically incorrect. It should be "does not compress as well as lzw" to properly compare the two compression methods.
| * lz4s - fast to decompress, does not compress very well as lzw. | |
| * lz4s - fast to decompress, does not compress as well as lzw. |
| * zstds6 - alias for zstds(level=6) | ||
| * zstds9 - alias for zstds(level=9) | ||
|
|
||
| NOTE: lz4s, lz4shc and zstds these are the streaming versions and have a 's' suffix. They should be used for the node compression. |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence "NOTE: lz4s, lz4shc and zstds these are the streaming versions and have a 's' suffix." is grammatically incorrect. It should be "NOTE: lz4s, lz4shc and zstds are the streaming versions and have a 's' suffix." (remove "these" or add a dash after "zstds").
| NOTE: lz4s, lz4shc and zstds these are the streaming versions and have a 's' suffix. They should be used for the node compression. | |
| NOTE: lz4s, lz4shc and zstds are the streaming versions and have an 's' suffix. They should be used for the node compression. |
Type of change:
Checklist:
Smoketest:
Testing: