Skip to content

Conversation

@ghalliday
Copy link
Member

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Copilot AI review requested due to automatic review settings January 7, 2026 12:21
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-35580

Jirabot Action Result:
Assigning user: [email protected]
Workflow Transition To: Merge Pending
Updated PR

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request updates the index formats documentation (devdoc/indexformats.md) to provide more comprehensive and current information about HPCC Platform index compression formats. The updates clarify performance considerations, compression options, and recommendations for using different index formats in production.

Key changes:

  • Enhanced explanation of compression ratios and their interaction with Linux page cache and decompression times
  • Added detailed sections distinguishing between batch and interactive system requirements
  • Updated version information for hybrid and inplace format support (9.14.48, 10.0.22, 10.2.0)
  • Expanded recommendations section with separate guidance for cost reduction vs. performance optimization
  • Added comprehensive compression options reference section documenting syntax and parameters

### Reducing costs

When the inplace indexes compress well, it should give the best performance. Branch nodes are much more efficient. Decompression times and in-memory sizes should be notably smaller.
Almost all indexes should use `compress('hybrid')`. This is likely to give the smallest disk sizes, with good decompression performance. A relatively small subset of indexes will create smaller files by using `compress('inplace:zstds,blob(zstd)')`. (This will become the default for inplace indexes in a future version.) Good candidates for inplace compression are indexes with few index fields that are fairly densly populated (e.g. phone numbers).
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "densly" is misspelled - it should be "densely".

Suggested change
Almost all indexes should use `compress('hybrid')`. This is likely to give the smallest disk sizes, with good decompression performance. A relatively small subset of indexes will create smaller files by using `compress('inplace:zstds,blob(zstd)')`. (This will become the default for inplace indexes in a future version.) Good candidates for inplace compression are indexes with few index fields that are fairly densly populated (e.g. phone numbers).
Almost all indexes should use `compress('hybrid')`. This is likely to give the smallest disk sizes, with good decompression performance. A relatively small subset of indexes will create smaller files by using `compress('inplace:zstds,blob(zstd)')`. (This will become the default for inplace indexes in a future version.) Good candidates for inplace compression are indexes with few index fields that are fairly densely populated (e.g. phone numbers).

Copilot uses AI. Check for mistakes.

One subtlety is the difference between batch and interactive systems. Batch systems are most concerned with throughput, interactive systems are concerned with minimizing latency. Sometimes optimizing for those goals needs different solutions.
1. Minimize disk size.\
If cost is the ultimate priority, then minimimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required.
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "minimimizing" is misspelled - it should be "minimizing" (with two 'i's, not three).

Suggested change
If cost is the ultimate priority, then minimimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required.
If cost is the ultimate priority, then minimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required.

Copilot uses AI. Check for mistakes.
##### Interactive

For a batch system it may be better to have a larger internal cache and smaller page cache - because a larger internal cache will improve throughput, but reducing the Linux page cache should only reduce latency rather than throughput.
For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimial - since that will minimimize the number of nodes that must be fetched from disk.
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "minimimize" is misspelled - it should be "minimize" (with two 'i's, not three).

Copilot uses AI. Check for mistakes.
##### Interactive

For a batch system it may be better to have a larger internal cache and smaller page cache - because a larger internal cache will improve throughput, but reducing the Linux page cache should only reduce latency rather than throughput.
For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimial - since that will minimimize the number of nodes that must be fetched from disk.
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "optimial" is misspelled - it should be "optimal".

Suggested change
For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimial - since that will minimimize the number of nodes that must be fetched from disk.
For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimal - since that will minimize the number of nodes that must be fetched from disk.

Copilot uses AI. Check for mistakes.
It will require careful profiling to determine the best approach. The first approximation should be to follow the recommendations for reducing costs. Then explore the following:

There are some indexes that have very large numbers of keyed fields, where all the fields are used for keyed searches. If the trailing fields tend to be unrelated, then these indexes will not compress as well as the legacy format.
* Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc\ for indexes with few (or very dense) keyed fields.
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a backslash character at the end of this line that appears to be unintentional and should be removed. The backslash is typically used for line continuation in markdown, but here it appears after "lz4shc" and before a space, which seems like a typo.

Suggested change
* Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc\ for indexes with few (or very dense) keyed fields.
* Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc` for indexes with few (or very dense) keyed fields.

Copilot uses AI. Check for mistakes.
* Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc\ for indexes with few (or very dense) keyed fields.

In this case, using the hybrid format would be a better approach. The branch nodes will likely compress better, and zstd will improve the decompression speed and reduce the disk size - which will reduce the number of disk reads.
This will trade increased disk space for reduced decompression time. If all files fit on local fast NVMe storage, and there is no opportunity to reduce costs by reducing the cluster size (e.g. the cluster is already cpu constrained), then using lz4hcs rather than zstds compression will reduce the cpu load. Whether that outweights the increased time to read from disk will be data and query dependent.
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "outweights" is misspelled - it should be "outweighs".

Suggested change
This will trade increased disk space for reduced decompression time. If all files fit on local fast NVMe storage, and there is no opportunity to reduce costs by reducing the cluster size (e.g. the cluster is already cpu constrained), then using lz4hcs rather than zstds compression will reduce the cpu load. Whether that outweights the increased time to read from disk will be data and query dependent.
This will trade increased disk space for reduced decompression time. If all files fit on local fast NVMe storage, and there is no opportunity to reduce costs by reducing the cluster size (e.g. the cluster is already cpu constrained), then using lz4hcs rather than zstds compression will reduce the cpu load. Whether that outweighs the increased time to read from disk will be data and query dependent.

Copilot uses AI. Check for mistakes.

- It will reduce the size and number of the branch nodes - occasionally significantly.
- It may marginally speed up searching on the indexes because fewer fields need to be compared when performing a keyed match.-
- It will have a notable affect on the size of inplace indexes, but relatively little on hybrid indexes.
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "affect" should be "effect" in this context. "Affect" is typically a verb meaning "to influence," while "effect" is a noun meaning "a result or consequence." The sentence is talking about the impact/result on the size, so "effect" is correct.

Suggested change
- It will have a notable affect on the size of inplace indexes, but relatively little on hybrid indexes.
- It will have a notable effect on the size of inplace indexes, but relatively little on hybrid indexes.

Copilot uses AI. Check for mistakes.
Comment on lines +259 to +261
Compression types:
* hybrid - inplace branches, legacy-style leaf nodes

Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text "Compression types:" is listed after the "Types" section, but it should probably be labeled as "Formats:" or integrated into the Types section differently. The current structure lists "inplace" as a Type but then shows "hybrid" under "Compression types", which is confusing since hybrid is described as a format type in the earlier documentation, not a compression algorithm.

Suggested change
Compression types:
* hybrid - inplace branches, legacy-style leaf nodes
* hybrid - inplace branches, legacy-style leaf nodes

Copilot uses AI. Check for mistakes.

compression:
* lzw - historical compression method. Not recommended.
* lz4s - fast to decompress, does not compress very well as lzw.
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase "does not compress very well as lzw" is grammatically incorrect. It should be "does not compress as well as lzw" to properly compare the two compression methods.

Suggested change
* lz4s - fast to decompress, does not compress very well as lzw.
* lz4s - fast to decompress, does not compress as well as lzw.

Copilot uses AI. Check for mistakes.
* zstds6 - alias for zstds(level=6)
* zstds9 - alias for zstds(level=9)

NOTE: lz4s, lz4shc and zstds these are the streaming versions and have a 's' suffix. They should be used for the node compression.
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence "NOTE: lz4s, lz4shc and zstds these are the streaming versions and have a 's' suffix." is grammatically incorrect. It should be "NOTE: lz4s, lz4shc and zstds are the streaming versions and have a 's' suffix." (remove "these" or add a dash after "zstds").

Suggested change
NOTE: lz4s, lz4shc and zstds these are the streaming versions and have a 's' suffix. They should be used for the node compression.
NOTE: lz4s, lz4shc and zstds are the streaming versions and have an 's' suffix. They should be used for the node compression.

Copilot uses AI. Check for mistakes.
@ghalliday ghalliday requested review from mckellyln and removed request for mckellyln January 8, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant