HPCC-35580 Further updates to the index documentation #20808

ghalliday · 2026-01-07T12:21:22Z

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

Signed-off-by: Gavin Halliday <[email protected]>

github-actions · 2026-01-07T12:22:05Z

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-35580

Jirabot Action Result:
Assigning user: [email protected]
Workflow Transition To: Merge Pending
Updated PR

Copilot

Pull request overview

This pull request updates the index formats documentation (devdoc/indexformats.md) to provide more comprehensive and current information about HPCC Platform index compression formats. The updates clarify performance considerations, compression options, and recommendations for using different index formats in production.

Key changes:

Enhanced explanation of compression ratios and their interaction with Linux page cache and decompression times
Added detailed sections distinguishing between batch and interactive system requirements
Updated version information for hybrid and inplace format support (9.14.48, 10.0.22, 10.2.0)
Expanded recommendations section with separate guidance for cost reduction vs. performance optimization
Added comprehensive compression options reference section documenting syntax and parameters

Copilot · 2026-01-07T12:26:49Z

devdoc/indexformats.md

+### Reducing costs

-When the inplace indexes compress well, it should give the best performance.  Branch nodes are much more efficient.  Decompression times and in-memory sizes should be notably smaller.
+Almost all indexes should use `compress('hybrid')`.  This is likely to give the smallest disk sizes, with good decompression performance.  A relatively small subset of indexes will create smaller files by using `compress('inplace:zstds,blob(zstd)')`.  (This will become the default for inplace indexes in a future version.)  Good candidates for inplace compression are indexes with few index fields that are fairly densly populated (e.g. phone numbers).


The word "densly" is misspelled - it should be "densely".

Suggested change

Almost all indexes should use `compress('hybrid')`. This is likely to give the smallest disk sizes, with good decompression performance. A relatively small subset of indexes will create smaller files by using `compress('inplace:zstds,blob(zstd)')`. (This will become the default for inplace indexes in a future version.) Good candidates for inplace compression are indexes with few index fields that are fairly densly populated (e.g. phone numbers).

Almost all indexes should use `compress('hybrid')`. This is likely to give the smallest disk sizes, with good decompression performance. A relatively small subset of indexes will create smaller files by using `compress('inplace:zstds,blob(zstd)')`. (This will become the default for inplace indexes in a future version.) Good candidates for inplace compression are indexes with few index fields that are fairly densely populated (e.g. phone numbers).

Copilot · 2026-01-07T12:26:50Z

devdoc/indexformats.md


-One subtlety is the difference between batch and interactive systems.  Batch systems are most concerned with throughput, interactive systems are concerned with minimizing latency.  Sometimes optimizing for those goals needs different solutions.
+1. Minimize disk size.\
+   If cost is the ultimate priority, then minimimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required.


The word "minimimizing" is misspelled - it should be "minimizing" (with two 'i's, not three).

Suggested change

If cost is the ultimate priority, then minimimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required.

If cost is the ultimate priority, then minimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required.

Copilot · 2026-01-07T12:26:50Z

devdoc/indexformats.md

+##### Interactive

-For a batch system it may be better to have a larger internal cache and smaller page cache - because a larger internal cache will improve throughput, but reducing the Linux page cache should only reduce latency rather than throughput.
+For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency.  That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimial - since that will minimimize the number of nodes that must be fetched from disk.


The word "minimimize" is misspelled - it should be "minimize" (with two 'i's, not three).

Copilot · 2026-01-07T12:26:50Z

devdoc/indexformats.md

+##### Interactive

-For a batch system it may be better to have a larger internal cache and smaller page cache - because a larger internal cache will improve throughput, but reducing the Linux page cache should only reduce latency rather than throughput.
+For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency.  That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimial - since that will minimimize the number of nodes that must be fetched from disk.


The word "optimial" is misspelled - it should be "optimal".

Suggested change

For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimial - since that will minimimize the number of nodes that must be fetched from disk.

For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimal - since that will minimize the number of nodes that must be fetched from disk.

Copilot · 2026-01-07T12:26:51Z

devdoc/indexformats.md

+It will require careful profiling to determine the best approach.  The first approximation should be to follow the recommendations for reducing costs.  Then explore the following:

-There are some indexes that have very large numbers of keyed fields, where all the fields are used for keyed searches.  If the trailing fields tend to be unrelated, then these indexes will not compress as well as the legacy format.
+* Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc\ for indexes with few (or very dense) keyed fields.


There is a backslash character at the end of this line that appears to be unintentional and should be removed. The backslash is typically used for line continuation in markdown, but here it appears after "lz4shc" and before a space, which seems like a typo.

Suggested change

* Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc\ for indexes with few (or very dense) keyed fields.

* Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc` for indexes with few (or very dense) keyed fields.

Copilot · 2026-01-07T12:26:52Z

devdoc/indexformats.md

+* Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc\ for indexes with few (or very dense) keyed fields.

-In this case, using the hybrid format would be a better approach.  The branch nodes will likely compress better, and zstd will improve the decompression speed and reduce the disk size - which will reduce the number of disk reads.
+This will trade increased disk space for reduced decompression time.  If all files fit on local fast NVMe storage, and there is no opportunity to reduce costs by reducing the cluster size (e.g. the cluster is already cpu constrained), then using lz4hcs rather than zstds compression will reduce the cpu load.  Whether that outweights the increased time to read from disk will be data and query dependent.


The word "outweights" is misspelled - it should be "outweighs".

Suggested change

This will trade increased disk space for reduced decompression time. If all files fit on local fast NVMe storage, and there is no opportunity to reduce costs by reducing the cluster size (e.g. the cluster is already cpu constrained), then using lz4hcs rather than zstds compression will reduce the cpu load. Whether that outweights the increased time to read from disk will be data and query dependent.

This will trade increased disk space for reduced decompression time. If all files fit on local fast NVMe storage, and there is no opportunity to reduce costs by reducing the cluster size (e.g. the cluster is already cpu constrained), then using lz4hcs rather than zstds compression will reduce the cpu load. Whether that outweighs the increased time to read from disk will be data and query dependent.

Copilot · 2026-01-07T12:26:53Z

devdoc/indexformats.md

+
+- It will reduce the size and number of the branch nodes - occasionally significantly.
+- It may marginally speed up searching on the indexes because fewer fields need to be compared when performing a keyed match.-
+- It will have a notable affect on the size of inplace indexes, but relatively little on hybrid indexes.


The word "affect" should be "effect" in this context. "Affect" is typically a verb meaning "to influence," while "effect" is a noun meaning "a result or consequence." The sentence is talking about the impact/result on the size, so "effect" is correct.

Suggested change

- It will have a notable affect on the size of inplace indexes, but relatively little on hybrid indexes.

- It will have a notable effect on the size of inplace indexes, but relatively little on hybrid indexes.

Copilot · 2026-01-07T12:26:53Z

devdoc/indexformats.md

+Compression types:
+* hybrid - inplace branches, legacy-style leaf nodes
+


The text "Compression types:" is listed after the "Types" section, but it should probably be labeled as "Formats:" or integrated into the Types section differently. The current structure lists "inplace" as a Type but then shows "hybrid" under "Compression types", which is confusing since hybrid is described as a format type in the earlier documentation, not a compression algorithm.

Suggested change

Compression types:

* hybrid - inplace branches, legacy-style leaf nodes

* hybrid - inplace branches, legacy-style leaf nodes

Copilot · 2026-01-07T12:26:53Z

devdoc/indexformats.md

+
+compression:
+* lzw - historical compression method.  Not recommended.
+* lz4s - fast to decompress, does not compress very well as lzw.


The phrase "does not compress very well as lzw" is grammatically incorrect. It should be "does not compress as well as lzw" to properly compare the two compression methods.

Suggested change

* lz4s - fast to decompress, does not compress very well as lzw.

* lz4s - fast to decompress, does not compress as well as lzw.

Copilot · 2026-01-07T12:26:54Z

devdoc/indexformats.md

+* zstds6 - alias for zstds(level=6)
+* zstds9 - alias for zstds(level=9)
+
+NOTE: lz4s, lz4shc and zstds these are the streaming versions and have a 's' suffix.  They should be used for the node compression.


The sentence "NOTE: lz4s, lz4shc and zstds these are the streaming versions and have a 's' suffix." is grammatically incorrect. It should be "NOTE: lz4s, lz4shc and zstds are the streaming versions and have a 's' suffix." (remove "these" or add a dash after "zstds").

Suggested change

NOTE: lz4s, lz4shc and zstds these are the streaming versions and have a 's' suffix. They should be used for the node compression.

NOTE: lz4s, lz4shc and zstds are the streaming versions and have an 's' suffix. They should be used for the node compression.

HPCC-35580 Further updates to the index documentation

4e6a3f6

Signed-off-by: Gavin Halliday <[email protected]>

Copilot AI review requested due to automatic review settings January 7, 2026 12:21

Copilot started reviewing on behalf of ghalliday January 7, 2026 12:24 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

ghalliday requested review from mckellyln and removed request for mckellyln January 8, 2026 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HPCC-35580 Further updates to the index documentation #20808

HPCC-35580 Further updates to the index documentation #20808

ghalliday commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	If cost is the ultimate priority, then minimimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required.
	If cost is the ultimate priority, then minimizing the disk size is likely to provide the largest benefit - since that often defines the size of the compute required.

	For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimial - since that will minimimize the number of nodes that must be fetched from disk.
	For interactive systems, minimizing the disk reads is the priority - since that is the operation with the highest latency. That suggests a relatively large linux page cache, and relatively small roxie node caches will be optimal - since that will minimize the number of nodes that must be fetched from disk.

	* Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc\ for indexes with few (or very dense) keyed fields.
	* Using `hybrid:lz4shc` for indexes with many keyed fields and `inplace:lz4shc` for indexes with few (or very dense) keyed fields.

	- It will have a notable affect on the size of inplace indexes, but relatively little on hybrid indexes.
	- It will have a notable effect on the size of inplace indexes, but relatively little on hybrid indexes.

		Compression types:
		* hybrid - inplace branches, legacy-style leaf nodes

	* lz4s - fast to decompress, does not compress very well as lzw.
	* lz4s - fast to decompress, does not compress as well as lzw.

	NOTE: lz4s, lz4shc and zstds these are the streaming versions and have a 's' suffix. They should be used for the node compression.
	NOTE: lz4s, lz4shc and zstds are the streaming versions and have an 's' suffix. They should be used for the node compression.

HPCC-35580 Further updates to the index documentation #20808

Are you sure you want to change the base?

HPCC-35580 Further updates to the index documentation #20808

Conversation

ghalliday commented Jan 7, 2026

Type of change:

Checklist:

Smoketest:

Testing:

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant