Skip to content

Commit 307603a

Browse files
committed
Support ZSTD dictionary compression
patch by Yifan Cai; reviewed by Jon Haddad, Stefan Miklosovic for CASSANDRA-17021
1 parent 9142d0c commit 307603a

File tree

85 files changed

+8971
-271
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

85 files changed

+8971
-271
lines changed

CHANGES.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
5.1
2+
* Support ZSTD dictionary compression (CASSANDRA-17021)
23
* Fix ExceptionsTable when stacktrace has zero elements (CASSANDRA-20992)
34
* Replace blocking wait with non-blocking delay in paxos repair (CASSANDRA-20983)
45
* Implementation of CEP-55 - Generation of role names (CASSANDRA-20897)

conf/cassandra.yaml

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2870,3 +2870,49 @@ storage_compatibility_mode: NONE
28702870
# # especially in keyspaces with many tables. The splitter avoids batching tables together if they
28712871
# # exceed other configuration parameters like bytes_per_assignment or partitions_per_assignment.
28722872
# max_tables_per_assignment: 64
2873+
2874+
# Dictionary compression settings for ZSTD dictionary-based compression
2875+
# These settings control the automatic training and caching of compression dictionaries
2876+
# for tables that use ZSTD dictionary compression.
2877+
2878+
# How often to refresh compression dictionaries across the cluster.
2879+
# During refresh, nodes will check for newer dictionary versions and update their caches.
2880+
# Min unit: s
2881+
compression_dictionary_refresh_interval: 3600s
2882+
2883+
# Initial delay before starting the first dictionary refresh cycle after node startup.
2884+
# This prevents all nodes from refreshing simultaneously when the cluster starts.
2885+
# Min unit: s
2886+
compression_dictionary_refresh_initial_delay: 10s
2887+
2888+
# Maximum number of compression dictionaries to cache per table.
2889+
# Each table using dictionary compression can have multiple dictionaries cached
2890+
# (current version plus recently used versions for reading older SSTables).
2891+
compression_dictionary_cache_size: 10
2892+
2893+
# How long to keep compression dictionaries in the cache before they expire.
2894+
# Expired dictionaries will be removed from memory but can be reloaded if needed.
2895+
# Min unit: s
2896+
compression_dictionary_cache_expire: 24h
2897+
2898+
# Dictionary training configuration (advanced settings)
2899+
# These settings control how compression dictionaries are trained from sample data.
2900+
2901+
# Maximum size of a trained compression dictionary.
2902+
# Larger dictionaries may provide better compression but use more memory.
2903+
compression_dictionary_training_max_dictionary_size: 64KiB
2904+
2905+
# Maximum total size of sample data to collect for dictionary training.
2906+
# More sample data generally produces better dictionaries but takes longer to train.
2907+
# The recommended sample size is 100x the dictionary size.
2908+
compression_dictionary_training_max_total_sample_size: 10MiB
2909+
2910+
# Enable automatic dictionary training based on sampling of write operations.
2911+
# When enabled, the system will automatically collect samples and train new dictionaries.
2912+
# Manual training via nodetool is always available regardless of this setting.
2913+
compression_dictionary_training_auto_train_enabled: false
2914+
2915+
# Sampling rate for automatic dictionary training (1-10000).
2916+
# Value of 100 means 1% of writes are sampled. Lower values reduce overhead but may
2917+
# result in less representative sample data for dictionary training.
2918+
compression_dictionary_training_sampling_rate: 0.01

conf/cassandra_latest.yaml

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2621,3 +2621,49 @@ storage_compatibility_mode: NONE
26212621
# # especially in keyspaces with many tables. The splitter avoids batching tables together if they
26222622
# # exceed other configuration parameters like bytes_per_assignment or partitions_per_assignment.
26232623
# max_tables_per_assignment: 64
2624+
2625+
# Dictionary compression settings for ZSTD dictionary-based compression
2626+
# These settings control the automatic training and caching of compression dictionaries
2627+
# for tables that use ZSTD dictionary compression.
2628+
2629+
# How often to refresh compression dictionaries across the cluster.
2630+
# During refresh, nodes will check for newer dictionary versions and update their caches.
2631+
# Min unit: s
2632+
compression_dictionary_refresh_interval: 3600s
2633+
2634+
# Initial delay before starting the first dictionary refresh cycle after node startup.
2635+
# This prevents all nodes from refreshing simultaneously when the cluster starts.
2636+
# Min unit: s
2637+
compression_dictionary_refresh_initial_delay: 10s
2638+
2639+
# Maximum number of compression dictionaries to cache per table.
2640+
# Each table using dictionary compression can have multiple dictionaries cached
2641+
# (current version plus recently used versions for reading older SSTables).
2642+
compression_dictionary_cache_size: 10
2643+
2644+
# How long to keep compression dictionaries in the cache before they expire.
2645+
# Expired dictionaries will be removed from memory but can be reloaded if needed.
2646+
# Min unit: s
2647+
compression_dictionary_cache_expire: 24h
2648+
2649+
# Dictionary training configuration (advanced settings)
2650+
# These settings control how compression dictionaries are trained from sample data.
2651+
2652+
# Maximum size of a trained compression dictionary.
2653+
# Larger dictionaries may provide better compression but use more memory.
2654+
compression_dictionary_training_max_dictionary_size: 64KiB
2655+
2656+
# Maximum total size of sample data to collect for dictionary training.
2657+
# More sample data generally produces better dictionaries but takes longer to train.
2658+
# The recommended sample size is 100x the dictionary size.
2659+
compression_dictionary_training_max_total_sample_size: 10MiB
2660+
2661+
# Enable automatic dictionary training based on sampling of write operations.
2662+
# When enabled, the system will automatically collect samples and train new dictionaries.
2663+
# Manual training via nodetool is always available regardless of this setting.
2664+
compression_dictionary_training_auto_train_enabled: false
2665+
2666+
# Sampling rate for automatic dictionary training (1-10000).
2667+
# Value of 100 means 1% of writes are sampled. Lower values reduce overhead but may
2668+
# result in less representative sample data for dictionary training.
2669+
compression_dictionary_training_sampling_rate: 0.01

doc/modules/cassandra/pages/managing/operating/compression.adoc

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@ these areas (A is relatively good, F is relatively bad):
4949

5050
|https://facebook.github.io/zstd/[Zstd] |`ZstdCompressor` | A- | A- | A+ | `>= 4.0`
5151

52+
|https://facebook.github.io/zstd/[Zstd with Dictionary] |`ZstdDictionaryCompressor` | A- | A- | A++ | `>= 6.0`
53+
5254
|http://google.github.io/snappy/[Snappy] |`SnappyCompressor` | A- | A | C | `>= 1.0`
5355

5456
|https://zlib.net[Deflate (zlib)] |`DeflateCompressor` | C | C | A | `>= 1.0`
@@ -60,13 +62,112 @@ cycle spent. This is why it is the default choice in Cassandra.
6062

6163
For storage critical applications (disk footprint), however, `Zstd` may
6264
be a better choice as it can get significant additional ratio to `LZ4`.
65+
For workloads with highly repetitive or similar data patterns,
66+
`ZstdDictionaryCompressor` can achieve even better compression ratios by
67+
training a compression dictionary on representative data samples.
6368

6469
`Snappy` is kept for backwards compatibility and `LZ4` will typically be
6570
preferable.
6671

6772
`Deflate` is kept for backwards compatibility and `Zstd` will typically
6873
be preferable.
6974

75+
== ZSTD Dictionary Compression
76+
77+
The `ZstdDictionaryCompressor` extends standard ZSTD compression by using
78+
trained compression dictionaries to achieve superior compression ratios,
79+
particularly for workloads with repetitive or similar data patterns.
80+
81+
=== How Dictionary Compression Works
82+
83+
Dictionary compression improves upon standard compression by training a
84+
compression dictionary on representative samples of your data. This
85+
dictionary captures common patterns, repeated strings, and data structures,
86+
allowing the compressor to reference these patterns more efficiently than
87+
discovering them independently in each compression chunk.
88+
89+
=== When to Use Dictionary Compression
90+
91+
Dictionary compression is most effective for:
92+
93+
* *Tables with similar row structures*: JSON documents, XML data, or
94+
repeated data schemas benefit significantly from dictionary compression.
95+
* *Storage-critical workloads*: When disk space savings justify the
96+
additional operational overhead of dictionary training and management.
97+
* *Large datasets with repetitive patterns*: The more similar your data,
98+
the better the compression ratio improvement.
99+
100+
Dictionary compression may not be ideal for:
101+
102+
* *Highly random or unique data*: Already-compressed data or cryptographic
103+
data will see minimal benefit.
104+
* *Small tables*: The overhead of dictionary management may outweigh the
105+
storage savings.
106+
* *Frequently changing schemas*: Schema changes may require retraining
107+
dictionaries to maintain optimal compression ratios.
108+
109+
=== Dictionary Training
110+
111+
Before dictionary compression can provide optimal results, a compression
112+
dictionary must be trained on representative data samples. Cassandra
113+
supports both manual and automatic training approaches.
114+
115+
==== Manual Dictionary Training
116+
117+
Use the `nodetool compressiondictionary train` command to manually train
118+
a compression dictionary:
119+
120+
[source,bash]
121+
----
122+
nodetool compressiondictionary train <keyspace> <table>
123+
----
124+
125+
The command trains a dictionary by sampling from existing SSTables. If no
126+
SSTables are available on disk (e.g., all data is in memtables), the command
127+
will automatically flush the memtable before sampling.
128+
129+
The training process completes synchronously and displays progress information
130+
including sample count, sample size, and elapsed time. Training typically
131+
completes within minutes for most workloads.
132+
133+
By default, training will only proceed if enough samples have been collected.
134+
To force training even with insufficient samples, use the `--force` or `-f` option:
135+
136+
[source,bash]
137+
----
138+
nodetool compressiondictionary train --force <keyspace> <table>
139+
----
140+
141+
This can be useful for testing or when you want to train a dictionary from
142+
limited data during initial setup.
143+
144+
==== Automatic Dictionary Training
145+
146+
Enable automatic training in `cassandra.yaml`:
147+
148+
[source,yaml]
149+
----
150+
compression_dictionary_training_auto_train_enabled: true
151+
compression_dictionary_training_sampling_rate: 100 # 1% of writes
152+
----
153+
154+
When enabled, Cassandra automatically samples write operations and
155+
trains dictionaries in the background based on the configured sampling
156+
rate (range: 1-10000, where 100 = 1% of writes).
157+
158+
=== Dictionary Storage and Distribution
159+
160+
Compression dictionaries are stored cluster-wide in the
161+
`system_distributed.compression_dictionaries` table. Each table can
162+
maintain multiple dictionary versions: the current dictionary for
163+
compressing new SSTables, plus historical dictionaries needed for
164+
reading older SSTables.
165+
166+
Dictionaries are identified by `dict_id`, with higher IDs representing
167+
newer dictionaries. Cassandra automatically refreshes dictionaries
168+
across the cluster based on configured intervals, and caches them
169+
locally to minimize lookup overhead.
170+
70171
== Configuring Compression
71172

72173
Compression is configured on a per-table basis as an optional argument
@@ -105,6 +206,17 @@ should be used with caution, as they require more memory. The default of
105206
`3` is a good choice for competing with `Deflate` ratios and `1` is a
106207
good choice for competing with `LZ4`.
107208

209+
The `ZstdDictionaryCompressor` supports the same options as
210+
`ZstdCompressor`:
211+
212+
* `compression_level` (default `3`): Same range and behavior as
213+
`ZstdCompressor`. Dictionary compression provides improved ratios at
214+
any compression level compared to standard ZSTD.
215+
216+
NOTE: `ZstdDictionaryCompressor` requires a trained compression
217+
dictionary to achieve optimal results. See the ZSTD Dictionary
218+
Compression section above for training instructions.
219+
108220
Users can set compression using the following syntax:
109221

110222
[source,cql]
@@ -121,6 +233,25 @@ ALTER TABLE keyspace.table
121233
WITH compression = {'class': 'LZ4Compressor', 'chunk_length_in_kb': 64};
122234
----
123235

236+
For dictionary compression:
237+
238+
[source,cql]
239+
----
240+
CREATE TABLE keyspace.table (id int PRIMARY KEY)
241+
WITH compression = {'class': 'ZstdDictionaryCompressor'};
242+
----
243+
244+
Or with a specific compression level:
245+
246+
[source,cql]
247+
----
248+
ALTER TABLE keyspace.table
249+
WITH compression = {
250+
'class': 'ZstdDictionaryCompressor',
251+
'compression_level': '3'
252+
};
253+
----
254+
124255
Once enabled, compression can be disabled with `ALTER TABLE` setting
125256
`enabled` to `false`:
126257

@@ -140,6 +271,63 @@ immediately, the operator can trigger an SSTable rewrite using
140271
`nodetool scrub` or `nodetool upgradesstables -a`, both of which will
141272
rebuild the SSTables on disk, re-compressing the data in the process.
142273

274+
== Dictionary Compression Configuration
275+
276+
When using `ZstdDictionaryCompressor`, several additional configuration
277+
options are available in `cassandra.yaml` to control dictionary
278+
management, caching, and training behavior.
279+
280+
=== Dictionary Refresh Settings
281+
282+
* `compression_dictionary_refresh_interval` (default: `3600`): How often
283+
(in seconds) to check for and refresh compression dictionaries
284+
cluster-wide. Newly trained dictionaries will be picked up by all nodes
285+
within this interval.
286+
* `compression_dictionary_refresh_initial_delay` (default: `10`): Initial
287+
delay (in seconds) before the first dictionary refresh check after node
288+
startup.
289+
290+
=== Dictionary Caching
291+
292+
* `compression_dictionary_cache_size` (default: `10`): Maximum number of
293+
compression dictionaries to cache per table. Higher values reduce lookup
294+
overhead but increase memory usage.
295+
* `compression_dictionary_cache_expire` (default: `3600`): Dictionary
296+
cache entry TTL in seconds. Expired entries are evicted and reloaded on
297+
next access.
298+
299+
=== Training Configuration
300+
301+
* `compression_dictionary_training_max_dictionary_size` (default: `65536`):
302+
Maximum size of trained dictionaries in bytes. Larger dictionaries can
303+
capture more patterns but increase memory overhead.
304+
* `compression_dictionary_training_max_total_sample_size` (default:
305+
`10485760`): Maximum total size of sample data to collect for training,
306+
approximately 10MB.
307+
* `compression_dictionary_training_auto_train_enabled` (default: `false`):
308+
Enable automatic background dictionary training. When enabled, Cassandra
309+
samples writes and trains dictionaries automatically.
310+
* `compression_dictionary_training_sampling_rate` (default: `100`):
311+
Sampling rate for automatic training, range 1-10000 where 100 = 1% of
312+
writes. Lower values reduce training overhead but may miss data patterns.
313+
314+
Example configuration:
315+
316+
[source,yaml]
317+
----
318+
# Dictionary refresh and caching
319+
compression_dictionary_refresh_interval: 3600
320+
compression_dictionary_refresh_initial_delay: 10
321+
compression_dictionary_cache_size: 10
322+
compression_dictionary_cache_expire: 3600
323+
324+
# Automatic training
325+
compression_dictionary_training_auto_train_enabled: false
326+
compression_dictionary_training_sampling_rate: 100
327+
compression_dictionary_training_max_dictionary_size: 65536
328+
compression_dictionary_training_max_total_sample_size: 10485760
329+
----
330+
143331
== Other options
144332

145333
* `crc_check_chance` (default: `1.0`): determines how likely Cassandra
@@ -186,6 +374,39 @@ correctness of data on disk, compressed tables allow the user to set
186374
probabilistically validate chunks on read to verify bits on disk are not
187375
corrupt.
188376

377+
=== Dictionary Compression Operational Considerations
378+
379+
When using `ZstdDictionaryCompressor`, additional operational factors
380+
apply:
381+
382+
* *Dictionary Storage*: Compression dictionaries are stored in the
383+
`system_distributed.compression_dictionaries` table and replicated
384+
cluster-wide. Each table maintains current and historical dictionary
385+
versions.
386+
* *Dictionary Cache Memory*: Dictionaries are cached locally on each node
387+
according to `compression_dictionary_cache_size`. Memory overhead is
388+
typically minimal (default 64KB per dictionary × cache size).
389+
* *Dictionary Training Overhead*: Manual training via
390+
`nodetool compressiondictionary train` samples SSTable chunk data and
391+
performs CPU-intensive dictionary training. Consider running training
392+
during off-peak hours.
393+
* *Automatic Training Impact*: When
394+
`compression_dictionary_training_auto_train_enabled` is true, write
395+
operations are sampled based on `compression_dictionary_training_sampling_rate`.
396+
This adds minimal overhead but should be monitored in write-intensive
397+
workloads.
398+
* *Dictionary Refresh*: The dictionary refresh process
399+
(`compression_dictionary_refresh_interval`) checks for new dictionaries
400+
cluster-wide. The default 1-hour interval balances freshness with
401+
overhead.
402+
* *SSTable Compatibility*: Each SSTable is compressed with a specific
403+
dictionary version. Historical dictionaries must be retained to read
404+
older SSTables until they are compacted with new dictionaries.
405+
* *Schema Changes*: Significant schema changes or data pattern shifts may
406+
require retraining dictionaries to maintain optimal compression ratios.
407+
Monitor the `SSTable Compression Ratio` via `nodetool tablestats` to
408+
detect degradation.
409+
189410
== Advanced Use
190411

191412
Advanced users can provide their own compression class by implementing

pylib/cqlshlib/cqlhandling.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ class CqlParsingRuleSet(pylexotron.ParsingRuleSet):
4444
'SnappyCompressor',
4545
'LZ4Compressor',
4646
'ZstdCompressor',
47+
'ZstdDictionaryCompressor'
4748
)
4849

4950
available_compaction_classes = (

0 commit comments

Comments
 (0)