Skip to content

Commit 2cebcfc

Browse files
authored
Dictionary Creation (#91)
* refactor: move compress_fastest to a new file * sync * feat(dict): bare structure of dictionary creation * . * . * dict: more scaffolding for file processing * dict: rudimentary implementation * sync * dict: rudimentary implementation complete * dict: pre-clippy auto apply * refactor: specify raw content dictionary creation * lint: fixing clippy * docs: update readme.md to include dict builder * docs: include some rustdoc metadata * lint: fixing clippy * pr(cleanup): apply feedback from pull/91 - Fix typo in cargo.toml - set VERBOSE to false and add a test to verify it's false - remove commented out bench code from zstd_dict.rs --------- Co-authored-by: arc <zleyyij@users.noreply.github.com>
1 parent fb92a10 commit 2cebcfc

File tree

19 files changed

+648
-30
lines changed

19 files changed

+648
-30
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
**/*.rs.bk
33
Cargo.lock
44
/local_corpus_files
5+
/local_dict_corpus_files
56
/orig-zstd
67
fuzz_decodecorpus
78
perf.data*

Cargo.toml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ readme = "Readme.md"
1212
keywords = ["zstd", "zstandard", "decompression"]
1313
categories = ["compression"]
1414

15+
[package.metadata.docs.rs]
16+
all-features = true
17+
rustdoc-args = ["--cfg", "docsrs"]
18+
1519
[dependencies]
1620
twox-hash = { version = "2.0", default-features = false, features = ["xxhash64"], optional = true }
1721

@@ -20,17 +24,20 @@ twox-hash = { version = "2.0", default-features = false, features = ["xxhash64"]
2024
compiler_builtins = { version = "0.1.2", optional = true }
2125
core = { version = "1.0.0", optional = true, package = "rustc-std-workspace-core" }
2226
alloc = { version = "1.0.0", optional = true, package = "rustc-std-workspace-alloc" }
27+
fastrand = "2.3.0"
28+
2329

2430
[dev-dependencies]
2531
criterion = "0.5"
2632
rand = { version = "0.8.5", features = ["small_rng"] }
27-
zstd = "0.13.2"
33+
zstd = { version = "0.13.2", features = ["zstdmt"]}
2834

2935
[features]
3036
default = ["hash", "std"]
3137
hash = ["dep:twox-hash"]
3238
fuzz_exports = []
3339
std = []
40+
dict_builder = ["std"]
3441

3542
# Internal feature, only used when building as part of libstd, not part of the
3643
# stable interface of this crate.
@@ -47,3 +54,7 @@ required-features = ["std"]
4754
[[bin]]
4855
name = "zstd_stream"
4956
required-features = ["std"]
57+
58+
[[bin]]
59+
name = "zstd_dict"
60+
required-features = ["std", "dict_builder"]

Readme.md

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,22 +15,49 @@ This crate is currently actively maintained.
1515

1616
# Current Status
1717

18-
Feature complete on the decoder side.
18+
## Decompression
19+
The `decoding` module provides a complete
20+
implementation of a Zstandard decompressor.
21+
22+
In terms of speed, `ruzstd` is behind the original C implementation
23+
which has a rust binding located [here](https://github.com/gyscos/zstd-rs).
24+
25+
Measuring with the 'time' utility the original zstd and my decoder both
26+
decoding the same enwik9.zst file from a ramfs, my decoder is about 3.5
27+
times slower. Enwik9 is highly compressible, for less compressible data
28+
(like a ubuntu installation .iso) my decoder comes close to only being
29+
1.4 times slower.
1930

31+
## Compression
2032
On the compression side:
2133
- Support for generating compressed blocks at any compression level
2234
- [x] Uncompressed
2335
- [x] Fastest (roughly level 1)
2436
- [ ] Default (roughly level 3)
2537
- [ ] Better (roughly level 7)
2638
- [ ] Best (roughly level 11)
27-
- [ ] Checksums
39+
- [x] Checksums
2840
- [ ] Dictionaries
2941

30-
## Speed
31-
In terms of speed this library is behind the original C implementation which has a rust binding located [here](https://github.com/gyscos/zstd-rs).
42+
## Dictionary Generation
43+
When the `dict_builder` feature is enabled, the `dictionary` module
44+
provides the ability to create new dictionaries.
45+
46+
On the `github-users` sample set, our implementation benchmarks within
47+
0.2% of the official implementation (as of commit
48+
`09e52d07340acdb2e13817b066e8be6e424f7258`):
49+
```no_build
50+
uncompressed: 100.00% (7484607 bytes)
51+
no dict: 34.99% of original size (2618872 bytes)
52+
reference dict: 16.16% of no dict size (2195672 bytes smaller)
53+
our dict: 16.28% of no dict size (2192400 bytes smaller)
54+
```
55+
56+
The dictionary generator only provides support for creating "raw
57+
content" dictionaries. Tagged dictionaries are currently unsupported.
3258

33-
Measuring with the 'time' utility the original zstd and my decoder both decoding the same enwik9.zst file from a ramfs, my decoder is about 3.5 times slower. Enwik9 is highly compressible, for less compressible data (like a ubuntu installation .iso) my decoder comes close to only being 1.4 times slower.
59+
See <https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#dictionary-format>
60+
for clarification.
3461

3562

3663
# How can you use it?

src/bin/zstd.rs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ struct StateTracker {
2121
file_size: u64,
2222
old_percentage: i8,
2323
}
24-
24+
#[allow(unused)]
2525
fn decompress(flags: &[String], file_paths: &[String]) {
2626
if !flags.contains(&"-d".to_owned()) {
2727
eprintln!("This zstd implementation only supports decompression. Please add a \"-d\" flag");
@@ -128,6 +128,7 @@ fn decompress(flags: &[String], file_paths: &[String]) {
128128
}
129129
}
130130

131+
#[allow(unused)]
131132
struct PercentPrintReader<R: Read> {
132133
total: usize,
133134
counter: usize,

src/bin/zstd_dict.rs

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
use ruzstd::dictionary::{create_raw_dict_from_dir, create_raw_dict_from_source};
2+
use std::env::args;
3+
use std::fs::File;
4+
use std::path::Path;
5+
6+
fn main() {
7+
let args: Vec<String> = args().collect();
8+
let input_path: &Path = args.get(1).expect("no input provided").as_ref();
9+
let output_path: &Path = args.get(2).expect("no output path provided").as_ref();
10+
let dict_size = args
11+
.get(3)
12+
.expect("no dict size provided (kb)")
13+
.parse::<usize>()
14+
.expect("dict size was not a valid num");
15+
16+
let mut output = File::create(output_path).unwrap();
17+
if input_path.is_file() {
18+
let source = File::open(input_path).expect("unable to open input path");
19+
let source_size = source.metadata().unwrap().len();
20+
create_raw_dict_from_source(source, source_size as usize, &mut output, dict_size);
21+
} else {
22+
create_raw_dict_from_dir(input_path, &mut output, dict_size).unwrap();
23+
}
24+
}

src/bit_io/bit_reader.rs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ impl<'s> BitReader<'s> {
6666

6767
let mut bit_shift = bits_left_in_current_byte; //this many bits are already set in value
6868

69-
assert!(self.idx % 8 == 0);
69+
assert!(self.idx.is_multiple_of(8));
7070

7171
//collect full bytes
7272
for _ in 0..full_bytes_needed {
@@ -116,7 +116,7 @@ impl core::fmt::Display for GetBitsError {
116116
} => {
117117
write!(
118118
f,
119-
"Cant serve this request. The reader is limited to {limit} bits, requested {num_requested_bits} bits",
119+
"Cant serve this request. The reader is limited to {limit} bits, requested {num_requested_bits} bits"
120120
)
121121
}
122122
GetBitsError::NotEnoughRemainingBits {
@@ -125,7 +125,7 @@ impl core::fmt::Display for GetBitsError {
125125
} => {
126126
write!(
127127
f,
128-
"Can\'t read {requested} bits, only have {remaining} bits left",
128+
"Can\'t read {requested} bits, only have {remaining} bits left"
129129
)
130130
}
131131
}

src/bit_io/bit_writer.rs

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ impl<V: AsMut<Vec<u8>>> BitWriter<V> {
4545

4646
/// Reset to an index. Currently only supports resetting to a byte aligned index
4747
pub fn reset_to(&mut self, index: usize) {
48-
assert!(index % 8 == 0);
48+
assert!(index.is_multiple_of(8));
4949
self.partial = 0;
5050
self.bits_in_partial = 0;
5151
self.bit_idx = index;
@@ -66,7 +66,7 @@ impl<V: AsMut<Vec<u8>>> BitWriter<V> {
6666

6767
// We might be changing bits unaligned to byte borders.
6868
// This means the lower bits of the first byte we are touching must stay the same
69-
if idx % 8 != 0 {
69+
if !idx.is_multiple_of(8) {
7070
// How many (upper) bits will change in the first byte?
7171
let bits_in_first_byte = 8 - (idx % 8);
7272
// We don't support only changing a few bits in the middle of a byte
@@ -82,7 +82,7 @@ impl<V: AsMut<Vec<u8>>> BitWriter<V> {
8282
idx += bits_in_first_byte;
8383
}
8484

85-
assert!(idx % 8 == 0);
85+
assert!(idx.is_multiple_of(8));
8686
// We are now byte aligned, change idx to byte resolution
8787
let mut idx = idx / 8;
8888

@@ -113,7 +113,7 @@ impl<V: AsMut<Vec<u8>>> BitWriter<V> {
113113

114114
/// Flush temporary internal buffers to the output buffer. Only works if this is currently byte aligned
115115
pub fn flush(&mut self) {
116-
assert!(self.bits_in_partial % 8 == 0);
116+
assert!(self.bits_in_partial.is_multiple_of(8));
117117
let full_bytes = self.bits_in_partial / 8;
118118
self.output
119119
.as_mut()
@@ -204,7 +204,7 @@ impl<V: AsMut<Vec<u8>>> BitWriter<V> {
204204
/// Returns how many bits are missing for an even byte
205205
pub fn misaligned(&self) -> usize {
206206
let idx = self.index();
207-
if idx % 8 == 0 {
207+
if idx.is_multiple_of(8) {
208208
0
209209
} else {
210210
8 - (idx % 8)

src/dictionary/cover.rs

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
//! An implementation of the local maximum coverage algorithm
2+
//! described in the paper "Effective Construction of Relative Lempel-Ziv Dictionaries",
3+
//! by Liao, Petri, Moffat, and Wirth, published under the University of Melbourne.
4+
//!
5+
//! See: <https://people.eng.unimelb.edu.au/ammoffat/abstracts/lpmw16www.pdf>
6+
//!
7+
//! Facebook's implementation was also used as a reference.
8+
//! <https://github.com/facebook/zstd/tree/dev/lib/dictBuilder>
9+
10+
use super::DictParams;
11+
use crate::dictionary::frequency::estimate_frequency;
12+
use core::convert::TryInto;
13+
use std::collections::HashMap;
14+
use std::vec::Vec;
15+
16+
/// The size of each k-mer
17+
pub(super) const K: usize = 16;
18+
19+
///As found under "4: Experiments - Varying k-mer Size" in the original paper,
20+
/// "when k = 16, across all our text collections, there is a reasonable spread"
21+
///
22+
/// Reasonable range: [6, 16]
23+
pub(super) type KMer = [u8; K];
24+
25+
pub struct Segment {
26+
/// The actual contents of the segment.
27+
pub raw: Vec<u8>,
28+
/// A measure of how "ideal" a given segment would be to include in the dictionary
29+
///
30+
/// Higher is better, there's no upper limit. This number is determined by
31+
/// estimating the number of occurances in a given epoch
32+
pub score: usize,
33+
}
34+
35+
impl Eq for Segment {}
36+
37+
impl PartialEq for Segment {
38+
fn eq(&self, other: &Self) -> bool {
39+
// We only really care about score in regards to heap order
40+
self.score == other.score
41+
}
42+
}
43+
44+
impl PartialOrd for Segment {
45+
fn partial_cmp(&self, other: &Self) -> Option<core::cmp::Ordering> {
46+
Some(self.cmp(other))
47+
}
48+
}
49+
50+
impl Ord for Segment {
51+
fn cmp(&self, other: &Self) -> core::cmp::Ordering {
52+
self.score.cmp(&other.score)
53+
}
54+
}
55+
56+
/// A re-usable allocation containing large allocations
57+
/// that are used multiple times during dictionary construction (once per epoch)
58+
pub struct Context {
59+
/// Keeps track of the number of occurances of a particular k-mer within an epoch.
60+
///
61+
/// Reset for each epoch.
62+
pub frequencies: HashMap<KMer, usize>,
63+
}
64+
65+
/// Returns the highest scoring segment in an epoch
66+
/// as a slice of that epoch.
67+
pub fn pick_best_segment(
68+
params: &DictParams,
69+
ctx: &mut Context,
70+
collection_sample: &'_ [u8],
71+
) -> Segment {
72+
let mut segments = collection_sample
73+
.chunks(params.segment_size as usize)
74+
.peekable();
75+
let mut best_segment: &[u8] = segments.peek().expect("at least one segment");
76+
let mut top_segment_score: usize = 0;
77+
// Iterate over segments and score each segment, keeping track of the best segment
78+
for segment in segments {
79+
let segment_score = score_segment(ctx, collection_sample, segment);
80+
if segment_score > top_segment_score {
81+
best_segment = segment;
82+
top_segment_score = segment_score;
83+
}
84+
}
85+
86+
Segment {
87+
raw: best_segment.into(),
88+
score: top_segment_score,
89+
}
90+
}
91+
92+
/// Given a segment, compute the score (or usefulness) of that segment against the entire epoch.
93+
///
94+
/// `score_segment` modifies `ctx.frequencies`.
95+
fn score_segment(ctx: &mut Context, collection_sample: &[u8], segment: &[u8]) -> usize {
96+
let mut segment_score = 0;
97+
// Determine the score of each overlapping k-mer
98+
for i in 0..(segment.len() - K - 1) {
99+
let kmer: &KMer = (&segment[i..i + K])
100+
.try_into()
101+
.expect("Failed to make kmer");
102+
// if the kmer is already in the pool, it recieves a score of zero
103+
if ctx.frequencies.contains_key(kmer) {
104+
continue;
105+
}
106+
let kmer_score = estimate_frequency(kmer, collection_sample);
107+
ctx.frequencies.insert(*kmer, kmer_score);
108+
segment_score += kmer_score;
109+
}
110+
111+
segment_score
112+
}
113+
114+
/// Computes the number of epochs and the size of each epoch.
115+
///
116+
/// Returns a (number of epochs, epoch size) tuple.
117+
///
118+
/// A translation of `COVER_epoch_info_t COVER_computeEpochs()` from facebook/zstd.
119+
pub fn compute_epoch_info(
120+
params: &DictParams,
121+
max_dict_size: usize,
122+
num_kmers: usize,
123+
) -> (usize, usize) {
124+
let min_epoch_size = 10_000; // 10 KiB
125+
let mut num_epochs: usize = usize::max(1, max_dict_size / params.segment_size as usize);
126+
let mut epoch_size: usize = num_kmers / num_epochs;
127+
if epoch_size >= min_epoch_size {
128+
assert!(epoch_size * num_epochs <= num_kmers);
129+
return (num_epochs, epoch_size);
130+
}
131+
epoch_size = usize::min(min_epoch_size, num_kmers);
132+
num_epochs = num_kmers / epoch_size;
133+
(num_epochs, epoch_size)
134+
}

0 commit comments

Comments
 (0)