Skip to content

Commit e270aa7

Browse files
michielp1807folkertdev
authored andcommitted
Add docs for trainFromBuffer and optimizeTrainFromBuffer functions
1 parent c70084e commit e270aa7

File tree

3 files changed

+133
-0
lines changed

3 files changed

+133
-0
lines changed

lib/dictBuilder/cover.rs

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -801,6 +801,29 @@ pub(super) const unsafe fn assume_init_ref<T>(slice: &[MaybeUninit<T>]) -> &[T]
801801
unsafe { &*(slice as *const [MaybeUninit<T>] as *const [T]) }
802802
}
803803

804+
/// Train a dictionary from an array of samples using the COVER algorithm.
805+
///
806+
/// Samples must be stored concatenated in a single flat buffer `samplesBuffer`, supplied with an
807+
/// array of sizes `samplesSizes`, providing the size of each sample, in order.
808+
///
809+
/// The resulting dictionary will be saved into `dictBuffer`.
810+
///
811+
/// In general, a reasonable dictionary has a size of ~100 KB. It's possible to select smaller or
812+
/// larger size, just by specifying `dictBufferCapacity`. In general, it's recommended to provide a
813+
/// few thousands samples, though this can vary a lot. It's recommended that total size of all
814+
/// samples be about ~x100 times the target size of dictionary.
815+
///
816+
/// # Returns
817+
///
818+
/// - the size of the dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
819+
/// - an error code, which can be tested with [`ZDICT_isError`]
820+
///
821+
/// Dictionary training will fail if there are not enough samples to construct a dictionary, or if
822+
/// most of the samples are too small (< 8 bytes being the lower limit). If dictionary training
823+
/// fails, you should use zstd without a dictionary, as the dictionary would've been ineffective
824+
/// anyways. If you believe your samples would benefit from a dictionary please open an issue with
825+
/// details, and we can look into it.
826+
///
804827
/// # Safety
805828
///
806829
/// Behavior is undefined if any of the following conditions are violated:
@@ -1200,6 +1223,30 @@ fn COVER_tryParameters(data: Box<COVER_tryParameters_data_t>) {
12001223
drop(freqs);
12011224
}
12021225

1226+
/// This function tries many parameter combinations (specifically, `k` and `d` combinations) and
1227+
/// picks the best parameters.
1228+
///
1229+
/// `*parameters` is filled with the best parameters found, and the dictionary constructed with
1230+
/// those parameters is stored in `dictBuffer`.
1231+
///
1232+
/// The parameters `d`, `k`, and `steps` are optional:
1233+
/// - If `d` is zero, we check `d` in 6..8.
1234+
/// - If `k` is zero, we check `d` in 50..2000.
1235+
/// - If `steps` is zero it defaults to its default value (40).
1236+
///
1237+
/// # Returns
1238+
///
1239+
/// - the size of the dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
1240+
/// - an error code, which can be tested with [`ZDICT_isError`]
1241+
///
1242+
/// Dictionary training will fail if there are not enough samples to construct a dictionary, or if
1243+
/// most of the samples are too small (< 8 bytes being the lower limit). If dictionary training
1244+
/// fails, you should use zstd without a dictionary, as the dictionary would've been ineffective
1245+
/// anyways. If you believe your samples would benefit from a dictionary please open an issue with
1246+
/// details, and we can look into it.
1247+
///
1248+
/// On success `*parameters` contains the parameters selected.
1249+
///
12031250
/// # Safety
12041251
///
12051252
/// Behavior is undefined if any of the following conditions are violated:

lib/dictBuilder/fastcover.rs

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -468,6 +468,32 @@ fn FASTCOVER_convertToFastCoverParams(
468468
fastCoverParams.shrinkDict = coverParams.shrinkDict;
469469
}
470470

471+
/// Train a dictionary from an array of samples using a modified version of COVER algorithm.
472+
///
473+
/// Samples must be stored concatenated in a single flat buffer `samplesBuffer`, supplied with an
474+
/// array of sizes `samplesSizes`, providing the size of each sample, in order.
475+
///
476+
/// Only parameters `d` and `k` are required. All other parameters will use default values if not
477+
/// provided.
478+
///
479+
/// The resulting dictionary will be saved into `dictBuffer`.
480+
///
481+
/// In general, a reasonable dictionary has a size of ~100 KB. It's possible to select smaller or
482+
/// larger size, just by specifying `dictBufferCapacity`. In general, it's recommended to provide a
483+
/// few thousands samples, though this can vary a lot. It's recommended that total size of all
484+
/// samples be about ~x100 times the target size of dictionary.
485+
///
486+
/// # Returns
487+
///
488+
/// - the size of the dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
489+
/// - an error code, which can be tested with [`crate::ZDICT_isError`]
490+
///
491+
/// Dictionary training will fail if there are not enough samples to construct a dictionary, or if
492+
/// most of the samples are too small (< 8 bytes being the lower limit). If dictionary training
493+
/// fails, you should use zstd without a dictionary, as the dictionary would've been ineffective
494+
/// anyways. If you believe your samples would benefit from a dictionary please open an issue with
495+
/// details, and we can look into it.
496+
///
471497
/// # Safety
472498
///
473499
/// Behavior is undefined if any of the following conditions are violated:
@@ -604,6 +630,31 @@ fn train_from_buffer_fastcover(
604630
dictionarySize
605631
}
606632

633+
/// This function tries many parameter combinations (specifically, `k` and `d` combinations) and
634+
/// picks the best parameters.
635+
///
636+
/// `*parameters` is filled with the best parameters found, and the dictionary constructed with
637+
/// those parameters is stored in `dictBuffer`.
638+
///
639+
/// The parameters `d`, `k`, `steps`, and `accel` are optional:
640+
/// - If `d` is zero, we check `d` in 6..8.
641+
/// - If `k` is zero, we check `d` in 50..2000.
642+
/// - If `steps` is zero it defaults to its default value (40).
643+
/// - If `accel` is zero, the default value of 1 is used.
644+
///
645+
/// # Returns
646+
///
647+
/// - the size of the dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
648+
/// - an error code, which can be tested with [`crate::ZDICT_isError`]
649+
///
650+
/// Dictionary training will fail if there are not enough samples to construct a dictionary, or if
651+
/// most of the samples are too small (< 8 bytes being the lower limit). If dictionary training
652+
/// fails, you should use zstd without a dictionary, as the dictionary would've been ineffective
653+
/// anyways. If you believe your samples would benefit from a dictionary please open an issue with
654+
/// details, and we can look into it.
655+
///
656+
/// On success `*parameters` contains the parameters selected.
657+
///
607658
/// # Safety
608659
///
609660
/// Behavior is undefined if any of the following conditions are violated:

lib/dictBuilder/zdict.rs

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1587,6 +1587,41 @@ pub unsafe extern "C" fn ZDICT_trainFromBuffer_legacy(
15871587
)
15881588
}
15891589

1590+
/// Train a dictionary from an array of samples.
1591+
///
1592+
/// Calls single-threaded [`ZDICT_optimizeTrainFromBuffer_fastCover`], with `d=8`, `steps=4`,
1593+
/// `f=20`, and `accel=1`.
1594+
///
1595+
/// Samples must be stored concatenated in a single flat buffer `samplesBuffer`, supplied with an
1596+
/// array of sizes `samplesSizes`, providing the size of each sample, in order. The resulting
1597+
/// dictionary will be saved into `dictBuffer`.
1598+
///
1599+
/// In general, a reasonable dictionary has a size of ~100 KB. It's possible to select smaller or
1600+
/// larger size, just by specifying `dictBufferCapacity`. In general, it's recommended to provide a
1601+
/// few thousands samples, though this can vary a lot. It's recommended that total size of all
1602+
/// samples be about ~x100 times the target size of dictionary.
1603+
///
1604+
/// # Returns
1605+
///
1606+
/// - the size of the dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
1607+
/// - an error code, which can be tested with [`ZDICT_isError`]
1608+
///
1609+
/// Dictionary training will fail if there are not enough samples to construct a dictionary, or if
1610+
/// most of the samples are too small (< 8 bytes being the lower limit). If dictionary training
1611+
/// fails, you should use zstd without a dictionary, as the dictionary would've been ineffective
1612+
/// anyways. If you believe your samples would benefit from a dictionary please open an issue with
1613+
/// details, and we can look into it.
1614+
///
1615+
/// # Safety
1616+
///
1617+
/// Behavior is undefined if any of the following conditions are violated:
1618+
///
1619+
/// - `dictBufferCapacity` is 0 or `dictBuffer` and `dictBufferCapacity` satisfy the requirements
1620+
/// of [`core::slice::from_raw_parts_mut`].
1621+
/// - `nbSamples` is 0 or `samplesSizes` and `nbSamples` satisfy the requirements
1622+
/// of [`core::slice::from_raw_parts`].
1623+
/// - `sum(samplesSizes)` is 0 or `samplesBuffer` and `sum(samplesSizes)` satisfy the requirements
1624+
/// of [`core::slice::from_raw_parts`].
15901625
#[cfg_attr(feature = "export-symbols", export_name = crate::prefix!(ZDICT_trainFromBuffer))]
15911626
pub unsafe extern "C" fn ZDICT_trainFromBuffer(
15921627
dictBuffer: *mut core::ffi::c_void,

0 commit comments

Comments
 (0)