Skip to content

Commit 377e2a8

Browse files
authored
Merge pull request #12 from polarsignals/determ
deterministic array encodings
2 parents d803d25 + 76e5cef commit 377e2a8

File tree

4 files changed

+36
-33
lines changed

4 files changed

+36
-33
lines changed

docs/specs/file-format.md

Lines changed: 0 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -117,26 +117,3 @@ The plan is that at write-time, a minimum supported reader version is declared.
117117
reader version can then be embedded into the file with WebAssembly decompression logic. Old readers are able to decompress new
118118
data (slower than native code, but still with SIMD acceleration) and read the file. New readers are able to make the best use of
119119
these encodings with native decompression logic and additional push-down compute functions (which also provides an incentive to upgrade).
120-
121-
## File Determinism and Reproducibility
122-
123-
### Encoding Order Indeterminism
124-
125-
When writing Vortex files, each array segment references its encoding via an integer index into the footer's `array_specs`
126-
list. During serialization, encodings are registered in the order they are first encountered via calls to
127-
`ArrayContext::encoding_idx()`. With concurrent writes, this encounter order depends on thread scheduling and lock
128-
acquisition timing, making the ordering in the footer non-deterministic between runs.
129-
130-
This affects the `encoding` field in each serialized array segment. The same encoding might receive index 0 in one run and
131-
index 1 in another, changing the integer value stored in each array segment that uses that encoding. FlatBuffers optimize
132-
storage by omitting fields with default values (such as 0), so when an encoding index is 0, the field may be omitted from
133-
the serialized representation. This saves approximately 2 bytes per affected array segment, and with alignment adjustments,
134-
can result in up to 4 bytes difference per array segment between runs.
135-
136-
:::{note}
137-
Despite this non-determinism, the practical impact is minimal:
138-
139-
- File size may vary by up to 4 bytes per affected array segment
140-
- All file contents remain semantically identical and fully readable
141-
- Segment ordering (the actual data layout) remains deterministic and consistent across writes
142-
:::

vortex-array/src/context.rs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,15 @@ impl<T: Clone + Eq> VTableContext<T> {
2323
Self(Arc::new(RwLock::new(encodings)))
2424
}
2525

26+
pub fn from_registry_sorted(registry: &Registry<T>) -> Self
27+
where
28+
T: Display,
29+
{
30+
let mut encodings: Vec<T> = registry.items().collect();
31+
encodings.sort_by_key(|a| a.to_string());
32+
Self::new(encodings)
33+
}
34+
2635
pub fn try_from_registry<'a>(
2736
registry: &Registry<T>,
2837
ids: impl IntoIterator<Item = &'a str>,

vortex-file/src/writer.rs

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,25 @@ use std::io::Write;
66
use std::sync::Arc;
77
use std::sync::atomic::AtomicU64;
88

9-
use futures::future::{Fuse, LocalBoxFuture, ready};
10-
use futures::{FutureExt, StreamExt, TryStreamExt, pin_mut, select};
11-
use vortex_array::iter::{ArrayIterator, ArrayIteratorExt};
12-
use vortex_array::stats::{PRUNING_STATS, Stat};
13-
use vortex_array::stream::{ArrayStream, ArrayStreamAdapter, ArrayStreamExt, SendableArrayStream};
14-
use vortex_array::{ArrayContext, ArrayRef};
9+
use futures::FutureExt;
10+
use futures::StreamExt;
11+
use futures::TryStreamExt;
12+
use futures::future::Fuse;
13+
use futures::future::LocalBoxFuture;
14+
use futures::future::ready;
15+
use futures::pin_mut;
16+
use futures::select;
17+
use vortex_array::ArrayContext;
18+
use vortex_array::ArrayRef;
19+
use vortex_array::iter::ArrayIterator;
20+
use vortex_array::iter::ArrayIteratorExt;
21+
use vortex_array::ArraySessionExt;
22+
use vortex_array::stats::PRUNING_STATS;
23+
use vortex_array::stats::Stat;
24+
use vortex_array::stream::ArrayStream;
25+
use vortex_array::stream::ArrayStreamAdapter;
26+
use vortex_array::stream::ArrayStreamExt;
27+
use vortex_array::stream::SendableArrayStream;
1528
use vortex_buffer::ByteBuffer;
1629
use vortex_dtype::DType;
1730
use vortex_error::{VortexError, VortexExpect, VortexResult, vortex_bail, vortex_err};
@@ -116,8 +129,12 @@ impl VortexWriteOptions {
116129
mut write: W,
117130
stream: SendableArrayStream,
118131
) -> VortexResult<WriteSummary> {
119-
// Set up a Context to capture the encodings used in the file.
120-
let ctx = ArrayContext::empty();
132+
// NOTE(os): Setup an array context that already has all known encodings pre-populated.
133+
// This is preferred for now over having an empty context here, because only the
134+
// serialised array order is deterministic. The serialisation of arrays are done
135+
// parallel and with an empty context they can register their encodings to the context
136+
// in different order, changing the written bytes from run to run.
137+
let ctx = ArrayContext::from_registry_sorted(self.session.arrays().registry());
121138
let dtype = stream.dtype().clone();
122139

123140
let (mut ptr, eof) = SequenceId::root().split();

vortex-python/src/io.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,7 @@ impl PyVortexWriteOptions {
212212
/// >>> vx.io.VortexWriteOptions.default().write_path(sprl, "chonky.vortex")
213213
/// >>> import os
214214
/// >>> os.path.getsize('chonky.vortex')
215-
/// 215196
215+
/// 215996
216216
/// ```
217217
///
218218
/// Wow, Vortex manages to use about two bytes per integer! So advanced. So tiny.
@@ -224,7 +224,7 @@ impl PyVortexWriteOptions {
224224
/// ```python
225225
/// >>> vx.io.VortexWriteOptions.compact().write_path(sprl, "tiny.vortex")
226226
/// >>> os.path.getsize('tiny.vortex')
227-
/// 54200
227+
/// 55116
228228
/// ```
229229
///
230230
/// Random numbers are not (usually) composed of random bytes!

0 commit comments

Comments
 (0)