deterministic encoding ordering (#5770)

onursatici · web-flow · commit 4765616e4d25 · 2025-12-18T11:28:25.000Z
pre register all encodings known by the session, so they have the same
encoding index from run to run. Layout encoding indices don't suffer
from this because we get them after we have the full layout tree, and we
traverse it dept first in a single thread to get the layout indices

Signed-off-by: Onur Satici &lt;onur@spiraldb.com&gt;
diff --git a/docs/specs/file-format.md b/docs/specs/file-format.md
@@ -117,26 +117,3 @@ The plan is that at write-time, a minimum supported reader version is declared.
 reader version can then be embedded into the file with WebAssembly decompression logic. Old readers are able to decompress new
 data (slower than native code, but still with SIMD acceleration) and read the file. New readers are able to make the best use of
 these encodings with native decompression logic and additional push-down compute functions (which also provides an incentive to upgrade).
-
-## File Determinism and Reproducibility
-
-### Encoding Order Indeterminism
-
-When writing Vortex files, each array segment references its encoding via an integer index into the footer's `array_specs`
-list. During serialization, encodings are registered in the order they are first encountered via calls to
-`ArrayContext::encoding_idx()`. With concurrent writes, this encounter order depends on thread scheduling and lock
-acquisition timing, making the ordering in the footer non-deterministic between runs.
-
-This affects the `encoding` field in each serialized array segment. The same encoding might receive index 0 in one run and
-index 1 in another, changing the integer value stored in each array segment that uses that encoding. FlatBuffers optimize
-storage by omitting fields with default values (such as 0), so when an encoding index is 0, the field may be omitted from
-the serialized representation. This saves approximately 2 bytes per affected array segment, and with alignment adjustments,
-can result in up to 4 bytes difference per array segment between runs.
-
-:::{note}
-Despite this non-determinism, the practical impact is minimal:
-
-- File size may vary by up to 4 bytes per affected array segment
-- All file contents remain semantically identical and fully readable
-- Segment ordering (the actual data layout) remains deterministic and consistent across writes
-:::
diff --git a/vortex-array/src/context.rs b/vortex-array/src/context.rs
@@ -26,6 +26,15 @@ impl<T: Clone + Eq> VTableContext<T> {
         Self(Arc::new(RwLock::new(encodings)))
     }
 
+    pub fn from_registry_sorted(registry: &Registry<T>) -> Self
+    where
+        T: Display,
+    {
+        let mut encodings: Vec<T> = registry.items().collect();
+        encodings.sort_by_key(|a| a.to_string());
+        Self::new(encodings)
+    }
+
     pub fn try_from_registry<'a>(
         registry: &Registry<T>,
         ids: impl IntoIterator<Item = &'a str>,
diff --git a/vortex-file/src/writer.rs b/vortex-file/src/writer.rs
@@ -19,6 +19,7 @@ use vortex_array::ArrayRef;
 use vortex_array::expr::stats::Stat;
 use vortex_array::iter::ArrayIterator;
 use vortex_array::iter::ArrayIteratorExt;
+use vortex_array::session::ArraySessionExt;
 use vortex_array::stats::PRUNING_STATS;
 use vortex_array::stream::ArrayStream;
 use vortex_array::stream::ArrayStreamAdapter;
@@ -138,8 +139,12 @@ impl VortexWriteOptions {
         mut write: W,
         stream: SendableArrayStream,
     ) -> VortexResult<WriteSummary> {
-        // Set up a Context to capture the encodings used in the file.
-        let ctx = ArrayContext::empty();
+        // NOTE(os): Setup an array context that already has all known encodings pre-populated.
+        // This is preferred for now over having an empty context here, because only the
+        // serialised array order is deterministic. The serialisation of arrays are done
+        // parallel and with an empty context they can register their encodings to the context
+        // in different order, changing the written bytes from run to run.
+        let ctx = ArrayContext::from_registry_sorted(self.session.arrays().registry());
         let dtype = stream.dtype().clone();
 
         let (mut ptr, eof) = SequenceId::root().split();
diff --git a/vortex-python/src/io.rs b/vortex-python/src/io.rs
@@ -222,7 +222,7 @@ impl PyVortexWriteOptions {
     /// >>> vx.io.VortexWriteOptions.default().write_path(sprl, "chonky.vortex")
     /// >>> import os
     /// >>> os.path.getsize('chonky.vortex')
-    /// 215196
+    /// 215996
     /// ```
     ///
     /// Wow, Vortex manages to use about two bytes per integer! So advanced. So tiny.
@@ -234,7 +234,7 @@ impl PyVortexWriteOptions {
     /// ```python
     /// >>> vx.io.VortexWriteOptions.compact().write_path(sprl, "tiny.vortex")
     /// >>> os.path.getsize('tiny.vortex')
-    /// 54200
+    /// 55116
     /// ```
     ///
     /// Random numbers are not (usually) composed of random bytes!