Introduce the dataset manifest and remove layer information from the partition table #11423

abey79 · 2025-10-03T15:53:59Z

What

Introduces gRPC endpoints and associated SDK method to access the dataset manifest table, which contains a row per layer. Also, remove most layer-related columns from the partition table.

This PR also attempts to solidify the notion that Scan{PartitionTable|DatasetManifest}Response is the One True Source(tm) of information on the returned dataframe's schema.

Future work:

implement dataset manifest endpoint in the OSS server: RR-2482
implement dataset manifest in the catalog provider: Add dataset manifest table to the catalog provider #11444

TODO

github-actions · 2025-10-03T15:54:22Z

Web viewer built successfully.

Result	Commit	Link	Manifest
✅	7df55bf	https://rerun.io/viewer/pr/11423	`+nightly` `+main`

^{Note: This comment is updated whenever you push a commit.}

Copilot

Pull Request Overview

This PR introduces the dataset manifest table functionality and removes layer-specific columns from the partition table. The dataset manifest provides layer-level metadata while the partition table now focuses solely on partition-level information.

Adds new gRPC endpoints for dataset manifest schema and scanning operations
Refactors partition table structure to remove layer information and add partition metadata
Implements dataset manifest provider for DataFusion integration

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
rerun_py/src/catalog/dataset_entry.rs	Adds manifest() method to expose dataset manifest as DataFusion table
rerun_py/rerun_bindings/rerun_bindings.pyi	Python type hints for new manifest() method
crates/store/re_server/src/store.rs	Updates partition table schema removing layer columns and adding metadata
crates/store/re_server/src/rerun_cloud.rs	Implements placeholder gRPC handlers for dataset manifest endpoints
crates/store/re_redap_client/src/lib.rs	Adds error variant for dataset manifest schema operations
crates/store/re_redap_client/src/connection_client.rs	Implements client method for dataset manifest schema fetching
crates/store/re_protos/src/v1alpha1/rerun.cloud.v1alpha1.rs	Generated protobuf code for new dataset manifest endpoints
crates/store/re_protos/src/v1alpha1/rerun.cloud.v1alpha1.ext.rs	Schema definitions and helper methods for dataset manifest responses
crates/store/re_protos/proto/rerun/v1alpha1/cloud.proto	Protocol buffer definitions for dataset manifest endpoints
crates/store/re_datafusion/src/partition_table.rs	Adds TODO comment for deduplication
crates/store/re_datafusion/src/lib.rs	Exports new DatasetManifestProvider
crates/store/re_datafusion/src/dataset_manifest.rs	Implements DatasetManifestProvider for DataFusion integration

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

crates/store/re_protos/src/v1alpha1/rerun.cloud.v1alpha1.ext.rs

crates/store/re_datafusion/src/dataset_manifest.rs

Co-authored-by: Copilot <[email protected]>

github-actions · 2025-10-07T16:36:13Z

Latest documentation preview deployed successfully.

Result	Commit	Link
✅	`7df55bf`	https://landing-7tzuu2dqu-rerun.vercel.app/docs

^{Note: This comment is updated whenever you push a commit.}

abey79 · 2025-10-07T16:42:23Z

crates/store/re_protos/proto/rerun/v1alpha1/cloud.proto

+  // Returns the schema of the dataset manifest.
+  //
+  // To inspect the data of the dataset manifest, which is guaranteed to match the schema returned by
+  // this endpoint, check out `ScanDatasetManifest`.
+  //
+  // This endpoint requires the standard dataset headers.
+  rpc GetDatasetManifestSchema(GetDatasetManifestSchemaRequest) returns (GetDatasetManifestSchemaResponse) {}
+
+  // Inspect the contents of the dataset manifest.
+  //
+  // The data will follow the schema returned by `GetDatasetManifestSchema`.
+  //
+  // This endpoint requires the standard dataset headers.
+  rpc ScanDatasetManifest(ScanDatasetManifestRequest) returns (stream ScanDatasetManifestResponse) {}
+


This is the main point of this PR.

abey79 · 2025-10-07T16:44:27Z

crates/store/re_protos/src/v1alpha1/rerun.cloud.v1alpha1.ext.rs

 impl ScanPartitionTableResponse {
    pub const PARTITION_ID: &str = "rerun_partition_id";
-    pub const PARTITION_TYPE: &str = "rerun_partition_type";
+
+    /// Layer names for this partition, one per layer.
+    ///
+    /// Should have the same length as [`Self::STORAGE_URLS`].
+    pub const LAYER_NAMES: &str = "rerun_layer_names";
+
+    /// Storage URLs for this partition, one per layer.
+    ///
+    /// Should have the same length as [`Self::LAYER_NAMES`].
+    pub const STORAGE_URLS: &str = "rerun_storage_urls";
+    pub const LAST_UPDATED_AT: &str = "rerun_last_updated_at";
+
+    /// Total number of chunks for this partition.
+    pub const NUM_CHUNKS: &str = "rerun_num_chunks";
+
+    /// Total size in bytes for this partition.
+    pub const SIZE_BYTES: &str = "rerun_size_bytes";
+
+    pub fn layer_names_inner_field() -> FieldRef {
+        Arc::new(Field::new(Self::LAYER_NAMES, DataType::Utf8, false))
+    }
+
+    pub fn storage_urls_inner_field() -> FieldRef {
+        Arc::new(Field::new(Self::STORAGE_URLS, DataType::Utf8, false))
+    }
+
+    // NOTE: changing this method is a breaking change for implementation (aka it at least breaks
+    // tests in `dataplatform`)
+    pub fn fields() -> Vec<Field> {
+        vec![
+            Field::new(Self::PARTITION_ID, DataType::Utf8, false),
+            Field::new(
+                Self::LAYER_NAMES,
+                DataType::List(Self::layer_names_inner_field()),
+                false,
+            ),
+            Field::new(
+                Self::STORAGE_URLS,
+                DataType::List(Self::storage_urls_inner_field()),
+                false,
+            ),
+            Field::new(
+                Self::LAST_UPDATED_AT,
+                DataType::Timestamp(TimeUnit::Nanosecond, None),
+                false,
+            ),
+            Field::new(Self::NUM_CHUNKS, DataType::UInt64, false),
+            Field::new(Self::SIZE_BYTES, DataType::UInt64, false),
+        ]
+    }
+
+    pub fn schema() -> Schema {
+        Schema::new(Self::fields())
+    }
+
+    /// Helper to simplify instantiation of the dataframe in [`Self::data`].
+    pub fn create_dataframe(
+        partition_ids: Vec<String>,
+        layer_names: Vec<Vec<String>>,
+        storage_urls: Vec<Vec<String>>,
+        last_updated_at: Vec<i64>,
+        num_chunks: Vec<u64>,
+        size_bytes: Vec<u64>,
+    ) -> arrow::error::Result<RecordBatch> {
+        let row_count = partition_ids.len();
+        let schema = Arc::new(Self::schema());
+
+        let mut layer_names_builder =
+            ListBuilder::new(StringBuilder::new()).with_field(Self::layer_names_inner_field());
+
+        for mut inner_vec in layer_names {
+            for layer_name in inner_vec.drain(..) {
+                layer_names_builder.values().append_value(layer_name)
+            }
+            layer_names_builder.append(true);
+        }
+
+        let mut urls_builder =
+            ListBuilder::new(StringBuilder::new()).with_field(Self::storage_urls_inner_field());
+
+        for mut inner_vec in storage_urls {
+            for layer_name in inner_vec.drain(..) {
+                urls_builder.values().append_value(layer_name)
+            }
+            urls_builder.append(true);
+        }
+
+        let columns: Vec<ArrayRef> = vec![
+            Arc::new(StringArray::from(partition_ids)),
+            Arc::new(layer_names_builder.finish()),
+            Arc::new(urls_builder.finish()),
+            Arc::new(TimestampNanosecondArray::from(last_updated_at)),
+            Arc::new(UInt64Array::from(num_chunks)),
+            Arc::new(UInt64Array::from(size_bytes)),
+        ];
+
+        RecordBatch::try_new_with_options(
+            schema,
+            columns,
+            &RecordBatchOptions::default().with_row_count(Some(row_count)),
+        )
+    }
+
+    pub fn data(&self) -> Result<&DataframePart, TypeConversionError> {
+        Ok(self
+            .data
+            .as_ref()
+            .ok_or_else(|| missing_field!(Self, "data"))?)
+    }
+}


This is the secondary point of this PR, aka trying to compensate for the lack of dataframe-level spec in *.proto and harden all implementations against drifts and mismatch. This works only if all codebases use this as ground truth (which this pair of PR attempts to do).

teh-cmc

Looking good 👍

Mostly had things to say regarding naming conventions... do with that what you will 🤷

crates/store/re_protos/proto/rerun/v1alpha1/cloud.proto

crates/store/re_protos/src/v1alpha1/rerun.cloud.v1alpha1.ext.rs

teh-cmc · 2025-10-08T07:30:59Z

crates/store/re_protos/src/v1alpha1/rerun.cloud.v1alpha1.ext.rs

+// --- ScanDatasetManifestResponse --
+
+impl ScanDatasetManifestResponse {
+    pub const LAYER_NAME: &str = "rerun_layer_name";


Similarly, I've been trying (and still failing) to make sure that all field constants are named const FIELD_XXX and that they are named exactly the same as the string they represent so they A) can be lexically grepped for and B) can have their constant name guessed just by looking at any dataframe in the wild. 🤷

I'm gonna take the middle route here. I like the FIELD_ prefix, but I'll drop the RERUN_ part. It's noisy, and not having it seems not having still allows for the stated goals of this naming scheme.

docs/content/reference/migration/migration-0-26.md

Co-authored-by: Clement Rey <[email protected]>

# Conflicts: # crates/store/re_datafusion/src/dataframe_query_common.rs # crates/store/re_protos/src/v1alpha1/rerun.cloud.v1alpha1.ext.rs # crates/store/re_redap_client/src/connection_client.rs # crates/store/re_server/src/rerun_cloud.rs # rerun_py/src/catalog/dataset_entry.rs

abey79 changed the title ~~Add grpc endpoint for layer table and cleanup helper objects~~ Introduce the layer table and remove layer information from the partition table Oct 3, 2025

abey79 added sdk-python Python logging API include in changelog dataplatform Rerun Data Platform integration labels Oct 3, 2025

abey79 added 2 commits October 6, 2025 15:51

Add grpc endpoint for layer table and cleanup helper objects

6850986

add table provider for layer table

1cd8a69

abey79 force-pushed the antoine/layer-table branch from 701aed1 to 1cd8a69 Compare October 6, 2025 14:21

abey79 added 3 commits October 6, 2025 16:42

add DatasetEntry.layer_table to Python SDK

1ce0fc5

reintroduce storage_urls in partition table

33f7409

Fix schema mismatch

abc37b0

abey79 changed the title ~~Introduce the layer table and remove layer information from the partition table~~ Introduce the dataset manifest and remove layer information from the partition table Oct 7, 2025

abey79 added 2 commits October 7, 2025 09:21

Rename everything to "DatasetManifest"

feb1b31

Fix name + update proto docstring

b070cbe

abey79 requested a review from Copilot October 7, 2025 07:30

Copilot AI reviewed Oct 7, 2025

View reviewed changes

crates/store/re_protos/src/v1alpha1/rerun.cloud.v1alpha1.ext.rs Outdated Show resolved Hide resolved

crates/store/re_datafusion/src/dataset_manifest.rs Outdated Show resolved Hide resolved

crates/store/re_datafusion/src/dataset_manifest.rs Outdated Show resolved Hide resolved

abey79 and others added 11 commits October 7, 2025 09:34

Apply suggestion from @Copilot

5d43d14

Co-authored-by: Copilot <[email protected]>

Apply suggestion from @Copilot

6bba8ab

Co-authored-by: Copilot <[email protected]>

Minor fix

a82dbb4

Merge branch 'main' into antoine/layer-table

fcd9b23

Minor minor fix

648d46a

Add explicit fields() method

af5c3df

Add explicit xxx_inner_field() methods

8778501

Remove utterly deprecated constants

3e5baa0

More docstring and rename to LAYER_NAMES

ff20d4e

add unit test

4e5f440

update migration guide

f594a75

abey79 added 2 commits October 7, 2025 18:38

fix wasm build

1ddbab3

Merge branch 'main' into antoine/layer-table

d27a747

abey79 marked this pull request as ready for review October 7, 2025 16:39

lint

439b734

abey79 commented Oct 7, 2025

View reviewed changes

teh-cmc self-requested a review October 7, 2025 17:09

teh-cmc approved these changes Oct 8, 2025

View reviewed changes

abey79 and others added 5 commits October 8, 2025 09:42

Update crates/store/re_protos/src/v1alpha1/rerun.cloud.v1alpha1.ext.rs

45dd37c

Co-authored-by: Clement Rey <[email protected]>

Update docs/content/reference/migration/migration-0-26.md

b824ded

Co-authored-by: Clement Rey <[email protected]>

Review comments

d6edf6e

fix imports

7df55bf

abey79 merged commit e9429eb into main Oct 8, 2025
41 checks passed

abey79 deleted the antoine/layer-table branch October 8, 2025 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce the dataset manifest and remove layer information from the partition table #11423

Introduce the dataset manifest and remove layer information from the partition table #11423

Uh oh!

abey79 commented Oct 3, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 3, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

abey79 Oct 7, 2025

Uh oh!

abey79 Oct 7, 2025

Uh oh!

teh-cmc left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

teh-cmc Oct 8, 2025

Uh oh!

abey79 Oct 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Introduce the dataset manifest and remove layer information from the partition table #11423

Introduce the dataset manifest and remove layer information from the partition table #11423

Uh oh!

Conversation

abey79 commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related

What

Uh oh!

github-actions bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abey79 Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

abey79 Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

teh-cmc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

teh-cmc Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

abey79 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abey79 commented Oct 3, 2025 •

edited

Loading

github-actions bot commented Oct 3, 2025 •

edited

Loading

github-actions bot commented Oct 7, 2025 •

edited

Loading