Skip to content

Conversation

@bkirwi
Copy link
Contributor

@bkirwi bkirwi commented Dec 17, 2025

Tease out some unnecessary code from StorageCollections.

  • Remove the "status collection" field. This is no longer needed anywhere; we manage status collections fully in envd now.
  • Remove the "data source" from our collection state. It's Weird that storage collections knew this level of detail in the first place, and there's a fair bit of code that's dedicated to keeping this metadata in sync. We grab the specific values that we care about and drop both the data_source field and all the code that keeps it in sync.

Motivation

It's very hard to make changes to storage collections! Simplifying makes it a bit easier to experiment with details of eg. how primary semantics work.

Breaking this out from my experiments since I think it's valuable cleanup in general and because I don't want it to bitrot as we get into the holidays.

Tips for reviewer

Very fine-grained commits, mostly for my own sanity - but it may help review as well.

@bkirwi bkirwi force-pushed the collections-cleanup branch 5 times, most recently from 8322316 to 47a29da Compare December 18, 2025 17:18
@def-
Copy link
Contributor

def- commented Dec 18, 2025

You'll want to rebase on main to get rid of the cargo test flake.

We report status via the clusterd protocol now, not via side-channel
collections.
Awkward fit for storage collections, but this at least avoids some
refactoring...
Sad to remove asserts, but the point is to no longer have
StorageCollections understand the details of what it's managing.
All the remaining code exists only to keep things in sync, which we
don't need.
@bkirwi bkirwi force-pushed the collections-cleanup branch from 47a29da to 5dee7c8 Compare December 19, 2025 17:01
@bkirwi bkirwi marked this pull request as ready for review December 19, 2025 18:16
@bkirwi bkirwi requested review from a team as code owners December 19, 2025 18:16
@bkirwi bkirwi requested review from aljoscha, ggevay and teskje December 19, 2025 18:16
Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, nice cleanup in there!

pub description: CollectionDescription<T>,

time_dependence: Option<TimeDependence>,
ingestion_remap_collection_id: Option<GlobalId>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like we're taking a step backwards, with now adding a field that is specific for collections, while other parts of the PR remove specific knowledge about the collections we're managing

but also fine, I might now know all the context 🤷‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally. We unfortunately still need it to resolve the remap collection in create_collections_for_bootstrap, though, and without adding this field we wouldn't be able to remove the DataSource

I think the next incremental step is to change the storage controller to pre-resolve this info, so that storage collections doesn't need to know about the remap business at all, and can just track it as an opaque dependency. But that's followup work...

@bkirwi bkirwi merged commit 1c250df into MaterializeInc:main Dec 22, 2025
333 of 335 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants