storage: allow singleton sources on multi-replica clusters by aljoscha · Pull Request #31890 · MaterializeInc/materialize

aljoscha · 2025-03-15T17:24:18Z

Implements https://github.com/MaterializeInc/database-issues/issues/9079

Per the design in https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/20250127_multi_replica_scheduling_singleton_sources.md

There's now an issue around tracking "active replicas" in the storage controller, for the purposes of tracking DroppedId messages. I have a good idea how to fix it but wanted to get this out for review now.

This hides the implementation detail and will make it easier to change how we track the data inside Instance. Which latter I'll do in one of the next commits. We have to use a `Box<dyn ...>` because the trait needs to be object safe, that is dyn-compatible.

aljoscha · 2025-03-17T09:39:32Z

@bkirwi & @petrosagg If we're happy with this approach, we can also use it for sinks on multi-replica clusters.

petrosagg · 2025-03-17T16:24:40Z

src/storage-controller/src/instance.rs

+#[derive(Debug)]
+struct ActiveIngestion {
+    /// Whether the ingestion prefers running on a single replica.
+    prefers_single_replica: bool,


Is this a property of an active ingestion? I would expect this information to be found in the collection state for the given GlobalId. Having it here implies that it can potentially disagree with the ingestion, right?

It's a property of the IngestionDescription, and yeah, this is a cache of that, so that I don't have to re-derive it from the history. If you prefer that, I can also change it so we get it from the history instead. I don't think this is performance-sensitive code.

petrosagg · 2025-03-17T16:26:34Z

src/storage-controller/src/instance.rs

+
+    /// Replays commands to the specified replica.
+    pub fn replay_commands(&mut self, replica_id: ReplicaId) {
+        let commands = self.history.iter().cloned().collect::<Vec<_>>();


Haven't tried it but it feels like the collect and the into_iter below can be deleted.

petrosagg · 2025-03-17T16:26:55Z

src/storage-controller/src/instance.rs

+
+        let filtered_commands = commands
+            .into_iter()
+            .map(|command| match command.clone() {


Aren't the commands already owned since you called cloned above?

petrosagg · 2025-03-17T16:29:38Z

src/storage-controller/src/instance.rs

+            .map(|command| match command.clone() {
+                StorageCommand::RunIngestions(mut cmds) => {
+                    cmds.retain(|cmd| self.is_active_replica(&cmd.id, &replica_id));
+                    StorageCommand::RunIngestions(cmds)


This may end up with no commands, you could switch the iterator combinator to filter_map and return None in this case to avoid sending the command

petrosagg · 2025-03-17T16:32:46Z

src/storage-controller/src/instance.rs

+                }
+                command => command,
+            })
+            .collect::<Vec<_>>();


Also seems it can be deleted (and also didn't try it)

this one is getting around a borrowing problem, the iterator wants to access self and the call for getting the replica does as well

petrosagg · 2025-03-17T16:36:49Z

src/storage-controller/src/instance.rs

+                .description
+                .desc
+                .connection
+                .prefers_single_replica();


I see, so this method already exists there. Is there a reason we can't use this instead of de-normalizing it in the ActiveIngestion struct?

By "already exist", you mean it's added in this PR? 😅

But yes, see my answer up here: https://github.com/MaterializeInc/materialize/pull/31890/files/9413c5366af22285e88e7c7ad279fcfcf1a20d65#r1999229940

petrosagg · 2025-03-17T16:39:21Z

src/storage-controller/src/instance.rs

+    /// we never change the scheduling decision for single-replica ingestions
+    /// unless we have to, that is unless the replica that they are running on
+    /// goes away. We do this, so that we don't send a mix of "run"/"allow
+    /// compaction"/"run" messages to replicas, which wouldn't deal well with


How is this avoided? I would expect that if I create a replica A and then create and drop replica B then replica A will receive run, allow compaction, run, allow compaction for that ingestion while we switch back and forth.

In that case replica A would receive a RunIngestion and then nothing more. From above:

Crucially, we never change the scheduling decision for single-replica ingestions unless we have to, that is unless the replica that they are running on goes away. We do this, so that we don't send a mix of "run"/"allow compaction"/"run" messages to replicas

I can highlight it more that this is the more important property we uphold

Does that mean that if I create a replica and then drop it my pg sources will never get scheduled again?

No, in that case it would be scheduled on another replica.

Here's the docstring, but slightly massaged to highlight the important property. I'll change the code to that.

/// An important property of this scheduling algorithm is that we never /// change the scheduling decision for single-replica ingestions unless we /// have to, that is unless the replica that they are running on goes away. /// We do this, so that we don't send a mix of "run"/"allow /// compaction"/"run" messages to replicas, which wouldn't deal well with /// this. When we _do_ have to make a scheduling decision we schedule a /// single-replica ingestion on the last replica.

Does this help?

I want to change the last part (along with the implementation) to this, though:

/// this. When we _do_ have to make a scheduling decision we schedule a /// single-replica ingestion on the first replica, according to the sort /// order of `ReplicaId`. We do this latter so that the scheduling decision /// is stable across restarts of `environmentd`/the controller.

petrosagg · 2025-03-17T16:40:46Z

src/storage-controller/src/instance.rs

+    pub fn get_ingestion_description(
+        &self,
+        id: &GlobalId,
+    ) -> Option<mz_storage_types::sources::IngestionDescription<CollectionMetadata>> {


nit: maybe import IngestionDescription at the top to avoid the long path

petrosagg · 2025-03-17T16:42:38Z

src/storage-controller/src/instance.rs

+        if let Some(ingestion_id) = self.ingestion_exports.get(id) {
+            // Right now, only ingestions can have per-replica scheduling
+            // decisions.
+            if let Some(ingestion) = self.active_ingestions.get(ingestion_id) {


nit: feels cleaner to make this a match instead of an if let with an early return to avoid falling through

bkirwi · 2025-03-17T17:45:54Z

In general, the more similar the replicas of a particular cluster are, the happier I am. For the Kafka source, where we only filter at the end, things feel very similar and I am very happy. For these other sources, where we end up sending different command streams to different replicas, I am more nervous!

I think it probably doesn't make sense to revisit the design here, especially given the timing of everything, but I'd be interested in exploring a different approach for the sink. (eg. having the controller pick a leader, but only doing the filtering of the Kafka writes "at the last minute" in the sink.)

aljoscha · 2025-03-18T12:09:39Z

In general, the more similar the replicas of a particular cluster are, the happier I am. For the Kafka source, where we only filter at the end, things feel very similar and I am very happy. For these other sources, where we end up sending different command streams to different replicas, I am more nervous!

I think it probably doesn't make sense to revisit the design here, especially given the timing of everything, but I'd be interested in exploring a different approach for the sink. (eg. having the controller pick a leader, but only doing the filtering of the Kafka writes "at the last minute" in the sink.)

Agreed on all those accounts! This design was chosen because it was determined to be quite hard for the sources to share their replication slot. I'd be very happy if we can come up with a design that allows us to run the same dataflows on all replicas. Also, though, this thing here is not at all a trap door decision, we can easily change things in future releases.

…n one replica only Implements MaterializeInc/database-issues#9079 Per the design in https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/20250127_multi_replica_scheduling_singleton_sources.md

aljoscha · 2025-03-18T16:38:29Z

@petrosagg I addressed your easy comments, and the question around scheduling is now resolved. Could you take another look? If you like, I'll also remove the cached prefers_single_replica field on ActiveIngestion and recompute that from the history when needed.

bkirwi · 2025-03-18T17:04:09Z

(Just TBC - once Petros is satisfied I am also satisfied!)

petrosagg

Looks great! Thanks for explaining the scheduling thing in the call

This PR extends the approach of MaterializeInc#31890 to the case of sinks. The handling is identical to that of single replica sources. Signed-off-by: Petros Angelatos <petrosagg@gmail.com>

aljoscha force-pushed the storage-allow-singleton-sources-on-multi-replica-clusters branch 3 times, most recently from 5f9de72 to 804de79 Compare March 17, 2025 09:23

aljoscha force-pushed the storage-allow-singleton-sources-on-multi-replica-clusters branch from 804de79 to 9413c53 Compare March 17, 2025 09:36

aljoscha marked this pull request as ready for review March 17, 2025 09:37

aljoscha requested review from a team as code owners March 17, 2025 09:37

aljoscha requested review from ParkMyCar, bkirwi and petrosagg March 17, 2025 09:37

petrosagg reviewed Mar 17, 2025

View reviewed changes

aljoscha force-pushed the storage-allow-singleton-sources-on-multi-replica-clusters branch from 9413c53 to 25feff9 Compare March 18, 2025 11:59

aljoscha force-pushed the storage-allow-singleton-sources-on-multi-replica-clusters branch from 25feff9 to 731e1e3 Compare March 18, 2025 14:52

aljoscha force-pushed the storage-allow-singleton-sources-on-multi-replica-clusters branch from 731e1e3 to 689fdd8 Compare March 18, 2025 16:37

storage: don't cache prefers_single_replica in ActiveIngestions

428d487

petrosagg approved these changes Mar 19, 2025

View reviewed changes

aljoscha merged commit 7bb7b88 into MaterializeInc:main Mar 19, 2025
82 checks passed

aljoscha deleted the storage-allow-singleton-sources-on-multi-replica-clusters branch March 19, 2025 08:58

petrosagg mentioned this pull request Apr 1, 2025

storage: allow sinks on multi-replica clusters #32066

Merged

5 tasks

Comments

Conversation

aljoscha commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aljoscha commented Mar 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkirwi commented Mar 17, 2025

Uh oh!

aljoscha commented Mar 18, 2025

Uh oh!

aljoscha commented Mar 18, 2025

Uh oh!

bkirwi commented Mar 18, 2025

Uh oh!

petrosagg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aljoscha commented Mar 15, 2025 •

edited

Loading