Skip to content

Comments

storage: allow singleton sources on multi-replica clusters#31890

Merged
aljoscha merged 3 commits intoMaterializeInc:mainfrom
aljoscha:storage-allow-singleton-sources-on-multi-replica-clusters
Mar 19, 2025
Merged

storage: allow singleton sources on multi-replica clusters#31890
aljoscha merged 3 commits intoMaterializeInc:mainfrom
aljoscha:storage-allow-singleton-sources-on-multi-replica-clusters

Conversation

@aljoscha
Copy link
Contributor

@aljoscha aljoscha commented Mar 15, 2025

Implements https://github.com/MaterializeInc/database-issues/issues/9079

Per the design in https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/20250127_multi_replica_scheduling_singleton_sources.md

There's now an issue around tracking "active replicas" in the storage controller, for the purposes of tracking DroppedId messages. I have a good idea how to fix it but wanted to get this out for review now.

@aljoscha aljoscha force-pushed the storage-allow-singleton-sources-on-multi-replica-clusters branch 3 times, most recently from 5f9de72 to 804de79 Compare March 17, 2025 09:23
This hides the implementation detail and will make it easier to change
how we track the data inside Instance. Which latter I'll do in one of
the next commits.

We have to use a `Box<dyn ...>` because the trait needs to be object
safe, that is dyn-compatible.
@aljoscha aljoscha force-pushed the storage-allow-singleton-sources-on-multi-replica-clusters branch from 804de79 to 9413c53 Compare March 17, 2025 09:36
@aljoscha aljoscha marked this pull request as ready for review March 17, 2025 09:37
@aljoscha aljoscha requested review from a team as code owners March 17, 2025 09:37
@aljoscha
Copy link
Contributor Author

@bkirwi & @petrosagg If we're happy with this approach, we can also use it for sinks on multi-replica clusters.

#[derive(Debug)]
struct ActiveIngestion {
/// Whether the ingestion prefers running on a single replica.
prefers_single_replica: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a property of an active ingestion? I would expect this information to be found in the collection state for the given GlobalId. Having it here implies that it can potentially disagree with the ingestion, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a property of the IngestionDescription, and yeah, this is a cache of that, so that I don't have to re-derive it from the history. If you prefer that, I can also change it so we get it from the history instead. I don't think this is performance-sensitive code.


/// Replays commands to the specified replica.
pub fn replay_commands(&mut self, replica_id: ReplicaId) {
let commands = self.history.iter().cloned().collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tried it but it feels like the collect and the into_iter below can be deleted.


let filtered_commands = commands
.into_iter()
.map(|command| match command.clone() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't the commands already owned since you called cloned above?

.map(|command| match command.clone() {
StorageCommand::RunIngestions(mut cmds) => {
cmds.retain(|cmd| self.is_active_replica(&cmd.id, &replica_id));
StorageCommand::RunIngestions(cmds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may end up with no commands, you could switch the iterator combinator to filter_map and return None in this case to avoid sending the command

}
command => command,
})
.collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also seems it can be deleted (and also didn't try it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is getting around a borrowing problem, the iterator wants to access self and the call for getting the replica does as well

.description
.desc
.connection
.prefers_single_replica();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so this method already exists there. Is there a reason we can't use this instead of de-normalizing it in the ActiveIngestion struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "already exist", you mean it's added in this PR? 😅

But yes, see my answer up here: https://github.com/MaterializeInc/materialize/pull/31890/files/9413c5366af22285e88e7c7ad279fcfcf1a20d65#r1999229940

/// we never change the scheduling decision for single-replica ingestions
/// unless we have to, that is unless the replica that they are running on
/// goes away. We do this, so that we don't send a mix of "run"/"allow
/// compaction"/"run" messages to replicas, which wouldn't deal well with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this avoided? I would expect that if I create a replica A and then create and drop replica B then replica A will receive run, allow compaction, run, allow compaction for that ingestion while we switch back and forth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case replica A would receive a RunIngestion and then nothing more. From above:

Crucially, we never change the scheduling decision for single-replica ingestions unless we have to, that is unless the replica that they are running on goes away. We do this, so that we don't send a mix of "run"/"allow compaction"/"run" messages to replicas

I can highlight it more that this is the more important property we uphold

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that if I create a replica and then drop it my pg sources will never get scheduled again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in that case it would be scheduled on another replica.

Here's the docstring, but slightly massaged to highlight the important property. I'll change the code to that.

    /// An important property of this scheduling algorithm is that we never
    /// change the scheduling decision for single-replica ingestions unless we
    /// have to, that is unless the replica that they are running on goes away.
    /// We do this, so that we don't send a mix of "run"/"allow
    /// compaction"/"run" messages to replicas, which wouldn't deal well with
    /// this. When we _do_ have to make a scheduling decision we schedule a
    /// single-replica ingestion on the last replica.

Does this help?

I want to change the last part (along with the implementation) to this, though:

    /// this. When we _do_ have to make a scheduling decision we schedule a
    /// single-replica ingestion on the first replica, according to the sort
    /// order of `ReplicaId`. We do this latter so that the scheduling decision
    /// is stable across restarts of `environmentd`/the controller.

pub fn get_ingestion_description(
&self,
id: &GlobalId,
) -> Option<mz_storage_types::sources::IngestionDescription<CollectionMetadata>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe import IngestionDescription at the top to avoid the long path

if let Some(ingestion_id) = self.ingestion_exports.get(id) {
// Right now, only ingestions can have per-replica scheduling
// decisions.
if let Some(ingestion) = self.active_ingestions.get(ingestion_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: feels cleaner to make this a match instead of an if let with an early return to avoid falling through

@bkirwi
Copy link
Contributor

bkirwi commented Mar 17, 2025

In general, the more similar the replicas of a particular cluster are, the happier I am. For the Kafka source, where we only filter at the end, things feel very similar and I am very happy. For these other sources, where we end up sending different command streams to different replicas, I am more nervous!

I think it probably doesn't make sense to revisit the design here, especially given the timing of everything, but I'd be interested in exploring a different approach for the sink. (eg. having the controller pick a leader, but only doing the filtering of the Kafka writes "at the last minute" in the sink.)

@aljoscha aljoscha force-pushed the storage-allow-singleton-sources-on-multi-replica-clusters branch from 9413c53 to 25feff9 Compare March 18, 2025 11:59
@aljoscha
Copy link
Contributor Author

In general, the more similar the replicas of a particular cluster are, the happier I am. For the Kafka source, where we only filter at the end, things feel very similar and I am very happy. For these other sources, where we end up sending different command streams to different replicas, I am more nervous!

I think it probably doesn't make sense to revisit the design here, especially given the timing of everything, but I'd be interested in exploring a different approach for the sink. (eg. having the controller pick a leader, but only doing the filtering of the Kafka writes "at the last minute" in the sink.)

Agreed on all those accounts! This design was chosen because it was determined to be quite hard for the sources to share their replication slot. I'd be very happy if we can come up with a design that allows us to run the same dataflows on all replicas. Also, though, this thing here is not at all a trap door decision, we can easily change things in future releases.

@aljoscha aljoscha force-pushed the storage-allow-singleton-sources-on-multi-replica-clusters branch from 25feff9 to 731e1e3 Compare March 18, 2025 14:52
@aljoscha aljoscha force-pushed the storage-allow-singleton-sources-on-multi-replica-clusters branch from 731e1e3 to 689fdd8 Compare March 18, 2025 16:37
@aljoscha
Copy link
Contributor Author

@petrosagg I addressed your easy comments, and the question around scheduling is now resolved. Could you take another look? If you like, I'll also remove the cached prefers_single_replica field on ActiveIngestion and recompute that from the history when needed.

@bkirwi
Copy link
Contributor

bkirwi commented Mar 18, 2025

(Just TBC - once Petros is satisfied I am also satisfied!)

Copy link
Contributor

@petrosagg petrosagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for explaining the scheduling thing in the call

@aljoscha aljoscha merged commit 7bb7b88 into MaterializeInc:main Mar 19, 2025
82 checks passed
@aljoscha aljoscha deleted the storage-allow-singleton-sources-on-multi-replica-clusters branch March 19, 2025 08:58
petrosagg added a commit to petrosagg/materialize that referenced this pull request Apr 1, 2025
This PR extends the approach of
MaterializeInc#31890 to the case of
sinks. The handling is identical to that of single replica sources.

Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
petrosagg added a commit to petrosagg/materialize that referenced this pull request Apr 1, 2025
This PR extends the approach of
MaterializeInc#31890 to the case of
sinks. The handling is identical to that of single replica sources.

Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants