REP-5219 Make migration-verifier process change events in batch. #39

tdq45gj · 2024-11-07T16:38:39Z

This makes the change stream reader:

reads a change event until RemainingBatchSize == 0
buffer all events of a batch in memory
batch insert recheck docs

It also fixes the bug that the eventRecorder counts each event twice.

FGasper

Generally this looks good, but I wonder if we can’t simplify it some.

Also the dupe-key handling seems like a point of potential concern. How necessary is that?

FGasper · 2024-11-08T16:15:13Z

internal/verifier/change_stream.go

 // HandleChangeStreamEvent performs the necessary work for change stream events that occur during
 // operation.
-func (verifier *Verifier) HandleChangeStreamEvent(ctx context.Context, changeEvent *ParsedEvent) error {
+func (verifier *Verifier) HandleChangeStreamEvent(changeEvent *ParsedEvent) error {


Just curious, what was the advantage of doing the buffering here versus closer to the read from the cursor? I was thinking we’d read all the events into a slice then pass that slice to this function (renamed HandleChangeStreamEvents).

We wouldn’t need separate handle-vs-flush methods in that case.

Good point, I made the changes.

FGasper · 2024-11-08T16:16:47Z

internal/verifier/change_stream.go

 			}
+
+			eventsRead++
+		}


Should there be error handling for cs.Err() here?

Added to the if statement above.

internal/verifier/change_stream.go

FGasper · 2024-11-08T16:20:04Z

internal/verifier/change_stream.go


 // StartChangeStream starts the change stream.
-func (verifier *Verifier) StartChangeStream(ctx context.Context) error {
+func (verifier *Verifier) StartChangeStream(ctx context.Context, batchSize *int32) error {


Why does this need to be a setting? Aren’t we just taking whatever batch size we get from the source?

I did this for a test to pass in a small batch size.

Was that necessary, though?

getMore always returns after a relatively short period of inactivity with just whatever events it happens to have seen. So the batch size should be naturally limited to whichever events you’ve triggered … ?

FGasper · 2024-11-08T16:22:25Z

internal/verifier/change_stream.go

+		}
+
+		dbName, collName := SplitNamespace(namespace)
+		if err := verifier.insertRecheckDocs(ctx, dbName, collName, ids, dataSizes); err != nil {


This approach will cause us to block between each namespace’s events, I think.

Could we instead persist all of the events for all namespaces at once?

FGasper · 2024-11-08T16:24:21Z

internal/verifier/recheck.go

 	}

+	// Silence any duplicate key errors as recheck docs should have existed.
+	if mongo.IsDuplicateKeyError(err) {


Dupe key errors can exist alongside other errors.

We should probably instead look for non-simple dupe-key errors so that anything that isn’t a dupe-key will still be handled.

It seems that there shouldn't be any duplicate key error here because it's actually replace rather than insert.

FGasper · 2024-11-08T16:25:15Z

internal/verifier/migration_verifier.go


 	pprofInterval time.Duration
+
+	changeEventRecheckBuf ChangeEventRecheckBuffer


Suggestion: We shouldn’t need this if we switch to reading all events from the cursor into a slice.

FGasper · 2024-11-08T16:34:28Z

internal/verifier/change_stream_test.go

+		"the verifier should flush a recheck doc after a batch",
+	)
+	suite.Require().Empty(verifier.changeEventRecheckBuf["testDB.testColl1"])
+	suite.Require().Empty(verifier.changeEventRecheckBuf["testDB.testColl2"])


Can we avoid testing verifier internals here? It’d be ideal not to depend on them.

autarch

LGTM % some small stuff. No need for another review.

autarch · 2024-11-08T20:44:49Z

internal/verifier/recheck.go

+func (verifier *Verifier) insertRecheckDocsWhileLocked(
 	ctx context.Context,
-	dbName, collName string, documentIDs []interface{}, dataSizes []int) error {
+	dbNames []string, collNames []string, documentIDs []interface{}, dataSizes []int) error {


nit: This is a very weird way to format this. I think it'd more readable as one arg per line.

autarch · 2024-11-08T20:46:27Z

internal/verifier/change_stream.go

+	verifier.mux.Lock()
+	defer verifier.mux.Unlock()
+
+	return verifier.insertRecheckDocsWhileLocked(ctx, dbNames, collNames, docIDs, dataSizes)


I think it'd make more sense to put the locking in the insertRecheckDocsWhileLocked method. Right now every caller has to remember to use the mutex properly, which seems like a recipe for mistakes.

I made it this way just this week to accommodate stats. If we’re going to change this I’d rather it’d be in a separate PR since the changes don’t really seem germane.

I've removed the change event stats from this function in this PR. I think it makes sense to make this change here as well.

autarch · 2024-11-08T20:50:12Z

internal/verifier/change_stream.go

+		}
+
+		if eventsRead > 0 {
+			verifier.logger.Debug().Msgf("Received a batch of %d events", eventsRead)


Do you also want to log the 0 events case? I don't have enough context to know if that would be useful.

I'd prefer not to log if there's no event. We've recommended users to run the m-v with debug log level, so I don't want too many log lines even at debug level.

FGasper · 2024-11-09T01:07:32Z

internal/verifier/change_stream.go

+		}
+
+		if eventsRead > 0 {
+			verifier.logger.Debug().Msgf("Received a batch of %d events", eventsRead)


nit:

Suggested change

verifier.logger.Debug().Msgf("Received a batch of %d events", eventsRead)

verifier.logger.Debug().Int("eventsCount", eventsRead).Msgf("Received a batch of events.")

FGasper · 2024-11-09T01:08:16Z

internal/verifier/change_stream.go

+			verifier.logger.Debug().Msgf("Received a batch of %d events", eventsRead)
+		}
+
+		err := verifier.HandleChangeStreamEvents(ctx, changeEventBatch)


Should this handling logic also be under the >0 condition?

FGasper · 2024-11-09T01:09:36Z

internal/verifier/change_stream.go

 		}

-		return gotEvent, errors.Wrap(cs.Err(), "change stream iteration failed")
+		return eventsRead > 0, errors.Wrap(cs.Err(), "change stream iteration failed")


It seems strange to defer handling cs.Err() until after HandleChangeStreamEvents() is called. It’d look a bit more idiomatic if the check were closer to the TryNext.

FGasper · 2024-11-09T01:11:07Z

internal/verifier/recheck.go

+	dbNames []string,
+	collNames []string,
+	documentIDs []interface{},
+	dataSizes []int,


nit: If we’re doing a separate slice for each category it’d seem a bit cleaner if there were 1 slice with a struct that contains these data points.

FGasper

LGTM. Thanks!

FGasper · 2024-11-12T15:00:13Z

internal/verifier/migration_verifier_test.go

 	suite.Require().Equal(VerificationStatus{TotalTasks: 1, FailedTasks: 1}, *status)
+
+	checkContinueChan <- struct{}{}
+	require.NoError(suite.T(), errGroup.Wait())


tdq45gj added 10 commits November 7, 2024 11:38

buffer by change stream batch

7a11e3e

fix test

6dfbc64

Update recheck_test.go

b201a20

Update change_stream_test.go

c4c7a66

Update change_stream_test.go

59effc4

Update change_stream.go

84b6df6

Update change_stream_test.go

3c6383b

Update change_stream_test.go

759a9bc

Update migration_verifier_test.go

b5a170a

Merge branch 'main' into rep-5219-batch-attempt-2

91b1896

tdq45gj requested review from FGasper and autarch November 7, 2024 21:11

Merge branch 'main' into rep-5219-batch-attempt-2

a58b996

FGasper requested changes Nov 8, 2024

View reviewed changes

FGasper reviewed Nov 8, 2024

View reviewed changes

tdq45gj added 2 commits November 8, 2024 11:53

Merge branch 'main' into rep-5219-batch-attempt-2

0fea3ad

Felipe's review

3ac0547

tdq45gj requested a review from FGasper November 8, 2024 20:29

autarch approved these changes Nov 8, 2024

View reviewed changes

Dave's review

86935c1

FGasper reviewed Nov 9, 2024

View reviewed changes

Update change_stream.go

5c71523

tdq45gj requested a review from FGasper November 11, 2024 14:12

FGasper reviewed Nov 12, 2024

View reviewed changes

FGasper approved these changes Nov 12, 2024

View reviewed changes

tdq45gj merged commit 13e2a62 into mongodb-labs:main Nov 12, 2024
5 checks passed


		pprofInterval time.Duration

		changeEventRecheckBuf ChangeEventRecheckBuffer

	verifier.logger.Debug().Msgf("Received a batch of %d events", eventsRead)
	verifier.logger.Debug().Int("eventsCount", eventsRead).Msgf("Received a batch of events.")

REP-5219 Make migration-verifier process change events in batch. #39

REP-5219 Make migration-verifier process change events in batch. #39

Uh oh!

Conversation

tdq45gj commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FGasper left a comment

Choose a reason for hiding this comment

Uh oh!

FGasper Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

autarch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FGasper left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tdq45gj commented Nov 7, 2024 •

edited

Loading

FGasper Nov 8, 2024 •

edited

Loading