REP-5317 Add destination change stream #53

tdq45gj · 2024-11-22T20:59:37Z

This PR adds a destination change stream to the MV. It also adds a ChangeStreamReader struct that reads change events from either source or destination. Change events are handled in change event handlers that're goroutines started for each of the change stream reader.

FGasper

In broad strokes this looks good.

I’ve noted some small things, mostly nits.

Also, can there be a more end-to-end-ish test that if the destination receives an update to a document after MV has checked it that we’ll get a mismatch reported?

FGasper · 2024-11-26T14:42:50Z

internal/verifier/migration_verifier.go

 	dstNamespaces []string
-	nsMap         map[string]string
+	srcDstNsMap   map[string]string
+	dstSrcNsMap   map[string]string


Is this always just the inverse of srcDstNsMap? If so, I think we’d be better served by a dedicated type with an accessor.

internal/verifier/migration_verifier.go

FGasper · 2024-11-26T14:46:04Z

internal/verifier/migration_verifier.go

-			verifier.logger,
-			verifier.srcClient,
-		)
+	srcFinalTs, err := GetNewClusterTime(


In upstream the mux is locked during GetNewClusterTime. Does this need to change? If so, why?

If not, can we keep upstream’s behavior here?

I don't think GetNewClusterTime needs to be under the lock. Here I'm removing the verifier.writesOffTimestamp field and adding the above error so that writesOff can't be called twice.

I don't think GetNewClusterTime needs to be under the lock.

What benefit do we realize from changing this, though?

Ideally we should minimize the critical section.

My concern is that this is fairly brittle stuff. We’ve made it less so, but altering concurrency-related code seems like a risk we’re better off avoiding.

I can’t see anything wrong with your reasoning, but this sort of stuff is prone to “surprises”. Will this palpably improve speed? If so, then I’m OK with it. If not, IMO we should retain the present logic.

FGasper · 2024-11-26T14:46:42Z

internal/verifier/migration_verifier.go

-		}
-	} else {
-		verifier.mux.Unlock()
+	// This has to happen under the lock because the change stream


The comment restores my “under the lock” typo. :-P Should probably fix.

FGasper · 2024-11-26T14:50:53Z

internal/verifier/recheck.go

 // sorting by _id will guarantee that all rechecks for a given
 // namespace appear consecutively.
+//
+// DatabaseName and CollectionName should be on the source.


Instead of this comment, the struct could rename its members?

FGasper · 2024-11-26T16:12:41Z

internal/verifier/change_stream.go

-		// If the changeStreamEnderChan has a message, the user has indicated that
-		// source writes are ended. This means we should exit rather than continue
+		// If the ChangeStreamEnderChan has a message, the user has indicated that
+		// source and destination writes are ended. This means we should exit rather than continue


I don’t know that the “and destination” part is gainful here. It might suggest that user writes to the destination are expected during a migration, which they definitely aren’t.

I think it means that the migration tool has stopped/committed. I'll clarify in this comment.

FGasper · 2024-11-26T16:13:45Z

internal/verifier/change_stream.go

 			gotwritesOffTimestamp = true

-			// Read all change events until the source reports no events.
+			// Read all change events until the source / destination reports no events.


This log is inaccurate upstream; we might as well fix it here.

Maybe:

Read change events until the stream reaches the writesOffTs.

FGasper · 2024-11-26T16:20:28Z

internal/verifier/check.go

+		case err := <-verifier.srcChangeStreamReader.ChangeStreamErrChan:
 			cancel()
-			return errors.Wrap(err, "change stream failed")
+			return errors.Wrapf(err, "got an error from %s", verifier.srcChangeStreamReader)


Do you prefer “got an error” to “failed” here? The latter seems (to me, at least) like stronger, more concise wording.

Maybe:

errors.Wrapf(err, "%s failed", verifier.srcChangeStreamReader)

FGasper · 2024-11-26T16:20:41Z

internal/verifier/check.go

+			return errors.Wrapf(err, "got an error from %s", verifier.srcChangeStreamReader)
+		case err := <-verifier.dstChangeStreamReader.ChangeStreamErrChan:
+			cancel()
+			return errors.Wrapf(err, "got an error from %s", verifier.dstChangeStreamReader)


same q re “got an error”

FGasper · 2024-11-26T16:21:39Z

internal/verifier/check.go

+		if err != nil {
+			return errors.Wrap(err, "failed to start change stream on destination")
+		}
+		verifier.StartChangeEventHandler(ctx, verifier.dstChangeStreamReader)


Since ChangeStreamReader has a String() method, you could just loop over the src & dst change streams rather than duplicating this logic.

FGasper · 2024-11-26T16:28:11Z

internal/verifier/change_stream_test.go

+	verifier.SetSrcNamespaces([]string{"srcDB.srcColl1", "srcDB.srcColl2"})
+	verifier.SetDstNamespaces([]string{"dstDB.dstColl1", "dstDB.dstColl2"})


Can these use suite.DBNameForTest() rather than hard-coding DB names?

FGasper

I’ve left a few more small comments. I think this is quite close! Please ping me when you’ve had a chance to look.

Thanks!

FGasper · 2024-12-03T14:24:08Z

internal/verifier/change_stream.go

+
+// StartChangeEventHandler starts a goroutine that handles change event batches from the reader.
+// It needs to be started after the reader starts.
+func (verifier *Verifier) StartChangeEventHandler(ctx context.Context, reader *ChangeStreamReader, errGroup *errgroup.Group) {


nit: It seems slightly cleaner not to pass the errGroup around, but just to let the caller call this function in errGroup.Go().

FGasper · 2024-12-03T14:28:20Z

internal/verifier/change_stream.go

 			return errors.Wrapf(err, "failed to decode change event to %T", changeEventBatch[eventsRead])
 		}

+		csr.logger.Trace().Msgf("%s received a change event: %v", csr, changeEventBatch[eventsRead])


Trace-level logs make sense, but is there any way to actually see them? The CLI just exposes a --debug option.

Also, consider making the event an Interface() in the log and using Msg() instead.

Yeah, this only logs in tests. I added a comment.

FGasper · 2024-12-03T14:29:02Z

internal/verifier/change_stream.go

 		}

+		csr.logger.Trace().Msgf("%s received a change event: %v", csr, changeEventBatch[eventsRead])
+		fmt.Printf("%d %d\n", changeEventBatch[eventsRead].ClusterTime.T, changeEventBatch[eventsRead].ClusterTime.I)


FGasper · 2024-12-03T14:29:35Z

internal/verifier/change_stream.go

+		csr.logger.Trace().Msgf("%s received a change event: %v", csr, changeEventBatch[eventsRead])
+		fmt.Printf("%d %d\n", changeEventBatch[eventsRead].ClusterTime.T, changeEventBatch[eventsRead].ClusterTime.I)
+
+		if changeEventBatch[eventsRead].ClusterTime != nil &&


I know this is just copied logic, but is there ever a case where we’d expect a nil cluster time?

I don't think there would be a nil cluster time, although the ClusterTime field has a BSON tag that omits empty.

I feel like if a missing clusterTime ever happened that’d be something we should at least warn about, if not fail outright. But that’d be more apropos to a separate PR.

FGasper · 2024-12-03T14:29:57Z

internal/verifier/change_stream.go

-		return errors.Wrap(err, "failed to handle change events")
-	}
-
+	csr.ChangeEventBatchChan <- changeEventBatch


Why do the handling in a separate goroutine?

The handling needs to call the verifier's method insertRecheckDocs. Now that we're separating the change stream read logic to its own struct, it seems to me that having a separate goroutine to handle the change events in the verifier makes the most sense.

OK, yeah it works around the circular dependency.

Ideally I’d us to solve that by breaking apart Verifier, but that may be a bigger change than the present “mini-project” warrants.

FGasper · 2024-12-03T14:40:22Z

internal/verifier/check.go

-		verifier.logger.Debug().Msg("Check: Change stream already running.")
-	} else {
-		verifier.logger.Debug().Msg("Change stream not running; starting change stream")
+	ceHandlerGroup := &errgroup.Group{}


Why not errgroup.WithContext()?

Also, why are the event handler goroutines in an errgroup while the event reader goroutines aren’t?

Yeah, errgroup.WithContext() seems better. The only reason for having handler goroutines in an errgroup is that we want to wait for them in the CheckDriver function. The change stream readers are currently waited through channels.

FGasper · 2024-12-03T15:08:03Z

internal/verifier/migration_verifier.go

+type clusterType string
+
+const (
+	srcReaderType clusterType = "source"


“cluster type” seems relatively non-descript: “type” often means topology, for example.

Maybe whichCluster?

I can't think of a better name. I changed it to whichCluster.

FGasper · 2024-12-03T15:10:16Z

internal/verifier/migration_verifier.go

-			verifier.logger,
-			verifier.srcClient,
-		)
+	srcFinalTs, err := GetNewClusterTime(


I don't think GetNewClusterTime needs to be under the lock.

What benefit do we realize from changing this, though?

FGasper · 2024-12-03T15:11:06Z

internal/verifier/migration_verifier.go

-		}
-	} else {
-		verifier.mux.Unlock()
+	// This has to happen outside the lock because the change stream


There are 2 change streams … can you amend this comment accordingly?

FGasper · 2024-12-03T15:25:47Z

internal/verifier/migration_verifier_test.go

+	// Dry run generation 0 to make sure change stream reader is started.
+	suite.Require().NoError(runner.AwaitGenerationEnd())
+
+	suite.Require().NoError(runner.StartNextGeneration())


Should the generation start after the inserts? It seems like that’d be a bit less race-prone.

Looks like it's more likely to race if inserts happen after generation start because StartNextGeneration isn't blocked on the next generation. I've reverted the changes for this test so that change events occur in the previous generation and tasks should appear in the next generation.

FGasper

LGTM % the last few small notes I’ve made.

FGasper · 2024-12-03T21:15:01Z

internal/verifier/change_stream.go

+	ChangeEventBatchChan chan []ParsedEvent
+	WritesOffTsChan      chan primitive.Timestamp
+	ErrChan              chan error
+	DoneChan             chan struct{}


nit: Do these need to be exported?

Technically none of them need to be exported since they're all in the verifier package. Changed to private.

FGasper · 2024-12-03T21:15:56Z

internal/verifier/change_stream.go

+		csr.logger.Trace().Msgf("%s received a change event: %v", csr, changeEventBatch[eventsRead])
+		fmt.Printf("%d %d\n", changeEventBatch[eventsRead].ClusterTime.T, changeEventBatch[eventsRead].ClusterTime.I)
+
+		if changeEventBatch[eventsRead].ClusterTime != nil &&


I feel like if a missing clusterTime ever happened that’d be something we should at least warn about, if not fail outright. But that’d be more apropos to a separate PR.

FGasper · 2024-12-03T21:18:09Z

internal/verifier/change_stream.go

-		return errors.Wrap(err, "failed to handle change events")
-	}
-
+	csr.ChangeEventBatchChan <- changeEventBatch


OK, yeah it works around the circular dependency.

Ideally I’d us to solve that by breaking apart Verifier, but that may be a bigger change than the present “mini-project” warrants.

FGasper · 2024-12-03T21:23:54Z

internal/verifier/migration_verifier.go

-			verifier.logger,
-			verifier.srcClient,
-		)
+	srcFinalTs, err := GetNewClusterTime(


My concern is that this is fairly brittle stuff. We’ve made it less so, but altering concurrency-related code seems like a risk we’re better off avoiding.

I can’t see anything wrong with your reasoning, but this sort of stuff is prone to “surprises”. Will this palpably improve speed? If so, then I’m OK with it. If not, IMO we should retain the present logic.

This fixes an oversight from PR mongodb-labs#53.

This fixes an oversight from PR mongodb-labs#53. It also “upgrades” the test for the change stream filter to focus on behavior rather than implementation.

This fixes an oversight from PR #53. It also “upgrades” the test for the change stream filter to focus on behavior rather than implementation.

tdq45gj added 3 commits November 22, 2024 08:43

wip

351c94c

Merge branch 'main' into REP-5317-add-dst-change-stream

0f645f1

refactor

53404bc

tdq45gj marked this pull request as ready for review November 22, 2024 20:59

tdq45gj added 10 commits November 22, 2024 16:09

Update change_stream_test.go

24d3881

Update change_stream_test.go

d66f25e

Merge branch 'main' into REP-5317-add-dst-change-stream

9a081cc

Update migration_verifier.go

733220c

Merge branch 'main' into REP-5317-add-dst-change-stream

2ae4abc

fix

39af4e7

Update integration_test_suite.go

4ff9863

fix tests

16224b1

Merge branch 'main' into REP-5317-add-dst-change-stream

2d95686

Update change_stream.go

a9cb0c1

tdq45gj requested a review from FGasper November 25, 2024 22:12

FGasper requested changes Nov 26, 2024

View reviewed changes

FGasper reviewed Nov 26, 2024

View reviewed changes

tdq45gj added 13 commits November 27, 2024 13:56

Felipe's review

5bf64f6

Merge branch 'main' into REP-5317-add-dst-change-stream

668e649

Update change_stream_test.go

743dd6a

initialize nsmap

cc34be4

fix

4ba0eda

fix generational rechecking test

b9d5f37

Add end-to-end test

9eed454

wait for change event handler

281afc0

fix race condition in tests

c96a495

Update change_stream.go

e89d175

Update check.go

f14962f

wait for recheck docs creation

be93d05

fix

c1b7ead

tdq45gj requested a review from FGasper December 2, 2024 12:54

tdq45gj added 2 commits December 3, 2024 09:23

Update change_stream_test.go

c119cb0

Update change_stream.go

e302d9d

FGasper requested changes Dec 3, 2024

View reviewed changes

tdq45gj added 4 commits December 3, 2024 11:51

refactors

d15b020

Merge branch 'main' into REP-5317-add-dst-change-stream

ff0c7e9

rename clusterType to whichCluster

e56dcbc

reorder controls in TestChangesOnDstBeforeSrc

c91eb0d

tdq45gj requested a review from FGasper December 3, 2024 18:09

tdq45gj added 4 commits December 3, 2024 14:23

Update migration_verifier_test.go

faa2e24

Update change_stream_test.go

c71e9ef

reduce log level

18a4cb3

Update change_stream_test.go

3961b2d

FGasper approved these changes Dec 3, 2024

View reviewed changes

tdq45gj added 2 commits December 3, 2024 16:33

change to private fields

c2da8c4

Update migration_verifier_test.go

e3e901c

tdq45gj merged commit ad69b4d into mongodb-labs:main Dec 3, 2024
49 checks passed

FGasper added a commit to FGasper/migration-verifier that referenced this pull request Dec 4, 2024

REP-5317 Ignore mongosync dbs in change stream.

348d004

This fixes an oversight from PR mongodb-labs#53.

FGasper added a commit that referenced this pull request Dec 4, 2024

REP-5317 Ignore mongosync dbs in change stream.

fcb677f

This fixes an oversight from PR #53. It also “upgrades” the test for the change stream filter to focus on behavior rather than implementation.

		verifier.SetSrcNamespaces([]string{"srcDB.srcColl1", "srcDB.srcColl2"})
		verifier.SetDstNamespaces([]string{"dstDB.dstColl1", "dstDB.dstColl2"})

REP-5317 Add destination change stream #53

REP-5317 Add destination change stream #53

Uh oh!

Conversation

tdq45gj commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FGasper left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FGasper left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FGasper Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tdq45gj commented Nov 22, 2024 •

edited

Loading

FGasper left a comment •

edited

Loading

FGasper Dec 3, 2024 •

edited

Loading