-
Notifications
You must be signed in to change notification settings - Fork 15
REP-5317 Add destination change stream #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REP-5317 Add destination change stream #53
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In broad strokes this looks good.
I’ve noted some small things, mostly nits.
Also, can there be a more end-to-end-ish test that if the destination receives an update to a document after MV has checked it that we’ll get a mismatch reported?
| dstNamespaces []string | ||
| nsMap map[string]string | ||
| srcDstNsMap map[string]string | ||
| dstSrcNsMap map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always just the inverse of srcDstNsMap? If so, I think we’d be better served by a dedicated type with an accessor.
| verifier.logger, | ||
| verifier.srcClient, | ||
| ) | ||
| srcFinalTs, err := GetNewClusterTime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In upstream the mux is locked during GetNewClusterTime. Does this need to change? If so, why?
If not, can we keep upstream’s behavior here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think GetNewClusterTime needs to be under the lock. Here I'm removing the verifier.writesOffTimestamp field and adding the above error so that writesOff can't be called twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think GetNewClusterTime needs to be under the lock.
What benefit do we realize from changing this, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should minimize the critical section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that this is fairly brittle stuff. We’ve made it less so, but altering concurrency-related code seems like a risk we’re better off avoiding.
I can’t see anything wrong with your reasoning, but this sort of stuff is prone to “surprises”. Will this palpably improve speed? If so, then I’m OK with it. If not, IMO we should retain the present logic.
| } | ||
| } else { | ||
| verifier.mux.Unlock() | ||
| // This has to happen under the lock because the change stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment restores my “under the lock” typo. :-P Should probably fix.
internal/verifier/recheck.go
Outdated
| // sorting by _id will guarantee that all rechecks for a given | ||
| // namespace appear consecutively. | ||
| // | ||
| // DatabaseName and CollectionName should be on the source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this comment, the struct could rename its members?
internal/verifier/change_stream.go
Outdated
| // If the changeStreamEnderChan has a message, the user has indicated that | ||
| // source writes are ended. This means we should exit rather than continue | ||
| // If the ChangeStreamEnderChan has a message, the user has indicated that | ||
| // source and destination writes are ended. This means we should exit rather than continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t know that the “and destination” part is gainful here. It might suggest that user writes to the destination are expected during a migration, which they definitely aren’t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it means that the migration tool has stopped/committed. I'll clarify in this comment.
internal/verifier/change_stream.go
Outdated
| gotwritesOffTimestamp = true | ||
|
|
||
| // Read all change events until the source reports no events. | ||
| // Read all change events until the source / destination reports no events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log is inaccurate upstream; we might as well fix it here.
Maybe:
Read change events until the stream reaches the writesOffTs.
internal/verifier/check.go
Outdated
| case err := <-verifier.srcChangeStreamReader.ChangeStreamErrChan: | ||
| cancel() | ||
| return errors.Wrap(err, "change stream failed") | ||
| return errors.Wrapf(err, "got an error from %s", verifier.srcChangeStreamReader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you prefer “got an error” to “failed” here? The latter seems (to me, at least) like stronger, more concise wording.
Maybe:
errors.Wrapf(err, "%s failed", verifier.srcChangeStreamReader)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
internal/verifier/check.go
Outdated
| return errors.Wrapf(err, "got an error from %s", verifier.srcChangeStreamReader) | ||
| case err := <-verifier.dstChangeStreamReader.ChangeStreamErrChan: | ||
| cancel() | ||
| return errors.Wrapf(err, "got an error from %s", verifier.dstChangeStreamReader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same q re “got an error”
internal/verifier/check.go
Outdated
| if err != nil { | ||
| return errors.Wrap(err, "failed to start change stream on destination") | ||
| } | ||
| verifier.StartChangeEventHandler(ctx, verifier.dstChangeStreamReader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since ChangeStreamReader has a String() method, you could just loop over the src & dst change streams rather than duplicating this logic.
| verifier.SetSrcNamespaces([]string{"srcDB.srcColl1", "srcDB.srcColl2"}) | ||
| verifier.SetDstNamespaces([]string{"dstDB.dstColl1", "dstDB.dstColl2"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these use suite.DBNameForTest() rather than hard-coding DB names?
FGasper
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve left a few more small comments. I think this is quite close! Please ping me when you’ve had a chance to look.
Thanks!
internal/verifier/change_stream.go
Outdated
|
|
||
| // StartChangeEventHandler starts a goroutine that handles change event batches from the reader. | ||
| // It needs to be started after the reader starts. | ||
| func (verifier *Verifier) StartChangeEventHandler(ctx context.Context, reader *ChangeStreamReader, errGroup *errgroup.Group) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It seems slightly cleaner not to pass the errGroup around, but just to let the caller call this function in errGroup.Go().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
internal/verifier/change_stream.go
Outdated
| return errors.Wrapf(err, "failed to decode change event to %T", changeEventBatch[eventsRead]) | ||
| } | ||
|
|
||
| csr.logger.Trace().Msgf("%s received a change event: %v", csr, changeEventBatch[eventsRead]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trace-level logs make sense, but is there any way to actually see them? The CLI just exposes a --debug option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, consider making the event an Interface() in the log and using Msg() instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this only logs in tests. I added a comment.
internal/verifier/change_stream.go
Outdated
| } | ||
|
|
||
| csr.logger.Trace().Msgf("%s received a change event: %v", csr, changeEventBatch[eventsRead]) | ||
| fmt.Printf("%d %d\n", changeEventBatch[eventsRead].ClusterTime.T, changeEventBatch[eventsRead].ClusterTime.I) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover?
| csr.logger.Trace().Msgf("%s received a change event: %v", csr, changeEventBatch[eventsRead]) | ||
| fmt.Printf("%d %d\n", changeEventBatch[eventsRead].ClusterTime.T, changeEventBatch[eventsRead].ClusterTime.I) | ||
|
|
||
| if changeEventBatch[eventsRead].ClusterTime != nil && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just copied logic, but is there ever a case where we’d expect a nil cluster time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there would be a nil cluster time, although the ClusterTime field has a BSON tag that omits empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like if a missing clusterTime ever happened that’d be something we should at least warn about, if not fail outright. But that’d be more apropos to a separate PR.
internal/verifier/change_stream.go
Outdated
| return errors.Wrap(err, "failed to handle change events") | ||
| } | ||
|
|
||
| csr.ChangeEventBatchChan <- changeEventBatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do the handling in a separate goroutine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handling needs to call the verifier's method insertRecheckDocs. Now that we're separating the change stream read logic to its own struct, it seems to me that having a separate goroutine to handle the change events in the verifier makes the most sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, yeah it works around the circular dependency.
Ideally I’d us to solve that by breaking apart Verifier, but that may be a bigger change than the present “mini-project” warrants.
internal/verifier/check.go
Outdated
| verifier.logger.Debug().Msg("Check: Change stream already running.") | ||
| } else { | ||
| verifier.logger.Debug().Msg("Change stream not running; starting change stream") | ||
| ceHandlerGroup := &errgroup.Group{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not errgroup.WithContext()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why are the event handler goroutines in an errgroup while the event reader goroutines aren’t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, errgroup.WithContext() seems better. The only reason for having handler goroutines in an errgroup is that we want to wait for them in the CheckDriver function. The change stream readers are currently waited through channels.
| type clusterType string | ||
|
|
||
| const ( | ||
| srcReaderType clusterType = "source" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“cluster type” seems relatively non-descript: “type” often means topology, for example.
Maybe whichCluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a better name. I changed it to whichCluster.
| verifier.logger, | ||
| verifier.srcClient, | ||
| ) | ||
| srcFinalTs, err := GetNewClusterTime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think GetNewClusterTime needs to be under the lock.
What benefit do we realize from changing this, though?
| } | ||
| } else { | ||
| verifier.mux.Unlock() | ||
| // This has to happen outside the lock because the change stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 change streams … can you amend this comment accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| // Dry run generation 0 to make sure change stream reader is started. | ||
| suite.Require().NoError(runner.AwaitGenerationEnd()) | ||
|
|
||
| suite.Require().NoError(runner.StartNextGeneration()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the generation start after the inserts? It seems like that’d be a bit less race-prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's more likely to race if inserts happen after generation start because StartNextGeneration isn't blocked on the next generation. I've reverted the changes for this test so that change events occur in the previous generation and tasks should appear in the next generation.
FGasper
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % the last few small notes I’ve made.
internal/verifier/change_stream.go
Outdated
| ChangeEventBatchChan chan []ParsedEvent | ||
| WritesOffTsChan chan primitive.Timestamp | ||
| ErrChan chan error | ||
| DoneChan chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do these need to be exported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically none of them need to be exported since they're all in the verifier package. Changed to private.
| csr.logger.Trace().Msgf("%s received a change event: %v", csr, changeEventBatch[eventsRead]) | ||
| fmt.Printf("%d %d\n", changeEventBatch[eventsRead].ClusterTime.T, changeEventBatch[eventsRead].ClusterTime.I) | ||
|
|
||
| if changeEventBatch[eventsRead].ClusterTime != nil && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like if a missing clusterTime ever happened that’d be something we should at least warn about, if not fail outright. But that’d be more apropos to a separate PR.
internal/verifier/change_stream.go
Outdated
| return errors.Wrap(err, "failed to handle change events") | ||
| } | ||
|
|
||
| csr.ChangeEventBatchChan <- changeEventBatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, yeah it works around the circular dependency.
Ideally I’d us to solve that by breaking apart Verifier, but that may be a bigger change than the present “mini-project” warrants.
| verifier.logger, | ||
| verifier.srcClient, | ||
| ) | ||
| srcFinalTs, err := GetNewClusterTime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that this is fairly brittle stuff. We’ve made it less so, but altering concurrency-related code seems like a risk we’re better off avoiding.
I can’t see anything wrong with your reasoning, but this sort of stuff is prone to “surprises”. Will this palpably improve speed? If so, then I’m OK with it. If not, IMO we should retain the present logic.
This fixes an oversight from PR mongodb-labs#53.
This fixes an oversight from PR mongodb-labs#53. It also “upgrades” the test for the change stream filter to focus on behavior rather than implementation.
This fixes an oversight from PR #53. It also “upgrades” the test for the change stream filter to focus on behavior rather than implementation.
This PR adds a destination change stream to the MV. It also adds a
ChangeStreamReaderstruct that reads change events from either source or destination. Change events are handled in change event handlers that're goroutines started for each of the change stream reader.