Skip to content

Commit 886e29f

Browse files
authored
Merge pull request #137 from FGasper/felipe_quick_fixes
This PR contains a series of small improvements made while testing the verifier’s performance. This PR’s merge will preserve its individual commits.
2 parents f4c983a + ea7be29 commit 886e29f

File tree

7 files changed

+260
-205
lines changed

7 files changed

+260
-205
lines changed

README.md

Lines changed: 44 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,44 @@
11
# Verify Migrations!
22

3-
_If verifying a migration done via [mongosync](https://www.mongodb.com/docs/cluster-to-cluster-sync/current/), please check if it is possible to use the
3+
_If verifying a migration done via [mongosync](https://www.mongodb.com/docs/cluster-to-cluster-sync/current/), please check if it is possible to use the
44
[embedded verifier](https://www.mongodb.com/docs/cluster-to-cluster-sync/current/reference/verification/embedded/#std-label-c2c-embedded-verifier) as that is the preferred approach for verification._
55

6-
# Obtaining
7-
To fetch the latest release:
6+
# Quick Start
7+
8+
Download the verifier’s latest release:
89
```
910
curl -sSL https://raw.githubusercontent.com/mongodb-labs/migration-verifier/refs/heads/main/download_latest.sh | sh
1011
```
11-
… or, if you prefer to build locally, just do:
12+
(Alternatively, you can check out this repository then `./build.sh` to build from source.)
13+
14+
Then start a local replica set to store verification metadata:
1215
```
13-
./build.sh
16+
docker run -it -p27017:27017 -v ./verifier_db:/data/db --entrypoint bash mongodb/mongodb-community-server -c 'mongod --bind_ip_all --replSet rs & mpid=$! && until mongosh --eval "rs.initiate()"; do sleep 1; done && wait $mpid'
1417
```
18+
(This will create a local `verifier_db` directory so that you can resume verification if needed.)
1519

16-
# Operational UX Once Running
17-
18-
_Assumes no port set, default port for operation webserver is 27020_
19-
20-
# Recommendations
20+
Finally, run verification:
21+
```
22+
./migration_verifier \
23+
--srcURI mongodb://your.source.cluster \
24+
--dstURI mongodb://your.destination.cluster \
25+
--serverPort 0 \
26+
--verifyAll \
27+
--start
28+
```
29+
The above will stream verification logs to standard output. Once writes stop,
30+
watch for change stream lag to hit 0. The log will report either the found
31+
mismatches or a confirmation of exact match between the clusters.
2132

2233
# Verifier Metadata Considerations
2334

24-
migration-verifier needs a database to store its state. This database SHOULD be on its own cluster.
35+
migration-verifier needs a MongoDB cluster to store its state. This cluster *must* support transactions (i.e., either a replica set or sharded cluster, NOT a standalone instance). By default, this is assumed to run on localhost:27017.
2536

26-
The verifier _can_ instead store its metadata on the destination cluster. This can severely degrade performance, though.
27-
It also requires either disabling mongosync’s destination write blocking or giving the `bypassWriteBlockingMode` to the verifier’s `--metaURI` user.
37+
See [above](#Quick-Start) for a one-line command to start up a local, single-node replica set that you can use for this purpose.
2838

29-
## Launch the Verifier Binary
39+
The verifier can alternatively store its metadata on the destination cluster. This can severely degrade performance, though. Also, if you’re using mongosync, it requires either disabling mongosync’s destination write blocking or giving the `bypassWriteBlockingMode` to the verifier’s `--metaURI` user.
40+
41+
# More Details
3042

3143
To see all options:
3244

@@ -36,19 +48,19 @@ To see all options:
3648
```
3749

3850

39-
To check all namespaces:
51+
To check all namespaces:
4052

4153

4254
```
43-
./migration_verifier --srcURI mongodb://127.0.0.1:27002 --dstURI mongodb://127.0.0.1:27003 --metaURI mongodb://127.0.0.1:27001 --metaDBName verify_meta --verifyAll
55+
./migration_verifier --srcURI mongodb://127.0.0.1:27002 --dstURI mongodb://127.0.0.1:27003 --metaURI mongodb://127.0.0.1:27001 --verifyAll
4456
```
4557

4658

47-
To filter namespaces (allow list):
59+
To check only specific namespaces:
4860

4961

5062
```
51-
./migration_verifier --srcURI mongodb://127.0.0.1:27002 --dstURI mongodb://127.0.0.1:27003 --metaURI mongodb://127.0.0.1:27001 --metaDBName verify_meta --srcNamespace foo.bar --dstNamespace foo.bar --srcNamespace foo.yar --dstNamespace foo.yar --srcNamespace mixed.namespaces --dstNamespace can.work
63+
./migration_verifier --srcURI mongodb://127.0.0.1:27002 --dstURI mongodb://127.0.0.1:27003 --srcNamespace foo.bar --dstNamespace foo.bar --srcNamespace foo.yar --dstNamespace foo.yar --srcNamespace mixed.namespaces --dstNamespace can.work
5264
```
5365

5466

@@ -70,7 +82,7 @@ To set a port, use `--serverPort <port number>`. The default is 27020. Note that
7082

7183
If you give 0 as the port, a random ephemeral port will be chosen. The log will show the chosen port, and you may also query the OS to learn it (e.g., `lsof -a -iTCP -sTCP:LISTEN -p <pid>`).
7284

73-
### Using a configuration file
85+
## Using a configuration file
7486

7587
To load configuration options from a YAML configuration file, use the `--configFile` parameter.
7688

@@ -83,19 +95,17 @@ metaURI: mongodb://localhost:28012
8395
```
8496

8597

86-
## Send the Verifier Process Commands:
87-
98+
## Send the Verifier Process Commands:
8899

89-
90-
1. After launching the verifier (see above), you can send it requests to get it to start verifying. The verification process is started by using the `check`command. An [optional `filter` parameter](#document-filtering) can be passed within the `check` request body to only check documents within that filter. The verification process will keep running until you tell the verifier to stop. It will keep track of the inconsistencies it has found and will keep checking those inconsistencies hoping that eventually they will resolve.
100+
1. After launching the verifier (see above), you can send it requests to get it to start verifying. If you don’t pass the `--start` parameter, verification is started by using the `check` command. An [optional `filter` parameter](#document-filtering) can be passed within the `check` request body to only check documents within that filter. The verification process will keep running until you tell the verifier to stop. It will keep track of the inconsistencies it has found and will keep checking those inconsistencies hoping that eventually they will resolve.
91101

92102
```
93103
curl -H "Content-Type: application/json" -d '{}' http://127.0.0.1:27020/api/v1/check
94104
```
95105
96106
97-
2. Once mongosync has committed the replication, you can tell the verifier that writes have stopped. You can see the state of mongosync’s replication by hitting mongosync’s `progress` endpoint and checking that the state is `COMMITTED`. See the documentation [here](https://www.mongodb.com/docs/cluster-to-cluster-sync/current/reference/api/progress/#response). \
98-
The verifier will now check to completion to make sure that there are no inconsistencies. The command you need to send the verifier to tell it that the replication is committed is `writesOff`. The command doesn’t block. This means that you will have to poll the verifier to see the status of the verification (see `progress`).
107+
2. Once writes on the source cluster have stopped, you can tell the verifier that writes have stopped. (You can see the state of mongosync’s replication by hitting mongosync’s `progress` endpoint and checking that the state is `COMMITTED`. See the documentation [here](https://www.mongodb.com/docs/cluster-to-cluster-sync/current/reference/api/progress/#response)). \
108+
The verifier will now check to completion to make sure that there are no inconsistencies. The command you need to send the verifier here is `writesOff`. The command doesn’t block. This means that you will have to poll the verifier, or watch its logs, to see the status of the verification (see `progress`).
99109
100110
```
101111
curl -H "Content-Type: application/json" -X POST -d '{}' http://127.0.0.1:27020/api/v1/writesOff
@@ -135,6 +145,7 @@ The verifier will now check to completion to make sure that there are no inconsi
135145
| `--dstNamespace <namespaces>` | destination namespaces to check |
136146
| `--metaDBName <name>` | name of the database in which to store verification metadata (default: "migration_verification_metadata") |
137147
| `--docCompareMethod` | How to compare documents. See below for details. |
148+
| `--start` | Start checking documents right away rather than waiting for a `/check` API request. |
138149
| `--verifyAll` | If set, verify all user namespaces |
139150
| `--clean` | If set, drop all previous verification metadata before starting |
140151
| `--readPreference <value>` | Read preference for reading data from clusters. May be 'primary', 'secondary', 'primaryPreferred', 'secondaryPreferred', or 'nearest' (default: "primary") |
@@ -171,19 +182,6 @@ generation’s mismatches, aggregate like this on the metadata cluster:
171182
Note that each mismatch includes timestamps. You can cross-reference
172183
these with the clusters’ oplogs to diagnose problems.
173184
174-
# Benchmarking Results
175-
176-
Ran on m6id.metal + M40 with 3 replica sets
177-
178-
Command run python3 ./test/benchmark.py --way=recheck remote
179-
180-
When running with 1TB of random data on 3 collections
181-
182-
**In recheck and normal mode it runs at 1.5-2.5gbps per replica** and is **disk bound on each node** (meaning there are not of easy optimizations to make this faster) \
183-
On default settings it used about **200GB of RAM on m6id.metal machine when using all the cores**
184-
185-
**This means it does about 1TB/20min but it is HIGHLY dependent on the source and dest machines**
186-
187185
# Tests
188186
189187
This project’s tests run as normal Go tests, to, with `go test`.
@@ -311,9 +309,11 @@ The migration-verifier periodically persists its change stream’s resume token
311309
312310
# Performance
313311
314-
The migration-verifier optimizes for the case where a migration’s initial sync is completed **and** change events are relatively infrequent. If you start verification before initial sync finishes, or if the source cluster is too busy, the verification may freeze.
312+
The verifier has been observed handling test source write loads of 15,000 writes per second. Real-world performance will vary according to several factors, including network latency, cluster resources, and the verifier node’s resources.
313+
314+
## Per-shard verification
315315
316-
The migration-verifier is also rather resource-hungry. To mitigate this, try limiting its number of workers (i.e., `--numWorkers`), its partition size (`--partitionSizeMB`), and/or its process group’s resource limits (see the `ulimit` command in POSIX OSes).
316+
If migrating shard-to-shard, you can also verify shard-to-shard to scale verification horizontally. Run 1 verifier per source shard. You can colocate all verifiers’ metadata on the same metadata cluster, but each verifier must use its own database (e.g., `verify90`, `verify1`, …). If that metadata cluster buckles under the load, consider splitting verification across multiple hosts.
317317
318318
# Document comparison methods
319319
@@ -323,11 +323,11 @@ The default. This establishes full binary equivalence, including field order and
323323
324324
## `ignoreFieldOrder`
325325
326-
Like `binary` but ignores the ordering of fields. Incurs extra overhead on this host.
326+
Like `binary` but ignores the ordering of fields. Incurs extra overhead on the verifier host.
327327
328328
## `toHashedIndexKey`
329329
330-
Compares document hashes (and lengths) rather than full documents. This minimizes the data sent to migration-verifier, which can dramatically shorten verification time.
330+
Compares document hashes (and lengths) rather than full documents. This minimizes the data sent to migration-verifier, which can dramatically increase performance.
331331
332332
It carries a few downsides, though:
333333
@@ -339,7 +339,7 @@ The discrepancy _will_, though, usually be seen if the BSON types are of differe
339339
340340
If, however, _multiple_ numeric type changes happen, then `toHashedIndexKey` will only notice the discrepancy if the total document length changes. For example, if an Int changes to a Long, but elsewhere a Long changes to an Int, that will evade notice.
341341
342-
The above are all, of course, **highly** unlikely in real-world migrations.
342+
The above are all **highly** unlikely in real-world migrations.
343343
344344
### Lost reporting
345345
@@ -359,6 +359,6 @@ Additionally, because the amount of data sent to migration-verifier doesn’t ac
359359
360360
# Limitations
361361
362-
- The verifier’s iterative process can handle data changes while it is running, until you hit the writesOff endpoint. However, it cannot handle DDL commands. If the verifier receives a DDL change stream event (drop, dropDatabase, rename), the verification will fail. If an untracked DDL event (create, createIndexes, dropIndexes, modify) occurs, the verifier may miss the change.
362+
- The verifier’s iterative process can handle data changes while it is running, until you hit the writesOff endpoint. However, it cannot handle DDL commands. If the verifier receives a DDL change stream event, the verification will fail.
363363
364364
- The verifier crashes if it tries to compare time-series collections. The error will include a phrase like “Collection has nil UUID (most probably is a view)” and also mention “timeseries”.

internal/verifier/check.go

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,19 @@ func (verifier *Verifier) CheckWorker(ctxIn context.Context) error {
117117

118118
waitForTaskCreation := 0
119119

120-
finishedAllTasks := false
120+
var finishedAllTasks bool
121+
122+
go func() {
123+
delay := 30 * time.Second
124+
125+
time.Sleep(delay)
126+
127+
for cancelableCtx.Err() == nil {
128+
verifier.PrintVerificationSummary(cancelableCtx, GenerationInProgress)
129+
130+
time.Sleep(delay)
131+
}
132+
}()
121133

122134
eg.Go(func() error {
123135
for {
@@ -134,21 +146,16 @@ func (verifier *Verifier) CheckWorker(ctxIn context.Context) error {
134146
Any("taskCountsByStatus", verificationStatus).
135147
Send()
136148

137-
if waitForTaskCreation%2 == 0 {
138-
if generation > 0 || verifier.gen0PendingCollectionTasks.Load() == 0 {
139-
verifier.PrintVerificationSummary(ctx, GenerationInProgress)
140-
}
141-
}
142-
143149
// The generation continues as long as >=1 task for this generation is
144150
// “added” or “pending”.
145151
if verificationStatus.AddedTasks > 0 || verificationStatus.ProcessingTasks > 0 {
146152
waitForTaskCreation++
147153

148154
time.Sleep(verifier.verificationStatusCheckInterval)
149155
} else {
150-
verifier.PrintVerificationSummary(ctx, GenerationComplete)
151156
finishedAllTasks = true
157+
verifier.PrintVerificationSummary(ctx, GenerationComplete)
158+
152159
canceler(errors.Errorf("generation %d finished", generation))
153160
return nil
154161
}

0 commit comments

Comments
 (0)