Skip to content

Commit 81f917d

Browse files
authored
REP-6380 Add optional hash-based verification. (#122)
This replaces the current `--ignoreFieldOrder` flag with a new `--docCompareMethod` parameter that accepts 3 values: the 2 old modes, and a new one that compares documents via $toHashedIndexKey. This is not set as the default because of the precision issues described in the README’s new section. The performance gains, though, probably outweigh those concerns for most migrations. In a reference test, an initial scan that took 37 minutes with `binary` comparison took under 15 minutes with the new method.
1 parent 9af8b54 commit 81f917d

File tree

16 files changed

+639
-196
lines changed

16 files changed

+639
-196
lines changed

.github/workflows/all.yml

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,15 @@ jobs:
2222
srcConnStr: mongodb://localhost:27020,localhost:27021,localhost:27022
2323
dstConnStr: mongodb://localhost:27030,localhost:27031,localhost:27032
2424

25+
exclude:
26+
- mongodb_versions: [ '4.2', '4.2' ]
27+
toHashedIndexKey: true
28+
- mongodb_versions: [ '4.2', '4.4' ]
29+
toHashedIndexKey: true
30+
- mongodb_versions: [ '4.2', '5.0' ]
31+
toHashedIndexKey: true
32+
- mongodb_versions: [ '4.2', '6.0' ]
33+
toHashedIndexKey: true
2534

2635
# versions are: source, destination
2736
mongodb_versions:
@@ -33,10 +42,12 @@ jobs:
3342
- [ '4.4', '4.4' ]
3443
- [ '4.4', '5.0' ]
3544
- [ '4.4', '6.0' ]
45+
- [ '4.4', '8.0' ]
3646

3747
- [ '5.0', '5.0' ]
3848
- [ '5.0', '6.0' ]
3949
- [ '5.0', '7.0' ]
50+
- [ '5.0', '8.0' ]
4051

4152
- [ '6.0', '6.0' ]
4253
- [ '6.0', '7.0' ]
@@ -47,6 +58,8 @@ jobs:
4758

4859
- [ '8.0', '8.0' ]
4960

61+
toHashedIndexKey: [true, false]
62+
5063
topology:
5164
- name: replset
5265
srcConnStr: mongodb://localhost:27020,localhost:27021,localhost:27022
@@ -67,7 +80,7 @@ jobs:
6780
# versions need.
6881
runs-on: ubuntu-22.04
6982

70-
name: ${{ matrix.mongodb_versions[0] }} to ${{ matrix.mongodb_versions[1] }}, ${{ matrix.topology.name }}
83+
name: ${{ matrix.mongodb_versions[0] }} to ${{ matrix.mongodb_versions[1] }}, ${{ matrix.topology.name }}${{ matrix.toHashedIndexKey && ', hashed doc compare' || '' }}
7184

7285
steps:
7386
- run: uname -a
@@ -110,6 +123,7 @@ jobs:
110123
- name: Test
111124
run: go test -v ./... -race
112125
env:
126+
MVTEST_DOC_COMPARE_METHOD: ${{matrix.toHashedIndexKey && 'toHashedIndexKey' || ''}}
113127
MVTEST_SRC: ${{matrix.topology.srcConnStr}}
114128
MVTEST_DST: ${{matrix.topology.dstConnStr}}
115129
MVTEST_META: mongodb://localhost:27040

README.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ The verifier will now check to completion to make sure that there are no inconsi
131131
| `--srcNamespace <namespaces>` | source namespaces to check |
132132
| `--dstNamespace <namespaces>` | destination namespaces to check |
133133
| `--metaDBName <name>` | name of the database in which to store verification metadata (default: "migration_verification_metadata") |
134-
| `--ignoreFieldOrder` | Whether or not field order is ignored in documents |
134+
| `--docCompareMethod` | How to compare documents. See below for details. |
135135
| `--verifyAll` | If set, verify all user namespaces |
136136
| `--clean` | If set, drop all previous verification metadata before starting |
137137
| `--readPreference <value>` | Read preference for reading data from clusters. May be 'primary', 'secondary', 'primaryPreferred', 'secondaryPreferred', or 'nearest' (default: "primary") |
@@ -312,6 +312,38 @@ The migration-verifier optimizes for the case where a migration’s initial sync
312312
313313
The migration-verifier is also rather resource-hungry. To mitigate this, try limiting its number of workers (i.e., `--numWorkers`), its partition size (`--partitionSizeMB`), and/or its process group’s resource limits (see the `ulimit` command in POSIX OSes).
314314
315+
# Document comparison methods
316+
317+
## `binary`
318+
319+
The default. This establishes full binary equivalence, including field order and all types.
320+
321+
## `ignoreFieldOrder`
322+
323+
Like `binary` but ignores the ordering of fields. Incurs extra overhead on this host.
324+
325+
## `toHashedIndexKey`
326+
327+
Compares document hashes (and lengths) rather than full documents. This minimizes the data sent to migration-verifier, which can dramatically shorten verification time.
328+
329+
It carries a few downsides, though:
330+
331+
### Lost precision
332+
333+
This method ignores certain type changes if the underlying value remains the same. For example, if a Long changes to a Double, and the two values are identical, `toHashedIndexKey` will not notice the discrepancy.
334+
335+
The discrepancy _will_, though, usually be seen if the BSON types are of different lengths. For example, if a Long changes to Decimal, `toHashedIndexKey` will notice that.
336+
337+
If, however, _multiple_ numeric type changes happen, then `toHashedIndexKey` will only notice the discrepancy if the total document length changes. For example, if an Int changes to a Long, but elsewhere a Long changes to an Int, that will evade notice.
338+
339+
The above are all, of course, **highly** unlikely in real-world migrations.
340+
341+
### Lost reporting
342+
343+
Full-document verification methods allow migration-verifier to diagnose mismatches, e.g., by identifying specific changed fields. The only such detail that `toHashedIndexKey` can discern, though, is a change in document length.
344+
345+
Additionally, because the amount of data sent to migration-verifier doesn’t actually reflect the documents’ size, no meaningful statistics are shown concerning the collection data size. Document counts, of course, are still shown.
346+
315347
# Known Issues
316348
317349
- The verifier may report missing documents on the destination that don’t actually appear to be missing (i.e., a nonexistent problem). This has been hard to reproduce. If missing documents are reported, it is good practice to check for false positives.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
package comparehashed
2+
3+
func CanCompareDocsViaToHashedIndexKey(
4+
version []int,
5+
) bool {
6+
if version[0] >= 8 {
7+
return true
8+
}
9+
10+
switch version[0] {
11+
case 7:
12+
return version[2] >= 6
13+
case 6:
14+
return version[2] >= 14
15+
case 5:
16+
return version[2] >= 25
17+
case 4:
18+
return version[1] == 4 && version[2] >= 29
19+
default:
20+
return false
21+
}
22+
}

internal/partitions/partition.go

Lines changed: 61 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,14 @@ package partitions
22

33
import (
44
"fmt"
5+
"slices"
56

6-
"github.com/10gen/migration-verifier/internal/logger"
77
"github.com/10gen/migration-verifier/internal/util"
8+
"github.com/10gen/migration-verifier/option"
89
"github.com/pkg/errors"
910
"go.mongodb.org/mongo-driver/bson"
1011
"go.mongodb.org/mongo-driver/bson/primitive"
12+
"go.mongodb.org/mongo-driver/mongo"
1113
)
1214

1315
// PartitionKey represents the _id of a partition document stored in the destination.
@@ -94,66 +96,75 @@ func (p *Partition) lowerBoundFromCurrent(current bson.Raw) (any, error) {
9496
return nil, errors.New("could not find an '_id' element in the raw document")
9597
}
9698

97-
// FindCmd constructs the Find command for reading documents from the partition. For capped
98-
// collections, the sort order will be `$natural` and the `lowerBound` argument is ignored. For
99-
// all other collections, the collection will be sorted by the `_id` field. The `lowerBound`
100-
// argument will determine the starting point for the find. If it is `nil`, then the value of
101-
// `p.Key.Lower`.
102-
//
103-
// This always constructs a non-type-bracketed find command.
104-
func (p *Partition) FindCmd(
105-
// TODO (REP-1281)
106-
logger *logger.Logger,
107-
startAt *primitive.Timestamp,
108-
// We only use this for testing.
109-
batchSize ...int,
110-
) bson.D {
111-
// Get the bounded query filter from the partition to be used in the Find command.
112-
findCmd := bson.D{
113-
{"find", p.Ns.Coll},
114-
{"collectionUUID", p.Key.SourceUUID},
115-
{"readConcern", bson.D{
116-
{"level", "majority"},
117-
// Start the cursor after the global state's ChangeStreamStartAtTs. Otherwise,
118-
// there may be changes made by collection copy prior to change event application's
119-
// start time that are not accounted for, leading to potential data
120-
// inconsistencies.
121-
{"afterClusterTime", startAt},
122-
}},
123-
// The cursor should not have a timeout.
124-
{"noCursorTimeout", true},
99+
type PartitionQueryParameters struct {
100+
filter option.Option[bson.D]
101+
sortField option.Option[string]
102+
hint option.Option[bson.D]
103+
}
104+
105+
func (pqp PartitionQueryParameters) ToFindOptions() bson.D {
106+
doc := bson.D{}
107+
108+
if theFilter, has := pqp.filter.Get(); has {
109+
doc = append(doc, bson.E{"filter", theFilter})
110+
}
111+
112+
pqp.addHintIfNeeded(&doc)
113+
114+
return doc
115+
}
116+
117+
func (pqp PartitionQueryParameters) ToAggOptions() bson.D {
118+
pl := mongo.Pipeline{}
119+
120+
if theFilter, has := pqp.filter.Get(); has {
121+
pl = append(pl, bson.D{{"$match", theFilter}})
125122
}
126-
if len(batchSize) > 0 {
127-
findCmd = append(findCmd, bson.E{"batchSize", batchSize[0]})
123+
124+
if theSort, has := pqp.sortField.Get(); has {
125+
pl = append(pl, bson.D{{"$sort", bson.D{{theSort, 1}}}})
128126
}
129-
findOptions := p.GetFindOptions(nil, nil)
130-
findCmd = append(findCmd, findOptions...)
131127

132-
return findCmd
128+
doc := bson.D{
129+
{"pipeline", pl},
130+
}
131+
132+
pqp.addHintIfNeeded(&doc)
133+
134+
return doc
135+
}
136+
137+
func (pqp PartitionQueryParameters) addHintIfNeeded(docRef *bson.D) {
138+
if theHint, has := pqp.hint.Get(); has {
139+
*docRef = append(*docRef, bson.E{"hint", theHint})
140+
}
133141
}
134142

135-
// GetFindOptions returns only the options necessary to do a find on any given collection with this
136-
// partition. It is intended to allow the same partitioning to be used on different collections
137-
// (e.g. use the partitions on the source to read the destination for verification)
143+
// GetQueryParameters returns a PartitionQueryParameters that describes the
144+
// parameters needed to fetch docs for the partition. It is intended to allow
145+
// the same partitioning to be used on different collections (e.g. use the
146+
// partitions on the source to read the destination for verification)
138147
// If the passed-in buildinfo indicates a mongodb version < 5.0, type bracketing is not used.
139148
// filterAndPredicates is a slice of filter criteria that's used to construct the "filter" field in the find option.
140-
func (p *Partition) GetFindOptions(clusterInfo *util.ClusterInfo, filterAndPredicates bson.A) bson.D {
149+
func (p *Partition) GetQueryParameters(clusterInfo *util.ClusterInfo, filterAndPredicates bson.A) PartitionQueryParameters {
150+
params := PartitionQueryParameters{}
151+
141152
if p == nil {
142153
if len(filterAndPredicates) > 0 {
143-
return bson.D{{"filter", bson.D{{"$and", filterAndPredicates}}}}
154+
params.filter = option.Some(bson.D{{"$and", filterAndPredicates}})
144155
}
145-
return bson.D{}
156+
157+
return params
146158
}
147-
findOptions := bson.D{}
159+
148160
if p.IsCapped {
149161
// For capped collections, sort the documents by their natural order. We deliberately
150162
// exclude the ID filter to ensure that documents are inserted in the correct order.
151-
sort := bson.E{"sort", bson.D{{"$natural", 1}}}
152-
findOptions = append(findOptions, sort)
163+
params.sortField = option.Some("$natural")
153164
} else {
154165
// For non-capped collections, sort by _id to minimize the amount of time
155166
// that a given document spends cached in memory.
156-
findOptions = append(findOptions, bson.E{"sort", bson.D{{"_id", 1}}})
167+
params.sortField = option.Some("_id")
157168

158169
// For non-capped collections, the cursor should use the ID filter and the _id index.
159170
// Get the bounded query filter from the partition to be used in the Find command.
@@ -167,20 +178,22 @@ func (p *Partition) GetFindOptions(clusterInfo *util.ClusterInfo, filterAndPredi
167178
}
168179
}
169180

181+
filterAndPredicates = slices.Clone(filterAndPredicates)
182+
170183
if useExprFind {
171184
filterAndPredicates = append(filterAndPredicates, p.filterWithExpr())
172185
} else {
173186
filterAndPredicates = append(filterAndPredicates, p.filterWithExplicitTypeChecks())
174187
}
175188

176-
hint := bson.E{"hint", bson.D{{"_id", 1}}}
177-
findOptions = append(findOptions, hint)
189+
params.hint = option.Some(bson.D{{"_id", 1}})
178190
}
179191

180192
if len(filterAndPredicates) > 0 {
181-
findOptions = append(findOptions, bson.E{"filter", bson.D{{"$and", filterAndPredicates}}})
193+
params.filter = option.Some(bson.D{{"$and", filterAndPredicates}})
182194
}
183-
return findOptions
195+
196+
return params
184197
}
185198

186199
// filterWithExpr returns a range filter on _id to be used in a Find query for the

0 commit comments

Comments
 (0)