Unwrap wrapped values before computing their hash. #2967

nicktobey · 2025-05-05T21:08:26Z

(This PR changes the types in one of the tables used in tests. This improves the test coverage for TEXT columns, especially when they're used in conjunction with other tables with VARCHAR columns. None of the existing tests were testing the original type: this should be strictly increasing our test coverage.)

The following plans involve computing a hash of rows to store in an in-memory hash set:

Intersect, Except, and Distinct
HashLookup
InSubquery
HashSubquery
FullJoinIter
ConcatJoin
Recursive CTE
UpdateJoinIter

We weren't previously unwrapping wrapped values before computing hashes. The default hash implementation used the struct's %v representation to compute the hash, which has two problems:

The hash of the %v representation of a wrapper struct is not the same as the hash of the value that the wrapper is semantically equivalent to.
The hash of the %v representation of a wrapper struct depends on internal state, such as whether the wrapped has already been unwrapped once before (and cached the unwrapped value in an internal buffer)

The simplest fix is to unwrap values before computing a row hash in the HashOf function.

However, this fix comes at a cost: it now requires the engine to unwrap all values if they get used in any of the above plans. This will hurt performance for any of the above plans if they don't actually need to unwrap the value. For example, an UpdateJoinIter on a table with a TEXT column will now load that column from disk, even if its value is never used.

A better fix might be to use the Hash() function that is already defined on the sql.Wrapper interface. For all existing Wrapper implementations, this returns the Dolt content address of the value, and is the same regardless of whether or not that address has previously been resolved. However, this would still return a different hash than an equivalent string. If we wanted them to return the same hash, Dolt would need to define a custom hash for strings that computes the Dolt content address of the string if it were to be stored as a Dolt chunk. This would likely be slower than Go's builtin hash for strings, although the performance might be comparable? This would likely result in worse performance for plans that don't use TEXT columns.

zachmu

LGTM, makes sense

nicktobey force-pushed the nicktobey/hash branch 2 times, most recently from cc95006 to ee7f231 Compare May 5, 2025 21:18

Unwrap wrapped values before computing their hash.

d69d4fe

nicktobey force-pushed the nicktobey/hash branch from ee7f231 to d69d4fe Compare May 5, 2025 21:48

zachmu approved these changes May 6, 2025

View reviewed changes

nicktobey merged commit 9a519d1 into main May 6, 2025
7 of 8 checks passed

nicktobey mentioned this pull request May 6, 2025

More panic: interface conversion: interface {} is *val.TextStorage, not string dolthub/dolt#9166

Closed

BrewTestBot mentioned this pull request May 8, 2025

dolt 1.53.1 Homebrew/homebrew-core#222921

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unwrap wrapped values before computing their hash. #2967

Unwrap wrapped values before computing their hash. #2967

Uh oh!

nicktobey commented May 5, 2025 •

edited

Loading

Uh oh!

zachmu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Unwrap wrapped values before computing their hash. #2967

Unwrap wrapped values before computing their hash. #2967

Uh oh!

Conversation

nicktobey commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zachmu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nicktobey commented May 5, 2025 •

edited

Loading