Skip to content

Conversation

@nicktobey
Copy link
Contributor

@nicktobey nicktobey commented May 5, 2025

(This PR changes the types in one of the tables used in tests. This improves the test coverage for TEXT columns, especially when they're used in conjunction with other tables with VARCHAR columns. None of the existing tests were testing the original type: this should be strictly increasing our test coverage.)

The following plans involve computing a hash of rows to store in an in-memory hash set:

  • Intersect, Except, and Distinct
  • HashLookup
  • InSubquery
  • HashSubquery
  • FullJoinIter
  • ConcatJoin
  • Recursive CTE
  • UpdateJoinIter

We weren't previously unwrapping wrapped values before computing hashes. The default hash implementation used the struct's %v representation to compute the hash, which has two problems:

  • The hash of the %v representation of a wrapper struct is not the same as the hash of the value that the wrapper is semantically equivalent to.
  • The hash of the %v representation of a wrapper struct depends on internal state, such as whether the wrapped has already been unwrapped once before (and cached the unwrapped value in an internal buffer)

The simplest fix is to unwrap values before computing a row hash in the HashOf function.

However, this fix comes at a cost: it now requires the engine to unwrap all values if they get used in any of the above plans. This will hurt performance for any of the above plans if they don't actually need to unwrap the value. For example, an UpdateJoinIter on a table with a TEXT column will now load that column from disk, even if its value is never used.

A better fix might be to use the Hash() function that is already defined on the sql.Wrapper interface. For all existing Wrapper implementations, this returns the Dolt content address of the value, and is the same regardless of whether or not that address has previously been resolved. However, this would still return a different hash than an equivalent string. If we wanted them to return the same hash, Dolt would need to define a custom hash for strings that computes the Dolt content address of the string if it were to be stored as a Dolt chunk. This would likely be slower than Go's builtin hash for strings, although the performance might be comparable? This would likely result in worse performance for plans that don't use TEXT columns.

@nicktobey nicktobey force-pushed the nicktobey/hash branch 2 times, most recently from cc95006 to ee7f231 Compare May 5, 2025 21:18
Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, makes sense

@nicktobey nicktobey merged commit 9a519d1 into main May 6, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants