Unwrap wrapped values before computing their hash. #2967
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(This PR changes the types in one of the tables used in tests. This improves the test coverage for
TEXTcolumns, especially when they're used in conjunction with other tables withVARCHARcolumns. None of the existing tests were testing the original type: this should be strictly increasing our test coverage.)The following plans involve computing a hash of rows to store in an in-memory hash set:
We weren't previously unwrapping wrapped values before computing hashes. The default hash implementation used the struct's
%vrepresentation to compute the hash, which has two problems:%vrepresentation of a wrapper struct is not the same as the hash of the value that the wrapper is semantically equivalent to.%vrepresentation of a wrapper struct depends on internal state, such as whether the wrapped has already been unwrapped once before (and cached the unwrapped value in an internal buffer)The simplest fix is to unwrap values before computing a row hash in the
HashOffunction.However, this fix comes at a cost: it now requires the engine to unwrap all values if they get used in any of the above plans. This will hurt performance for any of the above plans if they don't actually need to unwrap the value. For example, an UpdateJoinIter on a table with a
TEXTcolumn will now load that column from disk, even if its value is never used.A better fix might be to use the
Hash()function that is already defined on thesql.Wrapperinterface. For all existing Wrapper implementations, this returns the Dolt content address of the value, and is the same regardless of whether or not that address has previously been resolved. However, this would still return a different hash than an equivalent string. If we wanted them to return the same hash, Dolt would need to define a custom hash for strings that computes the Dolt content address of the string if it were to be stored as a Dolt chunk. This would likely be slower than Go's builtin hash for strings, although the performance might be comparable? This would likely result in worse performance for plans that don't useTEXTcolumns.