-
Notifications
You must be signed in to change notification settings - Fork 9
Description
lsm-tree
can be used in an "untyped" way by setting types of keys, values, and blobs as raw bytes. For example, a user could decide on either of the two following options:
type Typed = Table IO Word64 Word64 Word64
type Untyped = Table IO RawBytes RawBytes RawBytes
The latter option bypasses lsm-tree
's serialisation framework, which is perfectly valid to do, and it is useful in some cases where lsm-tree
's serialisation classes are too restrictive. The caveat is that the onus for (de-)serialisation now falls on the user of lsm-tree
. For example ouroboros-consensus
does this because they want to (de-)serialise keys and values with additional context.
This unearths a problem, however, since some public API functions applied to untyped tables will return keys, values, and blobs that retain unused memory. Inspecting RawBytes
, we see that it is a newtype wrapper around a primitive vector:
-- | Unboxed vectors of primitive types.
data Vector a = Vector {-# UNPACK #-} !Int -- ^ offset
{-# UNPACK #-} !Int -- ^ length
{-# UNPACK #-} !ByteArray -- ^ underlying byte array
A RawBytes
that retains unused memory is slice of a primitive vector. For example:
Vector 0 8 (replicate 8 0) -- not a slice
Vector 1 8 (replicate 8 0) -- a slice
Vector 0 7 (replicate 8 0) -- a slice
Vector 3 4 (replicate 8 0) -- a slice
Ideally, public API functions should never return slices, but some do.
We wrote a reproducer program that inserts a bunch of keys/values/blobs into an untyped table, then performs lookups and range lookups (i.e., cursor reads) while retrieving blobs, and prints the results. The reproducer program and its output is included below:
Inserting keys, values, and blobs
Printing inserted keys, values, and blobs as (bytes, bytes, Maybe bytes)
([0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],Just [0,0,0,0,0,0,0,0])
([0,0,0,0,0,0,0,1],[1,0,0,0,0,0,0,0],Just [1,0,0,0,0,0,0,0])
([0,0,0,0,0,0,0,2],[2,0,0,0,0,0,0,0],Just [2,0,0,0,0,0,0,0])
([0,0,0,0,0,0,0,3],[3,0,0,0,0,0,0,0],Just [3,0,0,0,0,0,0,0])
([0,0,0,0,0,0,0,4],[4,0,0,0,0,0,0,0],Just [4,0,0,0,0,0,0,0])
([0,0,0,0,0,0,0,5],[5,0,0,0,0,0,0,0],Just [5,0,0,0,0,0,0,0])
([0,0,0,0,0,0,0,6],[6,0,0,0,0,0,0,0],Just [6,0,0,0,0,0,0,0])
([0,0,0,0,0,0,0,7],[7,0,0,0,0,0,0,0],Just [7,0,0,0,0,0,0,0])
([0,0,0,0,0,0,0,8],[8,0,0,0,0,0,0,0],Just [8,0,0,0,0,0,0,0])
([0,0,0,0,0,0,0,9],[9,0,0,0,0,0,0,0],Just [9,0,0,0,0,0,0,0])
Performing lookups
Printing looked up values as (bytes, offset, size)
([0,0,0,0,0,0,0,0],0,8)
([1,0,0,0,0,0,0,0],0,8)
([2,0,0,0,0,0,0,0],0,8)
([3,0,0,0,0,0,0,0],0,8)
([4,0,0,0,0,0,0,0],0,8)
([5,0,0,0,0,0,0,0],0,8)
([6,0,0,0,0,0,0,0],0,8)
([7,0,0,0,0,0,0,0],0,8)
([8,0,0,0,0,0,0,0],0,8)
([9,0,0,0,0,0,0,0],0,8)
Printing looked up blobs as (bytes, offset, size)
([0,0,0,0,0,0,0,0],0,8)
([1,0,0,0,0,0,0,0],8,8)
([2,0,0,0,0,0,0,0],16,8)
([3,0,0,0,0,0,0,0],24,8)
([4,0,0,0,0,0,0,0],32,8)
([5,0,0,0,0,0,0,0],40,8)
([6,0,0,0,0,0,0,0],48,8)
([7,0,0,0,0,0,0,0],56,8)
([8,0,0,0,0,0,0,0],64,8)
([9,0,0,0,0,0,0,0],72,8)
Performing a range lookup
Printing looked up keys as (bytes, offset, size)
([0,0,0,0,0,0,0,0],106,8)
([0,0,0,0,0,0,0,1],114,8)
([0,0,0,0,0,0,0,2],122,8)
([0,0,0,0,0,0,0,3],130,8)
([0,0,0,0,0,0,0,4],138,8)
([0,0,0,0,0,0,0,5],106,8)
([0,0,0,0,0,0,0,6],114,8)
([0,0,0,0,0,0,0,7],122,8)
([0,0,0,0,0,0,0,8],130,8)
([0,0,0,0,0,0,0,9],138,8)
Printing looked up values as (bytes, offset, size)
([0,0,0,0,0,0,0,0],146,8)
([1,0,0,0,0,0,0,0],154,8)
([2,0,0,0,0,0,0,0],162,8)
([3,0,0,0,0,0,0,0],170,8)
([4,0,0,0,0,0,0,0],178,8)
([5,0,0,0,0,0,0,0],146,8)
([6,0,0,0,0,0,0,0],154,8)
([7,0,0,0,0,0,0,0],162,8)
([8,0,0,0,0,0,0,0],170,8)
([9,0,0,0,0,0,0,0],178,8)
Printing looked up blobs as (bytes, offset, size)
([0,0,0,0,0,0,0,0],0,8)
([1,0,0,0,0,0,0,0],8,8)
([2,0,0,0,0,0,0,0],16,8)
([3,0,0,0,0,0,0,0],24,8)
([4,0,0,0,0,0,0,0],32,8)
([5,0,0,0,0,0,0,0],40,8)
([6,0,0,0,0,0,0,0],48,8)
([7,0,0,0,0,0,0,0],56,8)
([8,0,0,0,0,0,0,0],64,8)
([9,0,0,0,0,0,0,0],72,8)
What we can see here is that lookups do not return slices, but blob retrieval and range lookups (i.e., cursor reads) do. We should fix this, presumably by force
ing primitive vectors before returning them as raw bytes from public API functions. We should also write tests for this.
Reproducer
{-# OPTIONS_GHC -Wno-orphans #-}
module Main (main) where
import Control.Tracer
import Data.Maybe
import qualified Data.Vector as V
import qualified Data.Vector.Primitive as VP
import Data.Word
import Database.LSMTree
main :: IO ()
main = do
foo
foo :: IO ()
foo =
withOpenSessionIO nullTracer "_temp" $ \s ->
withTableWith conf s $ \(t :: Table IO RawBytes RawBytes RawBytes) -> do
do
putStrLn "Inserting keys, values, and blobs"
let entries = V.fromList [
( serialiseKey k
, serialiseValue v
, fmap serialiseValue mb
)
| (k, v, mb) <- contents
]
inserts t entries
putStrLn "Printing inserted keys, values, and blobs as (bytes, bytes, Maybe bytes)"
mapM_ print entries
do
putStrLn "Performing lookups"
res <- lookups t $ V.fromList [
serialiseKey k
| (k, _, _) <- contents
]
let values = V.map (fromJust . getValue) res
let blobRefs = V.mapMaybe getBlob res
blobs <- retrieveBlobs s blobRefs
putStrLn "Printing looked up values as (bytes, offset, size)"
mapM_ print $ fmap deconstruct values
putStrLn "Printing looked up blobs as (bytes, offset, size)"
mapM_ print $ fmap deconstruct blobs
do
putStrLn "Performing a range lookup"
res <- rangeLookup t $
FromToIncluding (serialiseKey @Word64 minBound)
(serialiseKey @Word64 maxBound)
let keys = V.map getEntryKey res
let values = V.map getEntryValue res
let blobRefs = V.mapMaybe getEntryBlob res
blobs <- retrieveBlobs s blobRefs
putStrLn "Printing looked up keys as (bytes, offset, size)"
mapM_ print $ fmap deconstruct keys
putStrLn "Printing looked up values as (bytes, offset, size)"
mapM_ print $ fmap deconstruct values
putStrLn "Printing looked up blobs as (bytes, offset, size)"
mapM_ print $ fmap deconstruct blobs
where
conf = defaultTableConfig {
confWriteBufferAlloc = AllocNumEntries 5
}
contents :: [(Word64, Word64, Maybe Word64)]
contents = [(i, i, Just i) | i <- [0..9]]
instance SerialiseKey RawBytes where
serialiseKey = id
deserialiseKey = id
instance SerialiseValue RawBytes where
serialiseValue = id
deserialiseValue = id
instance ResolveValue RawBytes where
resolve = const
deconstruct :: RawBytes -> (RawBytes, Int, Int)
deconstruct rb@(RawBytes (VP.Vector off sz _)) = (rb, off, sz)