Skip to content

Commit 6d4610d

Browse files
dcouttsjorisdral
authored andcommitted
Make the MergeBatchSize an adjustable parameter in TableConfig
Previously it was hard coded to be the same as the write buffer size. Document what it means as a new tunable parameter. Setting this low (1) is important for getting good parallel work balance on the pipelined WP8 benchmark. It is a crucial change that makes the pipelined version actually improve performance. Previously it would only get about a 5 to 10% improvement.
1 parent 2143a57 commit 6d4610d

File tree

17 files changed

+202
-42
lines changed

17 files changed

+202
-42
lines changed

README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -356,6 +356,12 @@ The *disk cache policy* determines if lookup operations use the OS page
356356
cache. Caching may improve the performance of lookups and updates if
357357
database access follows certain patterns.
358358

359+
`confMergeBatchSize`
360+
The merge batch size balances the maximum latency of individual update
361+
operations, versus the latency of a sequence of update operations.
362+
Bigger batches improves overall performance but some updates will take a
363+
lot longer than others. The default is to use a large batch size.
364+
359365
##### Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size <span id="fine_tuning_data_layout" class="anchor"></span>
360366

361367
The configuration parameters `confMergePolicy`, `confSizeRatio`, and
@@ -647,6 +653,33 @@ locality if it is likely to access entries that have nearby keys.
647653
does not have good spatial or temporal locality. For instance, if the
648654
access pattern is uniformly random.
649655

656+
##### Fine-tuning: Merge Batch Size <span id="fine_tuning_merge_batch_size" class="anchor"></span>
657+
658+
The *merge batch size* is a micro-tuning parameter, and in most cases
659+
you do need to think about it and can leave it at its default.
660+
661+
When using the `Incremental` merge schedule, merging is done in batches.
662+
This is a trade-off: larger batches tends to mean better overall
663+
performance but the downside is that while most updates (inserts,
664+
deletes, upserts) are fast, some are slower (when a batch of merging
665+
work has to be done).
666+
667+
If you care most about the maximum latency of updates, then use a small
668+
batch size. If you don't care about latency of individual operations,
669+
just the latency of the overall sequence of operations then use a large
670+
batch size. The default is to use a large batch size, the same size as
671+
the write buffer itself. The minimum batch size is 1. The maximum batch
672+
size is the size of the write buffer `confWriteBufferAlloc`.
673+
674+
Note that the actual batch size is the minimum of this configuration
675+
parameter and the size of the batch of operations performed (e.g.
676+
`inserts`). So if you consistently use large batches, you can use a
677+
batch size of 1 and the merge batch size will always be determined by
678+
the operation batch size.
679+
680+
A further reason why it may be preferable to use minimal batch sizes is
681+
to get good parallel work balance, when using parallelism.
682+
650683
### References
651684

652685
The implementation of LSM-trees in this package draws inspiration from:

lsm-tree.cabal

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,12 @@ description:
183183
The /disk cache policy/ determines if lookup operations use the OS page cache.
184184
Caching may improve the performance of lookups and updates if database access follows certain patterns.
185185

186+
[@confMergeBatchSize@]
187+
The merge batch size balances the maximum latency of individual update
188+
operations, versus the latency of a sequence of update operations. Bigger
189+
batches improves overall performance but some updates will take a lot
190+
longer than others. The default is to use a large batch size.
191+
186192
==== Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size #fine_tuning_data_layout#
187193

188194
The configuration parameters @confMergePolicy@, @confSizeRatio@, and @confWriteBufferAlloc@ affect how the table organises its data.
@@ -429,6 +435,31 @@ description:
429435
* Use the @DiskCacheNone@ policy if the database's access pattern has does not have good spatial or temporal locality.
430436
For instance, if the access pattern is uniformly random.
431437

438+
==== Fine-tuning: Merge Batch Size #fine_tuning_merge_batch_size#
439+
440+
The /merge batch size/ is a micro-tuning parameter, and in most cases you do
441+
need to think about it and can leave it at its default.
442+
443+
When using the 'Incremental' merge schedule, merging is done in batches. This
444+
is a trade-off: larger batches tends to mean better overall performance but the
445+
downside is that while most updates (inserts, deletes, upserts) are fast, some
446+
are slower (when a batch of merging work has to be done).
447+
448+
If you care most about the maximum latency of updates, then use a small batch
449+
size. If you don't care about latency of individual operations, just the
450+
latency of the overall sequence of operations then use a large batch size. The
451+
default is to use a large batch size, the same size as the write buffer itself.
452+
The minimum batch size is 1. The maximum batch size is the size of the write
453+
buffer 'confWriteBufferAlloc'.
454+
455+
Note that the actual batch size is the minimum of this configuration
456+
parameter and the size of the batch of operations performed (e.g. 'inserts').
457+
So if you consistently use large batches, you can use a batch size of 1 and
458+
the merge batch size will always be determined by the operation batch size.
459+
460+
A further reason why it may be preferable to use minimal batch sizes is to get
461+
good parallel work balance, when using parallelism.
462+
432463
== References
433464

434465
The implementation of LSM-trees in this package draws inspiration from:

src-extras/Database/LSMTree/Extras/NoThunks.hs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -659,6 +659,9 @@ deriving anyclass instance NoThunks DiskCachePolicy
659659
deriving stock instance Generic MergeSchedule
660660
deriving anyclass instance NoThunks MergeSchedule
661661

662+
deriving stock instance Generic MergeBatchSize
663+
deriving anyclass instance NoThunks MergeBatchSize
664+
662665
{-------------------------------------------------------------------------------
663666
RWVar
664667
-------------------------------------------------------------------------------}

src/Database/LSMTree.hs

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,8 @@ module Database.LSMTree (
109109
confBloomFilterAlloc,
110110
confFencePointerIndex,
111111
confDiskCachePolicy,
112-
confMergeSchedule
112+
confMergeSchedule,
113+
confMergeBatchSize
113114
),
114115
defaultTableConfig,
115116
MergePolicy (LazyLevelling),
@@ -119,6 +120,7 @@ module Database.LSMTree (
119120
BloomFilterAlloc (AllocFixed, AllocRequestFPR),
120121
FencePointerIndexType (OrdinaryIndex, CompactIndex),
121122
DiskCachePolicy (..),
123+
MergeBatchSize (..),
122124

123125
-- ** Table Configuration Overrides #table_configuration_overrides#
124126
OverrideDiskCachePolicy (..),
@@ -214,9 +216,10 @@ import qualified Database.LSMTree.Internal.BlobRef as Internal
214216
import Database.LSMTree.Internal.Config
215217
(BloomFilterAlloc (AllocFixed, AllocRequestFPR),
216218
DiskCachePolicy (..), FencePointerIndexType (..),
217-
LevelNo (..), MergePolicy (..), MergeSchedule (..),
218-
SizeRatio (..), TableConfig (..), WriteBufferAlloc (..),
219-
defaultTableConfig, serialiseKeyMinimalSize)
219+
LevelNo (..), MergeBatchSize (..), MergePolicy (..),
220+
MergeSchedule (..), SizeRatio (..), TableConfig (..),
221+
WriteBufferAlloc (..), defaultTableConfig,
222+
serialiseKeyMinimalSize)
220223
import Database.LSMTree.Internal.Config.Override
221224
(OverrideDiskCachePolicy (..))
222225
import Database.LSMTree.Internal.Entry (NumEntries (..))

src/Database/LSMTree/Internal/Config.hs

Lines changed: 66 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,16 @@ module Database.LSMTree.Internal.Config (
2626
, diskCachePolicyForLevel
2727
-- * Merge schedule
2828
, MergeSchedule (..)
29+
-- * Merge batch size
30+
, MergeBatchSize (..)
31+
, creditThresholdForLevel
2932
) where
3033

3134
import Control.DeepSeq (NFData (..))
3235
import Database.LSMTree.Internal.Index (IndexType)
3336
import qualified Database.LSMTree.Internal.Index as Index
3437
(IndexType (Compact, Ordinary))
38+
import qualified Database.LSMTree.Internal.MergingRun as MR
3539
import qualified Database.LSMTree.Internal.RawBytes as RB
3640
import Database.LSMTree.Internal.Run (RunDataCaching (..))
3741
import Database.LSMTree.Internal.RunAcc (RunBloomFilterAlloc (..))
@@ -90,6 +94,12 @@ For a detailed discussion of fine-tuning the table configuration, see [Fine-tuni
9094
[@confDiskCachePolicy :: t'DiskCachePolicy'@]
9195
The /disk cache policy/ supports caching lookup operations using the OS page cache.
9296
Caching may improve the performance of lookups and updates if database access follows certain patterns.
97+
98+
[@confMergeBatchSize :: t'MergeBatchSize'@]
99+
The merge batch size balances the maximum latency of individual update
100+
operations, versus the latency of a sequence of update operations. Bigger
101+
batches improves overall performance but some updates will take a lot
102+
longer than others. The default is to use a large batch size.
93103
-}
94104
data TableConfig = TableConfig {
95105
confMergePolicy :: !MergePolicy
@@ -99,12 +109,14 @@ data TableConfig = TableConfig {
99109
, confBloomFilterAlloc :: !BloomFilterAlloc
100110
, confFencePointerIndex :: !FencePointerIndexType
101111
, confDiskCachePolicy :: !DiskCachePolicy
112+
, confMergeBatchSize :: !MergeBatchSize
102113
}
103114
deriving stock (Show, Eq)
104115

105116
instance NFData TableConfig where
106-
rnf (TableConfig a b c d e f g) =
107-
rnf a `seq` rnf b `seq` rnf c `seq` rnf d `seq` rnf e `seq` rnf f `seq` rnf g
117+
rnf (TableConfig a b c d e f g h) =
118+
rnf a `seq` rnf b `seq` rnf c `seq` rnf d `seq`
119+
rnf e `seq` rnf f `seq` rnf g `seq` rnf h
108120

109121
-- | The 'defaultTableConfig' defines reasonable defaults for all 'TableConfig' parameters.
110122
--
@@ -122,6 +134,8 @@ instance NFData TableConfig where
122134
-- OrdinaryIndex
123135
-- >>> confDiskCachePolicy defaultTableConfig
124136
-- DiskCacheAll
137+
-- >>> confMergeBatchSize defaultTableConfig
138+
-- MergeBatchSize 20000
125139
--
126140
defaultTableConfig :: TableConfig
127141
defaultTableConfig =
@@ -133,6 +147,7 @@ defaultTableConfig =
133147
, confBloomFilterAlloc = AllocRequestFPR 1.0e-3
134148
, confFencePointerIndex = OrdinaryIndex
135149
, confDiskCachePolicy = DiskCacheAll
150+
, confMergeBatchSize = MergeBatchSize 20_000 -- same as write buffer
136151
}
137152

138153
data RunLevelNo = RegularLevel LevelNo | UnionLevel
@@ -238,6 +253,8 @@ data MergeSchedule =
238253
The 'Incremental' merge schedule spreads out the merging work over time.
239254
This is less efficient than the 'OneShot' merge schedule, but has a consistent workload.
240255
Using the 'Incremental' merge schedule, the worst-case disk I\/O complexity of the update operations is /logarithmic/ in the size of the table.
256+
This 'Incremental' merge schedule still uses batching to improve performance.
257+
The batch size can be controlled using the 'MergeBatchSize'.
241258
-}
242259
| Incremental
243260
deriving stock (Eq, Show)
@@ -385,3 +402,50 @@ diskCachePolicyForLevel policy levelNo =
385402
RegularLevel l | l <= LevelNo n -> CacheRunData
386403
| otherwise -> NoCacheRunData
387404
UnionLevel -> NoCacheRunData
405+
406+
{-------------------------------------------------------------------------------
407+
Merge batch size
408+
-------------------------------------------------------------------------------}
409+
410+
{- |
411+
The /merge batch size/ is a micro-tuning parameter, and in most cases you do
412+
need to think about it and can leave it at its default.
413+
414+
When using the 'Incremental' merge schedule, merging is done in batches. This
415+
is a trade-off: larger batches tends to mean better overall performance but the
416+
downside is that while most updates (inserts, deletes, upserts) are fast, some
417+
are slower (when a batch of merging work has to be done).
418+
419+
If you care most about the maximum latency of updates, then use a small batch
420+
size. If you don't care about latency of individual operations, just the
421+
latency of the overall sequence of operations then use a large batch size. The
422+
default is to use a large batch size, the same size as the write buffer itself.
423+
The minimum batch size is 1. The maximum batch size is the size of the write
424+
buffer 'confWriteBufferAlloc'.
425+
426+
Note that the actual batch size is the minimum of this configuration
427+
parameter and the size of the batch of operations performed (e.g. 'inserts').
428+
So if you consistently use large batches, you can use a batch size of 1 and
429+
the merge batch size will always be determined by the operation batch size.
430+
431+
A further reason why it may be preferable to use minimal batch sizes is to get
432+
good parallel work balance, when using parallelism.
433+
-}
434+
newtype MergeBatchSize = MergeBatchSize Int
435+
deriving stock (Show, Eq, Ord)
436+
deriving newtype (NFData)
437+
438+
-- TODO: the thresholds for doing merge work should be different for each level,
439+
-- and ideally all-pairs co-prime.
440+
creditThresholdForLevel :: TableConfig -> LevelNo -> MR.CreditThreshold
441+
creditThresholdForLevel TableConfig {
442+
confMergeBatchSize = MergeBatchSize mergeBatchSz,
443+
confWriteBufferAlloc = AllocNumEntries writeBufferSz
444+
}
445+
(LevelNo _i) =
446+
MR.CreditThreshold
447+
. MR.UnspentCredits
448+
. MR.MergeCredits
449+
. max 1
450+
. min writeBufferSz
451+
$ mergeBatchSz

src/Database/LSMTree/Internal/Config/Override.hs

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -91,16 +91,8 @@ instance Override DiskCachePolicy SnapshotMetaData where
9191
in fmap (override rdc) smt
9292

9393
instance Override DiskCachePolicy TableConfig where
94-
override confDiskCachePolicy' TableConfig {..}
95-
= TableConfig
96-
{ confMergePolicy,
97-
confMergeSchedule,
98-
confSizeRatio,
99-
confWriteBufferAlloc,
100-
confBloomFilterAlloc,
101-
confFencePointerIndex,
102-
confDiskCachePolicy = confDiskCachePolicy'
103-
}
94+
override confDiskCachePolicy' tc =
95+
tc { confDiskCachePolicy = confDiskCachePolicy' }
10496

10597
instance Override DiskCachePolicy (SnapLevels SnapshotRun) where
10698
override dcp (SnapLevels (vec :: V.Vector (SnapLevel SnapshotRun))) =

src/Database/LSMTree/Internal/IncomingRun.hs

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -218,13 +218,6 @@ supplyCreditsIncomingRun conf ln (Merging _ nominalDebt nominalCreditsVar mr)
218218
-- use atomic operations for its counters). We could potentially simplify
219219
-- MergingRun by dispensing with batching for the MergeCredits counters.
220220

221-
-- TODO: the thresholds for doing merge work should be different for each level,
222-
-- maybe co-prime?
223-
creditThresholdForLevel :: TableConfig -> LevelNo -> MR.CreditThreshold
224-
creditThresholdForLevel conf (LevelNo _i) =
225-
let AllocNumEntries x = confWriteBufferAlloc conf
226-
in MR.CreditThreshold (MR.UnspentCredits (MergeCredits x))
227-
228221
-- | Deposit nominal credits in the local credits var, ensuring the total
229222
-- credits does not exceed the total debt.
230223
--

src/Database/LSMTree/Internal/Snapshot/Codec.hs

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -294,27 +294,44 @@ instance Encode TableConfig where
294294
, confBloomFilterAlloc = bloomFilterAlloc
295295
, confFencePointerIndex = fencePointerIndex
296296
, confDiskCachePolicy = diskCachePolicy
297+
, confMergeBatchSize = mergeBatchSize
297298
}
298299
) =
299-
encodeListLen 7
300+
encodeListLen 8
300301
<> encode mergePolicy
301302
<> encode mergeSchedule
302303
<> encode sizeRatio
303304
<> encode writeBufferAlloc
304305
<> encode bloomFilterAlloc
305306
<> encode fencePointerIndex
306307
<> encode diskCachePolicy
308+
<> encode mergeBatchSize
307309

308310
instance DecodeVersioned TableConfig where
309-
decodeVersioned v = do
310-
_ <- decodeListLenOf 7
311+
decodeVersioned v@V0 = do
312+
decodeListLenOf 7
311313
confMergePolicy <- decodeVersioned v
312314
confMergeSchedule <- decodeVersioned v
313315
confSizeRatio <- decodeVersioned v
314316
confWriteBufferAlloc <- decodeVersioned v
315317
confBloomFilterAlloc <- decodeVersioned v
316318
confFencePointerIndex <- decodeVersioned v
317319
confDiskCachePolicy <- decodeVersioned v
320+
let confMergeBatchSize = case confWriteBufferAlloc of
321+
AllocNumEntries n -> MergeBatchSize n
322+
pure TableConfig {..}
323+
324+
-- We introduced the confMergeBatchSize in V1
325+
decodeVersioned v@V1 = do
326+
decodeListLenOf 8
327+
confMergePolicy <- decodeVersioned v
328+
confMergeSchedule <- decodeVersioned v
329+
confSizeRatio <- decodeVersioned v
330+
confWriteBufferAlloc <- decodeVersioned v
331+
confBloomFilterAlloc <- decodeVersioned v
332+
confFencePointerIndex <- decodeVersioned v
333+
confDiskCachePolicy <- decodeVersioned v
334+
confMergeBatchSize <- decodeVersioned v
318335
pure TableConfig {..}
319336

320337
-- MergePolicy
@@ -494,6 +511,14 @@ instance DecodeVersioned MergeSchedule where
494511
1 -> pure Incremental
495512
_ -> fail ("[MergeSchedule] Unexpected tag: " <> show tag)
496513

514+
-- MergeBatchSize
515+
516+
instance Encode MergeBatchSize where
517+
encode (MergeBatchSize n) = encodeInt n
518+
519+
instance DecodeVersioned MergeBatchSize where
520+
decodeVersioned _v = MergeBatchSize <$> decodeInt
521+
497522
{-------------------------------------------------------------------------------
498523
Encoding and decoding: SnapLevels
499524
-------------------------------------------------------------------------------}

src/Database/LSMTree/Simple.hs

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ module Database.LSMTree.Simple (
111111
FencePointerIndexType (OrdinaryIndex, CompactIndex),
112112
DiskCachePolicy (..),
113113
MergeSchedule (..),
114+
MergeBatchSize (..),
114115

115116
-- ** Table Configuration Overrides #table_configuration_overrides#
116117
OverrideDiskCachePolicy (..),
@@ -165,9 +166,9 @@ import Data.Vector (Vector)
165166
import Data.Void (Void)
166167
import Database.LSMTree (BloomFilterAlloc, CursorClosedError (..),
167168
DiskCachePolicy, FencePointerIndexType,
168-
InvalidSnapshotNameError (..), MergePolicy, MergeSchedule,
169-
OverrideDiskCachePolicy (..), Range (..), RawBytes,
170-
ResolveAsFirst (..), SerialiseKey (..),
169+
InvalidSnapshotNameError (..), MergeBatchSize, MergePolicy,
170+
MergeSchedule, OverrideDiskCachePolicy (..), Range (..),
171+
RawBytes, ResolveAsFirst (..), SerialiseKey (..),
171172
SerialiseKeyOrderPreserving, SerialiseValue (..),
172173
SessionClosedError (..), SizeRatio,
173174
SnapshotCorruptedError (..),

0 commit comments

Comments
 (0)