Skip to content

Commit d87f9fa

Browse files
authored
Merge pull request #761 from IntersectMBO/dcoutts/wp8-bench-pipelined-3
Improve parallel speedup on pipelined benchmark
2 parents 2f2fe36 + 09f06ae commit d87f9fa

File tree

126 files changed

+421
-166
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

126 files changed

+421
-166
lines changed

README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -356,6 +356,12 @@ The *disk cache policy* determines if lookup operations use the OS page
356356
cache. Caching may improve the performance of lookups and updates if
357357
database access follows certain patterns.
358358

359+
`confMergeBatchSize`
360+
The merge batch size balances the maximum latency of individual update
361+
operations, versus the latency of a sequence of update operations.
362+
Bigger batches improves overall performance but some updates will take a
363+
lot longer than others. The default is to use a large batch size.
364+
359365
##### Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size <span id="fine_tuning_data_layout" class="anchor"></span>
360366

361367
The configuration parameters `confMergePolicy`, `confSizeRatio`, and
@@ -647,6 +653,33 @@ locality if it is likely to access entries that have nearby keys.
647653
does not have good spatial or temporal locality. For instance, if the
648654
access pattern is uniformly random.
649655

656+
##### Fine-tuning: Merge Batch Size <span id="fine_tuning_merge_batch_size" class="anchor"></span>
657+
658+
The *merge batch size* is a micro-tuning parameter, and in most cases
659+
you do need to think about it and can leave it at its default.
660+
661+
When using the `Incremental` merge schedule, merging is done in batches.
662+
This is a trade-off: larger batches tends to mean better overall
663+
performance but the downside is that while most updates (inserts,
664+
deletes, upserts) are fast, some are slower (when a batch of merging
665+
work has to be done).
666+
667+
If you care most about the maximum latency of updates, then use a small
668+
batch size. If you don't care about latency of individual operations,
669+
just the latency of the overall sequence of operations then use a large
670+
batch size. The default is to use a large batch size, the same size as
671+
the write buffer itself. The minimum batch size is 1. The maximum batch
672+
size is the size of the write buffer `confWriteBufferAlloc`.
673+
674+
Note that the actual batch size is the minimum of this configuration
675+
parameter and the size of the batch of operations performed (e.g.
676+
`inserts`). So if you consistently use large batches, you can use a
677+
batch size of 1 and the merge batch size will always be determined by
678+
the operation batch size.
679+
680+
A further reason why it may be preferable to use minimal batch sizes is
681+
to get good parallel work balance, when using parallelism.
682+
650683
### References
651684

652685
The implementation of LSM-trees in this package draws inspiration from:

bench/macro/lsm-tree-bench-wp8.hs

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -183,13 +183,23 @@ mkTableConfigSetup GlobalOpts{diskCachePolicy} SetupOpts{bloomFilterAlloc} conf
183183
, LSM.confBloomFilterAlloc = bloomFilterAlloc
184184
}
185185

186-
mkTableConfigRun :: GlobalOpts -> LSM.TableConfig -> LSM.TableConfig
187-
mkTableConfigRun GlobalOpts{diskCachePolicy} conf = conf {
188-
LSM.confDiskCachePolicy = diskCachePolicy
186+
mkTableConfigRun :: GlobalOpts -> RunOpts -> LSM.TableConfig -> LSM.TableConfig
187+
mkTableConfigRun GlobalOpts{diskCachePolicy} RunOpts {pipelined} conf =
188+
conf {
189+
LSM.confDiskCachePolicy = diskCachePolicy,
190+
LSM.confMergeBatchSize = if pipelined
191+
then LSM.MergeBatchSize 1
192+
else LSM.confMergeBatchSize conf
189193
}
190194

191-
mkOverrideDiskCachePolicy :: GlobalOpts -> LSM.OverrideDiskCachePolicy
192-
mkOverrideDiskCachePolicy GlobalOpts{diskCachePolicy} = LSM.OverrideDiskCachePolicy diskCachePolicy
195+
mkTableConfigOverride :: GlobalOpts -> RunOpts -> LSM.TableConfigOverride
196+
mkTableConfigOverride GlobalOpts{diskCachePolicy} RunOpts {pipelined} =
197+
LSM.noTableConfigOverride {
198+
LSM.overrideDiskCachePolicy = Just diskCachePolicy,
199+
LSM.overrideMergeBatchSize = if pipelined
200+
then Just (LSM.MergeBatchSize 1)
201+
else Nothing
202+
}
193203

194204
mkTracer :: GlobalOpts -> Tracer IO LSM.LSMTreeTrace
195205
mkTracer gopts
@@ -585,8 +595,10 @@ doRun gopts opts = do
585595
-- reference version starts with empty (as it's not practical or
586596
-- necessary for testing to load the whole snapshot).
587597
tbl <- if check opts
588-
then LSM.newTableWith @IO @K @V @B (mkTableConfigRun gopts benchTableConfig) session
589-
else LSM.openTableFromSnapshotWith @IO @K @V @B (mkOverrideDiskCachePolicy gopts) session name label
598+
then let conf = mkTableConfigRun gopts opts benchTableConfig
599+
in LSM.newTableWith @IO @K @V @B conf session
600+
else let conf = mkTableConfigOverride gopts opts
601+
in LSM.openTableFromSnapshotWith @IO @K @V @B conf session name label
590602

591603
-- In checking mode, compare each output against a pure reference.
592604
checkvar <- newIORef $ pureReference

lsm-tree.cabal

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,12 @@ description:
183183
The /disk cache policy/ determines if lookup operations use the OS page cache.
184184
Caching may improve the performance of lookups and updates if database access follows certain patterns.
185185

186+
[@confMergeBatchSize@]
187+
The merge batch size balances the maximum latency of individual update
188+
operations, versus the latency of a sequence of update operations. Bigger
189+
batches improves overall performance but some updates will take a lot
190+
longer than others. The default is to use a large batch size.
191+
186192
==== Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size #fine_tuning_data_layout#
187193

188194
The configuration parameters @confMergePolicy@, @confSizeRatio@, and @confWriteBufferAlloc@ affect how the table organises its data.
@@ -429,6 +435,31 @@ description:
429435
* Use the @DiskCacheNone@ policy if the database's access pattern has does not have good spatial or temporal locality.
430436
For instance, if the access pattern is uniformly random.
431437

438+
==== Fine-tuning: Merge Batch Size #fine_tuning_merge_batch_size#
439+
440+
The /merge batch size/ is a micro-tuning parameter, and in most cases you do
441+
need to think about it and can leave it at its default.
442+
443+
When using the 'Incremental' merge schedule, merging is done in batches. This
444+
is a trade-off: larger batches tends to mean better overall performance but the
445+
downside is that while most updates (inserts, deletes, upserts) are fast, some
446+
are slower (when a batch of merging work has to be done).
447+
448+
If you care most about the maximum latency of updates, then use a small batch
449+
size. If you don't care about latency of individual operations, just the
450+
latency of the overall sequence of operations then use a large batch size. The
451+
default is to use a large batch size, the same size as the write buffer itself.
452+
The minimum batch size is 1. The maximum batch size is the size of the write
453+
buffer 'confWriteBufferAlloc'.
454+
455+
Note that the actual batch size is the minimum of this configuration
456+
parameter and the size of the batch of operations performed (e.g. 'inserts').
457+
So if you consistently use large batches, you can use a batch size of 1 and
458+
the merge batch size will always be determined by the operation batch size.
459+
460+
A further reason why it may be preferable to use minimal batch sizes is to get
461+
good parallel work balance, when using parallelism.
462+
432463
== References
433464

434465
The implementation of LSM-trees in this package draws inspiration from:

src-extras/Database/LSMTree/Extras/NoThunks.hs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -659,6 +659,9 @@ deriving anyclass instance NoThunks DiskCachePolicy
659659
deriving stock instance Generic MergeSchedule
660660
deriving anyclass instance NoThunks MergeSchedule
661661

662+
deriving stock instance Generic MergeBatchSize
663+
deriving anyclass instance NoThunks MergeBatchSize
664+
662665
{-------------------------------------------------------------------------------
663666
RWVar
664667
-------------------------------------------------------------------------------}

src/Database/LSMTree.hs

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,8 @@ module Database.LSMTree (
116116
confBloomFilterAlloc,
117117
confFencePointerIndex,
118118
confDiskCachePolicy,
119-
confMergeSchedule
119+
confMergeSchedule,
120+
confMergeBatchSize
120121
),
121122
defaultTableConfig,
122123
MergePolicy (LazyLevelling),
@@ -126,9 +127,11 @@ module Database.LSMTree (
126127
BloomFilterAlloc (AllocFixed, AllocRequestFPR),
127128
FencePointerIndexType (OrdinaryIndex, CompactIndex),
128129
DiskCachePolicy (..),
130+
MergeBatchSize (..),
129131

130132
-- ** Table Configuration Overrides #table_configuration_overrides#
131-
OverrideDiskCachePolicy (..),
133+
TableConfigOverride (..),
134+
noTableConfigOverride,
132135

133136
-- * Ranges #ranges#
134137
Range (..),
@@ -221,11 +224,12 @@ import qualified Database.LSMTree.Internal.BlobRef as Internal
221224
import Database.LSMTree.Internal.Config
222225
(BloomFilterAlloc (AllocFixed, AllocRequestFPR),
223226
DiskCachePolicy (..), FencePointerIndexType (..),
224-
LevelNo (..), MergePolicy (..), MergeSchedule (..),
225-
SizeRatio (..), TableConfig (..), WriteBufferAlloc (..),
226-
defaultTableConfig, serialiseKeyMinimalSize)
227+
LevelNo (..), MergeBatchSize (..), MergePolicy (..),
228+
MergeSchedule (..), SizeRatio (..), TableConfig (..),
229+
WriteBufferAlloc (..), defaultTableConfig,
230+
serialiseKeyMinimalSize)
227231
import Database.LSMTree.Internal.Config.Override
228-
(OverrideDiskCachePolicy (..))
232+
(TableConfigOverride (..), noTableConfigOverride)
229233
import Database.LSMTree.Internal.Entry (NumEntries (..))
230234
import qualified Database.LSMTree.Internal.Entry as Entry
231235
import Database.LSMTree.Internal.Merge (LevelMergeType (..))
@@ -2600,7 +2604,7 @@ Variant of 'withTableFromSnapshot' that accepts [table configuration overrides](
26002604
withTableFromSnapshotWith ::
26012605
forall k v b a.
26022606
(ResolveValue v) =>
2603-
OverrideDiskCachePolicy ->
2607+
TableConfigOverride ->
26042608
Session IO ->
26052609
SnapshotName ->
26062610
SnapshotLabel ->
@@ -2611,7 +2615,7 @@ withTableFromSnapshotWith ::
26112615
forall m k v b a.
26122616
(IOLike m) =>
26132617
(ResolveValue v) =>
2614-
OverrideDiskCachePolicy ->
2618+
TableConfigOverride ->
26152619
Session m ->
26162620
SnapshotName ->
26172621
SnapshotLabel ->
@@ -2675,7 +2679,7 @@ openTableFromSnapshot ::
26752679
SnapshotLabel ->
26762680
m (Table m k v b)
26772681
openTableFromSnapshot session snapName snapLabel =
2678-
openTableFromSnapshotWith NoOverrideDiskCachePolicy session snapName snapLabel
2682+
openTableFromSnapshotWith noTableConfigOverride session snapName snapLabel
26792683

26802684
{- |
26812685
Variant of 'openTableFromSnapshot' that accepts [table configuration overrides](#g:table_configuration_overrides).
@@ -2684,7 +2688,7 @@ Variant of 'openTableFromSnapshot' that accepts [table configuration overrides](
26842688
openTableFromSnapshotWith ::
26852689
forall k v b.
26862690
(ResolveValue v) =>
2687-
OverrideDiskCachePolicy ->
2691+
TableConfigOverride ->
26882692
Session IO ->
26892693
SnapshotName ->
26902694
SnapshotLabel ->
@@ -2694,7 +2698,7 @@ openTableFromSnapshotWith ::
26942698
forall m k v b.
26952699
(IOLike m) =>
26962700
(ResolveValue v) =>
2697-
OverrideDiskCachePolicy ->
2701+
TableConfigOverride ->
26982702
Session m ->
26992703
SnapshotName ->
27002704
SnapshotLabel ->

src/Database/LSMTree/Internal/Config.hs

Lines changed: 66 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,16 @@ module Database.LSMTree.Internal.Config (
2626
, diskCachePolicyForLevel
2727
-- * Merge schedule
2828
, MergeSchedule (..)
29+
-- * Merge batch size
30+
, MergeBatchSize (..)
31+
, creditThresholdForLevel
2932
) where
3033

3134
import Control.DeepSeq (NFData (..))
3235
import Database.LSMTree.Internal.Index (IndexType)
3336
import qualified Database.LSMTree.Internal.Index as Index
3437
(IndexType (Compact, Ordinary))
38+
import qualified Database.LSMTree.Internal.MergingRun as MR
3539
import qualified Database.LSMTree.Internal.RawBytes as RB
3640
import Database.LSMTree.Internal.Run (RunDataCaching (..))
3741
import Database.LSMTree.Internal.RunAcc (RunBloomFilterAlloc (..))
@@ -90,6 +94,12 @@ For a detailed discussion of fine-tuning the table configuration, see [Fine-tuni
9094
[@confDiskCachePolicy :: t'DiskCachePolicy'@]
9195
The /disk cache policy/ supports caching lookup operations using the OS page cache.
9296
Caching may improve the performance of lookups and updates if database access follows certain patterns.
97+
98+
[@confMergeBatchSize :: t'MergeBatchSize'@]
99+
The merge batch size balances the maximum latency of individual update
100+
operations, versus the latency of a sequence of update operations. Bigger
101+
batches improves overall performance but some updates will take a lot
102+
longer than others. The default is to use a large batch size.
93103
-}
94104
data TableConfig = TableConfig {
95105
confMergePolicy :: !MergePolicy
@@ -99,12 +109,14 @@ data TableConfig = TableConfig {
99109
, confBloomFilterAlloc :: !BloomFilterAlloc
100110
, confFencePointerIndex :: !FencePointerIndexType
101111
, confDiskCachePolicy :: !DiskCachePolicy
112+
, confMergeBatchSize :: !MergeBatchSize
102113
}
103114
deriving stock (Show, Eq)
104115

105116
instance NFData TableConfig where
106-
rnf (TableConfig a b c d e f g) =
107-
rnf a `seq` rnf b `seq` rnf c `seq` rnf d `seq` rnf e `seq` rnf f `seq` rnf g
117+
rnf (TableConfig a b c d e f g h) =
118+
rnf a `seq` rnf b `seq` rnf c `seq` rnf d `seq`
119+
rnf e `seq` rnf f `seq` rnf g `seq` rnf h
108120

109121
-- | The 'defaultTableConfig' defines reasonable defaults for all 'TableConfig' parameters.
110122
--
@@ -122,6 +134,8 @@ instance NFData TableConfig where
122134
-- OrdinaryIndex
123135
-- >>> confDiskCachePolicy defaultTableConfig
124136
-- DiskCacheAll
137+
-- >>> confMergeBatchSize defaultTableConfig
138+
-- MergeBatchSize 20000
125139
--
126140
defaultTableConfig :: TableConfig
127141
defaultTableConfig =
@@ -133,6 +147,7 @@ defaultTableConfig =
133147
, confBloomFilterAlloc = AllocRequestFPR 1.0e-3
134148
, confFencePointerIndex = OrdinaryIndex
135149
, confDiskCachePolicy = DiskCacheAll
150+
, confMergeBatchSize = MergeBatchSize 20_000 -- same as write buffer
136151
}
137152

138153
data RunLevelNo = RegularLevel LevelNo | UnionLevel
@@ -238,6 +253,8 @@ data MergeSchedule =
238253
The 'Incremental' merge schedule spreads out the merging work over time.
239254
This is less efficient than the 'OneShot' merge schedule, but has a consistent workload.
240255
Using the 'Incremental' merge schedule, the worst-case disk I\/O complexity of the update operations is /logarithmic/ in the size of the table.
256+
This 'Incremental' merge schedule still uses batching to improve performance.
257+
The batch size can be controlled using the 'MergeBatchSize'.
241258
-}
242259
| Incremental
243260
deriving stock (Eq, Show)
@@ -385,3 +402,50 @@ diskCachePolicyForLevel policy levelNo =
385402
RegularLevel l | l <= LevelNo n -> CacheRunData
386403
| otherwise -> NoCacheRunData
387404
UnionLevel -> NoCacheRunData
405+
406+
{-------------------------------------------------------------------------------
407+
Merge batch size
408+
-------------------------------------------------------------------------------}
409+
410+
{- |
411+
The /merge batch size/ is a micro-tuning parameter, and in most cases you do
412+
need to think about it and can leave it at its default.
413+
414+
When using the 'Incremental' merge schedule, merging is done in batches. This
415+
is a trade-off: larger batches tends to mean better overall performance but the
416+
downside is that while most updates (inserts, deletes, upserts) are fast, some
417+
are slower (when a batch of merging work has to be done).
418+
419+
If you care most about the maximum latency of updates, then use a small batch
420+
size. If you don't care about latency of individual operations, just the
421+
latency of the overall sequence of operations then use a large batch size. The
422+
default is to use a large batch size, the same size as the write buffer itself.
423+
The minimum batch size is 1. The maximum batch size is the size of the write
424+
buffer 'confWriteBufferAlloc'.
425+
426+
Note that the actual batch size is the minimum of this configuration
427+
parameter and the size of the batch of operations performed (e.g. 'inserts').
428+
So if you consistently use large batches, you can use a batch size of 1 and
429+
the merge batch size will always be determined by the operation batch size.
430+
431+
A further reason why it may be preferable to use minimal batch sizes is to get
432+
good parallel work balance, when using parallelism.
433+
-}
434+
newtype MergeBatchSize = MergeBatchSize Int
435+
deriving stock (Show, Eq, Ord)
436+
deriving newtype (NFData)
437+
438+
-- TODO: the thresholds for doing merge work should be different for each level,
439+
-- and ideally all-pairs co-prime.
440+
creditThresholdForLevel :: TableConfig -> LevelNo -> MR.CreditThreshold
441+
creditThresholdForLevel TableConfig {
442+
confMergeBatchSize = MergeBatchSize mergeBatchSz,
443+
confWriteBufferAlloc = AllocNumEntries writeBufferSz
444+
}
445+
(LevelNo _i) =
446+
MR.CreditThreshold
447+
. MR.UnspentCredits
448+
. MR.MergeCredits
449+
. max 1
450+
. min writeBufferSz
451+
$ mergeBatchSz

0 commit comments

Comments
 (0)