Skip to content

Commit 0342f3f

Browse files
authored
Fix log rotation id after restart (#2173)
<!-- Describe your change here --> After rotation, we now reset the number of events to 1 (not 0), because the checkpoint event is sourced on restart. This avoids a mismatch between the rotation check on startup and during normal operation. That discrepancy was the cause of inconsistent rotation log ids after restarts. Also, we changed the rotation condition to use (>) instead of (>=), preventing a follow up rotation on start up when the configured threshold is 1 (since checkpointing would immediately trigger a new rotation). Lastly, a checkpoint event id now matches the last persisted event id from its preceding rotated log file, preserving sequential order of event ids across logs. This also makes it easier to identify which rotated log file was used to compute the checkpoint, as its event id matches the file name suffix. --- <!-- Consider each and tick it off one way or the other --> * [X] CHANGELOG updated or not needed * [X] Documentation updated or not needed * [X] Haddocks updated or not needed * [X] No new TODOs introduced or explained herafter
2 parents f44785e + c18902d commit 0342f3f

File tree

10 files changed

+228
-65
lines changed

10 files changed

+228
-65
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,12 @@ changes.
3333
but is enough to resolve the problem until we can identify the central cause
3434
of the issue.
3535

36+
- Fix rotation log id consistency after restart by changing the rotation check to trigger only
37+
when the number of persisted `StateChanged` events exceeds the configured `--persistence-rotate-after` threshold.
38+
* This also prevents immediate rotation on startup when the threshold is set to 1.
39+
* `Checkpoint` event ids now match the suffix of their preceding rotated log file and the last `StateChanged` event id within it,
40+
preserving sequential order and making it easier to identify which rotated log file was used to compute it.
41+
3642
## [0.22.3] - 2025-07-21
3743

3844
* Change behavior of `Hydra.Network.Etcd` to fallback to earliest possible

docs/docs/dev/architecture/event-sourcing.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,30 @@ When implementing an event source or sink, you might want to consider testing th
3636
- [ ] Concurrent use of `sourceEvents` is possible
3737

3838
- [ ] General: allocated resources are released (use with/bracket pattern)
39+
40+
### Event Log Rotation
41+
42+
Long-living heads may produce a large number of persisted events, which can impact the restart time of the hydra-node as it needs to read in all the previous to recreate its state.
43+
44+
Event log rotation was introduced to improve recovery times by reducing the number of events that need to be replayed on startup. This is achieved by periodically replacing the current event log with a new one that starts from a checkpoint event, which captures the latest aggregated head state.
45+
46+
Only rotated log files are saved with an incrementing `logId` suffix in their names, while the main `state` log file remains unchanged to preserve backward compatibility. This `logId` suffix corresponds to the ID of the last event included in that file.
47+
Rotation can be enabled via the optional `--persistence-rotate-after` command-line argument, which specifies the number of events after which rotation should occur.
48+
> For example, with `--persistence-rotate-after 100`, you’ll get rotated files named: state-100, state-200, state-300, and so on, each containing 101 events. This is because event IDs start at 0, so state-100 includes 101 state changed events (0–100) without a checkpoint. Subsequent rotated files include a checkpoint plus 100 new state changed events.
49+
50+
Note that a checkpoint event id matches the last persisted event id from the previous rotated log file, preserving the sequential order of event ids across logs.
51+
This also makes it easier to identify which rotated log file was used to compute the checkpoint, as its event id matches the file name suffix.
52+
53+
Depending on the rotation configuration used, the current `state` file may already contain more events than the specified threshold, causing a rotation to occur immediately on startup before any new inputs are processed.
54+
55+
Upon rotation, a server output is produced to notify external agents when a checkpoint occurs, allowing them to perform archival or cleanup actions without interrupting the Hydra Head.
56+
57+
The appropriate value for `--persistence-rotate-after` depends on your specific use case and the expected transaction volume.
58+
59+
> As a rough guideline, in a simple scenario (running a single party on devnet that repeatedly re-spends the same committed UTxO) we observed that setting `--persistence-rotate-after 10000` results in rotated log files of about 8 MB every 3 minutes.
60+
>
61+
> Keep in mind that the size and frequency of rotated files will vary depending on several factors:
62+
> * Transaction sizes: Larger transactions result in larger event payloads.
63+
> * Number of party members: More parties increase the number of L2 protocol messages per snapshot, generating more events.
64+
> * Ledger UTxO size: A higher number of UTxOs increases the size of certain events like snapshots.
65+
> * Transaction throughput (TPS): Higher TPS leads to more events being produced over time.

hydra-cluster/exe/hydra-cluster/Main.hs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,15 +33,15 @@ run options =
3333
Nothing -> do
3434
withCardanoNodeDevnet fromCardanoNode workDir $ \node -> do
3535
txId <- publishOrReuseHydraScripts tracer node
36-
singlePartyOpenAHead tracer workDir node txId $ \client walletSk _headId -> do
36+
singlePartyOpenAHead tracer workDir node txId persistenceRotateAfter $ \client walletSk _headId -> do
3737
case scenario of
3838
Idle -> forever $ pure ()
3939
RespendUTxO -> do
4040
-- Start respending the same UTxO with a 100ms delay.
4141
-- XXX: Should make this configurable
4242
respendUTxO client walletSk 0.1
4343
where
44-
Options{knownNetwork, stateDirectory, publishHydraScripts, useMithril, scenario} = options
44+
Options{knownNetwork, stateDirectory, publishHydraScripts, useMithril, scenario, persistenceRotateAfter} = options
4545

4646
withRunningCardanoNode tracer workDir network action =
4747
findRunningCardanoNode (contramap FromCardanoNode tracer) workDir network >>= \case

hydra-cluster/src/Hydra/Cluster/Options.hs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,19 @@ import Data.ByteString.Char8 qualified as BSC
66
import Data.List qualified as List
77
import Hydra.Cardano.Api (TxId, deserialiseFromRawBytesHex)
88
import Hydra.Cluster.Fixture (KnownNetwork (..))
9+
import Hydra.Options (persistenceRotateAfterParser)
910
import Hydra.Prelude
1011
import Options.Applicative (Parser, eitherReader, flag, flag', help, long, metavar, strOption)
1112
import Options.Applicative.Builder (option)
13+
import Test.QuickCheck (Positive)
1214

1315
data Options = Options
1416
{ knownNetwork :: Maybe KnownNetwork
1517
, stateDirectory :: Maybe FilePath
1618
, publishHydraScripts :: PublishOrReuse
1719
, useMithril :: UseMithril
1820
, scenario :: Scenario
21+
, persistenceRotateAfter :: Maybe (Positive Natural)
1922
}
2023
deriving stock (Show, Eq, Generic)
2124
deriving anyclass (ToJSON)
@@ -40,6 +43,7 @@ parseOptions =
4043
<*> parsePublishHydraScripts
4144
<*> parseUseMithril
4245
<*> parseScenario
46+
<*> optional persistenceRotateAfterParser
4347
where
4448
parseKnownNetwork =
4549
flag' (Just Preview) (long "preview" <> help "The preview testnet")

hydra-cluster/src/Hydra/Cluster/Scenarios.hs

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ import Hydra.Ledger.Cardano (mkSimpleTx, mkTransferTx, unsafeBuildTransaction)
108108
import Hydra.Ledger.Cardano.Evaluate (maxTxExecutionUnits)
109109
import Hydra.Logging (Tracer, traceWith)
110110
import Hydra.Node.DepositPeriod (DepositPeriod (..))
111-
import Hydra.Options (CardanoChainConfig (..), startChainFrom)
111+
import Hydra.Options (CardanoChainConfig (..), RunOptions (..), startChainFrom)
112112
import Hydra.Tx (HeadId, IsTx (balance), Party, txId)
113113
import Hydra.Tx.ContestationPeriod qualified as CP
114114
import Hydra.Tx.Utils (dummyValidatorScript, verificationKeyToOnChainId)
@@ -155,7 +155,7 @@ import System.FilePath ((</>))
155155
import System.Process (callProcess)
156156
import Test.Hydra.Tx.Fixture (testNetworkId)
157157
import Test.Hydra.Tx.Gen (genKeyPair)
158-
import Test.QuickCheck (choose, elements, generate)
158+
import Test.QuickCheck (Positive, choose, elements, generate)
159159

160160
data EndToEndLog
161161
= ClusterOptions {options :: Options}
@@ -504,10 +504,11 @@ singlePartyOpenAHead ::
504504
FilePath ->
505505
RunningNode ->
506506
[TxId] ->
507+
Maybe (Positive Natural) ->
507508
-- | Continuation called when the head is open
508509
(HydraClient -> SigningKey PaymentKey -> HeadId -> IO a) ->
509510
IO a
510-
singlePartyOpenAHead tracer workDir node hydraScriptsTxId callback =
511+
singlePartyOpenAHead tracer workDir node hydraScriptsTxId persistenceRotateAfter callback =
511512
(`finally` returnFundsToFaucet tracer node Alice) $ do
512513
refuelIfNeeded tracer node Alice 25_000_000
513514
-- Start hydra-node on chain tip
@@ -525,7 +526,9 @@ singlePartyOpenAHead tracer workDir node hydraScriptsTxId callback =
525526
utxoToCommit <- seedFromFaucet node walletVk 100_000_000 (contramap FromFaucet tracer)
526527

527528
let hydraTracer = contramap FromHydraNode tracer
528-
withHydraNode hydraTracer aliceChainConfig workDir 1 aliceSk [] [1] $ \n1 -> do
529+
options <- prepareHydraNode aliceChainConfig workDir 1 aliceSk [] [] id
530+
let options' = options{persistenceRotateAfter}
531+
withPreparedHydraNode hydraTracer workDir 1 options' $ \n1 -> do
529532
-- Initialize & open head
530533
send n1 $ input "Init" []
531534
headId <- waitMatch (10 * blockTime) n1 $ headIsInitializingWith (Set.fromList [alice])

hydra-cluster/test/Test/EndToEndSpec.hs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ import System.Directory (removeDirectoryRecursive, removeFile)
108108
import System.FilePath ((</>))
109109
import Test.Hydra.Tx.Fixture (testNetworkId)
110110
import Test.Hydra.Tx.Gen (genKeyPair, genUTxOFor)
111-
import Test.QuickCheck (generate)
111+
import Test.QuickCheck (Positive (..), generate)
112112
import Prelude qualified
113113

114114
allNodeIds :: [Int]
@@ -206,7 +206,7 @@ spec = around (showLogsOnFailure "EndToEndSpec") $ do
206206

207207
-- Measure restart after rotation
208208
options <- prepareHydraNode offlineConfig tmpDir 1 aliceSk [] [] id
209-
let options' = options{persistenceRotateAfter = Just 10}
209+
let options' = options{persistenceRotateAfter = Just (Positive 10)}
210210
t1 <- getCurrentTime
211211
diff2 <- withPreparedHydraNode (contramap FromHydraNode tracer) tmpDir 1 options' $ \_ -> do
212212
t2 <- getCurrentTime

hydra-node/src/Hydra/Events/Rotation.hs

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,9 @@ import Conduit (MonadUnliftIO, runConduit, runResourceT, (.|))
66
import Control.Concurrent.Class.MonadSTM (modifyTVar', newTVarIO, readTVarIO, writeTVar)
77
import Data.Conduit.Combinators qualified as C
88
import Hydra.Events (EventId, EventSink (..), EventSource (..), HasEventId (..))
9+
import Test.QuickCheck (Positive (..))
910

10-
newtype RotationConfig = RotateAfter Natural
11+
newtype RotationConfig = RotateAfter (Positive Natural)
1112

1213
type LogId = EventId
1314

@@ -52,37 +53,42 @@ newRotatedEventStore config s0 aggregator checkpointer eventStore = do
5253
rotate = const . const $ pure ()
5354
}
5455
where
55-
RotateAfter rotateAfterX = config
56+
RotateAfter (Positive rotateAfterX) = config
5657

5758
aggregateEvents (!n, !_evId, !acc) e = (n + 1, getEventId e, aggregator acc e)
5859

5960
shouldRotate numberOfEventsV = do
6061
currentNumberOfEvents <- readTVarIO numberOfEventsV
61-
pure $ currentNumberOfEvents >= rotateAfterX
62+
-- since rotateAfterX can be any positive number (including 1),
63+
-- we use (>) instead of (>=) to avoid triggering a rotation immediately after a checkpoint,
64+
-- which would lead to an infinite loop
65+
pure $ currentNumberOfEvents > rotateAfterX
6266

6367
rotatedPutEvent numberOfEventsV aggregateStateV event = do
6468
putEvent event
6569
atomically $ do
6670
-- aggregate new state
6771
modifyTVar' aggregateStateV (`aggregator` event)
6872
-- bump numberOfEvents
69-
numberOfEvents <- readTVar numberOfEventsV
70-
let numberOfEvents' = numberOfEvents + 1
71-
writeTVar numberOfEventsV numberOfEvents'
73+
modifyTVar' numberOfEventsV (+ 1)
7274
-- check rotation
7375
whenM (shouldRotate numberOfEventsV) $ do
7476
let eventId = getEventId event
7577
rotateEventLog numberOfEventsV aggregateStateV eventId
7678

7779
rotateEventLog numberOfEventsV aggregateStateV lastEventId = do
78-
-- build checkpoint event
80+
-- build the checkpoint event
7981
now <- getCurrentTime
8082
aggregateState <- readTVarIO aggregateStateV
81-
let checkpoint = checkpointer aggregateState (lastEventId + 1) now
82-
-- rotate with checkpoint
83+
-- the checkpoint has the same event id as the last event persisted
84+
let checkpoint = checkpointer aggregateState lastEventId now
85+
-- the rotated log file name suffix (logId) matches the last event persisted,
86+
-- while the checkpoint event is appended to the new (current) state log file
8387
rotate lastEventId checkpoint
84-
-- clear numberOfEvents + bump logId
88+
-- reset `numberOfEvents` to 1 because
89+
-- the checkpoint event was just appended during rotation
90+
-- and will be sourced from the event store on restart
8591
atomically $ do
86-
writeTVar numberOfEventsV 0
92+
writeTVar numberOfEventsV 1
8793

8894
EventStore{eventSource, eventSink = EventSink{putEvent}, rotate} = eventStore

hydra-node/src/Hydra/Options.hs

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ import Options.Applicative (
7878
)
7979
import Options.Applicative.Builder (str)
8080
import Options.Applicative.Help (vsep)
81-
import Test.QuickCheck (elements, listOf, listOf1, oneof, vectorOf)
81+
import Test.QuickCheck (Positive (..), choose, elements, listOf, listOf1, oneof, vectorOf)
8282

8383
data Command
8484
= Run RunOptions
@@ -193,14 +193,21 @@ data RunOptions = RunOptions
193193
, hydraSigningKey :: FilePath
194194
, hydraVerificationKeys :: [FilePath]
195195
, persistenceDir :: FilePath
196-
, persistenceRotateAfter :: Maybe Natural
196+
, persistenceRotateAfter :: Maybe (Positive Natural)
197197
, chainConfig :: ChainConfig
198198
, ledgerConfig :: LedgerConfig
199199
, whichEtcd :: WhichEtcd
200200
}
201201
deriving stock (Eq, Show, Generic)
202202
deriving anyclass (ToJSON, FromJSON)
203203

204+
-- Orphan instances
205+
instance ToJSON a => ToJSON (Positive a) where
206+
toJSON (Positive a) = toJSON a
207+
208+
instance FromJSON a => FromJSON (Positive a) where
209+
parseJSON v = Positive <$> parseJSON v
210+
204211
-- Orphan instance
205212
instance Arbitrary IP where
206213
arbitrary = IPv4 . toIPv4w <$> arbitrary
@@ -221,7 +228,7 @@ instance Arbitrary RunOptions where
221228
hydraSigningKey <- genFilePath "sk"
222229
hydraVerificationKeys <- reasonablySized (listOf (genFilePath "vk"))
223230
persistenceDir <- genDirPath
224-
persistenceRotateAfter <- arbitrary
231+
persistenceRotateAfter <- oneof [pure Nothing, Just . Positive . fromInteger <$> choose (1, 100000)]
225232
chainConfig <- arbitrary
226233
ledgerConfig <- arbitrary
227234
whichEtcd <- arbitrary
@@ -829,15 +836,22 @@ persistenceDirParser =
829836
\Do not edit these files manually!"
830837
)
831838

832-
persistenceRotateAfterParser :: Parser Natural
839+
persistenceRotateAfterParser :: Parser (Positive Natural)
833840
persistenceRotateAfterParser =
834841
option
835-
auto
842+
(eitherReader validateRotateAfter)
836843
( long "persistence-rotate-after"
837844
<> metavar "NATURAL"
838845
<> help
839-
"The number of Hydra events to trigger rotation (default: no rotation)"
846+
"The number of Hydra events to trigger rotation (default: no rotation).\
847+
\Note it must be a positive number."
840848
)
849+
where
850+
validateRotateAfter :: String -> Either String (Positive Natural)
851+
validateRotateAfter arg =
852+
case readMaybe arg of
853+
Just n | n > 0 -> Right (Positive n)
854+
_ -> Left "--persistence-rotate-after must be a positive number"
841855

842856
hydraNodeCommand :: ParserInfo Command
843857
hydraNodeCommand =
@@ -966,7 +980,7 @@ toArgs
966980
<> concatMap toArgPeer peers
967981
<> maybe [] (\port -> ["--monitoring-port", show port]) monitoringPort
968982
<> ["--persistence-dir", persistenceDir]
969-
<> maybe [] (\rotateAfter -> ["--persistence-rotate-after", show rotateAfter]) persistenceRotateAfter
983+
<> maybe [] (\rotateAfter -> ["--persistence-rotate-after", showPositive rotateAfter]) persistenceRotateAfter
970984
<> argsChainConfig chainConfig
971985
<> argsLedgerConfig
972986
where
@@ -1035,6 +1049,9 @@ toArgs
10351049
{ cardanoLedgerProtocolParametersFile
10361050
} = ledgerConfig
10371051

1052+
showPositive :: Show a => Positive a -> String
1053+
showPositive (Positive x) = show x
1054+
10381055
toArgNodeSocket :: SocketPath -> [String]
10391056
toArgNodeSocket nodeSocket = ["--node-socket", unFile nodeSocket]
10401057

0 commit comments

Comments
 (0)