Fix log rotation id after restart (#2173)

noonio · web-flow · commit 0342f3fdb5c4 · 2025-08-05T18:46:17.000-06:00
&lt;!-- Describe your change here --&gt;

After rotation, we now reset the number of events to 1 (not 0),
because the checkpoint event is sourced on restart. This avoids
a mismatch between the rotation check on startup and during normal
operation.
That discrepancy was the cause of inconsistent rotation log ids after
restarts.

Also, we changed the rotation condition to use (&gt;) instead of (&gt;=),
preventing a follow up rotation on start up when the configured
threshold is 1
(since checkpointing would immediately trigger a new rotation).

Lastly, a checkpoint event id now matches the last persisted event id
from its preceding rotated log file, preserving sequential order of
event ids across logs.

This also makes it easier to identify which rotated log file was used to
compute the checkpoint,
as its event id matches the file name suffix.

---

&lt;!-- Consider each and tick it off one way or the other --&gt;
* [X] CHANGELOG updated or not needed
* [X] Documentation updated or not needed
* [X] Haddocks updated or not needed
* [X] No new TODOs introduced or explained herafter
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -33,6 +33,12 @@ changes.
   but is enough to resolve the problem until we can identify the central cause
   of the issue.
 
+- Fix rotation log id consistency after restart by changing the rotation check to trigger only
+when the number of persisted `StateChanged` events exceeds the configured `--persistence-rotate-after` threshold.
+  * This also prevents immediate rotation on startup when the threshold is set to 1.
+  * `Checkpoint` event ids now match the suffix of their preceding rotated log file and the last `StateChanged` event id within it,
+  preserving sequential order and making it easier to identify which rotated log file was used to compute it.
+
 ## [0.22.3] - 2025-07-21
 
 * Change behavior of `Hydra.Network.Etcd` to fallback to earliest possible
diff --git a/docs/docs/dev/architecture/event-sourcing.md b/docs/docs/dev/architecture/event-sourcing.md
@@ -36,3 +36,30 @@ When implementing an event source or sink, you might want to consider testing th
   - [ ] Concurrent use of `sourceEvents` is possible
   
 - [ ] General: allocated resources are released (use with/bracket pattern)
+
+### Event Log Rotation
+
+Long-living heads may produce a large number of persisted events, which can impact the restart time of the hydra-node as it needs to read in all the previous to recreate its state.
+
+Event log rotation was introduced to improve recovery times by reducing the number of events that need to be replayed on startup. This is achieved by periodically replacing the current event log with a new one that starts from a checkpoint event, which captures the latest aggregated head state.
+
+Only rotated log files are saved with an incrementing `logId` suffix in their names, while the main `state` log file remains unchanged to preserve backward compatibility. This `logId` suffix corresponds to the ID of the last event included in that file.
+Rotation can be enabled via the optional `--persistence-rotate-after` command-line argument, which specifies the number of events after which rotation should occur.
+> For example, with `--persistence-rotate-after 100`, you’ll get rotated files named: state-100, state-200, state-300, and so on, each containing 101 events. This is because event IDs start at 0, so state-100 includes 101 state changed events (0–100) without a checkpoint. Subsequent rotated files include a checkpoint plus 100 new state changed events.
+
+Note that a checkpoint event id matches the last persisted event id from the previous rotated log file, preserving the sequential order of event ids across logs.
+This also makes it easier to identify which rotated log file was used to compute the checkpoint, as its event id matches the file name suffix.
+
+Depending on the rotation configuration used, the current `state` file may already contain more events than the specified threshold, causing a rotation to occur immediately on startup before any new inputs are processed.
+
+Upon rotation, a server output is produced to notify external agents when a checkpoint occurs, allowing them to perform archival or cleanup actions without interrupting the Hydra Head.
+
+The appropriate value for `--persistence-rotate-after` depends on your specific use case and the expected transaction volume.
+
+> As a rough guideline, in a simple scenario (running a single party on devnet that repeatedly re-spends the same committed UTxO) we observed that setting `--persistence-rotate-after 10000` results in rotated log files of about 8 MB every 3 minutes.
+>
+> Keep in mind that the size and frequency of rotated files will vary depending on several factors:
+>  * Transaction sizes: Larger transactions result in larger event payloads.
+>  * Number of party members: More parties increase the number of L2 protocol messages per snapshot, generating more events.
+>  * Ledger UTxO size: A higher number of UTxOs increases the size of certain events like snapshots.
+>  * Transaction throughput (TPS): Higher TPS leads to more events being produced over time.
diff --git a/hydra-cluster/exe/hydra-cluster/Main.hs b/hydra-cluster/exe/hydra-cluster/Main.hs
@@ -33,15 +33,15 @@ run options =
         Nothing -> do
           withCardanoNodeDevnet fromCardanoNode workDir $ \node -> do
             txId <- publishOrReuseHydraScripts tracer node
-            singlePartyOpenAHead tracer workDir node txId $ \client walletSk _headId -> do
+            singlePartyOpenAHead tracer workDir node txId persistenceRotateAfter $ \client walletSk _headId -> do
               case scenario of
                 Idle -> forever $ pure ()
                 RespendUTxO -> do
                   -- Start respending the same UTxO with a 100ms delay.
                   -- XXX: Should make this configurable
                   respendUTxO client walletSk 0.1
  where
-  Options{knownNetwork, stateDirectory, publishHydraScripts, useMithril, scenario} = options
+  Options{knownNetwork, stateDirectory, publishHydraScripts, useMithril, scenario, persistenceRotateAfter} = options
 
   withRunningCardanoNode tracer workDir network action =
     findRunningCardanoNode (contramap FromCardanoNode tracer) workDir network >>= \case
diff --git a/hydra-cluster/src/Hydra/Cluster/Options.hs b/hydra-cluster/src/Hydra/Cluster/Options.hs
@@ -6,16 +6,19 @@ import Data.ByteString.Char8 qualified as BSC
 import Data.List qualified as List
 import Hydra.Cardano.Api (TxId, deserialiseFromRawBytesHex)
 import Hydra.Cluster.Fixture (KnownNetwork (..))
+import Hydra.Options (persistenceRotateAfterParser)
 import Hydra.Prelude
 import Options.Applicative (Parser, eitherReader, flag, flag', help, long, metavar, strOption)
 import Options.Applicative.Builder (option)
+import Test.QuickCheck (Positive)
 
 data Options = Options
   { knownNetwork :: Maybe KnownNetwork
   , stateDirectory :: Maybe FilePath
   , publishHydraScripts :: PublishOrReuse
   , useMithril :: UseMithril
   , scenario :: Scenario
+  , persistenceRotateAfter :: Maybe (Positive Natural)
   }
   deriving stock (Show, Eq, Generic)
   deriving anyclass (ToJSON)
@@ -40,6 +43,7 @@ parseOptions =
     <*> parsePublishHydraScripts
     <*> parseUseMithril
     <*> parseScenario
+    <*> optional persistenceRotateAfterParser
  where
   parseKnownNetwork =
     flag' (Just Preview) (long "preview" <> help "The preview testnet")
diff --git a/hydra-cluster/src/Hydra/Cluster/Scenarios.hs b/hydra-cluster/src/Hydra/Cluster/Scenarios.hs
@@ -108,7 +108,7 @@ import Hydra.Ledger.Cardano (mkSimpleTx, mkTransferTx, unsafeBuildTransaction)
 import Hydra.Ledger.Cardano.Evaluate (maxTxExecutionUnits)
 import Hydra.Logging (Tracer, traceWith)
 import Hydra.Node.DepositPeriod (DepositPeriod (..))
-import Hydra.Options (CardanoChainConfig (..), startChainFrom)
+import Hydra.Options (CardanoChainConfig (..), RunOptions (..), startChainFrom)
 import Hydra.Tx (HeadId, IsTx (balance), Party, txId)
 import Hydra.Tx.ContestationPeriod qualified as CP
 import Hydra.Tx.Utils (dummyValidatorScript, verificationKeyToOnChainId)
@@ -155,7 +155,7 @@ import System.FilePath ((</>))
 import System.Process (callProcess)
 import Test.Hydra.Tx.Fixture (testNetworkId)
 import Test.Hydra.Tx.Gen (genKeyPair)
-import Test.QuickCheck (choose, elements, generate)
+import Test.QuickCheck (Positive, choose, elements, generate)
 
 data EndToEndLog
   = ClusterOptions {options :: Options}
@@ -504,10 +504,11 @@ singlePartyOpenAHead ::
   FilePath ->
   RunningNode ->
   [TxId] ->
+  Maybe (Positive Natural) ->
   -- | Continuation called when the head is open
   (HydraClient -> SigningKey PaymentKey -> HeadId -> IO a) ->
   IO a
-singlePartyOpenAHead tracer workDir node hydraScriptsTxId callback =
+singlePartyOpenAHead tracer workDir node hydraScriptsTxId persistenceRotateAfter callback =
   (`finally` returnFundsToFaucet tracer node Alice) $ do
     refuelIfNeeded tracer node Alice 25_000_000
     -- Start hydra-node on chain tip
@@ -525,7 +526,9 @@ singlePartyOpenAHead tracer workDir node hydraScriptsTxId callback =
     utxoToCommit <- seedFromFaucet node walletVk 100_000_000 (contramap FromFaucet tracer)
 
     let hydraTracer = contramap FromHydraNode tracer
-    withHydraNode hydraTracer aliceChainConfig workDir 1 aliceSk [] [1] $ \n1 -> do
+    options <- prepareHydraNode aliceChainConfig workDir 1 aliceSk [] [] id
+    let options' = options{persistenceRotateAfter}
+    withPreparedHydraNode hydraTracer workDir 1 options' $ \n1 -> do
       -- Initialize & open head
       send n1 $ input "Init" []
       headId <- waitMatch (10 * blockTime) n1 $ headIsInitializingWith (Set.fromList [alice])
diff --git a/hydra-cluster/test/Test/EndToEndSpec.hs b/hydra-cluster/test/Test/EndToEndSpec.hs
@@ -108,7 +108,7 @@ import System.Directory (removeDirectoryRecursive, removeFile)
 import System.FilePath ((</>))
 import Test.Hydra.Tx.Fixture (testNetworkId)
 import Test.Hydra.Tx.Gen (genKeyPair, genUTxOFor)
-import Test.QuickCheck (generate)
+import Test.QuickCheck (Positive (..), generate)
 import Prelude qualified
 
 allNodeIds :: [Int]
@@ -206,7 +206,7 @@ spec = around (showLogsOnFailure "EndToEndSpec") $ do
 
         -- Measure restart after rotation
         options <- prepareHydraNode offlineConfig tmpDir 1 aliceSk [] [] id
-        let options' = options{persistenceRotateAfter = Just 10}
+        let options' = options{persistenceRotateAfter = Just (Positive 10)}
         t1 <- getCurrentTime
         diff2 <- withPreparedHydraNode (contramap FromHydraNode tracer) tmpDir 1 options' $ \_ -> do
           t2 <- getCurrentTime
diff --git a/hydra-node/src/Hydra/Events/Rotation.hs b/hydra-node/src/Hydra/Events/Rotation.hs
@@ -6,8 +6,9 @@ import Conduit (MonadUnliftIO, runConduit, runResourceT, (.|))
 import Control.Concurrent.Class.MonadSTM (modifyTVar', newTVarIO, readTVarIO, writeTVar)
 import Data.Conduit.Combinators qualified as C
 import Hydra.Events (EventId, EventSink (..), EventSource (..), HasEventId (..))
+import Test.QuickCheck (Positive (..))
 
-newtype RotationConfig = RotateAfter Natural
+newtype RotationConfig = RotateAfter (Positive Natural)
 
 type LogId = EventId
 
@@ -52,37 +53,42 @@ newRotatedEventStore config s0 aggregator checkpointer eventStore = do
         rotate = const . const $ pure ()
       }
  where
-  RotateAfter rotateAfterX = config
+  RotateAfter (Positive rotateAfterX) = config
 
   aggregateEvents (!n, !_evId, !acc) e = (n + 1, getEventId e, aggregator acc e)
 
   shouldRotate numberOfEventsV = do
     currentNumberOfEvents <- readTVarIO numberOfEventsV
-    pure $ currentNumberOfEvents >= rotateAfterX
+    -- since rotateAfterX can be any positive number (including 1),
+    -- we use (>) instead of (>=) to avoid triggering a rotation immediately after a checkpoint,
+    -- which would lead to an infinite loop
+    pure $ currentNumberOfEvents > rotateAfterX
 
   rotatedPutEvent numberOfEventsV aggregateStateV event = do
     putEvent event
     atomically $ do
       -- aggregate new state
       modifyTVar' aggregateStateV (`aggregator` event)
       -- bump numberOfEvents
-      numberOfEvents <- readTVar numberOfEventsV
-      let numberOfEvents' = numberOfEvents + 1
-      writeTVar numberOfEventsV numberOfEvents'
+      modifyTVar' numberOfEventsV (+ 1)
     -- check rotation
     whenM (shouldRotate numberOfEventsV) $ do
       let eventId = getEventId event
       rotateEventLog numberOfEventsV aggregateStateV eventId
 
   rotateEventLog numberOfEventsV aggregateStateV lastEventId = do
-    -- build checkpoint event
+    -- build the checkpoint event
     now <- getCurrentTime
     aggregateState <- readTVarIO aggregateStateV
-    let checkpoint = checkpointer aggregateState (lastEventId + 1) now
-    -- rotate with checkpoint
+    -- the checkpoint has the same event id as the last event persisted
+    let checkpoint = checkpointer aggregateState lastEventId now
+    -- the rotated log file name suffix (logId) matches the last event persisted,
+    -- while the checkpoint event is appended to the new (current) state log file
     rotate lastEventId checkpoint
-    -- clear numberOfEvents + bump logId
+    -- reset `numberOfEvents` to 1 because
+    -- the checkpoint event was just appended during rotation
+    -- and will be sourced from the event store on restart
     atomically $ do
-      writeTVar numberOfEventsV 0
+      writeTVar numberOfEventsV 1
 
   EventStore{eventSource, eventSink = EventSink{putEvent}, rotate} = eventStore
diff --git a/hydra-node/src/Hydra/Options.hs b/hydra-node/src/Hydra/Options.hs
@@ -78,7 +78,7 @@ import Options.Applicative (
  )
 import Options.Applicative.Builder (str)
 import Options.Applicative.Help (vsep)
-import Test.QuickCheck (elements, listOf, listOf1, oneof, vectorOf)
+import Test.QuickCheck (Positive (..), choose, elements, listOf, listOf1, oneof, vectorOf)
 
 data Command
   = Run RunOptions
@@ -193,14 +193,21 @@ data RunOptions = RunOptions
   , hydraSigningKey :: FilePath
   , hydraVerificationKeys :: [FilePath]
   , persistenceDir :: FilePath
-  , persistenceRotateAfter :: Maybe Natural
+  , persistenceRotateAfter :: Maybe (Positive Natural)
   , chainConfig :: ChainConfig
   , ledgerConfig :: LedgerConfig
   , whichEtcd :: WhichEtcd
   }
   deriving stock (Eq, Show, Generic)
   deriving anyclass (ToJSON, FromJSON)
 
+-- Orphan instances
+instance ToJSON a => ToJSON (Positive a) where
+  toJSON (Positive a) = toJSON a
+
+instance FromJSON a => FromJSON (Positive a) where
+  parseJSON v = Positive <$> parseJSON v
+
 -- Orphan instance
 instance Arbitrary IP where
   arbitrary = IPv4 . toIPv4w <$> arbitrary
@@ -221,7 +228,7 @@ instance Arbitrary RunOptions where
     hydraSigningKey <- genFilePath "sk"
     hydraVerificationKeys <- reasonablySized (listOf (genFilePath "vk"))
     persistenceDir <- genDirPath
-    persistenceRotateAfter <- arbitrary
+    persistenceRotateAfter <- oneof [pure Nothing, Just . Positive . fromInteger <$> choose (1, 100000)]
     chainConfig <- arbitrary
     ledgerConfig <- arbitrary
     whichEtcd <- arbitrary
@@ -829,15 +836,22 @@ persistenceDirParser =
           \Do not edit these files manually!"
     )
 
-persistenceRotateAfterParser :: Parser Natural
+persistenceRotateAfterParser :: Parser (Positive Natural)
 persistenceRotateAfterParser =
   option
-    auto
+    (eitherReader validateRotateAfter)
     ( long "persistence-rotate-after"
         <> metavar "NATURAL"
         <> help
-          "The number of Hydra events to trigger rotation (default: no rotation)"
+          "The number of Hydra events to trigger rotation (default: no rotation).\
+          \Note it must be a positive number."
     )
+ where
+  validateRotateAfter :: String -> Either String (Positive Natural)
+  validateRotateAfter arg =
+    case readMaybe arg of
+      Just n | n > 0 -> Right (Positive n)
+      _ -> Left "--persistence-rotate-after must be a positive number"
 
 hydraNodeCommand :: ParserInfo Command
 hydraNodeCommand =
@@ -966,7 +980,7 @@ toArgs
       <> concatMap toArgPeer peers
       <> maybe [] (\port -> ["--monitoring-port", show port]) monitoringPort
       <> ["--persistence-dir", persistenceDir]
-      <> maybe [] (\rotateAfter -> ["--persistence-rotate-after", show rotateAfter]) persistenceRotateAfter
+      <> maybe [] (\rotateAfter -> ["--persistence-rotate-after", showPositive rotateAfter]) persistenceRotateAfter
       <> argsChainConfig chainConfig
       <> argsLedgerConfig
    where
@@ -1035,6 +1049,9 @@ toArgs
       { cardanoLedgerProtocolParametersFile
       } = ledgerConfig
 
+    showPositive :: Show a => Positive a -> String
+    showPositive (Positive x) = show x
+
 toArgNodeSocket :: SocketPath -> [String]
 toArgNodeSocket nodeSocket = ["--node-socket", unFile nodeSocket]
 
diff --git a/hydra-node/test/Hydra/Events/RotationSpec.hs b/hydra-node/test/Hydra/Events/RotationSpec.hs
diff --git a/hydra-node/test/Hydra/OptionsSpec.hs b/hydra-node/test/Hydra/OptionsSpec.hs