From 9d57d24816f50f9ec3fd64a27ae2a909f9420760 Mon Sep 17 00:00:00 2001 From: Stevan Andjelkovic Date: Wed, 21 Dec 2022 14:43:47 +0100 Subject: [PATCH 1/7] feat(part5): expand on code section --- docs/Part05SimulationTesting.md | 90 +++++---------------- src/Part05/StateMachine.hs | 16 ++++ src/Part05/StateMachineDSL.hs | 3 + src/Part05SimulationTesting.lhs | 139 +++++++++++++------------------- 4 files changed, 99 insertions(+), 149 deletions(-) diff --git a/docs/Part05SimulationTesting.md b/docs/Part05SimulationTesting.md index 136d489..c584369 100644 --- a/docs/Part05SimulationTesting.md +++ b/docs/Part05SimulationTesting.md @@ -94,134 +94,88 @@ If the checkers find any problem, we want to be able to reproduce it from a sing ## Code +We’ll link to the most important parts of the code rather than inlining it all here. + -- Let’s start with the state machine (SM) type -- A bit more complex what we’ve seen previously - - input and output types are parameters so that applications with different message types can written - - inputs are split into client requests (synchronous) and internal messages (asynchrous) - - a step in the SM can returns several outputs, this is useful for broadcasting - - outputs can also set and reset timers, which is necessary for implementing retry logic - - when the timers expire the event loop will call the SM’s timeout handler (`smTimeout`) - - in addition to the state we also thread through a seed, `StdGen`, so that the SM can generate random numbers - - there’s also an initisation step (`smInit`) to set up the SM before it’s put to work +Let’s start with the state machine (SM) type [itself](../src/Part05/StateMachine.hs): ``` haskell import Part05.StateMachine () ``` -- In order to make it more ergonomic to write SMs we introduce a domain-specific language (DSL) for it - -- The DSL allows us to use do syntax, do `send`s or register timers anywhere rather than return a list outputs, as well as add pre-conditions via guards and do early returns +- In order to make it more ergonomic to write SMs we introduce a domain-specific [language](../Part05/StateMachineDSL.hs) (DSL) for it: ``` haskell import Part05.StateMachineDSL () ``` -- The SMs are, as mentioned previously, parametrised by their input and output messages. - -- These parameters will most likely be instantiated with concrete (record-like) datatypes. - -- Network traffic from clients and other nodes in the network will come in as bytes though, so we need a way to decode inputs from bytes and a way to encode outputs as bytes. - -- `Codec`s are used to specify these convertions: +The SMs are parametrised by their input and output messages, which will be instantiated with concrete (struct-like) datatypes. Network traffic from clients and other nodes will come in as bytes though, so we need [a way](../src/Part05/Codec.hs) to decode inputs from bytes and a way to encode outputs as bytes: ``` haskell import Part05.Codec () ``` -- A SM together with its codec constitutes an application and it’s what’s expected from the user -- Several SM and codec pairs together form a `Configuration` -- The event loop expects a configuration at start up +We have now seen everything we need from an application developer’s point of view in terms of what we need to deploy on top of the event loop: an SM and a codec for encoding and decoding inputs and outputs for the SM. We bundle these two things up in a so called [configuration](../src/Part05/Configuration.hs): ``` haskell import Part05.Configuration () ``` -- We’ve covered what the user needs to provide in order to run an application on top of the event loop, next lets have a look at what the event loop provides +Having covered what the user needs to provide in order to run an application on top of the event loop, next lets have a look at the event loop itself. -- There are three types of events, network inputs (from client requests or from other nodes in the network), timer events (triggered when timers expire), and commands (think of this as admin commands that are sent directly to the event loop, currently there’s only a exit command which makes the event loop stop running) +There are three types of [events](../src/Part05/Event.hs), network inputs (from client requests or from other nodes in the network), timer events (triggered when timers expire), and commands (think of this as admin commands that are sent directly to the event loop, currently there’s only a exit command which makes the event loop stop running): ``` haskell import Part05.Event () ``` -- How are these events created? Depends on how the event loop is deployed: in production or simulation mode +How are these events created? That depends on if the event loop is [deployed](../src/Part05/Deployment.hs) in production or simulation mode: ``` haskell import Part05.Deployment () ``` -- network interface specifies how to send replies, and respond to clients - -- Network events in a production deployment are created when requests come in on http server +The [network interface](../src/Part05/Network.hs) specifies how to send replies and respond to clients. - - Client request use POST +In production mode the network interface also starts a HTTP server which generates network events as clients make requests or other nodes send messages. - - Internal messages use PUT - - - since client requests are synchronous, the http server puts the client request on the event queue and waits for the single threaded worker to create a response to the client request… - -- network events in a simulation deployment are created by the simulation itself, rather than from external requests - - - Agenda = priority queue of events - - - network interface: - - { nSend :: NodeId -> NodeId -> ByteString -> IO () - , nRespond :: ClientId -> ByteString -> IO () } +In simulation mode there’s no HTTP server, we instead generate client requests on demand using the [client generator](../src/Part05/ClientGenerator.hs). Client replies go directly to the client generator and sending of messages to other nodes don’t actually use the network either, but rather get enqueued to the event (priority) queue. ``` haskell import Part05.Network () -import Part05.AwaitingClients () -import Part05.Agenda () +import Part05.ClientGenerator () ``` -- Timers are registerd by the state machines, and when they expire the event loop creates a timer event for the SM that created it -- This is the same for both production and simulation deployments -- The difference is that in production a real clock is used to check if the timer has expired, while in simulation time is advanced discretely when an event is popped from the event queue +Timers are registerd by the state machines, and when they expire the event loop creates a timer event for the SM that created it. + +This is the same for both production and simulation deployments. The only difference is that in production a real clock is used to check if the timer has expired, while in simulation time is advanced discretely when an event is popped from the event queue. ``` haskell import Part05.TimerWheel () ``` -- These events get queued up, and thus an order established, by the event loop - - XXX: production - - XXX: simulation - - interface: - - +Network and timer events get queued up in the [event queue](../src/Part05/EventQueue.hs) which also is an interface with different implementation depending on deployment mode. - data EventQueue = EventQueue - { eqEnqueue :: Event -> IO () - , eqDequeue :: DequeueTimeout -> IO Event - } +In production the event queue is a FIFO queue, while in simulation it’s a [priority queue](../src/Part05/Agenda.hs) sorted by the event’s arrival time. In simulation mode we also append network events to our concurrent [history](../src/Part05/History.hs) which we later use for linearisability checking. ``` haskell import Part05.EventQueue () +import Part05.Agenda () +import Part05.History () ``` -- Now we have all bits to implement the event loop itself +Now we have all bits to implement the [event loop](../src/Part05/EventLoop.hs) itself! ``` haskell import Part05.EventLoop () ``` -- Last bits needed for simulation testing: generate traffic, collect concurrent history, debug errors: - -``` haskell -import Part05.ClientGenerator () -import Part05.History () -import Part05.Debug () -``` - -- Finally lets put all this together and develop and simulation test [Viewstamped replication](https://dspace.mit.edu/handle/1721.1/71763) by Brian Oki, Barbra Liskov and James Cowling (2012) - -XXX: Viewstamp replication example… +Finally lets put all this together and [develop and simulation test](../src/Part05/ViewstampReplication) [Viewstamped replication](https://dspace.mit.edu/handle/1721.1/71763) by Brian Oki, Barbra Liskov and James Cowling (2012): ## Discussion diff --git a/src/Part05/StateMachine.hs b/src/Part05/StateMachine.hs index 47ce988..520efbc 100644 --- a/src/Part05/StateMachine.hs +++ b/src/Part05/StateMachine.hs @@ -28,6 +28,22 @@ newtype TimerId = TimerId Int type SMStep state message response = state -> StdGen -> ([Output response message], state, StdGen) +-- The state machine type is a bit more complex what we've seen previously: +-- +-- * input and output types are parameters so that applications with different +-- message types can written; +-- * inputs are split into client requests (synchronous) and internal messages +-- (asynchrous); +-- * a step in the SM can returns several outputs, this is useful for +-- broadcasting; +-- * outputs can also set and reset timers, which is necessary for +-- implementing retry logic; +-- * when the timers expire the event loop will call the SM's timeout handler +-- (`smTimeout`); +-- * in addition to the state we also thread through a seed, `StdGen`, so that +-- the SM can generate random numbers; +-- * there's also an initisation step (`smInit`) to set up the SM before it's +-- put to work. data SM state request message response = SM { smState :: state , smInit :: SMStep state message response diff --git a/src/Part05/StateMachineDSL.hs b/src/Part05/StateMachineDSL.hs index c3601eb..8ce62a5 100644 --- a/src/Part05/StateMachineDSL.hs +++ b/src/Part05/StateMachineDSL.hs @@ -21,6 +21,9 @@ import Part05.Time ------------------------------------------------------------------------ +-- The DSL allows us to use do syntax, do `send`s or register timers anywhere +-- rather than return a list outputs, as well as add pre-conditions via guards +-- and do early returns. type SMM s msg resp a = ContT Guard (StateT s (StateT StdGen (Writer [Output resp msg]))) a diff --git a/src/Part05SimulationTesting.lhs b/src/Part05SimulationTesting.lhs index 8270447..92eee21 100644 --- a/src/Part05SimulationTesting.lhs +++ b/src/Part05SimulationTesting.lhs @@ -232,127 +232,104 @@ is deterministic otherwise we can't do that. Code ---- +We'll link to the most important parts of the code rather than inlining it all +here. + -* Let's start with the state machine (SM) type -* A bit more complex what we've seen previously - - input and output types are parameters so that applications with different message types can written - - inputs are split into client requests (synchronous) and internal messages (asynchrous) - - a step in the SM can returns several outputs, this is useful for broadcasting - - outputs can also set and reset timers, which is necessary for implementing retry logic - - when the timers expire the event loop will call the SM's timeout handler (`smTimeout`) - - in addition to the state we also thread through a seed, `StdGen`, so that the SM can generate random numbers - - there's also an initisation step (`smInit`) to set up the SM before it's put to work +Let's start with the state machine (SM) type +[itself](../src/Part05/StateMachine.hs): > import Part05.StateMachine () * In order to make it more ergonomic to write SMs we introduce a domain-specific - language (DSL) for it - -* The DSL allows us to use do syntax, do `send`s or register timers anywhere - rather than return a list outputs, as well as add pre-conditions via guards - and do early returns + [language](../Part05/StateMachineDSL.hs) (DSL) for it: > import Part05.StateMachineDSL () -* The SMs are, as mentioned previously, parametrised by their input and output - messages. - -* These parameters will most likely be instantiated with concrete (record-like) datatypes. - -* Network traffic from clients and other nodes in the network will come in as - bytes though, so we need a way to decode inputs from bytes and a way to encode - outputs as bytes. - -* `Codec`s are used to specify these convertions: +The SMs are parametrised by their input and output messages, which will be +instantiated with concrete (struct-like) datatypes. Network traffic from clients +and other nodes will come in as bytes though, so we need [a +way](../src/Part05/Codec.hs) to decode inputs from bytes and a way to encode +outputs as bytes: > import Part05.Codec () -* A SM together with its codec constitutes an application and it's what's expected from the user -* Several SM and codec pairs together form a `Configuration` -* The event loop expects a configuration at start up +We have now seen everything we need from an application developer's point of +view in terms of what we need to deploy on top of the event loop: an SM and a +codec for encoding and decoding inputs and outputs for the SM. We bundle these +two things up in a so called [configuration](../src/Part05/Configuration.hs): > import Part05.Configuration () -* We've covered what the user needs to provide in order to run an application on - top of the event loop, next lets have a look at what the event loop provides +Having covered what the user needs to provide in order to run an application on +top of the event loop, next lets have a look at the event loop itself. -* There are three types of events, network inputs (from client requests or from - other nodes in the network), timer events (triggered when timers expire), and - commands (think of this as admin commands that are sent directly to the event - loop, currently there's only a exit command which makes the event loop stop - running) +There are three types of [events](../src/Part05/Event.hs), network inputs (from +client requests or from other nodes in the network), timer events (triggered +when timers expire), and commands (think of this as admin commands that are sent +directly to the event loop, currently there's only a exit command which makes +the event loop stop running): > import Part05.Event () -* How are these events created? Depends on how the event loop is deployed: in - production or simulation mode +How are these events created? That depends on if the event loop is +[deployed](../src/Part05/Deployment.hs) in production or simulation mode: > import Part05.Deployment () -* network interface specifies how to send replies, and respond to clients +The [network interface](../src/Part05/Network.hs) specifies how to send replies +and respond to clients. -* Network events in a production deployment are created when requests come in on http server - - Client request use POST - - Internal messages use PUT +In production mode the network interface also starts a HTTP server which +generates network events as clients make requests or other nodes send messages. - - since client requests are synchronous, the http server puts the client - request on the event queue and waits for the single threaded worker to - create a response to the client request... - -* network events in a simulation deployment are created by the simulation itself, rather than from external requests - - Agenda = priority queue of events - - network interface: - ``` - { nSend :: NodeId -> NodeId -> ByteString -> IO () - , nRespond :: ClientId -> ByteString -> IO () } - ``` +In simulation mode there's no HTTP server, we instead generate client requests +on demand using the [client generator](../src/Part05/ClientGenerator.hs). Client +replies go directly to the client generator and sending of messages to other +nodes don't actually use the network either, but rather get enqueued to the +event (priority) queue. > import Part05.Network () -> import Part05.AwaitingClients () -> import Part05.Agenda () +> import Part05.ClientGenerator () + +Timers are registerd by the state machines, and when they expire the event loop +creates a timer event for the SM that created it. -* Timers are registerd by the state machines, and when they expire the event loop creates a timer event for the SM that created it -* This is the same for both production and simulation deployments -* The difference is that in production a real clock is used to check if the - timer has expired, while in simulation time is advanced discretely when an - event is popped from the event queue +This is the same for both production and simulation deployments. The only +difference is that in production a real clock is used to check if the timer has +expired, while in simulation time is advanced discretely when an event is popped +from the event queue. > import Part05.TimerWheel () -* These events get queued up, and thus an order established, by the event loop - - XXX: production - - XXX: simulation - - interface: - ``` - data EventQueue = EventQueue - { eqEnqueue :: Event -> IO () - , eqDequeue :: DequeueTimeout -> IO Event - } - ``` +Network and timer events get queued up in the [event +queue](../src/Part05/EventQueue.hs) which also is an interface with different +implementation depending on deployment mode. + +In production the event queue is a FIFO queue, while in simulation it's a +[priority queue](../src/Part05/Agenda.hs) sorted by the event's arrival time. In +simulation mode we also append network events to our concurrent +[history](../src/Part05/History.hs) which we later use for linearisability +checking. > import Part05.EventQueue () +> import Part05.Agenda () +> import Part05.History () -* Now we have all bits to implement the event loop itself +Now we have all bits to implement the [event loop](../src/Part05/EventLoop.hs) +itself! > import Part05.EventLoop () -* Last bits needed for simulation testing: generate traffic, collect concurrent - history, debug errors: - -> import Part05.ClientGenerator () -> import Part05.History () -> import Part05.Debug () - -* Finally lets put all this together and develop and simulation test - [Viewstamped replication](https://dspace.mit.edu/handle/1721.1/71763) by Brian - Oki, Barbra Liskov and James Cowling (2012) - -XXX: Viewstamp replication example... +Finally lets put all this together and [develop and simulation +test](../src/Part05/ViewstampReplication) [Viewstamped +replication](https://dspace.mit.edu/handle/1721.1/71763) by Brian Oki, Barbra +Liskov and James Cowling (2012): Discussion ---------- From d6647feefa6381dc72db1eba9eb9bc53b5424b72 Mon Sep 17 00:00:00 2001 From: Stevan Andjelkovic Date: Wed, 21 Dec 2022 14:53:10 +0100 Subject: [PATCH 2/7] fix(part5): typos --- docs/Part05SimulationTesting.md | 32 +++++++++++++++---------------- src/Part05SimulationTesting.lhs | 34 ++++++++++++++++----------------- 2 files changed, 33 insertions(+), 33 deletions(-) diff --git a/docs/Part05SimulationTesting.md b/docs/Part05SimulationTesting.md index c584369..757b82f 100644 --- a/docs/Part05SimulationTesting.md +++ b/docs/Part05SimulationTesting.md @@ -2,7 +2,7 @@ ![](../images/under_construction.gif) -*The code section needs to be turned from a bullet point presentation into a readable text. Before that can be done, we need the last pieces of code: the example and possibly the debugger. The exercises needs to be revisted as well.* +*The code section needs to be turned from a bullet point presentation into a readable text. Before that can be done, we need the last pieces of code: the example and possibly the debugger. The exercises needs to be revisited as well.* ## Motivation @@ -40,7 +40,7 @@ For each client write there will be several internal messages between the nodes -How can this be achieved? One way would be for the distributed network to elect a leader node and have all client requests go through it, the leader would then replicate the data to all other nodes and confirm enough nodes got it before responding to the client. In case the leader because unavailable, a new leader is elected. In case a node crashes, its state is restored after it restarts by the other nodes. That way as long as enough nodes are available and running we can keep serving client requests. We’ll omit the exact details of how this is achieved or now, but hopefuly we’ve explained enough for it to be possilbe to appreciate that testing all possible corner cases related to those failure modes can be tricky. +How can this be achieved? One way would be for the distributed network to elect a leader node and have all client requests go through it, the leader would then replicate the data to all other nodes and confirm enough nodes got it before responding to the client. In case the leader because unavailable, a new leader is elected. In case a node crashes, its state is restored after it restarts by the other nodes. That way as long as enough nodes are available and running we can keep serving client requests. We’ll omit the exact details of how this is achieved or now, but hopefully we’ve explained enough for it to be possible to appreciate that testing all possible corner cases related to those failure modes can be tricky. Next lets sketch how we can implement the data store nodes using state machines (SMs). First recall the type of our SMs: @@ -58,11 +58,11 @@ How does the actual networking happen though? For the “real” / “production Sometimes when we send internal messages to other nodes they can be dropped by the network, in order to be able to implement retry logic we need to extend the basic functionality of SMs with some notion of being able to keep track of the passage of time. There are many ways to do this, for our particular application we’ll choose timers. The way timers work is that SMs can register a timer as part of their output. Typically we’d do something like: send such and such message to such and such node and set a timer for 30s, if we don’t hear back from the node within 30s and reset the timer, then a timer wheel process will enqueue a timer event which the SM can use for doing the retry. -By the way, all this extra stuff that happens outside of the SM is packaged up in a componenet called the event loop. +By the way, all this extra stuff that happens outside of the SM is packaged up in a component called the event loop. Before we deploy the SM to production using the above event loop, we would like to test it for all those tricky failure modes we mentioned before. In order to reuse as much code as possible with the “real” / “production” deployment, we’ll use the same SM and event loop! -How can we possibly reuse the same event loop you might be thinking? The key here is that the networking and timer wheel components of the event loop are implemeneted using interfaces. +How can we possibly reuse the same event loop you might be thinking? The key here is that the networking and timer wheel components of the event loop are implemented using interfaces. The interface for networking has a method for sending internal messages to other nodes and a method for sending responses to clients (note that we don’t need a method for receiving because that’s already done by the event loop and we can merely dequeue from the event queue to get the network events). The interface for time has a method to get the current time as well as setting the current time. @@ -80,7 +80,7 @@ In the “real” / “production” deployment we run one SM, which encodes the Once the SM is stepped and we get its outputs, the “fake” send implementation of the network interface will generate arrival times and put them back on the priority queue, while the “fake” respond implementation will notify the client generator and append the response to the concurrent history. -Note that since arrival times are randomly generated (deterministically using a seed) and because we got a priority queue rather than a FIFO we get interesting message interleavings where for example a message that was sent much later than some other message might end up getting receieved earlier. +Note that since arrival times are randomly generated (deterministically using a seed) and because we got a priority queue rather than a FIFO we get interesting message interleavings where for example a message that was sent much later than some other message might end up getting received earlier. Next lets have a look at time. The “fake” implementation of the time interface is completely detached from the actual system time, the clock is only advanced by explicit calls to the set time method. This allows us to do a key thing: set the time when we dequeue an event to the arrival time of that event! This allows us to jump in time to when we know that the next event is supposed to happen without waiting for it, i.e. no more waiting 30s for timeouts to happen! @@ -94,7 +94,7 @@ If the checkers find any problem, we want to be able to reproduce it from a sing ## Code -We’ll link to the most important parts of the code rather than inlining it all here. +We’ll link to the most important parts of the code rather than in-lining it all here.