- demoware
- built with assistance of GitHub Copilot
- practice patterns for SVG animation on the server
- reduce the size of the updates via updating only the transform:
- extract and use SVG components
- practice OTP processes that externalize their state to support an minimal-interruption restart. Demonstratable via fault injection
- practice injecting larger faults such as node failures
- investigate GitHub Copilot-assisted coding in Elixir
- practice writing mix tasks, e.g. to bump the app version
- practice a whole application hot code upgrade without directly using a known "advanced" approach.
- a virtual ball is flying around in a box
- its behavior can be changed at run-time
- various system failures can also be triggered/simulated
- it is expected that the ball continues the movement without a noticeable interruption as long there's one machine available
docker compose up
↓
# once
mix setup
# one node
mix phx.server
# 3 nodes
process-compose
↓
- use
process-compose
to start 3 nodes locally → (additional node ports:4001
,4002
)
To crash a node: triple click on an x
.
- the demo fix: the new version contains a new ball movement module:
RandomReboundV2NonSticky
release-two-versions.sh
simulates the build of two versions with one compiled without the new module. The new behavior name is pre-configured and added in a particular version for the demo.- demo
- start two versions running alongside in a cluster
- look at the views of both versions, noting the new module
- try to switch over to the new module while the ball runs on the old instance → safe failure
- take down the old node → the new node takes over but still running the old behavior
- switch over to the new behavior
using process-compose
:
scripts/release-two-versions.sh
scripts/run-two-versions.sh
↓
node-crash-ball-reschedule.mp4
- the application is clustered
- a singleton process
Ball
runs on one of the nodes in the cluster - the ball is flying around in a box with an injectable behavior, fulfilling a
BallMovement
protocol - the list of movement behavior modules can be found in the config
:available_ball_behaviors
- the config includes one non-existent module
NonExistentBehavior
which simulates a sub-system update fault - the nodes (dangerously → demoware!) expose a kill switch which stops a node with an non-zero exit code, triggering a restart of the ball process on another node
- the state of the ball is continuously externalized to a simple process called
StateGuardian
, local to each node - when the ball starts, it may load its state from the
StateGuardian
- the svg is rendered as a live view template, updating its position only
- the list of nodes is updated periodically by
ClusterInfoServer
- upon start (with a short delay), and on detection of a new node by
NodeListener
, the compiled modules configured on one node are spread to other nodes viaBehaviorModules
. Pre-requisite: the module doesn't depend on modules not present on other nodes.
flowchart TB
subgraph Replica1
HordeSupervisor1([HordeSupervisor]) -- schedules start of --> Ball
Ball(("Ball (singleton)"))-- publishes state changes to -->BallStateTopic@{ shape: das, label: "state:ball" }
%% Ball-- updates -->BallUpdatesTopic@{ shape: das, label: "updates:ball" }
Ball-- publishes changes to -->BallCoordinatesTopic@{ shape: das, label: "coordinates:ball, updates:ball" }
StateGuardian1[StateGuardian] -- subscribed to --> BallStateTopic
Ball -. restores state from .-> StateGuardian1
LiveView1@{ shape: manual-input, label: "LiveView"} -- subscribed to --> BallCoordinatesTopic
end
subgraph Replica2
BallStateTopic2@{ shape: das, label: "state..." } -- distributed --- BallStateTopic
StateGuardian2[StateGuardian] -- subscribed to --> BallStateTopic2
HordeSupervisor2([HordeSupervisor]) -- distributed --- HordeSupervisor1
LiveView2@{ shape: manual-input, label: "LiveView"} -- subscribed to --> BallCoordinatesTopic0
BallCoordinatesTopic0@{ shape: das, label: "coordinates..." }-- distributed ---BallCoordinatesTopic
end
subgraph Users[" "]
User1("fa:fa-user User") -- interacts with --> LiveView2
User2("fa:fa-user User") -- interacts with --> LiveView1
end
- Will it scale? → What do you mean by 'scale' exactly?
- Why publish each ball state to the state guardian? Isn't it too chatty/expensive? → Yes. I wanted to demo simulating a whole node going down on which the singleton ball is running. Without it, the take-over of the ball by another node wouldn't look that spectacular.
- Why not use technology XYZ for this? → Yes. That'd be nice, although, Phoenix LiveView, Elixir and Erlang provide so many primitives out of the box, making such architectural sketches effective, requiring fewer infrastructural moving parts.
- One could do this in C! → Sure! You'd just have to implement "half of Erlang" yourself. Spoiler alert: a big chunk of Erlang/OTP is C
- Why not just use the standard Erlang/OTP mechanism for the hot code upgrade? → Yes, that'd be nice as well, and has been tried and tested all around the world. Many articles and docs on the subject suggest trying alternative approaches these days. Knowing something is possible and having tried it may lie far apart.
- Why not demo XYZ as well? → Yes, that'd be nice too. There's no-one to stop you from doing it.
- If the last known state has been a new module, downgrading will lead to a system failure,
unless the module has already been distributed by
BehaviorModules
to the node that starts the ball.last_known_good_module
cannot be relied upon, and some other behavior like a default behavior module or a stack of last known good modules can be designed.
- Decoupling deployment from release
- Keeping the state of a process outside of it (e.g. in another, logic-less process)
- "The generic component should hide details of concurrency and mechanisms for fault-tolerance from the
plugins. The plugins should be written using only sequential code with well-defined types."
- Joe Armstrong's PhD Thesis "Making reliable distributed systems in the presence of software errors"