diff --git a/docs/engine/GASOLINE.md b/docs/engine/GASOLINE.md deleted file mode 100644 index 8d6ef0cdce..0000000000 --- a/docs/engine/GASOLINE.md +++ /dev/null @@ -1,11 +0,0 @@ -# Gasoline - -Gasoline (at engine/packages/gasoline) is the durable execution engine running most persistent things on Rivet Engine. - -Gasoline consists of: -- Workflows - Similar to the concept of actors (not Rivet Actors) which can sleep (be removed from memory) when not in use -- Signals - Facilitates intercommunication between workflow <-> workflow and other services (such as api) -> workflow -- Messages - Ephemeral "fire-and-forget" communication between workflows -> other services -- Activities - Individual steps in a workflow, each can be individually retried upon failure and "replayed" instead of re-executed with every workflow run -- Operations - Thin wrappers around native rust functions. Provided for clean interop with the Gasoline ecosystem - diff --git a/docs/engine/GASOLINE/DESIGN_GUIDE.md b/docs/engine/GASOLINE/DESIGN_GUIDE.md new file mode 100644 index 0000000000..2acd1eee1a --- /dev/null +++ b/docs/engine/GASOLINE/DESIGN_GUIDE.md @@ -0,0 +1,3 @@ +# Design Guide + +TODO: Common workflow patterns, systems layouts of multiple workflows diff --git a/docs/engine/GASOLINE/GOTCHAS.md b/docs/engine/GASOLINE/GOTCHAS.md new file mode 100644 index 0000000000..aa90e9e132 --- /dev/null +++ b/docs/engine/GASOLINE/GOTCHAS.md @@ -0,0 +1,3 @@ +# Gotchas + +TODO diff --git a/docs/engine/GASOLINE/OPERATIONS.md b/docs/engine/GASOLINE/OPERATIONS.md new file mode 100644 index 0000000000..98356a4634 --- /dev/null +++ b/docs/engine/GASOLINE/OPERATIONS.md @@ -0,0 +1,3 @@ +# Operations + +TODO diff --git a/docs/engine/GASOLINE/OVERVIEW.md b/docs/engine/GASOLINE/OVERVIEW.md new file mode 100644 index 0000000000..4c28cc8b9b --- /dev/null +++ b/docs/engine/GASOLINE/OVERVIEW.md @@ -0,0 +1,86 @@ +# Gasoline + +Gasoline (at engine/packages/gasoline) is the durable execution engine running most persistent things on Rivet Engine. + +Gasoline consists of: +- Workflows - Similar to the concept of actors. Can sleep (be removed from memory) when not in use +- Signals - Facilitates intercommunication between workflow <-> workflow and other services (such as api) -> workflow +- Messages - Ephemeral "fire-and-forget" communication between workflows -> other services +- Activities - Thin wrappers around native rust functions, each can be individually retried upon failure +- Operations - Thin wrappers around native rust functions. Provided for clean interop with the Gasoline ecosystem + +## Workflows + +Workflows consist of a series of durable steps. When a step is complete, its result is saved to database as workflow history. If a workflow encounters a step which requires waiting (i.e. a signal, or just sleeping) it will be removed from memory until that event happens or the sleep times out. + +When a workflow is rewoken from database back into memory, all previous steps that already completed will be "replayed". This means they will not actually execute code or write anything to database, but instead use the data already in the database to mimic an actual response. + +## Workflow Steps + +All workflow steps are durable, meaning they are either NO-OPs or return the previously calculated result when the workflow replays. + +These include: + +- Signal - a durable packet of data sent to a workflow. Can have a timeout and ready batch size +- Message - an ephemeral packet of data sent from a workflow to a subscriber +- Activity - a thin wrapper around a bare function with automatic backoff retries +- Sub workflow - publishing another workflow and/or waiting for another workflow to complete before continuing +- Loop - allows running the same closure repeatedly while intelligently handling workflow history +- Sleep +- Removed events, version checks - advanced workflow history steps (see WORKFLOW_HISTORY.md) + +You cannot run operations directly in the workflow body; they must be put in an activity first. + +Some composition methods: + +- Join - run multiple activities or closures in parallel +- Closure - create a branch in workflow history + +### Activities + +Activities are used to run user code; they are the meat and potatoes of the workflow. They should be composed in a way such that their failure and subsequent retry does not cause any side effects. In other words, each activity should be limited to 1 "action" that, when retried, will not be clobbered by previous executions of the same activity. + +Pseudocode (bad composition): + +- Activity 1 + - Transaction 1: Insert user row into database + - Transaction 2: Insert user id into group table + +If this activity were to fail on transaction 2, it will be retried with backoff. However, because the user row already exists, retrying will result in a database error. This activity will never succeed upon retry. + +To remedy this, either: + - Combine the queries into one transaction OR + - Separate the transactions into 2 activities, one for each transaction + +### Signals + +Signals are the only form of communication between services (anything outside of workflows) -> workflow, as well as workflow -> workflow. + +To send a signal you need: + +- A workflow ID OR +- Tags that match an existing incomplete workflow + +If the workflow is found, the signal is added to the workflows queue. Workflows can consume signals by using a `listen` step. + +If a workflow completes with pending signals still in its queue, the signals are marked as "acknowledged" and essentially forgotten. + +### Messages + +Messages are ephemeral packets of data sent from workflows. They are intended as status updates for real time communication and not durable communication. It is ok for messages to not be consumed by any receiver. + +Receiving a message requires subscribing to its name and a subset of its tags. + +## Tags + +Workflows, signals, and messages all use tags for convenience when their unique IDs are not available. + +Signals can be sent to a workflow if you know its name and a subset of its tags. + +Internally, it is more efficient to order signal tags in a manner of most unique to least unique: + +- Given a workflow with tags: + - namespace = foo + - type = normal + +The signal should be published with `namespace = foo` first, then `type = normal` \ No newline at end of file diff --git a/docs/engine/GASOLINE/WORKFLOW_HISTORY.md b/docs/engine/GASOLINE/WORKFLOW_HISTORY.md new file mode 100644 index 0000000000..2dba84cad5 --- /dev/null +++ b/docs/engine/GASOLINE/WORKFLOW_HISTORY.md @@ -0,0 +1,111 @@ +# Workflow History + +To provide durable execution, workflows store their steps (aka events) in a database. These are stored in order of event location. + +## Location + +All events have a __location__ consisting of a set of __coordinates__. Each __coordinate__ is a set of __ordinates__ which are positive integers. + +Locations look like: + +- `{1}` - the first event +- `{1, 4}` - the fourth child of the first event +- `{0.1}` - the first inserted event before the first event +- `{4, 0.3.1, 0.6}` - the 6th inserted event before the first event of the parent, which is the first inserted event between 0.3 and 0.4, which is a child of the fourth event in the root + +This may look confusing, but it allows dynamic location assignment without prior knowledge of the entire list of steps of a workflow, as well as modifying an existing workflow which already has some history. + +## Calculating Location + +Location is determined both by the events before it and the events after it (if the location is for an inserted event). + +For a new workflow with no history, location is determined by incrementing the final coordinate (which consists of a single ordinate). Location `{2}` follows `{1}`, etc. + +Coordinates start at `1`, not `0`. This is important to allow for inserting events before location `{1}` without negative numbers. + +### Branches + +When a branch (used internally for steps like loops and closures) is encountered, a new coordinate is added to the current location. So events that are children of a branch at location `{1}` would start at `{1, 1}`. + +### Inserting Events + +If a workflow has already executed up to location `{4}` but you want to add a new activity before location `{2}`, you can use __versioned workflow steps__ to make this happen. + +By default all steps inherit the version of the branch they come from, which for the root of the workflow is version 1. + +If you were to add a new step before location `{2}` with version 1 (denoted as `{2}v1` or `{2} v1`), the workflow would fail when it replays with the error `HistoryDiverged`. The version of inserted events must always be higher than the version of the step that comes AFTER it. + +When we add a new step before location `{2}` with a version 2, it will be assigned the location `{1.1}` because it is the first inserted event after location `{1}`. All subsequent new events we add will increment this final ordinate: `{1.2}`, `{1.3}`, etc. + +If we want to add an event between `{1.1}` and `{1.2}` (which are both version 2 events), we will need to set the event's version to 3. The new event's location will be `{1.1.1}`. + +Events inserted before the event at location `{1}` will start with a 0 (`{0.1}`). Continuing to insert events before this new event will prepend more 0's: `{0.0.1}`, `{0.0.0.1}`, etc. + +Note that inserting can be done at any root, so an event between `{2, 11, 4}` and `{2, 11, 5}` will have the location `{2, 11, 4.1}`. + +### Removing events + +Removing events requires replacing the event with a `ctx.removed::<_>()` call. This is a durable step that will either: + +- For workflows that have already executed the step that should be removed: will not manipulate the database but will skip the step when replaying. +- For workflows that have not executed the step yet: will insert a `removed` event into history + +This keeps locations consistent between the two cases. + +### Inserting Events Conditionally Based On Version + +Sometimes you may want to keep the history of existing workflows the same while modifying only new workflows. You can do this with `ctx.check_version(N)` where `N` is the version that will be used when the history does not exist yet (i.e. for a new workflow). + +Given a workflow with the history: + +- `{1}v1` activity foo +- `{2}v1` activity bar +- `{3}v1` sleep + +If we want this workflow to remain the same but new workflows to execute a different activity instead of `bar` (perhaps a newer version), we can do: + +```rust +// Activity foo +ctx.activity(...).await?; + +match ctx.check_version(2).await { + // The existing workflow will always match this path because the next event (activity bar) has version 1 + 1 => { + // Here we need to keep the workflow steps as expected by the history, run activity bar + ctx.activity(...).await?; + } + // This will be `2` because that is the value of `N` + _latest => { + // Activity bar_fast + ctx.v(2).activity(...).await?; + } +} + +ctx.sleep().await +``` + +Version checks are durable because if history already exists at the location they are added then they do not manipulate the database and read the version of that history event. But if the version check is at the end of the current branch of events (as in a new workflow), it will be inserted as an event itself. This means the workflow history for both workflows will look like this: + +- Existing workflow history: + - `{1}v1` activity foo + - `{2}v1` activity bar + - `{3}v1` sleep +- New workflow history: + - `{1}v1` activity foo + - `{2}v2` version check + - `{3}v2` activity bar_fast + - `{4}v1` sleep + +Note that you can also manipulate the existing workflow history in the `1` branch just like you would without `check_version`. We could insert a new activity after activity `bar` with a `v2` or remove activity `bar`. + +## Loops + +Loops structure event history with 2 nested branches: + +A loop at location `{2}` will have each iteration on a separate branch: `{2, 1}`, `{2, 2}`, ... `{2, iteration}`. + +Events in each iteration will be a child of the iteration branch: `{2, 2, 1}`, `{2, 2, 2}` are the first two events in the second iteration of the loop at location `{2}`. + +Loops are often used to create state machines out of workflows. Because state machines can technically run forever based on their loop configuration, Gasoline moves all complete iteration's event history into a different place in the database known as forgotten event history. + +Forgotten events will not be pulled from the database when the workflow is replayed. This will not cause issues for the workflow because we know which iteration is the current one and previous iterations should not influence the current history as each iteration is separate. diff --git a/docs/engine/HIBERNATING_WS.md b/docs/engine/HIBERNATING_WS.md index 18391f5986..b7e1699522 100644 --- a/docs/engine/HIBERNATING_WS.md +++ b/docs/engine/HIBERNATING_WS.md @@ -22,4 +22,4 @@ To facilitate state management on the runner side (specifically via RivetKit), e When a client websocket closes during hibernation, this value is cleared. -When a runner receives a CommandStartActor message via the runner protocol, it contains information about which hiberating requests are still active. +When a runner receives a CommandStartActor message via the runner protocol, it contains information about which hibernating requests are still active. diff --git a/engine/packages/gasoline/src/db/kv/mod.rs b/engine/packages/gasoline/src/db/kv/mod.rs index c208625db0..068ea3deac 100644 --- a/engine/packages/gasoline/src/db/kv/mod.rs +++ b/engine/packages/gasoline/src/db/kv/mod.rs @@ -977,7 +977,7 @@ impl Database for DatabaseKv { } /// Returns the first incomplete workflow with the given name and tags, first meaning the one with the - /// lowest id value (interpreted as u128) because its in a KV store. There is no way to get any other + /// lowest id value (by internal representation) because its in a KV store. There is no way to get any other /// workflow besides the first. #[tracing::instrument(skip_all, fields(%workflow_name))] async fn find_workflow(