Merge pull request #170 from arangodb-helper/documentation/starter-architecture

ewoutp · web-flow · commit 807e834a99fb · 2018-06-25T19:26:36.000+02:00
Architecture related docs.
diff --git a/docs/Manual/Programs/Starter/Architecture.md b/docs/Manual/Programs/Starter/Architecture.md
@@ -0,0 +1,165 @@
+# ArangoDB Starter Architecture
+
+## What does the Starter do
+
+The ArangoDB Starter is a program used to create ArangoDB database deployments
+on bare-metal (or virtual machines) with ease.
+It enables you to create everything from a simple Single server instance
+to a full blown Cluster with datacenter to datacenter replication in under 5 minutes.
+
+The Starter is intended to be used in environments where there is no higher
+level orchestration system (e.g. Kubernetes or DC/OS) available.
+
+## Starter versions
+
+The Starter is a separate process in a binary called `arangodb` (or `arangodb.exe` on Windows).
+This binary has its own version number that is independent of a ArangoDB (database)
+version.
+
+This means that Starter version `a.b.c` can be used to run deployments
+of ArangoDB databases with different version.
+For example, the Starter with version `0.11.2` can be used to create
+ArangoDB deployments with ArangoDB version `3.2.<something>` as well
+as deployments with ArangoDB version `3.3.<something>`.
+
+It also means that you can update the Starter independently from the ArangoDB
+database.
+
+Note that the Starter is also included in all binary ArangoDB packages.
+
+To find the versions of you Starters & ArangoDB database, run the following commands:
+
+```bash
+# To get the Starter version
+arangodb --version
+# To get the ArangoDB database version
+arangod --version
+```
+
+## Starter deployment modes
+
+The Starter supports 3 different modes of ArangoDB deployments:
+
+1. Single server
+1. Active failover
+1. Cluster
+
+Note: Datacenter replication is an option for the `cluster` deployment mode.
+
+You select one of these modes using the `--starter.mode` command line option.
+
+Depending on the mode you've selected, the Starter launches one or more
+(`arangod` / `arangosync`) server processes.
+
+No matter which mode you select, the Starter always provides you
+a common directory structure for storing the servers data, configuration & log files.
+
+## Starter operating modes
+
+The Starter can run as normal processes directly on the host operating system,
+or as containers in a docker runtime.
+
+When running as normal process directly on the host operating system,
+the Starter launches the servers as child processes and monitors those.
+If one of the server processes terminates, a new one is started automatically.
+
+When running in a docker container, the Starter launches the servers
+as separate docker containers, that share the volume namespace with
+the container that runs the Starter. It monitors those containers
+and if one terminates, a new container is launched automatically.
+
+## Starter data-directory
+
+The Starter uses a single directory with a well known structure to store
+all data for its own configuration & logs, as well as the configuration,
+data & logs of all servers it starts.
+
+This data directory is set using the `--starter.data-dir` command line option.
+It contains the following files & sub-directories.
+
+- `setup.json` The configuration of the "cluster of Starters".
+  For details see below. DO NOT edit this file.
+- `arangodb.log` The log file of the Starter
+- `single<port>`, `agent<port>`, `coordinator<port>`, `dbserver<port`>: directories for
+  launched servers. These directories contain among others the following files:
+  - `apps`: A directory with Foxx applications
+  - `data`: A directory with database data
+  - `arangod.conf`: The configuration file for the server. Editing this file is possible, but not recommended.
+  - `arangod.log`: The log file of the server
+  - `arangod_command.txt`: File containing the exact command line of the started server (for debugging purposes only)
+
+## Running on multiple machines
+
+For the `activefailover` & `cluster` mode, it is required to run multiple
+Starters, as every Starter will only launch a subset of all servers needed
+to form the entire deployment.
+For example in `cluster` mode, a Starter will launch a single agent, a single dbserver
+and a single coordinator.
+
+It is the responsibility of the user to run the Starter on multiple machines such
+that enough servers are started to form the entire deployment.
+The minimum number of Starters needed is 3.
+
+The Starters running on those machines need to know about each other's existence.
+In order to do so, the Starters form a "cluster" of their own (not to be confused
+with the ArangoDB database cluster).
+This cluster of Starters is formed from the values given to the `--starter.join`
+command line option. You should pass the addresses (`<host>:<port>`) of all Starters.
+
+For example a typical commandline for a cluster deployment looks like this:
+
+```bash
+arangodb --starter.mode=cluster --starter.join=hostA:8528,hostB:8528,hostC:8528
+# this command is run on hostA, hostB and hostC.
+```
+
+The state of the cluster (of Starters) is stored in a configuration file called
+`setup.json` in the data directory of every Starter and the ArangoDB
+agency is used to elect a master among all Starters.
+
+The master Starter is responsible for maintaining the list of all Starters
+involved in the cluster and their addresses. The slave Starters (all Starters
+except the elected master) fetch this list from the master Starter on regular
+basis and store it to its own `setup.json` config file.
+
+Note: The `setup.json` config file MUST NOT be edited manually.
+
+## Running on multiple machines (under the hood)
+
+As mentioned above, when the Starter is used to create an `activefailover`
+or `cluster` deployment, it first creates a "cluster" of Starters.
+
+These are the steps taken by the Starters to bootstrap such a deployment
+from scratch.
+
+1. All Starters are started (either manually or by some supervisor)
+1. All Starters try to read their config from `setup.json`.
+   If that file exists and is valid, this bootstrap-from-scratch process
+   is aborted and all Starters go directly to the `running` phase described below.
+1. All Starters create a unique ID
+1. The list of `--starter.join` arguments is sorted
+1. All Starters request the unique ID from the first server in the sorted `--starter.join` list,
+   and compares the result with its own unique ID.
+1. The Starter that finds its own unique ID, is continuing as `bootstrap master`
+   the other Starters are continuing as `bootstrap slaves`.
+1. The `bootstrap master` waits for at least 2 `bootstrap slaves` to join it.
+1. The `bootstrap slaves` contact the `bootstrap master` to join its cluster of Starters.
+1. Once the `bootstrap master` has received enough (at least 2) requests
+   to join its cluster of Starters, it continues with the `running` phase.
+1. The `bootstrap slaves` keep asking the `bootstrap master` about its state.
+   As soon as they receive confirmation to do so, they also continue with the `running` phase.
+
+In the `running` phase all Starters launch the desired servers and keeps monitoring those
+servers. Once a functional agency is detected, all Starters will try to be
+`running master` by trying to write their ID in a well known location in the agency.
+The first Starter to succeed in doing so wins this master election.
+
+The `running master` will keep writing its ID in the agency in order to remaining
+the `running master`. Since this ID is written with a short time-to-live,
+other Starters are able to detect when the current `running master` has been stopped
+or is no longer responsible. In that case the remaining Starters will perform
+another master election to decide who will be the next `running master`.
+
+API requests that involve the state of the cluster of Starters are always answered
+by the current `running master`. All other Starters will refer the request to
+the current `running master`.