Documentation is still very much a work-in-progress
Cluman is the server component of the Haven project. This components manages the clusters via communication with the agent. The name comes from CLUster-MANager.
- Cluman manages the following modules (data is stored in
etcd(runetcdctl ls /cluman/for checking)):/nodes- list of connected nodes/clusters- list of clusters (currently onlyRealClusterstored here)/containers- list of containers/applications- list of compose applications - applications are made of one or more containers instantiated by a compose file/docker-registry- list of user-added registries and Docker Hub/pipelines- list of pipelines (To be implemented)
- It gathers the metric information from the nodes (part of info provided by agent) and the containers
and save it in the file queues at
${dm.file.fbstorage}(default${java.io.tmpdir}/cluman/fbstorage) directory. - It provides an API to interact with the platform (see http://$MASTER_IP:8761/swagger-ui.html) and user interface.
Cluman has the follow entities:
-
NodesGroup- a group of nodesNodesGroup has 'features' which may be used for resolving into these group types:
SWARM- nodes in this group type are grouped together by a single 'swarm' service. We consider groups with this feature as 'cluster' or 'real cluster'.FORBID_NODE_ADDITION- this group type is a meta group created by the system (For example, "orphan" mentioned below.) No modification is allowed for these NodesGroup.
In addition, Cluman has some pre-defined NodesGroup (all of them are them stored in
DiscoveryStorage.SYSTEM_GROUPS):DiscoveryStorage.GROUP_ID_ALL- this group contains all currently on-line nodes.DiscoveryStorage.GROUP_ID_ORPHANS- this group contains all nodes that do not belongs to anyRealCluster.
-
RealCluster- a type of NodeGroup supported asSWARMservice. Do not confuse it with 'real cluster', becauseRealCluster- the type name of domain object which is represent 'real cluster' in Cluman. -
Node- node of cluster, Cluman does not differ node which containscluster-managercontainer from others.
When Cluman is started, it
- Sets up a
DiscoveryStorageImpl. - Reads from the etcd storage and loads list of registered
RealCluster. - For each cluster cluman, it runs
swarminstance throughDockerServices.getOrCreateCluster().
Docker Swarm is used for managing the NodesGroups (wrapped in DockerServiceImpl) for Real Cluster
and VirtualDockerService.
For gathering nodes' events from Docker services, Cluman connects to each node directly. These connections are
stored in DockerServices. The registered nodes' information are stored in NodeStorage.
When a node agent sends data to NodeStorage through TokenDiscoveryServer, the storage will add additional node reference into the
Swarm part of etcd tree via NodeStorage.updateSwarmRegistration).
Cluman registers node through agent but also use information about node from docker info.
All gathered info are saved in NodeRegistrationImpl. Data about node health and metrics is published as NodeMetrics.
Node has two main flags health.healthy and on. The difference between the two are:
on- It shows the online node status. The value is true when the node agent send ack in a specified time. If the timeout is exceeded then node is immediately set to off (on=false). SeeNodeRegistrationImpl.isOnfor details. Its flag ignores the status of the node's Docker service.healthy- Its value is derived from the Docker service status. The Docker developers can declare it as 'engine is unreachable' but we may use the analysis of node metrics (for example storage space is exceeded, or hdd SMART errors).
Note that node.health (aka NodeMetrics) has time value based on the local node time.
Cluman uses options from different sources for creating new container:
- API
- Compose like (yml) or properties file from git, examples: containers-configuration where dev is cluster name
- dm.image.configuration.git.url=https://.git
- dm.image.configuration.git.username=
- dm.image.configuration.git.password=
- Image labels with
arg.prefix, example:LABEL arg.memory=512MLABEL arg.restart=alwaysLABEL arg.ports=8761:8761
Application uses Docker Compose as the backend. Each application contains:
- String name
- String cluster
- File initFile
- Date creatingDate
- List containers
TODO: More details
Cluman has global instances of MessageBus. Each instance has a unique ID, usually its id can be obtain from
static field of event class: <EventClass>.BUS.
List of global buses:
- bus.cluman.dockerservice -
DockerServiceEvent, notifiesDockerServiceInfoevents. - bus.cluman.log.application -
ApplicationEvent, notifies 'applications' events. - bus.cluman.log.registry -
RegistryEvent, notifies registry adding and deletion events. - bus.cluman.node -
NodeEvent, notifies node status updates, which are derived from haven-agent requests. - bus.cluman.log.docker -
DockerLogEvent, notifies proxy events from Docker service, seeDockerServices.convertToLogEvent - bus.cluman.log.nodesGroup -
NodesGroupEvent, notifies NodesGroup creations and deletions. - bus.cluman.erorrs -
LogEvent, bus aggregate messages from other buses withWithSeverity.getSeverity() >= WARNING, also has history. - bus.cluman.pipeline -
PipelineEventnotifies pipeline changes. - bus.cluman.job -
JobEvent, notifies changes of eachJobInstancechanges, containsJobInfo, can be caused byJobInstance.send().
Many events has action field which has values described in StandardActions class:
- create - some object will be created, note that for example
DockerServicecann't becreate, but can bestart - update
- delete
- start - applicable for some objects which have run state, containers, jobs, processes
- stop
- die - unexpected
stop; usually it mean error - online - it and below action applicable for objects which used through network: node, Docker service and etc.
- offline
Events also have severity fields . It can have INFO, WARNING and ERROR status.
API is published at http://$MASTER_IP:8761/swagger-ui.html URL.