Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 23 additions & 30 deletions docs/auth.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,44 @@
# Kubernetes native authentication
- [Kubernetes native authentication](#kubernetes-native-authentication)
- [Authentication flow](#authentication-flow)
- [Setup](#setup)
- [Kid-mapping ConfigMap](#kid-mapping-configmap)
- [Server configuration](#server-configuration)
- [Example config](#example-config)
- [Client configuration](#client-configuration)
- [Example config](#example-config-1)
# Kubernetes Native Authentication

Kubernetes native authentication defines a flow where the Armada Executor cluster's in-built token handling authenticates the Executor requesting jobs from the Armada Server.
Kubernetes native authentication defines a flow where the Armada
Executor cluster's in-built token handling is used to authenticate
the Executor requesting jobs from the Armada Server.

## Authentication flow
## Authentication Flow

1. The Executor requests a temporary Service Account Token from its own cluster's
Kubernetes API using TokenRequest.
Kubernetes API using TokenRequest
2. The Executor sends this Token to the Server in an Authorization Header.
3. The Server decodes the token to read the KID (Kubernetes Key ID).
4. The Server swaps for the Callback URL of the executor's API server. The KID-to-URL mapping is stored in a prior generated ConfigMap or Secret.
4. The Server swaps for the Callback URL of the executor's API server.
The KID to URL mapping is stored in a prior generated ConfigMap or Secret.
5. The Server uses the URL to call the Executor Cluster's TokenReview endpoint.
6. On successful review, the Server caches the Token for its lifetime. On unsuccessful review, the Token is cached for a configuration defined time.
6. On successful review the Server caches the Token for its lifetime,
on unsuccessful review the Token is cached for a configuration defined time.

## Setup

### Kid-mapping ConfigMap

The Armada Server must have a ConfigMap in its namespace with the following format for entries:
The Armada Server must have a ConfigMap in its namespace with the following
format for entries:

```yaml
data:
"<CLUSTER_KID>": "<EXECUTOR_CLUSTER_CALLBACK_URL>"
...
```

You can mount this ConfigMap anywhere on the Server's Pod.
This ConfigMap may be mounted anywhere on the Server's Pod.

### Server configuration

You need to configure three things in the Server Config:

* the location of the KID-mapping config map mounted on the Pod
* the retry timeout for failed Tokens
* the full service account name in the `permissionGroupMapping`

#### Example config
Three things need to be configured in the Server Config:
- The location of the KID-mapping config map mounted on the Pod
- The retry timeout for failed Tokens
- The full service account name in the permissionGroupMapping

Example Config:
```yaml
applicationConfig:
auth:
Expand All @@ -54,15 +49,13 @@ applicationConfig:
execute_jobs: ["system:serviceaccount:armada:armada-executor"]
```

### Client configuration

For the Executor authentication, you need to specify:

* the desired token `expiry`
* the Executor's `namespace` and `serviceAccount`
### Client Configuration

#### Example config
For the Executor authentication you will need to specify:
- The desired token Expiry
- The Executor's Namespace and ServiceAccount

Example Config:
```yaml
applicationConfig:
kubernetesNativeAuth:
Expand Down
18 changes: 0 additions & 18 deletions docs/client_libraries.md

This file was deleted.

22 changes: 22 additions & 0 deletions docs/consistency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# A note on consistency

The data stream approach taken by Armada is not the only way to maintain consistency across views. Here, we compare this approach with the two other possible solutions.

Armada stores its state across several databases. Whenever Armada receives an API call to update its state, all those databases need to be updated. However, if each database were to be updated independently it is possible for some of those updates to succeed while others fail, leading to an inconsistent application state. It would require complex logic to detect and correct for such partial failures. However, even with such logic we could not guarantee that the application state is consistent; if Armada crashes before it has had time to correct for the partial failure the application may remain in an inconsistent state.

There are three commonly used approaches to address this issue:

* Store all state in a single database with support for transactions. Changes are submitted atomically and are rolled back in case of failure; there are no partial failures.
* Distributed transaction frameworks (e.g., X/Open XA), which extend the notation of transactions to operations involving several databases.
* Ordered idempotent updates.

The first approach results in tight coupling between components and would limit us to a single database technology. Adding a new component (e.g., a new dashboard) could break existing component since all operations part of the transaction are rolled back if one fails. The second approach allows us to use multiple databases (as long as they support the distributed transaction framework), but components are still tightly coupled since they have to be part of the same transaction. Further, there are performance concerns associated with these options, since transactions may not be easily scalable. Hence, we use the third approach, which we explain next.

First, note that if we can replay the sequence of state transitions that led to the current state, in case of a crash we can recover the correct state by truncating the database and replaying all transitions from the beginning of time. Because operations are ordered, this always results in the same end state. If we also, for each database, store the id of the most recent transition successfully applied to that database, we only need to replay transitions more recent than that. This saves us from having to start over from a clean database; because we know where we left off we can keep going from there. For this to work, we need transactions but not distributed transactions. Essentially, applying a transition already written to the database results in a no-op, i.e., the updates are idempotent (meaning that applying the same update twice has the same effect as applying it once).

The two principal drawbacks of this approach are:

* Eventual consistency: Whereas the first two approaches result in a system that is always consistent, with the third approach, because databases are updated independently, there will be some replication lag during which some part of the state may be inconsistent.
* Timeliness: There is some delay between submitting a change and that change being reflected in the application state.

Working around eventual consistency requires some care, but is not impossible. For example, it is fine for the UI to show the a job as "running" for a few seconds after the job has finished before showing "completed". Regarding timeliness, it is not a problem if there is a few seconds delay between a job being submitted and the job being considered for queueing. However, poor timeliness may lead to clients (i.e., the entity submitting jobs to the system) not being able to read their own writes for some time, which may lead to confusion (i.e., there may be some delay between a client submitting a job a that job showing as "pending"). This issue can be worked around by storing the set of submitted jobs in-memory either at the client or at the API endpoint.
264 changes: 264 additions & 0 deletions docs/developer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
# Developer Guide

## Introduction

This document is intended for developers who want to contribute to the project. It contains information about the project structure, how to build the project, and how to run the tests.

## TLDR

Want to quickly get Armada running and test it? Install the [Pre-requisites](#pre-requisites) and then run:

```bash
mage localdev minimal testsuite
```

To get the UI running, run:

```bash
mage ui
```

## A note for Devs on Arm / Windows

There is limited information on issues that appear on Arm / Windows Machines when running this setup.

Feel free to create a ticket if you encounter any issues, and link them to the relavent issue:

* https://github.com/armadaproject/armada/issues/2493 (Arm)
* https://github.com/armadaproject/armada/issues/2492 (Windows)


## Design Docs

Please see these documents for more information about Armadas Design:

* [Armada Components Diagram](./design/relationships_diagram.md)
* [Armada Architecture](./design/architecture.md)
* [Armada Design](./design/index.md)
* [How Priority Functions](./design/priority.md)
* [Armada Scheduler Design](./design/scheduler.md)

## Other Useful Developer Docs

* [Armada API](./developer/api.md)
* [Running Armada in an EC2 Instance](./developer/aws-ec2.md)
* [Armada UI](./developer/ui.md)
* [Usage Metrics](./developer/usage_metrics.md)
* [Using OIDC with Armada](./developer/oidc.md)
* [Building the Website](./developer/website.md)
* [Using Localdev Manually](./developer/manual-localdev.md)
* [Inspecting and Debugging etcd in Localdev setup](./developer/etc-localdev.md)

## Pre-requisites

- [Go](https://go.dev/doc/install) (version 1.23 or later)
- gcc (for Windows, see, e.g., [tdm-gcc](https://jmeubank.github.io/tdm-gcc/))
- [mage](https://magefile.org/)
- [docker](https://docs.docker.com/get-docker/)
- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)
- [protoc](https://github.com/protocolbuffers/protobuf/releases)
- [helm](https://helm.sh/docs/intro/install/) (version 3.10.0 or later)

## Using Mage

Mage is a build tool that we use to build Armada. It is similar to Make, but written in Go. It is used to build Armada, run tests, and run other useful commands. To see a list of available commands, run `mage -l`.

## LocalDev Setup

LocalDev provides a reliable and extendable way to install Armada as a developer. It runs the following steps:

* Bootstrap the required tools from [tools.yaml](https://github.com/armadaproject/armada/blob/master/tools.yaml)
* Create a local Kubernetes cluster using [kind](https://kind.sigs.k8s.io/)
* Start the dependencies of Armada, including Pulsar, Redis, and Postgres.

**Note: If you edit a proto file, you will also need to run `mage proto` to regenerate the Go code.**

It has the following options to customize further steps:

* `mage localdev full` - Runs all components of Armada, including the Lookout UI.
* `mage localdev minimal` - Runs only the core components of Armada (such as the API server and an executor).
* `mage localdev no-build` - Skips the build step; set `ARMADA_IMAGE` and `ARMADA_TAG` to choose the Docker image to use.

`mage localdev minimal` is what is used to test the CI pipeline, and is the recommended way to test changes to the core components of Armada.

## Debug error saying that the (port 6443 is already in use) after running mage localdev full

## Identifying the Conflict

Before making any changes, it's essential to identify which port is causing the conflict. Port 6443 is a common source of conflicts. You can check for existing bindings to this port using commands like `netstat` or `lsof`.

1. The `kind.yaml` file is where you define the configuration for your Kind clusters. To resolve port conflicts:

* Open your [kind.yaml](https://github.com/armadaproject/armada/blob/master/e2e/setup/kind.yaml) file.

2. Locate the relevant section where the `hostPort` is set. It may look something like this:


```
- containerPort: 6443 # control plane
hostPort: 6443 # exposes control plane on localhost:6443
protocol: TCP
```

* Modify the hostPort value to a port that is not in use on your system. For example:

```
- containerPort: 6443 # control plane
hostPort: 6444 # exposes control plane on localhost:6444
protocol: TCP
```
You are not limited to using port 6444; you can choose any available port that doesn't conflict with other services on your system. Select a port that suits your system configuration.

### Testing if LocalDev is working

Running `mage testsuite` will run the full test suite against the localdev cluster. This is the recommended way to test changes to the core components of Armada.

You can also run the same commands yourself:

```bash
go run cmd/armadactl/main.go create queue e2e-test-queue

# To allow Ingress tests to pass
export ARMADA_EXECUTOR_INGRESS_URL="http://localhost"
export ARMADA_EXECUTOR_INGRESS_PORT=5001

go run cmd/testsuite/main.go test --tests "testsuite/testcases/basic/*" --junit junit.xml
```

### Running the UI

In LocalDev, the UI is built seperately with `mage ui`. To access it, open http://localhost:8089 in your browser.

For more information see the [UI Developer Guide](./developer/ui.md).


### Choosing components to run

You can set the `ARMADA_COMPONENTS` environment variable to choose which components to run. It is a comma separated list of components to run. For example, to run only the server and executor, you can run:

```bash
export ARMADA_COMPONENTS="server,executor"
```

### Running Pulsar backed scheduler with LocalDev

Ensure your local environment is completely torn down with
```bash
mage LocalDevStop
```

And then run

```bash
mage LocalDev minimal
```

Ensure your local dev environment is completely torn down when switching between pulsar backed and legacy
setups.

If the eventsingester or the scheduleringester don't come up then just manually spin them up with `docker-compose up`.

## Debugging

The mage target `mage debug` supports multiple methods for debugging, and runs the appropriate parts of localdev as required.

**NOTE: We are actively accepting contributions for more debugging guides!**

It supports the following commands:

* `mage debug vscode` - Runs the server and executor in debug mode, and provides a launch.json file for VSCode.
* `mage debug delve` - Runs the server and executor in debug mode, and starts the Delve debugger.

### VSCode Debugging

After running `mage debug vscode`, you can attach to the running processes using VSCode.
The launch.json file can be found [Here](../developer/debug/launch.json)

For using VSCode debugging, see the [VSCode Debugging Guide](https://code.visualstudio.com/docs/editor/debugging).

### Delve Debugging

The delve target creates a new docker-compose file: `./docker-compose.dev.yaml` with the correct volumes, commands and images for debugging.

If you would like to manually create the compose file and run it yourself, you can run the following commands:

```bash
mage createDelveCompose

# You can then start components manually
docker compose -f docker-compose.dev.yaml up -d server executor
```

After running `mage debug delve`, you can attach to the running processes using Delve.

```bash
$ docker compose exec -it server bash
root@3b5e4089edbb:/app# dlv connect :4000
Type 'help' for list of commands.
(dlv) b (*SubmitServer).CreateQueue
Breakpoint 3 set at 0x1fb3800 for github.com/armadaproject/armada/internal/armada/server.(*SubmitServer).CreateQueue() ./internal/armada/server/submit.go:137
(dlv) c
> github.com/armadaproject/armada/internal/armada/server.(*SubmitServer).CreateQueue() ./internal/armada/server/submit.go:140 (PC: 0x1fb38a0)
135: }
136:
=> 137: func (server *SubmitServer) CreateQueue(ctx context.Context, request *api.Queue) (*types.Empty, error) {
138: err := checkPermission(server.permissions, ctx, permissions.CreateQueue)
139: var ep *ErrUnauthorized
140: if errors.As(err, &ep) {
141: return nil, status.Errorf(codes.PermissionDenied, "[CreateQueue] error creating queue %s: %s", request.Name, ep)
142: } else if err != nil {
143: return nil, status.Errorf(codes.Unavailable, "[CreateQueue] error checking permissions: %s", err)
144: }
145:
(dlv)
```

All outputs of delve can be found in the `./delve` directory.

External Debug Port Mappings:

| Armada service | Debug host |
|-----------------|----------------|
| server | localhost:4000 |
| executor | localhost:4001 |
| binoculars | localhost:4002 |
| eventingester | localhost:4003 |
| lookoutui | localhost:4004 |
| lookout | localhost:4005 |
| lookoutingester | localhost:4007 |


## GoLand Run Configurations

We provide a number of run configurations within the `.run` directory of this project. These will be accessible when opening the project in GoLand, allowing you to run Armada in both standard and debug mode.

The following high-level configurations are provided, each composed of sub-configurations:
1. `Armada Infrastructure Services`
- Runs Infrastructure Services required to run Armada, irrespective of scheduler type
2. `Armada (Legacy Scheduler)`
- Runs Armada with the Legacy Scheduler
3. `Armada (Pulsar Scheduler)`
- Runs Armada with the Pulsar Scheduler (recommended)
4. `Lookout UI`
- Script which configures a local UI development setup

A minimal local Armada setup using these configurations would be `Armada Infrastructure Services` and one of (`Armada (Legacy Scheduler)` or `Armada (Pulsar Scheduler)`). Running the `Lookout UI` script on top of this configuration would allow you to develop the Lookout UI live from GoLand, and see the changes visible in your browser. **These configurations (executor specifically) require a kubernetes config in `$PROJECT_DIR$/.kube/internal/config`**

GoLand does not allow us to specify an ordering for services within docker compose configurations. As a result, some database migration services may require rerunning.

## Visual Studio Code debug configurations

We similarly provide run and debug configurations for Visual Studio Code users to run each Armada service and use the debugger provided with VS Code.

The `Armada` configuration performs all required setup - setting up the Kind cluster, spinning up infrastructure services and performing database migrations - and then runs all services.

### Other Debugging Methods

Run `mage debug local` to only spin up the dependencies of Armada, and then run the individual components yourself.

For required enviromental variables, please see [The Enviromental Variables Guide](https://github.com/armadaproject/armada/tree/master/developer/env/README.md).

## Finer-Grain Control

If you would like to run the individual mage targets yourself, you can do so.
See the [Manually Running LocalDev](./developer/manual-localdev.md) guide for more information.
Loading
Loading