Skip to content

Commit 1bb2361

Browse files
committed
docs: Update README with installation & usage instructions
Signed-off-by: Phoevos Kalemkeris <[email protected]>
1 parent 59ba92d commit 1bb2361

File tree

2 files changed

+159
-10
lines changed

2 files changed

+159
-10
lines changed

README.md

Lines changed: 158 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,164 @@
11
# CogStack Model Gateway
22

3-
The CogStack ModelGateway (CMG) is a service that provides a unified interface for accessing machine learning models deployed as standalone servers. It implements service discovery and enables
4-
scheduling incoming tasks based on their priority, as well the state of the cluster. The project is designed to work with [Cogstack ModelServe](https://github.com/CogStack/CogStack-ModelServe) model
5-
server instances and consists of two main components:
3+
The CogStack Model Gateway (CMG) is a service that provides a unified interface for accessing
4+
machine learning models deployed as standalone servers. It implements service discovery and enables
5+
scheduling incoming tasks based on their priority, as well the state of the cluster. The project is
6+
designed to work with [Cogstack ModelServe (CMS)](https://github.com/CogStack/CogStack-ModelServe)
7+
model server instances and consists of two main components:
68

79
* **Model Gateway**: A RESTful API that provides a unified interface for accessing machine learning
810
models deployed as standalone servers. The gateway is responsible for assigning a priority to each
9-
incoming task and publishing it to a queue for processing.
10-
* **Task Scheduler**: A service that schedules queued tasks based on their priority and the state of
11-
the cluster. The scheduler is responsible for ensuring that tasks are processed in a timely
12-
manner and that the cluster is not overloaded.
11+
incoming task and publishing it to a queue for processing. On top of the API endpoints provided by
12+
CMS, the gateway also exposes endpoints for monitoring the state of submitted tasks and fetching
13+
their results, as well as for discovering available model servers and deploying new ones from
14+
previously trained models.
15+
* **Task Scheduler**: A service that schedules queued tasks for execution based on their priority.
16+
The scheduler is responsible for ensuring that tasks are processed in a timely manner and that the
17+
cluster is not overloaded.
1318

14-
CogStack ModelGateway comes with a persistence layer that stores information about scheduled tasks,
15-
exposed through a REST API for visibility and monitoring.
19+
## Content
20+
21+
* [Prerequisites](#prerequisites)
22+
* [Installation](#installation)
23+
* [Usage](#usage)
24+
* [Development](#development)
25+
26+
## Prerequisites
27+
28+
In order to run the CogStack Model Gateway, you need:
29+
30+
* [Docker](https://www.docker.com/) installed on the host
31+
* An instance of the [CogStack ModelServe](https://github.com/CogStack/CogStack-ModelServe) stack,
32+
including a configured model tracking server (e.g. MLflow). The Gateway uses the external CMS
33+
network for model discovery and to communicate with the model servers. You should make a note of
34+
the CMS Docker project name as well as the tracking server URL, which are required for setting up
35+
the Gateway.
36+
37+
## Installation
38+
39+
Installing the CogStack Model Gateway is possible using Docker Compose, while configuration is done
40+
through environment variables. Before deploying the Gateway, make sure to set the required variables
41+
either by exporting them in the shell or by creating a `.env` file in the root directory of the
42+
project. The following variables are required:
43+
44+
* `MLFLOW_TRACKING_URI`: The URI for the MLflow tracking server.
45+
* `CMS_PROJECT_NAME`: The name of the Docker project where the CogStack ModelServe stack is running.
46+
* `CMG_SCHEDULER_MAX_CONCURRENT_TASKS`: The max number of concurrent tasks the scheduler can handle.
47+
* `CMG_DB_USER`: The username for the PostgreSQL database.
48+
* `CMG_DB_PASSWORD`: The password for the PostgreSQL database.
49+
* `CMG_DB_NAME`: The name of the PostgreSQL database.
50+
* `CMG_QUEUE_USER`: The username for the RabbitMQ message broker.
51+
* `CMG_QUEUE_PASSWORD`: The password for the RabbitMQ message broker.
52+
* `CMG_QUEUE_NAME`: The name of the RabbitMQ queue.
53+
* `CMG_OBJECT_STORE_ACCESS_KEY`: The access key for the MinIO object store.
54+
* `CMG_OBJECT_STORE_SECRET_KEY`: The secret key for the MinIO object store.
55+
* `CMG_OBJECT_STORE_BUCKET_TASKS`: The name of the MinIO bucket for storing task payloads.
56+
* `CMG_OBJECT_STORE_BUCKET_RESULTS`: The name of the MinIO bucket for storing task results.
57+
58+
An example configuration is provided below, using the default project name for the CMS stack (i.e.
59+
"cms"), forcing the scheduler to handle only one task at a time, using the internal Docker service
60+
name in the MLflow URI, and setting up the remaining services with sample credentials that fulfill
61+
their respective service validation requirements (e.g. MinIO secret key minimum length, underscores
62+
not allowed in MinIO bucket names). The configuration should be saved in a `.env` file in the root
63+
directory of the project before running Docker Compose (or sourced directly in the shell):
64+
65+
```shell
66+
CMS_PROJECT_NAME=cms
67+
68+
CMG_SCHEDULER_MAX_CONCURRENT_TASKS=1
69+
70+
# Postgres
71+
CMG_DB_USER=admin
72+
CMG_DB_PASSWORD=admin
73+
CMG_DB_NAME=cmg_tasks
74+
75+
# RabbitMQ
76+
CMG_QUEUE_USER=admin
77+
CMG_QUEUE_PASSWORD=admin
78+
CMG_QUEUE_NAME=cmg_tasks
79+
80+
# MinIO
81+
CMG_OBJECT_STORE_ACCESS_KEY=admin
82+
CMG_OBJECT_STORE_SECRET_KEY=admin123
83+
CMG_OBJECT_STORE_BUCKET_TASKS=cmg-tasks
84+
CMG_OBJECT_STORE_BUCKET_RESULTS=cmg-results
85+
86+
# MLflow
87+
MLFLOW_TRACKING_URI=http://mlflow-ui:5000
88+
```
89+
90+
To install the CogStack Model Gateway, clone the repository and run `docker compose` inside the root
91+
directory:
92+
93+
```shell
94+
docker compose -f docker-compose.yaml up
95+
```
96+
97+
This will spin up the following services:
98+
99+
* **Model Gateway**: The main service that provides a RESTful API for accessing machine learning
100+
models deployed as standalone CMS servers.
101+
* **Task Scheduler**: A service that schedules queued tasks for execution based on their priority.
102+
* **Ripper**: A service responsible for removing model servers deployed through the Gateway that
103+
have exceeded their TTL.
104+
* **PostgreSQL**: A database used for storing task metadata (e.g. status, result references).
105+
* **RabbitMQ**: A message broker used for task queuing and communication between the Gateway and the
106+
Scheduler.
107+
* **MinIO**: An object storage service used for storing task results, as well as incoming request
108+
payloads.
109+
* **pgAdmin**: A web-based interface for managing the PostgreSQL database.
110+
111+
## Usage
112+
113+
The Gateway exposes 2 main HTTP endpoints, one for interacting with the model servers and one for
114+
monitoring the state of submitted tasks. The following endpoints are available:
115+
116+
* **Model Servers**: Interact with CMS model servers.
117+
118+
* `GET /models`: List all available model servers (i.e. Docker containers with the
119+
"org.cogstack.model-serve" label and "com.docker.compose.project" set to `$CMS_PROJECT_NAME`).
120+
121+
* **Query Parameters**:
122+
* `verbose (bool)`: Include model metadata from the tracking server (if available).
123+
124+
* `GET /models/{model_server_name}/info`: Get information about a specific model (equivalent to
125+
the `/info` CMS endpoint).
126+
* `POST /models/{model_server_name}`: Deploy a new model server from a previously trained model.
127+
128+
* **Body**:
129+
* `tracking_id (str)`: The tracking ID of the run that generated the model to serve (e.g.
130+
MLflow run ID), used to fetch the model URI (optional if model_uri is provided explicitly).
131+
* `model_uri (str)`: The URI of the model to serve (optional if tracking_id is provided).
132+
* `ttl (int, default=86400)`: The deployed model will be deleted after TTL seconds (defaults
133+
to 1 day). Set -1 as the TTL value to protect the model from being deleted.
134+
135+
* `POST /models/{model_server_name}/tasks/{task_name}`: Execute a task on the specified model
136+
server, providing any query parameters or request body required (follows the CMS API, striving
137+
to support the same endpoints).
138+
139+
* **Tasks**: Monitor the state of submitted tasks.
140+
141+
* `GET /tasks`: List all submitted tasks (currently not allowed, will be enabled once users are
142+
introduced).
143+
* `GET /tasks/{task_id}`: Get information about a specific task.
144+
145+
* **Query Parameters**:
146+
* `detail (bool)`: Include detailed information about the task (e.g. result reference, error
147+
message, model tracking ID).
148+
* `download (bool)`: Download the result of the task (if available).
149+
150+
## Development
151+
152+
The project is still under active development. In the future we will be focusing on the following:
153+
154+
* **Tests**: Adding unit tests for every component of the project (only the `common` package is
155+
currently tested appropriately) and extending the integration tests to cover the training and
156+
evaluation CMS endpoints.
157+
* **User management**: Introduce users and bind task requests to them, to control access to results
158+
and generate notifications.
159+
* **Smart scheduling**: Implement a more sophisticated scheduling algorithm that takes into account
160+
the state of the cluster.
161+
* **CI/CD**: Set up a continuous integration and deployment pipeline for the project.
162+
* **Documentation**: Writing detailed documentation for the project, starting from docstrings to
163+
describe the inner workings of our services.
164+
* **Monitoring**: Integrate with Prometheus and Grafana.

cogstack_model_gateway/gateway/routers/models.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ async def deploy_model(
143143
] = None,
144144
model_uri: Annotated[
145145
str | None,
146-
Body(description="The URI of the model to serve (optional if run_id is provided)"),
146+
Body(description="The URI of the model to serve (optional if tracking_id is provided)"),
147147
] = None,
148148
ttl: Annotated[
149149
int | None,

0 commit comments

Comments
 (0)