|
1 | 1 | # CogStack Model Gateway |
2 | 2 |
|
3 | | -The CogStack ModelGateway (CMG) is a service that provides a unified interface for accessing machine learning models deployed as standalone servers. It implements service discovery and enables |
4 | | -scheduling incoming tasks based on their priority, as well the state of the cluster. The project is designed to work with [Cogstack ModelServe](https://github.com/CogStack/CogStack-ModelServe) model |
5 | | -server instances and consists of two main components: |
| 3 | +The CogStack Model Gateway (CMG) is a service that provides a unified interface for accessing |
| 4 | +machine learning models deployed as standalone servers. It implements service discovery and enables |
| 5 | +scheduling incoming tasks based on their priority, as well the state of the cluster. The project is |
| 6 | +designed to work with [Cogstack ModelServe (CMS)](https://github.com/CogStack/CogStack-ModelServe) |
| 7 | +model server instances and consists of two main components: |
6 | 8 |
|
7 | 9 | * **Model Gateway**: A RESTful API that provides a unified interface for accessing machine learning |
8 | 10 | models deployed as standalone servers. The gateway is responsible for assigning a priority to each |
9 | | - incoming task and publishing it to a queue for processing. |
10 | | -* **Task Scheduler**: A service that schedules queued tasks based on their priority and the state of |
11 | | - the cluster. The scheduler is responsible for ensuring that tasks are processed in a timely |
12 | | - manner and that the cluster is not overloaded. |
| 11 | + incoming task and publishing it to a queue for processing. On top of the API endpoints provided by |
| 12 | + CMS, the gateway also exposes endpoints for monitoring the state of submitted tasks and fetching |
| 13 | + their results, as well as for discovering available model servers and deploying new ones from |
| 14 | + previously trained models. |
| 15 | +* **Task Scheduler**: A service that schedules queued tasks for execution based on their priority. |
| 16 | + The scheduler is responsible for ensuring that tasks are processed in a timely manner and that the |
| 17 | + cluster is not overloaded. |
13 | 18 |
|
14 | | -CogStack ModelGateway comes with a persistence layer that stores information about scheduled tasks, |
15 | | -exposed through a REST API for visibility and monitoring. |
| 19 | +## Content |
| 20 | + |
| 21 | +* [Prerequisites](#prerequisites) |
| 22 | +* [Installation](#installation) |
| 23 | +* [Usage](#usage) |
| 24 | +* [Development](#development) |
| 25 | + |
| 26 | +## Prerequisites |
| 27 | + |
| 28 | +In order to run the CogStack Model Gateway, you need: |
| 29 | + |
| 30 | +* [Docker](https://www.docker.com/) installed on the host |
| 31 | +* An instance of the [CogStack ModelServe](https://github.com/CogStack/CogStack-ModelServe) stack, |
| 32 | + including a configured model tracking server (e.g. MLflow). The Gateway uses the external CMS |
| 33 | + network for model discovery and to communicate with the model servers. You should make a note of |
| 34 | + the CMS Docker project name as well as the tracking server URL, which are required for setting up |
| 35 | + the Gateway. |
| 36 | + |
| 37 | +## Installation |
| 38 | + |
| 39 | +Installing the CogStack Model Gateway is possible using Docker Compose, while configuration is done |
| 40 | +through environment variables. Before deploying the Gateway, make sure to set the required variables |
| 41 | +either by exporting them in the shell or by creating a `.env` file in the root directory of the |
| 42 | +project. The following variables are required: |
| 43 | + |
| 44 | +* `MLFLOW_TRACKING_URI`: The URI for the MLflow tracking server. |
| 45 | +* `CMS_PROJECT_NAME`: The name of the Docker project where the CogStack ModelServe stack is running. |
| 46 | +* `CMG_SCHEDULER_MAX_CONCURRENT_TASKS`: The max number of concurrent tasks the scheduler can handle. |
| 47 | +* `CMG_DB_USER`: The username for the PostgreSQL database. |
| 48 | +* `CMG_DB_PASSWORD`: The password for the PostgreSQL database. |
| 49 | +* `CMG_DB_NAME`: The name of the PostgreSQL database. |
| 50 | +* `CMG_QUEUE_USER`: The username for the RabbitMQ message broker. |
| 51 | +* `CMG_QUEUE_PASSWORD`: The password for the RabbitMQ message broker. |
| 52 | +* `CMG_QUEUE_NAME`: The name of the RabbitMQ queue. |
| 53 | +* `CMG_OBJECT_STORE_ACCESS_KEY`: The access key for the MinIO object store. |
| 54 | +* `CMG_OBJECT_STORE_SECRET_KEY`: The secret key for the MinIO object store. |
| 55 | +* `CMG_OBJECT_STORE_BUCKET_TASKS`: The name of the MinIO bucket for storing task payloads. |
| 56 | +* `CMG_OBJECT_STORE_BUCKET_RESULTS`: The name of the MinIO bucket for storing task results. |
| 57 | + |
| 58 | +An example configuration is provided below, using the default project name for the CMS stack (i.e. |
| 59 | +"cms"), forcing the scheduler to handle only one task at a time, using the internal Docker service |
| 60 | +name in the MLflow URI, and setting up the remaining services with sample credentials that fulfill |
| 61 | +their respective service validation requirements (e.g. MinIO secret key minimum length, underscores |
| 62 | +not allowed in MinIO bucket names). The configuration should be saved in a `.env` file in the root |
| 63 | +directory of the project before running Docker Compose (or sourced directly in the shell): |
| 64 | + |
| 65 | +```shell |
| 66 | +CMS_PROJECT_NAME=cms |
| 67 | + |
| 68 | +CMG_SCHEDULER_MAX_CONCURRENT_TASKS=1 |
| 69 | + |
| 70 | +# Postgres |
| 71 | +CMG_DB_USER=admin |
| 72 | +CMG_DB_PASSWORD=admin |
| 73 | +CMG_DB_NAME=cmg_tasks |
| 74 | + |
| 75 | +# RabbitMQ |
| 76 | +CMG_QUEUE_USER=admin |
| 77 | +CMG_QUEUE_PASSWORD=admin |
| 78 | +CMG_QUEUE_NAME=cmg_tasks |
| 79 | + |
| 80 | +# MinIO |
| 81 | +CMG_OBJECT_STORE_ACCESS_KEY=admin |
| 82 | +CMG_OBJECT_STORE_SECRET_KEY=admin123 |
| 83 | +CMG_OBJECT_STORE_BUCKET_TASKS=cmg-tasks |
| 84 | +CMG_OBJECT_STORE_BUCKET_RESULTS=cmg-results |
| 85 | + |
| 86 | +# MLflow |
| 87 | +MLFLOW_TRACKING_URI=http://mlflow-ui:5000 |
| 88 | +``` |
| 89 | + |
| 90 | +To install the CogStack Model Gateway, clone the repository and run `docker compose` inside the root |
| 91 | +directory: |
| 92 | + |
| 93 | +```shell |
| 94 | +docker compose -f docker-compose.yaml up |
| 95 | +``` |
| 96 | + |
| 97 | +This will spin up the following services: |
| 98 | + |
| 99 | +* **Model Gateway**: The main service that provides a RESTful API for accessing machine learning |
| 100 | + models deployed as standalone CMS servers. |
| 101 | +* **Task Scheduler**: A service that schedules queued tasks for execution based on their priority. |
| 102 | +* **Ripper**: A service responsible for removing model servers deployed through the Gateway that |
| 103 | + have exceeded their TTL. |
| 104 | +* **PostgreSQL**: A database used for storing task metadata (e.g. status, result references). |
| 105 | +* **RabbitMQ**: A message broker used for task queuing and communication between the Gateway and the |
| 106 | + Scheduler. |
| 107 | +* **MinIO**: An object storage service used for storing task results, as well as incoming request |
| 108 | + payloads. |
| 109 | +* **pgAdmin**: A web-based interface for managing the PostgreSQL database. |
| 110 | + |
| 111 | +## Usage |
| 112 | + |
| 113 | +The Gateway exposes 2 main HTTP endpoints, one for interacting with the model servers and one for |
| 114 | +monitoring the state of submitted tasks. The following endpoints are available: |
| 115 | + |
| 116 | +* **Model Servers**: Interact with CMS model servers. |
| 117 | + |
| 118 | + * `GET /models`: List all available model servers (i.e. Docker containers with the |
| 119 | + "org.cogstack.model-serve" label and "com.docker.compose.project" set to `$CMS_PROJECT_NAME`). |
| 120 | + |
| 121 | + * **Query Parameters**: |
| 122 | + * `verbose (bool)`: Include model metadata from the tracking server (if available). |
| 123 | + |
| 124 | + * `GET /models/{model_server_name}/info`: Get information about a specific model (equivalent to |
| 125 | + the `/info` CMS endpoint). |
| 126 | + * `POST /models/{model_server_name}`: Deploy a new model server from a previously trained model. |
| 127 | + |
| 128 | + * **Body**: |
| 129 | + * `tracking_id (str)`: The tracking ID of the run that generated the model to serve (e.g. |
| 130 | + MLflow run ID), used to fetch the model URI (optional if model_uri is provided explicitly). |
| 131 | + * `model_uri (str)`: The URI of the model to serve (optional if tracking_id is provided). |
| 132 | + * `ttl (int, default=86400)`: The deployed model will be deleted after TTL seconds (defaults |
| 133 | + to 1 day). Set -1 as the TTL value to protect the model from being deleted. |
| 134 | + |
| 135 | + * `POST /models/{model_server_name}/tasks/{task_name}`: Execute a task on the specified model |
| 136 | + server, providing any query parameters or request body required (follows the CMS API, striving |
| 137 | + to support the same endpoints). |
| 138 | + |
| 139 | +* **Tasks**: Monitor the state of submitted tasks. |
| 140 | + |
| 141 | + * `GET /tasks`: List all submitted tasks (currently not allowed, will be enabled once users are |
| 142 | + introduced). |
| 143 | + * `GET /tasks/{task_id}`: Get information about a specific task. |
| 144 | + |
| 145 | + * **Query Parameters**: |
| 146 | + * `detail (bool)`: Include detailed information about the task (e.g. result reference, error |
| 147 | + message, model tracking ID). |
| 148 | + * `download (bool)`: Download the result of the task (if available). |
| 149 | + |
| 150 | +## Development |
| 151 | + |
| 152 | +The project is still under active development. In the future we will be focusing on the following: |
| 153 | + |
| 154 | +* **Tests**: Adding unit tests for every component of the project (only the `common` package is |
| 155 | + currently tested appropriately) and extending the integration tests to cover the training and |
| 156 | + evaluation CMS endpoints. |
| 157 | +* **User management**: Introduce users and bind task requests to them, to control access to results |
| 158 | + and generate notifications. |
| 159 | +* **Smart scheduling**: Implement a more sophisticated scheduling algorithm that takes into account |
| 160 | + the state of the cluster. |
| 161 | +* **CI/CD**: Set up a continuous integration and deployment pipeline for the project. |
| 162 | +* **Documentation**: Writing detailed documentation for the project, starting from docstrings to |
| 163 | + describe the inner workings of our services. |
| 164 | +* **Monitoring**: Integrate with Prometheus and Grafana. |
0 commit comments