Skip to content

Commit a4817e3

Browse files
authored
Merge pull request #320 from open-edge-platform/update-branch
feat: add multiserve microservice (#861)
2 parents 7664670 + c0510f6 commit a4817e3

36 files changed

+7662
-0
lines changed
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Python-generated files
2+
__pycache__/
3+
*.py[oc]
4+
build/
5+
dist/
6+
wheels/
7+
*.egg-info
8+
9+
# Virtual environments
10+
.venv
11+
12+
# Vscode
13+
.vscode
14+
15+
# Sources
16+
logs
17+
models
18+
engine/*
19+
config.yaml
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Multi-Backend Inference Server
2+
3+
A unified, local inference server for managing, running, and monitoring multiple LLM backends (Llama.cpp, OpenAI-compatible APIs, and more) with a modern web dashboard and tray integration.
4+
5+
## Features
6+
7+
- **Model Management**: Download, verify, start, stop, and delete LLM models from Hugging Face or manually.
8+
- **Multi-Task Support**: Supports text generation, embeddings, reranking, and multimodal models.
9+
- **Device Selection**: Run models on CPU, GPU, or NPU (if supported).
10+
- **Web Dashboard**: Modern UI for status, logs, and model management.
11+
- **Tray App**: System tray integration for quick access and server control.
12+
- **OpenAI Proxy**: Exposes OpenAI-compatible endpoints for easy integration.
13+
- **Cross-Platform**: Windows and Linux support.
14+
15+
## Directory Structure
16+
17+
```
18+
.
19+
├── app.py # Main FastAPI application entrypoint
20+
├── modules/ # Core Python modules
21+
│ ├── llamacpp/ # Llama.cpp management and GGUF downloader
22+
│ ├── gpu_metrics.py # XPU/GPU metrics collection
23+
│ ├── tray_app.py # System tray integration
24+
│ └── utils.py # Utility functions
25+
├── routers/ # FastAPI routers (API endpoints)
26+
├── engine/ # Native binaries, licenses, and XPU headers
27+
├── static/ # Web dashboard static files
28+
├── tests/ # Example tests
29+
├── config.yaml # Model/task configuration
30+
├── verified.yaml # List of verified models
31+
├── pyproject.toml # Python dependencies
32+
└── README.md # This file
33+
```
34+
35+
## Quick Start
36+
37+
1. **Install dependencies**
38+
Python 3.12+ is required. This project uses `uv` for fast dependency management.
39+
40+
```sh
41+
# Install uv (if you don't have it)
42+
pip install uv
43+
44+
# Create a virtual environment and install dependencies
45+
uv sync
46+
```
47+
48+
2. **Run the server**
49+
50+
```sh
51+
uv run app.py
52+
```
53+
54+
3. **Support LlamaCPP backend and OVMS backend**
55+
56+
```sh
57+
uv run app.py --backend ovms # for OVMS backend
58+
uv run app.py --backend llamacpp # for LlamaCPP backend
59+
```
60+
61+
4. **Access the dashboard**
62+
Open [http://127.0.0.1:8000](http://127.0.0.1:8000) in your browser.
63+
64+
5. **Tray App**
65+
The tray icon should appear automatically when running on supported platforms.
66+
67+
## Model Management
68+
69+
- **Download**: Use the dashboard or API to download models by Hugging Face repo ID.
70+
- **Start/Stop**: Start or stop models for different tasks (text generation, embeddings, rerank, multimodal).
71+
- **Device Selection**: Choose CPU/GPU/NPU for inference (if available).
72+
- **Logs**: View download and runtime logs in the dashboard.
73+
74+
## API Endpoints
75+
76+
- **/v1/model**: List available/downloaded models
77+
- **/v1/start**: Start or swap a model
78+
- **/v1/stop**: Stop a running model
79+
- **/v1/download**: Download a model
80+
- **/v1/delete**: Delete a model
81+
- **/v1/status**: Get server and model status
82+
- **/v1/chat/completions**: OpenAI-compatible endpoints
83+
- **/v1/embeddings**: OpenAI-compatible endpoints
84+
- **/v1/rerank**: OpenAI-compatible endpoints
85+
86+
## Configuration
87+
88+
- **config.yaml**: Controls active models and default parameters.
89+
- **verified.yaml**: List of models considered "verified" for auto-discovery.
90+
91+
## Building a Standalone App
92+
93+
This project supports PyInstaller for packaging as a standalone executable. See [app.spec](app.spec) for build configuration.
94+
95+
## Running the App
96+
97+
App will launched in tray mode
98+
99+
![tray](./images/tray.png)
100+
101+
Click the **Open Management UI** will launch the dashboard in browser
102+
103+
![dashboard](./images/dashboard.png)
104+
105+
Click the **Open API Docs** will launch the Swagger API docs in browser
106+
107+
![api](./images//api.png)
108+
109+
## License
110+
111+
- Third-party binaries and libraries: See `engine/llama.cpp/` for individual licenses.
112+
113+
---
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
import os
5+
import sys
6+
# if getattr(sys, 'frozen', False):
7+
# sys.stdout = open(os.devnull, "w")
8+
9+
import threading
10+
import argparse
11+
12+
from fastapi import FastAPI
13+
from fastapi.responses import FileResponse
14+
from fastapi.staticfiles import StaticFiles
15+
from contextlib import asynccontextmanager
16+
17+
from modules.utils import get_resource_path
18+
from modules.tray_app import InferenceServerTrayApp
19+
20+
from modules.llamacpp.cli import LlamaManagerCLI
21+
from routers.llamacpp_api_router import create_llamacpp_api_router
22+
from routers.llamacpp_openai_proxy_router import create_llamacpp_openai_proxy_router
23+
24+
from modules.ovms.cli import OVMSManagerCLI
25+
from routers.ovms_api_router import create_ovms_api_router
26+
from routers.ovms_openai_proxy_router import create_ovms_openai_proxy_router
27+
28+
argparser = argparse.ArgumentParser()
29+
argparser.add_argument("--backend", default="llamacpp", help="Inference Backend (eg: ovms / llamacpp)")
30+
args = argparser.parse_args()
31+
32+
if args.backend == "llamacpp":
33+
index_file = "index.html"
34+
manager = LlamaManagerCLI(verified_model_path=get_resource_path("verified.yaml"), models_directory="models/GGUF")
35+
create_api_router = create_llamacpp_api_router
36+
create_openai_proxy_router = create_llamacpp_openai_proxy_router
37+
else:
38+
index_file = "index_ov.html"
39+
manager = OVMSManagerCLI(verified_model_path=get_resource_path("verified.yaml"), models_directory="models/OV")
40+
create_api_router = create_ovms_api_router
41+
create_openai_proxy_router = create_ovms_openai_proxy_router
42+
43+
def start_server():
44+
manager.start_server()
45+
46+
@asynccontextmanager
47+
async def lifespan(app: FastAPI):
48+
server_thread = threading.Thread(target=start_server)
49+
server_thread.start()
50+
51+
yield
52+
53+
print("FastAPI shutdown: Stopping Inference server thread...")
54+
55+
app = FastAPI(
56+
title="Inference Server Manager API",
57+
description="API to control and configure Inference Server (Start, Stop, Download, Config Management).",
58+
version="0.0.1",
59+
lifespan=lifespan
60+
)
61+
62+
app.mount("/static", StaticFiles(directory=get_resource_path("static")), name="static")
63+
app.mount("/webfonts", StaticFiles(directory=get_resource_path("static/webfonts")), name="webfonts")
64+
65+
@app.get("/", include_in_schema=False)
66+
async def root():
67+
html_path = get_resource_path(f"./static/{index_file}")
68+
return FileResponse(html_path)
69+
70+
api_router = create_api_router(manager)
71+
app.include_router(api_router)
72+
openai_router = create_openai_proxy_router(manager)
73+
app.include_router(openai_router)
74+
75+
if __name__ == "__main__":
76+
tray_app = InferenceServerTrayApp(app, manager)
77+
tray_app.start(False)
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# -*- mode: python ; coding: utf-8 -*-
2+
from PyInstaller.utils.hooks import collect_submodules
3+
4+
hiddenimports = collect_submodules('uvicorn')
5+
6+
a = Analysis(
7+
['app.py'],
8+
pathex=[],
9+
binaries=[('icon.ico', '.'), ('config.yaml', '.'), ('verified.yaml', '.')],
10+
datas=[('engine', './engine'), ('static', './static')],
11+
hiddenimports=hiddenimports,
12+
hookspath=[],
13+
hooksconfig={},
14+
runtime_hooks=[],
15+
excludes=[],
16+
noarchive=False,
17+
optimize=0,
18+
)
19+
pyz = PYZ(a.pure)
20+
21+
exe = EXE(
22+
pyz,
23+
a.scripts,
24+
a.binaries,
25+
a.datas,
26+
[],
27+
name='InferenceServerManager',
28+
debug=False,
29+
bootloader_ignore_signals=False,
30+
strip=False,
31+
upx=True,
32+
upx_exclude=[],
33+
runtime_tmpdir=None,
34+
console=True,
35+
disable_windowed_traceback=False,
36+
argv_emulation=False,
37+
target_arch=None,
38+
codesign_identity=None,
39+
entitlements_file=None,
40+
icon=['icon.ico'],
41+
)
4.19 KB
Binary file not shown.
91.9 KB
Loading
188 KB
Loading
57.6 KB
Loading
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
import subprocess # nosec - disable B404:import-subprocess check
5+
import json
6+
from pprint import pprint
7+
from typing import List, Dict, Any, Optional
8+
9+
class XpuManager:
10+
def __init__(self):
11+
self._key_metrics_map = {
12+
"XPUM_STATS_POWER": "power_W",
13+
"XPUM_STATS_GPU_UTILIZATION": "gpu_utilization_pct",
14+
"XPUM_STATS_MEMORY_USED": "memory_used_MB",
15+
"XPUM_STATS_MEMORY_UTILIZATION": "memory_utilization_pct",
16+
"XPUM_STATS_GPU_FREQUENCY": "gpu_frequency_MHz",
17+
"XPUM_STATS_CORE_TEMPERATURE": "core_temperature_C"
18+
}
19+
20+
def _run_xpusmi_command(self, command_args: List[str]) -> Optional[str]:
21+
full_command = ['./engine/xpu-smi/xpu-smi.exe'] + command_args
22+
23+
try:
24+
result = subprocess.run(
25+
full_command,
26+
capture_output=True,
27+
text=True,
28+
check=True,
29+
timeout=10
30+
)
31+
return result.stdout
32+
except FileNotFoundError:
33+
print("Error: 'xpu-smi' command not found. Ensure it's installed and in your PATH.")
34+
return None
35+
except subprocess.CalledProcessError as e:
36+
return None
37+
except subprocess.TimeoutExpired:
38+
print("Error: Command timed out.")
39+
return None
40+
41+
def _parse_xpu_stats(self, stats_data: Dict[str, Any]) -> Dict[str, Any]:
42+
if not stats_data or "device_level" not in stats_data:
43+
return {"error": "Invalid stats data structure."}
44+
45+
extracted_metrics = {
46+
"device_id": stats_data.get("device_id"),
47+
}
48+
49+
for key in self._key_metrics_map.values():
50+
extracted_metrics[key] = None
51+
52+
for metric in stats_data.get("device_level", []):
53+
metric_type = metric.get("metrics_type")
54+
metric_value = metric.get("value")
55+
56+
if metric_type in self._key_metrics_map:
57+
key = self._key_metrics_map[metric_type]
58+
if isinstance(metric_value, (float, int)):
59+
extracted_metrics[key] = round(metric_value, 2) if isinstance(metric_value, float) else metric_value
60+
else:
61+
extracted_metrics[key] = metric_value
62+
63+
return extracted_metrics
64+
65+
def discover_devices(self) -> List[Dict[str, Any]]:
66+
discovery_stdout = self._run_xpusmi_command(['discovery', '-j'])
67+
68+
if not discovery_stdout:
69+
return []
70+
71+
try:
72+
discovery_data = json.loads(discovery_stdout)
73+
return discovery_data.get("device_list", [])
74+
except json.JSONDecodeError:
75+
print("Failed to parse discovery JSON.")
76+
return []
77+
78+
def get_device_stats(self, device_id: int) -> Optional[Dict[str, Any]]:
79+
stats_stdout = self._run_xpusmi_command(['stats', '-d', str(device_id), '-j'])
80+
81+
if stats_stdout:
82+
try:
83+
raw_stats = json.loads(stats_stdout)
84+
return self._parse_xpu_stats(raw_stats)
85+
except json.JSONDecodeError:
86+
return None
87+
return None
88+
89+
def get_all_device_data(self) -> Dict[int, Dict[str, Any]]:
90+
devices = self.discover_devices()
91+
if not devices:
92+
return {}
93+
94+
all_data: Dict[int, Dict[str, Any]] = {}
95+
96+
for device in devices:
97+
dev_id = device.get('device_id')
98+
if dev_id is not None:
99+
stats = self.get_device_stats(dev_id)
100+
if stats and "error" not in stats:
101+
full_info = {**device, **stats}
102+
all_data[dev_id] = full_info
103+
104+
return all_data
105+
106+
if __name__ == "__main__":
107+
manager = XpuManager()
108+
109+
print("Initializing XPU Manager and querying all devices...")
110+
111+
xpu_data = manager.get_all_device_data()
112+
113+
print("\n" + "=" * 60)
114+
if xpu_data:
115+
print(f"Successfully retrieved data for {len(xpu_data)} XPU device(s).")
116+
print("=" * 60)
117+
118+
for dev_id, data in xpu_data.items():
119+
print(f"Device ID {dev_id}: {data.get('device_name', 'N/A')}")
120+
print(f" - Power Draw: {data.get('power_W', 'N/A')} W")
121+
print(f" - GPU Util: {data.get('gpu_utilization_pct', 'N/A')}%")
122+
print(f" - Mem Util: {data.get('memory_utilization_pct', 'N/A')}%")
123+
print(f" - Mem Used: {data.get('memory_used_MB', 'N/A')} MB")
124+
print(f" - Core Temp: {data.get('core_temperature_C', 'N/A')}°C")
125+
print("-" * 60)
126+
127+
print("\n--- Full Data Dictionary (Combined Discovery and Stats) ---")
128+
pprint(xpu_data)
129+
130+
else:
131+
print("No XPU data could be retrieved. Check your XPU-SMI installation and device status.")
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+

0 commit comments

Comments
 (0)