You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+9-5Lines changed: 9 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,9 @@ The stack is set up using [Helm](https://helm.sh/docs/), and contains the follow
31
31
-**Request router**: Directs requests to appropriate backends based on routing keys or session IDs to maximize KV cache reuse.
32
32
-**Observability stack**: monitors the metrics of the backends through [Prometheus](https://github.com/prometheus/prometheus) + [Grafana](https://grafana.com/)
33
33
34
-
<imgsrc="https://github.com/user-attachments/assets/8f05e7b9-0513-40a9-9ba9-2d3acca77c0c"alt="Architecture of the stack"width="800"/>
34
+
<palign="center">
35
+
<imgsrc="https://github.com/user-attachments/assets/8f05e7b9-0513-40a9-9ba9-2d3acca77c0c"alt="Architecture of the stack"width="80%"/>
36
+
</p>
35
37
36
38
## Roadmap
37
39
@@ -86,16 +88,16 @@ The Grafana dashboard provides the following insights:
7.**GPU KV Cache Hit Rate**: Displays the hit rate for the GPU KV cache.
88
90
89
-
<imgsrc="https://github.com/user-attachments/assets/05766673-c449-4094-bdc8-dea6ac28cb79"alt="Grafana dashboard to monitor the deployment"width="500"/>
91
+
<palign="center">
92
+
<imgsrc="https://github.com/user-attachments/assets/05766673-c449-4094-bdc8-dea6ac28cb79"alt="Grafana dashboard to monitor the deployment"width="80%"/>
93
+
</p>
90
94
91
95
### Configuration
92
96
93
-
See the details in `observability/README.md`
97
+
See the details in [`observability/README.md`](./observability/README.md)
94
98
95
99
## Router
96
100
97
-
### Overview
98
-
99
101
The router ensures efficient request distribution among backends. It supports:
100
102
101
103
- Routing to endpoints that run different models
@@ -106,6 +108,8 @@ The router ensures efficient request distribution among backends. It supports:
106
108
- Session-ID based routing
107
109
- (WIP) prefix-aware routing
108
110
111
+
Please refer to the [router documentation](./router/README.md) for more details.
112
+
109
113
## Contributing
110
114
111
115
Contributions are welcome! Please follow the standard GitHub flow:
0 commit comments