You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-6Lines changed: 8 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -78,30 +78,32 @@ Llm-d customizes vLLM & IGW to create a disaggregated serving solution. We've wo
78
78
79
79
IGW has enhanced support for vLLM via llm-d, and broad support for any model servers implementing the protocol. More details can be found in [model server integration](https://gateway-api-inference-extension.sigs.k8s.io/implementations/model-servers/).
80
80
81
-
82
81
## Status
83
82
84
-
This project is [alpha (0.3 release)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.3.0). It should not be used in production yet.
This project is in alpha. latest release can be found [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest).
86
+
It should not be used in production yet.
85
87
86
88
## Getting Started
87
89
88
90
Follow our [Getting Started Guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to get the inference-extension up and running on your cluster!
89
91
90
-
See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation on leveraging our Kubernetes-native declarative APIs
92
+
See [our website](https://gateway-api-inference-extension.sigs.k8s.io/) for detailed API documentation on leveraging our Kubernetes-native declarative APIs
91
93
92
94
## Roadmap
93
95
94
96
As Inference Gateway builds towards a GA release. We will continue to expand our capabilities, namely:
95
-
1. Prefix-cache aware load balancing with interfaces for remote caches
96
-
1. Recommended LoRA adapter pipeline for automated rollout
97
+
98
+
1. Prefix-cache aware load balancing with interfaces for remote caches
99
+
1. Recommended LoRA adapter pipeline for automated rollout
97
100
1. Fairness and priority between workloads within the same criticality band
98
101
1. HPA support for autoscaling on aggregate metrics derived from the load balancer
99
102
1. Support for large multi-modal inputs and outputs
100
103
1. Support for other GenAI model types (diffusion and other non-completion protocols)
101
104
1. Heterogeneous accelerators - serve workloads on multiple types of accelerator using latency and request cost-aware load balancing
102
105
1. Disaggregated serving support with independently scaling pools
103
106
104
-
105
107
## End-to-End Tests
106
108
107
109
Follow this [README](./test/e2e/epp/README.md) to learn more about running the inference-extension end-to-end test suite on your cluster.
0 commit comments