Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/check-typos.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,5 @@ jobs:
uses: actions/checkout@v5

- name: Check typos
uses: crate-ci/typos@v1.36.2
uses: crate-ci/typos@v1.38.1

25 changes: 25 additions & 0 deletions deploy/config/sim-epp-no-hit-lru.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Sample EPP configuration for running without P/D
# with small hash block size for simulation purposes
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: prefix-cache-scorer
parameters:
hashBlockSize: 5
maxPrefixBlocksToMatch: 256
lruCapacityPerServer: 31250
- type: no-hit-lru-scorer
parameters:
lruSize: 2048
- type: decode-filter
- type: max-score-picker
- type: single-profile-handler
schedulingProfiles:
- name: default
plugins:
- pluginRef: decode-filter
- pluginRef: max-score-picker
- pluginRef: prefix-cache-scorer
weight: 2
- pluginRef: no-hit-lru-scorer
weight: 1
51 changes: 51 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,57 @@ used for the same session.

---

#### NoHitLRUScorer

Scores pods based on least recently used (LRU) ordering for cold requests (requests with no KV cache hits).
This helps evenly distribute cache growth across pods, since cold requests result in new KV blocks being created.

The scorer integrates with a prefix cache plugin to determine if a request has cache hits:
- For cold requests (no cache hits): Ranks pods by LRU order, with never-used or least recently used pods
receiving higher scores (up to 1.0) and most recently used pods receiving lower scores (approaching 0.0)
- For warm requests (cache hits): Returns neutral scores (0.5) for all pods to avoid interfering with
cache locality optimization

The LRU tracking is specific to cold requests only - pods are added to the LRU cache when they serve
a cold request, not when they serve requests with cache hits.

- **Type**: `no-hit-lru-scorer`
- **Parameters**:
- `prefixPluginName` (optional): The name of the prefix cache plugin to read state from. Defaults to `prefix-cache-scorer`.
- `lruSize` (optional): The maximum number of pods to track in the LRU cache. Defaults to 1024.

Example configuration:

```yaml
plugins:
- type: prefix-cache-scorer
parameters:
hashBlockSize: 5
maxPrefixBlocksToMatch: 256
lruCapacityPerServer: 31250
- type: no-hit-lru-scorer
parameters:
lruSize: 2048
- type: decode-filter
- type: max-score-picker
- type: single-profile-handler
schedulingProfiles:
- name: default
plugins:
- pluginRef: decode-filter
- pluginRef: max-score-picker
- pluginRef: prefix-cache-scorer
weight: 2
- pluginRef: no-hit-lru-scorer
weight: 1
```

**Note:** This scorer is designed to work alongside a prefix cache scorer (such as `prefix-cache-scorer` or
`precise-prefix-cache-scorer`). If no prefix cache state is available, all requests are treated as cold.
When integrating with a prefix-cache scorer, the prefix-cache scorer should be defined first in the scheduling profile.

---

### Sample Disaggregated Prefill/Decode Configuration

The following is an example of what a configuration for disaggregated Prefill/Decode might look like:
Expand Down
89 changes: 45 additions & 44 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,21 @@ require (
github.com/go-logr/logr v1.4.3
github.com/google/go-cmp v0.7.0
github.com/google/uuid v1.6.0
github.com/hashicorp/golang-lru/v2 v2.0.7
github.com/jellydator/ttlcache/v3 v3.4.0
github.com/llm-d/llm-d-kv-cache-manager v0.3.1
github.com/onsi/ginkgo/v2 v2.25.3
github.com/llm-d/llm-d-kv-cache-manager v0.3.2
github.com/onsi/ginkgo/v2 v2.26.0
github.com/onsi/gomega v1.38.2
github.com/openai/openai-go v1.12.0
github.com/stretchr/testify v1.11.1
google.golang.org/grpc v1.75.1
google.golang.org/grpc v1.76.0
k8s.io/api v0.34.1
k8s.io/apiextensions-apiserver v0.34.1
k8s.io/apimachinery v0.34.1
k8s.io/client-go v0.34.1
sigs.k8s.io/controller-runtime v0.22.1
sigs.k8s.io/gateway-api v1.3.0
sigs.k8s.io/gateway-api-inference-extension v1.0.0
sigs.k8s.io/controller-runtime v0.22.3
sigs.k8s.io/gateway-api v1.4.0
sigs.k8s.io/gateway-api-inference-extension v0.0.0-20251016181044-831a919943ba
)

require (
Expand All @@ -30,7 +31,7 @@ require (
github.com/antlr4-go/antlr/v4 v4.13.0 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/blang/semver/v4 v4.0.0 // indirect
github.com/cenkalti/backoff/v5 v5.0.2 // indirect
github.com/cenkalti/backoff/v5 v5.0.3 // indirect
github.com/cespare/xxhash/v2 v2.3.0 // indirect
github.com/cncf/xds/go v0.0.0-20250501225837-2ac532fd4443 // indirect
github.com/daulet/tokenizers v1.22.1 // indirect
Expand All @@ -39,51 +40,49 @@ require (
github.com/dgraph-io/ristretto/v2 v2.3.0 // indirect
github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f // indirect
github.com/dustin/go-humanize v1.0.1 // indirect
github.com/emicklei/go-restful/v3 v3.12.2 // indirect
github.com/envoyproxy/go-control-plane/envoy v1.32.4 // indirect
github.com/emicklei/go-restful/v3 v3.13.0 // indirect
github.com/envoyproxy/go-control-plane/envoy v1.35.0 // indirect
github.com/envoyproxy/protoc-gen-validate v1.2.1 // indirect
github.com/evanphx/json-patch/v5 v5.9.11 // indirect
github.com/felixge/httpsnoop v1.0.4 // indirect
github.com/fsnotify/fsnotify v1.9.0 // indirect
github.com/fxamacker/cbor/v2 v2.9.0 // indirect
github.com/go-logr/stdr v1.2.2 // indirect
github.com/go-logr/zapr v1.3.0 // indirect
github.com/go-openapi/jsonpointer v0.21.0 // indirect
github.com/go-openapi/jsonpointer v0.21.2 // indirect
github.com/go-openapi/jsonreference v0.21.0 // indirect
github.com/go-openapi/swag v0.23.0 // indirect
github.com/go-openapi/swag v0.23.1 // indirect
github.com/go-task/slim-sprig/v3 v3.0.0 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/google/btree v1.1.3 // indirect
github.com/google/cel-go v0.26.0 // indirect
github.com/google/gnostic-models v0.7.0 // indirect
github.com/google/pprof v0.0.0-20250607225305-033d6d78b36a // indirect
github.com/google/pprof v0.0.0-20250820193118-f64d9cf942d6 // indirect
github.com/gorilla/websocket v1.5.4-0.20250319132907-e064f32e3674 // indirect
github.com/grafana/regexp v0.0.0-20240518133315-a468a5bfb3bc // indirect
github.com/grpc-ecosystem/grpc-gateway/v2 v2.26.3 // indirect
github.com/hashicorp/golang-lru/v2 v2.0.7 // indirect
github.com/grpc-ecosystem/grpc-gateway/v2 v2.27.2 // indirect
github.com/inconshreveable/mousetrap v1.1.0 // indirect
github.com/josharian/intern v1.0.0 // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
github.com/mailru/easyjson v0.9.0 // indirect
github.com/moby/spdystream v0.5.0 // indirect
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee // indirect
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
github.com/mxk/go-flowrate v0.0.0-20140419014527-cca7078d478f // indirect
github.com/pebbe/zmq4 v1.4.0 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/planetscale/vtprotobuf v0.6.1-0.20240319094008-0393e58bdf10 // indirect
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
github.com/prometheus/client_golang v1.23.0 // indirect
github.com/prometheus/client_golang v1.23.2 // indirect
github.com/prometheus/client_model v0.6.2 // indirect
github.com/prometheus/common v0.65.0 // indirect
github.com/prometheus/procfs v0.16.1 // indirect
github.com/prometheus/prometheus v0.305.0 // indirect
github.com/prometheus/common v0.67.1 // indirect
github.com/prometheus/procfs v0.17.0 // indirect
github.com/prometheus/prometheus v0.306.0 // indirect
github.com/redis/go-redis/v9 v9.11.0 // indirect
github.com/spf13/cobra v1.9.1 // indirect
github.com/spf13/pflag v1.0.6 // indirect
github.com/spf13/pflag v1.0.7 // indirect
github.com/stoewer/go-strcase v1.3.0 // indirect
github.com/tidwall/gjson v1.14.4 // indirect
github.com/tidwall/gjson v1.18.0 // indirect
github.com/tidwall/match v1.1.1 // indirect
github.com/tidwall/pretty v1.2.1 // indirect
github.com/tidwall/sjson v1.2.5 // indirect
Expand All @@ -92,42 +91,44 @@ require (
github.com/x448/float16 v0.8.4 // indirect
go.opentelemetry.io/auto/sdk v1.1.0 // indirect
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.61.0 // indirect
go.opentelemetry.io/otel v1.37.0 // indirect
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.36.0 // indirect
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.36.0 // indirect
go.opentelemetry.io/otel/metric v1.37.0 // indirect
go.opentelemetry.io/otel/sdk v1.37.0 // indirect
go.opentelemetry.io/otel/trace v1.37.0 // indirect
go.opentelemetry.io/proto/otlp v1.6.0 // indirect
go.opentelemetry.io/otel v1.38.0 // indirect
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.38.0 // indirect
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.38.0 // indirect
go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.38.0 // indirect
go.opentelemetry.io/otel/metric v1.38.0 // indirect
go.opentelemetry.io/otel/sdk v1.38.0 // indirect
go.opentelemetry.io/otel/trace v1.38.0 // indirect
go.opentelemetry.io/proto/otlp v1.7.1 // indirect
go.uber.org/atomic v1.11.0 // indirect
go.uber.org/automaxprocs v1.6.0 // indirect
go.uber.org/multierr v1.11.0 // indirect
go.uber.org/zap v1.27.0 // indirect
go.yaml.in/yaml/v2 v2.4.2 // indirect
go.yaml.in/yaml/v2 v2.4.3 // indirect
go.yaml.in/yaml/v3 v3.0.4 // indirect
golang.org/x/exp v0.0.0-20250106191152-7588d65b2ba8 // indirect
golang.org/x/net v0.43.0 // indirect
golang.org/x/oauth2 v0.30.0 // indirect
golang.org/x/sync v0.16.0 // indirect
golang.org/x/sys v0.35.0 // indirect
golang.org/x/term v0.34.0 // indirect
golang.org/x/text v0.28.0 // indirect
golang.org/x/mod v0.28.0 // indirect
golang.org/x/net v0.44.0 // indirect
golang.org/x/oauth2 v0.31.0 // indirect
golang.org/x/sync v0.17.0 // indirect
golang.org/x/sys v0.36.0 // indirect
golang.org/x/term v0.35.0 // indirect
golang.org/x/text v0.29.0 // indirect
golang.org/x/time v0.12.0 // indirect
golang.org/x/tools v0.36.0 // indirect
golang.org/x/tools v0.37.0 // indirect
gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20250707201910-8d1bb00bc6a7 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20250707201910-8d1bb00bc6a7 // indirect
google.golang.org/protobuf v1.36.7 // indirect
gopkg.in/evanphx/json-patch.v4 v4.12.0 // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20250825161204-c5933d9347a5 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20250826171959-ef028d996bc1 // indirect
google.golang.org/protobuf v1.36.10 // indirect
gopkg.in/evanphx/json-patch.v4 v4.13.0 // indirect
gopkg.in/inf.v0 v0.9.1 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
k8s.io/apiserver v0.34.1 // indirect
k8s.io/component-base v0.34.1 // indirect
k8s.io/klog/v2 v2.130.1 // indirect
k8s.io/kube-openapi v0.0.0-20250710124328-f3f2b991d03b // indirect
k8s.io/utils v0.0.0-20250604170112-4c0f3b243397 // indirect
k8s.io/kube-openapi v0.0.0-20250814151709-d7b6acb124c3 // indirect
k8s.io/utils v0.0.0-20250820121507-0af2bda4dd1d // indirect
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.31.2 // indirect
sigs.k8s.io/json v0.0.0-20241014173422-cfa47c3a1cc8 // indirect
sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 // indirect
sigs.k8s.io/randfill v1.0.0 // indirect
sigs.k8s.io/structured-merge-diff/v6 v6.3.0 // indirect
sigs.k8s.io/yaml v1.6.0 // indirect
Expand Down
Loading