Skip to content

Commit 9111ef2

Browse files
authored
Extensible/Pluggable data layer proposal (#1023)
* initial proposal content Signed-off-by: Etai Lev Ran <[email protected]> * renamed based on assigned PR Signed-off-by: Etai Lev Ran <[email protected]> * Plugin suffix unused - removing open Signed-off-by: Etai Lev Ran <[email protected]> * address grammatical errors from review Signed-off-by: Etai Lev Ran <[email protected]> --------- Signed-off-by: Etai Lev Ran <[email protected]>
1 parent fe5dea3 commit 9111ef2

File tree

1 file changed

+174
-0
lines changed
  • docs/proposals/1023-data-layer-architecture

1 file changed

+174
-0
lines changed
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Data Layer Architecture Proposal
2+
3+
Author(s): @elevran @nirrozenbaum
4+
5+
## Proposal Status
6+
7+
***Draft***
8+
9+
## Summary
10+
11+
The EPP Architecture proposal identifies the need for an extensible
12+
[Data Layer](../0683-epp-architecture-proposal/README.md#data-layer).
13+
Recently, the scheduling subsystem underwent a major [architecure change](../0845-scheduler-architecture-proposal/README.md)
14+
to allow easier extension and pluggability. This proposal aims to apply
15+
similar extensibility to the Data Layer subsystem, allowing custom inference
16+
gateways to extend the Gateway API Inference extension (GIE) for their use
17+
cases without modifying the core GIE code base.
18+
19+
See [this document](https://docs.google.com/document/d/1eCCuyB_VW08ik_jqPC1__z6FzeWO_VOlPDUpN85g9Ww/edit?usp=sharing) for additional context amd reference.
20+
21+
## Goals
22+
23+
The Data Layer pluggability effort aims to address the folllowing goals and
24+
requirements:
25+
26+
- Make endpoint attributes used by GIE components accessible via well defined
27+
Data Layer interfaces.
28+
- Enable collection of additional (or different) subset of attributes from an
29+
existing data source (e.g., the `/metrics` endpoint scraper).
30+
- Add a new data source that collects attributes not already collected.
31+
- Follow best practices and experience from the Scheduling subsystem
32+
pluggability effort. For example, extending the system to support the above
33+
should be through implementing well defined Plugin interfaces and registering
34+
them in the GIE Data Layer subsystem; any configuration would be done in the
35+
same way (e.g., code and/or configuration file), etc.
36+
- Be efficient (RAM, CPU, concurrency) in collecting and storing attributes.
37+
- Limit change blast radius in GIE when making above changes. Core GIE code
38+
should not need to be modified in order to support collecting and storing new
39+
attributes. Affected code should be scoped only to modules that make use of
40+
the new attributes.
41+
- The extensions should not increase coupling between GIE subsystems and
42+
Kubernetes (i.e., the environment specific code should be encapsulated and
43+
not “leaked” into the subsystem and its users).
44+
- (Future) Allow non-uniform data collection (i.e., not all endpoints share the
45+
same data).
46+
47+
## Non-Goals
48+
49+
- Modify existing GIE abstractions, such as `InferencePool`, to conform to the
50+
Data Layer pluggability design. They are to remain first class concepts, as
51+
today.
52+
- Enable reconciliation or modification of external state. The data sources are
53+
strictly read-only. For example, data source accessing Kubernetes state as part of
54+
the data collection would registered for `Watch()` notifications and shall not
55+
receive access to a k8s client.
56+
- Inference scheduler Plugins, that rely on custom data collection, accept that
57+
the [Model Server Protocol](../003-model-server-protocol/README.md) no longer
58+
provides guarantees on portability of a model server out of the box.
59+
60+
## Proposal
61+
62+
### Overview
63+
64+
There are two existing Data Sources in the Data Layer: a Pod reconciler that
65+
collects Pod IP address(es) and labels, copying them to endpoint attributes,
66+
and a metrics scraper that collects a defined set of metric values from the
67+
`/metrics` endpoint of each Pod. Note that the `InferencePool` reconciler is
68+
*not* considered part of the Data Layer.
69+
70+
### Components
71+
72+
The proposal is to make the Data Layer more extensible approaching by introducing
73+
these two interfaces:
74+
75+
- An **Attribute Collection** plugin interface responsible for extratcing relevant
76+
attributes from a data source and storing them into the Data Layer for consumption
77+
by other components. The plugin can be registered with existing or new
78+
*Data Sources* (see below) and sources would call their registered plugins
79+
periodically or on change to process attributes.
80+
- A **Data source** plugin interface that can be added to an inference gateway
81+
system, and on which *Attribute Collection* plugins can be registered to enrich
82+
the data model.
83+
84+
### Implementation Phases
85+
86+
In order to make iterative progress and validate the design alongside, we
87+
propose to implement and evolve the Data Layer extensibility over several
88+
phases:
89+
90+
1. Extend the backend, per endpoint, storage with a map from a name (i.e., the
91+
attribute collection interface) to the data it collected. Existing attributes,
92+
such as IP address or Pod labels, are not modified.
93+
1. Introduce a Data Source registry where new data sources can be registered and
94+
bootstrap it by wrapping the existing `/metrics` with a Data Source API. At this
95+
point, the metrics scraping code includes only the `Data Source` interface and the
96+
`Data Collection` interface is not used/exposed.
97+
1. Refactor the metrics scraping code into separate Data Source and Data Collection
98+
plugin interfaces.
99+
1. Following that, and based on any lessons learnt, we’ll refactor the existing
100+
Kubernetes Pod recocilation loop to the new plugin interfaces.
101+
102+
### Suggested Data Layer Plugin Interfaces
103+
104+
```go
105+
// DataCollection interface consumes data updates from sources, stores
106+
// it in the data layer for consumption.
107+
// The plugin should not assume a deterministic invocation behavior beyond
108+
// "the data layer believes the state should be updated"
109+
type DataCollection interface {
110+
// Extract is called by data sources with (possibly) updated
111+
// data per endpoint. Extracted attributes are added to the
112+
// Endpoint.
113+
Extract(ep Endpoint, data interface{}) error // or Collect?
114+
}
115+
116+
// Endpoint interface allows setting and retrieving of attributes
117+
// by a data collector.
118+
// Note that actual endpoint structure would be something like (pseudocode)
119+
// type EndpointState struct {
120+
// address
121+
// ...
122+
// data map[string]interface{}
123+
// }
124+
// The plugin interface would only mutate the `data` map
125+
type Endpoint interface {
126+
// StoreAttributes sets the data for the Endpoint on behalf
127+
// of the named collection Plugin
128+
StoreAttributes(collector string, data interface{}) error
129+
130+
// GetAttributes retrieves the attributes of the named collection
131+
// plugin for the Endpoint
132+
GetAttributes(collector string) (interface{}, error)
133+
}
134+
135+
// DataLayerSourcesRegistry include the list of available
136+
// Data Sources (interface defined below) in the system.
137+
// It is accompanied by functions (not shown) to register
138+
// and retrieve sources
139+
type DataLayerSourcesRegistry map[string]DataSource
140+
141+
// DataSource interface represents a data source that tracks
142+
// pods/resources and notifies data collection plugins to
143+
// extract relevant attributes.
144+
type DataSource interface {
145+
// Type of data available from this source
146+
Type() string
147+
148+
// Start begins the data collection and notification loop
149+
Start(ctx context) error
150+
151+
// Stop terminates data collection
152+
Stop() error
153+
154+
// Subscribe a collector to receive updates for tracked endpoints
155+
Subscribe(collector DataCollection) error
156+
157+
// UpdateEndpoints replaces the set of pods/resources tracked by
158+
// this source.
159+
// Alternative: add/remove individual endpoints?
160+
UpdateEndpoints(epIDs []string) error
161+
}
162+
```
163+
164+
## Open Questions
165+
166+
1. Type safety in extensible data colletion: `map[string]interface{}` seems
167+
like the simplest option to start, but may want to evolve to support
168+
type safety using generics or codegen.
169+
1. Should we design a separate interface specifically for k8s object watching
170+
under GIE control or do we want these to be managed as yet another data source?
171+
This affects the design (e.g., who owns the k8s caches, clients, etc.).
172+
With a GIE controlled data source, collectors just register the types (and
173+
other constraints? Labels, namespaces, …) with GIE core, and all k8s
174+
functionality is under GIE control.

0 commit comments

Comments
 (0)