Extensible/Pluggable data layer proposal (#1023)

elevran · web-flow · commit 9111ef27a080 · 2025-06-25T10:06:30.000-07:00
* initial proposal content

Signed-off-by: Etai Lev Ran &lt;elevran@gmail.com&gt;

* renamed based on assigned PR

Signed-off-by: Etai Lev Ran &lt;elevran@gmail.com&gt;

* Plugin suffix unused - removing open

Signed-off-by: Etai Lev Ran &lt;elevran@gmail.com&gt;

* address grammatical errors from review

Signed-off-by: Etai Lev Ran &lt;elevran@gmail.com&gt;

---------

Signed-off-by: Etai Lev Ran &lt;elevran@gmail.com&gt;
diff --git a/docs/proposals/1023-data-layer-architecture/README.md b/docs/proposals/1023-data-layer-architecture/README.md
@@ -0,0 +1,174 @@
+# Data Layer Architecture Proposal
+
+Author(s): @elevran @nirrozenbaum
+
+## Proposal Status
+
+***Draft***
+
+## Summary
+
+The EPP Architecture proposal identifies the need for an extensible
+ [Data Layer](../0683-epp-architecture-proposal/README.md#data-layer).
+ Recently, the scheduling subsystem underwent a major [architecure change](../0845-scheduler-architecture-proposal/README.md)
+ to allow easier extension and pluggability. This proposal aims to apply
+ similar extensibility to the Data Layer subsystem, allowing custom inference
+ gateways to extend the Gateway API Inference extension (GIE) for their use
+ cases without modifying the core GIE code base.
+
+See [this document](https://docs.google.com/document/d/1eCCuyB_VW08ik_jqPC1__z6FzeWO_VOlPDUpN85g9Ww/edit?usp=sharing) for additional context amd reference.
+
+## Goals
+
+The Data Layer pluggability effort aims to address the folllowing goals and
+ requirements:
+
+- Make endpoint attributes used by GIE components accessible via well defined
+  Data Layer interfaces.
+- Enable collection of additional (or different) subset of attributes from an
+  existing data source (e.g., the `/metrics` endpoint scraper).
+- Add a new data source that collects attributes not already collected.
+- Follow best practices and experience from the Scheduling subsystem
+  pluggability effort. For example, extending the system to support the above
+  should be through implementing well defined Plugin interfaces and registering
+  them in the GIE Data Layer subsystem; any configuration would be done in the
+  same way (e.g., code and/or configuration file), etc.
+- Be efficient (RAM, CPU, concurrency) in collecting and storing attributes.
+- Limit change blast radius in GIE when making above changes. Core GIE code
+  should not need to be modified in order to support collecting and storing new
+  attributes. Affected code should be scoped only to modules that make use of
+  the new attributes.
+- The extensions should not increase coupling between GIE subsystems and
+  Kubernetes (i.e., the environment specific code should be encapsulated and
+  not “leaked” into the subsystem and its users).
+- (Future) Allow non-uniform data collection (i.e., not all endpoints share the
+  same data).
+
+## Non-Goals
+
+- Modify existing GIE abstractions, such as `InferencePool`, to conform to the
+  Data Layer pluggability design. They are to remain first class concepts, as
+  today.
+- Enable reconciliation or modification of external state. The data sources are
+  strictly read-only. For example, data source accessing Kubernetes state as part of
+  the data collection would registered for `Watch()` notifications and shall not
+  receive access to a k8s client.
+- Inference scheduler Plugins, that rely on custom data collection, accept that
+  the [Model Server Protocol](../003-model-server-protocol/README.md) no longer
+  provides guarantees on portability of a model server out of the box.
+
+## Proposal
+
+### Overview
+
+There are two existing Data Sources in the Data Layer: a Pod reconciler that
+ collects Pod IP address(es) and labels, copying them to endpoint attributes,
+ and a metrics scraper that collects a defined set of metric values from the
+ `/metrics` endpoint of each Pod. Note that the `InferencePool` reconciler is
+ *not* considered part of the Data Layer.
+
+### Components
+
+The proposal is to make the Data Layer more extensible approaching by introducing
+ these two interfaces:
+
+- An **Attribute Collection** plugin interface responsible for extratcing relevant
+  attributes from a data source and storing them into the Data Layer for consumption
+  by other components. The plugin can be registered with existing or new
+  *Data Sources* (see below) and sources would call their registered plugins
+  periodically or on change to process attributes.
+- A **Data source** plugin interface that can be added to an inference gateway
+  system, and on which *Attribute Collection* plugins can be registered to enrich
+  the data model.
+
+### Implementation Phases
+
+In order to make iterative progress and validate the design alongside, we
+ propose to implement and evolve the Data Layer extensibility over several
+ phases:
+
+1. Extend the backend, per endpoint, storage with a map from a name (i.e., the
+  attribute collection interface) to the data it collected. Existing attributes,
+  such as IP address or Pod labels, are not modified.
+1. Introduce a Data Source registry where new data sources can be registered and
+  bootstrap it by wrapping the existing `/metrics` with a Data Source API. At this
+  point, the metrics scraping code includes only the `Data Source` interface and the
+  `Data Collection` interface is not used/exposed.
+1. Refactor the metrics scraping code into separate Data Source and Data Collection
+  plugin interfaces.
+1. Following that, and based on any lessons learnt, we’ll refactor the existing
+  Kubernetes Pod recocilation loop to the new plugin interfaces.
+
+### Suggested Data Layer Plugin Interfaces
+
+```go
+// DataCollection interface consumes data updates from sources, stores
+// it in the data layer for consumption.
+// The plugin should not assume a deterministic invocation behavior beyond
+// "the data layer believes the state should be updated"
+type DataCollection interface {
+    // Extract is called by data sources with (possibly) updated
+    // data per endpoint. Extracted attributes are added to the 
+    // Endpoint.
+    Extract(ep Endpoint, data interface{}) error // or Collect?
+}
+
+// Endpoint interface allows setting and retrieving of attributes
+// by a data collector.
+// Note that actual endpoint structure would be something like (pseudocode)
+// type EndpointState struct {
+//   address
+//   ...
+//   data map[string]interface{}
+// }
+// The plugin interface would only mutate the `data` map
+type Endpoint interface {
+   // StoreAttributes sets the data for the Endpoint on behalf
+   // of the named collection Plugin
+   StoreAttributes(collector string, data interface{}) error
+   
+   // GetAttributes retrieves the attributes of the named collection
+   // plugin for the Endpoint
+   GetAttributes(collector string) (interface{}, error)
+}
+
+// DataLayerSourcesRegistry include the list of available 
+// Data Sources (interface defined below) in the system.
+// It is accompanied by functions (not shown) to register
+// and retrieve sources
+type DataLayerSourcesRegistry map[string]DataSource 
+
+// DataSource interface represents a data source that tracks
+// pods/resources and notifies data collection plugins to
+// extract relevant attributes.
+type DataSource interface {
+    // Type of data available from this source
+    Type() string
+
+    // Start begins the data collection and notification loop
+    Start(ctx context) error
+
+    // Stop terminates data collection
+    Stop() error
+
+    // Subscribe a collector to receive updates for tracked endpoints
+    Subscribe(collector DataCollection) error
+
+    // UpdateEndpoints replaces the set of pods/resources tracked by
+    // this source.
+    // Alternative: add/remove individual endpoints?
+    UpdateEndpoints(epIDs []string) error 
+}
+```
+
+## Open Questions
+
+1. Type safety in extensible data colletion: `map[string]interface{}` seems
+  like the simplest option to start, but may want to evolve to support
+  type safety using generics or codegen.
+1. Should we design a separate interface specifically for k8s object watching
+  under GIE control or do we want these to be managed as yet another data source?
+  This affects the design (e.g., who owns the k8s caches, clients, etc.).
+  With a GIE controlled data source, collectors just register the types (and
+  other constraints? Labels, namespaces, …) with GIE core, and all k8s
+  functionality is under GIE control.