|
| 1 | +# Data Layer Architecture Proposal |
| 2 | + |
| 3 | +Author(s): @elevran @nirrozenbaum |
| 4 | + |
| 5 | +## Proposal Status |
| 6 | + |
| 7 | +***Draft*** |
| 8 | + |
| 9 | +## Summary |
| 10 | + |
| 11 | +The EPP Architecture proposal identifies the need for an extensible |
| 12 | + [Data Layer](../0683-epp-architecture-proposal/README.md#data-layer). |
| 13 | + Recently, the scheduling subsystem underwent a major [architecure change](../0845-scheduler-architecture-proposal/README.md) |
| 14 | + to allow easier extension and pluggability. This proposal aims to apply |
| 15 | + similar extensibility to the Data Layer subsystem, allowing custom inference |
| 16 | + gateways to extend the Gateway API Inference extension (GIE) for their use |
| 17 | + cases without modifying the core GIE code base. |
| 18 | + |
| 19 | +See [this document](https://docs.google.com/document/d/1eCCuyB_VW08ik_jqPC1__z6FzeWO_VOlPDUpN85g9Ww/edit?usp=sharing) for additional context amd reference. |
| 20 | + |
| 21 | +## Goals |
| 22 | + |
| 23 | +The Data Layer pluggability effort aims to address the folllowing goals and |
| 24 | + requirements: |
| 25 | + |
| 26 | +- Make endpoint attributes used by GIE components accessible via well defined |
| 27 | + Data Layer interfaces. |
| 28 | +- Enable collection of additional (or different) subset of attributes from an |
| 29 | + existing data source (e.g., the `/metrics` endpoint scraper). |
| 30 | +- Add a new data source that collects attributes not already collected. |
| 31 | +- Follow best practices and experience from the Scheduling subsystem |
| 32 | + pluggability effort. For example, extending the system to support the above |
| 33 | + should be through implementing well defined Plugin interfaces and registering |
| 34 | + them in the GIE Data Layer subsystem; any configuration would be done in the |
| 35 | + same way (e.g., code and/or configuration file), etc. |
| 36 | +- Be efficient (RAM, CPU, concurrency) in collecting and storing attributes. |
| 37 | +- Limit change blast radius in GIE when making above changes. Core GIE code |
| 38 | + should not need to be modified in order to support collecting and storing new |
| 39 | + attributes. Affected code should be scoped only to modules that make use of |
| 40 | + the new attributes. |
| 41 | +- The extensions should not increase coupling between GIE subsystems and |
| 42 | + Kubernetes (i.e., the environment specific code should be encapsulated and |
| 43 | + not “leaked” into the subsystem and its users). |
| 44 | +- (Future) Allow non-uniform data collection (i.e., not all endpoints share the |
| 45 | + same data). |
| 46 | + |
| 47 | +## Non-Goals |
| 48 | + |
| 49 | +- Modify existing GIE abstractions, such as `InferencePool`, to conform to the |
| 50 | + Data Layer pluggability design. They are to remain first class concepts, as |
| 51 | + today. |
| 52 | +- Enable reconciliation or modification of external state. The data sources are |
| 53 | + strictly read-only. For example, data source accessing Kubernetes state as part of |
| 54 | + the data collection would registered for `Watch()` notifications and shall not |
| 55 | + receive access to a k8s client. |
| 56 | +- Inference scheduler Plugins, that rely on custom data collection, accept that |
| 57 | + the [Model Server Protocol](../003-model-server-protocol/README.md) no longer |
| 58 | + provides guarantees on portability of a model server out of the box. |
| 59 | + |
| 60 | +## Proposal |
| 61 | + |
| 62 | +### Overview |
| 63 | + |
| 64 | +There are two existing Data Sources in the Data Layer: a Pod reconciler that |
| 65 | + collects Pod IP address(es) and labels, copying them to endpoint attributes, |
| 66 | + and a metrics scraper that collects a defined set of metric values from the |
| 67 | + `/metrics` endpoint of each Pod. Note that the `InferencePool` reconciler is |
| 68 | + *not* considered part of the Data Layer. |
| 69 | + |
| 70 | +### Components |
| 71 | + |
| 72 | +The proposal is to make the Data Layer more extensible approaching by introducing |
| 73 | + these two interfaces: |
| 74 | + |
| 75 | +- An **Attribute Collection** plugin interface responsible for extratcing relevant |
| 76 | + attributes from a data source and storing them into the Data Layer for consumption |
| 77 | + by other components. The plugin can be registered with existing or new |
| 78 | + *Data Sources* (see below) and sources would call their registered plugins |
| 79 | + periodically or on change to process attributes. |
| 80 | +- A **Data source** plugin interface that can be added to an inference gateway |
| 81 | + system, and on which *Attribute Collection* plugins can be registered to enrich |
| 82 | + the data model. |
| 83 | + |
| 84 | +### Implementation Phases |
| 85 | + |
| 86 | +In order to make iterative progress and validate the design alongside, we |
| 87 | + propose to implement and evolve the Data Layer extensibility over several |
| 88 | + phases: |
| 89 | + |
| 90 | +1. Extend the backend, per endpoint, storage with a map from a name (i.e., the |
| 91 | + attribute collection interface) to the data it collected. Existing attributes, |
| 92 | + such as IP address or Pod labels, are not modified. |
| 93 | +1. Introduce a Data Source registry where new data sources can be registered and |
| 94 | + bootstrap it by wrapping the existing `/metrics` with a Data Source API. At this |
| 95 | + point, the metrics scraping code includes only the `Data Source` interface and the |
| 96 | + `Data Collection` interface is not used/exposed. |
| 97 | +1. Refactor the metrics scraping code into separate Data Source and Data Collection |
| 98 | + plugin interfaces. |
| 99 | +1. Following that, and based on any lessons learnt, we’ll refactor the existing |
| 100 | + Kubernetes Pod recocilation loop to the new plugin interfaces. |
| 101 | + |
| 102 | +### Suggested Data Layer Plugin Interfaces |
| 103 | + |
| 104 | +```go |
| 105 | +// DataCollection interface consumes data updates from sources, stores |
| 106 | +// it in the data layer for consumption. |
| 107 | +// The plugin should not assume a deterministic invocation behavior beyond |
| 108 | +// "the data layer believes the state should be updated" |
| 109 | +type DataCollection interface { |
| 110 | + // Extract is called by data sources with (possibly) updated |
| 111 | + // data per endpoint. Extracted attributes are added to the |
| 112 | + // Endpoint. |
| 113 | + Extract(ep Endpoint, data interface{}) error // or Collect? |
| 114 | +} |
| 115 | + |
| 116 | +// Endpoint interface allows setting and retrieving of attributes |
| 117 | +// by a data collector. |
| 118 | +// Note that actual endpoint structure would be something like (pseudocode) |
| 119 | +// type EndpointState struct { |
| 120 | +// address |
| 121 | +// ... |
| 122 | +// data map[string]interface{} |
| 123 | +// } |
| 124 | +// The plugin interface would only mutate the `data` map |
| 125 | +type Endpoint interface { |
| 126 | + // StoreAttributes sets the data for the Endpoint on behalf |
| 127 | + // of the named collection Plugin |
| 128 | + StoreAttributes(collector string, data interface{}) error |
| 129 | + |
| 130 | + // GetAttributes retrieves the attributes of the named collection |
| 131 | + // plugin for the Endpoint |
| 132 | + GetAttributes(collector string) (interface{}, error) |
| 133 | +} |
| 134 | + |
| 135 | +// DataLayerSourcesRegistry include the list of available |
| 136 | +// Data Sources (interface defined below) in the system. |
| 137 | +// It is accompanied by functions (not shown) to register |
| 138 | +// and retrieve sources |
| 139 | +type DataLayerSourcesRegistry map[string]DataSource |
| 140 | + |
| 141 | +// DataSource interface represents a data source that tracks |
| 142 | +// pods/resources and notifies data collection plugins to |
| 143 | +// extract relevant attributes. |
| 144 | +type DataSource interface { |
| 145 | + // Type of data available from this source |
| 146 | + Type() string |
| 147 | + |
| 148 | + // Start begins the data collection and notification loop |
| 149 | + Start(ctx context) error |
| 150 | + |
| 151 | + // Stop terminates data collection |
| 152 | + Stop() error |
| 153 | + |
| 154 | + // Subscribe a collector to receive updates for tracked endpoints |
| 155 | + Subscribe(collector DataCollection) error |
| 156 | + |
| 157 | + // UpdateEndpoints replaces the set of pods/resources tracked by |
| 158 | + // this source. |
| 159 | + // Alternative: add/remove individual endpoints? |
| 160 | + UpdateEndpoints(epIDs []string) error |
| 161 | +} |
| 162 | +``` |
| 163 | + |
| 164 | +## Open Questions |
| 165 | + |
| 166 | +1. Type safety in extensible data colletion: `map[string]interface{}` seems |
| 167 | + like the simplest option to start, but may want to evolve to support |
| 168 | + type safety using generics or codegen. |
| 169 | +1. Should we design a separate interface specifically for k8s object watching |
| 170 | + under GIE control or do we want these to be managed as yet another data source? |
| 171 | + This affects the design (e.g., who owns the k8s caches, clients, etc.). |
| 172 | + With a GIE controlled data source, collectors just register the types (and |
| 173 | + other constraints? Labels, namespaces, …) with GIE core, and all k8s |
| 174 | + functionality is under GIE control. |
0 commit comments