Skip to content

Commit c9720d9

Browse files
authored
GEP-3792: Off-Cluster Gateways (#3851)
* GEP-3792 Signed-off-by: Flynn <[email protected]> * Wordsmith feature name. Signed-off-by: Flynn <[email protected]> * Address review feedback. Signed-off-by: Flynn <[email protected]> --------- Signed-off-by: Flynn <[email protected]>
1 parent 46d3d0b commit c9720d9

File tree

2 files changed

+297
-0
lines changed

2 files changed

+297
-0
lines changed

geps/gep-3792/index.md

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
# GEP-3792: External Gateways
2+
3+
* Issue: [#3792](https://github.com/kubernetes-sigs/gateway-api/issues/3792)
4+
* Status: Provisional
5+
6+
(See [status definitions](../overview.md#gep-states).)
7+
8+
## User Story
9+
10+
**[Chihiro] and [Ian] want a way for out-of-cluster Gateways to be able to
11+
usefully participate in a GAMMA-compliant in-cluster service mesh.**
12+
13+
Historically, API gateways and ingress controllers have often been implemented
14+
using a Service of type LoadBalancer fronting a Kubernetes pod running a
15+
proxy. This is simple to reason about, easy to manage for sidecar meshes, and
16+
will presumably be an important implementation mechanism for the foreseeable
17+
future. Some cloud providers, though, are moving the proxy outside of the
18+
cluster, for various reasons which are out of the scope of this GEP. Chihiro
19+
and Ian want to be able to use these out-of-cluster proxies effectively and
20+
safely, though they recognize that this may require additional configuration.
21+
22+
[Chihiro]: https://https//gateway-api.sigs.k8s.io/concepts/roles-and-personas/#chihiro
23+
[Ian]: https://https//gateway-api.sigs.k8s.io/concepts/roles-and-personas/#ian
24+
25+
### Nomenclature and Background
26+
27+
In this GEP:
28+
29+
1. We will use _out-of-cluster Gateway_ (OCG) to refer to a conformant
30+
implementation of Gateway API's `GATEWAY` profile that's running outside of
31+
the cluster. This would most commonly be a managed implementation from a
32+
cloud provider, but of course there are many other possibilities -- and in
33+
fact it's worth noting that anything we define here to support OCGs could
34+
also be used by workloads that run in-cluster but which, for whatever
35+
reason, can't be brought into the mesh in the mesh's usual way.
36+
37+
2. We'll also distinguish between _mTLS meshes_, which rely on standard mTLS
38+
for secure communication (authentication, encryption, and integrity
39+
checking) between workloads, and _non-mTLS meshes_, which do anything else.
40+
We'll focus on mTLS meshes in this GEP; this isn't because of a desire to
41+
exclude non-mTLS meshes, but because we'll have enough trouble just
42+
wrangling the mTLS meshes! Supporting non-mTLS meshes will be a separate
43+
GEP.
44+
45+
**Note:** It's important to separate mTLS and HTTPS here. Saying that the
46+
mTLS meshes use mTLS for secure communication does not preclude them from
47+
using custom protocols on top of mTLS, and certainly does not mean that
48+
they must use only HTTPS.
49+
50+
3. _Authentication_ is the act of verifying the identity of some _principal_;
51+
what the principal actually is depends on context. For this GEP we will
52+
primarily be concerned with _workload authentication_, in which the
53+
principal is a workload, as opposed to _user authentication_, in which the
54+
principal is the human on whose behalf a piece of technology is acting. We
55+
expect that the OCG will handle user auth, but of course meshed workloads
56+
can't trust what the OCG says about the user unless the OCG successfully
57+
authenticates itself as a workload.
58+
59+
**Note:** A single workload will have only one identity, but in practice we
60+
often see a single identity being used for multiple workloads (both because
61+
multiple replicas of a single workload need to share the same identity, and
62+
because some low-security workloads may be grouped together under a single
63+
identity).
64+
65+
4. Finally, we'll distinguish between _inbound_ and _outbound_ behaviors.
66+
67+
Inbound behaviors are those that are applied to a request _arriving_ at a
68+
given workload. Authorization and rate limiting are canonical examples
69+
of inbound behaviors.
70+
71+
Outbound behaviors are those that are applied to a request _leaving_ a
72+
given workload. Load balancing, retries, and circuit breakers are canonical
73+
examples of outbound behaviors.
74+
75+
## Goals
76+
77+
- Allow Chihiro and Ian to configure an OCG and a mesh such that the OCG can
78+
usefully participate in the mesh, including:
79+
80+
- The OCG must be able to securely communicate with meshed workloads in
81+
the cluster, where "securely communicate" includes encryption,
82+
authentication, and integrity checking.
83+
84+
- The OCG must have a proper identity within the mesh, so that the mesh
85+
can apply authorization policy to requests from the OCG.
86+
87+
- Whatever credentials the OCG and the mesh use to authenticate each other
88+
must be able to be properly maintained over time (for example, if they
89+
use mTLS, certificates will need rotation over time).
90+
91+
- The OCG must be able to distinguish meshed workloads from non-meshed
92+
workloads, so that it can communicate appropriately with each.
93+
94+
- Allow Ana to develop and operate meshed applications without needing to know
95+
whether the Gateway she's using is an OCG or an in-cluster Gateway.
96+
97+
- Define a basic set of requirements for OCGs and meshes that want to
98+
interoperate with each other (for example, the OCG and the mesh will likely
99+
need to agree on how workload authentication principals are represented).
100+
101+
- Define how responsibility is shared between the OCG and the mesh for
102+
outbound behaviors applied to requests leaving the OCG. (Note that "the OCG
103+
has complete responsibility and authority over outbound behaviors for
104+
requests leaving the OCG" is very much a valid definition.)
105+
106+
## Non-Goals
107+
108+
- Support multicluster operations. It may be the case that functional
109+
multicluster (with, e.g., a single OCG fronting multiple clusters) ends up
110+
falling out of this GEP, but it is not a goal.
111+
112+
- Support meshes interoperating with each other. It's possible that this GEP
113+
will lay a lot of groundwork in that direction, but it is not a goal.
114+
115+
- Support non-mTLS meshes in Gateway API 1.4. We'll make every effort not to
116+
rule out non-mTLS meshes, but since starting with the mTLS meshes should
117+
tackle a large chunk of the industry with a single solution, that will be
118+
the initial focus.
119+
120+
- Solve the problem of extending a mesh to cover non-Kubernetes workloads (AKA
121+
_mesh expansion_). In many ways, mesh expansion is adjacent to the OCG
122+
situation, but the where the OCG is aware of the cluster and mesh, mesh
123+
expansion deals with a non-Kubernetes workload that is largely not aware of
124+
either.
125+
126+
- Solve the problem of how to support an OCG doing mTLS directly to a
127+
_non_-meshed workload (AKA the _backend TLS problem_). Backend TLS to
128+
non-meshed workloads is also adjacent to the OCG situation, but its
129+
configuration has different needs: backends terminating TLS on their own are
130+
likely to need per-workload configuration of certificates, cipher suites,
131+
etc., where the mesh as a whole should share a single configuration.
132+
133+
- Prevent the OCG API from being used by an in-cluster workload. We're not
134+
going to make in-cluster workloads a primary use case for this GEP, but
135+
neither are we disallowing them.
136+
137+
## Overview
138+
139+
Making an OCG work with an in-cluster mesh at the most basic level doesn't
140+
really require any special effort. As long as the OCG has IP connectivity to
141+
pods in the cluster, and the mesh is configured with permissive security, the
142+
OCG can simply forward traffic from clients directly to meshed pods, and
143+
things will "function" in that requests from clients, through the OCG, can be
144+
handled by workloads in the cluster.
145+
146+
Of course, this sort of non-integration has obvious and terrible security
147+
implications, since the traffic between the OCG and the application pods in
148+
the cluster will be cleartext in the scenario above. The lack of encryption is
149+
awful in its own right, but the fact that any mTLS mesh uses mTLS for
150+
_authentication_ also means that the mesh loses any way to enforce
151+
authorization policy around the OCG. Combined, these items amount to a major
152+
problem.
153+
154+
An additional concern is that the OCG needs to be able to implement features
155+
(e.g. sticky sessions) which require it to speak directly to endpoint IPs,
156+
which can limit what the mesh will be able to do. This is likely a more minor
157+
concern since a conformant OCG should itself be able to provide advanced
158+
functionality; however, at minimum it can create some friction in
159+
configuration.
160+
161+
### The Problems
162+
163+
To allow the OCG to _usefully_ participate in the mesh, we need to solve at
164+
least four significant problems. Thankfully, these are mostly problems for
165+
Chihiro -- if we do our jobs correctly, Ana will never need to know.
166+
167+
#### 1. The Trust Problem
168+
169+
The _trust problem_ is fairly straightforward to articulate: the OCG and the
170+
mesh both need access to whatever information will allow each of them to trust
171+
the other.
172+
173+
In the case of mTLS meshes, we are helped by the fact that basically every OCG
174+
candidate already speaks mTLS, so the trust problem becomes "only" one of
175+
setting things up for the OCG and the mesh to each include the other's CA
176+
certificate in their trust bundle. (They may be using the same CA certificate,
177+
but we shouldn't rely on that.)
178+
179+
In the case of non-mTLS meshes, the trust problem is more complex; this is the
180+
major reason that this GEP is focused on mTLS meshes.
181+
182+
#### 2. The Protocol Problem
183+
184+
The _protocol problem_ is that the data-plane elements of the mesh may assume
185+
that they'll always be talking only to other mesh data-plane elements, which
186+
the OCG will not be. If the mesh data-plane elements use a specific protocol,
187+
then either the OCG will need to speak that protocol, or the mesh will need to
188+
relax its requirements (perhaps on a separate port?) to accept requests
189+
directly from the OCG.
190+
191+
For example, Linkerd and Istio Legacy both use standard mTLS for
192+
proxy-to-proxy communication -- however, both also use ALPN to negotiate
193+
custom (and distinct!) "application" protocols during mTLS negotiation, and
194+
depending on the negotiated protocol, both can require the sending proxy to
195+
send additional information after mTLS is established, before any client data
196+
is sent. (For example, Linkerd requires the originating proxy to send
197+
transport metadata right after the TLS handshake, and it will reject a
198+
connection which doesn't do that correctly.)
199+
200+
#### 4. The Discovery Problem
201+
202+
When using a mesh, not every workload in the cluster is required to be meshed
203+
(for example, it's fairly common to have some namespaces meshed and other
204+
namespaces not meshed, especially during migrations). The _discovery problem_
205+
here is that the OCG needs to be know which workloads are meshed, so that it
206+
can choose appropriate communication methods for them.
207+
208+
#### 4. The Outbound Behavior Problem
209+
210+
The OCG will need to speak directly to endpoints in the cluster, as described
211+
above. This will prevent most meshes from being able to tell which service was
212+
originally requested, which makes it impossible for the mesh to apply outbound
213+
behaviors. This is the _outbound behavior problem_: it implies that either the
214+
OCG must be responsible for outbound behaviors for requests leaving the OCG
215+
for a meshed workload, or that the OCG must supply the mesh with enough
216+
information about the targeted service to allow the mesh to apply those
217+
outbound behaviors (if that's even possible: sidecar meshes may very well
218+
simply not be able to do this.)
219+
220+
This is listed last because it shouldn't be a functional problem to simply
221+
declare the OCG solely responsible for outbound behaviors for requests leaving
222+
the OCG. It is a UX problem: if a given workload needs to be used by both the
223+
OCG or other meshed workloads, you'll need to either provide two Routes with
224+
the same configuration, or you'll need to provide a single Route with multiple
225+
`parentRef`s.
226+
227+
## API
228+
229+
Most of the API work for this GEP is TBD at this point, but there are two
230+
important points to note:
231+
232+
First, Gateway API has never defined a Mesh resource because, to date, it's
233+
never been clear what would go into it. This may be the first configuration
234+
item that causes us to need a Mesh resource.
235+
236+
Second, since the API should affect only Gateway API resources, it is not a
237+
good candidate for policy attachment. It is likely to be much more reasonable
238+
to simply provide whatever extra configuration we need inline in the Gateway
239+
or Mesh resources.
240+
241+
## Graduation Criteria
242+
243+
In addition to the [general graduation
244+
criteria](../concepts/versioning.md#graduation-criteria), this GEP must also
245+
guarantee that **all four** of the problems listed above need resolutions, and
246+
must have implementation from at least two different Gateways and two
247+
different meshes.
248+
249+
### Gateway for Ingress (North/South)
250+
251+
### Gateway For Mesh (East/West)
252+
253+
## Conformance Details
254+
255+
#### Feature Names
256+
257+
This GEP will use the feature name `MeshOffClusterGateway`, under the
258+
assumption that we will indeed need a Mesh resource.
259+
260+
### Conformance tests
261+
262+
## Alternatives
263+
264+
## References

geps/gep-3792/metadata.yaml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
apiVersion: internal.gateway.networking.k8s.io/v1alpha1
2+
kind: GEPDetails
3+
number: 3792
4+
name: GEP template
5+
status: Provisional
6+
# Any authors who contribute to the GEP in any way should be listed here using
7+
# their GitHub handle.
8+
authors:
9+
- kflynn
10+
relationships:
11+
# obsoletes indicates that a GEP makes the linked GEP obsolete, and completely
12+
# replaces that GEP. The obsoleted GEP MUST have its obsoletedBy field
13+
# set back to this GEP, and MUST be moved to Declined.
14+
obsoletes: {}
15+
obsoletedBy: {}
16+
# extends indicates that a GEP extends the linked GEP, adding more detail
17+
# or additional implementation. The extended GEP MUST have its extendedBy
18+
# field set back to this GEP.
19+
extends: {}
20+
extendedBy: {}
21+
# seeAlso indicates other GEPs that are relevant in some way without being
22+
# covered by an existing relationship.
23+
seeAlso: {}
24+
# references is a list of hyperlinks to relevant external references.
25+
# It's intended to be used for storing GitHub discussions, Google docs, etc.
26+
references: {}
27+
# featureNames is a list of the feature names introduced by the GEP, if there
28+
# are any. This will allow us to track which feature was introduced by which GEP.
29+
# This is the value added to supportedFeatures and the conformance tests, in string form.
30+
featureNames: {}
31+
# changelog is a list of hyperlinks to PRs that make changes to the GEP, in
32+
# ascending date order.
33+
changelog: {}

0 commit comments

Comments
 (0)