Skip to content

Clarify compatibility constraints #340

@danwinship

Description

@danwinship

We need to be clear about our compatibility / API evolution constraints.

Specifically: what happens when the user creates a CNP that uses a feature which the CNP implementation doesn't know about?

Broadly speaking, there are two possible (useful) answers to that question:

  1. It gets rejected (somehow), which interrupts the user's deployment workflow with a clear error and makes it clear that the CNP won't be supported.
  2. It gets accepted, but the implementation is able to determine (somehow) that the CNP uses unsupported features, and then it has to do something sane in response.

While the first answer is generally better, it doesn't solve the problem of rollback: if a user upgrades their cluster, adds a CNP using a new feature, and then rolls the cluster back to the old version, then the old implementation will have to deal with the new CNP when it starts up again.

(Core APIs solve this problem by having a mandatory Alpha phase; as long as you wait until Beta before you use the feature, then it's guaranteed that if you roll back, you'll be rolling back to a release that still had the Alpha implementation. I don't think we can easily copy this.)

Possibility 1a: "Don't do that then" / API server validation rejects invalid CNPs.

If the CNP implementation itself always installs the CNP CRD itself, then only CNPs containing features supported by that CRD will validate, and CNPs containing unknown features would be rejected by the apiserver.

Having the CNP implementation install the CRD matches the way ANP is implemented/distributed currently. (The corresponding idea doesn't work for Gateway API because you might have multiple Gateway implementations in a single cluster, but at least for right now, it doesn't make sense to have multiple ANP/CNP implementations in a single cluster.)

The question is: can we guarantee that that will always work? I'm not sure we can. If we adopted something like Lior's proposed "IdentityNetworkPolicy", then in some clusters there would be two different components implementing different parts of the NetworkPolicy stack. Alternatively, if we later extend NP/CNP to support multi-network scenarios, then it might make sense to have one NP/CNP implementation for the cluster-default pod network, and a separate one for secondary pod networks.

So while this restriction makes sense now, it limits our future options.

Possibility 1b: A CNP-implementation-specific admission webhook rejects invalid CNPs

CNP implementations could be required to install a webhook that prevents the creation of CNPs that use features the CNP implementation doesn't understand. Basically, the webhook would deserialize the provided JSON object into the CNP implementation's generated golang version of the CRD, and then serialize it back to JSON, and see if any fields got lost.

This requires us to obey certain rules when adding new APIs, to ensure that implementations would be able to properly notice unsupported new features. For example, we could not add a new Action value, since that wouldn't show up as a difference when deserializing and reserializing the object. (Perhaps the webhook could run CRD validation against the object as well, to ensure that it validated against the version of the CRD it knows about? I'm not sure if the code for doing this is available outside the apiserver.)

Possibility 2a: The CNP implementation detects the existence of unsupported fields (and then does something)

The CNP implementation itself could recognize unsupported CNPs in roughly the same way the webhook would do in the previous example. For instance, it could use the generic client and deserialize to Unstructured to get the "raw" CNP from the apiserver, and then convert to the supported CNP struct and back as above.

Possibility 2b: We design the API in such a way that CNPs using unsupported features will have obvious "holes" (and when the implementation sees them, it does something)

This is the "exactly one field must be set, so if you see an object with no fields set then you know they must have used a feature you don't know about" approach. While this is theoretically easier to detect on the implementation end, it places more constraints on API evolution, and also (as compared to the previous approach), it means that the implementation must separately check for missing fields in every spot where they might occur, rather than being able to do a single top-level check for "some new field is missing".

If 2a is possible, it seems strictly superior to 2b.


With 2a and 2b, we also need to define what the implementation is supposed to do when it realizes that there is a bad CNP in the system. The docs currently suggest that the right approach is when in doubt, deny, but this runs the risk of completely breaking the cluster. Bowei was suggesting that a better approach is as soon as you see a bad CNP, stop processing updates until it is fixed, but this runs the risk of allowing the cluster to run for some length of time in an only-half-locked-down state. Neither approach is really very good. It seems clearly better to ensure that bad CNPs get rejected at creation time. The question is whether we need to deal with rollback also.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions