Skip to content

Commit 16d4e87

Browse files
authored
Merge pull request kubernetes-sigs#2021 from shiftstack/openstackserver-controller
🌱 Add design doc for OpenStackServer
2 parents a5cfcd7 + 9b05e94 commit 16d4e87

File tree

1 file changed

+160
-0
lines changed

1 file changed

+160
-0
lines changed
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Problem statement
2+
3+
We currently have 2 callers of compute.CreateInstance: the machine controller for OpenStackMachine, and the cluster controller for the bastion[^mapo].
4+
5+
[^mapo]: Incidentally, although it is not a primary concern of this project there is also a third consumer in OpenShift's [Machine API Provider OpenStack](https://github.com/openshift/machine-api-provider-openstack).
6+
7+
Of these, the bastion has been most problematic. The principal current problem with the bastion is that, unlike OpenStackMachine, it is mutable because the bastion is not required to have the same lifecycle as the cluster which contains it. The current API allows the user to modify the bastion, and the controller is expected to delete the current one and create a new one. This is a problem in edge cases where the spec is required to determine if resources were previously created and need to be cleaned up, or not. This is a fundamental design problem, but another less fundamental but still very real problem is that trying to integrate the bastion into both the cluster and (effectively) machine creation and deletion flows creates fragile code and has resulted in numerous bugs.
8+
9+
# Potential solutions
10+
11+
## Do nothing
12+
13+
The current solution has sufficed in practice for quite a long time. Additionally, we now have much better test coverage of bastion operations than we have had in the past, so we should hopefully catch issues sooner.
14+
15+
The primary issue with this is the drag it places on bugfixes and new development, because the code flow of bastion deletion in particular can be hard to reason about.
16+
17+
## Create an OpenStackMachine for the bastion
18+
19+
This is an attractive solution because it better models the independent lifecycle of the bastion. When a new bastion is created we create a new OpenStackMachine. If the bastion spec is updated we delete the old one and create a new one, with the old and new objects having their own, independent state.
20+
21+
The problem with this approach is that OpenStackMachine has other contracts. It cannot be created independently, but requires a CAPI Machine object. Other controllers have expectations of Machine objects, for example that they will eventually have a NodeRef.
22+
23+
Creating a Machine/OpenStackMachine object for the bastion could negatively impact other CAPI users.
24+
25+
## Create a new controller for the bastion
26+
27+
This is essentially the same as using OpenStackMachine, except using a new CRD which can be independent of other CAPI objects.
28+
29+
The obvious disadvantage is that it requires a new CRD and controller.
30+
31+
# Proposal: An OpenStackServer CRD and controller
32+
33+
## Philosophy
34+
35+
We will create a new controller which is capable, in principal, of doing all of the OpenStack resource creation tasks common to bastion and OpenStackMachine creation. That is:
36+
37+
* Resolve OpenStack resource parameters to specific resources (e.g. image, networks, subnets, security groups, server groups)
38+
* Create and delete ports
39+
* Create and delete volumes
40+
* Create and delete the server
41+
42+
It will have no dependency on Machine or OpenStackCluster. So things it will explicitly not do include:
43+
44+
* Attaching the Bastion floating IP
45+
* Creating a loadbalancer member
46+
* Adding a port for the default cluster network if no ports are provided
47+
* Referencing an AvailabilityZone from the Machine object
48+
* Adding default tags from the OpenStackCluster object
49+
50+
## API
51+
52+
Based on OpenStackMachineSpec, with modifications to accomodate the above.
53+
54+
55+
```go
56+
type OpenStackServerSpec struct {
57+
AvailabilityZone string
58+
Flavor string
59+
Image ImageParam
60+
SSHKeyName string
61+
Ports []PortOpts
62+
SecurityGroups []SecurityGroupParam
63+
Trunk bool
64+
Tags []string
65+
ServerMetadata []ServerMetadata
66+
ConfigDrive *bool
67+
RootVolume *RootVolume
68+
AdditionalBlockDevices []AdditionalBlockDevice
69+
ServerGroup *ServerGroupParam
70+
IdentityRef *OpenStackIdentityReference
71+
FloatingIPPoolRef *corev1.TypedLocalObjectReference
72+
UserDataRef *corev1.TypedLocalObjectReference
73+
}
74+
75+
type OpenStackServerStatus struct {
76+
InstanceID optional.String
77+
InstanceState *InstanceState
78+
Addresses []corev1.NodeAddress
79+
Resolved *ResolvedMachineSpec
80+
Resources *MachineResources
81+
Conditions clusterv1.Conditions
82+
}
83+
```
84+
85+
As the new API is non-trivial we should initially create it in v1alpha1.
86+
87+
### Upgrading an existing deployment
88+
89+
OpenStackServer would be a new API and does not affect any existing API.
90+
91+
In the same way that OpenStackMachine currently has an 'adoption' phase where it will adopt existing resources, OpenStackServer should adopt matching resources which it would have created. On upgrade to a new version of CAPO which manages the bastion with an OpenStackServer object, I would expect the flow to be:
92+
93+
* Cluster controller creates OpenStackServer and waits for it to report Ready
94+
* OpenStackServer controller looks for existing resources matching its given spec
95+
* OpenStackServer adopts existing resources
96+
* OpenStackServer reports Ready
97+
98+
The end-user should not be required to take any action, and the upgrade process should not involve the deletion and recreation of existing OpenStack resources.
99+
100+
### Notes
101+
102+
* ProviderID is not present: this is a Machine property
103+
* UserDataRef is added. If present it must exist.
104+
* AvailabilityZone is added, and refers explicitly
105+
* It is an error for Ports to be empty. Defaulting the cluster network must be done by the controller creating the object.
106+
* No managed security group will be added automatically. The managed security group must be added explicitly by the controller creating the object.
107+
* A Volume specified with `from: Machine` will use the value of `AvailabilityZone` instead of `Machine.FailureDomain`.
108+
109+
It will set the following `Conditions`:
110+
* `Ready`: `True` when the server is fully provisioned. Once set to `True` will never be subsequently set to `False`.
111+
* `Failed`: `True` means the OpenStackServer controller will not continue to reconcile this object. If set on a server which is not `Ready`, the server will never become `Ready`. If set on a server which is `Ready`, the server has suffered some terminal condition which cannot be resolved automatically. For example it may be in the `ERROR` state, or it may have been deleted.
112+
113+
## Changes to the cluster controller
114+
115+
Firstly the OpenStackCluster will report `Ready` as soon as the cluster infrastructure is created, before creating the bastion. This is not strictly necessary for this change, but it simplifies the bastion creation logic and it makes sense anyway.
116+
117+
The pseudocode of the cluster controller logic for the bastion becomes:
118+
```
119+
if spec.Bastion == nil {
120+
if OpenStackServer for bastion exists {
121+
delete OpenStackServer object
122+
return
123+
}
124+
} else {
125+
serverObj := fetch OpenStackServer object
126+
if serverObj does not exist {
127+
substitute tags and default network into bastion spec if required
128+
create OpenStackServer object
129+
return
130+
}
131+
if server spec does not match bastion.Spec {
132+
deleteOpenStackServer(bastion)
133+
return
134+
}
135+
if server is Ready {
136+
attach bastion floating IP to server
137+
populate bastion status in OpenStackCluster
138+
}
139+
}
140+
```
141+
142+
The OpenStackServer for the bastion is owned by the OpenStackCluster, and is deleted automatically along with it.
143+
144+
The `Enabled` flag for the bastion in OpenStackClusterSpec is deprecated but not removed. Setting `Enabled` explicitly to false will continue to not create a bastion, although there is no longer any need to do this. Validation will be updated to permit changing the bastion spec without requiring the bastion to be deleted first.
145+
146+
`Resolved` and `Resources` from `BastionStatus` would not be removed, but would no longer be populated. This state is now stored in the `OpenStackServer`.
147+
148+
## Optional: Changes to the machine controller
149+
150+
The primary goal of this change is to fix the bastion, so it is not necessary to change the machine controller also. However, once we have implemented the change to the bastion it may make sense to update the machine controller to use the same controller. i.e. The machine controller would:
151+
152+
* Create an OpenStackServer object corresponding to the machine spec plus resolved defaults from the OpenStackCluster.
153+
* Wait for it to be created
154+
* Do any API loadbalancer tasks required
155+
* Copy required CAPI fields from OpenStackServer to OpenStackMachine
156+
* Set ProviderID
157+
158+
## Note (but out of scope of this document)
159+
160+
OpenShift's [Machine API Provider OpenStack](https://github.com/openshift/machine-api-provider-openstack) could create an `OpenStackServer` object instead of calling directly into `compute.CreateInstance`.

0 commit comments

Comments
 (0)