Skip to content

Commit bbfa8a3

Browse files
committed
design: change logs design updates to remove "after" state collection
Signed-off-by: Jared Watts <[email protected]>
1 parent 1e88326 commit bbfa8a3

File tree

1 file changed

+32
-37
lines changed

1 file changed

+32
-37
lines changed

design/one-pager-change-logs.md

Lines changed: 32 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@
99
* To build further trust and confidence in Crossplane, we will log every
1010
operation that is performed on all managed resources
1111
* The managed reconciler will generate a record of each change made to an
12-
external system, as well as the state of the resource before and after the
13-
operation
12+
external system, as well as the full desired and observed state of the
13+
resource before the operation is performed
1414
* Each change log entry will be sent over gRPC to a sidecar container in the
1515
provider pod, where it will be persisted as a standard pod log
1616
* Pod logs are a very standard location that allows many different tools to view
@@ -84,19 +84,10 @@ Each entry in the change log will contain the following data:
8484
| Name | `dev-instance-bt66d` |
8585
| External Name | `vpc-4fa03d9fb92dfec50` |
8686
| Operation Type | `create\|update\|delete` |
87-
| Before operation desired/observed state of resource | `{full JSON dump of resource}` which includes desired `spec.forProvider` and observed `status.AtProvider` |
88-
| After operation desired/observed state of resource | `{full JSON dump of resource}` which includes desired `spec.forProvider` and observed `status.AtProvider` |
87+
| Desired/observed state of resource before operation | `{full JSON dump of resource}` which includes desired `spec.forProvider` and observed `status.AtProvider` |
8988
| Result of operation | `success` or error object |
9089
| (optional) Additional information that Providers can set as they choose | `{JSON object of arbitrary data}` |
9190

92-
Note that we are capturing the full state of the resource both **before** and
93-
**after** the operation is performed. We could instead just record the "before"
94-
state during each reconciliation, but that would paint a less complete picture
95-
compared to observing the external system and recording the "after" state
96-
immediately after the operation as well. If we find during implementation and
97-
testing that recording the "after" state places too much burden on the system,
98-
we can revisit this decision.
99-
10091
An additional information field is available in each change log entry for
10192
Providers to store any provider specific data they choose. Providers are not
10293
required to store any data at all in this field, as it is optional. As an
@@ -107,13 +98,13 @@ change log entry.
10798
#### Quantity of Generated Data
10899

109100
From testing during prototyping, a typical change log entry was found to be
110-
around 6KB in size, including both the full "before" and "after" states of the
101+
around 4KB in size, including both the full desired and observed state of the
111102
resource. To get an idea of the total quantity of data this feature is expected
112103
to generate, we can use the below calculations for a very rough estimate:
113104

114-
* 1 change log entry is ~6KB
105+
* 1 change log entry is ~4KB
115106
* 1000 resources changing 1x per hour
116-
* 6KB * 1000 changes/hour * 24 hours/day = ~140MB/day
107+
* 4KB * 1000 changes/hour * 24 hours/day = ~94MB/day
117108

118109
Note that this estimate can vary depending on the frequency of changes being
119110
made to the resources, and the number of fields each resource has. This estimate
@@ -140,13 +131,8 @@ The managed reconciler calls the provider's [`ExternalClient.Observe()`
140131
method](https://github.com/crossplane/crossplane-runtime/blob/release-1.16/pkg/reconciler/managed/reconciler.go#L914)
141132
before any CUD operation is performed, in order to understand the current state
142133
of the external system before the provider acts upon it. This up to date
143-
external state will be used to populate the "before" field in the change log
144-
entry.
145-
146-
If a CUD operation is performed later on in the same `Reconcile()`, then we
147-
will perform an additional call to this same `Observe()` method directly after
148-
the CUD operation is performed. This second `Observe()` result will be used to
149-
populate the "after" field in the change log entry.
134+
external state will be used to populate the desired/observed state field in the
135+
change log entry.
150136

151137
Some helpful pointers to where CUD operations are performed in the managed
152138
reconciler:
@@ -155,14 +141,6 @@ reconciler:
155141
* [Update](https://github.com/crossplane/crossplane-runtime/blob/release-1.16/pkg/reconciler/managed/reconciler.go#L1189)
156142
* [Delete](https://github.com/crossplane/crossplane-runtime/blob/release-1.16/pkg/reconciler/managed/reconciler.go#L954)
157143

158-
The code paths through the managed reconciler's `Reconcile()` method are already
159-
a bit complex. One design choice that has helped minimize the complexity and
160-
assist in readability is that the reconciler returns as early as it can after
161-
performing an operation. Since we will now be adding an additional `Observe()`
162-
call after each CUD operation, this will increase the complexity further, but
163-
is likely unavoidable for us to obtain an immediate understanding of the "after"
164-
state of a change, without having to wait for another reconcile.
165-
166144
### Storage and Durability
167145

168146
We consider change logs entries to be related to both security and reliability
@@ -313,9 +291,8 @@ of providers.
313291
* `crossplane-runtime`:
314292
* Protobuf types are defined for change log entries and a gRPC client/server
315293
to send these entries from client to server
316-
* Managed reconciler is updated to generate change log entries, send them to
317-
the gRPC server (i.e. sidecar container), and to include additional calls to
318-
`Observe()` after each CUD operation to capture the "after" state of the entry
294+
* Managed reconciler is updated to generate change log entries for appropriate
295+
events and send them to the gRPC server (i.e. sidecar container)
319296
* `change-logs-sidecar`
320297
* A new repository is created to build and publish the image used in the sidecar container
321298
* Consumes the protobuf types from `crossplane-runtime`
@@ -377,10 +354,10 @@ The following alternative approaches have been considered for this change logs d
377354
We could generate standard Kubernetes events with that capture all of the data
378355
we are proposing to store for each change log entry, likely stored as
379356
annotations on the events. However, given that we propose to store the entire
380-
object before and after a given operation is performed, this would likely be too
381-
large for what a standard `Event` and its annotations, or the underlying etcd
382-
storage, would allow. Relying on annotations also would diminish the portability
383-
and ease of access of the data to a wide variety of tools.
357+
object's state as well as other data, this would likely be too large for what a
358+
standard `Event` and its annotations, or the underlying etcd storage, would
359+
allow. Relying on annotations also would diminish the portability and ease of
360+
access of the data to a wide variety of tools.
384361

385362
### Kubernetes Audit Logs
386363

@@ -435,3 +412,21 @@ these logs:
435412
* Crossplane change logs take away any guesswork as to what changes Crossplane
436413
made, as opposed to trying to use indirect means like cloud identity, HTTP
437414
user agents, or other means to identify Crossplane changes being performed.
415+
416+
### After Operation State Collection
417+
418+
We explored the option of capturing the full state of the resource both
419+
**before** and **after** the operation is performed. We decided against also
420+
collecting the "after" state because:
421+
422+
* The full state of the resource from before the operation is performed has both
423+
`spec.forProvider` desired state and `status.atProvider` observed state. The
424+
observed state is very up to date with the external system because each
425+
reconciliation makes a `Observe()` call. Therefore the desired vs. observed
426+
state is effective to determine why a change is being made.
427+
* In the Upjet based providers, we are not likely to see a difference in
428+
observed state at all because they perform their operations asynchronously. A
429+
second `Observe()` call directly after the operation is performed would very
430+
likely not discover any changes at all.
431+
* There are concerns that a second `Observe()` call could cause too many calls
432+
to the external API and possibly induce rate limiting or throttling.

0 commit comments

Comments
 (0)