Operation handling for reliable receipts#1637
Conversation
Signed-off-by: Enrique Lacal <[email protected]>
New callback that can be called to insert a number of operation updates reliably Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
|
Just thought of something different, instead of using a |
pkg/core/operation.go
Outdated
|
|
||
| type OperationCallbacks interface { | ||
| OperationUpdate(update *OperationUpdate) | ||
| BulkOperationUpdates(ctx context.Context, updates []*OperationUpdate, onCommit chan<- error) |
There was a problem hiding this comment.
I think passing a channel might be a little sticky. We've gone back and forth on this a few times in other code interfaces across the FireFly ecosystem.
My feeling is a callback function is much easier to pass in, and you then leave the caller the responsibility to decide if they want that callback to use a channel or not.
There was a problem hiding this comment.
Yes! #1637 (comment) I got to the same conclusion after reading more code recently and that is what I'm reworking 💪🏼
There was a problem hiding this comment.
Cool - I think maybe with the discussion below around making the interface synchronous, this might become a moot point
internal/operations/manager.go
Outdated
| } | ||
| log.L(om.ctx).Debugf("%s updating operation %s status=%s%s", update.Plugin, update.NamespacedOpID, update.Status, errString) | ||
| } | ||
| om.updater.SubmitBulkOperationUpdates(ctx, updates, onCommit) |
There was a problem hiding this comment.
What is it that absolutely guarantees that this list of receipts in the order they were submitted, will be committed before any other receipts submitted to this function?
There was a problem hiding this comment.
Thanks very insightful, I realize that today that is up to the consumer to call this function only once at a time so this function should handle the case and not allow that to happen
There was a problem hiding this comment.
👍 - with the other discussions to make it a synchronous interface, I think you could just handle this as a comment on the API doc:
An insertion ordering guarantee is only provided when this code is called on a single goroutine inside of the connector. It is the responsibility of the connector code to allocate that routine, and ensure that there is only one.
| // Notice how this is not using the workers | ||
| // The reason is because we want for all updates to be stored at once in this order | ||
| // If offloaded into workers the updates would be processed in parallel, in different DB TX and in a different order | ||
| go func() { |
There was a problem hiding this comment.
I do think this is the fundamental question of design in this PR, but I'm unclear currently the way this solves it.
This go routine just gets fired off - who's responsibility is it, and exactly how, to do ordering against the API in a way that means future batches are extremely strictly ordered against completion of this batch.
Fundamentally - isn't this over complicated? Shouldn't we just make it synchronous and use the goroutine of the caller?
There was a problem hiding this comment.
I thought about making it synchronous and wanted the connector to be able to wait for a notification back but you are right if we are treating this API as a Persist and wait until done in the connector this should absolutely be synchronous
There was a problem hiding this comment.
Think that could be a big simplification overall to this. Push the responsibility to the connector to:
- Manage the goroutine that is the receipt writer
- Perform suitable performance batching into this code
- Handle all errors/retries
There was a problem hiding this comment.
Yeah, that simplifies massively this code to a synchronous bulk write of receipts
There was a problem hiding this comment.
There is a caveat to the above with is doBatchUpdateWithRetry retries indefinitely, I think we need to break this if we are making it synchronous
| } | ||
| } else { | ||
| // Less than the max inserts, just do it in one go | ||
| err := ou.doBatchUpdateWithRetry(ctx, updates) |
There was a problem hiding this comment.
If you make it syncrhonous you do not need retry here - the caller can be informed it is their responsibility to retry
|
|
||
| // This retries forever until there is no error | ||
| // but returns on cancelled context | ||
| if len(updates) > ou.conf.maxInserts { |
There was a problem hiding this comment.
You can choose to push this responsibility back to the connector. I don't think you need to try and do sub-pagination inside of this implementation.
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
5acdf0e to
283dce4
Compare
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1637 +/- ##
=======================================
Coverage 99.95% 99.95%
=======================================
Files 337 337
Lines 29524 29590 +66
=======================================
+ Hits 29512 29578 +66
Misses 8 8
Partials 4 4 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Enrique Lacal <[email protected]>
|
@peterbroadhurst This is good to re-review. Test coverage should be good as well. There was a slight doubt on if this function should restrict and validate the Plugin name inside each operation update, I left it up to the connector plugin to put the right one but I might add validation there or make it pass a |
| return err | ||
| } | ||
| for _, update := range updates { | ||
| if update.OnComplete != nil { |
There was a problem hiding this comment.
What's the purpose of this in practice now?
There was a problem hiding this comment.
I struggled with this one, it's still available in the struct core.OperationUpdate but it doesn't serve any purpose as we are returning synchronously so the top connector go routine can do what it needs when the updates have been committed.
So we can remove but we need to leave it for the existing flow and just comment on this flow that it will not be used
There was a problem hiding this comment.
Why @EnriqueL8 ?
Because you think there's other stuff on core.OperationUpdate that needs to be on this batch API?
There was a problem hiding this comment.
The rest of the fields in core.OperationUpdate are needed in this batch API so yes we could create a new type that has that subset of fields
There was a problem hiding this comment.
Thanks @peterbroadhurst for the nudge and suggestion - move to OperationUpdateAsync and OperationUpdate
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
Signed-off-by: Enrique Lacal <[email protected]>
peterbroadhurst
left a comment
There was a problem hiding this comment.
Thanks @EnriqueL8 - looks great, and appreciate you going round the loop with me on this
Proposed changes
Add a new function in the Operation Handler called
BulkOperationUpdatesthat takes in a series of OperationUpdate and will notify the submitter when the TX transaction has been committed to the database.This is important for blockchain plugins that want to preserve ordering of receipts and updating the operation before acknowledging to the blockchain connector that FireFly has accepted those receipts.
Reasons why I went this way:
idempotency keyis the Operation ID so correlation is easy in the blockchain plugin between the receipt and the operationKey part of the code to looks at
doBatchUpdateWithRetryruns in aretry.Doand in a DB RunAsGroup, it retries forever seefirefly/internal/operations/operation_updater.go
Lines 184 to 199 in c7e6058
Update worker get picked based on operation ID here
firefly/internal/operations/operation_updater.go
Line 90 in c7e6058
Workers call
doBatchUpdateWithRetryafter timeout or reached a limit to writefirefly/internal/operations/operation_updater.go
Lines 172 to 180 in c7e6058
Contributes to #1622
Example Blockchain plugin code that has a durable websocket communication:
Types of changes
Please make sure to follow these points
< issue name >egAdded links in the documentation.