-
Notifications
You must be signed in to change notification settings - Fork 11.7k
refactor(indexer-alt): CancellationToken -> Services #24503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: amnn/futures
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| // Job finished all work successfully | ||
| Ok(()) | ||
| } | ||
| ExitReason::UserInterrupt => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaviour has changed slightly -- Service::main doesn't distinguish between a user interruption and a termination signal -- they both come across as a termination, which will produce an exit code of 1 if the job was bounded.
@nickvikeras -- want to check that this is ok?
|
|
||
| #[error(transparent)] | ||
| Err(#[from] anyhow::Error), | ||
| pub struct Finalizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely happy with this -- the restore task needs to do some clean-up work at the end assuming the main service exits completely cleanly. I implemented this by having the run task output a Service for the main service, and then this Finalizer which can be run as its own service, when the main service completes successfully.
Originally, I tried to add finalization to the Service abstraction, but it doesn't compose well -- in particular with finalization usually you want to say things like "once this set of tasks completes successfully, run this task", but the way we merge services together can cause a finalizer task to end up waiting arbitrarily long after the tasks it cares about have completed to run, which is usually not desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Straggler from before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear from this change, but previously the commit watermark task would run one more time after it was told to exit, to ensure that any pending watermark was flushed.
Now, that will only happen if the service is shutting down gracefully (e.g. not because of a termination signal, error, or interrupt).
@nickvikeras, I'm assuming that's still okay for the checkpoint indexer? It means that if the indexer fails or is told to shutdown prematurely, it won't try to flush any pending watermarks, but if it is winding down naturally after having processed a finite range of checkpoints, it will (previously when we were using cancellation tokens, we couldn't distinguish between these cases, so we sometimes received a cancellation signal as part of a normal shutdown).
The reason I had to make this change is that sometimes the indexer is running in the same service as the database, which means that when it has been told to shutdown, the database has also been told to shutdown, which means that if it tries to connect to the database, it will end up hanging, and will eventually timeout (this happens when starting a local network as part of the CLI).
## Description Avoid using `CancellationToken` to detect when the service is being shutdown while yielded. When using the `Service` abstraction, explicitly waiting on a cancellation token is not necessary, as the yield will be within the scope of a task maintained by the service which can be aborted. Also make use of `Service` for the system package task. ## Test plan This change breaks the build overall, but the crate builds and tests: ``` sui-indexer-alt-reader$ cargo check sui-indexer-alt-reader$ cargo nextest run ```
## Description Switch the metrics service to using the new `Service` abstraction, and away from using `CancellationToken`. ## Test plan Build, Test, Lints for `sui-indexer-alt-metrics` crate all succeed (but this change does break the build across the mono-repo.
## Description Switch JSONRPC-alt to using the new `Service` abstraction, and away from using `CancellationToken`. ## Test plan Build, Test, Lints for `sui-indexer-alt-jsonrpc` crate all succeed (but this change dooes break the build across the mono-repo).
## Description Switch GraphQL-alt to using the new `Service` abstraction, and away from using `CancellationToken`. As part of this change, Chain Identifier initialization was adapted. Previously, the chain identifier task ran before RPC started, which complicated lifecycles, because we needed to run that task to completion to get a value before the service could start, which also meant that we needed to handle termination signals for this initialization task as well as for the main service. Now, the chain identifier task is a secondary service that writes the chain identifier to a `SetOnce` when it's done. This can be started alongside the RPC, and the requests that need this information will wait for it to be set before they can continue (but the RPC as a whole can start and serve other requests before then, and termination handling can remain in one place). ## Test plan Build, test and lint `sui-indexer-alt-graphql`. Also run the service locally and ensure that it exits cleanly when asked to shutdown, or when forcefully aborted.
## Description Use the `Service` abstraction for composing parts of the Indexing framework together. Most constituent services are abortable services, with the exception of the concurrent committer which uses graceful shutdown to ensure there's one more opportunity to write the a committer watermark, when shutting down. ## Test plan Existing tests.
## Description In a recent change the commit watermark was made to perform an extra iteration before shutting down, however this can cause stalls during shutdown if the task fails to connect to its store. This situation occurs in calls to `sui start` where the CLI is responsible for running the database process as a child as well as the RPC services. The database is wound down by the `Ctrl-C` signal, while the RPC services are gracefully shutting down. The change was originally made to cater to the checkpoint indexer (or any use of the object store) where watermark updates are infrequent, so it was useful to flush the watermarks before shutting down in response to a cancellation. However, at the time of that change, the cancellation token was responsible for signaling "normal shutdown" (e.g. after having processed all checkpoints), as well as interrupts (unexpected termination). The hope is that by using separate methods to communicate these two forms of exit, we can avoid adding this graceful shutdown support. ## Test plan The `sui` CLI no longer hangs when made to wait for the RPC services to gracefully shutdown.
## Description Migrate the consistent store crate to using the `Service` abstraction, away from using cancellation tokens. This is mostly a straightforward conversion, with the exception of the finalization process after restoration, which needs to run after the main restoration service has exited successfully. Initially, I tried adding finalization as a concept to the `Service` abstraction, but it was not a good fit, because an individual service's finalizers would end up dependent on the success of unrelated service's primary tasks after merging. Instead, it was easier to have the restoration task return a finalizer that can be run to get its own service, which will run if the main service finishes successfully. ## Test plan Crate's own build, tests, lints.
## Description Make use of the `Service` abstraction to support graceful shutdown for RPC services in localnet. ## Test plan The following exits promptly: ``` $ cargo run --bin sui -- start --force-regenesis --with-graphql ^C ```
3f0ca29 to
465b6a2
Compare
Description
Make use of the new
Servicesabstraction to avoid handling cancellation usingCancellationToken, which often requires threading the token deep into each service.This change also improves graceful shutdown support (for example when there is a metrics service running alongside the indexer, and the indexer triggers a shutdown, the metrics service is allowed time to drain connections before shutting down itself).
Test plan
CI
Stack
sui-futurescrate #24502Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.
Serviceinstead of aJoinHandle<()>when run. UseService::mainto wait for the service to exit cleanly or with an error, or respond to a termination signal with a graceful shutdown.Servicealso exposesrun,run_with_grace,join, andshutdownfunctions to customise various aspects of the shutdown process.