Skip to content

Conversation

@amnn
Copy link
Contributor

@amnn amnn commented Dec 3, 2025

Description

Make use of the new Services abstraction to avoid handling cancellation using CancellationToken, which often requires threading the token deep into each service.

This change also improves graceful shutdown support (for example when there is a metrics service running alongside the indexer, and the indexer triggers a shutdown, the metrics service is allowed time to drain connections before shutting down itself).

Test plan

CI

Stack


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • gRPC:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • Indexing Framework: The indexer, ingestion service, and metrics service all now return a Service instead of a JoinHandle<()> when run. Use Service::main to wait for the service to exit cleanly or with an error, or respond to a termination signal with a graceful shutdown. Service also exposes run, run_with_grace, join, and shutdown functions to customise various aspects of the shutdown process.

@amnn amnn self-assigned this Dec 3, 2025
@amnn amnn requested a review from a team as a code owner December 3, 2025 12:17
@vercel
Copy link

vercel bot commented Dec 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
sui-docs Ready Ready Preview Comment Dec 3, 2025 1:19pm
2 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
multisig-toolkit Ignored Ignored Preview Dec 3, 2025 1:19pm
sui-kiosk Ignored Ignored Preview Dec 3, 2025 1:19pm

// Job finished all work successfully
Ok(())
}
ExitReason::UserInterrupt => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour has changed slightly -- Service::main doesn't distinguish between a user interruption and a termination signal -- they both come across as a termination, which will produce an exit code of 1 if the job was bounded.

@nickvikeras -- want to check that this is ok?


#[error(transparent)]
Err(#[from] anyhow::Error),
pub struct Finalizer {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely happy with this -- the restore task needs to do some clean-up work at the end assuming the main service exits completely cleanly. I implemented this by having the run task output a Service for the main service, and then this Finalizer which can be run as its own service, when the main service completes successfully.

Originally, I tried to add finalization to the Service abstraction, but it doesn't compose well -- in particular with finalization usually you want to say things like "once this set of tasks completes successfully, run this task", but the way we merge services together can cause a finalizer task to end up waiting arbitrarily long after the tasks it cares about have completed to run, which is usually not desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Straggler from before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear from this change, but previously the commit watermark task would run one more time after it was told to exit, to ensure that any pending watermark was flushed.

Now, that will only happen if the service is shutting down gracefully (e.g. not because of a termination signal, error, or interrupt).

@nickvikeras, I'm assuming that's still okay for the checkpoint indexer? It means that if the indexer fails or is told to shutdown prematurely, it won't try to flush any pending watermarks, but if it is winding down naturally after having processed a finite range of checkpoints, it will (previously when we were using cancellation tokens, we couldn't distinguish between these cases, so we sometimes received a cancellation signal as part of a normal shutdown).

The reason I had to make this change is that sometimes the indexer is running in the same service as the database, which means that when it has been told to shutdown, the database has also been told to shutdown, which means that if it tries to connect to the database, it will end up hanging, and will eventually timeout (this happens when starting a local network as part of the CLI).

amnn added 12 commits December 3, 2025 13:14
## Description

Avoid using `CancellationToken` to detect when the service is being
shutdown while yielded. When using the `Service` abstraction, explicitly
waiting on a cancellation token is not necessary, as the yield will be
within the scope of a task maintained by the service which can be
aborted.

Also make use of `Service` for the system package task.

## Test plan

This change breaks the build overall, but the crate builds and tests:

```
sui-indexer-alt-reader$ cargo check
sui-indexer-alt-reader$ cargo nextest run
```
## Description

Switch the metrics service to using the new `Service` abstraction, and
away from using `CancellationToken`.

## Test plan

Build, Test, Lints for `sui-indexer-alt-metrics` crate all succeed (but
this change does break the build across the mono-repo.
## Description

Switch JSONRPC-alt to using the new `Service` abstraction, and away from
using `CancellationToken`.

## Test plan

Build, Test, Lints for `sui-indexer-alt-jsonrpc` crate all succeed (but
this change dooes break the build across the mono-repo).
## Description

Switch GraphQL-alt to using the new `Service` abstraction, and away from
using `CancellationToken`.

As part of this change, Chain Identifier initialization was adapted.
Previously, the chain identifier task ran before RPC started, which
complicated lifecycles, because we needed to run that task to completion
to get a value before the service could start, which also meant that we
needed to handle termination signals for this initialization task as
well as for the main service.

Now, the chain identifier task is a secondary service that writes the
chain identifier to a `SetOnce` when it's done. This can be started
alongside the RPC, and the requests that need this information will wait
for it to be set before they can continue (but the RPC as a whole can
start and serve other requests before then, and termination handling can
remain in one place).

## Test plan

Build, test and lint `sui-indexer-alt-graphql`. Also run the service
locally and ensure that it exits cleanly when asked to shutdown, or when
forcefully aborted.
## Description

Use the `Service` abstraction for composing parts of the Indexing
framework together. Most constituent services are abortable services,
with the exception of the concurrent committer which uses graceful
shutdown to ensure there's one more opportunity to write the a committer
watermark, when shutting down.

## Test plan

Existing tests.
## Description

In a recent change the commit watermark was made to perform an extra
iteration before shutting down, however this can cause stalls during
shutdown if the task fails to connect to its store.

This situation occurs in calls to `sui start` where the CLI is
responsible for running the database process as a child as well as the
RPC services. The database is wound down by the `Ctrl-C` signal, while
the RPC services are gracefully shutting down.

The change was originally made to cater to the checkpoint indexer (or
any use of the object store) where watermark updates are infrequent, so
it was useful to flush the watermarks before shutting down in response
to a cancellation.

However, at the time of that change, the cancellation token was
responsible for signaling "normal shutdown" (e.g. after having processed
all checkpoints), as well as interrupts (unexpected termination).

The hope is that by using separate methods to communicate these two
forms of exit, we can avoid adding this graceful shutdown support.

## Test plan

The `sui` CLI no longer hangs when made to wait for the RPC services to
gracefully shutdown.
## Description

Migrate the consistent store crate to using the `Service` abstraction,
away from using cancellation tokens.

This is mostly a straightforward conversion, with the exception of the
finalization process after restoration, which needs to run after the
main restoration service has exited successfully. Initially, I tried
adding finalization as a concept to the `Service` abstraction, but it
was not a good fit, because an individual service's finalizers would end
up dependent on the success of unrelated service's primary tasks after
merging.

Instead, it was easier to have the restoration task return a finalizer
that can be run to get its own service, which will run if the main
service finishes successfully.

## Test plan

Crate's own build, tests, lints.
## Description

Make use of the `Service` abstraction to support graceful shutdown for
RPC services in localnet.

## Test plan

The following exits promptly:

```
$ cargo run --bin sui -- start --force-regenesis --with-graphql
^C
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants