Chaos features by Brian-1402 · Pull Request #45 · satyamjay-iitd/reactor

Brian-1402 · 2025-08-21T17:19:23Z

PR for chaos features in #41. Core logic done. Review and small fixes remain.

Progress:

General fixes:

QOF fixes:

user side tracing shows file name rather than actor name. Instead of user setting log messages with actor name, try to have it built-in.

Some notes/questions to resolve:

Chaos timing issues: Observing upto 10s delay between node sending chaos request and actor receiving it.
Should nodes be notified about remote actors being crashed? This will require another node endpoint.
- For now, nope.
added Clone trait for job_manager::NodeHandle. is that necessary? A clone of NodeHandle is being made for each actor crashing async task, can this be avoided?
- Fixed
if ctrl-c is pressed and stop_job() is executed before background running chaos timers are complete, what happens?
- Whole process is closed, so the background tasks are also dropped.
Log statements for job manager would be helpful, so far no logs are visible.
stop_actor node endpoint takes RemoteActorInfo as input, but only actor name is really required to crash actor.
rpc-client msg_duplication_request takes i32 even though it was defined as u32 in node/rpc.rs

Design thoughts for actor restarts and timed starts

Before stopping actors we probably need to remove the ChaosEvent's of the respective actor from chaos schedule of the node -> Does the actor rx being closed handle this implicitly? -> Is it better to not remove the ChaosEvents and have the node ignore them if the actor is not found?
- Don't remove ChaosEvents. Receive chaos endpoint 404 and ignore.
To restart the actor, we probably also need to save its logical_ops, physical_ops?
- For now RemoteActorInfo isn't removed from NodeHandle.actors during crashes, so we can reuse that info.
- But that isn't enough. We should store SpawnArgs inside NodeHandle somewhere, mapped by actor_name.
Do we simply call start_actor endpoint of the node? or do we call place function of the NodeHandle?
- Just start_actor. Remaining code in place() is for setting up chaos etc. which doesnt change with crashes.
Before stopping actors we probably also have to remove the actor from the node's actor list (or else we can't add it again?)
- Keep it in job_manager::NodeHandle.actors, will need for restart maybe.
- Interestingly, for node::actor_control_loop()::local_actors updation for normal spawning of actors, there's no checking of another existing actor of same name. it just overrides that hashmap entry. This in general could be bad coz undefined behavior and wrong input not being handled. But could be useful if we want user to specify restarting of an actor but with different payload/settings. So should decide what the behavior is for crash and restart.
It seems we would need to notify the other nodes that the actor is added back up again using the notify_remote_actor_added endpoint? Would we have to create a notify_remote_actor_removed endpoint as well when actor is stopped?
- Regarding notify_remote_actor_removed endpoint: It seems like it would be better to not have this endpoint and for that matter, not update any nodes state about remote actors being crashed. Let nodes have stale info about the remote actors.
- This way, when other actors try to resolve the addr of an actor, it will succeed but when they try to send messages, it will fail and the user can handle that in callback2 - ActorUnreachable (Will allow for testing FT, network partitions etc).
- Now we wont need the notify_remote_actor_added endpoint here either
- Similary we will have another callback when node has not been able to resolve the addr of an actor after several tries - CouldntResolve (This should have a default behaviour of retrying for some time)
Do we set up the unfinished ChaosEvent again for the restarted actor? (If we had not removed the ChaosEvent of the actor then it should be there already)
- No, we aren't removing ChaosEvent during a crash.
For timed actor starting, need to decide relative to what the chaos timers will start with. should the timers start when the actor starts or when the JobController starts?

Untested though Also make the rpc_client generation makefile code easier to run

ChaosManager at Actor TBD Co-authored-by: Brian Sajeev <Brian-1402@users.noreply.github.com>

Co-authored-by: Brian Sajeev <Brian-1402@users.noreply.github.com>

node/src/lib.rs

satyamjay-iitd · 2025-08-22T04:34:03Z

node/src/rpc.rs

+    Json(actor_info): Json<RemoteActorInfo>,
+) -> impl IntoResponse {
+    state
+        .clone()


is cloning required here?

satyamjay-iitd · 2025-08-22T04:34:30Z

node/src/rpc.rs

+    post,
+    path = "/stop_actor",
+    responses(
+        (status = 200, description = "Actor stop initiated")


Also return 404 for actor not found

openapitools.json

job_manager/src/lib.rs

actor/src/lib.rs

actor/src/send.rs

job_manager/src/lib.rs

job_manager/src/placement.rs

node/src/lib.rs

Separate Chaos Endpoints for cleaner chaos request types Probability & Factor Toml validation Co-authored-by: Brian Sajeev <Brian-1402@users.noreply.github.com>

Chaos Refactored into apply/revert types

added delay chaos

Also formatting and clippy satisfaction

* added docs workflow * Update docs.yml * Update docs.yml #skipci * Update docs.yml [skip ci] * Update docs.yml [skip ci] * added browser based job controller * updated ui * formatting * Update api (#49) * updated generated api clients * clippy * Added #actor macro to register the actor (#50) * updated generated api clients * clippy * updated api to get list of registerd apis * wip ui * added how_to.md * added register actor macro

WIll do later

Added node endpoint for crashing individual actors

3fdd624

Untested though Also make the rpc_client generation makefile code easier to run

Brian-1402 requested a review from satyamjay-iitd August 21, 2025 17:19

Brian-1402 changed the title ~~Chaos features for #41~~ Chaos features Aug 21, 2025

Brian-1402 self-assigned this Aug 21, 2025

Actor crashing implemented

e058887

Brian-1402 linked an issue Aug 21, 2025 that may be closed by this pull request

Add a chaos flag #41

Open

Brian-1402 removed a link to an issue Aug 21, 2025

Add a chaos flag #41

Open

Brian-1402 and others added 2 commits August 28, 2025 03:17

Removing clone for NodeHandle

bbf2aa2

Client side code done for dupl, loss

b2e1130

ChaosManager at Actor TBD Co-authored-by: Brian Sajeev <Brian-1402@users.noreply.github.com>

Aman-Hassan force-pushed the chaos branch from 2f343e7 to b2e1130 Compare August 28, 2025 20:05

Actor side ChaosManager completed

1f4cf7a

Co-authored-by: Brian Sajeev <Brian-1402@users.noreply.github.com>

Brian-1402 assigned Aman-Hassan Aug 28, 2025

satyamjay-iitd previously requested changes Aug 29, 2025

View reviewed changes

Aman-Hassan and others added 16 commits September 4, 2025 01:01

WIP review changes

0547830

Global Chaos Implemented (Untested)

c0b5e35

Separate Chaos Endpoints for cleaner chaos request types Probability & Factor Toml validation Co-authored-by: Brian Sajeev <Brian-1402@users.noreply.github.com>

added delay chaos

2ee82ae

Actor Restarts implemented (Untested)

450c4c0

Chaos Refactored into apply/revert types

Stop chaos endpoint added (Untested)

e9ef0f8

Merge branch 'chaos' into chaos-delay

db86db1

Merge pull request #51 from satyamjay-iitd/chaos-delay

e470301

added delay chaos

Fixed merge compile errors, incomplete UnsetMsgDelay endpoint

84bc909

Also formatting and clippy satisfaction

Disable delay chaos added (Untested)

910522c

Added paxos to examples

9e04d48

debugged delay

e98661d

added paxos to ci

2d39978

Merge branch 'master' of github.com:satyamjay-iitd/reactor into chaos

69abfcf

merged with master

6e16f62

fixed makefile

10dc530

formatting

5e71999

satyamjay-iitd merged commit 2edd420 into master Sep 8, 2025
1 check passed

satyamjay-iitd deleted the chaos branch September 8, 2025 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chaos features#45

Chaos features#45
satyamjay-iitd merged 22 commits intomasterfrom
chaos

Brian-1402 commented Aug 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

satyamjay-iitd Aug 22, 2025

Uh oh!

satyamjay-iitd Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Brian-1402 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress:

General fixes:

QOF fixes:

Some notes/questions to resolve:

Design thoughts for actor restarts and timed starts

Uh oh!

Uh oh!

satyamjay-iitd Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

satyamjay-iitd Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Brian-1402 commented Aug 21, 2025 •

edited

Loading