Skip to content

Chaos features#45

Merged
satyamjay-iitd merged 22 commits intomasterfrom
chaos
Sep 8, 2025
Merged

Chaos features#45
satyamjay-iitd merged 22 commits intomasterfrom
chaos

Conversation

@Brian-1402
Copy link
Collaborator

@Brian-1402 Brian-1402 commented Aug 21, 2025

PR for chaos features in #41. Core logic done. Review and small fixes remain.

Progress:

  • Crashes
    • Node endpoint for crashing individual actors
    • TOML parsing for chaos crashes
    • Chaos timer function in job controller to send timed crash requests to nodes.
  • Msg loss/Msg Duplication
    • Local TOML parsing
    • ChaosManager for actor
  • Msg delay chaos (Done in added delay chaos #51)
  • Restart actors (To be tested)
  • User callbacks for responding to chaos events
    • For the user to know that messages failed or that actors are unreachable. In the pov of user, both look the same.
  • Actor-Actor specific chaos setting
  • Timed start for new actors (?)

General fixes:

  • Global chaos config toml
  • Chaos feature flag
  • Single loop chaos request timer
  • Stop chaos endpoint
  • validation for chaos settings
    • add 404 Actor Not Found
    • actor names should be for real actors
    • time entries should make sense (no chaos start time after actor crash/before restart)
    • figure out race conditions for chaos timers (eg: msg_loss start_ms at 1000ms, actor restart at 1000ms, but msg_loss request reached early)
    • Chaos messages from node to actor should have highest priority, more than regular messages. Otherwise during high load, chaos will get queued behind and may not accurately represent the start/stop times.

QOF fixes:

  • user side tracing shows file name rather than actor name. Instead of user setting log messages with actor name, try to have it built-in.

Some notes/questions to resolve:

  • Chaos timing issues: Observing upto 10s delay between node sending chaos request and actor receiving it.
  • Should nodes be notified about remote actors being crashed? This will require another node endpoint.
    • For now, nope.
  • added Clone trait for job_manager::NodeHandle. is that necessary? A clone of NodeHandle is being made for each actor crashing async task, can this be avoided?
    • Fixed
  • if ctrl-c is pressed and stop_job() is executed before background running chaos timers are complete, what happens?
    • Whole process is closed, so the background tasks are also dropped.
  • Log statements for job manager would be helpful, so far no logs are visible.
  • stop_actor node endpoint takes RemoteActorInfo as input, but only actor name is really required to crash actor.
  • rpc-client msg_duplication_request takes i32 even though it was defined as u32 in node/rpc.rs

Design thoughts for actor restarts and timed starts

  • Before stopping actors we probably need to remove the ChaosEvent's of the respective actor from chaos schedule of the node -> Does the actor rx being closed handle this implicitly? -> Is it better to not remove the ChaosEvents and have the node ignore them if the actor is not found?
    • Don't remove ChaosEvents. Receive chaos endpoint 404 and ignore.
  • To restart the actor, we probably also need to save its logical_ops, physical_ops?
    • For now RemoteActorInfo isn't removed from NodeHandle.actors during crashes, so we can reuse that info.
    • But that isn't enough. We should store SpawnArgs inside NodeHandle somewhere, mapped by actor_name.
  • Do we simply call start_actor endpoint of the node? or do we call place function of the NodeHandle?
    • Just start_actor. Remaining code in place() is for setting up chaos etc. which doesnt change with crashes.
  • Before stopping actors we probably also have to remove the actor from the node's actor list (or else we can't add it again?)
    • Keep it in job_manager::NodeHandle.actors, will need for restart maybe.
    • Interestingly, for node::actor_control_loop()::local_actors updation for normal spawning of actors, there's no checking of another existing actor of same name. it just overrides that hashmap entry. This in general could be bad coz undefined behavior and wrong input not being handled. But could be useful if we want user to specify restarting of an actor but with different payload/settings. So should decide what the behavior is for crash and restart.
  • It seems we would need to notify the other nodes that the actor is added back up again using the notify_remote_actor_added endpoint? Would we have to create a notify_remote_actor_removed endpoint as well when actor is stopped?
    • Regarding notify_remote_actor_removed endpoint: It seems like it would be better to not have this endpoint and for that matter, not update any nodes state about remote actors being crashed. Let nodes have stale info about the remote actors.
    • This way, when other actors try to resolve the addr of an actor, it will succeed but when they try to send messages, it will fail and the user can handle that in callback2 - ActorUnreachable (Will allow for testing FT, network partitions etc).
    • Now we wont need the notify_remote_actor_added endpoint here either
    • Similary we will have another callback when node has not been able to resolve the addr of an actor after several tries - CouldntResolve (This should have a default behaviour of retrying for some time)
  • Do we set up the unfinished ChaosEvent again for the restarted actor? (If we had not removed the ChaosEvent of the actor then it should be there already)
    • No, we aren't removing ChaosEvent during a crash.
  • For timed actor starting, need to decide relative to what the chaos timers will start with. should the timers start when the actor starts or when the JobController starts?

Untested though

Also make the rpc_client generation makefile code easier to run
@Brian-1402 Brian-1402 changed the title Chaos features for #41 Chaos features Aug 21, 2025
@Brian-1402 Brian-1402 self-assigned this Aug 21, 2025
@Brian-1402 Brian-1402 linked an issue Aug 21, 2025 that may be closed by this pull request
@Brian-1402 Brian-1402 removed a link to an issue Aug 21, 2025
Brian-1402 and others added 2 commits August 28, 2025 03:17
ChaosManager at Actor TBD

Co-authored-by: Brian Sajeev <Brian-1402@users.noreply.github.com>
Co-authored-by: Brian Sajeev <Brian-1402@users.noreply.github.com>
Json(actor_info): Json<RemoteActorInfo>,
) -> impl IntoResponse {
state
.clone()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is cloning required here?

node/src/rpc.rs Outdated
post,
path = "/stop_actor",
responses(
(status = 200, description = "Actor stop initiated")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also return 404 for actor not found

Aman-Hassan and others added 16 commits September 4, 2025 01:01
Separate Chaos Endpoints for cleaner chaos request types

Probability & Factor Toml validation

Co-authored-by: Brian Sajeev <Brian-1402@users.noreply.github.com>
Chaos Refactored into apply/revert types
Also formatting and clippy satisfaction
* added docs workflow

* Update docs.yml

* Update docs.yml #skipci

* Update docs.yml [skip ci]

* Update docs.yml [skip ci]

* added browser based job controller

* updated ui

* formatting

* Update api (#49)

* updated generated api clients

* clippy

* Added #actor macro to register the actor (#50)

* updated generated api clients

* clippy

* updated api to get list of registerd apis

* wip ui

* added how_to.md

* added register actor macro
@satyamjay-iitd satyamjay-iitd dismissed their stale review September 8, 2025 12:04

WIll do later

@satyamjay-iitd satyamjay-iitd merged commit 2edd420 into master Sep 8, 2025
1 check passed
@satyamjay-iitd satyamjay-iitd deleted the chaos branch September 8, 2025 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants