Skip to content

Add adaptive circuit breaker#760

Open
AbdulRahmanAlHamali wants to merge 164 commits intomainfrom
pid-take-2
Open

Add adaptive circuit breaker#760
AbdulRahmanAlHamali wants to merge 164 commits intomainfrom
pid-take-2

Conversation

@AbdulRahmanAlHamali
Copy link
Contributor

@AbdulRahmanAlHamali AbdulRahmanAlHamali commented Sep 23, 2025

Add a new adaptive circuit breaker that:

  1. Works without need for configuration
  2. Has the ability to open partially/fully depending on the severity of the incident

The circuit breaker has two main components:

Ideal Error Rate Estimator

This estimator tries to find out the expected error rate from a healthy dependency. It does that through a method of simple exponential smoothing, which basically relies on calculating a weighted average, with a few extra domain-specific hints:

  1. Ignore any observations that are obviously too high
  2. Converge slowly towards higher values, and quickly towards lower values
  3. Be more receptive to signals for the first 30 minutes after boot (because we start with a random guess for the value), and less receptive afterwards

PID Controller (Process, Integral, Derivative) Controller

See wikipedia link

This controller increases/decreases the the rejection rate of the circuit breaker based on the value of:

kI * Integral + kP * P + kD * derivative

Where P is:

ErrorDelta - (1 - ErrorDelta) * RejectionRate

The intuition of the value of P:

  • Increases when error rate increases
  • Decreases when rejection rate increases
  • The rejection rate has a (1 - ErrorDelta) multiplier, that we call the "defensiveness" multiplier. This allows rejection to increase more aggressively if the error rate is high

Note: The derivate component is currently set to 0. The Integral component mostly contributes as a history to prevent the circuit breaker from fluctuating too much.

attr_reader :name, :pid_controller, :ping_thread

def initialize(name:, kp: 1.0, ki: 0.1, kd: 0.0,
window_size: 10, history_duration: 3600,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the calculation of online mean, Welford's algorithm was discussed as a solution to use constant space

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we want to allow developers to configure the interval of integration? (history_duration)?

AbdulRahmanAlHamali and others added 24 commits October 22, 2025 17:08
Update variable names

Fill sliding window with 1 hr worth data

Add comment

Update experiemnt resource to be deterministic

Change deterministic default value to false

Cleanup

Remove unused variable

Make initial seed error rate more customizable

Add seed_error_rate as a property
* Prefilling added

* Change initial duration to 900 s
* testing different circuit breaking scenarios

* adding concurrency

* adds more puts to get further information during phases

* Fixing concurrency, unprotected ping, extras

* update classic sustained test

* cleaning up outdated tests, and testing without ping rate

* modify ki instead of dividing by window size

---------

Co-authored-by: Abdulrahman Alhamali <abdulrahman.alhamali@shopify.com>
AbdulRahmanAlHamali and others added 5 commits January 12, 2026 19:24
* allow using the fiber scheduler if present

* run experiments
…rror-rate-calculation

Fixed redundant ideal error rate calculation. Calculated once and re-used appropriately.
* Added param to only use 'current_window_requests' when needed. Updated 'start_pid_controller_update_thread' to only pass in needed values to 'notify_metrics_update'.

* Removed notifying logic for state transitions.

* Refactor adaptive circuit breaker tests by removing the state transition notification test. Updated metrics expectations to include previous p_value and adjusted test logic for clarity.

* Added updated experiments
@kris-gaudel
Copy link
Contributor

Hey guys, hope you had a good start to the new year! Are there any issue I could work on in an open-source capacity on the ACB?

end

def wait_for_window
Kernel.sleep(@sliding_interval)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kernel.sleep(@sliding_interval)
Kernel.sleep(@sliding_interval * rand(0.9...1.1)

@samuel-williams-shopify

Here are the notes from our tuple:

(A) 1 PIDController
(B) 1 Faraday request

(A) -> sleep(sliding_window_timeout)
(B) -> timeout(t, IO#write -> IO#gets)
(A) -> wakes up, update
(B) -> timeout(t, IO#write -> IO#gets -> response line) -> good

(An) -> sleep(sliding_window_timeout + random_jitter)
(A0) -> wakes up, update -> some CPU time, -> A(N+1) -> time has elapsed
(B) -> timeout(t, IO#write -> IO#gets)
(AN+1) -> wakes up, update # process all of these O(An)
(B) -> timeout(t, IO#write -> IO#gets -> response line) -> times out


Assume sliding window of 1 second
(A0) t+0.1
(A1) t+0.2
(A2) t+0.3
(A0) -> wakes up, update -> some CPU time, -> A(N+1) -> time has elapsed
(A0) t+0.1 -> ready at t+1.0 -> sleep(sliding_window_timeout)
(A1) t+0.2 -> ready at t+1.0 -> sleep(sliding_window_timeout)
(A2) t+0.3 -> ready at t+1.0 -> sleep(sliding_window_timeout)

(1) +random jitter -> always late
(2) +/- random jitter -> gaussian distribution of lateness
(3) + robust loop which computes offset, but you may need to drop sample windows if running late
	(real time stuff)

duration = 1.0
def now
	Process.clock_gettime(Process::CLOCK_MONOTONIC)
end


while true
	current = now
	yield

	next_time = current + duration
	duration = next_time - now
	if duration > 0
		sleep(duration)
	else
		# droppped window
	end
end

while true
	# uniform distribution:
	duration = duration * rand(0.9..1.1)
	sleep(duration * rand(0.9..1.1))

	yield(duration) -> pid_controller.update(duration)
end

AbdulRahmanAlHamali and others added 21 commits January 21, 2026 19:29
* add jitter to sleep

* fix tests and re-run experiments

---------

Co-authored-by: Adrian Gudas <adrian.gudas@shopify.com>
stop thread if the last remaining circuit breaker is destroyed
use a sliding interval of 1 (run through the loop every second)
…hread

using a single thread for all PID controller statuses
* Add dead zone ratio to PID controller for noise suppression

- Introduced `dead_zone_ratio` parameter in the PID controller to suppress noise from small deviations in error rates.
- Updated the `calculate_p_value` method to implement dead zone logic, allowing for more stable control responses.
- Enhanced tests to validate the behavior of the dead zone in various scenarios, ensuring it does not impede recovery while effectively filtering noise.

* Added experiment tests

* Refine dead zone logic in PID controller

- Updated the handling of the dead zone in the PID controller to allow full signal response above the dead zone, improving control accuracy.
- Adjusted comments for clarity regarding the purpose of the dead zone in noise suppression.

* Added experiment results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants