Skip to content

Stripe rate limit handling#4216

Draft
mayorova wants to merge 5 commits intomasterfrom
stripe-rate-limit-handling
Draft

Stripe rate limit handling#4216
mayorova wants to merge 5 commits intomasterfrom
stripe-rate-limit-handling

Conversation

@mayorova
Copy link
Contributor

@mayorova mayorova commented Feb 5, 2026

What this PR does / why we need it:

This implementation is a draft which was implemented by Claude (accurately guided by @mayorova). It is intended to serve as a starting point in discussions about how we can solve the issue (https://issues.redhat.com/browse/THREESCALE-4086) and also potentially open a discussion about Billing refactoring.

As it can be clearly seen, our billing process is quite complicated:

  • Many classes and modules are involved, and they keep passing control to each other, and the interactions are often back-and-forth, not very intuitive, and it's hard to follow the line of execution.
  • Error handling is not very clear either, errors sometimes get re-raised, sometimes "swallowed", and it's not easy to understand what the implications of a raised exceptions are
  • There is a number of "TODO" comments asking for refactor, e.g. Invoice#charge!.
  • The process uses sidekiq-batch gem, which is an open-source alternative to the commercial Sidekiq's Batch feature (paid). In our experience, it has some issues, for example those explained in THREESCALE_8124.
  • The whole process is not very intuitive (or, I'd even say - anti-intuitive), for example, the main process happens in Finance::BillingStrategy.daily method, which looks like it is intended to go over a list of billing strategies (i.e. provider accounts), and inside each strategy execute the daily process for a list of buyer accounts. However, in practice, this method is typically called by the BillingWorker job that runs for a single provider and a single buyer.

In general, I think the whole billing process could be refactored to make it significantly simpler and more predictable. We could do it also with Stripe rate limits in mind, to make the implementation more straightforward, and maybe even implement some kind of client-side rate limit (to avoid the error in the first place), rather than reacting to the error and re-try.

However, of course this is a very sensitive piece of logic, and it might be dangerous to modify it significantly (as we know it has been working quite reliably for ages).

So, let's talk 😉

Which issue(s) this PR fixes

https://issues.redhat.com/browse/THREESCALE-4086

Verification steps

Special notes for your reviewer:

  • Check out the doc STRIPE_RATE_LIMIT_HANDLING.md that Claude created explaining the implementation in detail.
  • Also, take a look at the diagrams Claude has drawn under docs/.
  • For testing the implemenation (or reproducing the original issue) in development mode, instead of mocking the stripe server, I just hard-coded a simple change in the activemerchant gem. Before this line I place
        return {"error"=>{"message"=>"Request rate limit exceeded. Learn more about rate limits here https://stripe.com/docs/rate-limits.", "type"=>"invalid_request_error", "code"=>"rate_limit", "doc_url"=>"https://stripe.com/docs/error-codes/rate-limit"}, "response_headers"=>{}} if parameters[:amount].to_i >= 10000

So, the invoices with total value > 100 will trigger this error, while the "cheaper" ones should pass successfully. Beware though about the order of invoices - if the first invoice fails, the process will not continue.

I also use these steps to prepare my Rails console for actual test:

  1. Prepare
# Prerequisites:
# - set up Stripe payment gateway configuration for the provider
# - create a buyer account (under the provider) and set up payment details (credit card and billing address) in the dev portal

buyer = Account.find ID

redis_key = "lock:billing:#{buyer.id}"
redis = System::RedisClientPool.default

def create_invoice(buyer, cost)
  new_invoice = buyer.provider_account.billing_strategy.create_invoice!(buyer_account: buyer, period: Month.new(Time.zone.now))
  billing = Finance::AdminBilling.new(new_invoice)
  line_item_params = {name: 'test', description: 'test description', quantity: 1, cost: cost}
  billing.create_line_item(line_item_params)
  new_invoice.issue
  new_invoice.update_column(:due_on, new_invoice.issued_on)
end
  1. Create invoices
create_invoice(buyer, "10.00")
create_invoice(buyer, "120.00")
  1. Remove the lock and run the job
redis.call("DEL", redis_key)

BillingWorker.perform_async(buyer.id, buyer.provider_account_id, Time.zone.now.to_fs(:iso8601))

Copy link
Contributor

@jlledom jlledom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the tests but left some comments on the rest.

Basically the idea here is to differentiate between errors and warnings. I think it's fine to do so, but it could be done in a simpler way:

  1. Payment transaction, the place to create the Rate Limit exception
  2. Everything in between: just let the exception buble
  3. Billing service: this is the place to check the exception and chose different path whether it's error or warning.
    • If warning: log, report, release lock
    • If error, same path as now
  4. About lock: just add a release method to the logic already existing: Synchronization::NowaitLockService.

id = billing_strategy.id
buyer_ids = options[:buyer_ids]

# Note: We don't know which specific buyer hit the rate limit, only which buyers were being processed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not adding the buyer or invoice id as exception attributes? that way we can track who failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful. Perhaps ideally we use the original exception error message also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, although in practice we always supply a single buyer id when calling this :)

Comment on lines +133 to +135
# Check for rate limit in error message (common patterns)
message = response.message.to_s.downcase
message.include?('rate limit') || message.include?('too many requests') || message.include?('429')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it returning 429 not enough to detect a rate limit error? are there scenarios where a rate limit error doesn't return 429?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like claude decided it makes sense. Might be worth checking stripe docs for any errors that would mandate a retry and how to detect them.

Copy link
Contributor

@jlledom jlledom Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained in this comment, while Stripe does return 429 status code, and it should be enough, when ActiveMerchant transforms it, the status code is lost, and we need to rely on the error code rate_limit.

Comment on lines +30 to +37
acquire_lock
call
rescue Finance::Payment::RateLimitError => error
# Rate limit errors should retry immediately via Sidekiq
# Release the lock so retries can proceed without waiting 1 hour
release_lock
report_error(error)
raise error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the with_lock method dead code now, right?

Also Synchronization::NowaitLockService

Comment on lines +74 to +95
def lock_key
@lock_key ||= "lock:billing:#{account_id}"
end

def lock_manager
@lock_manager ||= Redlock::Client.new([System::RedisClientPool.default], { retry_count: 0, redis_timeout: 1 })
end

def acquire_lock
# Acquire lock for 1 hour
# Normally we don't release it, but for rate limits we do (see rescue block)
@lock_info = lock_manager.lock(lock_key, 1.hour.in_milliseconds)
raise LockBillingError, "Concurrent billing job already running for account #{account_id}" unless @lock_info
end

def release_lock
# Only called on rate limit errors to allow immediate retry
lock_manager.unlock(@lock_info) if @lock_info
@lock_info = nil
rescue => e
Rails.logger.warn("Failed to release billing lock for account #{account_id}: #{e.message}")
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, wouldn't it be easier to add a release method to Synchronization::NowaitLockService ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to keep a 1 hour lock that is not manually released but the timeout is waited anyway, was that we were hitting an issue where we observed double scheduling of same providers. So kind of a second line of defense in case for some reason such an issue is reintroduced somehow.

I would prefer to keep that and instead reschedule jobs that we want to retry soon with a random delay of 1 hour to 1 hour and a half range.

But it is also acceptable to add the release to the with_lock method.

Copy link
Contributor

@jlledom jlledom Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I know. What I mean is Claude here added a new parallel locking backend, that makes both with_lock and Synchronization::NowaitLockService dead code. And I don't see any other advantage over previous service, than having a release method. If that's the point, it's simpler to add a release method to Synchronization::NowaitLockService so we can reuse what we have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, wouldn't it be easier to add a release method to Synchronization::NowaitLockService ?

Yeah, the reason I left it like that is that NowaitLockService uses the Service pattern, and the idea is that it only exposes a single public method - call. I agree that we need a service with acquire and release methods (or something like that). I'd also probably have a specific BillingLock service, so that we have the billing: prefix owned by it.

# Rate limit errors should retry immediately via Sidekiq
# Release the lock so retries can proceed without waiting 1 hour
release_lock
report_error(error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already reported from the strategy #call! method, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed (or I think I did) all reports and all log printing except here in the BillingService.

System::ErrorReporting.report_error(e, :error_message => message,
:error_class => 'RateLimitError',
:parameters => { billing_strategy_id: id, buyer_ids: buyer_ids })

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to just raise all up the stack until reaching the service, no need for all the logic here I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I didn't track the whole chain carefully but my intuition was similar, that there are too many levels we are handling the exception at.

IMO ideally we would introduce a separate worker just for payment gateway processing. That may have its own locking and throttling rules. Basically scheduling invoice charging and finish the billing job. Then all retry and throttling can be more easily reasoned about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO ideally we would introduce a separate worker just for payment gateway processing. That may have its own locking and throttling rules. Basically scheduling invoice charging and finish the billing job. Then all retry and throttling can be more easily reasoned about.

Yeah, that would make sense. As I mentioned in the PR description:

In general, I think the whole billing process could be refactored to make it significantly simpler and more predictable. We could do it also with Stripe rate limits in mind, to make the implementation more straightforward, and maybe even implement some kind of client-side rate limit (to avoid the error in the first place), rather than reacting to the error and re-try.

It's just that currently the approach is to group all billing jobs by provider, using batches, and there are some callbacks executed at the end of processing for each provider. If we have sidekiq jobs at invoice level, we might lose that ability, and it might become more complicated to track whether the provider's billing as a whole was successful or not.

Having said that, I am not a fan of this batch library either (I think it's quite flaky and doesn't really bring much value, IMO), and probably refactoring would be good. But I am not sure I would like to tackle it at this point in time 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree on getting rid of sidekiq-batch if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would bet getting rig of sidekiq-batch is possible because I did for the background deletion part.

The main point is to see what are the batches used for if anything. Like any hooks.

On the other hand, the reported issues with sidekiq-batch are related mostly to usage with ActiveJob type of jobs. If used with native sidekiq jobs/workers, then they should properly work... except the non-expiring Redis keys we have to regularly clean 😬

Comment on lines +455 to +457
# Rate limit errors should bubble up to Sidekiq for immediate retry with exponential backoff
# Don't treat these as payment failures - they're temporary gateway issues
logger.warn("Rate limit error for invoice #{id} (buyer #{buyer_account_id}) - will retry via Sidekiq: #{e.message}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to log here, just logging and reporting once in the billing service would be enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use this location to reschedule billing for this provider and this buyer after a random interval. @mayorova, just another low-key approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without using sidekiq's mechanism?

Don't know... I think scheduling a billing job should not be inside Invoice model 🫠

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general doing the charging from within the model does not seem right. It should be some from a service. Like we can have a convenience method Invoice#charge!, it can ensure the invoice is chargeable. But it should offload the charging to a service that would handle errors and retries.

So unless we are refactoring how things are done, I'm more in favor of adding rescheduling wherever it is easiest right now, than handling this specific error (rate limit) on 5 levels which will NOT make it easier to refactor later or make the code more readable.

Basically I'm for having a convenient small hack somewhere OR some refactoring that would bring us at least a step in the right direction.

But wait, invoice.rb, we don't need this rescue block at all. It performs needless logging and then raises the same exception. So my suggestion applies to one layer up, or the next layer up, wherever it seems most appropriate. Something like @jlledom already suggested I believe :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable.


# Check for rate limit errors and raise immediately for Sidekiq retry
if rate_limit_error?(response)
logger.warn("Rate limit detected (429) for PaymentTransaction - will retry with backoff")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, no need to log here.

# Check for rate limit errors and raise immediately for Sidekiq retry
if rate_limit_error?(response)
logger.warn("Rate limit detected (429) for PaymentTransaction - will retry with backoff")
raise Finance::Payment::RateLimitError.new(response)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have the invoice info here? it would be useful to add it to the exception.

Copy link
Contributor

@jlledom jlledom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the tests but left some comments on the rest.

Basically the idea here is to differentiate between errors and warnings. I think it's fine to do so, but it could be done in a simpler way:

  1. Payment transaction, the place to create the Rate Limit exception
  2. Everything in between: just let the exception buble
  3. Billing service: this is the place to check the exception and chose different path whether it's error or warning.
    • If warning: log, report, release lock
    • If error, same path as now
  4. About lock: just add a release method to the logic already existing: Synchronization::NowaitLockService.

rescue Finance::Payment::RateLimitError => error
# Rate limit errors should retry immediately via Sidekiq
# Release the lock so retries can proceed without waiting 1 hour
release_lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't use the block mode of with_lock, the release should happen in an ensure block. Although we just leave it to the timeout value for a reason to avoid spurious attempts.

So here you want to release the lock in case of a rate limit only?

My main question is, why previously normal retry didn't take place? Or was it taking place?

From this change I make the conclusion that the normal retry was too fast for the 1 hour timeout. Maybe we should just adjust the retry to be after the standard timeout? Just questions, maybe this approach makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this change I make the conclusion that the normal retry was too fast for the 1 hour timeout. Maybe we should just adjust the retry to be after the standard timeout? Just questions, maybe this approach makes sense.

So, the story is that "normal retry" (handled by Sidekiq) was never actually happening.

The reason is that the exception was being swallowed in the rescue block and never re-raised, only in tests:

raise if Rails.env.test?

So, in case of rate limit errors, the invoice was just maked as "failed", and the next attempt to charge it would happen only in 3 days, during the daily biling:

# the invoice has to be due and at least 3 days later than the last
# automatic charging date to be automatically chargeable
scope :chargeable, ->(now) {
where.has do
((state == 'unpaid') | (state == 'pending')) &
(due_on <= now) &
((last_charging_retry == nil) | (last_charging_retry <= (now - 3.days)))
end
}

# Don't treat these as payment failures - they're temporary gateway issues
logger.warn("Rate limit error for invoice #{id} (buyer #{buyer_account_id}) - will retry via Sidekiq: #{e.message}")
raise e
rescue Finance::Payment::CreditCardError, ActiveMerchant::ActiveMerchantError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we add the exception here, in this block? then it will be subject to retries?

@mayorova mayorova force-pushed the stripe-rate-limit-handling branch from 4ee1be0 to 192c774 Compare February 23, 2026 17:05
def rate_limit_error?(response)
return false if response.success?

response.params.dig("error","code") == 'rate_limit'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I repeated the tests, and while the Stripe API does return 429 status code, however the status code is "swallowed" by ActiveMerchant code, so we don't have the information on the HTTP status code in this point. See https://github.com/activemerchant/active_merchant/blob/v1.137.0/lib/active_merchant/billing/gateways/stripe.rb#L704-L710

So, the error.code seems to be the way to detect this.

As this is, of course, specific to the gateway (Stripe in this case), I moved this detection here, and I also decided to make the exception gatway-specific too Finance::Payment::StripeRateLimitError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me, just a nitpick, maybe the rate_limit literal could be a constant inside Finance::Payment::StripeRateLimitError.

@mayorova
Copy link
Contributor Author

Basically the idea here is to differentiate between errors and warnings. I think it's fine to do so, but it could be done in a simpler way:

  1. Payment transaction, the place to create the Rate Limit exception

I refactored the initial code, and now the exception is raised in the StripeChargeService. That seemed more logical to me, because this is something gateway-specific.
Well, other gateways can probably also have some rate limit errors, but we do not have precedents so far.

  1. Everything in between: just let the exception buble

Right, but we still need to rescue and re-throw, because otherwise the exception will get processed as another type, and we need to prevent this, at each stage of processing.

  1. Billing service: this is the place to check the exception and chose different path whether it's error or warning.

    • If warning: log, report, release lock
    • If error, same path as now

Not sure what you mean - "error" or "warning". Currently the exception is bubbled up to the BillingWorker and is handled in sidekiq_retry_in - basially just returning a nil value from the block, so that the built-in Sidekiq exponential backoff kicks in and retries the job.

  1. About lock: just add a release method to the logic already existing: Synchronization::NowaitLockService.

I have added Synchronization::BillingLockService that have lock and unlock methods.

Why not adding the buyer or invoice id as exception attributes? that way we can track who failed.

I added payment_metadata to StripeRateLimitError error, that includes

payment_metadata = {
  invoice_id: @invoice&.id,
  buyer_id: @invoice&.buyer_account&.id,
  payment_method_id: @payment_method_id,
  gateway_options: @gateway_options
}

I just need to find a good place to use/print it.

Copy link
Contributor

@jlledom jlledom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, just a few comments.

I again didn't review the tests. @mayorova you said they are not ready, right?

def rate_limit_error?(response)
return false if response.success?

response.params.dig("error","code") == 'rate_limit'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me, just a nitpick, maybe the rate_limit literal could be a constant inside Finance::Payment::StripeRateLimitError.

Comment on lines +92 to +95
rescue Finance::Payment::StripeRateLimitError => e
# Rate limit errors should bubble up to Sidekiq for immediate retry with exponential backoff
# Don't treat these as payment failures - they are temporary gateway issues
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if, instead of having to reference StripeRateLimitError on several levels, you make StripeRateLimitError inherit from something new like Finance::Payment::TemporaryError and rescue that? This way we can reuse this structure in the future if some other temporary error happens, also in other gateways. As long as the error inherits from TemporaryError, billing will be retried.

CreditCardPurchaseFailed = Class.new(GatewayError)

# Rate limit error - should be retried immediately, not treated as payment failure
class StripeRateLimitError < ActiveMerchant::ActiveMerchantError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when this only happens in Stripe, I think it's better to not make it stripe specific, because the concept of rate limiting is more general. Also I don't see anything in this error definition that would force to limit it to Stripe.

@@ -0,0 +1,31 @@
# frozen_string_literal: true

class Synchronization::BillingLockService < Synchronization::NowaitLockService
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why splitting the logic into two classes? couldn't it be just one?

gateway_options: @gateway_options
}
raise Finance::Payment::StripeRateLimitError.new(response, payment_metadata)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea that one exception we generate here and others in another place.

If service has to generate exceptions, it has to do it for any failure. And if it is not supposed to generate exceptions, then it should not generate any.

In this case it feels as if the "contract" is not to generate exceptions.

In other languages like Java, whether a method raises (throws) or not is part of the definition of the method.

Looking at the whole charging implementation though, it is rather convoluted. I think we should either:

  • implement all gateways to return/raise based on clear lassification of temporary, non temporary and success conditions
  • just implement classification of all gateways in BillingService#call! and then reschedule as desired
  • be lazy, keep everything as is but enable stripe SDK retry which takes care of back-off time as well idempotency of the requests. Stripe.max_network_retries = 2, see https://rubydoc.info/github/stripe/stripe-ruby . The mere use of idempotency keys is a huge win, otherwise we should make sure to use idempotency keys anyways to avoid double charging of the same payment.

My personal take would be to enable Stripe SDK native retries in the first place and do some refactor of how we classify errors in the future if needed.

My second preference would be to both (which doesn't preclude enabling native SDK retries):

  • implement idemotency keys (this could probably be simply the hashed invoice id so any payment attempts to that invoice would use the same idempotency key, preventing double charging for the same invoice reliably
  • classify the errors within BillingService#call! and decide whether to schedule an immediate retry or leave it for the future billing cycles

Last preference of mine is to classify the errors in the upper layers - as now but with more prominent structure that accounts for braintree and possible future gateways. But this will still require the final decision to be taken within BillingService#call! so I'm not sure how it will reduce complexity.

P.S. I understand that it is very complicated to figure out the least shitty approach here besides heavily refactoring everything. That's why I prefer the least changes or at least to be as compact as possible. Not trying to complain for everything although all approaches have pros and cons. The above is just what I find most sensible but I'm very open to change my mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants