Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
463 changes: 463 additions & 0 deletions STRIPE_RATE_LIMIT_HANDLING.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful for reviewing but shouldn't be merged.

Large diffs are not rendered by default.

13 changes: 13 additions & 0 deletions app/lib/finance/payment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,18 @@ def message

CreditCardPurchaseFailed = Class.new(GatewayError)

# Rate limit error - should be retried immediately, not treated as payment failure
class StripeRateLimitError < ActiveMerchant::ActiveMerchantError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when this only happens in Stripe, I think it's better to not make it stripe specific, because the concept of rate limiting is more general. Also I don't see anything in this error definition that would force to limit it to Stripe.

attr_reader :response, :payment_metadata

def initialize(response, payment_metadata)
@response = response
@payment_metadata = payment_metadata
end

def message
response.try!(:message) || 'Rate limit exceeded - too many requests to payment gateway'
end
end
end
end
8 changes: 8 additions & 0 deletions app/models/finance/billing_strategy.rb
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,10 @@ def self.daily(options = {})
results.start(billing_strategy)
ignoring_find_each_scope { billing_strategy.daily(now: now, buyer_ids: options[:buyer_ids], skip_notifications: skip_notifications) }
results.success(billing_strategy)
rescue Finance::Payment::StripeRateLimitError => e
# Rate limit errors should bubble up to Sidekiq for immediate retry with exponential backoff
# Don't treat these as payment failures - they are temporary gateway issues
raise e
Comment on lines +92 to +95
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if, instead of having to reference StripeRateLimitError on several levels, you make StripeRateLimitError inherit from something new like Finance::Payment::TemporaryError and rescue that? This way we can reuse this structure in the future if some other temporary error happens, also in other gateways. As long as the error inherits from TemporaryError, billing will be retried.

rescue => e
results.failure(billing_strategy)
name = billing_strategy.provider.try!(:name)
Expand Down Expand Up @@ -352,6 +356,10 @@ def bill_and_charge_each(options = {})
buyer_accounts.find_each(:batch_size => 20) do |buyer|
begin
ignoring_find_each_scope { yield(buyer) }
rescue Finance::Payment::StripeRateLimitError => exception
# Rate limit errors should bubble up to trigger Sidekiq retry with exponential backoff
# Do NOT catch and swallow - let the job fail and retry immediately
raise exception
rescue => exception
name = buyer.name
buyer_id = buyer.id
Expand Down
4 changes: 4 additions & 0 deletions app/models/invoice.rb
Original file line number Diff line number Diff line change
Expand Up @@ -451,6 +451,10 @@ def charge!(automatic = true)
logger.info("Invoice #{id} (buyer #{buyer_account_id}) was not charged")
false
end
rescue Finance::Payment::StripeRateLimitError => e
# Rate limit errors should bubble up to Sidekiq for immediate retry with exponential backoff
# Don't treat these as payment failures - they are temporary gateway issues
raise e
rescue Finance::Payment::CreditCardError, ActiveMerchant::ActiveMerchantError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we add the exception here, in this block? then it will be subject to retries?

provider.billing_strategy&.error("Error when charging invoice #{id} (buyer #{buyer_account_id})", buyer)

Expand Down
19 changes: 10 additions & 9 deletions app/services/finance/billing_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,26 @@ def self.call(account_id, options = {})
new(account_id, options).call
end

attr_reader :account_id, :provider_account_id, :now, :skip_notifications
attr_reader :account_id, :provider_account_id, :now, :skip_notifications, :lock_service

# @param account_id [Integer] either a provider account, or a buyer account with :provider_account_id in options
def initialize(account_id, options = {})
@account_id = account_id
@provider_account_id = options[:provider_account_id] || account_id
@now = Time.zone.parse(options[:now].to_s) || Time.zone.now
@skip_notifications = options[:skip_notifications]
@lock_service = Synchronization::BillingLockService.new(account_id.to_s)
end

def call!
with_lock { call }
lock_service.lock
call
rescue Finance::Payment::StripeRateLimitError => error
# Rate limit errors should retry immediately via Sidekiq
# Release the lock so retries can proceed without waiting 1 hour
lock_service.unlock
report_error(error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already reported from the strategy #call! method, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed (or I think I did) all reports and all log printing except here in the BillingService.

raise error
rescue LockBillingError, SpuriousBillingError => error
report_error(error)
nil
Expand Down Expand Up @@ -64,13 +72,6 @@ def billing_options
options
end

def with_lock
# intentionally skip unlocking, no further billing of account within 1 hour allowed
raise LockBillingError, "Concurrent billing job already running for account #{account_id}" unless Synchronization::NowaitLockService.call("billing:#{account_id}", timeout: 1.hour.in_milliseconds).result

yield
end

def report_error(error)
message = "Failed to perform billing job: #{error.message}"
System::ErrorReporting.report_error(error, error_message: message,
Expand Down
29 changes: 28 additions & 1 deletion app/services/finance/stripe_charge_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,19 @@ def initialize(gateway, payment_method_id:, invoice: nil, gateway_options: {})

def charge(amount)
payment_intent = latest_pending_payment_intent
payment_intent ? confirm_payment_intent(payment_intent) : create_payment_intent(amount)
response = payment_intent ? confirm_payment_intent(payment_intent) : create_payment_intent(amount)

if rate_limit_error? response
payment_metadata = {
invoice_id: @invoice&.id,
buyer_id: @invoice&.buyer_account&.id,
payment_method_id: @payment_method_id,
gateway_options: @gateway_options
}
raise Finance::Payment::StripeRateLimitError.new(response, payment_metadata)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea that one exception we generate here and others in another place.

If service has to generate exceptions, it has to do it for any failure. And if it is not supposed to generate exceptions, then it should not generate any.

In this case it feels as if the "contract" is not to generate exceptions.

In other languages like Java, whether a method raises (throws) or not is part of the definition of the method.

Looking at the whole charging implementation though, it is rather convoluted. I think we should either:

  • implement all gateways to return/raise based on clear lassification of temporary, non temporary and success conditions
  • just implement classification of all gateways in BillingService#call! and then reschedule as desired
  • be lazy, keep everything as is but enable stripe SDK retry which takes care of back-off time as well idempotency of the requests. Stripe.max_network_retries = 2, see https://rubydoc.info/github/stripe/stripe-ruby . The mere use of idempotency keys is a huge win, otherwise we should make sure to use idempotency keys anyways to avoid double charging of the same payment.

My personal take would be to enable Stripe SDK native retries in the first place and do some refactor of how we classify errors in the future if needed.

My second preference would be to both (which doesn't preclude enabling native SDK retries):

  • implement idemotency keys (this could probably be simply the hashed invoice id so any payment attempts to that invoice would use the same idempotency key, preventing double charging for the same invoice reliably
  • classify the errors within BillingService#call! and decide whether to schedule an immediate retry or leave it for the future billing cycles

Last preference of mine is to classify the errors in the upper layers - as now but with more prominent structure that accounts for braintree and possible future gateways. But this will still require the final decision to be taken within BillingService#call! so I'm not sure how it will reduce complexity.

P.S. I understand that it is very complicated to figure out the least shitty approach here besides heavily refactoring everything. That's why I prefer the least changes or at least to be as compact as possible. Not trying to complain for everything although all approaches have pros and cons. The above is just what I find most sensible but I'm very open to change my mind.


response
end

protected
Expand Down Expand Up @@ -86,4 +98,19 @@ def default_charge_description

"#{invoice.from.name} #{PAYMENT_DESCRIPTION} #{invoice.friendly_id}".strip
end

# Response body in case of Rate Limit error:
# {
# "error": {
# "message": "Request rate limit exceeded. Learn more about rate limits here https://stripe.com/docs/rate-limits.",
# "type": "invalid_request_error",
# "code": "rate_limit",
# "doc_url": "https://stripe.com/docs/error-codes/rate-limit"
# }
# }
def rate_limit_error?(response)
return false if response.success?

response.params.dig("error","code") == 'rate_limit'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I repeated the tests, and while the Stripe API does return 429 status code, however the status code is "swallowed" by ActiveMerchant code, so we don't have the information on the HTTP status code in this point. See https://github.com/activemerchant/active_merchant/blob/v1.137.0/lib/active_merchant/billing/gateways/stripe.rb#L704-L710

So, the error.code seems to be the way to detect this.

As this is, of course, specific to the gateway (Stripe in this case), I moved this detection here, and I also decided to make the exception gatway-specific too Finance::Payment::StripeRateLimitError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me, just a nitpick, maybe the rate_limit literal could be a constant inside Finance::Payment::StripeRateLimitError.

end
end
31 changes: 31 additions & 0 deletions app/services/synchronization/billing_lock_service.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# frozen_string_literal: true

class Synchronization::BillingLockService < Synchronization::NowaitLockService
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why splitting the logic into two classes? couldn't it be just one?


# Default timeout for billing
DEFAULT_TIMEOUT = 1.hour.in_milliseconds

# @param account_id [String] provider account ID for locking
# @param timeout [Integer] milliseconds lock timeout
# @yield [] if lock is acquired, then execute block without parameters and ensure lock is released afterwards
def initialize(account_id, timeout: DEFAULT_TIMEOUT, &block)
self.account_id = account_id

resource = "billing:#{account_id}"
super(resource, timeout: timeout, &block)
end

attr_accessor :lock_info, :account_id

def lock
@lock_info = manager.lock(lock_key, timeout)
raise Finance::LockBillingError, "Concurrent billing job already running for account #{account_id}" unless @lock_info
end

def unlock
manager.unlock(@lock_info) if @lock_info
@lock_info = nil
rescue => e
Rails.logger.warn("Failed to release billing lock for account #{account_id}: #{e.message}")
end
end
8 changes: 6 additions & 2 deletions app/services/synchronization/nowait_lock_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,19 @@ def initialize(str_resource, timeout:, &block)

# @return [Hash, NilClass, TrueClass, FalseClass] without block returns lock_info or nil, otherwise a boolean signifying whether it ran
def call
manager.lock("lock:#{resource}", timeout, &block)
manager.lock(lock_key, timeout, &block)
end

private

attr_accessor :resource, :block, :timeout

def lock_key
@lock_key ||= "lock:#{resource}"
end

def manager
# we may cache this as thread/fiber local variable but for now creating a new one seems good enough
Redlock::Client.new([System::RedisClientPool.default], { retry_count: 0, redis_timeout: 1 })
@manager ||= Redlock::Client.new([System::RedisClientPool.default], { retry_count: 0, redis_timeout: 1 })
end
end
17 changes: 12 additions & 5 deletions app/workers/billing_worker.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,18 @@
class BillingWorker
include Sidekiq::Job

sidekiq_options queue: :billing, retry: 3

sidekiq_retry_in do |_count|
# after lock has been released
1.hours + 10
sidekiq_options queue: :billing, retry: 5

sidekiq_retry_in do |count, exception|
case exception
when Finance::Payment::StripeRateLimitError
# Use default exponential backoff delays that Sidekiq implements by returning nil
# see https://github.com/sidekiq/sidekiq/blob/v7.3.2/lib/sidekiq/job_retry.rb#L219-L220
nil
else
# For other errors: wait for lock to be released (original behavior)
(1.hours + 10).to_i
end
end

class Callback
Expand Down
158 changes: 158 additions & 0 deletions docs/BILLING_DIAGRAMS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Billing Flow Diagrams

This directory contains Mermaid sequence diagrams documenting the billing process.

## Diagrams

1. **`billing-flow-happy-path.mmd`** - Normal successful billing flow
2. **`billing-flow-rate-limit.mmd`** - Rate limit error handling with immediate retry
3. **`billing-flow-payment-error.mmd`** - Payment error handling (card declined)

## Viewing Diagrams

### Option 1: GitHub/GitLab (Easiest)

GitHub and GitLab render `.mmd` files automatically. Just view the files in your browser:
- Click on `billing-flow-happy-path.mmd` in GitHub
- The diagram renders automatically

### Option 2: VS Code (Best for Editing)

Install the **Mermaid Preview** extension:

```bash
# Install VS Code extension
code --install-extension bierner.markdown-mermaid
```

Then:
1. Open any `.mmd` file in VS Code
2. Right-click → "Open Preview"
3. Or use `Ctrl+Shift+V` to preview

### Option 3: Command Line (Linux)

Install Mermaid CLI:

```bash
# Using npm (if you have Node.js installed)
npm install -g @mermaid-js/mermaid-cli

# Or using yarn
yarn global add @mermaid-js/mermaid-cli
```

Render to PNG/SVG:

```bash
# Generate PNG
mmdc -i billing-flow-happy-path.mmd -o billing-flow-happy-path.png

# Generate SVG (better for zoom/print)
mmdc -i billing-flow-happy-path.mmd -o billing-flow-happy-path.svg

# Generate all diagrams
for file in billing-flow-*.mmd; do
mmdc -i "$file" -o "${file%.mmd}.png"
done
```

### Option 4: Online Editor (No Installation)

Use the official Mermaid Live Editor:
- https://mermaid.live/

1. Copy the contents of any `.mmd` file
2. Paste into the editor
3. View/export the diagram

### Option 5: IntelliJ/RubyMine

Install the **Mermaid** plugin:
1. Go to Settings → Plugins
2. Search for "Mermaid"
3. Install and restart
4. Open `.mmd` files to see live preview

## Diagram Syntax

Mermaid uses a simple text-based syntax:

```mermaid
sequenceDiagram
participant A as Component A
participant B as Component B

A->>B: method_call()
activate B
B-->>A: return value
deactivate B

Note over A,B: This is a note

rect rgb(200, 255, 200)
Note over A: Success block
A->>B: another_call()
end
```

### Key Elements Used

- `participant X as Name` - Define components/actors
- `A->>B: message` - Solid arrow (synchronous call)
- `B-->>A: message` - Dashed arrow (return/response)
- `activate/deactivate` - Show when component is active
- `Note over A,B: text` - Add notes/comments
- `rect rgb(r,g,b)` - Colored background blocks
- `loop`, `alt`, `opt` - Control structures

## Editing Tips

1. **Keep it readable** - Don't make diagrams too complex
2. **Use colors** - `rect rgb(...)` to highlight different scenarios:
- Green (200, 255, 200) - Success paths
- Red (255, 200, 200) - Error paths
- Yellow (255, 255, 200) - Important notes
- Blue (200, 200, 255) - Future/delayed actions
3. **Add notes** - Explain WHY things happen, not just WHAT
4. **Break into multiple diagrams** - One scenario per diagram

## Comparing Rate Limit vs. Payment Error

| Aspect | Rate Limit Error | Payment Error |
|--------|------------------|---------------|
| **Exception** | `Finance::Payment::StripeRateLimitError` | `Finance::Payment::CreditCardPurchaseFailed` |
| **HTTP Code** | 429 | 402 |
| **Caught where?** | NOT caught in Invoice#charge! | Caught in Invoice#charge! |
| **Invoice state** | Remains "pending" | Changed to "unpaid" |
| **Retry counter** | NOT incremented | Incremented (1/3, 2/3, 3/3) |
| **Job behavior** | Fails → Sidekiq retry | Completes successfully |
| **Retry timing** | ~15s, ~45s, ~135s... | 3 days later |
| **Lock** | Released immediately | Held for 1 hour |
| **User notification** | None | Email sent immediately |
| **Re-raised?** | Yes (bubbles to BillingWorker) | No (handled in Invoice) |

## Key Implementation Details

### Critical Fixes

1. **`bill_and_charge_each` (billing_strategy.rb:378-383)**
- Separate rescue for `RateLimitError` to prevent swallowing
- Must re-raise immediately

2. **`BillingStrategy.daily` (billing_strategy.rb:92-112)**
- Separate rescue for better logging (WARN vs ERROR)
- Don't call `results.failure()` for rate limits

3. **`BillingService.call!` (billing_service.rb:29-41)**
- Release lock on rate limit
- Re-raise error for Sidekiq

4. **`BillingWorker` (billing_worker.rb:8-20)**
- Exponential backoff for rate limits: `(3 ** count) * 5 + jitter`
- Standard delay for other errors: `1.hour + 10`

## See Also

- `STRIPE_RATE_LIMIT_HANDLING.md` - Detailed implementation documentation
- `test/integration/finance/stripe_rate_limit_handling_test.rb` - Comprehensive tests
Loading