Skip to content

Comments

fix(swap): Monero wallet thread safety#281

Merged
binarybaron merged 18 commits intomasterfrom
asb/wallet-synchronization
Apr 24, 2025
Merged

fix(swap): Monero wallet thread safety#281
binarybaron merged 18 commits intomasterfrom
asb/wallet-synchronization

Conversation

@Einliterflasche
Copy link

Our current approach to thread safety regarding monero::Wallet, which we pass around as Arc<monero::Wallet>, is that it contains a Mutex::monero_rpc::wallet::Client and acquires that before using the monero wallet rpc process. Since it does that in each function, calling i.e. create_from_and_load from create_from_keys_and_sweep results in the lock being acquired, released and then acquired again, even though we expect an atomic operation. This results in a race condition where another thread acquires the lock after it is first released, even though the function isn't finished yet.

We need to migrate to an API where we can guarantee that multiple functions can be called without releasing the lock. The best way to do this would be to always pass Arc<Mutex<monero::Wallet>> and have the caller lock the mutex. Any call to the monero::Wallet would then be guaranteed to happen within a single lock acquisition. However, this would introduce a lot of (small) changes, since we use Arc<monero::Wallet> quite a lot.

@binarybaron can you think of another way to fix this with less friction, that is similarly robust?

@binarybaron
Copy link

I need to think about this a bit more but how long would you like to keep the lock? Over the duration of which process ?

@Einliterflasche
Copy link
Author

This race condition is the cause of the refund failures of #274. If we take a look at these logs:

{"timestamp":"2025-03-26T21:53:28.464127523Z","level":"INFO","fields":{"message":"Syncing Monero wallet","name":"asb-wallet","attempt":1},"span":{"id":"773...","name":"swap"},"spans":[{"id":"773...","name":"swap"}]}
{"timestamp":"2025-03-26T21:54:44.403850664Z","level":"INFO","fields":{"message":"Syncing Monero wallet","name":"asb-wallet","attempt":1},"span":{"id":"50f...","name":"swap"},"spans":[{"id":"50f...","name":"swap"}]}
{"timestamp":"2025-03-26T21:54:58.127906703Z","level":"INFO","fields":{"message":"Monero wallet synced","name":"asb-wallet"},"span":{"id":"773...","name":"swap"},"spans":[{"id":"773...","name":"swap"}]}
{"timestamp":"2025-03-26T21:54:58.128939883Z","level":"INFO","fields":{"message":"Monero wallet synced","name":"asb-wallet"},"span":{"id":"50f...","name":"swap"},"spans":[{"id":"50f...","name":"swap"}]}
{"timestamp":"2025-03-26T21:54:59.207398055Z","level":"INFO","fields":{"message":"Monero transferred back to default wallet","tx":"2ab...","monero_address":"4BA..."},"span":{"id":"773...","name":"swap"},"spans":[{"id":"773...","name":"swap"}]}
{"timestamp":"2025-03-26T21:54:59.290489473Z","level":"WARN","fields":{"message":"Failed to refund Monero. We will retry in 0 seconds","swap_id":"50f...","error":"Failed to transfer Monero to default wallet\n\nCaused by:\n    0: JSON-RPC request failed with code -4: No unlocked balance in the specified account\n    1: JSON-RPC request failed with code -4: No unlocked balance in the specified account"},"span":{"id":"50f...","name":"swap"},"spans":[{"id":"50f...","name":"swap"}]}

Then the asb is trying to sync 2 monero wallets at the same, which is not possible with monero-wallet-rpc. We need cannot release the lock unless we finish the complete sequence (opening wallet, sync, sweep) or else monero-wallet-rpc will be in an unexpected (wrong) state and cause refund issues etc.
This is triggered any time the asb tries to refund 2 swaps whose timelock expires in the same block, i.e. often (for busy makers).

Before, monero::Wallet wrapped a Mutex<Client>, and locked
the mutex on each operation. This meant releasing the
lock in between operations, even though we rely on the
operations being executed in order.

To remedy this race condition, we wrap monero::Wallet itself
in a mutex, requiring any caller to hold the lock for the duration
of the operation, including any suboperations.
@binarybaron
Copy link

This race condition is the cause of the refund failures of #274. If we take a look at these logs:

{"timestamp":"2025-03-26T21:53:28.464127523Z","level":"INFO","fields":{"message":"Syncing Monero wallet","name":"asb-wallet","attempt":1},"span":{"id":"773...","name":"swap"},"spans":[{"id":"773...","name":"swap"}]}
{"timestamp":"2025-03-26T21:54:44.403850664Z","level":"INFO","fields":{"message":"Syncing Monero wallet","name":"asb-wallet","attempt":1},"span":{"id":"50f...","name":"swap"},"spans":[{"id":"50f...","name":"swap"}]}
{"timestamp":"2025-03-26T21:54:58.127906703Z","level":"INFO","fields":{"message":"Monero wallet synced","name":"asb-wallet"},"span":{"id":"773...","name":"swap"},"spans":[{"id":"773...","name":"swap"}]}
{"timestamp":"2025-03-26T21:54:58.128939883Z","level":"INFO","fields":{"message":"Monero wallet synced","name":"asb-wallet"},"span":{"id":"50f...","name":"swap"},"spans":[{"id":"50f...","name":"swap"}]}
{"timestamp":"2025-03-26T21:54:59.207398055Z","level":"INFO","fields":{"message":"Monero transferred back to default wallet","tx":"2ab...","monero_address":"4BA..."},"span":{"id":"773...","name":"swap"},"spans":[{"id":"773...","name":"swap"}]}
{"timestamp":"2025-03-26T21:54:59.290489473Z","level":"WARN","fields":{"message":"Failed to refund Monero. We will retry in 0 seconds","swap_id":"50f...","error":"Failed to transfer Monero to default wallet\n\nCaused by:\n    0: JSON-RPC request failed with code -4: No unlocked balance in the specified account\n    1: JSON-RPC request failed with code -4: No unlocked balance in the specified account"},"span":{"id":"50f...","name":"swap"},"spans":[{"id":"50f...","name":"swap"}]}

Then the asb is trying to sync 2 monero wallets at the same, which is not possible with monero-wallet-rpc. We need cannot release the lock unless we finish the complete sequence (opening wallet, sync, sweep) or else monero-wallet-rpc will be in an unexpected (wrong) state and cause refund issues etc. This is triggered any time the asb tries to refund 2 swaps whose timelock expires in the same block, i.e. often (for busy makers).

You are totally correct here. I have encountered this myself on my dev server.

@binarybaron
Copy link

This also means we cannot provide quotes while we are syncing a refund wallet. Which is ok, I guess but is properly not expected behaviour.

@Einliterflasche
Copy link
Author

Combined with #288 this would probanly mostly resolve the issue, though? We could also return the cached quote while working on creating another one.

@binarybaron
Copy link

binarybaron commented Apr 16, 2025

Combined with #288 this would probanly mostly resolve the issue, though? We could also return the cached quote while working on creating another one.

I'd stay with the current policy of only returning a quote if the TTL is accurate. We don't want to return the same outdated quote over a long period (if e.g the syncing takes a long time). A user cannot initiate a swap while we're syncing anyway because we need to sync our main asb_wallet before initiating a swap (for the wallet snapshot)

This will be fixed once #244 is merged. No need to prevent to have concurrency, if we don't support it.

We should still merge this PR though (if it wasn't clear)

@binarybaron binarybaron changed the title rework asb monero wallet thread safety fix(swap): Monero wallet thread safety Apr 17, 2025
@binarybaron
Copy link

We can add a timeout for the acquisition of the Mutex lock for some operation.

If we want to serve a quote and can't lock the Monero wallet for >60s it might not make sense anymore to even bother with responding anymore.

Due to the newly introduced thread safety, we are currently holding
lock to the monero wallet while waiting for confirmations
-- since this takes a lot of time, it starves all other tasks
that do anything with the monero wallet.

In this commit I start implementing a change that enables us to release
the lock to the wallet while waiting for confirmations and only acquire it
when necessary.

This breaks with the current system of passing just a generic client
which implements the MoneroWalletRpc trait (which we use to pass a dummy
client for testing).

This commit is the first step towards a small refactor to that system.
By always passing Arc<Mutex<Wallet>> instead of MoneroWalletRpc clients
directly we can allow the wait_for_confirmations functions to lock the
Mutex and access the client when they need to, while releasing the lock
when waiting for the next tick. This stops the current starving of other
tasks waiting for the lock.

Since we use a dummy client for testing, this required adding a generic
parameter to the Wallet. However, since we specify a default type,
this doesn't actually require generic handling anywhere.
@Einliterflasche
Copy link
Author

We can add a timeout for the acquisition of the Mutex lock for some operation.

If we want to serve a quote and can't lock the Monero wallet for >60s it might not make sense anymore to even bother with responding anymore.

I think we just waited for the lock (without timeout) before this change too. But you're right, it totally makes sense to implement this here.

This commit adds a timeout after 60 seconds when trying to acquire
the lock on the monero wallet while making a quote.
Should a timout occur, we return an error.
This makes sure that we get _some_ return value and that
starvation is noticed.
@Einliterflasche
Copy link
Author

Currently, there still seems to be an instance of overly long lock holding and subsequent starvation of other tasks in wait_for_confirmations. Why this happens currently eludes me.

@Einliterflasche
Copy link
Author

Einliterflasche commented Apr 20, 2025

Tests are passing now, could you review this @binarybaron @delta1?

Edit: there is one test (reopens_wallet_in_case_not_available) which is probably failing

@binarybaron
Copy link

Tests are passing now, could you review this @binarybaron @delta1?

Edit: there is one test (reopens_wallet_in_case_not_available) which is probably failing

I'll review this aspa.

Our testnet asb is running this branch.

@binarybaron
Copy link

@Einliterflasche Have you added the changes from #260?

@Einliterflasche
Copy link
Author

Not yet

Copy link

@binarybaron binarybaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments about Mutex / RwLock

@binarybaron
Copy link

LGTM but let's do some manual testing before merging this.

@Einliterflasche
Copy link
Author

Edit: there is one test (reopens_wallet_in_case_not_available) which is probably failing

The test runs fine on my machine, but timed out in the action. I'll rerun the action, but it should be fine.

@binarybaron
Copy link

I keep getting this error when trying to do a testnet swap: Failed to complete swap: Fee per byte returned by electrum server is negative: -1. This may indicate that fee estimation is not supported by this server

This isn't related to your changes but it's preventing me from testing this PR. Can you try doing a swap?

@binarybaron binarybaron added the P1 Priority 1 label Apr 22, 2025
@Einliterflasche
Copy link
Author

I'm adding some changes from #260 into this PR.

When we fail to create a monero wallet from keys, we will now try
to open it instead. I also renamed the method to be more consistent
with Wallet::open_or_create.

These changes are mostly taken from #260.
@Einliterflasche Einliterflasche force-pushed the asb/wallet-synchronization branch from eda2268 to b57dd3b Compare April 22, 2025 13:45
@Einliterflasche Einliterflasche force-pushed the asb/wallet-synchronization branch from b57dd3b to 66ce0c5 Compare April 22, 2025 13:47
This commit deduplicates logic by using
create_from_keys_and_sweep_to in bob's redeem_xmr
and also adds the create_from_keys_and_sweep_to
method while making create_from_keys_and_sweep a
wrapper around it.
@Einliterflasche
Copy link
Author

Einliterflasche commented Apr 23, 2025

There was a deadlock possibility in wait_for_confirmations_with in one match arm, which I fixed now. I also verified that no other match arm can exhibit that behaviour.

@binarybaron
Copy link

There was a deadlock possibility in wait_for_confirmations_with in one match arm, which I fixed now. I also verified that no other match arm can exhibit that behaviour.

Awesome. LGTM. Is this good to merge?

@Einliterflasche
Copy link
Author

I think so. Since you agree, I'll go ahead and merge this.

@binarybaron binarybaron merged commit f1e5cdf into master Apr 24, 2025
27 of 29 checks passed
@binarybaron
Copy link

Please also add a changelog entry.

Awesome that we finally fixed this 💪🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P1 Priority 1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants