Skip to content

Conversation

JoshLind
Copy link
Contributor

Description

This PR offers several performance, metric and logging improvements for the pepper service.

The commits are as follows:

  1. Small improvements to error logs and metrics (for request handling). This commit also bumps the maximum HTTP request timeout from 10 seconds to 15 seconds (additional buffer is required until we can identify why some external resource fetches are so slow).
  2. Use tokio spawn_blocking() for the pepper derivation logic, and add new metrics to track derivation latency.
  3. Update the firestore DB writes to be asynchronous, and add new metrics to track DB write latencies. This will avoid blocking the request handler when clients make pepper requests.

Testing Plan

Existing test infrastructure.

@JoshLind JoshLind requested review from alinush and zjma October 13, 2025 13:16
Copy link
Contributor

@alinush alinush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈

@JoshLind JoshLind enabled auto-merge (rebase) October 14, 2025 12:52

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on 75cc050c2c35782fc994c2a7e61a7fb3ea88f71f ==> 7d237c25521c488650e093179854a8f6c9d173c7

Compatibility test results for 75cc050c2c35782fc994c2a7e61a7fb3ea88f71f ==> 7d237c25521c488650e093179854a8f6c9d173c7 (PR)
1. Check liveness of validators at old version: 75cc050c2c35782fc994c2a7e61a7fb3ea88f71f
compatibility::simple-validator-upgrade::liveness-check : committed: 14626.09 txn/s, latency: 2382.09 ms, (p50: 2500 ms, p70: 2600, p90: 2700 ms, p99: 2800 ms), latency samples: 474060
2. Upgrading first Validator to new version: 7d237c25521c488650e093179854a8f6c9d173c7
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 5329.07 txn/s, latency: 6422.97 ms, (p50: 7100 ms, p70: 7200, p90: 7300 ms, p99: 7300 ms), latency samples: 184420
3. Upgrading rest of first batch to new version: 7d237c25521c488650e093179854a8f6c9d173c7
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 5333.87 txn/s, latency: 6458.30 ms, (p50: 7200 ms, p70: 7300, p90: 7400 ms, p99: 7400 ms), latency samples: 183500
4. upgrading second batch to new version: 7d237c25521c488650e093179854a8f6c9d173c7
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 8377.99 txn/s, latency: 4075.55 ms, (p50: 4400 ms, p70: 4500, p90: 4600 ms, p99: 4700 ms), latency samples: 277420
5. check swarm health
Compatibility test for 75cc050c2c35782fc994c2a7e61a7fb3ea88f71f ==> 7d237c25521c488650e093179854a8f6c9d173c7 passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 7d237c25521c488650e093179854a8f6c9d173c7

two traffics test: inner traffic : committed: 13851.24 txn/s, submitted: 13851.61 txn/s, expired: 0.37 txn/s, latency: 2714.18 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3600 ms), latency samples: 5266620
two traffics test : committed: 99.97 txn/s, latency: 804.72 ms, (p50: 700 ms, p70: 800, p90: 900 ms, p99: 2400 ms), latency samples: 1780
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 2.240, avg: 2.098", "ConsensusProposalToOrdered: max: 0.168, avg: 0.164", "ConsensusOrderedToCommit: max: 0.193, avg: 0.095", "ConsensusProposalToCommit: max: 0.354, avg: 0.259"]
Max non-epoch-change gap was: 2 rounds at version 23607 (avg 0.00) [limit 4], 2.21s no progress at version 45816 (avg 0.07s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.30s no progress at version 2279906 (avg 0.30s) [limit 16].
Test Ok

Comment on lines +95 to +99
if self.disable_async_db_updates {
db_update_task.await // Run the task and wait for the result
} else {
tokio::spawn(db_update_task); // Run the task asynchronously
Ok(())
Copy link
Contributor

@zjma zjma Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recall the original design is that we should wait for the recovery database txn to be committed first, otherwise if a db txn failed and it is the only login attempt in the history, it would be impossible to recover that account. Therefore async db update shouldn't be an option. (@alinush can confirm?)

That said, I'm not sure if we are still keep the original design.

Copy link
Contributor Author

@JoshLind JoshLind Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, good point! 😄 My thinking was the opposite, i.e.,:

If our DB is unhealthy, we shouldn't block users from accessing their pepper (and thus their on-chain accounts). Instead, we emit an error (so we can raise an alarm) and save the pepper input to our logs, so that we recover it manually (if we ever need to).

That way, for a pepper to be totally unrecoverable, we'd need: (1) the DB to be unhealthy (e.g., input never saved); (2) our logs to be missing (e.g., logs to be wiped); and (3) the dApp info to be lost/disabled.

Note: for (3), if any other users have logged into the same dApp, we should be able to take the "brute force approach" and find the input for the user more manually, right?

All in all, my preference is to still allow users to access their accounts (while we figure out what's going wrong with the DB). What do you think? 😄

@JoshLind JoshLind merged commit 28dc629 into main Oct 14, 2025
89 checks passed
@JoshLind JoshLind deleted the pepper_logs branch October 14, 2025 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants