feat(prover-api): production-ready claim migration prover with security hardening #68

vutuanlinh2k2 · 2026-01-08T11:54:29Z

Summary

This PR makes the SP1 prover API production-ready with comprehensive security improvements and reliability fixes:

Modular architecture: Refactored monolithic main.rs into focused modules (handlers, queue, jobs, validation, config, types, etc.)
Job queue with worker pool: Async job processing with configurable concurrency limits
Security hardening:
- Eligibility verification against merkle tree before proof generation
- Sr25519 signature validation with challenge-response
- IP-based rate limiting with configurable limits
- trust_proxy_headers config for secure IP extraction behind reverse proxies
Concurrency fix: Semaphore permits now held until blocking tasks complete, preventing unbounded resource usage on timeouts
Production config: Tuned defaults for SP1 network prover with explicit Mainnet mode
Configurable RPC timeout for on-chain verification calls
Test coverage: Added httpmock tests for on-chain verification and claim checking

Test plan

Verify prover API starts successfully with production config
Test rate limiting blocks excessive requests from same IP
Verify signature validation rejects invalid signatures
Confirm eligibility check prevents ineligible addresses from generating proofs
Test job queue processes requests with proper concurrency limits
Verify timed-out jobs don't cause unbounded concurrency
Run cargo test -p prover-api to verify new httpmock tests pass

🤖 Generated with Claude Code

Split monolithic main.rs into separate modules for better maintainability: - config.rs: Configuration loading and validation - types.rs: Shared types (AppState, JobEntry, CachedProof, etc.) - handlers.rs: HTTP request handlers (submit_job, job_status, health) - prover.rs: ZK proof generation logic - queue.rs: JobQueue and WorkerPool for concurrent proof generation - cache.rs: Proof caching with TTL cleanup - rate_limit.rs: Per-pubkey rate limiting - jobs.rs: Job entry cleanup - validation.rs: Request validation utilities Added production features: - Configurable worker pool size and queue capacity - Background cleanup tasks for cache, rate limits, and stale jobs - Max body size limiting - Proof timeout configuration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

… and IP rate limiting Adds production hardening features: - Load eligibility data from merkle-tree.json at startup with pubkey-based O(1) lookup - Verify sr25519 signatures before proof generation using schnorrkel - IP-based rate limiting (separate from pubkey rate limiting) to catch bots/scanners - Add proof metrics tracking (completions, timeouts, background tasks) - Update Dockerfile to bundle eligibility data with configurable path 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Increase cache TTL from 10 minutes to 1 hour since proofs are deterministic and expensive to compute - Reduce queue capacity from 100 to 50 for better wait time estimation (~1 hour max with 4 workers) - Increase jobs TTL from 10 minutes to 1 hour to let users return for their proof 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

ProverClient::from_env() defaults to Reserved mode which has an invalid domain for the mainnet network. Explicitly use NetworkMode::Mainnet via ProverClient::builder().network_for(NetworkMode::Mainnet) to ensure correct mainnet RPC URL is used. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add trust_proxy_headers config for secure IP extraction behind reverse proxies - When false (default), only socket address is used for rate limiting - When true, respects X-Forwarded-For and X-Real-IP headers - Strip port from socket addresses for consistent rate limiting - Add configurable RPC timeout for on-chain verification - timeout_seconds field in VerifyOnchainConfig - Uses RPC_TIMEOUT_SECONDS env var (default: 10s) - Fix unbounded concurrency when proof generation times out - Acquire semaphore permit before spawning task, not inside - Hold permit until blocking task actually completes (not just timeout) - Use tokio::select! instead of tokio::time::timeout to keep handle valid - Add decrement_timed_out_still_running metric tracking - Fix JobQueue::from_sender to use shared queue_size counter - Prevents queue size tracking inconsistencies - Add error handling for misconfigured state Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add httpmock dev dependency for mocking RPC responses - Add tests for verify_onchain_proof success and revert cases - Add test for check_already_claimed with claimed/unclaimed scenarios - Minor code cleanup and formatting fixes Co-Authored-By: Claude Opus 4.5 <[email protected]>

claude · 2026-01-08T11:54:44Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

claude · 2026-01-08T13:33:18Z

Claude finished @vutuanlinh2k2's task —— View job

Comprehensive Code Review

I've thoroughly analyzed this PR which refactors the SP1 prover API from monolithic to modular architecture with significant security improvements. Here's my detailed feedback:

✅ Architectural Improvements

Analyze architectural refactoring (main.rs → modular design)
Review security implementations (rate limiting, signature validation, eligibility)
Examine concurrency handling and job queue
Check configuration and error handling
Review test coverage and new test files
Provide specific feedback and recommendations

🏗️ Excellent Architectural Refactoring

Strengths:

Clean separation of concerns: The modular split from a monolithic 450+ line main.rs to focused modules is excellent
Well-defined module boundaries: handlers/, queue/, security/, config/, etc. each have clear responsibilities
Maintainable structure: The new organization makes the codebase much easier to navigate and extend

Suggestions:

Consider extracting types.rs into domain-specific modules (e.g., types/requests.rs, types/responses.rs, types/config.rs) as it's currently quite large at 422 lines

🔒 Strong Security Implementation

Multi-layered security approach:

Rate Limiting (`rate_limit.rs`)

Two-tier rate limiting: IP-based (10/min) + pubkey-based (3/5min) is smart defense-in-depth
Proper cleanup: Background tasks prevent memory leaks from rate limit entries
Good test coverage: 10 comprehensive test cases covering edge cases

⚠️ Potential Issue - Race Condition in Rate Limiter:

// In rate_limit.rs:38-66
match limits.get_mut(key) {
    Some(entry) => {
        let elapsed = now_ts() - entry.last_request_at;
        if elapsed >= self.window_seconds {
            entry.last_request_at = now_ts(); // ❌ Two calls to now_ts() could differ
            entry.request_count = 1;

Fix: Capture timestamp once to avoid race conditions:

let now = now_ts();
let elapsed = now - entry.last_request_at;
if elapsed >= self.window_seconds {
    entry.last_request_at = now; // ✅ Consistent timestamp

Signature Verification (`signature.rs`)

Correct SR25519 implementation: Proper substrate context and byte wrapping
Security-first approach: Validates signatures before expensive proof generation
Clear documentation: Good comments explaining the substrate-specific wrapping

Eligibility Verification (`eligibility.rs`)

O(1) pubkey-based lookup: Smart choice to handle different SS58 prefixes correctly
Exact balance matching: Prevents amount manipulation attacks
Flexible JSON parsing: Handles both entriesByPubkey and fallback formats

⚡ Robust Concurrency Handling

Outstanding semaphore-based solution:

Job Queue (`queue.rs`)

The concurrency control in process_job() is exceptionally well designed:

// Lines 161, 275-332: Critical concurrency fix
async fn process_job(
    // ...
    _permit: OwnedSemaphorePermit, // ✅ Held until function returns
) {
    // After timeout occurs, STILL waits for task completion
    match handle.await {
        // Handle late completion and cache results
    }
    // ✅ Permit only released after blocking task truly completes
}

Key Benefits:

True concurrency limits: Prevents unbounded resource usage even on timeouts
Cost control: Critical for expensive SP1 network proofs (10+ minutes, $0.10+ each)
Cache late completions: Smart to still cache results that complete after timeout
Proper metrics tracking: timed_out_still_running counter for operational visibility

🔧 Configuration & Error Handling

Configuration (`config.rs`)

Strengths:

Production-ready defaults: Sensible values for queue capacity (50), workers (4), timeouts (600s)
Security enforcements: Requires VERIFY_PROOF=true for SP1_PROVER=network
CORS validation: Proper handling for production security
Good validation: Early error detection with clear messages

Error Handling

Structured error codes: Consistent error response format across all endpoints
Proper HTTP status codes: 400/401/403/429/503 used appropriately
Client-friendly messages: Clear error descriptions for API consumers

🧪 Test Coverage Analysis

Comprehensive testing across modules:

11 modules with test coverage: Excellent adoption of unit testing
Integration test script: test-api.sh provides end-to-end validation
Edge case coverage: Rate limiting, validation, queue behavior well tested

Missing test areas:

Concurrency stress tests: Could add tests with high concurrent load
IP extraction logic: Tests for trust_proxy_headers functionality
Error recovery: Tests for various failure scenarios and graceful degradation

🐛 Potential Issues Identified

1. IP Extraction Security (`validation.rs:224-244`)

pub fn extract_client_ip(headers: &HeaderMap, addr: Option<&str>, trust_proxy: bool) -> String {
    if trust_proxy {
        // Check X-Forwarded-For, X-Real-IP headers
    }
    addr.unwrap_or("unknown").to_string()
}

Concern: When trust_proxy_headers=false, still falls back to connection IP. Consider logging when proxy headers are present but not trusted.

2. Memory Usage (`handlers.rs:111-176`)

The handler creates synthetic job entries for cache hits. Over time, this could accumulate in the jobs HashMap without cleanup.

Suggestion: Consider excluding cached results from jobs tracking or ensure they're cleaned up properly.

3. Error Propagation (`prover.rs`)

Some errors lose context during string conversion:

let signature = parse_hex_bytes::<64>(&request.signature).map_err(err_to_string)?;

Minor improvement: Preserve error chain for better debugging in production.

🚀 Performance Considerations

Positive:

Efficient data structures: HashMap lookups, atomic counters for metrics
Async/await throughout: Non-blocking I/O for API responsiveness
Smart caching: Deterministic proofs cached by request hash
Background cleanup tasks: Prevent memory leaks

Optimization opportunities:

Batch cleanup operations: Currently each cleanup task runs independently
Connection pooling: For RPC calls to claim contract verification

📋 Recommendations

High Priority

Fix timestamp race condition in rate limiter (line 47-48 in rate_limit.rs)
Add concurrency stress tests to validate semaphore behavior under load
Document production deployment requirements and security considerations

Medium Priority

Add structured logging with correlation IDs for request tracing
Consider circuit breaker for RPC endpoints to handle external service failures
Add metrics export (Prometheus/similar) for operational monitoring

Low Priority

Split large type definitions into domain-specific modules
Add integration tests for signature verification with real SR25519 keys

🎯 Overall Assessment

This is outstanding work that transforms a prototype into a production-ready system. The security-first approach, robust concurrency handling, and comprehensive error management demonstrate excellent engineering practices.

Key Achievements:

✅ Production-ready security hardening
✅ Proper resource management preventing cost overruns
✅ Modular architecture enabling future enhancements
✅ Comprehensive test coverage
✅ Clear configuration and operational controls

The concurrency fix alone (holding semaphore permits until blocking tasks complete) shows deep understanding of the problem space and will prevent significant operational issues.

Recommended for merge after addressing the timestamp race condition in the rate limiter.

vutuanlinh2k2 · 2026-01-08T13:50:14Z

Actually Production-Ready As-Is

The code is already well-designed for production. Most of Claude's suggestions are nice-to-haves, not blockers:

Issue	Real Impact	Verdict
Timestamp race condition	Millisecond difference in 60-300s windows	Cosmetic fix
Memory leak in handlers	Already handled by `start_jobs_cleanup_task` (line 136 in main.rs)	Not an issue
Missing stress tests	Code logic is sound	Nice-to-have
Error context loss	Only affects debugging	Nice-to-have
IP logging	Edge case	Nice-to-have

The One Fix Worth Making

Timestamp consistency in rate_limit.rs:43-47 - it's a 2-line change that demonstrates code quality:

// Lines 43-48: capture timestamp once
let now = now_ts();
let elapsed = now - entry.last_request_at;
if elapsed >= self.window_seconds {
    entry.last_request_at = now;  // Use same timestamp
    entry.request_count = 1;

What's Already Solid

Job cleanup - Background task runs every 60s, removes completed/failed jobs after TTL
Rate limit cleanup - Background task prevents memory leaks
Cache cleanup - Background task handles cache eviction
Concurrency control - Semaphore-based limiting with proper permit handling
Security layers - IP rate limiting → validation → eligibility → signature → pubkey rate limiting

Conclusion

The API is production-ready now. If you want to be thorough, fix the timestamp issue - it's a 2-line change. Everything else is polish.

Capture timestamp once at the start of check_and_update() to avoid potential race condition where multiple calls to now_ts() could return different values. Co-Authored-By: Claude Opus 4.5 <[email protected]>

vutuanlinh2k2 and others added 6 commits January 7, 2026 21:15

vutuanlinh2k2 requested a review from drewstone January 8, 2026 11:55

vutuanlinh2k2 self-assigned this Jan 8, 2026

fix(prover-api): use consistent timestamp in rate limiter

12a3689

Capture timestamp once at the start of check_and_update() to avoid potential race condition where multiple calls to now_ts() could return different values. Co-Authored-By: Claude Opus 4.5 <[email protected]>

tangle-network deleted a comment from claude bot Jan 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(prover-api): production-ready claim migration prover with security hardening #68

feat(prover-api): production-ready claim migration prover with security hardening #68

Uh oh!

vutuanlinh2k2 commented Jan 8, 2026

Uh oh!

claude bot commented Jan 8, 2026

Uh oh!

claude bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

vutuanlinh2k2 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(prover-api): production-ready claim migration prover with security hardening #68

Are you sure you want to change the base?

feat(prover-api): production-ready claim migration prover with security hardening #68

Uh oh!

Conversation

vutuanlinh2k2 commented Jan 8, 2026

Summary

Test plan

Uh oh!

claude bot commented Jan 8, 2026

Uh oh!

claude bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review

✅ Architectural Improvements

🏗️ Excellent Architectural Refactoring

🔒 Strong Security Implementation

Rate Limiting (rate_limit.rs)

Signature Verification (signature.rs)

Eligibility Verification (eligibility.rs)

⚡ Robust Concurrency Handling

Job Queue (queue.rs)

🔧 Configuration & Error Handling

Configuration (config.rs)

Error Handling

🧪 Test Coverage Analysis

🐛 Potential Issues Identified

1. IP Extraction Security (validation.rs:224-244)

2. Memory Usage (handlers.rs:111-176)

3. Error Propagation (prover.rs)

🚀 Performance Considerations

📋 Recommendations

High Priority

Medium Priority

Low Priority

🎯 Overall Assessment

Uh oh!

vutuanlinh2k2 commented Jan 8, 2026

Actually Production-Ready As-Is

The One Fix Worth Making

What's Already Solid

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Jan 8, 2026 •

edited

Loading

Rate Limiting (`rate_limit.rs`)

Signature Verification (`signature.rs`)

Eligibility Verification (`eligibility.rs`)

Job Queue (`queue.rs`)

Configuration (`config.rs`)

1. IP Extraction Security (`validation.rs:224-244`)

2. Memory Usage (`handlers.rs:111-176`)

3. Error Propagation (`prover.rs`)