Skip to content

Conversation

@vutuanlinh2k2
Copy link
Contributor

Summary

This PR makes the SP1 prover API production-ready with comprehensive security improvements and reliability fixes:

  • Modular architecture: Refactored monolithic main.rs into focused modules (handlers, queue, jobs, validation, config, types, etc.)
  • Job queue with worker pool: Async job processing with configurable concurrency limits
  • Security hardening:
    • Eligibility verification against merkle tree before proof generation
    • Sr25519 signature validation with challenge-response
    • IP-based rate limiting with configurable limits
    • trust_proxy_headers config for secure IP extraction behind reverse proxies
  • Concurrency fix: Semaphore permits now held until blocking tasks complete, preventing unbounded resource usage on timeouts
  • Production config: Tuned defaults for SP1 network prover with explicit Mainnet mode
  • Configurable RPC timeout for on-chain verification calls
  • Test coverage: Added httpmock tests for on-chain verification and claim checking

Test plan

  • Verify prover API starts successfully with production config
  • Test rate limiting blocks excessive requests from same IP
  • Verify signature validation rejects invalid signatures
  • Confirm eligibility check prevents ineligible addresses from generating proofs
  • Test job queue processes requests with proper concurrency limits
  • Verify timed-out jobs don't cause unbounded concurrency
  • Run cargo test -p prover-api to verify new httpmock tests pass

🤖 Generated with Claude Code

vutuanlinh2k2 and others added 6 commits January 7, 2026 21:15
Split monolithic main.rs into separate modules for better maintainability:
- config.rs: Configuration loading and validation
- types.rs: Shared types (AppState, JobEntry, CachedProof, etc.)
- handlers.rs: HTTP request handlers (submit_job, job_status, health)
- prover.rs: ZK proof generation logic
- queue.rs: JobQueue and WorkerPool for concurrent proof generation
- cache.rs: Proof caching with TTL cleanup
- rate_limit.rs: Per-pubkey rate limiting
- jobs.rs: Job entry cleanup
- validation.rs: Request validation utilities

Added production features:
- Configurable worker pool size and queue capacity
- Background cleanup tasks for cache, rate limits, and stale jobs
- Max body size limiting
- Proof timeout configuration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
… and IP rate limiting

Adds production hardening features:
- Load eligibility data from merkle-tree.json at startup with pubkey-based O(1) lookup
- Verify sr25519 signatures before proof generation using schnorrkel
- IP-based rate limiting (separate from pubkey rate limiting) to catch bots/scanners
- Add proof metrics tracking (completions, timeouts, background tasks)
- Update Dockerfile to bundle eligibility data with configurable path

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Increase cache TTL from 10 minutes to 1 hour since proofs are
  deterministic and expensive to compute
- Reduce queue capacity from 100 to 50 for better wait time estimation
  (~1 hour max with 4 workers)
- Increase jobs TTL from 10 minutes to 1 hour to let users return
  for their proof

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
ProverClient::from_env() defaults to Reserved mode which has an invalid
domain for the mainnet network. Explicitly use NetworkMode::Mainnet via
ProverClient::builder().network_for(NetworkMode::Mainnet) to ensure
correct mainnet RPC URL is used.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add trust_proxy_headers config for secure IP extraction behind reverse proxies
  - When false (default), only socket address is used for rate limiting
  - When true, respects X-Forwarded-For and X-Real-IP headers
  - Strip port from socket addresses for consistent rate limiting

- Add configurable RPC timeout for on-chain verification
  - timeout_seconds field in VerifyOnchainConfig
  - Uses RPC_TIMEOUT_SECONDS env var (default: 10s)

- Fix unbounded concurrency when proof generation times out
  - Acquire semaphore permit before spawning task, not inside
  - Hold permit until blocking task actually completes (not just timeout)
  - Use tokio::select! instead of tokio::time::timeout to keep handle valid
  - Add decrement_timed_out_still_running metric tracking

- Fix JobQueue::from_sender to use shared queue_size counter
  - Prevents queue size tracking inconsistencies
  - Add error handling for misconfigured state

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add httpmock dev dependency for mocking RPC responses
- Add tests for verify_onchain_proof success and revert cases
- Add test for check_already_claimed with claimed/unclaimed scenarios
- Minor code cleanup and formatting fixes

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@claude
Copy link

claude bot commented Jan 8, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@vutuanlinh2k2 vutuanlinh2k2 self-assigned this Jan 8, 2026
@claude
Copy link

claude bot commented Jan 8, 2026

Claude finished @vutuanlinh2k2's task —— View job


Comprehensive Code Review

I've thoroughly analyzed this PR which refactors the SP1 prover API from monolithic to modular architecture with significant security improvements. Here's my detailed feedback:

✅ Architectural Improvements

  • Analyze architectural refactoring (main.rs → modular design)
  • Review security implementations (rate limiting, signature validation, eligibility)
  • Examine concurrency handling and job queue
  • Check configuration and error handling
  • Review test coverage and new test files
  • Provide specific feedback and recommendations

🏗️ Excellent Architectural Refactoring

Strengths:

  • Clean separation of concerns: The modular split from a monolithic 450+ line main.rs to focused modules is excellent
  • Well-defined module boundaries: handlers/, queue/, security/, config/, etc. each have clear responsibilities
  • Maintainable structure: The new organization makes the codebase much easier to navigate and extend

Suggestions:

  • Consider extracting types.rs into domain-specific modules (e.g., types/requests.rs, types/responses.rs, types/config.rs) as it's currently quite large at 422 lines

🔒 Strong Security Implementation

Multi-layered security approach:

Rate Limiting (rate_limit.rs)

  • Two-tier rate limiting: IP-based (10/min) + pubkey-based (3/5min) is smart defense-in-depth
  • Proper cleanup: Background tasks prevent memory leaks from rate limit entries
  • Good test coverage: 10 comprehensive test cases covering edge cases

⚠️ Potential Issue - Race Condition in Rate Limiter:

// In rate_limit.rs:38-66
match limits.get_mut(key) {
    Some(entry) => {
        let elapsed = now_ts() - entry.last_request_at;
        if elapsed >= self.window_seconds {
            entry.last_request_at = now_ts(); // ❌ Two calls to now_ts() could differ
            entry.request_count = 1;

Fix: Capture timestamp once to avoid race conditions:

let now = now_ts();
let elapsed = now - entry.last_request_at;
if elapsed >= self.window_seconds {
    entry.last_request_at = now; // ✅ Consistent timestamp

Signature Verification (signature.rs)

  • Correct SR25519 implementation: Proper substrate context and byte wrapping
  • Security-first approach: Validates signatures before expensive proof generation
  • Clear documentation: Good comments explaining the substrate-specific wrapping

Eligibility Verification (eligibility.rs)

  • O(1) pubkey-based lookup: Smart choice to handle different SS58 prefixes correctly
  • Exact balance matching: Prevents amount manipulation attacks
  • Flexible JSON parsing: Handles both entriesByPubkey and fallback formats

Robust Concurrency Handling

Outstanding semaphore-based solution:

Job Queue (queue.rs)

The concurrency control in process_job() is exceptionally well designed:

// Lines 161, 275-332: Critical concurrency fix
async fn process_job(
    // ...
    _permit: OwnedSemaphorePermit, // ✅ Held until function returns
) {
    // After timeout occurs, STILL waits for task completion
    match handle.await {
        // Handle late completion and cache results
    }
    // ✅ Permit only released after blocking task truly completes
}

Key Benefits:

  • True concurrency limits: Prevents unbounded resource usage even on timeouts
  • Cost control: Critical for expensive SP1 network proofs (10+ minutes, $0.10+ each)
  • Cache late completions: Smart to still cache results that complete after timeout
  • Proper metrics tracking: timed_out_still_running counter for operational visibility

🔧 Configuration & Error Handling

Configuration (config.rs)

Strengths:

  • Production-ready defaults: Sensible values for queue capacity (50), workers (4), timeouts (600s)
  • Security enforcements: Requires VERIFY_PROOF=true for SP1_PROVER=network
  • CORS validation: Proper handling for production security
  • Good validation: Early error detection with clear messages

Error Handling

  • Structured error codes: Consistent error response format across all endpoints
  • Proper HTTP status codes: 400/401/403/429/503 used appropriately
  • Client-friendly messages: Clear error descriptions for API consumers

🧪 Test Coverage Analysis

Comprehensive testing across modules:

  • 11 modules with test coverage: Excellent adoption of unit testing
  • Integration test script: test-api.sh provides end-to-end validation
  • Edge case coverage: Rate limiting, validation, queue behavior well tested

Missing test areas:

  • Concurrency stress tests: Could add tests with high concurrent load
  • IP extraction logic: Tests for trust_proxy_headers functionality
  • Error recovery: Tests for various failure scenarios and graceful degradation

🐛 Potential Issues Identified

1. IP Extraction Security (validation.rs:224-244)

pub fn extract_client_ip(headers: &HeaderMap, addr: Option<&str>, trust_proxy: bool) -> String {
    if trust_proxy {
        // Check X-Forwarded-For, X-Real-IP headers
    }
    addr.unwrap_or("unknown").to_string()
}

Concern: When trust_proxy_headers=false, still falls back to connection IP. Consider logging when proxy headers are present but not trusted.

2. Memory Usage (handlers.rs:111-176)

The handler creates synthetic job entries for cache hits. Over time, this could accumulate in the jobs HashMap without cleanup.

Suggestion: Consider excluding cached results from jobs tracking or ensure they're cleaned up properly.

3. Error Propagation (prover.rs)

Some errors lose context during string conversion:

let signature = parse_hex_bytes::<64>(&request.signature).map_err(err_to_string)?;

Minor improvement: Preserve error chain for better debugging in production.

🚀 Performance Considerations

Positive:

  • Efficient data structures: HashMap lookups, atomic counters for metrics
  • Async/await throughout: Non-blocking I/O for API responsiveness
  • Smart caching: Deterministic proofs cached by request hash
  • Background cleanup tasks: Prevent memory leaks

Optimization opportunities:

  • Batch cleanup operations: Currently each cleanup task runs independently
  • Connection pooling: For RPC calls to claim contract verification

📋 Recommendations

High Priority

  1. Fix timestamp race condition in rate limiter (line 47-48 in rate_limit.rs)
  2. Add concurrency stress tests to validate semaphore behavior under load
  3. Document production deployment requirements and security considerations

Medium Priority

  1. Add structured logging with correlation IDs for request tracing
  2. Consider circuit breaker for RPC endpoints to handle external service failures
  3. Add metrics export (Prometheus/similar) for operational monitoring

Low Priority

  1. Split large type definitions into domain-specific modules
  2. Add integration tests for signature verification with real SR25519 keys

🎯 Overall Assessment

This is outstanding work that transforms a prototype into a production-ready system. The security-first approach, robust concurrency handling, and comprehensive error management demonstrate excellent engineering practices.

Key Achievements:

  • ✅ Production-ready security hardening
  • ✅ Proper resource management preventing cost overruns
  • ✅ Modular architecture enabling future enhancements
  • ✅ Comprehensive test coverage
  • ✅ Clear configuration and operational controls

The concurrency fix alone (holding semaphore permits until blocking tasks complete) shows deep understanding of the problem space and will prevent significant operational issues.

Recommended for merge after addressing the timestamp race condition in the rate limiter.


@vutuanlinh2k2
Copy link
Contributor Author

Actually Production-Ready As-Is

The code is already well-designed for production. Most of Claude's suggestions are nice-to-haves, not blockers:

Issue Real Impact Verdict
Timestamp race condition Millisecond difference in 60-300s windows Cosmetic fix
Memory leak in handlers Already handled by start_jobs_cleanup_task (line 136 in main.rs) Not an issue
Missing stress tests Code logic is sound Nice-to-have
Error context loss Only affects debugging Nice-to-have
IP logging Edge case Nice-to-have

The One Fix Worth Making

Timestamp consistency in rate_limit.rs:43-47 - it's a 2-line change that demonstrates code quality:

// Lines 43-48: capture timestamp once
let now = now_ts();
let elapsed = now - entry.last_request_at;
if elapsed >= self.window_seconds {
    entry.last_request_at = now;  // Use same timestamp
    entry.request_count = 1;

What's Already Solid

  1. Job cleanup - Background task runs every 60s, removes completed/failed jobs after TTL
  2. Rate limit cleanup - Background task prevents memory leaks
  3. Cache cleanup - Background task handles cache eviction
  4. Concurrency control - Semaphore-based limiting with proper permit handling
  5. Security layers - IP rate limiting → validation → eligibility → signature → pubkey rate limiting

Conclusion

The API is production-ready now. If you want to be thorough, fix the timestamp issue - it's a 2-line change. Everything else is polish.

Capture timestamp once at the start of check_and_update() to avoid
potential race condition where multiple calls to now_ts() could return
different values.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@tangle-network tangle-network deleted a comment from claude bot Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants