refactor: migrate captcha to stateless JWT-based design#115
Closed
LaurenceJJones wants to merge 185 commits intomainfrom
Closed
refactor: migrate captcha to stateless JWT-based design#115LaurenceJJones wants to merge 185 commits intomainfrom
LaurenceJJones wants to merge 185 commits intomainfrom
Conversation
…spoa is set correctly, all works ✨
…g on inheriting the GID
…implementation still non-existent), update dockerfiles to remove local and mount within compose instead. Trid go interfaces think I got it wrong but each is set type that handles ip.nets or strings. Still need to work on handle wrapper as currently it just tests the ip is within the cidr set and return true or false but its still wip wip wip
…e about pushing it up to upstream library. We dont keep track of any open connections due to how haproxy handles the tcp connection. Updated configuration to prevent haproxy holding tcp connecting for 2 minutes. Update docker-compose to use static IP because haproxy caches dns resolutions and couldnt find a way currently to revoke them ✨
…ndle context on the upper functions instead, much cleaner and dont need to use a custom struct ✨
…and other work to try to get haproxy on port 80
…nd return best remediation
…s in a routine reading at once
… crowdsec can issue multiple bans on a single remediation type and we need to keep track of which id expires vs the remediation itself, after removing the id we check if the remediation has any ids left if not remove the remdiation and then check if the element itself has any remediations left and if not remove the element itself. this is to improve the speed of checking an element since if you issue a 100 bans then in the end all you care about that the unique remediation is ban
…shutdown code to only log if we get signterm or int
…e continer, we can also do this within systemd via pre hooks. this may change in future
…dundant code and now moving ahead ahoy ahoy
…ration, it now passes the first remedation back to prevent doing the same lookup twice
…ng just need to do rendering of templates within lua
…ot tracking ip addrss
- Replace string concatenation with single strings in error messages - Update README to use hex encoding for cookie_secret generation - Use openssl rand -hex 32 for clearer byte count visibility
45246a2 to
593c2af
Compare
- Remove .vagrant files from git index (files remain on disk) - .vagrant/ should be ignored via .gitignore (handled in separate PR)
- Add .vagrant/ directory to .gitignore to prevent accidental commits - Vagrant state files should not be tracked in version control
- Merge main branch to include .gitignore changes (.vagrant/ addition)
…handling (#119) Upgrade haproxy-spoe-go to v1.0.7 and remove manual workgroup handling - Updated github.com/negasus/haproxy-spoe-go from v1.0.5 to v1.0.7 - Removed HAWaitGroup from Spoa struct as library now handles workgroup internally - Removed manual Add/Done calls from handlerWrapper - Simplified Shutdown method - library now waits for handlers when listeners are closed - Removed unused sync import
…0 to v2.0.0 (#118) Upgrade geoip2-golang from v1 to v2 - Updated dependency from github.com/oschwald/geoip2-golang v1.9.0 to v2.0.0 - Updated maxminddb-golang to v2.0.0 (transitive dependency) - Changed GetASN and GetCity methods to accept netip.Addr instead of *net.IP - Updated field names: IsoCode -> ISOCode in GetIsoCodeFromRecord - Updated pkg/spoa/root.go to pass netip.Addr directly to GetCity - Updated geo_test.go to use netip.MustParseAddr and Names.English field - All tests pass and code builds successfully
* feat: implement path-compressed trie for IP/CIDR lookups using cidranger
Replace map-based IPSet and PrefixSet with unified cidranger-based implementation
for significant performance improvements and proper longest-prefix matching.
Performance improvements:
- Lookup: ~915ns/op (small dataset), ~1.3μs/op (large dataset)
- Memory: 523-603 B/op, 9-12 allocations
- Add/Remove: ~17ms/op for 2,000 decisions
Key changes:
- Add cidranger dependency for battle-tested trie implementation
- Replace separate IPSet/PrefixSet with unified CIDRUnifiedIPSet
- Implement proper longest-prefix matching semantics
- Store exact IPs as /32 (IPv4) or /128 (IPv6) prefixes
- Maintain thread safety with RWMutex
- Add comprehensive benchmarks and tests
Files:
- pkg/dataset/cidranger_trie.go: Core trie implementation
- pkg/dataset/cidranger_types.go: Unified IP set wrapper
- pkg/dataset/benchmark_test.go: Performance benchmarks
- pkg/dataset/root.go: Updated to use new implementation
- pkg/dataset/root_test.go: Updated tests
This provides ~18-20x performance improvement over the previous O(n) map-based
implementation while maintaining backward compatibility.
* fix: handle error return values in benchmark tests
Fix golangci-lint errcheck issues by properly handling error return values
from AddPrefix and AddIP methods in TestLongestPrefixMatch.
* refactor: remove unused code from trie implementation
Remove unused DecisionEntry struct and legacy compatibility methods:
- DecisionEntry: Defined but never used
- CIDRUnifiedIPSet.Add(interface{}): Generic method not used
- CIDRUnifiedIPSet.Remove(interface{}): Generic method not used
- CIDRUnifiedIPSet.Init(): Not used, constructor used instead
Keeps only the actively used methods:
- AddIP/AddPrefix, RemoveIP/RemovePrefix, Contains
- NewCIDRUnifiedIPSet constructor
No functional changes, just cleanup for cleaner codebase.
* feat: implement bart-based trie for superior performance
Replace cidranger with bart library for IP/CIDR operations:
## Performance Comparison
| Implementation | Small Dataset (2K) | Large Dataset (20K) | Memory/Op | Allocs/Op | Notes |
|----------------|-------------------|---------------------|-----------|-----------|-------|
| **Original Map-based** | 16,141 ns/op | 155,162 ns/op | 1000 B/op | 12 allocs | Simple but slow for large datasets |
| **CIDRanger** | 900.5 ns/op | 1,411 ns/op | 525 B/op | 9 allocs | Path-compressed trie, good for lookups |
| **Bart** | **415.7 ns/op** | **415.7 ns/op** | **496 B/op** | **6 allocs** | **Best overall performance** |
## Key Improvements
- **Lookup**: 54% faster than cidranger (415ns vs 900ns)
- **Large datasets**: 70% faster than cidranger (415ns vs 1,411ns)
- **Memory efficiency**: 5% less memory than cidranger (496B vs 525B)
- **Allocations**: 33% fewer allocations than cidranger (6 vs 9)
- **Native netip**: Zero allocation overhead with Go's standard library types
## Technical Advantages
- **Path-compressed trie**: 256-bit blocks with CPU-optimized operations
- **Hardware acceleration**: Uses POPCNT, LZCNT, TZCNT instructions
- **Lock-free reads**: Concurrent access without blocking
- **Efficient writes**: Single-pass operations for add/remove
- **Memory layout**: Cacheline-aligned for optimal performance
## Implementation Details
- Add bart_trie.go: BartTrie using bart.Table[RemediationIdsMap]
- Add bart_types.go: BartUnifiedIPSet wrapper for unified API
- Update root.go: Replace CIDRUnifiedIPSet with BartUnifiedIPSet
- Update tests: Fix all references to new bart implementation
- Add benchmarks: Comprehensive performance comparison
- Add IsEmpty() method: Required for RemediationIdsMap
All tests pass, 0 linter issues, ready for production use.
* refactor: remove cidranger implementation, keep only bart
Clean up the bart branch by removing cidranger dependencies:
Removed:
- pkg/dataset/cidranger_trie.go
- pkg/dataset/cidranger_types.go
- BenchmarkCIDRImplementation benchmarks
- cidranger dependency from go.mod
Kept:
- bart_trie.go: BartTrie with intelligent prefix detection
- bart_types.go: BartUnifiedIPSet wrapper
- bart benchmarks: BenchmarkBartLookup, BenchmarkBartAdd, BenchmarkBartRemove
- bart dependency in go.mod
Performance (bart only):
- Lookup: 464.9 ns/op (25% faster than cidranger)
- Add: 2.7 μs/op (2600x faster than cidranger)
- Remove: 2.7 μs/op (2600x faster than cidranger)
Features:
- Intelligent IPv4 prefix detection (/8, /16, /24, /32)
- Intelligent IPv6 prefix detection (/16, /32, /48, /56, /60, /64, /80, /96, /128)
- Native netip support with zero allocations
- CPU-optimized with hardware acceleration
All tests pass, 0 linter issues.
* perf: optimize bart trie with conditional logging and pre-allocation
- Add conditional logging guards to eliminate allocations when trace disabled
- Preserve full contextual logging (prefix, remediation, ip fields) when trace enabled
- Add pre-allocation hints for maps and slices to reduce GC pressure
- Add separate benchmarks (AddOnly, RemoveOnly, DifferentSizes) for better analysis
- Update AddID/RemoveID methods to handle nullable loggers
Performance improvements:
- 63% faster (3.15ms vs 8.36ms per operation)
- 60% less memory (2.41MB vs 6.21MB per operation)
- 74% fewer allocations (17,815 vs 67,371 per operation)
Maintains full traceability with contextual fields when debugging.
* perf: optimize decision processing and cleanup dataset package
- Replace strings.Contains() with direct IP/prefix parsing for IPv6 detection
- Add conditional logging guards to Set methods to eliminate allocations
- Remove redundant wrapper methods (AddIP, AddCIDR, RemoveIP, RemoveCIDR)
- Remove unused types (PrefixSet, IPSet) replaced by BartUnifiedIPSet
- Simplify generic Set[T] to concrete CNSet for better maintainability
- Clean up unused imports and dead code
Performance improvements:
- 62% faster (3.16ms vs 8.36ms per operation)
- 61% less memory (2.41MB vs 6.21MB per operation)
- 74% fewer allocations (17,819 vs 67,371 per operation)
- 22% less code (368 vs 470 lines)
All tests pass, linter clean, production ready.
* fix: remove heuristic prefix detection for exact IPs
- Remove detectPrefixLength() function that used heuristics to guess prefix length
- Always use /32 for IPv4 and /128 for IPv6 when adding/removing exact IPs
- Prevents unexpected matches when IPs like 192.168.1.0 are added as exact IPs
- Fixes test files to use netip.Addr instead of strings
- Simplifies code by removing 56 lines of complex heuristic logic
* style: use Go 1.22+ integer range syntax in benchmark tests
- Convert loops to use 'for i := range count' instead of 'for i := 0; i < count; i++'
- Convert benchmark loops to use 'for range b.N' instead of 'for i := 0; i < b.N; i++'
- Fixes golangci-lint intrange warnings
- All tests and benchmarks pass
* Update pkg/dataset/bart_trie.go
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* refactor: implement atomic pointer pattern for bart trie with batch operations
- Switch from RWMutex to atomic.Pointer for lock-free reads
- Implement batch Add/Remove operations for better performance during initial load
- Fix exact prefix matching in AddBatch to prevent incorrect merging
- Simplify metrics handling by moving increment/decrement to collection loop
- Consolidate IP and range handling as unified prefix operations
- Add Clone() method to RemediationIdsMap for bart's InsertPersist/DeletePersist
- Rename types to BartAddOp/BartRemoveOp and move to bart_trie.go
This provides significant performance improvements:
- Lock-free reads via atomic pointer (no read contention)
- Single atomic swap per batch instead of per operation
- Much faster initial load with tens of thousands of decisions
* refactor: remove unnecessary pointer receivers from RemediationIdsMap
Maps are reference types in Go, so value receivers work correctly for
mutations while being more idiomatic and avoiding pointer indirection.
* Add completion logs for decision batch processing
- Add debug log messages when batch processing completes
- Logs show 'Finished processing X new decisions' and 'Finished processing X deleted decisions'
- Improves observability of batch processing performance
* Fix exact prefix matching and metrics accuracy in batch operations
- Fix RemoveBatch to use exact prefix matching (DeletePersist) instead of longest prefix matching (Lookup)
Prevents data corruption when removing /24 prefixes that could incorrectly match /16 prefixes
- Change RemoveBatch return type from []bool to []*BartRemoveOp
Allows callers to access operation metadata (Origin, IPType, Scope) directly
- Add metadata fields (Origin, IPType, Scope) to BartAddOp and BartRemoveOp structs
Enables accurate metrics tracking without separate metadata structures
- Fix metrics to only increment/decrement on successful operations
Move metrics increment in Add() to after AddBatch succeeds
Only decrement metrics in Remove() for operations returned by RemoveBatch
- Fix all unkeyed struct literals to use keyed fields
Improves maintainability and prevents issues when struct fields are reordered
* refactor: clean up batch operations and fix variable shadowing
- Fix AddBatch to properly use table returned from DeletePersist
- Simplify AddBatch by extracting common InsertPersist call
- Clean up RemoveBatch with continue statements to reduce nesting
- Fix variable shadowing by using explicit declarations
- Improve code readability and maintainability
* perf: optimize slice creation and add defensive check
- Optimize AddID to create slice with initial element instead of append
- Add defensive check in GetRemediationAndOrigin to prevent panic on empty slices
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* perf: use ModifyPersist instead of DeletePersist + InsertPersist
- Replace DeletePersist + InsertPersist pattern with ModifyPersist in AddBatch
- Replace DeletePersist + InsertPersist pattern with ModifyPersist in RemoveBatch
- More efficient: single tree traversal instead of two
- More atomic: modification happens in one operation
- Cleaner code: eliminates delete-then-reinsert pattern
* Optimize BART trie initialization using Insert for initial load
- Start with nil table instead of empty table for better memory efficiency
- Use Insert method for initial batch load (more efficient than ModifyPersist)
- Use ModifyPersist for subsequent incremental updates
- Split AddBatch into initializeBatch and updateBatch methods
- Handle duplicate prefixes in initial batch by merging IDs
- Add nil checks in Contains and RemoveBatch methods
This follows the maintainer's recommendation to use Insert for the
initial load phase, then switch to ModifyPersist for smaller updates.
* Refactor: Merge BartTrie into BartUnifiedIPSet
- Remove redundant BartTrie abstraction layer
- Consolidate all logic into BartUnifiedIPSet directly
- Delete bart_trie.go file
- Remove unused logger field from BartUnifiedIPSet
- Fix linter warnings: remove error return from initializeBatch and updateBatch
(they always return nil, so error return type is unnecessary)
This simplifies the codebase by eliminating an unnecessary abstraction
layer while maintaining all functionality.
* Simplify AddBatch and optimize metrics tracking
- Remove result slice from AddBatch since all operations always succeed
- Increment metrics during batch creation loop instead of separate loop
- Update test to match new AddBatch signature
- Simplify initializeBatch and updateBatch to return void
This eliminates unnecessary complexity and improves performance by
removing an extra loop and better cache locality for metrics.
* Refactor prefixLen initialization and improve batch operation logging
- Initialize prefixLen to 32 (default) and only update for IPv6
- Add logging for processing decisions in Add/Remove methods
- Remove unnecessary length checks before batch operations
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix geo database initialization bug and ISO code header issue - Fix geo database initialization: remove meaningless IsValid() check in open() that always passed on first call, add proper path validation before loading - Improve code redundancy: extract common file checking logic into checkAndUpdateModTime() function, reduce duplication in WatchFiles() - Add comprehensive test coverage for Init() with various error cases - Fix ISO code header: always set isocode variable when available, regardless of remediation status, so X-Crowdsec-IsoCode header is populated correctly * Fix linting issues in geo tests - Replace assert.Error with require.Error for error assertions (testifylint) - Use t.TempDir() instead of empty string for os.CreateTemp (usetesting)
…125) Root cause: bart's ModifyPersist calls our Clone() method when cloning nodes. If Clone() returned nil (when called on nil map), that nil was stored in the cloned table. Subsequent callbacks received exists=true with nil data, causing panic in AddID(). Fixes: 1. Clone() now returns initialized map instead of nil 2. Removed redundant manual Clone() calls in updateBatch/RemoveBatch since bart already clones via our Cloner interface before callback 3. Changed RemediationIdsMap{} to make(RemediationIdsMap) for clarity This improves both correctness and performance (no double cloning).
Remove the Run goroutine, channel-based operations, and associated types that are no longer used. Hosts are now added synchronously at startup. Changes: - Remove Manager.Run(), Manager.Chan - Remove HostOp, Op types and constants - Remove removeHost() and patchHost() dead code - Remove sync.RWMutex (not needed - hosts only added at startup) - Use sync.Map for thread-safe cache during concurrent MatchFirstHost - Make AddHost public for direct synchronous calls - Update cmd/root.go to use AddHost directly This simplifies the architecture by removing ~80 lines of unnecessary complexity while maintaining thread-safe cache access.
* feat: add pprof debug endpoint for runtime profiling Adds optional pprof HTTP endpoint for runtime memory, CPU, and goroutine profiling. Useful for debugging memory issues and performance analysis. Configuration: ```yaml pprof: enabled: true listen_addr: 127.0.0.1 listen_port: 6060 ``` Endpoints exposed: - /debug/pprof/ - Index with links to all profiles - /debug/pprof/heap - Heap memory profile - /debug/pprof/profile - CPU profile (30s default) - /debug/pprof/goroutine - Goroutine stack dumps - /debug/pprof/trace - Execution trace Usage: go tool pprof http://127.0.0.1:6060/debug/pprof/heap WARNING: Only enable in development/debugging scenarios, not production. * fix: add graceful shutdown for pprof server Use http.Server with Shutdown() instead of raw ListenAndServe to properly shut down the pprof endpoint when the application exits. * test: add PprofConfig unmarshal tests - Test that pprof config is properly unmarshalled from YAML - Test that pprof is disabled by default when not specified
…126) * fix(memory): intern origin strings and clone country codes to allow GC ## Problem When processing decisions from the stream, strings extracted from Decision structs (e.g., `*decision.Origin`) shared their backing byte array with the original struct. This prevented GC from reclaiming the entire DecisionsStreamResponse until those strings were no longer referenced, causing unnecessary memory retention. ## Solution 1. **String interning for origins** - Origins have low cardinality (crowdsec, cscli, lists:scenario). Interning deduplicates them and breaks references to Decision memory. 100k decisions with same origin = 1 string allocation. 2. **Clone country codes** - Use `strings.Clone()` to break reference to Decision struct memory for country scope decisions. 3. **Clear decisions after processing** - Set `decisions.New` and `decisions.Deleted` to nil after processing to help GC reclaim the DecisionsStreamResponse immediately. ## Performance ``` BenchmarkInternString/existing_string 81M ops 17.35 ns/op 0 B/op BenchmarkInternString/new_string 33M ops 38.01 ns/op 0 B/op ``` The fast path (existing string lookup) is ~17ns with zero allocations. New strings are cloned once and cached for future use. * refactor: address Copilot review feedback - Use original string as key in LoadOrStore to avoid unnecessary clone allocation when another goroutine wins the race - Add O(n) performance warning to InternedPoolSize documentation - Simplify benchmark pre-population (only need one call per unique string) - Pre-generate unique strings in new_string benchmark to isolate interning cost from string construction overhead - Restructure origin interning to avoid intermediate string when not needed * refactor: address valid Copilot feedback (2 of 3) - Fix benchmark timer ordering: clear pool before b.ResetTimer() - Store *decision.Origin in variable to avoid double dereference Note: Rejected Copilot's suggestion to revert LoadOrStore key from 's' back to 'cloned'. Using 's' as key is correct because: 1. sync.Map compares keys by value (string contents), not pointer 2. The key is only used for lookup - the VALUE (cloned) is what's stored 3. Using 's' avoids wasting a clone allocation when another goroutine wins * Update pkg/dataset/root.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update pkg/dataset/root.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
perf: use WithLabelValues to avoid map allocation on hot path
Replace prometheus.Labels{} map literals with WithLabelValues() calls
to avoid allocating a new map on every metric update.
This is critical for the SPOA hot path (handleIPRequest, handleHTTPRequest)
which runs on every incoming request. Each prometheus.Labels{} creates
a new map allocation that accumulates until GC runs, causing memory
growth under load.
WithLabelValues() takes label values in the same order as defined
in the metric (origin, ip_type, scope for TotalActiveDecisions;
ip_type for TotalProcessedRequests; origin, ip_type, remediation
for TotalBlockedRequests) and avoids the map allocation entirely.
Changes:
- pkg/spoa/root.go: Fix 4 hot-path metrics in request handlers
- pkg/dataset/root.go: Fix 5 metrics in Add/Remove paths
- Remove unused prometheus import from both files
* feat(dataset): hybrid IP storage with parallel batch processing Implement memory-efficient hybrid storage that separates individual IPs from CIDR ranges, significantly reducing memory usage for typical workloads. ## Changes - Add IPMap using sync.Map for individual IP addresses (O(1) lookups) - Keep BART (RangeSet) only for CIDR ranges requiring LPM - Parallelize Add/Remove operations using sync.WaitGroup.Go() (Go 1.24+) - CheckIP() checks IPMap first, falls back to RangeSet for range matching ## Memory Optimization Most real-world deployments receive individual IP decisions, not ranges. BART's radix trie has significant per-entry overhead optimized for prefix matching. By storing individual IPs in a simple map: - **1K IPs**: BART ~598KB → Map significantly less - **10K IPs**: BART ~5.6MB → Map significantly less - **50K IPs**: BART ~26.5MB → Map significantly less ## Performance - IPMap lookup: ~72 ns/op, 0 allocs - RangeSet lookup: ~69 ns/op, 0 allocs - Parallel batch processing for Add/Remove operations ## Breaking Changes None - API unchanged, internal storage optimization only. * fix(ipmap): prevent data races with concurrent readers Address Copilot review comments: 1. Clone RemediationIdsMap before modifying in add() and remove() - Prevents race between Contains() readers and Add/Remove modifiers - sync.Map's Load→Modify→Store pattern is unsafe with mutable values 2. Fix version comment: sync.WaitGroup.Go was added in Go 1.23, not 1.24 3. Remove trailing whitespace in benchmark_test.go * refactor(dataset): rename BartUnifiedIPSet to BartRangeSet More descriptive name since it now only handles CIDR ranges, while individual IPs are stored in IPMap. * feat(dataset): Consistent COW for lock free reads (#129) * perf: lock-free reads for IPMap and CNSet Use atomic pointers for completely lock-free reads across all data structures. SPOA handlers never block, even during batch updates. Changes: 1. IPMap: atomic.Pointer per entry for individual IPs 2. CNSet: atomic.Pointer for entire country map (small, cloning is cheap) - Added NewCNSet() constructor for consistency - Added defensive nil checks in Add/Remove/Contains 3. BartRangeSet: already uses atomic pointer (unchanged) Concurrency model (consistent across all): - Reads: atomic pointer load (instant, never blocks) - Writes: clone → modify → atomic store (copy-on-write) - Readers always see consistent state (old or new, never partial) This ensures SPOA always has immediate access to decision data, regardless of ongoing batch updates from the stream bouncer. * refactor(ipmap): simplify addLocked - remove unreachable LoadOrStore fallback Since writeMu is held for the entire batch, no other writer can race. The LoadOrStore fallback code was unreachable - simplified to use Store directly. * Simplify RemediationMap: remove ID tracking, optimize cloning - Changed from map[Remediation][]RemediationDetails to map[Remediation]string - Removed ID tracking (LAPI only sends longest decisions) - Simplified Add/Remove methods (no ID parameters) - Optimized IPMap cloning: skip clone for empty maps and single-entry removals - Updated all callers to use new simplified API - Reduced memory allocations and GC pressure * Remove writeMu mutex from IPMap - single writer guarantee - Removed writeMu mutex (unnecessary with single writer thread) - Renamed addLocked/removeLocked to add/remove - Updated comments to reflect single-writer, multiple-reader model - Stream handler processes decisions sequentially, no concurrent writes - Atomic pointer operations handle reader-writer synchronization safely * Address Copilot review comments: remove unused ID parameters - Remove unused 'id' parameter from CNSet.Add and CNSet.Remove methods - Update addCN/removeCN helper functions to match new signatures - Update outdated comments: 'merging IDs' -> 'merging remediations' - Optimize CNSet.Add preallocation: remove +1 to avoid overallocation - Fix map initialization syntax in CNSet.Add * Remove unused ID fields from operation structs - Remove ID field from IPAddOp, IPRemoveOp, BartAddOp, BartRemoveOp - Remove id field from internal cnOp struct - Update all operation struct literals to remove ID assignments - Update benchmark tests to match new struct definitions - IDs are no longer tracked since LAPI behavior ensures only longest decisions * Add comprehensive metrics tests and optimize no-op handling - Add comprehensive test suite for metrics tracking (IPMap, BartRangeSet, CNSet) - Fix BartRangeSet no-op check to use exact prefix lookup (Get) instead of LPM - Add HasRemediation and GetOriginForRemediation methods to BartRangeSet for exact prefix matching - Optimize ipType assignment to only occur after no-op checks - Remove pre-allocation of operation slices to avoid wasting memory when many decisions are no-ops - Replace hardcoded metric increments with len(decisions) for better readability * Fix outdated comment: replace 'ID' with 'remediation' - Update comment in bart_types.go to reflect that IDs are no longer tracked - Addresses Copilot review feedback from PR #129 * Refactor Remove() to return error for cleaner duplicate delete handling - Add ErrRemediationNotFound sentinel error - Update RemediationMap.Remove() to return error instead of silently ignoring - Update all call sites to use errors.Is() for cleaner error checking - Simplify RemoveBatch logic - no need to check existence before calling Remove() - Metrics are only updated for actual removals, not duplicate deletes This makes the code easier to follow and more explicit about error handling. * refactor(metrics): use WithLabelValues instead of With for better performance Replace prometheus.With(Labels{...}) with WithLabelValues() to avoid map allocation overhead. This aligns with the optimization from main branch. - Remove unused prometheus import - Update all metrics calls to use WithLabelValues - Add comments indicating label order for clarity * fix(lint): remove unused nolint directive in metrics tests * fix: verify origin matches before removing decisions - Add origin verification in IPMap.remove() to prevent removing decisions when origin has been overwritten (e.g., by CAPI) - Add same origin check in BartRangeSet.RemoveBatch() for range removals - Remove optimization that deleted IP when only one remediation existed to handle edge cases where same origin changes remediation types - Ensures metrics are decremented correctly and prevents orphaned decisions Fixes issue where unsubscribing from blocklists didn't properly remove decisions when origins were overwritten by other sources. * fix: correct Go version in comments and add origin overwrite test - Fix version documentation: sync.WaitGroup.Go was added in Go 1.22, not 1.23 - Add TestMetrics_OriginOverwriteAndDelete to verify metrics correctness when same origin changes remediation types (ban -> captcha -> delete ban) - Test ensures origin verification works correctly and metrics are accurate
* feat: add AppSec integration support - Add AppSec validation to SPOA request handler - Integrate AppSec component with global URL/API key config - Add AppSec collections and acquisition setup for docker-compose - Update HAProxy configs with http-buffer-request option - Add Vagrantfile for testing environment - Update bouncer config to support AppSec per-host configuration * Fix User-Agent extraction from headers and optimize HTTP data parsing - Fix readHeaders to properly handle CRLF line endings from HAProxy req.hdrs - Extract User-Agent from parsed headers instead of separate SPOE message arg - Only parse HTTP data when necessary (not for Allow/Ban unless AppSec always_send) - Move debug logging in createAppSecRequest to show actual headers being sent - Remove excessive logging from parseHTTPData function - Clean up verbose debug logs for cookie clearing and redirects * Fix linting issues and clean up debug logging - Check io.Copy return value (errcheck) - Remove unused logger parameter from parseHTTPData (unparam) - Remove duplicate User-Agent logging from debug log - Access req fields directly in debug log instead of reading from headers * Simplify header copying using maps.Clone() * Move AppSec metrics tracking to SPOA handler Track metrics in SPOA handler where we already have the IP as netip.Addr type, avoiding unnecessary string-to-IP parsing. This is more efficient and cleaner than parsing the IP from string in the AppSec component. * fix: apply Copilot review suggestions - Fix metrics call to use WithLabelValues instead of prometheus.Labels - Improve HTTP request line detection with method validation - Clarify remediation condition (>= Captcha instead of > Unknown) - Add body length check before sending to AppSec - Add debug logging for malformed headers - Fix port mismatch in Vagrantfile (4241 -> 7422) - Remove HTTP client timeout, rely on context timeout * fix: revert Vagrantfile AppSec port to 4241 for CI compatibility * refactor: simplify readHeaders function Remove HTTP request line detection logic since HAProxy only sends headers, not the initial HTTP request line * Refactor AppSec validation: simplify function signature and remove global AppSec from host manager - Simplified validateWithAppSec to return remediation instead of modifying pointer - Removed req parameter - ban injection handled at call sites - Removed hoststring parameter - extracted from message in extractAppSecData - Removed httpData parameter - always parsed from message - Created logger with host context for better logging - Removed global AppSec config from host manager - Hosts now only use explicit AppSec overrides - Global AppSec handled entirely at SPOA level * Fix security issue: prevent remediation downgrade in validateWithAppSec - Fixed validateWithAppSec to return the more restrictive remediation - Prevents security downgrade when AppSec returns Allow but current remediation is Ban/Captcha - Now correctly compares and returns max(currentRemediation, appSecRemediation) - Matches function documentation behavior * Address Copilot review comments: improve error handling and code quality - Move API key validation before cloning headers (fail-fast optimization) - Fix error message capitalization to follow Go conventions - Improve HTTP version conversion with explicit switch for edge cases - Simplify response body handling - just discard without size checks * fix appsec timeout * fix: add timeout options and golangci-lint
* feat: migrate to scratch-based Docker image
- Replace alpine base with scratch for minimal 16MB image
- Remove docker_start.sh (no shell in scratch)
- Use ENTRYPOINT + CMD for direct binary execution
- Add CA certificates for HTTPS connections to LAPI
- Add docker/README.md with comprehensive usage documentation
* fix: create runtime directories for scratch image
- Add /run/crowdsec-spoa/ for Unix socket
- Add /var/log/crowdsec-spoa/ for log files
- Use .keep files to ensure empty directories are copied to scratch
* fix: correct Docker image name to crowdsecurity/spoa-bouncer
* feat: add Docker-optimized config with environment variables
- Add config/crowdsec-spoa-bouncer.docker.yaml with stdout logging
- Support env vars: CROWDSEC_KEY, CROWDSEC_URL, LOG_LEVEL, etc.
- Enable Prometheus by default on port 6060
- Update Dockerfile to use Docker config and expose both ports
- Update README with comprehensive environment variables table
* fix: use ENV defaults instead of shell syntax for env vars
- StrictExpand doesn't support ${VAR:-default} syntax
- Set default values via ENV in Dockerfile
- Simplify docker.yaml to use plain ${VAR} substitution
- Update README to clarify only CROWDSEC_KEY is required
* docs: pin image versions in Docker examples
Address Copilot review: use pinned versions instead of :latest
- haproxy:3.1
- curlimages/curl:8.11.1
* fix: restore VOLUME declarations for Lua and HTML customization
Address Copilot review: keep volumes for user customization even though
documentation examples were simplified
* docs: remove GOMEMLIMIT from README examples
* docs: reorder config sections - custom config before env vars
Custom configuration file is the primary method, environment variables
are a convenience layer for simple deployments.
fix: add graceful shutdown for Prometheus metrics server Refactor Prometheus server to match pprof pattern: - Use dedicated http.ServeMux instead of default mux - Manage server lifecycle with errgroup - Add graceful shutdown handler with 5s timeout - Proper error propagation instead of fire-and-forget goroutine This ensures the Prometheus server shuts down cleanly on SIGTERM/SIGINT and errors are properly tracked by the errgroup.
* Migrate SPOE implementation to dropmorepackets/haproxy-go - Replace github.com/negasus/haproxy-spoe-go with github.com/dropmorepackets/haproxy-go - Implement zero-allocation message parsing with single-pass KV extraction - Add struct pooling for HTTPMessageData and IPMessageData with embedded buffers - Use bytes.Equal for key matching to avoid string allocations - Update all action setting to use encoding.ActionWriter - Optimize buffer reuse by embedding buffers in pooled structs - Remove unused code and simplify pooling strategy * Fix linter issues - Remove ctx field from Spoa struct (containedctx) - Use context parameter instead of context.Background() (contextcheck) - Add type assertion checks with ok for all type assertions (revive) - Prefix unused ctx parameters with _ (unparam) - Remove unused readKeyFromMessage function (unused) * fix(spoe): address Copilot PR review comments - Fix double-free bug: remove Put(nil) call when extractIPMessageData returns error - Add max buffer size limits (64KB headers, 512KB body) to prevent unbounded memory growth - Remove redundant clear() calls before copy() operations - All type assertions already have proper ok checks - Context usage already correct (using ctx parameter, not context.Background()) - Unused ctx field already removed * fix(spoe): remove unused error variables - Remove ErrMessageKeyNotFound and ErrMessageKeyTypeMismatch as they are no longer used after migration - These were left over from the old readKeyFromMessage function that was removed * refactor(spoe): add reset() method to IPMessageData for consistency - Add reset() method to IPMessageData struct, matching HTTPMessageData pattern - Update extractIPMessageData to use reset() instead of manual field clearing - Update handleIPRequest defer to use reset() method - Improves code consistency and maintainability * perf(spoe): optimize readHeaders to avoid full byte slice to string conversion - Use bytes.Split() instead of converting entire headers byte slice to string - Use bytes.IndexByte() to find colon separator directly in byte slice - Only convert individual key/value parts to strings when needed - Reduces allocations, especially for requests with large header blocks * perf(spoe): use bytes.SplitSeq for more efficient header parsing - Replace bytes.Split with bytes.SplitSeq to avoid allocating intermediate slice - SplitSeq uses iterator pattern, yielding lines one at a time - Reduces memory allocations when parsing headers - Addresses linting suggestion for more efficient iteration * fix(spoe): address memory safety and code quality issues - Remove HTTPRequestData struct to fix memory safety issue with pooled buffers - Eliminates risk of holding pointers to pooled slices that could be reused - Simplifies code since all values are extracted upfront in msgData - handleCaptchaRemediation now only returns remediation (not HTTPRequestData) - Fix misleading comment about URL nil check (line 666) - Clarify that msgData.URL is guaranteed non-nil at that point - Fix buffer reallocation edge case for headers and body buffers - Change > to >= to properly handle capacity exactly equal to max size - Prevents wasting pooled buffers for common boundary cases * refactor(SPOE): use message groups (#141) * Refactor SPOE to use groups with dropmorepackets package - Convert crowdsec-ip to crowdsec-tcp message for clarity - Create separate groups for crowdsec-tcp, crowdsec-http-body, and crowdsec-http-no-body - Use event on-client-session for TCP message triggering - Add conditional HTTP group sends based on body size using ACL - Update HTTP handler to always check IP and only count metrics when TCP handler didn't run - Add debug logging for all received messages - Use req.body_size with ACL for body size checking (works with http-buffer-request) - Update both standard and upstream proxy configs with IP extraction and conditional groups * Clean up SPOE groups implementation - Remove crowdsec-tcp group definition (uses event directive, not group) - Standardize body size limit to 100KB in both configs with consistent notes - Remove unnecessary string conversion in HandleSPOE (use bytes directly) - Ensure code only handles message names defined in crowdsec.cfg - Update comments for clarity * Update configs and code cleanup * Address PR review comments - Fix ACL logic to prevent both groups being sent when body_size not found - Update comments: clarify TCP handler behavior in upstream proxy mode - Fix grammar: 'uses' -> 'use' for plural subject - Clarify TCP remediation logic comment - Fix missing apostrophe: 'dont' -> 'don't' - Update handleTCPRequest function comment to explain TCP-level processing - Fix misleading comment about TCP handler using groups (it uses event directive) * refactor(spoa): clean up and simplify SPOA functions - Extract buildAppSecRequest method from validateWithAppSec - Refactor handleCaptchaRemediation into smaller focused functions - Add getSessionString helper to eliminate repeated type assertions - Simplify createNewSessionAndCookie by removing unused parameters - Improve code readability and maintainability * perf(spoa): remove redundant reset() calls after pool Get() Remove unnecessary reset() calls immediately after Get() from sync.Pool. The pool's New function returns zeroed instances, and we already reset before Put() in defer blocks, so resetting after Get() is redundant. * refactor: use ptr.Of and extract host/cookie from headers - Replace custom pointer helpers (stringPtr, boolPtr, etc.) with ptr.Of from go-cs-lib - Extract Host header and captcha cookie from req.hdrs instead of separate KV pairs - Add extractCookieValue helper using strings.SplitSeq and CutPrefix - Remove unused keyHost and keyCaptchaCookie variables * config: increase buffer size and timeouts for WAF body inspection - Add tune.bufsize 65536 (64KB) to support larger body inspection - Add timeout connect 2s and timeout server 60s to crowdsec-spoa backend - Increases max body size that can be inspected by AppSec WAF * feat: improve HAProxy buffer config and add debug tooling - Add tune.bufsize 65536 to haproxy.cfg and haproxy-upstreamproxy.cfg for 64KB body support - Add stats socket for runtime debugging (/tmp/haproxy.sock) - Lower body_within_limit to 50KB to stay safely under SPOE frame limit - Remove redundant max-frame-size from crowdsec.cfg (tune.bufsize handles it) - Add docker-compose.dev.yaml with debug container (tcpdump, socat, curl, etc.) - Move per-message logging from Debug to Trace for quieter Debug mode - Add Debug log for AppSec validation results * perf: remove unnecessary body copy for AppSec requests - Remove duplicate copy of body bytes when building AppSecRequest - BodyCopied remains valid until handler returns (pool return in defer) - Also remove debug log for body length * refactor: simplify AppSec config logic and fix metrics counting - Add getAppSecConfig() helper to centralize host/global AppSec resolution - Add shouldRunAppSec() predicate for cleaner conditional logic - Remove redundant early return in captcha switch case - Fix blocked metrics: don't count as blocked if user solved captcha - Reduce ~30 lines of scattered logic to ~4 lines per call site
) * Add Prometheus duration metrics for SPOA message processing - Add MessageDuration histogram metric to track processing time for crowdsec-http and crowdsec-ip messages - Instrument handlerWrapper to measure duration of each message type - Use prometheus.NewTimer() with WithLabelValues() to avoid allocations in hot path - Register new metric in Prometheus registry * Improve duration metrics accuracy and reliability - Update histogram buckets for better granularity around 500ms timeout threshold - Use defer for timer observation to ensure durations are recorded even on errors/panics - Address Copilot review feedback * Fix linting: remove defer-in-loop to avoid allocations - Remove defer and closure to eliminate allocations in hot path - Call ObserveDuration() directly after handler execution - Passes golangci-lint with zero allocations * Refactor duration metrics to function-level granularity - Remove message-level duration metrics (not granular enough) - Add IPCheckDuration metric with lookup_type label (ip_or_range, country) - Add CaptchaValidationDuration metric for captcha operations - Add GeoLookupDuration metric for geo database lookups - Instrument getIPRemediation to track IP/country checks separately - Instrument captcha validation in handleCaptchaRemediation - All metrics use 10 buckets optimized for 0-500ms timeout threshold - Zero allocations in hot path using prometheus.NewTimer() * Move timing instrumentation into dataset.CheckIP and optimize buckets - Move IP/range lookup timing into dataset.CheckIP for better encapsulation - Distinguish between 'ip' and 'range' lookup types after dataset redesign - Update IPCheckDuration buckets to start at 100μs for microsecond-level granularity - IP/range lookups are typically ~100μs, buckets now cover 100μs to 1s (10 buckets) - Update all test files to match new CheckIP signature - Zero allocations in hot path using prometheus.NewTimer() * Remove unnecessary variable declarations in getIPRemediation
Resolved conflicts while maintaining stateless captcha design: - Adopted new dropmorepackets/haproxy-go SPOP library from main - Kept stateless JWT-based captcha tokens (CaptchaToken) - Removed session-based captcha handling - Integrated duration metrics (CaptchaValidationDuration, GeoLookupDuration) - Updated captcha handling to use encoding.ActionWriter API - Kept CookieGenerator.GenerateUnsetCookie for clearing cookies on Allow
d989e0c to
1784e06
Compare
Breaking changes: - Rename cookie_secret -> signing_key for clarity (JWT signing key) - Remove redundant CookieGenerator.Secret field (unused) Improvements: - Fix Content-Type matching: use HasPrefix + ToLower for RFC compliance - Now handles: application/x-www-form-urlencoded; charset=UTF-8 - Case-insensitive: Application/X-WWW-Form-URLEncoded - Add comprehensive JWT test coverage (20 tests): - Signature verification and tampering detection - Expiration validation - Token lifecycle and status transitions - Cookie generation and validation - Edge cases and malformed tokens Updated: - README: cookie_secret -> signing_key with JWT clarification - Simplified CookieGenerator.Init() (no secret param) All tests pass. Closes security and UX gaps in stateless captcha flow.
1784e06 to
6260c85
Compare
6260c85 to
6c98530
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactor Captcha to Stateless JWT-Based Design
Summary
Replaces stateful session-based captcha with stateless JWT tokens stored in signed cookies. Removes all server-side session storage.
Key Changes
Architecture
Implementation
golang-jwt/jwt/v4(RFC 7519 compliant)internal/cookie→internal/remediation/captchainternal/sessionpackage entirelypending_ttl,passed_ttlLibrary Migration
dropmorepackets/haproxy-go)encoding.ActionWriterAPICode Quality
Configuration Changes
Breaking:
signing_key(renamed fromcookie_secret) is now mandatoryBenefits
Backward Compatibility
Old format cookies trigger automatic migration to JWT on next request.