Skip to content

Conversation

@akshayutture-augment
Copy link
Contributor

* Anonymous: Add device limiter

* break auth if limit reached

* fix typo

* refactored const to make it clearer with expiration

* anon device limit for config

---------

Co-authored-by: Eric Leijonmarck <[email protected]>
@akshayutture-augment
Copy link
Contributor Author

@claude review this PR

1 similar comment
@akshayutture-augment
Copy link
Contributor Author

@claude review this PR

@claude
Copy link

claude bot commented Nov 20, 2025

Claude finished @akshayutture-augment's task —— View job


PR Review Complete ✓

Summary

This PR adds a configurable device limit feature for anonymous access in Grafana. The implementation allows administrators to cap the number of anonymous devices that can be tracked, preventing unlimited device registration while still allowing existing devices to update their information.

Critical Issues

🔴 Race Condition in Device Limit Check

Location: pkg/services/anonymous/anonimpl/anonstore/database.go:109-117

The CreateOrUpdateDevice method has a race condition between checking the count and inserting a new device:

if s.deviceLimit > 0 {
    count, err := s.CountDevices(ctx, time.Now().UTC().Add(-anonymousDeviceExpiration), time.Now().UTC().Add(time.Minute))
    if err != nil {
        return err
    }
    
    if count >= s.deviceLimit {
        return s.updateDevice(ctx, device)
    }
}
// Between here, another goroutine could insert a device
// Time-of-check to time-of-use (TOCTOU) vulnerability

Impact: Multiple concurrent requests can bypass the device limit if they all check the count before any of them insert. For example, if the limit is 100 and current count is 99, 10 concurrent requests could all see count=99 and all proceed to insert, resulting in 109 devices.

Recommendation: Wrap the count check and insert in a database transaction, or use a database-level constraint/lock to enforce the limit atomically.

🟡 Inconsistent Error Handling in Authentication Flow

Location: pkg/services/anonymous/anonimpl/client.go:44-50

When ErrDeviceLimitReached occurs during authentication, it's returned as an authentication error. However, this could result in a poor user experience:

if err := a.anonDeviceService.TagDevice(ctx, httpReqCopy, anonymous.AnonDeviceUI); err != nil {
    if errors.Is(err, anonstore.ErrDeviceLimitReached) {
        return nil, err  // Authentication fails, but error message may be unclear
    }
    a.log.Warn("Failed to tag anonymous session", "error", err)
}

Recommendation: Consider wrapping the error with more user-friendly context, or handling it differently to provide clearer feedback to users that the anonymous device limit has been reached.

🟡 Missing CreatedAt in updateDevice

Location: pkg/services/anonymous/anonimpl/anonstore/database.go:73-103

The updateDevice method doesn't set CreatedAt, which is inconsistent. While this might be intentional, the test in database_test.go:55-70 also doesn't set CreatedAt:

anonDevice := &Device{
    DeviceID:  "32mdo31deeqwes",
    ClientIP:  "10.30.30.2",
    UserAgent: "test",
    UpdatedAt: time.Now().Add(-time.Hour),
    // CreatedAt is missing - will be zero value
}

Impact: Tests pass with zero CreatedAt, which may hide issues in production where CreatedAt might be required or used for other logic.

Recommendation: Ensure CreatedAt is set in tests, or document why it's acceptable for it to be zero.

Code Quality Issues

🔵 Redundant Time Arithmetic

Location: pkg/services/anonymous/anonimpl/anonstore/database.go:110

The code adds one minute to time.Now() when checking the device count:

count, err := s.CountDevices(ctx, time.Now().UTC().Add(-anonymousDeviceExpiration), time.Now().UTC().Add(time.Minute))

This pattern appears throughout the codebase. While the comment at pkg/services/anonymous/anonimpl/impl.go:69 explains it's for MySQL precision issues in tests, adding a minute to the future seems unusual. Consider using <= instead of BETWEEN in queries, or documenting this pattern more clearly.

🔵 Magic Number in Time Window

Location: pkg/services/anonymous/anonimpl/anonstore/database.go:81

device.UpdatedAt.UTC().Add(-anonymousDeviceExpiration), device.UpdatedAt.UTC().Add(time.Minute)

The updateDevice method uses a time window to check if a device exists and was updated recently. However, using device.UpdatedAt (which is time.Now()) as the center point and checking if the existing record's updated_at falls within [now - 30 days, now + 1 minute] is confusing. The logic seems to be "only update devices that were active within the expiration window," but this could be clearer.

🔵 Potential Performance Issue

Location: pkg/services/anonymous/anonimpl/anonstore/database.go:109-117

Every device creation checks the total count when the limit is enabled:

if s.deviceLimit > 0 {
    count, err := s.CountDevices(ctx, ...)  // Full table scan on every request

Impact: This adds a COUNT(*) query with a date range filter for every anonymous user request when limits are enabled. For high-traffic sites, this could become a performance bottleneck.

Recommendation: Consider using caching with invalidation, or implementing a more efficient approach like a materialized view or periodic count updates.

Security Considerations

SQL Injection Protection

The code properly uses parameterized queries throughout, protecting against SQL injection.

Input Validation

Device IDs and other inputs appear to be handled safely through the ORM layer.

⚠️ Denial of Service Potential

While the device limit feature is designed to prevent abuse, the race condition mentioned above could allow attackers to bypass limits. Additionally, the performance concern means that even legitimate traffic could strain the database when limits are enabled.

Test Coverage

Good Test Coverage

  • Basic device limit test: TestIntegrationBeyondDeviceLimit
  • Device cleanup tests updated
  • Integration tests cover main scenarios

🟡 Missing Test Cases

  1. No test for updating existing device when at limit - The test only checks rejection of new devices, not updates to existing ones
  2. No test for concurrent access - Given the race condition, concurrent test cases would be valuable
  3. No test for the exact boundary - What happens when count == deviceLimit vs count == deviceLimit - 1?
  4. No test for disabled limit - Should verify deviceLimit = 0 behaves correctly (appears to work based on existing tests)

Configuration

Frontend Integration

The configuration is properly exposed to the frontend via:

  • TypeScript type definitions
  • Frontend settings DTO
  • Runtime config

🔵 Documentation Missing

No documentation was added for:

  • The new device_limit configuration option in auth.anonymous section
  • How the limit works (applies to 30-day window)
  • What happens when the limit is reached (users get auth errors)
  • Recommended values for different deployment sizes

Minor Issues

  1. Type inconsistency: Frontend uses number | undefined while backend uses int64. Consider documenting that 0 means "no limit"

  2. Error message: ErrDeviceLimitReached is generic. Consider making it more descriptive: "anonymous device limit reached: maximum %d devices allowed"

  3. Logging: When device limit is reached, only debug logging occurs in impl.go:146. Consider warning-level logging for limit violations to help admins monitor this feature.

Recommendations

High Priority

  1. Fix the race condition in device limit checking
  2. Add documentation for the new configuration option
  3. Improve error handling to provide better user feedback when limits are reached

Medium Priority

  1. Add performance optimizations for the count query (caching or alternative approach)
  2. Enhance test coverage for boundary conditions and concurrent scenarios
  3. Improve error messages with more context

Low Priority

  1. Refactor time window logic for better clarity
  2. Add metrics/monitoring for device limit violations
  3. Consider configuration validation to prevent invalid values (e.g., negative limits)

Positive Aspects

✅ Well-structured code that follows existing patterns
✅ Proper separation of concerns
✅ Good use of dependency injection
✅ Appropriate error types defined
✅ Configuration properly threaded through the stack


Overall Assessment: The feature implementation is functional and follows Grafana's code patterns well. However, the race condition in the device limit check is a critical issue that should be addressed before merging to prevent bypassing the limit under concurrent load. The performance implications of counting devices on every request should also be considered for high-traffic deployments.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants