|
| 1 | +# KrakenD Plugins and Quotas Example |
| 2 | + |
| 3 | +This example demonstrates how to implement **token-based quota management** for streaming AI/LLM endpoints using KrakenD Enterprise Edition with a custom middleware plugin. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The example showcases: |
| 8 | + |
| 9 | +- **Custom Middleware Plugin**: A Go plugin that intercepts streaming responses and tracks token usage in real-time |
| 10 | +- **Quota Management**: Multi-tier rate limiting (gold, silver, bronze) using KrakenD's governance/processors |
| 11 | +- **Stream Processing**: Parsing streaming responses to extract token usage from the response body |
| 12 | +- **Persistent Tracking**: Quota consumption tracked across requests with automatic storage management |
| 13 | +- **Mock Backend**: An OpenAI-compatible mock server that streams responses with token metadata |
| 14 | + |
| 15 | +## Architecture |
| 16 | + |
| 17 | +``` |
| 18 | +┌──────────┐ ┌─────────────────────────────┐ ┌──────────┐ |
| 19 | +│ Client │────▶│ KrakenD EE │────▶│ Backend │ |
| 20 | +└──────────┘ │ - Quota Pre-check │ │ (Mock │ |
| 21 | + │ - Stream Wrapper │ │ OpenAI) │ |
| 22 | + │ - Token Extraction │ └──────────┘ |
| 23 | + │ - Quota Update │ |
| 24 | + │ │ |
| 25 | + │ Quota Processor handles │ |
| 26 | + │ storage transparently │ |
| 27 | + └─────────────────────────────┘ |
| 28 | +``` |
| 29 | + |
| 30 | +## Components |
| 31 | + |
| 32 | +### 1. KrakenD Configuration (`config/krakend/krakend.json`) |
| 33 | +- Defines quota rules for three tiers (gold, silver, bronze) |
| 34 | +- Configures the quota processor with storage backend |
| 35 | +- Sets up endpoint with the custom middleware plugin |
| 36 | + |
| 37 | +### 2. Middleware Plugin (`plugins/quota-control-mw/middleware.go`) |
| 38 | +- **Pre-check**: Validates if user has quota before forwarding request |
| 39 | +- **Stream Wrapping**: Intercepts the response stream from the backend |
| 40 | +- **Token Extraction**: Uses regex to parse token usage from streaming chunks |
| 41 | +- **Quota Update**: Dynamically updates quota based on actual token consumption |
| 42 | + |
| 43 | +### 3. Mock Backend (`backend/openai-mock/main.go`) |
| 44 | +- Simulates OpenAI's streaming API format |
| 45 | +- Returns Server-Sent Events (SSE) with mock responses |
| 46 | +- Includes token usage metadata in the final chunk |
| 47 | + |
| 48 | +## Prerequisites |
| 49 | + |
| 50 | +- Docker and Docker Compose |
| 51 | +- KrakenD Enterprise Edition license (place `LICENSE` file in the root directory) |
| 52 | + |
| 53 | +## Running the Example |
| 54 | + |
| 55 | +1. **Ensure you have a KrakenD EE license file** named `LICENSE` in this directory |
| 56 | + |
| 57 | +2. **Start all services**: |
| 58 | + ```bash |
| 59 | + docker compose up --build |
| 60 | + ``` |
| 61 | + |
| 62 | + This will start: |
| 63 | + - KrakenD EE on port `8080` |
| 64 | + - Mock backend on port `8090` |
| 65 | + - Supporting services (quota storage) |
| 66 | + |
| 67 | +3. **Test the endpoint**: |
| 68 | + ```bash |
| 69 | + curl -X POST http://localhost:8080/ \ |
| 70 | + -H "Content-Type: application/json" \ |
| 71 | + -d '{ |
| 72 | + "messages": [{"role": "user", "content": "Hello!"}] |
| 73 | + }' |
| 74 | + ``` |
| 75 | + |
| 76 | + You should see a streaming response with the mock data. |
| 77 | + |
| 78 | +## How It Works |
| 79 | + |
| 80 | +1. **Request arrives** at KrakenD's `/` endpoint |
| 81 | +2. **Middleware pre-checks** quota limits (weightless check with 0 tokens) |
| 82 | +3. If quota allows, **request is forwarded** to the backend |
| 83 | +4. **Backend streams** the response in OpenAI-compatible SSE format |
| 84 | +5. **Middleware wraps** the response stream with a custom reader |
| 85 | +6. As **chunks are read**, the plugin: |
| 86 | + - Searches for usage metadata in the stream |
| 87 | + - Extracts token counts using regex pattern |
| 88 | + - Updates the quota processor with actual consumption |
| 89 | +7. **Client receives** the full streaming response |
| 90 | +8. **Quota is accurately tracked** based on real token usage |
| 91 | + |
| 92 | +## Quota Tiers |
| 93 | + |
| 94 | +The configuration defines three quota tiers: |
| 95 | + |
| 96 | +| Tier | Hourly Limit | Daily Limit | |
| 97 | +|--------|--------------|-------------| |
| 98 | +| Gold | 1,000 tokens | 5,000 tokens| |
| 99 | +| Silver | 500 tokens | 2,000 tokens| |
| 100 | +| Bronze | 200 tokens | 1,000 tokens| |
| 101 | + |
| 102 | +**Note**: In the plugin code (`middleware.go:73`), the tier is hardcoded to `"admin"` (which maps to the `gold` tier). In a production setup, you would extract this from: |
| 103 | +- JWT claims |
| 104 | +- Request headers (e.g., `X-User-Tier`) |
| 105 | +- API key lookup |
| 106 | +- Query parameters |
| 107 | + |
| 108 | +## Key Features |
| 109 | + |
| 110 | +### 1. **Streaming-Aware Quota Management** |
| 111 | +Unlike traditional rate limiting that counts requests, this example counts tokens consumed during streaming, making it ideal for AI/LLM APIs where cost is token-based. |
| 112 | + |
| 113 | +### 2. **Pre-check + Post-update Pattern** |
| 114 | +- Pre-check (weight=0): Fast rejection of users who've already exceeded quota |
| 115 | +- Post-update (weight=actual): Accurate quota deduction after response completes |
| 116 | + |
| 117 | +### 3. **Regex-based Token Extraction** |
| 118 | +The plugin uses a regex pattern to extract token usage from the streaming response: |
| 119 | +```go |
| 120 | +usagePattern: `{"prompt_tokens":(\d+),"completion_tokens":(\d+),"total_tokens":(\d+)}` |
| 121 | +``` |
| 122 | + |
| 123 | +### 4. **Bloom Filter Optimization** |
| 124 | +The configuration includes a rejecter cache (Bloom filter) to quickly deny previously blocked users without storage lookups. |
| 125 | + |
| 126 | +## Customization |
| 127 | + |
| 128 | +### Change User/Tier Extraction |
| 129 | +Modify `middleware.go:73-74` to extract tier and user ID from the request: |
| 130 | + |
| 131 | +```go |
| 132 | +// Example: Extract from JWT claims |
| 133 | +tier := extractFromJWT(reqw, "tier") |
| 134 | +userId := extractFromJWT(reqw, "user_id") |
| 135 | + |
| 136 | +// Example: Extract from headers |
| 137 | +tier := reqw.Headers()["X-User-Tier"][0] |
| 138 | +userId := reqw.Headers()["X-User-Id"][0] |
| 139 | +``` |
| 140 | + |
| 141 | +### Adjust Quota Limits |
| 142 | +Edit `config/krakend/krakend.json` in the `governance/processors.quotas.rules` section. |
| 143 | + |
| 144 | +### Change Token Pattern |
| 145 | +If your backend uses a different format for token usage, update the regex pattern in `middleware.go:62`. |
| 146 | + |
| 147 | +## Building the Plugin Manually |
| 148 | + |
| 149 | +If you want to build the plugin outside Docker: |
| 150 | + |
| 151 | +```bash |
| 152 | +cd plugins/quota-control-mw |
| 153 | +make go.mod # Initialize Go module |
| 154 | +make amd64 # Build for AMD64 |
| 155 | +# or |
| 156 | +make arm64 # Build for ARM64 |
| 157 | +``` |
| 158 | + |
| 159 | +## Troubleshooting |
| 160 | + |
| 161 | +**Plugin not loading**: Check KrakenD logs for plugin-related errors. Ensure the `.so` file is in `/opt/krakend/plugins/` inside the container. |
| 162 | + |
| 163 | +**Quota not working**: Verify all services are running with `docker compose ps`. Check KrakenD logs for quota processor initialization errors. |
| 164 | + |
| 165 | +**Tokens not being tracked**: Check that the token usage format in the backend response matches the regex pattern in the plugin. |
| 166 | + |
| 167 | +## License |
| 168 | + |
| 169 | +This example requires a valid KrakenD Enterprise Edition license. Place your `LICENSE` file in the root directory. |
| 170 | + |
| 171 | +## References |
| 172 | + |
| 173 | +- [KrakenD Plugins Documentation](https://www.krakend.io/docs/enterprise/extending/) |
| 174 | +- [KrakenD Quota Management](https://www.krakend.io/docs/enterprise/governance/quota/) |
| 175 | +- [OpenAI Streaming Format](https://platform.openai.com/docs/api-reference/streaming) |
0 commit comments