You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Release v0.6.3: Enhanced configurability and reliability improvements
Added partitionable concurrency gates with per-tenant/location keys to prevent
cross-tenant starvation. Concurrency permit wait timeout is now configurable
via permit_timeout_ms (defaults to :infinity). Permit holders are monitored
and permits are automatically reclaimed if processes die without releasing.
Per-request timeout overrides for HTTP and streaming operations. Global default
timeout increased from 30s to 120s. Streaming gains tunable backoff ceiling
(max_backoff_ms), connect timeout, and configurable ManagerV2 cleanup delay.
Context cache TTL defaults now configurable via application environment. Rate
limiter retry delay fallback is similarly configurable when API responses lack
explicit retry timing.
Fixed streaming client memory leaks by removing persistent_term state tracking.
SSE parse errors now properly surface as errors instead of silently dropping.
Streaming backoff and connection timeouts are now tunable parameters.
All timeout and concurrency parameters support per-call overrides while
maintaining sensible global defaults. Documentation updated throughout to
reflect new configuration options and behavioral changes.
Streaming knobs: pass `timeout:` (per attempt, default `config :gemini_ex, :timeout` = 120_000), `max_retries:` (default 3), `max_backoff_ms:` (default 10_000), and `connect_timeout:` (default 5_000). Manager cleanup delay can be tuned via `config :gemini_ex, :streaming, cleanup_delay_ms: ...`.
154
+
155
+
### Rate Limiting & Concurrency (built-in)
154
156
155
157
- Enabled by default: requests block when over budget; non-blocking mode returns `{:error, {:rate_limited, retry_at, details}}` with `retry_at` set to the window end.
- Cached context tokens are counted toward budgets. When you precompute cache size, you can pass `estimated_cached_tokens:` alongside `estimated_input_tokens:` to budget correctly before the API reports usage.
158
160
- Optional `max_budget_wait_ms` caps how long blocking calls sleep for a full window; if the cap is hit and the window is still full, you get a `rate_limited` error with `retry_at` set to the actual window end.
161
+
- Concurrency gate: `max_concurrency_per_model` plus `permit_timeout_ms` (default `:infinity`, per-call override). `non_blocking: true` is the fail-fast path (returns `{:error, :no_permit_available}` immediately).
162
+
- Partition the gate with `concurrency_key:` (e.g., tenant/location) to avoid cross-tenant starvation; default key is the model name.
163
+
- Permit leak protection: holders are monitored; if a holder dies without releasing, its permits are reclaimed automatically.
164
+
165
+
### Timeouts (HTTP & Streaming)
166
+
167
+
- Global HTTP/stream timeout default is 120_000ms via `config :gemini_ex, :timeout`.
168
+
- Per-call override: `timeout:` on any request/stream.
**TTL defaults:** The default cache TTL is configurable via `config :gemini_ex, :context_cache, default_ttl_seconds: ...` (defaults to 3_600). You can also override per call with `default_ttl_seconds:` or pass `:ttl`/`:expire_time` explicitly.
The concurrency gate is per model by default (all callers to the same model share a queue). Use `concurrency_key:` to partition by tenant/location. `permit_timeout_ms` defaults to `:infinity`; a waiter only errors if you explicitly set a finite cap and it expires. Use `non_blocking: true` to fail fast instead of queueing.
0 commit comments