-
Notifications
You must be signed in to change notification settings - Fork 224
Description
Summary
Add a configurable environment variable (e.g. LLM_REQUEST_TIMEOUT) to control the timeout for requests to the LLM backend.
Motivation
Speakr's current 10-minute request timeout (inherited from the OpenAI Python SDK default) works well for cloud-hosted models but is insufficient for users running local inference via Ollama or similar self-hosted backends, particularly when:
- Using larger models (e.g. 12b+ parameter models) that are partially CPU-offloaded due to limited VRAM
- Processing long transcripts (60–120 minute sessions can generate 8,000–25,000+ token prompts)
- Running on consumer hardware where inference is significantly slower than cloud APIs
Currently the only workaround is to patch the source directly or use smaller models that may produce lower quality summaries.
Proposed Solution
Expose the LLM client timeout as an environment variable, for example:
LLM_REQUEST_TIMEOUT=1800 # seconds, default 600
This would be passed to the OpenAI client at initialization:
client = OpenAI(base_url=..., api_key=..., timeout=int(os.getenv('LLM_REQUEST_TIMEOUT', 600)))Additional Context
It would also be worth considering disabling or making configurable the automatic retry behavior (openai._base_client retries on timeout) for local inference endpoints. Retrying a timed-out request against a still-processing Ollama instance queues duplicate jobs, compounding the problem rather than resolving it.
Use Case
Local Ollama deployment with gemma3:12b on an NVIDIA RTX A2000 6GB, processing 90-120 minute meeting transcripts. Inference completes successfully when tested manually but exceeds the 10-minute timeout window in Speakr.