feat: Add LLM improvements and feedback signal API #100

csfet9 · 2026-01-05T12:15:52Z

Summary

This PR adds several high-value improvements to the Hindsight memory system:

1. Feedback Signal API

New API endpoint for tracking fact usefulness (POST /feedback)
Allows agents to signal which retrieved facts were helpful or not
Enables future improvements to retrieval ranking based on feedback
Includes database migration, tests, and documentation

2. Gemini 3 Flash Preview Optimizations

Added support for Gemini 3 Flash Preview model
Optimized configuration for better performance with this model

3. Empty LLM Response Handling

Added retry logic when LLM returns empty responses
Logs finish_reason for debugging
Exponential backoff between retries
Prevents failures from transient empty responses

Test plan

Run existing test suite: cd hindsight-api && uv run pytest tests/
Test feedback signal API endpoint manually
Verify Gemini provider works with new optimizations
Test empty response handling by simulating empty LLM responses

🤖 Generated with Claude Code

- Add Anthropic as LLM provider with full async support - Add LM Studio provider for local model inference - Fix JSON response format compatibility for local models - Update .env.example with configuration examples - Update docstrings with all supported providers Tested with: - Claude Sonnet 4 (claude-sonnet-4-20250514) - Claude Haiku 4.5 (claude-haiku-4-5-20251001) - Qwen 30B via LM Studio

Add configurable timeout support for LLM API calls: - Environment variable override via HINDSIGHT_API_LLM_TIMEOUT - Dynamic heuristic for lmstudio/ollama: 20 mins for large models (30b, 33b, 34b, 65b, 70b, 72b, 8x7b, 8x22b), 5 mins for others - Pass timeout to Anthropic, OpenAI, and local model clients 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Remove CLAUDE.md from .gitignore (should stay in repository) - Pass max_completion_tokens to _call_anthropic instead of hardcoding 4096 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Provides project context and development commands for AI-assisted coding. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add docker-compose.yml for local development - Add test_internal.py for local testing - Sync uv.lock and llm_wrapper.py changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Move LLM config to config.py with HINDSIGHT_API_ prefix - Add HINDSIGHT_API_LLM_MAX_CONCURRENT (default: 32) - Add HINDSIGHT_API_LLM_TIMEOUT (default: 120s) - Remove fragile model-size timeout heuristic - Apply markdown JSON extraction to all providers, not just local - Fix Anthropic markdown extraction bug (missing split) - Change LLM request/response logs from info to debug level 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Remove test_internal.py (debug file) - Remove docker-compose.yml (to be moved to hindsight-cookbook repo) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Implements a feedback signal system that tracks which recalled facts are actually useful, enabling usefulness-boosted recall. API Endpoints: - POST /v1/default/banks/{bank_id}/signal - Submit feedback signals - GET /v1/default/banks/{bank_id}/facts/{fact_id}/stats - Fact stats - GET /v1/default/banks/{bank_id}/stats/usefulness - Bank stats Features: - Signal types: used, ignored, helpful, not_helpful - Time-decayed scoring (5% decay per week) - Usefulness-boosted recall with configurable weight - Query pattern tracking for analytics Database: - fact_usefulness: Aggregate scores per fact - usefulness_signals: Individual signal records - query_pattern_stats: Pattern tracking Documentation: - Full API reference in hindsight-docs - Python, Node.js, and cURL examples - Updated recall.mdx with new parameters 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Adds a startup script that waits for dependencies (database and LLM Studio) before launching the Hindsight API. Retries indefinitely by default, allowing the container to start before LM Studio is available. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Strip <think>, <thinking>, <reasoning>, and |startthink|/|endthink| tags from reasoning model outputs to enable proper JSON parsing. This allows local reasoning models like Qwen3 to work with Hindsight's structured extraction pipeline. Also adds slow call logging for Ollama native function and updates reasoning model detection to include qwq family. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

…artup Container now waits for database and LLM Studio to be accessible before starting Hindsight. Configurable via environment variables: - HINDSIGHT_RETRY_MAX: Max retries (0 = infinite, default) - HINDSIGHT_RETRY_INTERVAL: Seconds between retries (default 10) Applied to all three Docker stages: api-only, cp-only, and standalone. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

When HINDSIGHT_API_DATABASE_URL is not set, the standalone container uses embedded pg0 which starts with start-all.sh. The retry script now detects this and skips the external database check. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Documents testing of Qwen3 8B/14B, Gemma 3, and NuExtract models for Hindsight memory extraction on Apple Silicon. Includes: - Benchmark results and performance comparisons - Configuration recommendations - Docker setup with retry-start script - Troubleshooting guide 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

# Conflicts: # hindsight-api/hindsight_api/engine/llm_wrapper.py # uv.lock

- Add llm-comparison.py benchmark script to compare LLM providers - Reduce max_completion_tokens from 65000 to 8192 for better local LLM compatibility - Include benchmark results for Qwen3-8B and Claude Haiku 4.5 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add thinking_level parameter support (LOW/MEDIUM/HIGH) for Gemini 3 models - Add temperature and max_output_tokens support for Gemini - Improve rate limit handling with 10-120s backoff for 429 errors - Add HINDSIGHT_API_LLM_THINKING_LEVEL env var (default: low) - Include benchmark results comparing thinking levels Performance with thinking_level=medium: - 4.3x faster retain (4.5s vs 19.4s per memory) - 92% fact extraction quality retained (33 vs 36 facts) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- LLM wrapper: detect and retry on empty responses (None or empty string) - LLM wrapper: add detailed logging with finish_reason and safety ratings - Docker startup: skip endpoint check for cloud providers (openai, anthropic, gemini, groq) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Resolved conflict in llm_wrapper.py: - Kept upstream's more descriptive comment for thinking tag stripping - Preserved local empty response handling with retry logic 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Implements a feedback signal system that tracks which recalled facts are actually useful, enabling usefulness-boosted recall. API Endpoints: - POST /v1/default/banks/{bank_id}/signal - Submit feedback signals - GET /v1/default/banks/{bank_id}/facts/{fact_id}/stats - Fact stats - GET /v1/default/banks/{bank_id}/stats/usefulness - Bank stats Features: - Signal types: used, ignored, helpful, not_helpful - Time-decayed scoring (5% decay per week) - Usefulness-boosted recall with configurable weight - Query pattern tracking for analytics Database: - fact_usefulness: Aggregate scores per fact - usefulness_signals: Individual signal records - query_pattern_stats: Pattern tracking Documentation: - Full API reference in hindsight-docs - Python, Node.js, and cURL examples - Updated recall.mdx with new parameters 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add thinking_level parameter support (LOW/MEDIUM/HIGH) for Gemini 3 models - Add temperature and max_output_tokens support for Gemini - Improve rate limit handling with 10-120s backoff for 429 errors - Add HINDSIGHT_API_LLM_THINKING_LEVEL env var (default: low) - Include benchmark results comparing thinking levels Performance with thinking_level=medium: - 4.3x faster retain (4.5s vs 19.4s per memory) - 92% fact extraction quality retained (33 vs 36 facts) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- LLM wrapper: detect and retry on empty responses (None or empty string) - LLM wrapper: add detailed logging with finish_reason and safety ratings - Docker startup: skip endpoint check for cloud providers (openai, anthropic, gemini, groq) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The retry logic is already integrated into start-all.sh, making this separate script unnecessary. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

These are local test results that shouldn't be in the upstream PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

nicoloboschi

Hey @csfet9 thanks for this big addition!

I've a more general question about this:
isn't the helpfulness related to the recall query and other parameters?

for example, if I ask the bank about "Alice" I might say that Bob facts weren't helpful but if I ask about Bob, that fact is indeed helpful (maybe)

So I think we're missing something that connects the helpfulness to the actual query

csfet9 · 2026-01-05T13:11:44Z

Hey @nicoloboschi, great feedback! You're absolutely right - the helpfulness signal should be tied to the query context.

I've implemented a query-context aware scoring system:

Changes

New query_fact_usefulness table - stores usefulness scores per (query_embedding, fact) pair instead of just per fact
query field is now required in SignalItem - ensures every signal is tied to the query that triggered the recall
Semantic matching for similar queries - uses cosine similarity (threshold 0.85) so "Who is the CEO?" and "Who is the chief executive?" share the same score context
Hybrid scoring in recall - when boost_by_usefulness=true:
- First looks for query-specific scores (similar queries)
- Falls back to global scores for unseen query patterns
- Falls back to 0.5 (neutral) for facts without any signals

Example

Fact: "Bob works at TechCorp"

Query "Who works at TechCorp?" → marked helpful → score = 0.65
Query "What's the weather?" → marked not_helpful → score = 0.40

Recall with "Tell me about TechCorp employees" → uses 0.65 (similar to first query)
Recall with "What's today's forecast?" → uses 0.40 (similar to second query)

This way, the same fact can have different usefulness scores depending on the query context.

Currently testing locally - will push once verified.

Want me to adjust anything?

nicoloboschi · 2026-01-05T13:15:43Z

Hey @nicoloboschi, great feedback! You're absolutely right - the helpfulness signal should be tied to the query context.

I've implemented a query-context aware scoring system:

Changes

New query_fact_usefulness table - stores usefulness scores per (query_embedding, fact) pair instead of just per fact

query field is now required in SignalItem - ensures every signal is tied to the query that triggered the recall

Semantic matching for similar queries - uses cosine similarity (threshold 0.85) so "Who is the CEO?" and "Who is the chief executive?" share the same score context

Hybrid scoring in recall - when boost_by_usefulness=true:

First looks for query-specific scores (similar queries)

Falls back to global scores for unseen query patterns

Falls back to 0.5 (neutral) for facts without any signals

Example

Fact: "Bob works at TechCorp"

Query "Who works at TechCorp?" → marked helpful → score = 0.65 Query "What's the weather?" → marked not_helpful → score = 0.40

Recall with "Tell me about TechCorp employees" → uses 0.65 (similar to first query) Recall with "What's today's forecast?" → uses 0.40 (similar to second query)

This way, the same fact can have different usefulness scores depending on the query context.

Currently testing locally - will push once verified.

Want me to adjust anything?

thanks, that looks better!

can you share your use case for this feature and what is the pattern you're building? it looks like you want some human-in-the-loop mechanism and I'd love to hear more about how you intend to use hindsight there

csfet9 · 2026-01-05T13:30:09Z

Thanks, Here's the use case:

I'm building an automatic feedback loop for Claude Code where the system learns from Claude's actual behavior.

How it works

Detection - After recall, analyze Claude's response to detect which facts were actually used (semantic similarity, explicit references like "based on context", behavioral signals like file access)
Signal - Send used/ignored verdicts back to Hindsight via /signal
Learn - Use boost_by_usefulness in future recalls to prioritize facts that worked before

Why query-context matters

A fact like "Bob is a senior engineer" might be helpful for "Who can help with code?" but irrelevant for "What's the deadline?". Without query-context, one "helpful" signal would boost it for ALL queries. The semantic matching ensures similar queries share scores while different query types stay separate.

The human-in-the-loop is mostly implicit - Claude's usage patterns provide the signal automatically, no manual thumbs up/down needed.

hindsight-api/hindsight_api/engine/llm_wrapper.py

hindsight-docs/examples/api/feedback-signals.mjs

- Make query field required in SignalItem for context-aware scoring - Add query_fact_usefulness table with HNSW index for semantic matching - Store query embeddings with signals for similarity-based grouping - Use query-specific scores with global fallback during recall - Similar queries (cosine similarity >= 0.85) share scores - Update documentation and examples with required query field 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Replace separate env var with existing reasoning_effort parameter for Gemini 3 thinking_level configuration. This unifies the config across providers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

csfet9 · 2026-01-05T16:05:12Z

Updates pushed:

Query-Context Aware Scoring (addresses your feedback)

Made query field required in SignalItem
New query_fact_usefulness table stores scores per (query_embedding, fact) pair
Semantic matching (cosine similarity ≥ 0.85) groups similar queries
Hybrid scoring: query-specific → global fallback → 0.5 neutral
Updated all docs and examples with required query field
Added tests for context-aware scoring

Review comment fixes
Gemini 3 thinking level now uses reasoning_effort parameter instead of separate env var
Query field already in examples (part of context-aware scoring commit)

Ready for re-review!

Resolve conflict in config.py: - Keep Groq service tier from upstream (vectorize-io#102) - Remove unused LLM thinking level env var (now using reasoning_effort) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Read HINDSIGHT_API_LLM_THINKING_LEVEL from environment in all LLMProvider factory methods (for_memory, for_answer_generation, for_judge) instead of hardcoding the reasoning_effort value. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Features merged: - Query-context aware feedback scoring for improved recall relevance - Groq service tier configuration (ENV_LLM_GROQ_SERVICE_TIER) - Configurable thinking level for Gemini 3 via reasoning_effort - OpenAI embeddings support with configurable dimensions - GitHub issue templates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The file was referenced in Dockerfile but was missing from the repo. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The retry logic was merged into start-all.sh. Updated Dockerfile from upstream which uses start-all.sh directly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

… 500 Add Pydantic field_validator to SignalItem.fact_id to validate UUID format at the API layer. Invalid UUIDs now receive a proper 422 Validation Error instead of causing a 500 Internal Server Error when the database rejects them. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

…tion Merged three changes: - llm_wrapper.py: Combined Gemini temperature/max_completion_tokens (from HEAD) with return_usage parameter (from upstream) - fact_extraction.py: Use config.retain_max_completion_tokens instead of hardcoded 8192 for configurable token limits Co-Authored-By: Claude Opus 4.5 <[email protected]>

The `<0.5s` was being interpreted as a JSX tag by the MDX parser. Co-Authored-By: Claude Opus 4.5 <[email protected]>

csfet9 and others added 29 commits December 30, 2025 16:12

chore: Remove deleted AI assistant files from .gitignore

47c6008

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

docs: Add CLAUDE.md for Claude Code integration

8c7e28f

Provides project context and development commands for AI-assisted coding. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

chore: Remove local dev docker-compose.yml

f69a048

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

chore: Add local dev docker-compose.yml

4ce365f

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

fix: Update LM Studio port to 2222 in docker-compose

10eb651

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

chore: Remove obsolete version attribute from docker-compose

92d05d5

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

fix: Remove test file and docker-compose per PR review

79bc663

- Remove test_internal.py (debug file) - Remove docker-compose.yml (to be moved to hindsight-cookbook repo) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

docs: Move local LLM testing docs to hindsight-docs

a0cec18

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge remote-tracking branch 'upstream/main'

bf9c0c0

# Conflicts: # hindsight-api/hindsight_api/engine/llm_wrapper.py # uv.lock

chore: remove redundant retry-start.sh

a85ac28

The retry logic is already integrated into start-all.sh, making this separate script unnecessary. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

chore: remove benchmark result files from PR

e4eae57

These are local test results that shouldn't be in the upstream PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

nicoloboschi requested changes Jan 5, 2026

View reviewed changes

nicoloboschi reviewed Jan 5, 2026

View reviewed changes

hindsight-api/hindsight_api/engine/llm_wrapper.py Outdated Show resolved Hide resolved

hindsight-docs/examples/api/feedback-signals.mjs Show resolved Hide resolved

csfet9 and others added 2 commits January 5, 2026 14:42

csfet9 and others added 10 commits January 5, 2026 17:06

Merge remote-tracking branch 'upstream/main'

1b6ffb7

fix: restore missing retry-start.sh for Docker build

0c31269

The file was referenced in Dockerfile but was missing from the repo. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge branch 'feature/llm-improvements-and-feedback-api'

d085415

fix: escape < in MDX to fix docs build

74e2190

The `<0.5s` was being interpreted as a JSX tag by the MDX parser. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add LLM improvements and feedback signal API #100

feat: Add LLM improvements and feedback signal API #100

Uh oh!

csfet9 commented Jan 5, 2026

Uh oh!

nicoloboschi left a comment

Uh oh!

csfet9 commented Jan 5, 2026

Uh oh!

nicoloboschi commented Jan 5, 2026

Uh oh!

csfet9 commented Jan 5, 2026

Uh oh!

Uh oh!

Uh oh!

csfet9 commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add LLM improvements and feedback signal API #100

Are you sure you want to change the base?

feat: Add LLM improvements and feedback signal API #100

Uh oh!

Conversation

csfet9 commented Jan 5, 2026

Summary

1. Feedback Signal API

2. Gemini 3 Flash Preview Optimizations

3. Empty LLM Response Handling

Test plan

Uh oh!

nicoloboschi left a comment

Choose a reason for hiding this comment

Uh oh!

csfet9 commented Jan 5, 2026

Uh oh!

nicoloboschi commented Jan 5, 2026

Uh oh!

csfet9 commented Jan 5, 2026

Uh oh!

Uh oh!

Uh oh!

csfet9 commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants