Skip to content

Conversation

@csfet9
Copy link
Contributor

@csfet9 csfet9 commented Jan 5, 2026

Summary

This PR adds several high-value improvements to the Hindsight memory system:

1. Feedback Signal API

  • New API endpoint for tracking fact usefulness (POST /feedback)
  • Allows agents to signal which retrieved facts were helpful or not
  • Enables future improvements to retrieval ranking based on feedback
  • Includes database migration, tests, and documentation

2. Gemini 3 Flash Preview Optimizations

  • Added support for Gemini 3 Flash Preview model
  • Optimized configuration for better performance with this model

3. Empty LLM Response Handling

  • Added retry logic when LLM returns empty responses
  • Logs finish_reason for debugging
  • Exponential backoff between retries
  • Prevents failures from transient empty responses

Test plan

  • Run existing test suite: cd hindsight-api && uv run pytest tests/
  • Test feedback signal API endpoint manually
  • Verify Gemini provider works with new optimizations
  • Test empty response handling by simulating empty LLM responses

🤖 Generated with Claude Code

csfet9 and others added 29 commits December 30, 2025 16:12
- Add Anthropic as LLM provider with full async support
- Add LM Studio provider for local model inference
- Fix JSON response format compatibility for local models
- Update .env.example with configuration examples
- Update docstrings with all supported providers

Tested with:
- Claude Sonnet 4 (claude-sonnet-4-20250514)
- Claude Haiku 4.5 (claude-haiku-4-5-20251001)
- Qwen 30B via LM Studio
Add configurable timeout support for LLM API calls:
- Environment variable override via HINDSIGHT_API_LLM_TIMEOUT
- Dynamic heuristic for lmstudio/ollama: 20 mins for large models
  (30b, 33b, 34b, 65b, 70b, 72b, 8x7b, 8x22b), 5 mins for others
- Pass timeout to Anthropic, OpenAI, and local model clients

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove CLAUDE.md from .gitignore (should stay in repository)
- Pass max_completion_tokens to _call_anthropic instead of hardcoding 4096

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Provides project context and development commands for AI-assisted coding.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add docker-compose.yml for local development
- Add test_internal.py for local testing
- Sync uv.lock and llm_wrapper.py changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Move LLM config to config.py with HINDSIGHT_API_ prefix
  - Add HINDSIGHT_API_LLM_MAX_CONCURRENT (default: 32)
  - Add HINDSIGHT_API_LLM_TIMEOUT (default: 120s)
- Remove fragile model-size timeout heuristic
- Apply markdown JSON extraction to all providers, not just local
- Fix Anthropic markdown extraction bug (missing split)
- Change LLM request/response logs from info to debug level

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove test_internal.py (debug file)
- Remove docker-compose.yml (to be moved to hindsight-cookbook repo)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Implements a feedback signal system that tracks which recalled facts
are actually useful, enabling usefulness-boosted recall.

API Endpoints:
- POST /v1/default/banks/{bank_id}/signal - Submit feedback signals
- GET /v1/default/banks/{bank_id}/facts/{fact_id}/stats - Fact stats
- GET /v1/default/banks/{bank_id}/stats/usefulness - Bank stats

Features:
- Signal types: used, ignored, helpful, not_helpful
- Time-decayed scoring (5% decay per week)
- Usefulness-boosted recall with configurable weight
- Query pattern tracking for analytics

Database:
- fact_usefulness: Aggregate scores per fact
- usefulness_signals: Individual signal records
- query_pattern_stats: Pattern tracking

Documentation:
- Full API reference in hindsight-docs
- Python, Node.js, and cURL examples
- Updated recall.mdx with new parameters

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Adds a startup script that waits for dependencies (database and LLM Studio)
before launching the Hindsight API. Retries indefinitely by default, allowing
the container to start before LM Studio is available.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Strip <think>, <thinking>, <reasoning>, and |startthink|/|endthink| tags
from reasoning model outputs to enable proper JSON parsing. This allows
local reasoning models like Qwen3 to work with Hindsight's structured
extraction pipeline.

Also adds slow call logging for Ollama native function and updates
reasoning model detection to include qwq family.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…artup

Container now waits for database and LLM Studio to be accessible before
starting Hindsight. Configurable via environment variables:
- HINDSIGHT_RETRY_MAX: Max retries (0 = infinite, default)
- HINDSIGHT_RETRY_INTERVAL: Seconds between retries (default 10)

Applied to all three Docker stages: api-only, cp-only, and standalone.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
When HINDSIGHT_API_DATABASE_URL is not set, the standalone container
uses embedded pg0 which starts with start-all.sh. The retry script
now detects this and skips the external database check.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Documents testing of Qwen3 8B/14B, Gemma 3, and NuExtract models
for Hindsight memory extraction on Apple Silicon. Includes:
- Benchmark results and performance comparisons
- Configuration recommendations
- Docker setup with retry-start script
- Troubleshooting guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
# Conflicts:
#	hindsight-api/hindsight_api/engine/llm_wrapper.py
#	uv.lock
- Add llm-comparison.py benchmark script to compare LLM providers
- Reduce max_completion_tokens from 65000 to 8192 for better local LLM compatibility
- Include benchmark results for Qwen3-8B and Claude Haiku 4.5

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add thinking_level parameter support (LOW/MEDIUM/HIGH) for Gemini 3 models
- Add temperature and max_output_tokens support for Gemini
- Improve rate limit handling with 10-120s backoff for 429 errors
- Add HINDSIGHT_API_LLM_THINKING_LEVEL env var (default: low)
- Include benchmark results comparing thinking levels

Performance with thinking_level=medium:
- 4.3x faster retain (4.5s vs 19.4s per memory)
- 92% fact extraction quality retained (33 vs 36 facts)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- LLM wrapper: detect and retry on empty responses (None or empty string)
- LLM wrapper: add detailed logging with finish_reason and safety ratings
- Docker startup: skip endpoint check for cloud providers (openai, anthropic, gemini, groq)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Resolved conflict in llm_wrapper.py:
- Kept upstream's more descriptive comment for thinking tag stripping
- Preserved local empty response handling with retry logic

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Implements a feedback signal system that tracks which recalled facts
are actually useful, enabling usefulness-boosted recall.

API Endpoints:
- POST /v1/default/banks/{bank_id}/signal - Submit feedback signals
- GET /v1/default/banks/{bank_id}/facts/{fact_id}/stats - Fact stats
- GET /v1/default/banks/{bank_id}/stats/usefulness - Bank stats

Features:
- Signal types: used, ignored, helpful, not_helpful
- Time-decayed scoring (5% decay per week)
- Usefulness-boosted recall with configurable weight
- Query pattern tracking for analytics

Database:
- fact_usefulness: Aggregate scores per fact
- usefulness_signals: Individual signal records
- query_pattern_stats: Pattern tracking

Documentation:
- Full API reference in hindsight-docs
- Python, Node.js, and cURL examples
- Updated recall.mdx with new parameters

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add thinking_level parameter support (LOW/MEDIUM/HIGH) for Gemini 3 models
- Add temperature and max_output_tokens support for Gemini
- Improve rate limit handling with 10-120s backoff for 429 errors
- Add HINDSIGHT_API_LLM_THINKING_LEVEL env var (default: low)
- Include benchmark results comparing thinking levels

Performance with thinking_level=medium:
- 4.3x faster retain (4.5s vs 19.4s per memory)
- 92% fact extraction quality retained (33 vs 36 facts)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- LLM wrapper: detect and retry on empty responses (None or empty string)
- LLM wrapper: add detailed logging with finish_reason and safety ratings
- Docker startup: skip endpoint check for cloud providers (openai, anthropic, gemini, groq)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The retry logic is already integrated into start-all.sh, making this
separate script unnecessary.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
These are local test results that shouldn't be in the upstream PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copy link
Collaborator

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @csfet9 thanks for this big addition!

I've a more general question about this:
isn't the helpfulness related to the recall query and other parameters?

for example, if I ask the bank about "Alice" I might say that Bob facts weren't helpful but if I ask about Bob, that fact is indeed helpful (maybe)

So I think we're missing something that connects the helpfulness to the actual query

@csfet9
Copy link
Contributor Author

csfet9 commented Jan 5, 2026

Hey @nicoloboschi, great feedback! You're absolutely right - the helpfulness signal should be tied to the query context.

I've implemented a query-context aware scoring system:

Changes

  1. New query_fact_usefulness table - stores usefulness scores per (query_embedding, fact) pair instead of just per fact
  2. query field is now required in SignalItem - ensures every signal is tied to the query that triggered the recall
  3. Semantic matching for similar queries - uses cosine similarity (threshold 0.85) so "Who is the CEO?" and "Who is the chief executive?" share the same score context
  4. Hybrid scoring in recall - when boost_by_usefulness=true:
    - First looks for query-specific scores (similar queries)
    - Falls back to global scores for unseen query patterns
    - Falls back to 0.5 (neutral) for facts without any signals

Example

Fact: "Bob works at TechCorp"

Query "Who works at TechCorp?" → marked helpful → score = 0.65
Query "What's the weather?" → marked not_helpful → score = 0.40

Recall with "Tell me about TechCorp employees" → uses 0.65 (similar to first query)
Recall with "What's today's forecast?" → uses 0.40 (similar to second query)

This way, the same fact can have different usefulness scores depending on the query context.

Currently testing locally - will push once verified.

Want me to adjust anything?

@nicoloboschi
Copy link
Collaborator

Hey @nicoloboschi, great feedback! You're absolutely right - the helpfulness signal should be tied to the query context.

I've implemented a query-context aware scoring system:

Changes

  1. New query_fact_usefulness table - stores usefulness scores per (query_embedding, fact) pair instead of just per fact
  2. query field is now required in SignalItem - ensures every signal is tied to the query that triggered the recall
  3. Semantic matching for similar queries - uses cosine similarity (threshold 0.85) so "Who is the CEO?" and "Who is the chief executive?" share the same score context
  4. Hybrid scoring in recall - when boost_by_usefulness=true:
    • First looks for query-specific scores (similar queries)
    • Falls back to global scores for unseen query patterns
    • Falls back to 0.5 (neutral) for facts without any signals

Example

Fact: "Bob works at TechCorp"

Query "Who works at TechCorp?" → marked helpful → score = 0.65 Query "What's the weather?" → marked not_helpful → score = 0.40

Recall with "Tell me about TechCorp employees" → uses 0.65 (similar to first query) Recall with "What's today's forecast?" → uses 0.40 (similar to second query)

This way, the same fact can have different usefulness scores depending on the query context.

Currently testing locally - will push once verified.

Want me to adjust anything?

thanks, that looks better!

can you share your use case for this feature and what is the pattern you're building? it looks like you want some human-in-the-loop mechanism and I'd love to hear more about how you intend to use hindsight there

@csfet9
Copy link
Contributor Author

csfet9 commented Jan 5, 2026

Thanks, Here's the use case:

I'm building an automatic feedback loop for Claude Code where the system learns from Claude's actual behavior.

How it works

  1. Detection - After recall, analyze Claude's response to detect which facts were actually used (semantic similarity, explicit references like "based on context", behavioral signals like file access)
  2. Signal - Send used/ignored verdicts back to Hindsight via /signal
  3. Learn - Use boost_by_usefulness in future recalls to prioritize facts that worked before

Why query-context matters

A fact like "Bob is a senior engineer" might be helpful for "Who can help with code?" but irrelevant for "What's the deadline?". Without query-context, one "helpful" signal would boost it for ALL queries. The semantic matching ensures similar queries share scores while different query types stay separate.

The human-in-the-loop is mostly implicit - Claude's usage patterns provide the signal automatically, no manual thumbs up/down needed.

csfet9 and others added 2 commits January 5, 2026 14:42
- Make query field required in SignalItem for context-aware scoring
- Add query_fact_usefulness table with HNSW index for semantic matching
- Store query embeddings with signals for similarity-based grouping
- Use query-specific scores with global fallback during recall
- Similar queries (cosine similarity >= 0.85) share scores
- Update documentation and examples with required query field

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace separate env var with existing reasoning_effort parameter
for Gemini 3 thinking_level configuration. This unifies the config
across providers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@csfet9
Copy link
Contributor Author

csfet9 commented Jan 5, 2026

Updates pushed:

Query-Context Aware Scoring (addresses your feedback)

  • Made query field required in SignalItem
  • New query_fact_usefulness table stores scores per (query_embedding, fact) pair
  • Semantic matching (cosine similarity ≥ 0.85) groups similar queries
  • Hybrid scoring: query-specific → global fallback → 0.5 neutral
  • Updated all docs and examples with required query field
  • Added tests for context-aware scoring

Review comment fixes
Gemini 3 thinking level now uses reasoning_effort parameter instead of separate env var
Query field already in examples (part of context-aware scoring commit)

Ready for re-review!

csfet9 and others added 10 commits January 5, 2026 17:06
Resolve conflict in config.py:
- Keep Groq service tier from upstream (vectorize-io#102)
- Remove unused LLM thinking level env var (now using reasoning_effort)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Read HINDSIGHT_API_LLM_THINKING_LEVEL from environment in all LLMProvider
factory methods (for_memory, for_answer_generation, for_judge) instead of
hardcoding the reasoning_effort value.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Features merged:
- Query-context aware feedback scoring for improved recall relevance
- Groq service tier configuration (ENV_LLM_GROQ_SERVICE_TIER)
- Configurable thinking level for Gemini 3 via reasoning_effort
- OpenAI embeddings support with configurable dimensions
- GitHub issue templates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The file was referenced in Dockerfile but was missing from the repo.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The retry logic was merged into start-all.sh. Updated Dockerfile from
upstream which uses start-all.sh directly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
… 500

Add Pydantic field_validator to SignalItem.fact_id to validate UUID format
at the API layer. Invalid UUIDs now receive a proper 422 Validation Error
instead of causing a 500 Internal Server Error when the database rejects them.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…tion

Merged three changes:
- llm_wrapper.py: Combined Gemini temperature/max_completion_tokens (from HEAD)
  with return_usage parameter (from upstream)
- fact_extraction.py: Use config.retain_max_completion_tokens instead of
  hardcoded 8192 for configurable token limits

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The `<0.5s` was being interpreted as a JSX tag by the MDX parser.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants