Build a knowledge system that never responds without citations
A RAG system that can:
- Ingest documents from the
../../data/corpus/directory - Answer questions about AI and computing topics
- Always cite sources - never respond without attribution. If no sources are used say so.
- Monitor performance with Langfuse, log the prompt, rag chunks and response
- Evaluate responses using LLM-as-a-judge in real time
- Citation Compliance Check: 100% of responses must include source attribution
- Retrieval Quality Assessment: Measure precision/recall of chunk retrieval
Your Mission: Load, chunk, and embed all corpus documents in one go.
What to Build:
- Document loader for all
.mdfiles indata/corpus/ai/anddata/corpus/computing/ - Chunking system (800-token pieces, 200-token overlap) with metadata preservation
- Vector database: Pinecone (simplest - web-based), Chroma (local), or SQLite-vec (lightweight)
- Embeddings: OpenAI text-embedding-3-large or Hugging Face sentence-transformers
- Basic search functionality that retrieves relevant chunks
🎮 Side Quest: Try semantic chunking and hybrid search (vector + keyword matching).
Your Mission: Build citation-aware response generation with monitoring.
What to Build:
- Citation-aware prompt engineering that never responds without sources
- Response validation (no citations = "I don't know")
- Langfuse integration for logging prompts, chunks, and responses
- Basic test question set (10+ questions from your corpus)
🎮 Side Quest: Experiment with different citation formats in Langfuse playground.
📸 Citation Example: See citation example.png for a real-world example from InsightMesh showing proper source attribution in action.
Your Mission: Implement both core evaluations and LLM-as-a-judge.
What to Build:
- Citation Compliance Check (automated validation of 100% attribution rate)
- Retrieval Quality Assessment (precision/recall metrics for chunk retrieval)
- LLM-as-a-judge pipeline for response quality scoring
- Expanded test dataset (20+ questions)
🎮 Side Quest: Build adversarial test questions and a retrieval debugger.
Your Mission: Polish your system and document learnings.
What to Build:
- Performance monitoring dashboard in Langfuse
- Error handling for edge cases (empty results, API failures)
- A/B testing setup for different prompt templates
- Documentation of what worked, what didn't, and key learnings
🎮 Side Quest: Test different chunking strategies and embedding models.
Purpose: Ensure 100% of responses include proper source attribution.
How it Works:
- For every response, check if it contains citation markers like
[Source: filename.md] - If no citation found, check if response is "I don't have enough information" or similar
- Flag any response that makes claims without citations
Success Metric: 100% compliance rate - zero tolerance for uncited responses
🎮 Side Quest: Build a regex pattern that catches different citation formats and validates them against your actual source files.
Purpose: Measure how well your system finds relevant information.
How it Works:
- For each test question, manually identify which corpus files contain the answer
- Compare your system's retrieved chunks against the "golden" relevant files
- Calculate precision (% of retrieved chunks that are relevant) and recall (% of relevant info that was retrieved)
Success Metrics:
- Precision >80% (most retrieved chunks should be useful)
- Recall >70% (shouldn't miss important information)
- Top-3 retrieval should contain the answer for 90% of answerable questions
🎮 Side Quest: Create a "retrieval debugger" that shows you exactly which chunks were retrieved for each question, so you can spot patterns in what your system misses.
Create Test Questions from Your Corpus:
- Write 20+ questions that can be answered from your AI and computing materials
- Include easy questions ("What is machine learning?") and hard ones ("How did WWII influence computer development?")
- Add trick questions that can't be answered from your sources
🎮 Side Quest: Create questions that require combining information from multiple files. Can your system cite both ai_bias.md and transformers.md in one response?
By the end of your 2-week sprint, you should have:
- 100% Citation Rate: Every response includes source attribution or says "I don't know"
- >75% Accuracy: On your test question set
- Langfuse Integration: All queries logged and monitored
- LLM Judge Working: Automated evaluation pipeline
- Corpus Coverage: Can answer questions from both AI and computing materials
Required Dependencies:
openai- For embeddings and LLM callslangchain- For document processinglangfuse- For monitoring and evaluationtiktoken- For token counting
Vector Database Options (choose one):
pinecone-client- Recommended for beginners (web-based, no setup)chromadb- For local developmentsqlite-vec- Lightweight embedded option
Embedding Options (choose one):
- OpenAI text-embedding-3-small - Best quality, costs ~$0.02/1M tokens
- Hugging Face sentence-transformers - Free, runs locally (e.g.,
all-MiniLM-L6-v2)
Environment Variables:
OPENAI_API_KEY- Your OpenAI API key (required for embeddings/LLM)PINECONE_API_KEY- Your Pinecone API key (if using Pinecone)LANGFUSE_SECRET_KEY- Your Langfuse secret keyLANGFUSE_PUBLIC_KEY- Your Langfuse public keyLANGFUSE_HOST- Usuallyhttps://cloud.langfuse.com
From Your Corpus:
- RAG Introduction - Start here for basic concepts
- RAG Challenges - Common problems you'll face
- Prompt Engineering - Essential for citation prompts
External Resources:
- Langfuse Docs - For monitoring setup
- Chroma Docs - Vector database guide
- OpenAI Embeddings - Best practices
1. Citations Are Non-Negotiable
- Never let your system respond without proper source attribution
- "I don't have enough information" is better than an uncited answer
- Validate every response before returning it
2. Test with Your Own Data
- Your corpus has rich AI and computing content - use it!
- Create questions that span multiple documents
- Test edge cases where no good answer exists
3. Monitor Everything
- Use Langfuse to track query patterns and response quality
- Set up alerts for low citation rates or poor accuracy
- Experiment with different prompts and measure the results
- Set up your development environment with the dependencies listed above
- Start with document loading - get your corpus ingested first
- Build incrementally - test each component as you build it
- Use AI assistance - ask Claude/GPT to help with implementation details, architecture, testing, design, etc. Lean into collaboration with it.
- Focus on citations - this is your #1 priority throughout
Once you have a working RAG system with 100% citation compliance and >75% accuracy, move to Phase 2: Voice Interface in ../2. Voice/README.md.
You'll connect your RAG system to a voice interface using Pipecat or OpenAI's Realtime API!
Remember: The goal is learning by building, not perfect code on the first try. Use your AI coding assistant liberally and focus on getting something working quickly! 🚀