This is a standalone, production-ready example of the RAG (Retrieval-Augmented Generation) system that powers the Trainer-Teacher chat interface.
This FastAPI application showcases four key RAG optimizations:
- Dynamic Page Filtering - Reduces irrelevant context by ~70% through section-specific search
- History-Aware Retrieval - LLM reformulates vague follow-ups using conversation history
- Streaming Responses - Real-time token delivery via Server-Sent Events (SSE)
- Redis Session Persistence - <10ms conversation history retrieval across sessions
backend-sample/
├── main.py # FastAPI application with heavily commented code
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
└── README.md # This file
- Python 3.9+
- OpenAI API key
- Pinecone account (free tier available)
- Redis instance (Upstash recommended for serverless)
-
Clone and navigate to this directory
cd backend-sample -
Create virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
cp .env.example .env # Edit .env with your actual API keys -
Prepare Pinecone index
- Create a Pinecone index named
ethic-teacher - Dimension:
1536(for OpenAI text-embedding-3-small) - Metric:
cosine - Upload your vectorized documents with metadata:
{ "source": "path/to/document.pdf", "page": 15, "text": "Document content..." }
- Create a Pinecone index named
-
Run the server
uvicorn main:app --reload --host 0.0.0.0 --port 8000
Server runs at: http://localhost:8000
Stream RAG responses with context-aware retrieval.
Request:
{
"message": "What is informed consent?",
"session_id": "user123_teacher456",
"pages": [10, 11, 12, 13, 14, 15],
"doc_path": "documents/ethics_course.pdf"
}Response: Server-Sent Events stream of text chunks
cURL Example:
curl -X POST "http://localhost:8000/stream_chat" \
-H "Content-Type: application/json" \
-d '{
"message": "Explain business ethics",
"session_id": "test_session",
"pages": [1, 2, 3, 4, 5],
"doc_path": "ethics.pdf"
}'Health check endpoint.
Response:
{
"status": "healthy",
"service": "RAG Educational Chat"
}Clear conversation history for a session.
Example:
curl -X DELETE "http://localhost:8000/session/test_session"Instead of searching the entire document, we filter by current section pages:
retriever = vector_store.as_retriever(
search_kwargs={
"k": 14, # Top-14 most relevant chunks
"filter": {
"source": "ethics.pdf",
"page": {"$in": [15, 16, 17, 18, 19, 20]} # Section 3 only
}
}
)Impact: When a student is in "Section 3" (pages 15-20), only those pages are searched instead of all 100 pages.
User: "What is informed consent?"
AI: [responds with context]
User: "Can you give me an example?"
System reformulates internally to:
"Can you give an example of informed consent in business ethics?"
[Then retrieves with full context]
This maintains conversation continuity across multi-turn interactions.
async for chunk in streaming_chain.astream({}):
response_text += chunk
yield chunk # Stream to client immediatelyUsers see responses appear token-by-token (ChatGPT-style), improving perceived latency.
# Save to Redis with session key
redis_client.set(
f"chat_history:{user123_teacher456}",
json.dumps(messages)
)
# Retrieve in <10ms
history = redis_client.get(f"chat_history:{user123_teacher456}")Conversations survive page reloads, navigation, and server restarts.
1. User sends question + session_id + pages filter
2. Load conversation history from Redis
3. LLM reformulates question using history context
4. Vector search with page filtering (k=14, pages=[...])
5. Retrieve relevant document chunks
6. Stream LLM response token-by-token
7. Save updated conversation to Redis
- Context noise reduction: Estimated ~70% via page filtering (compared to full-document search)
- Session retrieval: Redis provides sub-10ms latency for typical chat history sizes
- Response delivery: Token-by-token streaming reduces perceived latency vs. buffered responses
- Scalability: Stateless API design allows horizontal scaling; Redis handles concurrent sessions efficiently
| Variable | Description | Example |
|---|---|---|
OPENAI_KEY |
OpenAI API key for embeddings & LLM | sk-proj-... |
PINECONE_API |
Pinecone API key for vector store | xxxxxxxx-xxxx-... |
REDIS_URL |
Redis connection URL (Upstash format) | rediss://default:... |
- API Framework: FastAPI (async, high-performance)
- LLM Integration: LangChain with OpenAI GPT-4o-mini
- Vector Store: Pinecone (managed vector database)
- Embeddings: OpenAI text-embedding-3-small (1536 dimensions)
- Session Storage: Redis (Upstash recommended)
- Message History: LangChain's
ChatMessageHistory
- This sample uses real production code with enhanced documentation
- The main application has additional features (PDF ingestion, multi-document support, analytics)
- For full implementation details, contact the repository owner
Test with a simple Python client:
import requests
import json
response = requests.post(
"http://localhost:8000/stream_chat",
json={
"message": "What is business ethics?",
"session_id": "test_session",
"pages": [1, 2, 3, 4, 5],
"doc_path": "ethics.pdf"
},
stream=True
)
for chunk in response.iter_content(chunk_size=None, decode_unicode=True):
if chunk:
print(chunk, end='', flush=True)Built with production-grade RAG patterns for educational AI applications