SmartEdu AI — RAG-Powered MCQ Generator

SmartEdu AI V2 is a lightweight system that turns your study materials (PDF/DOCX/PPTX) into simple, meaningful multiple‑choice questions (MCQs). It uses a Retrieval‑Augmented Generation (RAG) pipeline built around Google Gemini for both embeddings and text generation, with a minimal in‑memory vector store for fast, local retrieval.

What You Get

Upload course content and generate MCQs in JSON.
Optional focus prompts (e.g., “Unit 2 only”, “definitions”).
Streamlit demo (standalone) and Flask API for the frontend.
Simple feedback generation endpoint to turn results into structured feedback.

Tech Stack

Backend: Flask API + Streamlit demo
LLM + Embeddings: Google Gemini (gemini-2.5-flash, text-embedding-004)
Chunking: LangChain RecursiveCharacterTextSplitter
Vector store: In‑memory NumPy arrays + cosine similarity
Frontend: React (Vite)

RAG Architecture (as implemented in `rag.py`)

The core pipeline follows classic RAG, kept intentionally simple and transparent:

File Ingestion
- Supported: PDF, DOCX, PPTX.
- extract_text() reads the file and returns raw text using PyPDF2, docx2txt, and python-pptx.
Chunking
- Uses RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200) to break the document into coherent chunks.
- Purpose: Keep context windows small and semantically tight for retrieval.
Embeddings & Index
- embed_texts(texts) calls Gemini text-embedding-004:batchEmbedContents and collects embeddings for each chunk.
- build_index(chunks) wraps the embeddings into a NumPy matrix and returns (chunks, embeddings).
- No external DB: it’s a simple, in‑memory vector store intended for single‑document workflows.
Retrieval
- retrieve_top_k(query, chunks, embeddings, k=5) embeds the query and computes cosine similarity against all chunk vectors.
- Returns the top‑k chunk texts as the context for generation.
Generation (Strictly Context‑Bound)
- call_gemini(prompt) sends a carefully structured prompt to gemini-2.5-flash, asking for MCQs only from the retrieved context.
- Output is validated as JSON; code trims any ```json fences and parses with json.loads.
Streamlit UX
- The Streamlit page lets you upload a file, builds the RAG index, and triggers MCQ generation.
- Stores rag_chunks and rag_embeddings in st.session_state after index build.

Why this design?

Keeps the system simple and inspectable.
Minimizes hallucinations by strictly prompting from retrieved chunks.
Avoids infra overhead (no external vector DB) while staying fast for single uploads.

Text Flow at a Glance

File → Text → Chunks → Embeddings → In‑Memory Index → Query Embedding → Top‑K Chunks → Prompt → MCQ JSON

API Overview (Flask)

The Flask app in Backend/app.py exposes two main endpoints:

POST /generate_mcq
- Form‑data: file (PDF/DOCX/PPTX), num_questions (int, default 10), user_focus (string, optional).
- Pipeline: extract → chunk → embed → retrieve → prompt → JSON MCQs.
- Response: { "mcqs": { ... } } or error with raw output if the LLM returns invalid JSON.
POST /generate_feedback
- Body: MCQ results JSON (see Backend/sample_result.json).
- Uses feedback.py to convert free‑form LLM text into a clean 5‑section feedback JSON.

Environment & Setup

Requirements
- Python 3.10+
- Node.js 18+
- A valid Google Gemini API key
Environment variable
- Create a .env file in Backend/ with:
```
GEMINI_API_KEY=your_api_key_here
```

Python dependencies

From Backend/:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Frontend dependencies
- From Frontend/eduai/:
```
npm install
```

Run It

Option A — Streamlit demo (quickest way to try RAG locally):

cd Backend
streamlit run rag.py

Option B — Flask API (for the React frontend):

cd Backend
python app.py

Then start the frontend:

cd Frontend/eduai
npm run dev

Notable Implementation Details

Gemini endpoints:
- Generation: gemini-2.5-flash:generateContent
- Embeddings: text-embedding-004:batchEmbedContents
JSON hygiene: both Streamlit and Flask flows strip markdown fences and validate JSON before responding.
Cosine similarity: a small helper computes similarity on NumPy arrays; guards against zero‑norm vectors.
Error handling: friendly st.error messages in Streamlit; Flask returns HTTP errors with details.

Limitations & Next Steps

In‑memory index: great for single uploads, but not persistent. Consider plugging in a vector DB (FAISS, Chroma, pgvector) for multi‑document collections.
PDF text extraction: scanned PDFs may require OCR (e.g., Tesseract) to get reliable text.
Strict JSON: if the model drifts from the format, retries or function‑calling patterns can improve reliability.

Project Structure

Backend/
  app.py            # Flask API exposing MCQ + feedback endpoints
  auth.py           # Authentication helpers (if used by your deployment)
  feedback.py       # Feedback normalization utilities
  rag.py            # Streamlit app + RAG core (extract, embed, retrieve, generate)
  requirements.txt  # Python dependencies
Frontend/eduai/
  src/...           # React components for upload, results, feedback

Contributing

Issues and PRs are welcome. If you add a persistent vector store or new generators (e.g., short‑answer, flashcards), please document the new configuration in this README.

License

This project is for educational purposes. Add your preferred license if you plan to distribute more broadly.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Backend		Backend
Frontend/eduai		Frontend/eduai
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmartEdu AI — RAG-Powered MCQ Generator

What You Get

Tech Stack

RAG Architecture (as implemented in `rag.py`)

Text Flow at a Glance

API Overview (Flask)

Environment & Setup

Run It

Notable Implementation Details

Limitations & Next Steps

Project Structure

Contributing

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SmartEdu AI — RAG-Powered MCQ Generator

What You Get

Tech Stack

RAG Architecture (as implemented in rag.py)

Text Flow at a Glance

API Overview (Flask)

Environment & Setup

Run It

Notable Implementation Details

Limitations & Next Steps

Project Structure

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

RAG Architecture (as implemented in `rag.py`)