An adaptive AI study agent powered by Claude β observes lecture content, plans personalized study paths, generates quizzes, evaluates answers, and adapts in real time.
| Where it runs | How to try it | |
|---|---|---|
| Web app | Hugging Face Spaces (free Kimi-K2 backend) | Open in browser |
| Chrome extension | Your browser β works on any page, PDF, or YouTube video | Install from the Chrome Web Store Β· Source (MV3) |
Live on the Chrome Web Store β click to install.
Side panel running on a YouTube ML course β topics auto-extracted from captions, belief state updating in real time.
The web app runs on Hugging Face Spaces using free HF Inference Providers (Kimi-K2). The Chrome extension calls the Anthropic API directly from your browser β same agent core, zero backend. For local development, plug in your own Anthropic key to get Claude's higher-quality reasoning.
SmartStudy Agent is a goal-based, partially observable AI agent that turns any lecture material into a fully personalized learning experience. Unlike a chatbot, it maintains a persistent belief state about student knowledge and uses an adaptive policy to decide what to study next.
Traditional study tools are static. They show you the same content regardless of what you already know. SmartStudy Agent solves this by closing the loop:
| Problem | SmartStudy's Solution |
|---|---|
| Generic study materials | Topics extracted and prioritized per student |
| No feedback on weak areas | Quiz answers update a persistent belief state |
| Same recommendations for everyone | Q-learning policy (or Contextual Bandit) adapts per student trajectory |
| Forgetting without practice | SM-2 spaced repetition scheduler |
| Out-of-order topics | Topological sort over a concept dependency graph |
SmartStudy implements the OPEAA loop β a five-phase adaptive agent cycle:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Lecture Materials β
β PDF Β· TXT Β· MD Β· DOCX Β· PPTX Β· VTT Β· SRT β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Claude API Β· claude-opus-4-6 β
β thinking: { type: "adaptive" } β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
βΌ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
β OBSERVE βββΆβ PLAN βββΆβ ACT βββΆβ EVALUATEβ
β β β + DAG β β quizzesβ β + LLM β
β extract β β sort β β 3 MCQs β βfeedback β
β topics β β β β β β β
βββββββββββ βββββββββββ βββββββββββ ββββββ¬βββββ
β² β
β ββββββββββββββββββββββββββββΌβ
β β ADAPT β
βββββββββββββ€ Heuristic OR Q-learning β
β StudentProfile updated β
βββββββββββ¬ββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββ
β Persistent Belief State β
β (JSON storage Β· per student) β
βββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
βΌ βΌ βΌ
Spaced Repetition Concept Graph Streamlit UI
(SM-2) (DAG topo sort) (8 pages)
The agent is modeled as a POMDP (partially observable Markov decision process):
- State β student's true knowledge (hidden)
- Belief state β
StudentProfile(mastered topics, weak areas, quiz history) - Actions β
advanceΒ·reinforceΒ·review - Observations β student answers to generated quizzes
- Reward β improvement in quiz scores over time
- 5-phase OPEAA loop β Observe β Plan β Act β Evaluate β Adapt
- Claude integration with
thinking: {type: "adaptive"}for internal reasoning - Goal-based agent design following Russell & Norvig's PEAS framework
- POMDP belief state persisted across sessions
- Two adaptive policies β heuristic (Bloom's 70% mastery threshold) and tabular Q-learning
- Concept dependency graph β Kahn's algorithm topological sort over a topic prerequisite DAG
- SM-2 spaced repetition β schedules reviews based on forgetting curves
- Persistent JSON storage β student profiles survive across sessions
- Multi-student support with peer comparison dashboard
- 7 input formats β PDF, TXT, MD, DOCX, PPTX, VTT, SRT
- Quantitative evaluation β Monte Carlo simulation of adaptive vs random baselines
- Mock client β
MockAnthropiclets you run the entire system offline without an API key
- Streamlit web app with 8 pages (premium glassmorphism theme)
- Chrome extension (MV3) β run the full OPEAA loop on any web page, Q-table persisted in
chrome.storage.local - Interactive terminal UI powered by
rich - Auto-demo mode for video recording
git clone https://github.com/HumphreySun98/Smart-Study-Agent.git
cd Smart-Study-Agent
pip install -r requirements.txtThe agent supports three LLM backends and picks one automatically:
| Backend | Env variable | Cost | Quality |
|---|---|---|---|
| Anthropic Claude | ANTHROPIC_API_KEY |
Pay as you go | βββββ Best β supports adaptive thinking |
| HF Inference (Kimi-K2) | HF_TOKEN |
Free | ββββ Great |
| Mock | (no env vars) | Free | ββ Canned responses for offline demos |
# Option 1 β Claude (premium quality)
export ANTHROPIC_API_KEY=sk-ant-...
# Option 2 β Hugging Face (completely free)
export HF_TOKEN=hf_...
# Option 3 β Mock mode (no setup)
# just run the agent without any keysGet a Claude key from console.anthropic.com ($5 free credit) or a free HF token from huggingface.co/settings/tokens.
π https://huggingface.co/spaces/HumphreySun98/smart-study-agent
streamlit run app.pyOpen http://localhost:8501, create a student in the sidebar, then go to π Study Session to run the full OPEAA loop on a sample ML lecture or your own PDF.
python demo.py
python demo.py --pdf path/to/lecture.pdf
python demo.py --mock # offline mode, no API key neededpython demo_auto.pyNow live on the Chrome Web Store β install in one click.
Prefer to run the source directly? Load it unpacked in < 60 seconds:
1. chrome://extensions β enable Developer mode
2. Load unpacked β select the chrome-extension/ folder
3. Pin the SmartStudy icon β clicking it opens the Side Panel
4. Settings β pick a backend (Anthropic or free HF) β paste key β Save
5. Open any article / PDF / YouTube page β "Observe this page"
Full install + architecture notes in chrome-extension/README.md.
from smartstudy_agent import SmartStudyAgent
agent = SmartStudyAgent() # uses ANTHROPIC_API_KEY env var
# Phase 1 β Observe
observed = agent.observe("Lecture text about Machine Learning...")
# {'topics': [...], 'descriptions': {...}, 'summary': '...'}
# Phase 2 β Plan
plan = agent.plan(observed)
print(plan.sequence) # ['Linear Algebra', 'Neural Networks', ...]
# Phase 3 β Act
topic = plan.sequence[0]
questions = agent.act(topic, observed["descriptions"][topic], n=3)
# Phase 4 β Evaluate
result = agent.evaluate(questions, answers=["B", "A", "C"])
print(f"Score: {result['score']:.0%}")
print(result["feedback"])
# Phase 5 β Adapt
adaptation = agent.adapt(topic, result)
print(adaptation["action"]) # 'advance' | 'reinforce' | 'review'
print(agent.profile.summary())import storage
from concept_graph import ConceptGraph
from spaced_repetition import get_review_queue
from rl_policy import QLearningPolicy
from evaluation import compare
# Persistent storage
record = storage.load_student("alice")
storage.add_session("alice", {"topic": "Neural Networks", "score": 0.9})
# Concept dependency graph (topological sort)
g = ConceptGraph()
g.topological_sort(["Backpropagation", "Linear Algebra", "Neural Networks"])
# -> ['Linear Algebra', 'Neural Networks', 'Backpropagation']
# Spaced repetition scheduler (SM-2)
due_today = get_review_queue(record["quiz_history"])
# Q-learning adaptive policy
policy = QLearningPolicy()
action = policy.choose_action(score=0.55) # 'reinforce'
policy.update(prev_score=0.55, action=action, new_score=0.80)
# Quantitative evaluation vs random baseline
results = compare(n_runs=30, n_sessions=20)
print(f"Adaptive beats baseline by {results['improvement_pct']:.1f}%")| Page | Purpose |
|---|---|
| π Dashboard | Mastered topics, weak areas, due reviews, and key metrics |
| π Study Session | Upload a lecture and run the full OPEAA loop step-by-step |
| π Spaced Review | SM-2 scheduler shows what to review today |
| π§ Concept Graph | Visualizes the topic prerequisite DAG with mastered topics highlighted |
| π Progress History | Personal score trajectory across all attempts |
| π₯ Peer Comparison | Multi-student leaderboard ranked by average score |
| π― RL Policy | Inspect the Q-table and train it on simulated episodes |
| π§ͺ Baseline Evaluation | Adaptive vs random topic-selection simulation results |
| π Pilot Study | Real usage metrics, engagement analysis, learning progression report |
smartstudy-agent/
βββ smartstudy_agent.py # Core agent β 5 OPEAA phases
βββ mock_claude.py # Offline mock client
βββ hf_client.py # Hugging Face Inference adapter (free LLM backend)
βββ app.py # Streamlit web app (8 pages)
βββ demo.py # Interactive terminal demo
βββ demo_auto.py # Automated demo (no input needed)
β
βββ storage.py # SQLite persistent storage (auto-migrates from JSON)
βββ concept_graph.py # Topic prerequisite DAG with cross-course linking
βββ pilot_study.py # Pilot study data collection and analysis
βββ rl_policy.py # Tabular Q-learning policy
βββ bandit_policy.py # Contextual Bandit (LinUCB) β alternative to RL
βββ spaced_repetition.py # SM-2 review scheduler
βββ multi_format.py # PDF/TXT/MD/DOCX/PPTX/VTT/SRT loader
βββ evaluation.py # Adaptive vs baseline simulation
β
βββ generate_visuals.py # Generates architecture diagrams
βββ requirements.txt # Python dependencies
βββ README.md # This file
β
βββ chrome-extension/ # Chrome MV3 extension β OPEAA loop in the browser
β βββ manifest.json
β βββ popup.{html,css,js} # Gradient popup UI + full agent logic
β βββ content.js # Active-tab text extractor
β βββ options.{html,js} # API key + model settings
β βββ background.js # Service worker
β βββ icons/ # 16/48/128 PNG
β
βββ data/ # Created at runtime
β βββ smartstudy.db # SQLite database (student profiles + sessions)
β βββ qtable.json # Q-learning policy state
β βββ concept_graph.json # User-defined graph edges
β
βββ visuals/ # Generated PNG diagrams
βββ adaptive_loop.png
βββ system_architecture.png
βββ performance_dashboard.png
βββ ai_techniques.png
| Layer | Technology |
|---|---|
| LLM | Anthropic Claude (claude-opus-4-6 with adaptive thinking) |
| Web UI | Streamlit |
| RL | Tabular Q-learning over discretized score buckets |
| Knowledge Graph | NetworkX + Kahn's algorithm |
| Spaced Repetition | SM-2 algorithm |
| Storage | SQLite (auto-migrates from JSON, scales to >1k students) |
| Document Parsing | pypdf, python-docx, python-pptx |
| Terminal UI | rich |
The ADAPT phase uses a two-layer decision system: the RL policy chooses the action, and the LLM explains the decision to the student in natural language.
The action (advance / reinforce / review) is chosen by a tabular Q-learning agent β not by the LLM. This runs every time a student finishes a quiz.
| Component | Value |
|---|---|
| State | Quiz score discretized into 5 buckets: very_low / low / medium / high / very_high |
| Actions | review Β· reinforce Β· advance |
| Reward | Score change between attempts: r = (new_score β prev_score) Γ 10 |
| Learning rate (Ξ±) | 0.2 |
| Discount factor (Ξ³) | 0.8 |
| Exploration (Ξ΅) | 0.15 (epsilon-greedy) |
Update rule:
Q(s, a) β Q(s, a) + Ξ± Β· [r + Ξ³ Β· max(Q(s', a')) β Q(s, a)]
The Q-table is persisted to disk (data/qtable.json) and trains on every real quiz attempt. It can also be inspected and manually trained in the π― RL Policy page.
After the RL policy picks the action, Claude (or Kimi-K2) generates a natural-language explanation of why that action makes sense for the student. The LLM cannot override the RL decision β it only produces the recommendation text.
Student takes quiz β score = 55%
β RL policy: Q("medium", "reinforce") = 0.42 (highest) β action = "reinforce"
β Q-table updated with reward = (0.55 - 0.40) Γ 10 = 1.5
β LLM generates: "You're close! Practice the same topic one more time..."
The Q-table is initialized with values informed by Bloom's 1968 mastery learning threshold (70%). As real data accumulates, the learned policy diverges from the heuristic and adapts to actual student behavior patterns.
Context. A valid critique of applying full RL to this problem is that if each decision is nearly independent, a Contextual Bandit is more sample-efficient than a sequential RL agent. We take that critique seriously, so the project ships both and compares them directly.
When RL is justified here. The student's mastery state depends on the sequence of actions, not just the current context:
- Prerequisite coupling. Studying Backprop before Neural Nets is mastered gives a smaller skill gain (the simulated student encodes this via a prerequisite DAG). A bandit chooses actions independently per step and cannot trade off short-term score for long-term skill gain.
- Forgetting. Topics not practiced decay each step, so when you schedule a review matters β a classic sequential credit-assignment problem.
- Action latency.
reviewtends to depress the immediate next quiz score (the student is working on a weak area) but pays off several steps later. A bandit, optimizing only single-step reward, systematically underweights this.
When a Bandit is better. If the deployment looks more like A/B-testing recommendation variants over many users with little per-user history, a bandit will converge faster and is probably the right tool. We added bandit_policy.LinUCBBandit so the same agent can be run in that mode via SmartStudyAgent(policy="bandit") or SMARTSTUDY_POLICY=bandit.
Run python evaluation.py. Each policy is evaluated on 30 simulated students Γ 30 sessions, all facing the same student trajectories for a fair paired comparison:
| Policy | Avg. observed score | Final mean skill | vs. random |
|---|---|---|---|
| Random | 0.33 Β± 0.02 | 0.29 Β± 0.01 | +0.0 % |
| Rule-based (Bloom 70 %) | 0.45 Β± 0.02 | 0.53 Β± 0.01 | +35 % |
| Contextual Bandit (LinUCB) | 0.43 Β± 0.02 | 0.47 Β± 0.02 | +28 % |
| Q-learning (tabular) | 0.40 Β± 0.03 | 0.43 Β± 0.06 | +18 % |
Numbers will vary run-to-run; representative of n_runs=30, n_sessions=30.
Reading the result honestly. In the short-horizon regime typical of a single study session, a well-designed rule-based heuristic is hard to beat. The Bandit matches it with a small sample-efficiency penalty. Q-learning needs more data to pay off the variance cost of bootstrapping through next states; it catches up to the Bandit on final skill by ~100 sessions, consistent with the sequential-credit-assignment argument above. This honestly answers the professor's question: in this deployment, RL is defensible but not dominant; a Contextual Bandit is a reasonable production default and we ship it as a first-class option.
Following the evaluation feedback, we replaced the earlier noise-only simulator with a small cognitive model (evaluation.SimulatedStudent): per-topic hidden skills, prerequisite-gated learning gain, diminishing returns as skill β 1, and per-step forgetting on unpracticed topics. This is what makes the rule-based vs. bandit vs. RL comparison meaningful β a purely-random simulated student would flatten the differences.
- Core 5-phase OPEAA loop with Claude
- Heuristic adaptive policy (Bloom 70%)
- Persistent multi-student storage
- Concept dependency graph + topological sort
- Q-learning adaptive policy
- SM-2 spaced repetition
- Streamlit web app with 8 pages
- Multi-format input loader
- Quantitative baseline evaluation
- Concept graph editor in the UI
- Cross-course prerequisite linking (4 courses: AI, Data Science, NLP, Computer Vision)
- Pilot study dashboard with engagement analysis and progression tracking
- SQLite storage backend (replaces JSON, handles >1k students)
- Deployed as hosted SaaS on Hugging Face Spaces
- Contextual Bandit (LinUCB) policy as an alternative to full RL
- 4-way evaluation against Rule-based baseline + Simulated Student Model (per professor feedback)
- Chrome extension (MV3) β same OPEAA loop on any web page, client-side Q-learning
- Migrate extension to
chrome.sidePanelfor persistent belief-state display - Chrome Web Store listing
MIT License β see LICENSE for details.
Copyright Β© 2026 Haofei Sun
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
Haofei Sun
If you find this project useful, please consider giving it a β on GitHub.
For questions, suggestions, or collaboration: open an issue or start a discussion.