Skip to content

HumphreySun98/Smart-Study-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SmartStudy Agent

An adaptive AI study agent powered by Claude β€” observes lecture content, plans personalized study paths, generates quizzes, evaluates answers, and adapts in real time.

Python 3.10+ Claude API Streamlit Chrome Extension License: MIT Hugging Face Space

🌐 Live β€” Two Ways to Use It

Where it runs How to try it
Web app Hugging Face Spaces (free Kimi-K2 backend) Open in browser
Chrome extension Your browser β€” works on any page, PDF, or YouTube video Install from the Chrome Web Store Β· Source (MV3)

SmartStudy Agent live on the Chrome Web Store
Live on the Chrome Web Store β€” click to install.

SmartStudy extension running on a YouTube ML course β€” side panel shows extracted topics and belief state
Side panel running on a YouTube ML course β€” topics auto-extracted from captions, belief state updating in real time.

The web app runs on Hugging Face Spaces using free HF Inference Providers (Kimi-K2). The Chrome extension calls the Anthropic API directly from your browser β€” same agent core, zero backend. For local development, plug in your own Anthropic key to get Claude's higher-quality reasoning.

SmartStudy Agent is a goal-based, partially observable AI agent that turns any lecture material into a fully personalized learning experience. Unlike a chatbot, it maintains a persistent belief state about student knowledge and uses an adaptive policy to decide what to study next.


Why SmartStudy?

Traditional study tools are static. They show you the same content regardless of what you already know. SmartStudy Agent solves this by closing the loop:

Problem SmartStudy's Solution
Generic study materials Topics extracted and prioritized per student
No feedback on weak areas Quiz answers update a persistent belief state
Same recommendations for everyone Q-learning policy (or Contextual Bandit) adapts per student trajectory
Forgetting without practice SM-2 spaced repetition scheduler
Out-of-order topics Topological sort over a concept dependency graph

Architecture

SmartStudy implements the OPEAA loop β€” a five-phase adaptive agent cycle:

       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚              Lecture Materials                  β”‚
       β”‚   PDF Β· TXT Β· MD Β· DOCX Β· PPTX Β· VTT Β· SRT      β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β–Ό
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚     Claude API  Β·  claude-opus-4-6              β”‚
       β”‚     thinking: { type: "adaptive" }              β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β–Ό
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚ OBSERVE │─▢│  PLAN   │─▢│   ACT   │─▢│ EVALUATEβ”‚
       β”‚         β”‚  β”‚  + DAG  β”‚  β”‚  quizzesβ”‚  β”‚  + LLM  β”‚
       β”‚ extract β”‚  β”‚   sort  β”‚  β”‚  3 MCQs β”‚  β”‚feedback β”‚
       β”‚ topics  β”‚  β”‚         β”‚  β”‚         β”‚  β”‚         β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
            β–²                                       β”‚
            β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”
            β”‚           β”‚           ADAPT           β”‚
            └────────────  Heuristic OR Q-learning  β”‚
                        β”‚  StudentProfile updated   β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β–Ό
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚   Persistent Belief State       β”‚
                  β”‚   (JSON storage Β· per student)  β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β–Ό                 β–Ό                 β–Ό
          Spaced Repetition  Concept Graph    Streamlit UI
            (SM-2)            (DAG topo sort)   (8 pages)

The agent is modeled as a POMDP (partially observable Markov decision process):

  • State β€” student's true knowledge (hidden)
  • Belief state β€” StudentProfile (mastered topics, weak areas, quiz history)
  • Actions β€” advance Β· reinforce Β· review
  • Observations β€” student answers to generated quizzes
  • Reward β€” improvement in quiz scores over time

Features

Core Agent

  • 5-phase OPEAA loop β€” Observe β†’ Plan β†’ Act β†’ Evaluate β†’ Adapt
  • Claude integration with thinking: {type: "adaptive"} for internal reasoning
  • Goal-based agent design following Russell & Norvig's PEAS framework
  • POMDP belief state persisted across sessions
  • Two adaptive policies β€” heuristic (Bloom's 70% mastery threshold) and tabular Q-learning

Knowledge & Memory

  • Concept dependency graph β€” Kahn's algorithm topological sort over a topic prerequisite DAG
  • SM-2 spaced repetition β€” schedules reviews based on forgetting curves
  • Persistent JSON storage β€” student profiles survive across sessions
  • Multi-student support with peer comparison dashboard

Input & Evaluation

  • 7 input formats β€” PDF, TXT, MD, DOCX, PPTX, VTT, SRT
  • Quantitative evaluation β€” Monte Carlo simulation of adaptive vs random baselines
  • Mock client β€” MockAnthropic lets you run the entire system offline without an API key

User Interface

  • Streamlit web app with 8 pages (premium glassmorphism theme)
  • Chrome extension (MV3) β€” run the full OPEAA loop on any web page, Q-table persisted in chrome.storage.local
  • Interactive terminal UI powered by rich
  • Auto-demo mode for video recording

Installation

git clone https://github.com/HumphreySun98/Smart-Study-Agent.git
cd Smart-Study-Agent
pip install -r requirements.txt

The agent supports three LLM backends and picks one automatically:

Backend Env variable Cost Quality
Anthropic Claude ANTHROPIC_API_KEY Pay as you go ⭐⭐⭐⭐⭐ Best β€” supports adaptive thinking
HF Inference (Kimi-K2) HF_TOKEN Free ⭐⭐⭐⭐ Great
Mock (no env vars) Free ⭐⭐ Canned responses for offline demos
# Option 1 β€” Claude (premium quality)
export ANTHROPIC_API_KEY=sk-ant-...

# Option 2 β€” Hugging Face (completely free)
export HF_TOKEN=hf_...

# Option 3 β€” Mock mode (no setup)
# just run the agent without any keys

Get a Claude key from console.anthropic.com ($5 free credit) or a free HF token from huggingface.co/settings/tokens.


Quick Start

Hosted Demo (zero install)

πŸ‘‰ https://huggingface.co/spaces/HumphreySun98/smart-study-agent

Web App (local)

streamlit run app.py

Open http://localhost:8501, create a student in the sidebar, then go to πŸ“– Study Session to run the full OPEAA loop on a sample ML lecture or your own PDF.

Terminal demo (interactive)

python demo.py
python demo.py --pdf path/to/lecture.pdf
python demo.py --mock                  # offline mode, no API key needed

Auto demo (for screen recording)

python demo_auto.py

Chrome extension (run the agent on any web page, PDF, or YouTube video)

Now live on the Chrome Web Store β€” install in one click.

Prefer to run the source directly? Load it unpacked in < 60 seconds:

1. chrome://extensions  β†’  enable Developer mode
2. Load unpacked  β†’  select the chrome-extension/ folder
3. Pin the SmartStudy icon β†’ clicking it opens the Side Panel
4. Settings β†’ pick a backend (Anthropic or free HF) β†’ paste key β†’ Save
5. Open any article / PDF / YouTube page β†’ "Observe this page"

Full install + architecture notes in chrome-extension/README.md.


Programmatic API

from smartstudy_agent import SmartStudyAgent

agent = SmartStudyAgent()   # uses ANTHROPIC_API_KEY env var

# Phase 1 β€” Observe
observed = agent.observe("Lecture text about Machine Learning...")
# {'topics': [...], 'descriptions': {...}, 'summary': '...'}

# Phase 2 β€” Plan
plan = agent.plan(observed)
print(plan.sequence)        # ['Linear Algebra', 'Neural Networks', ...]

# Phase 3 β€” Act
topic = plan.sequence[0]
questions = agent.act(topic, observed["descriptions"][topic], n=3)

# Phase 4 β€” Evaluate
result = agent.evaluate(questions, answers=["B", "A", "C"])
print(f"Score: {result['score']:.0%}")
print(result["feedback"])

# Phase 5 β€” Adapt
adaptation = agent.adapt(topic, result)
print(adaptation["action"])              # 'advance' | 'reinforce' | 'review'
print(agent.profile.summary())

Supporting modules

import storage
from concept_graph import ConceptGraph
from spaced_repetition import get_review_queue
from rl_policy import QLearningPolicy
from evaluation import compare

# Persistent storage
record = storage.load_student("alice")
storage.add_session("alice", {"topic": "Neural Networks", "score": 0.9})

# Concept dependency graph (topological sort)
g = ConceptGraph()
g.topological_sort(["Backpropagation", "Linear Algebra", "Neural Networks"])
# -> ['Linear Algebra', 'Neural Networks', 'Backpropagation']

# Spaced repetition scheduler (SM-2)
due_today = get_review_queue(record["quiz_history"])

# Q-learning adaptive policy
policy = QLearningPolicy()
action = policy.choose_action(score=0.55)         # 'reinforce'
policy.update(prev_score=0.55, action=action, new_score=0.80)

# Quantitative evaluation vs random baseline
results = compare(n_runs=30, n_sessions=20)
print(f"Adaptive beats baseline by {results['improvement_pct']:.1f}%")

Web App Pages

Page Purpose
🏠 Dashboard Mastered topics, weak areas, due reviews, and key metrics
πŸ“– Study Session Upload a lecture and run the full OPEAA loop step-by-step
πŸ” Spaced Review SM-2 scheduler shows what to review today
🧠 Concept Graph Visualizes the topic prerequisite DAG with mastered topics highlighted
πŸ“Š Progress History Personal score trajectory across all attempts
πŸ‘₯ Peer Comparison Multi-student leaderboard ranked by average score
🎯 RL Policy Inspect the Q-table and train it on simulated episodes
πŸ§ͺ Baseline Evaluation Adaptive vs random topic-selection simulation results
πŸ“‹ Pilot Study Real usage metrics, engagement analysis, learning progression report

Project Structure

smartstudy-agent/
β”œβ”€β”€ smartstudy_agent.py     # Core agent β€” 5 OPEAA phases
β”œβ”€β”€ mock_claude.py          # Offline mock client
β”œβ”€β”€ hf_client.py            # Hugging Face Inference adapter (free LLM backend)
β”œβ”€β”€ app.py                  # Streamlit web app (8 pages)
β”œβ”€β”€ demo.py                 # Interactive terminal demo
β”œβ”€β”€ demo_auto.py            # Automated demo (no input needed)
β”‚
β”œβ”€β”€ storage.py              # SQLite persistent storage (auto-migrates from JSON)
β”œβ”€β”€ concept_graph.py        # Topic prerequisite DAG with cross-course linking
β”œβ”€β”€ pilot_study.py          # Pilot study data collection and analysis
β”œβ”€β”€ rl_policy.py            # Tabular Q-learning policy
β”œβ”€β”€ bandit_policy.py        # Contextual Bandit (LinUCB) β€” alternative to RL
β”œβ”€β”€ spaced_repetition.py    # SM-2 review scheduler
β”œβ”€β”€ multi_format.py         # PDF/TXT/MD/DOCX/PPTX/VTT/SRT loader
β”œβ”€β”€ evaluation.py           # Adaptive vs baseline simulation
β”‚
β”œβ”€β”€ generate_visuals.py     # Generates architecture diagrams
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ README.md               # This file
β”‚
β”œβ”€β”€ chrome-extension/       # Chrome MV3 extension β€” OPEAA loop in the browser
β”‚   β”œβ”€β”€ manifest.json
β”‚   β”œβ”€β”€ popup.{html,css,js} # Gradient popup UI + full agent logic
β”‚   β”œβ”€β”€ content.js          # Active-tab text extractor
β”‚   β”œβ”€β”€ options.{html,js}   # API key + model settings
β”‚   β”œβ”€β”€ background.js       # Service worker
β”‚   └── icons/              # 16/48/128 PNG
β”‚
β”œβ”€β”€ data/                   # Created at runtime
β”‚   β”œβ”€β”€ smartstudy.db       # SQLite database (student profiles + sessions)
β”‚   β”œβ”€β”€ qtable.json         # Q-learning policy state
β”‚   └── concept_graph.json  # User-defined graph edges
β”‚
└── visuals/                # Generated PNG diagrams
    β”œβ”€β”€ adaptive_loop.png
    β”œβ”€β”€ system_architecture.png
    β”œβ”€β”€ performance_dashboard.png
    └── ai_techniques.png

Tech Stack

Layer Technology
LLM Anthropic Claude (claude-opus-4-6 with adaptive thinking)
Web UI Streamlit
RL Tabular Q-learning over discretized score buckets
Knowledge Graph NetworkX + Kahn's algorithm
Spaced Repetition SM-2 algorithm
Storage SQLite (auto-migrates from JSON, scales to >1k students)
Document Parsing pypdf, python-docx, python-pptx
Terminal UI rich

How the Agent Decides

The ADAPT phase uses a two-layer decision system: the RL policy chooses the action, and the LLM explains the decision to the student in natural language.

Q-Learning Policy (decides the action)

The action (advance / reinforce / review) is chosen by a tabular Q-learning agent β€” not by the LLM. This runs every time a student finishes a quiz.

Component Value
State Quiz score discretized into 5 buckets: very_low / low / medium / high / very_high
Actions review Β· reinforce Β· advance
Reward Score change between attempts: r = (new_score βˆ’ prev_score) Γ— 10
Learning rate (Ξ±) 0.2
Discount factor (Ξ³) 0.8
Exploration (Ξ΅) 0.15 (epsilon-greedy)

Update rule:

Q(s, a) ← Q(s, a) + Ξ± Β· [r + Ξ³ Β· max(Q(s', a')) βˆ’ Q(s, a)]

The Q-table is persisted to disk (data/qtable.json) and trains on every real quiz attempt. It can also be inspected and manually trained in the 🎯 RL Policy page.

LLM Layer (explains the decision)

After the RL policy picks the action, Claude (or Kimi-K2) generates a natural-language explanation of why that action makes sense for the student. The LLM cannot override the RL decision β€” it only produces the recommendation text.

Student takes quiz β†’ score = 55%
    β†’ RL policy: Q("medium", "reinforce") = 0.42 (highest)  β†’  action = "reinforce"
    β†’ Q-table updated with reward = (0.55 - 0.40) Γ— 10 = 1.5
    β†’ LLM generates: "You're close! Practice the same topic one more time..."

Heuristic Fallback

The Q-table is initialized with values informed by Bloom's 1968 mastery learning threshold (70%). As real data accumulates, the learned policy diverges from the heuristic and adapts to actual student behavior patterns.


Why RL (and not just a Contextual Bandit)?

Context. A valid critique of applying full RL to this problem is that if each decision is nearly independent, a Contextual Bandit is more sample-efficient than a sequential RL agent. We take that critique seriously, so the project ships both and compares them directly.

When RL is justified here. The student's mastery state depends on the sequence of actions, not just the current context:

  1. Prerequisite coupling. Studying Backprop before Neural Nets is mastered gives a smaller skill gain (the simulated student encodes this via a prerequisite DAG). A bandit chooses actions independently per step and cannot trade off short-term score for long-term skill gain.
  2. Forgetting. Topics not practiced decay each step, so when you schedule a review matters β€” a classic sequential credit-assignment problem.
  3. Action latency. review tends to depress the immediate next quiz score (the student is working on a weak area) but pays off several steps later. A bandit, optimizing only single-step reward, systematically underweights this.

When a Bandit is better. If the deployment looks more like A/B-testing recommendation variants over many users with little per-user history, a bandit will converge faster and is probably the right tool. We added bandit_policy.LinUCBBandit so the same agent can be run in that mode via SmartStudyAgent(policy="bandit") or SMARTSTUDY_POLICY=bandit.

Empirical comparison

Run python evaluation.py. Each policy is evaluated on 30 simulated students Γ— 30 sessions, all facing the same student trajectories for a fair paired comparison:

Policy Avg. observed score Final mean skill vs. random
Random 0.33 Β± 0.02 0.29 Β± 0.01 +0.0 %
Rule-based (Bloom 70 %) 0.45 Β± 0.02 0.53 Β± 0.01 +35 %
Contextual Bandit (LinUCB) 0.43 Β± 0.02 0.47 Β± 0.02 +28 %
Q-learning (tabular) 0.40 Β± 0.03 0.43 Β± 0.06 +18 %

Numbers will vary run-to-run; representative of n_runs=30, n_sessions=30.

Reading the result honestly. In the short-horizon regime typical of a single study session, a well-designed rule-based heuristic is hard to beat. The Bandit matches it with a small sample-efficiency penalty. Q-learning needs more data to pay off the variance cost of bootstrapping through next states; it catches up to the Bandit on final skill by ~100 sessions, consistent with the sequential-credit-assignment argument above. This honestly answers the professor's question: in this deployment, RL is defensible but not dominant; a Contextual Bandit is a reasonable production default and we ship it as a first-class option.

Simulated Student Model

Following the evaluation feedback, we replaced the earlier noise-only simulator with a small cognitive model (evaluation.SimulatedStudent): per-topic hidden skills, prerequisite-gated learning gain, diminishing returns as skill β†’ 1, and per-step forgetting on unpracticed topics. This is what makes the rule-based vs. bandit vs. RL comparison meaningful β€” a purely-random simulated student would flatten the differences.


Roadmap

  • Core 5-phase OPEAA loop with Claude
  • Heuristic adaptive policy (Bloom 70%)
  • Persistent multi-student storage
  • Concept dependency graph + topological sort
  • Q-learning adaptive policy
  • SM-2 spaced repetition
  • Streamlit web app with 8 pages
  • Multi-format input loader
  • Quantitative baseline evaluation
  • Concept graph editor in the UI
  • Cross-course prerequisite linking (4 courses: AI, Data Science, NLP, Computer Vision)
  • Pilot study dashboard with engagement analysis and progression tracking
  • SQLite storage backend (replaces JSON, handles >1k students)
  • Deployed as hosted SaaS on Hugging Face Spaces
  • Contextual Bandit (LinUCB) policy as an alternative to full RL
  • 4-way evaluation against Rule-based baseline + Simulated Student Model (per professor feedback)
  • Chrome extension (MV3) β€” same OPEAA loop on any web page, client-side Q-learning
  • Migrate extension to chrome.sidePanel for persistent belief-state display
  • Chrome Web Store listing

License

MIT License β€” see LICENSE for details.

Copyright Β© 2026 Haofei Sun

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

Author

Haofei Sun

If you find this project useful, please consider giving it a ⭐ on GitHub.

For questions, suggestions, or collaboration: open an issue or start a discussion.

About

πŸŽ“ Adaptive AI study agent with POMDP belief state β€” OPEAA loop, Q-learning + LinUCB bandit policies, SM-2 spaced repetition, concept DAG. Streamlit web app + Chrome extension (MV3). Claude & free HF backends.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors