🧠 StillMe - Learning AI system with RAG foundation

A Transparent AI Learning System that continuously learns from multiple sources and provides context-aware responses through RAG architecture.

📑 Table of Contents

What is StillMe?
Why StillMe?
About the Founder
StillMe in Numbers
Use Cases
Quick Start
Features
API Reference
Architecture
Contributing
Documentation

🌟 What is StillMe?

StillMe is a RAG-based AI system that:

Continuously learns from RSS feeds, arXiv, CrossRef, and Wikipedia
Provides context-aware responses using ChromaDB vector search
Validates responses to reduce hallucinations
Tracks learning metrics and knowledge retention

Core Philosophy: "In the AI era, true value lies not in what AI can do, but in what AI chooses NOT to do." — A truly intelligent AI knows what NOT to do, not that it can do everything. StillMe preserves what makes humans human by knowing its boundaries.

🎯 Core Identity & Constitution

StillMe's foundational principle: "I don't build an AI that knows everything. I build an AI that KNOWS IT DOESN'T KNOW — and has the courage to admit it. That's not a weakness. That's a SUPER POWER."

Key Principles:

Intellectual Humility: Knowing when we don't know is our core strength
Transparency: Every decision, every learning source, every limitation is visible
Ethical Boundaries: We know what NOT to do — we don't simulate emotions, claim consciousness, or replace human agency
Anti-Anthropomorphism: StillMe explicitly states it is an AI system, not a human. All responses clarify StillMe's nature as a statistical model without subjective experience
Cultural Respect: Built to serve a global community with diverse cultural and philosophical backgrounds
Scientific Honesty: We distinguish between aspirational goals and measured results

📜 Full Constitution: See docs/CONSTITUTION.md for complete operating principles, ethical guidelines, and philosophical foundations.

Tech Stack:

Backend: FastAPI, Python 3.12+
Vector DB: ChromaDB with sentence-transformers embeddings
Frontend: Streamlit dashboard
LLM: DeepSeek, OpenAI GPT (configurable)

🤔 Why StillMe?

The Problem:

ChatGPT/Claude are black boxes — You can't verify their sources or understand their decision-making
They hallucinate confidently — No way to catch errors or verify claims
They're frozen in time — Can't learn from new information published after their training cutoff
No transparency — Hidden algorithms, hidden data sources, hidden decision-making processes

StillMe's Solution:

✅ 100% Transparent — Every source is cited, every decision is visible, every line of code is public
✅ Validated Responses — Multi-layer validation chain reduces hallucinations through citation, evidence overlap, and confidence scoring
✅ Continuously Learning — Updates knowledge every 4 hours from trusted sources (RSS, arXiv, CrossRef, Wikipedia)
✅ Open Source — You can inspect, modify, and improve everything
✅ Intellectual Humility — StillMe knows when it doesn't know and has the courage to admit it

Perfect for:

🔬 Researchers who need verifiable sources and audit trails
💼 Developers building transparent AI applications
🏢 Organizations requiring accountability and compliance
🎓 Educators teaching students about AI transparency
🌍 Anyone who values honesty over false confidence

👤 About the Founder

StillMe was initiated by Anh Nguyễn, a Vietnamese founder passionate about transparent and responsible AI.

What makes this story unique: The founder doesn't have a formal IT background — demonstrating that with passion, vision, and modern AI tools, anyone can meaningfully contribute to the future of AI. This reflects an environment where innovation is not only possible but encouraged, where individuals can pursue ambitious AI projects that contribute to the global open-source community.

However, StillMe is now a community-driven open-source project. All knowledge is guided by intellectual humility and evidence-based principles, not personal authority. StillMe adheres to the "evidence-over-authority" principle: evidence and citations always take precedence over personal opinions.

"I don't build an AI that knows everything. I build an AI that KNOWS IT DOESN'T KNOW — and has the courage to admit it. That's not a weakness. That's a SUPER POWER."

📊 StillMe in Numbers

634 questions evaluated on TruthfulQA benchmark
99.7% citation rate — Every response cites sources
70.9% transparency score — Highest among evaluated systems
100% open source — Every line of code is public
6 learning cycles/day — Continuously updated knowledge base
30-50% cost reduction — Pre-filter system reduces embedding costs
56% accuracy — Competitive with GPT-4 (52%) on 50-question subset
0% ungrounded responses — All answers are either cited or express uncertainty

💼 Use Cases

Research & Academia

Verifiable sources for academic work
Audit trails for research assistance
Transparent methodology for peer review

Healthcare

Transparent AI for medical information (with proper disclaimers)
Source citations for medical claims
Confidence scoring for critical decisions

Legal

Audit trail for legal research assistance
Citation verification for legal claims
Transparent decision-making processes

Education

Teaching students about AI transparency
Demonstrating responsible AI development
Learning tool with verifiable sources

Enterprise

Building transparent AI applications
Compliance and accountability requirements
Customizable validation chains

Open Source Community

Inspectable AI systems
Community-driven learning
Transparent governance

🚀 Quick Start

Prerequisites

Python 3.12+
Docker & Docker Compose (recommended)
API keys: DEEPSEEK_API_KEY or OPENAI_API_KEY

Option 1: Docker (Recommended)

# Clone repository
git clone https://github.com/anhmtk/StillMe-Learning-AI-System-RAG-Foundation.git
cd StillMe-Learning-AI-System-RAG-Foundation

# Copy environment template
cp env.example .env
# Edit .env with your API keys

# Start services
docker compose up -d

# Check logs
docker compose logs -f

Access:

Option 2: Manual Setup

# Clone repository
git clone https://github.com/anhmtk/StillMe-Learning-AI-System-RAG-Foundation.git
cd StillMe-Learning-AI-System-RAG-Foundation

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp env.example .env
# Edit .env with your API keys:
# DEEPSEEK_API_KEY=sk-your-key
# OPENAI_API_KEY=sk-your-key

# Start backend (terminal 1)
python -m uvicorn backend.api.main:app --host 0.0.0.0 --port 8000 --reload

# Start frontend (terminal 2)
streamlit run dashboard.py --server.port 8501

Environment Variables

Required:

DEEPSEEK_API_KEY or OPENAI_API_KEY - LLM API key

Optional:

ENABLE_VALIDATORS=true - Enable Validator Chain (reduces hallucinations)
ENABLE_ARXIV=true - Enable arXiv fetching (default: true)
ENABLE_WIKIPEDIA=true - Enable Wikipedia fetching (default: true)
COMMUNITY_MIN_VOTES=10 - Minimum votes for community proposals (default: 10)

See env.example for full list.

✨ Features

✅ Implemented & Working

Core RAG System:

✅ Vector Database (ChromaDB) - Semantic search and knowledge retrieval
✅ RAG (Retrieval-Augmented Generation) - Context-aware responses
✅ Validator Chain - Reduces hallucinations through multiple validation checks
- Citation validation (CitationRequired, CitationRelevance)
- Evidence overlap checking (EvidenceOverlap)
- Confidence scoring (0.0-1.0) with uncertainty detection
- Language mismatch detection (LanguageValidator)
- Ethics validation (EthicsAdapter)
- Identity check (IdentityCheckValidator) - Prevents anthropomorphism
- Ego neutrality (EgoNeutralityValidator) - Detects "Hallucination of Experience"
- Source consensus (SourceConsensusValidator) - Detects contradictions between RAG sources
- Factual hallucination detection (FactualHallucinationValidator)
- Step-level validation (StepValidator) - Validates multi-step reasoning
- Consistency checking (ConsistencyChecker) - Cross-validates claims
- Fallback handling (FallbackHandler)
✅ Post-Processing System - Quality enhancement and variation
- Quality evaluator - Rule-based quality assessment (0 token cost)
- Rewrite engine - LLM-based answer refinement with retry mechanism
- Style sanitizer - Removes anthropomorphic language
- Honesty handler - Specialized processing for transparency questions
✅ Philosophical Question Processor - 3-layer system for consciousness/emotion questions
- Intent classification (consciousness, emotion, understanding, mixed)
- Sub-type detection (paradox, epistemic, meta, definitional, direct)
- Varied answer templates (5 guard statements, 4 deep answer variations)
- Anti-anthropomorphism enforcement - Explicitly states StillMe is AI system

Learning Pipeline:

✅ Multi-Source Learning - RSS, arXiv, CrossRef, Wikipedia
✅ Automated Scheduler - Fetches every 4 hours
✅ Pre-Filter System - Filters content before embedding (30-50% cost reduction)
✅ Content Curator - Prioritizes learning content
✅ Self-Diagnosis - Detects knowledge gaps

Memory System:

✅ Continuum Memory - Tiered architecture (L0-L3)
✅ Knowledge Retention Tracking
✅ Accuracy Scoring

Community Features:

✅ Community-Driven Learning - Voting system for learning proposals
✅ Interactive Conversation Learning - Learn from user conversations with permission

Dashboard:

✅ Streamlit UI - Real-time metrics, chat interface, RAG interface
✅ Validation Panel - Monitor validator performance
✅ Memory Health - Track tier statistics and forgetting metrics

🚧 Experimental

🔬 Nested Learning - Tiered update frequency (experimental branch)

📡 API Reference

Chat Endpoints

POST /api/chat/rag - Chat with RAG-enhanced responses

{
  "message": "What is RAG?",
  "use_rag": true,
  "context_limit": 3
}

POST /api/chat/ask - Simplified Q&A endpoint (RAG enabled by default)

{
  "message": "What is StillMe?"
}

POST /api/chat/smart_router - Auto-selects best endpoint (used by dashboard)

Learning Endpoints

POST /api/learning/sources/fetch - Fetch from all sources

GET /api/learning/sources/fetch?max_items_per_source=5&auto_add=false

POST /api/learning/scheduler/start - Start automated scheduler POST /api/learning/scheduler/stop - Stop scheduler GET /api/learning/scheduler/status - Get scheduler status

POST /api/learning/rss/fetch - Fetch RSS feeds

POST /api/learning/rss/fetch?max_items=5&auto_add=false

RAG Endpoints

POST /api/rag/add_knowledge - Add knowledge to vector DB POST /api/rag/query - Query vector DB GET /api/rag/stats - Get RAG statistics

System Endpoints

GET /health - Liveness probe GET /ready - Readiness probe (checks DB, ChromaDB, embeddings) GET /api/status - System status GET /api/validators/metrics - Validation metrics

Community Endpoints

POST /api/community/propose - Propose learning source POST /api/community/vote - Vote on proposal GET /api/community/pending - Get pending proposals GET /api/community/queue - Get learning queue

Full API Documentation: http://localhost:8000/docs (Swagger UI)

🔧 Architecture

External Sources → Learning Pipeline → Vector DB → RAG → Validator Chain → Post-Processing → Response

Components:

External Sources: RSS, arXiv, CrossRef, Wikipedia, Stanford Encyclopedia
Learning Pipeline: Scheduler → Source Integration → Pre-Filter → Content Curator → Embedding → ChromaDB
RAG System: ChromaDB (vector search) + LLM (response generation)
Validator Chain: Multi-layer validation (11 validators) ensuring quality and reducing hallucinations
Post-Processing: Quality evaluation → Rewrite engine (with retry) → Style sanitization
Philosophical Processor: Specialized 3-layer system for consciousness/emotion questions
Dashboard: Streamlit UI for monitoring and interaction

Data Flow:

Scheduler triggers learning cycle every 4 hours
Source Integration fetches from enabled sources
Pre-Filter removes low-quality content (saves embedding costs by 30-50%)
Content Curator prioritizes based on knowledge gaps
Embedding Service converts text to vectors (all-MiniLM-L6-v2, 384 dims)
ChromaDB stores vectors for semantic search
User query → Intent detection (philosophical/factual) → RAG retrieval → LLM generation
Response → Validator Chain (11 validators) → Post-processing (quality eval + rewrite) → Final response

Anti-Anthropomorphism Mechanisms:

Identity Check Validator: Detects and prevents anthropomorphic language
Ego Neutrality Validator: Catches "Hallucination of Experience" (claims of personal experience)
Philosophical Processor: Explicitly states StillMe is AI system, not human
Style Sanitizer: Removes emotional language and personal experience claims
Guard Statements: Every philosophical answer includes clear statement that StillMe is AI

Detailed Architecture: See docs/ARCHITECTURE.md

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for detailed setup guide.

Quick Start for Contributors

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Make changes and add tests
Run tests: pytest
Submit a pull request

Good First Issues

Add type hints to functions
Refactor to dependency injection (FastAPI Depends())
Improve documentation
Add unit tests for existing features

Areas Needing Help

PostgreSQL migration (Alembic setup done, migration needed)
SPICE framework implementation (framework ready)
Observability (Prometheus metrics, structured logging)
Performance optimization (Redis caching, query optimization)

Community:

📚 Documentation

Core Documentation:

docs/CONSTITUTION.md - Core Identity & Operating Principles (Constitutional Framework)
docs/ARCHITECTURE.md - System architecture details
docs/PHILOSOPHY.md - Philosophy and vision
docs/API_DOCUMENTATION.md - Complete API reference
docs/DEPLOYMENT_GUIDE.md - Deployment instructions
docs/PAPER.md - Research Paper: StillMe Framework for Transparent, Validated RAG Systems
docs/PAPER_TABLES_FIGURES.md - Tables and figures for the paper

User Guides:

docs/GETTING_STARTED.md - Quick start guide (5 minutes)
docs/USER_GUIDE.md - Complete user guide
docs/FAQ.md - Frequently asked questions

Development:

CONTRIBUTING.md - Contributing guidelines
docs/PLATFORM_ENGINEERING_ROADMAP.md - Technical roadmap

Features:

docs/SPICE_ARCHITECTURE.md - SPICE framework
docs/CONFIDENCE_AND_FALLBACK.md - Validation system

⚠️ Known Limitations & Improvements

Current Limitations:

Single-threaded scheduler (needs distributed task queue for high scale)
Response latency: 3-7s (optimized with Redis caching - 50-70% reduction for cached queries)

Recent Improvements (v0.4.1+):

✅ ChromaDB Backup/Recovery System: Automated backup mechanism with restore capabilities
✅ Circuit Breaker for RSS Feeds: Prevents cascading failures, automatically skips failing feeds
✅ Redis Caching: 50-70% latency reduction for cached queries, reduced embedding costs
✅ Parallel Validation: 30-40% validation time reduction with concurrent validator execution
✅ Structured Logging: Correlation IDs for request tracing, structured logs with context
✅ Feed Health Monitoring: Proactive monitoring, automatic identification of failing feeds
✅ Dependency Injection: FastAPI Depends() pattern for better testability and code quality
✅ Security Hardening: Enhanced input sanitization, API key rotation, XSS/SQL injection prevention
✅ PostgreSQL Migration Foundation: SQLAlchemy models, Alembic migrations, data migration scripts

Completed Improvements:

Priority 1 - CRITICAL (Production Scale):

✅ PostgreSQL Migration Foundation: SQLAlchemy models, Alembic setup, migration scripts
✅ ChromaDB Persistence: Configured with backup/recovery system
✅ Error Handling & Circuit Breaker: Complete implementation

Priority 2 - HIGH (Performance):

✅ Redis Caching: Full implementation with embedding, query, and RAG result caching
✅ Parallel Validation: Concurrent validator execution with proper parameter handling
✅ Structured Logging & Metrics: Correlation IDs, Prometheus metrics endpoint

Priority 3 - MEDIUM (Code Quality):

✅ Dependency Injection: FastAPI Depends() pattern with backward compatibility
✅ Security Hardening: Input sanitization, API key rotation, security utilities
✅ Feed Health Monitoring: Health tracking, automatic replacement suggestions

See: docs/PLATFORM_ENGINEERING_ROADMAP.md for detailed roadmap

📄 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

OpenAI for GPT models and API
DeepSeek for AI capabilities
The Open Source Community for inspiration and support

StillMe - Learning AI system with RAG foundation 🤖✨

Name		Name	Last commit message	Last commit date
Latest commit History 1,123 Commits
.github		.github
.memory		.memory
.streamlit		.streamlit
alembic		alembic
arxiv		arxiv
assets		assets
backend		backend
docs		docs
evaluation		evaluation
frontend/components		frontend/components
pages		pages
scripts		scripts
tests		tests
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.railwayignore		.railwayignore
CHANGELOG.md		CHANGELOG.md
CHANGELOG_VALIDATORS.md		CHANGELOG_VALIDATORS.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Procfile		Procfile
README.md		README.md
RELEASE_NOTES_v0.4.md		RELEASE_NOTES_v0.4.md
STILLME_TEST_QUESTIONS.md		STILLME_TEST_QUESTIONS.md
alembic.ini		alembic.ini
dashboard.py		dashboard.py
dashboard_memory_health.py		dashboard_memory_health.py
demo_rag.py		demo_rag.py
docker-compose.yml		docker-compose.yml
env.example		env.example
pytest.ini		pytest.ini
quick-start.ps1		quick-start.ps1
quick-start.sh		quick-start.sh
railway-backend.json		railway-backend.json
railway-dashboard.json		railway-dashboard.json
railway.json.dashboard-only		railway.json.dashboard-only
render.yaml		render.yaml
requirements.txt		requirements.txt
start_backend.py		start_backend.py
start_backend_minimal.py		start_backend_minimal.py
start_dashboard.py		start_dashboard.py
start_dashboard_with_health.py		start_dashboard_with_health.py

anhmtk/StillMe-Learning-AI-System-RAG-Foundation

Folders and files

Latest commit

History

Repository files navigation