Intelligent Research Paper Discovery and Management System
Research Agent is an automated system for discovering, filtering, downloading, and organizing academic research papers from multiple sources. It uses intelligent filtering to identify relevant papers while automatically excluding job postings, link aggregators, and marketing content.
- 🔍 Multi-Source Search: Simultaneously searches ArXiv, LessWrong, and AI lab publications
- ☁️ Cloud Storage Integration: Automatically transfers results to a "ground truth" cloud folder (e.g., Google Drive)
- 🛡️ Absolute Data Protection: Robust safeguards prevent accidental deletion of existing papers in cloud storage
- 🧠 Intelligent Filtering: Advanced content filtering with boolean query support
- 🚀 Parallel Execution: All sources searched concurrently using multiprocessing for maximum speed
- 🔄 Self-Healing: Automatic error detection, rollback, and retry
- 💾 Smart Deduplication: Detects and merges papers found across multiple sources and cloud storage
- 🎯 Mode-Based Operation: AUTOMATIC, TEST (Count-only), and BACKFILL modes
- 🖥️ Progress Tracking: Real-time stats and progress bars for large backfill operations
- 📦 ZIP Backups: Dedicated backup system with compression and cloud directory support
- Quick Start
- Installation
- Usage
- Configuration
- Search Modes
- Prompt Syntax
- Architecture
- Features
- Testing
- Troubleshooting
- Advanced Topics
- Contributing
pip install -r requirements.txtEdit prompts/prompt.txt:
# Example
echo '("AI" OR "machine learning") AND ("safety" OR "alignment")' > prompts/prompt.txtCLI (Recommended):
python main.py --mode TESTINGGUI (Recommended):
# Double-click the launcher script:
run_gui.batResults are organized across staging and production storage:
- Production (Cloud):
R:/MyDrive/03 Research Papers/(Configurable) - Staging (Temp):
F:/RESTMP/(Temporary staging area) - Database:
F:/TMPRES/metadata.db(Indexed metadata) - Excel Log:
F:/Antigravity_Results/Research_Papers/research_log.xlsx
Operating System:
- ✅ Windows 10/11
- ✅ macOS 10.15 (Catalina) or higher
- ✅ Linux (Ubuntu 20.04+, Debian 10+, or equivalent)
Hardware:
- RAM: 2GB minimum, 4GB+ recommended
- Disk Space:
- 500MB for application and dependencies
- 1GB+ recommended for paper storage
- 5GB+ for extensive backfill operations
- Network: Stable internet connection required for API access
Performance Estimates:
- TESTING mode: ~50MB download
- DAILY mode: ~200MB download
- BACKFILL mode: 500MB - 5GB+ (depends on query scope)
-
Python 3.8 or higher
- Check version:
python --versionorpython3 --version - Download from: https://www.python.org/downloads/
⚠️ Important: During installation, check "Add Python to PATH"
- Check version:
-
pip (Python package manager)
- Usually included with Python 3.8+
- Check:
pip --versionorpip3 --version - If missing:
python -m ensurepip --upgrade
-
Git (for cloning repository)
- Check:
git --version - Download from: https://git-scm.com/downloads
- Alternative: Download ZIP from GitHub
- Check:
Windows:
- No additional dependencies required
- Microsoft Visual C++ Redistributable (usually pre-installed)
macOS:
- Xcode Command Line Tools (usually pre-installed)
- Install if needed:
xcode-select --install
Linux (Debian/Ubuntu):
# Install system dependencies for Playwright
sudo apt-get update
sudo apt-get install -y \
libnss3 \
libnspr4 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libcups2 \
libdrm2 \
libxkbcommon0 \
libxcomposite1 \
libxdamage1 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libasound2Linux (Fedora/RHEL):
sudo dnf install -y \
nss \
nspr \
atk \
at-spi2-atk \
cups-libs \
libdrm \
libxkbcommon \
libXcomposite \
libXdamage \
libXfixes \
libXrandr \
mesa-libgbm \
alsa-lib-
SQLite Browser: For viewing database contents
- Download: https://sqlitebrowser.org/
- Alternative: Use command-line
sqlite3
-
Virtual Environment Support: Included with Python 3.8+
- Check:
python -m venv --help
- Check:
-
Clone the repository:
git clone https://github.com/yourusername/research-agent.git cd research-agent -
Create virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Install Playwright browsers (for web scraping):
playwright install chromium
-
Configure settings:
cp config.yaml.example config.yaml # Edit config.yaml with your paths -
Create prompt file:
echo '("AI" OR "machine learning") AND ("safety")' > prompts/prompt.txt
Run these commands to verify everything is working:
1. Check Python and Dependencies:
# Verify Python version
python --version
# Should show: Python 3.8.x or higher
# Verify dependencies installed
pip list | grep arxiv
pip list | grep playwright
pip list | grep openpyxl
# Should show versions for each package2. Verify Playwright Browsers:
playwright install --dry-run chromium
# Should show: chromium is already installed3. Run Quick Test:
python test_mode_settings.py
# Expected output:
# [PASS] All three modes present in config
# [PASS] TESTING mode configured correctly
# [PASS] DAILY mode configured correctly
# [PASS] BACKFILL mode configured correctly
# [PASS] Supervisor stores search_params correctly
# ...all tests passing4. Test Run (Optional):
# Quick 30-second test
python main.py --mode TESTING
# Should complete without errors and download a few papersTroubleshooting Verification:
If test fails with "ModuleNotFoundError":
# Ensure virtual environment is activated
source venv/bin/activate # Linux/Mac
# OR
venv\Scripts\activate # Windows
# Reinstall dependencies
pip install -r requirements.txtIf test fails with "playwright command not found":
# Install playwright CLI
pip install playwright
playwright install chromiumIf test fails with "FileNotFoundError: config.yaml":
# Create default config
cp config.yaml.example config.yaml
# OR manually create config.yaml from template# Automatic mode (detects based on database)
python main.py
# Test mode (count-only verification)
python main.py --mode TEST
# Backfill mode (full historical retrieval)
python main.py --mode BACKFILL# Custom prompt from command line
python main.py --mode TESTING --prompt '("robotics") AND ("safety")'
# Override max results (deprecated, use config.yaml instead)
python main.py --max-results 50If no mode specified, the agent automatically chooses:
- BACKFILL if database is empty
- DAILY if database contains papers
python main.py # Auto-detects appropriate modepython gui.py- Real-time progress for each source (ArXiv, Semantic Scholar, etc.)
- Live status updates showing current operations
- Cancel button for graceful shutdown
- Scrolling log of all operations
- Paper counts per source
- Launch GUI:
python gui.py - Edit
prompts/prompt.txtwith your search terms - Click "Start Agent"
- Monitor progress in real-time
- Click "Cancel Run" if needed
- Results saved automatically when complete
# General Settings
storage_path: "F:/Antigravity_Results/Research_Papers/data"
staging_dir: "F:/RESTMP"
db_path: "F:/TMPRES/metadata.db"
# Cloud Storage Settings
cloud_storage:
enabled: true
path: "R:/MyDrive/03 Research Papers"
check_duplicates: true
backup_enabled: true
# Mode-Specific Settings
mode_settings:
testing:
max_papers_per_agent: 10
per_query_limit: 5
respect_date_range: false
test:
max_papers_per_agent: 0 # Count-only mode
per_query_limit: 100
daily:
max_papers_per_agent: 50
per_query_limit: 20
respect_date_range: true
backfill:
max_papers_per_agent: null # Unlimited
per_query_limit: 10
respect_date_range: true
# Retry and Timeout Settings
retry_settings:
max_worker_retries: 2 # Worker restart attempts
worker_retry_delay: 5 # Seconds between retries
worker_timeout: 600 # Worker timeout (10 min)
api_max_retries: 3 # API call retries
api_base_delay: 2 # Exponential backoff base
request_pacing_delay: 1.0 # Rate limiting delay
# Export Settings
export_dir: "."
export_filename: "research_log.xlsx"| Setting | Description | Default |
|---|---|---|
papers_dir |
Where PDFs are saved | data/papers |
db_path |
SQLite database location | data/metadata.db |
max_papers_per_agent |
Total papers per source | Mode-dependent |
per_query_limit |
Papers per API call | Mode-dependent |
worker_timeout |
Max worker runtime | 600 seconds |
Purpose: Quick verification during development
Characteristics:
- Limit: 10 papers per source (40 total)
- Batch size: 5 papers per API call
- Date handling: Ignores date ranges
- Duration: ~30 seconds
When to use:
- Testing new search queries
- Verifying system works
- Quick sampling of results
Example:
python main.py --mode TESTINGIf the local database (metadata.db) becomes corrupted or out of sync with the Cloud Folder, use the Reconstruction Tools:
-
Run Reconstruction:
- Double-click
Run Reconstruction.lnk(orrun_reconstruct.bat). - Choose Update (default) or Wipe (clean start).
- Runs invisibly in the background.
- Double-click
-
Monitor Progress:
- Double-click
View Logs.lnk(orview_logs.bat). - Tails the log file from the Cloud Storage in real-time.
- Double-click
Purpose: Incremental updates for ongoing monitoring
Characteristics:
- Limit: 50 papers per source (200 total)
- Batch size: 20 papers per API call
- Date handling: Only papers after last run
- Duration: 2-5 minutes
When to use:
- Daily morning update
- Scheduled cron jobs
- Monitoring new publications
Example:
python main.py --mode DAILYAuto-selected when: Database contains papers
Purpose: Historical collection for new topics
Characteristics:
- Limit: UNLIMITED (until date range satisfied)
- Batch size: 10 papers per API call (stable)
- Date handling: Fetches back to start_date (default: 2023-01-01)
- Duration: 10+ minutes (depends on topic)
When to use:
- First run with new topic
- Building historical corpus
- Comprehensive literature review
Example:
python main.py --mode BACKFILLAuto-selected when: Database is empty
Research Agent uses boolean logic for precise filtering:
("term1" OR "term2") AND ("term3" OR "term4") ANDNOT ("exclude1" OR "exclude2")
-
Quoted Terms: Always use quotes around search terms
"AI safety" "machine learning" -
OR Groups: Parentheses with OR for alternatives
("AI" OR "artificial intelligence" OR "machine learning") -
AND Logic: Connect groups with AND
("AI" OR "ML") AND ("safety" OR "alignment") -
Exclusions: Use ANDNOT at the end
ANDNOT ("automotive" OR "medical" OR "clinical")
Basic Search:
("AI safety")
Multiple Terms:
("AI" OR "machine learning") AND ("safety")
Complex Query:
("artificial intelligence" OR "large language model" OR "LLM")
AND ("alignment" OR "safety" OR "risk")
ANDNOT ("automotive" OR "medical" OR "agriculture")
Very Specific:
("AI" OR "machine learning" OR "deep learning")
AND ("safety" OR "alignment" OR "interpretability" OR "explainability")
AND ("language model" OR "LLM" OR "GPT" OR "transformer")
ANDNOT ("medical" OR "clinical" OR "automotive" OR "financial")
Prompts are automatically validated for:
- ✓ Balanced parentheses
- ✓ Balanced quotes
- ✓ No empty groups
- ✓ Valid operators (AND, OR, ANDNOT only)
- ✓ At least one inclusion term
Invalid prompts are rejected with helpful error messages.
┌─────────────────────────────────────────────────────────────┐
│ User Interface │
│ (CLI: main.py / GUI: gui.py) │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Supervisor │
│ (Orchestrates workers, handles errors) │
└───┬──────────┬──────────┬──────────┬─────────┬─────────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ ArXiv │ │Semantic│ │LessWrong│ │AI Labs│ │ ... │
│ Worker │ │Scholar │ │ Worker │ │ Worker│ │ Worker │
│ │ │ Worker │ │ │ │ │ │ │
└────┬───┘ └────┬───┘ └────┬───┘ └────┬───┘ └────┬───┘
│ │ │ │ │
└──────────┴──────────┴──────────┴──────────┘
│
▼
┌──────────────────────────────────────────────┐
│ FilterManager │
│ (Boolean logic, content filtering) │
└───────────────────┬──────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ StorageManager │
│ (SQLite DB, deduplication, versioning) │
└───────────────────┬──────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ ExportManager │
│ (Excel generation) │
└──────────────────────────────────────────────┘
Supervisor (src/supervisor.py)
- Manages worker processes
- Handles errors and retries
- Implements self-healing
- Tracks heartbeats and timeouts
Worker (src/worker.py)
- Runs searchers in isolated processes
- Filters results
- Downloads PDFs
- Stores metadata
FilterManager (src/filter.py)
- Parses boolean queries
- Filters papers by content
- Excludes job postings, marketing, link aggregators
StorageManager (src/storage.py)
- SQLite database operations
- Deduplication across sources and cloud storage
- Schema migrations
- Rollback support with cloud protection
CloudTransferManager (src/cloud_transfer.py)
- Manages staging to cloud transfers
- Implements conflict resolution dialogs
- Verified non-destructive operations
BackupManager (src/backup.py)
- ZIP compression for database and papers
- Configurable backup directories
Searchers (src/searchers/)
ArxivSearcher: arXiv.org papersLessWrongSearcher: LessWrong/Alignment ForumLabScraper: OpenAI, Anthropic, DeepMind, etc.
Automatically excludes non-research content:
Job Postings:
- "Job Opening: AI Researcher"
- "We're Hiring - Apply Now"
- "Career Opportunities"
Link Aggregators:
- "Weekly AI Safety Roundup"
- "This Week in Machine Learning"
- "Curated Research Links"
Marketing Content:
- "New AI Platform - Sign Up Today"
- "Request a Demo"
- "Buy Now - Limited Offer"
36+ default exclusion terms always applied.
See CONTENT_FILTERING_GUIDE.md for details.
Papers found on multiple sources are automatically merged:
ArXiv: "AI Safety Survey" (ID: arxiv-123)
↓
Database: [arxiv-123, source="arxiv"]
Semantic Scholar: "AI Safety Survey" (ID: arxiv-123)
↓
Database: [arxiv-123, source="arxiv, semantic", urls="url1 ; url2"]
Benefits:
- Single entry per paper
- All source URLs preserved
- No duplicate downloads
When errors occur:
- Detection: Worker crash or timeout detected
- Rollback: Database entries and files deleted
- Analysis: Error logged and analyzed
- Retry: Worker restarted (up to N times)
- Recovery: System continues with other workers
Example:
[ArXiv] ERROR: Connection reset
[Supervisor] Rolling back ArXiv work...
[Supervisor] Deleted 5 DB entries, 5 files
[Supervisor] Self-healing attempt 1/2...
[ArXiv] Restarting...
[ArXiv] Complete - 10 papers downloaded
Different modes optimized for different use cases:
| Feature | TESTING | DAILY | BACKFILL |
|---|---|---|---|
| Total Limit | 10 | 50 | ∞ |
| Batch Size | 5 | 20 | 10 |
| Duration | 30s | 2-5m | 10-60m |
| Use Case | Quick test | Incremental | Historical |
See MODE_SETTINGS_GUIDE.md for details.
Automatic schema versioning and migrations:
# Current version tracked in database
CURRENT_VERSION = 2
# Migrations applied automatically
v1: Add 'source' column
v2: Create schema_version tableBenefits:
- Safe upgrades across versions
- Idempotent migrations
- Handles legacy databases
See test_migrations.py for verification.
URLs normalized to prevent duplicates:
http://example.com/paper/ → https://example.com/paper
https://example.com/paper?utm_source=twitter → https://example.com/paper
HTTP://EXAMPLE.COM/PAPER → https://example.com/paper
Normalized by:
- Protocol (https)
- Domain case (lowercase)
- Trailing slashes (removed)
- Tracking parameters (removed)
All sources searched simultaneously:
[2024-01-11 10:00:00] Starting parallel search workers...
[2024-01-11 10:00:00] Supervisor started worker: ArXiv
[2024-01-11 10:00:00] Supervisor started worker: Semantic Scholar
[2024-01-11 10:00:00] Supervisor started worker: LessWrong
[2024-01-11 10:00:01] Supervisor started worker: AI Labs
Performance: ~3-4x faster than sequential execution
All retry/timeout settings configurable:
retry_settings:
max_worker_retries: 2 # Supervisor retries
worker_retry_delay: 5 # Seconds between retries
worker_timeout: 600 # Worker max runtime
api_max_retries: 3 # API call retries
api_base_delay: 2 # Exponential backoff
request_pacing_delay: 1.0 # Rate limitingRun all tests:
# Mode settings
python test_mode_settings.py
# Config loading
python test_config_settings.py
# Migrations
python test_migrations.py
# Content filtering
python test_content_filtering.py
# Integration tests
python test_integration.py
# High priority fixes
python test_verification.py
# Self-healing
python test_self_healing.py
# Prompt validation
python test_prompt_validation.py| Test File | Tests | Status |
|---|---|---|
| test_mode_settings.py | 8 | ✅ PASS |
| test_config_settings.py | 8 | ✅ PASS |
| test_migrations.py | 6 | ✅ PASS |
| test_content_filtering.py | 7 | ✅ PASS |
| test_integration.py | 5 | ✅ PASS |
| test_prompt_validation.py | 8 | ✅ PASS |
Total: 42+ automated tests
Symptom:
ERROR - Invalid prompt syntax:
- Unbalanced quotes: found 3 quotes (must be even)
Solution:
- Check all terms have matching quotes
- Example:
("AI" OR "ML")not("AI OR "ML")
Symptom:
ERROR - Zero documents returned during backfill run.
Causes:
- Search terms too specific (no matches)
- All papers filtered out
- API errors across all sources
Solutions:
- Broaden search terms
- Check debug logs:
python main.py --mode TESTING 2>&1 | grep Filtered - Test individual sources
Symptom:
WARNING - Worker ArXiv timeout after 600s
Solutions:
- Increase timeout in config.yaml:
retry_settings: worker_timeout: 1200 # 20 minutes
- Reduce per_query_limit to avoid connection errors
- Check network connectivity
Symptom: PDFs exist but research_log.xlsx empty
Cause: Papers marked as synced_to_cloud=1 already
Solution:
# Reset sync status
sqlite3 data/metadata.db
UPDATE papers SET synced_to_cloud = 0;
.exit
# Re-run agent
python main.py --mode DAILYSymptom:
ERROR - database is locked
Cause: Multiple processes accessing database
Solution:
- Ensure only one agent running
- Close GUI if running CLI
- Restart if hung:
pkill -f python(Linux/Mac)
Symptom: Expected papers not downloaded
Solution:
-
Check logs for filtering reason:
python main.py --mode TESTING 2>&1 | grep "Filtered"
-
If default exclusion too broad, edit
src/filter.py:DEFAULT_EXCLUSIONS = [ # 'sign up', # Comment out if too aggressive ]
-
Adjust detection thresholds if needed
Add a new source by creating a searcher:
# src/searchers/custom_searcher.py
from .base import BaseSearcher
class CustomSearcher(BaseSearcher):
def __init__(self, config):
super().__init__(config)
self.source_name = "custom"
self.download_dir = os.path.join(
config.get("papers_dir"),
self.source_name
)
os.makedirs(self.download_dir, exist_ok=True)
def search(self, query, start_date=None, max_results=10, stop_event=None):
papers = []
# Implement search logic
return papers
def download(self, paper_meta):
# Implement download logic
return pdf_pathRegister in main.py:
from src.searchers.custom_searcher import CustomSearcher
workers = [
(ArxivSearcher, "ArXiv"),
(CustomSearcher, "Custom Source"), # Add here
# ...
]Linux/Mac (cron):
# Edit crontab
crontab -e
# Run daily at 6 AM
0 6 * * * cd /path/to/research-agent && /path/to/venv/bin/python main.py --mode DAILYWindows (Task Scheduler):
- Open Task Scheduler
- Create Basic Task
- Trigger: Daily at 6:00 AM
- Action: Start a program
- Program:
C:\path\to\venv\Scripts\python.exe - Arguments:
main.py --mode DAILY - Start in:
C:\path\to\research-agent
- Program:
Useful SQL queries:
# Open database
sqlite3 data/metadata.db
# Count papers by source
SELECT source, COUNT(*) FROM papers GROUP BY source;
# Recent papers
SELECT title, published_date FROM papers
ORDER BY published_date DESC LIMIT 10;
# Papers from multiple sources
SELECT title, source FROM papers
WHERE source LIKE '%,%';
# Unsynced papers
SELECT COUNT(*) FROM papers WHERE synced_to_cloud = 0;
# Check schema version
SELECT * FROM schema_version;Create configs for different environments:
# Development
cp config.yaml config.dev.yaml
# Edit config.dev.yaml (smaller limits)
# Production
cp config.yaml config.prod.yaml
# Edit config.prod.yaml (full limits)
# Use specific config
export RESEARCH_AGENT_CONFIG=config.dev.yaml
python main.py --mode TESTINGmode_settings:
daily:
per_query_limit: 50 # Larger batches (faster but riskier)
retry_settings:
request_pacing_delay: 0.5 # Faster requests (watch rate limits)mode_settings:
backfill:
per_query_limit: 5 # Smaller batches (slower but stable)
retry_settings:
request_pacing_delay: 2.0 # Slower requests (avoid rate limits)
api_max_retries: 5 # More retriesresearch-agent/
├── README.md # This file
├── requirements.txt # Python dependencies
├── config.yaml # Configuration
├── prompts/
│ └── prompt.txt # Search query
│
├── main.py # CLI entry point
├── gui.py # GUI entry point
│
├── src/
│ ├── supervisor.py # Worker orchestration
│ ├── worker.py # Worker process logic
│ ├── filter.py # Query parsing & filtering
│ ├── storage.py # Database operations
│ ├── export.py # Excel generation
│ ├── utils.py # Utilities
│ │
│ └── searchers/
│ ├── base.py # Base searcher class
│ ├── arxiv_searcher.py # ArXiv integration
│ ├── semantic_searcher.py # Semantic Scholar API
│ ├── lesswrong_searcher.py # LessWrong/AF scraper
│ └── lab_scraper.py # AI lab scrapers
│
├── test_*.py # Test files
│
├── data/
│ ├── metadata.db # SQLite database
│ └── papers/ # Downloaded PDFs
│ ├── arxiv/
│ ├── semantic/
│ ├── lesswrong/
│ └── labs/
│
└── docs/
├── MODE_SETTINGS_GUIDE.md # Mode configuration
├── CONTENT_FILTERING_GUIDE.md # Filtering details
├── LOW_PRIORITY_RECOMMENDATIONS.md # Future enhancements
└── TEST_RESULTS.md # Test documentation
arxiv==2.1.0 # ArXiv API client
requests==2.31.0 # HTTP library
pyyaml==6.0.1 # Config parsing
openpyxl==3.1.2 # Excel generation
beautifulsoup4==4.12.2 # HTML parsing
playwright==1.40.0 # Browser automation
semanticscholar==0.8.0 # Semantic Scholar API
pytest==7.4.3 # Testing framework
pytest-qt==4.2.0 # GUI testing (optional)
memory-profiler==0.61.0 # Performance profiling (optional)
Install all:
pip install -r requirements.txtThis project uses semantic versioning: MAJOR.MINOR.PATCH
Current Version: 1.0.0
- 1.0.0 (2026-01-11):
- Initial release
- Multi-source search (ArXiv, Semantic Scholar, LessWrong, AI Labs)
- Intelligent content filtering
- Self-healing architecture
- Mode-specific parameters
- Database migration versioning
- Comprehensive test suite
- Multi-source parallel search (ArXiv, Semantic Scholar, LessWrong, AI Labs)
- Intelligent content filtering (job postings, link aggregators, marketing)
- Self-healing error recovery with rollback
- Mode-specific parameters (TESTING, DAILY, BACKFILL)
- Database migration versioning system
- URL normalization for deduplication
- Configurable retry/timeout settings
- Comprehensive test suite (42+ tests)
- CLI and GUI interfaces
- Excel export with metadata
- Cross-source deduplication
- GUI worker management crash
- Stop event responsiveness across searchers
- ArXiv hardcoded query issue
- Self-healing file cleanup
- Worker timeout detection
Q: How do I add a new search term?
A: Edit prompts/prompt.txt and re-run the agent.
Q: Can I search for papers before 2023?
A: Yes, edit start_date in main.py or gui.py to an earlier date.
Q: How do I exclude more terms?
A: Add them to the ANDNOT section of your prompt.
Q: Where are PDFs saved?
A: data/papers/{source}/ where source is arxiv, semantic, lesswrong, or labs.
Q: Can I disable a source?
A: Yes, comment it out in the workers list in main.py or gui.py.
Q: How do I reset the database?
A: Delete data/metadata.db (backups recommended).
Q: Does it work offline? A: No, requires internet for searching and downloading.
Q: What's the API rate limit? A: Varies by source. Built-in rate limiting handles this automatically.
Q: Can I run multiple instances? A: No, database locking prevents this. Run one at a time.
Q: How do I update to a new version?
A: git pull and run python test_migrations.py to verify database compatibility.
- Documentation: Check guides in
docs/folder - Tests: Run relevant test file to diagnose issues
- Logs: Check
research_agent.logfor detailed errors - Issues: Report bugs on GitHub Issues
Enable detailed logging:
# In src/utils.py
logging.basicConfig(
level=logging.DEBUG, # Change from INFO
# ...
)- Fork the repository
- Create feature branch:
git checkout -b feature-name - Make changes
- Run tests:
python test_*.py - Commit:
git commit -am 'Add feature' - Push:
git push origin feature-name - Submit pull request
- PEP 8 compliance
- Type hints where appropriate
- Docstrings for all public methods
- Unit tests for new features
- New searcher implementations
- Additional export formats (CSV, JSON, BibTeX)
- Performance optimizations
- Additional test coverage
- Documentation improvements
MIT License - see LICENSE file for details.
- ArXiv for open access to research papers
- Semantic Scholar for comprehensive academic search
- LessWrong and Alignment Forum for AI safety content
- AI Research Labs (Anthropic, OpenAI, DeepMind, etc.) for publishing research
Project Repository: https://github.com/yourusername/research-agent
Issues: https://github.com/yourusername/research-agent/issues
Documentation: See docs/ folder
# Test run
python main.py --mode TESTING
# Daily update
python main.py --mode DAILY
# Full backfill
python main.py --mode BACKFILL
# Launch GUI
python gui.py
# Run all tests
python test_*.py("term1" OR "term2") AND ("term3") ANDNOT ("exclude")
config.yaml- Main configurationprompts/prompt.txt- Search querydata/metadata.db- Databaseresearch_log.xlsx- Output
python test_mode_settings.py # Mode configuration
python test_content_filtering.py # Filtering logic
python test_integration.py # End-to-end testsBuilt with ❤️ for the AI safety research community