A powerful research automation system that combines web crawling and content generation to provide comprehensive research results.
This project aims to revolutionize the research process by:
- Automating Research: Eliminate manual searching and reading through multiple sources
- Comprehensive Coverage: Gather information from diverse sources (academic, news, blogs)
- Time Efficiency: Reduce research time from hours to minutes
- Quality Output: Generate well-structured, fact-based research summaries
- Accessibility: Make research accessible to non-experts while maintaining academic rigor
The system uses a dual-agent architecture:
-
URL Discovery Agent:
- Searches multiple sources simultaneously
- Filters results based on relevance and date
- Deduplicates and ranks URLs
- Sources include:
- Academic: arXiv, Semantic Scholar, Google Scholar
- News: NewsAPI, Tech blogs
- Blogs: Medium, GitHub
- General: DuckDuckGo
-
Content Generation Agent:
- Crawls discovered URLs
- Extracts and processes content
- Generates comprehensive summaries
- Formats output in various formats (PDF, Markdown, JSON)
- âś… Multi-source URL discovery
- âś… Asynchronous content processing
- âś… Basic content generation
- âś… API endpoints
- âś… Database integration
- âś… Rate limiting
- âś… Docker support
- 🔄 Enhanced content analysis
- 🔄 Improved relevance ranking
- 🔄 Better error handling
- 🔄 Frontend development
- Limited to English content
- Basic content summarization
- No user authentication
- No persistent storage of results
- Automated Research: Combines web crawling and content generation to provide comprehensive research results
- URL Discovery: Automatically finds relevant URLs from multiple sources:
- Academic: arXiv, Semantic Scholar, Google Scholar
- News: NewsAPI, Tech blogs (TechCrunch, The Verge, Wired, etc.)
- Blogs: Medium, GitHub repositories and discussions
- General: DuckDuckGo
- Advanced Search Options:
- Date range filtering
- Source filtering (academic papers, news articles, blogs, GitHub)
- Language filtering
- Customizable crawl depth and page limits
- Export Options:
- PDF export
- Markdown export
- JSON export
- Rate Limiting: Prevents abuse and ensures fair usage
- Asynchronous Processing: Handles long-running research tasks efficiently
- Web Interface: User-friendly React frontend
- Docker Support: Easy deployment with Docker and Docker Compose
- Database Integration: PostgreSQL for task management and result storage
- Redis Caching: Efficient caching of research results
- Multimodal Research: Process both text and images from web content
- Intelligent Crawling: Smart web crawling with depth control and domain limits
- Vector Search: Efficient similarity search using FAISS
- Free Alternatives: Uses open-source models instead of paid APIs
- Flexible Deployment: Support for both HuggingFace API and local models
-
Start the Server:
set PORT=8081 && python app.py
-
Submit a Research Task:
curl -X 'POST' \ 'http://localhost:8081/api/tasks' \ -H 'Content-Type: application/json' \ -d '{ "query": "What are the latest advancements in quantum computing?", "max_depth": 2, "max_pages_per_domain": 20, "date_range": { "start": "2024-01-01", "end": "2024-06-03" }, "sources": ["academic", "news", "blogs"], "languages": ["en"], "output_format": "article" }'
-
Check Task Status:
curl http://localhost:8081/api/tasks/{task_id}
-
Custom Search Parameters:
- Adjust
max_depthfor deeper research - Modify
max_pages_per_domainfor broader coverage - Use specific
sourcesfor targeted research
- Adjust
-
Output Formats:
article: Comprehensive research summarybullet_points: Key findings and insightsacademic: Formal research paper format
-
API Integration:
- Use the API in your applications
- Integrate with existing research workflows
- Build custom frontends
- User Management:
- User authentication and authorization
- User preferences and settings
- Saved searches and research history
- Advanced Content Generation:
- Support for different output formats (e.g., academic papers, blog posts)
- Automatic citation generation
- Fact-checking capabilities
- Enhanced Crawling:
- PDF parsing support
- Academic paper parsing
- Social media content integration
- Monitoring and Analytics:
- Prometheus metrics
- Grafana dashboards
- Usage statistics and insights
- API Key Authentication: Secure API access with key-based authentication
- Advanced Search Features:
- Semantic search capabilities
- Topic clustering
- Related content suggestions
- Python 3.8+
- Node.js 14+
- PostgreSQL 15.x
- Redis (optional, for caching)
- Docker and Docker Compose (optional, for containerized deployment)
-
Download and Install PostgreSQL 15.x
- Download from: https://www.enterprisedb.com/downloads/postgres-postgresql-downloads
- Choose PostgreSQL 15.x (latest version)
- During installation:
- Use port: 5433 (or your preferred port)
- Set password as: postgres
- Keep default username as: postgres
- Install all components (including pgAdmin)
-
Create Database
Option 1: Using pgAdmin 4 (Recommended)
- Open pgAdmin 4 from Windows Start menu
- Enter your master password when prompted
- In the left sidebar, expand "Servers"
- Connect to "PostgreSQL 15" (enter your PostgreSQL password)
- Right-click on "Databases"
- Select "Create" > "Database"
- Enter:
- Database:
research_ai - Owner:
postgres
- Database:
- Click "Save"
Option 2: Using Command Line
# If psql is in your PATH: psql -U postgres -p 5433 # Or use the full path (adjust based on your installation): "C:\Program Files\PostgreSQL\15\bin\psql.exe" -U postgres -p 5433 # Then create the database: CREATE DATABASE research_ai; # Verify creation \l
-
Clone the repository:
git clone https://github.com/yourusername/deep-research-ai-agent.git cd deep-research-ai-agent -
Set up Python virtual environment:
# Windows python -m venv venv .\venv\Scripts\activate # Linux/Mac python3 -m venv venv source venv/bin/activate
-
Install Python dependencies:
pip install -r requirements.txt
-
Configure environment variables: Create a
.envfile in the project root with the following content:# Database configuration DATABASE_URL=postgresql+asyncpg://postgres:postgres@localhost:5433/research_ai DB_POOL_SIZE=5 DB_MAX_OVERFLOW=10 SQL_ECHO=false # API configuration HOST=0.0.0.0 PORT=8081 # Logging LOG_LEVEL=INFO # Rate limiting RATE_LIMIT_PER_MINUTE=10 # Optional API Keys (for enhanced search capabilities) NEWS_API_KEY=your_news_api_key GOOGLE_SCHOLAR_API_KEY=your_serpapi_key
-
Initialize the database:
# Run database migrations alembic upgrade head
-
Start the backend server:
# Windows set PORT=8081 && python app.py # Linux/Mac PORT=8081 python app.py
The API will be available at http://localhost:8081
-
Test the API:
- Open your browser and navigate to http://localhost:8081/docs
- You should see the Swagger UI documentation
- Test the health endpoint: http://localhost:8081/health
- Create a new research task using the POST /api/tasks endpoint
GET /: Root endpoint for API health checkGET /health: Health check endpointGET /docs: Swagger UI documentation
POST /api/tasks: Create a new research task{ "query": "What are the latest advancements in quantum computing?", "max_depth": 2, "max_pages_per_domain": 20, "date_range": { "start": "2024-01-01", "end": "2024-06-03" }, "sources": ["academic", "news", "blogs"], "languages": ["en"], "output_format": "article" }GET /api/tasks/{task_id}: Get task status and resultsGET /api/tasks: List all tasks
The API uses a global exception handler that provides detailed error messages:
{
"detail": "Error message",
"type": "ErrorType",
"message": "An unexpected error occurred"
}The API implements rate limiting to prevent abuse:
- 10 requests per minute for task creation
- 60 requests per minute for task status checks
- 30 requests per minute for task listing
main.py: This is a CLI script for quick testing and debugging of the research pipeline. It is not the main entry point for the deployed application (which usesapp.py). You can run it directly to test the research and content generation agents without starting the API server.
-
Database Connection Issues
- Verify PostgreSQL is running:
pg_isready -p 5433(use your port number) - Check database credentials in
.env - Ensure database exists:
psql -U postgres -p 5433 -l(use your port number) - Verify port number matches your PostgreSQL installation (default is 5432, but might be 5433)
- Verify PostgreSQL is running:
-
Port Conflicts
- Check if ports 8080 (API) and 3000 (Frontend) are available
- Modify ports in
.envif needed
-
Missing Dependencies
- Ensure all Python packages are installed:
pip install -r requirements.txt - Check Node.js dependencies:
npm install
- Ensure all Python packages are installed:
-
Search Source Issues
- If using NewsAPI, ensure your API key is valid
- If using Google Scholar, ensure your SerpAPI key is valid
- Check rate limits for each service
- Backend logs: Check console output or
logs/app.log - Frontend logs: Check browser console
- Database logs: Check PostgreSQL logs
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for the GPT models
- FastAPI for the backend framework
- React for the frontend framework
- PostgreSQL and Redis for data storage
-
Research Agent
- Discovers and crawls web content
- Processes text and images
- Generates embeddings for both modalities
- Stores content in vector database
-
Drafting Agent
- Retrieves relevant content
- Generates answers and summaries
- Handles multimodal context
- Provides source attribution
-
URL Discovery Agent
- Finds relevant URLs for research
- Supports multiple search sources
- Implements relevance ranking
-
Content Discovery
- User query triggers URL discovery
- System crawls relevant web pages
- Extracts text and images
-
Content Processing
- Text cleaning and chunking
- Image extraction and storage
- Embedding generation for both modalities
-
Vector Storage
- FAISS for efficient similarity search
- Stores text and image embeddings
- Maintains metadata and relationships
-
Retrieval & Generation
- Semantic search for relevant content
- Context formatting with multimodal support
- Answer generation using LLM
-
Text Embeddings
- Model:
sentence-transformers/all-MiniLM-L6-v2 - Purpose: Generate text embeddings for semantic search
- Usage: Content chunking and query processing
- Model:
-
Image Embeddings
- Model:
openai/clip-vit-base-patch32 - Purpose: Generate image embeddings
- Usage: Image similarity search and multimodal context
- Model:
-
HuggingFace API (Default)
- Model:
mistralai/Mistral-7B-Instruct-v0.2 - Setup:
# Set HuggingFace token export HF_TOKEN="your_token_here"
- Benefits:
- High-quality responses
- Managed infrastructure
- Regular updates
- Model:
-
Local Model
- Setup:
# Set local model endpoint export LOCAL_LLM_ENDPOINT="http://localhost:8000/v1"
- Benefits:
- No API costs
- Full control
- No rate limits
- Setup:
-
Clone the repository:
git clone https://github.com/yourusername/deep-research-ai-agent.git cd deep-research-ai-agent -
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
# Create .env file cp .env.example .env # Edit .env with your configuration # HF_TOKEN=your_token_here # LOCAL_LLM_ENDPOINT=http://localhost:8000/v1
-
Initialize the database:
alembic upgrade head
from agents.research_agent import ResearchAgent
from agents.drafting_agent import DraftingAgent
# Initialize agents
research_agent = ResearchAgent()
drafting_agent = DraftingAgent()
# Perform research
results = await research_agent.discover_and_research("Your research query")
# Generate answer
answer = drafting_agent.answer_question("Your question")The system automatically handles multimodal content:
- Extracts images from web pages
- Generates embeddings for both text and images
- Combines modalities for better context
- Returns relevant images with answers
-
Research Parameters
research_agent = ResearchAgent( max_depth=2, # Crawling depth max_pages_per_domain=20, # Pages per domain limit vector_store_path="./data/embeddings", raw_data_path="./data/raw" )
-
Drafting Parameters
drafting_agent = DraftingAgent( vector_store_path="./data/embeddings", model_type="huggingface" # or "local" )
- Location:
./data/embeddings/ - Format: FAISS index
- Content: Text and image embeddings
- Location:
./data/raw/ - Content:
- Text chunks
- Downloaded images
- Metadata
- Models:
- User
- Task
- ContentChunk
- Purpose: Track research tasks and results