Skip to content

A powerful research automation system that combines web crawling and content generation to provide comprehensive research results.

License

AdiD-code/deep-research-ai-agent

Repository files navigation

Dual-Agent Deep Research AI

A powerful research automation system that combines web crawling and content generation to provide comprehensive research results.

Project Aims

This project aims to revolutionize the research process by:

  1. Automating Research: Eliminate manual searching and reading through multiple sources
  2. Comprehensive Coverage: Gather information from diverse sources (academic, news, blogs)
  3. Time Efficiency: Reduce research time from hours to minutes
  4. Quality Output: Generate well-structured, fact-based research summaries
  5. Accessibility: Make research accessible to non-experts while maintaining academic rigor

How It Works

The system uses a dual-agent architecture:

  1. URL Discovery Agent:

    • Searches multiple sources simultaneously
    • Filters results based on relevance and date
    • Deduplicates and ranks URLs
    • Sources include:
      • Academic: arXiv, Semantic Scholar, Google Scholar
      • News: NewsAPI, Tech blogs
      • Blogs: Medium, GitHub
      • General: DuckDuckGo
  2. Content Generation Agent:

    • Crawls discovered URLs
    • Extracts and processes content
    • Generates comprehensive summaries
    • Formats output in various formats (PDF, Markdown, JSON)

Current Status

Working Features

  • âś… Multi-source URL discovery
  • âś… Asynchronous content processing
  • âś… Basic content generation
  • âś… API endpoints
  • âś… Database integration
  • âś… Rate limiting
  • âś… Docker support

In Progress

  • 🔄 Enhanced content analysis
  • 🔄 Improved relevance ranking
  • 🔄 Better error handling
  • 🔄 Frontend development

Known Limitations

  • Limited to English content
  • Basic content summarization
  • No user authentication
  • No persistent storage of results

Features

  • Automated Research: Combines web crawling and content generation to provide comprehensive research results
  • URL Discovery: Automatically finds relevant URLs from multiple sources:
    • Academic: arXiv, Semantic Scholar, Google Scholar
    • News: NewsAPI, Tech blogs (TechCrunch, The Verge, Wired, etc.)
    • Blogs: Medium, GitHub repositories and discussions
    • General: DuckDuckGo
  • Advanced Search Options:
    • Date range filtering
    • Source filtering (academic papers, news articles, blogs, GitHub)
    • Language filtering
    • Customizable crawl depth and page limits
  • Export Options:
    • PDF export
    • Markdown export
    • JSON export
  • Rate Limiting: Prevents abuse and ensures fair usage
  • Asynchronous Processing: Handles long-running research tasks efficiently
  • Web Interface: User-friendly React frontend
  • Docker Support: Easy deployment with Docker and Docker Compose
  • Database Integration: PostgreSQL for task management and result storage
  • Redis Caching: Efficient caching of research results
  • Multimodal Research: Process both text and images from web content
  • Intelligent Crawling: Smart web crawling with depth control and domain limits
  • Vector Search: Efficient similarity search using FAISS
  • Free Alternatives: Uses open-source models instead of paid APIs
  • Flexible Deployment: Support for both HuggingFace API and local models

Usage Guide

Basic Usage

  1. Start the Server:

    set PORT=8081 && python app.py
  2. Submit a Research Task:

    curl -X 'POST' \
      'http://localhost:8081/api/tasks' \
      -H 'Content-Type: application/json' \
      -d '{
        "query": "What are the latest advancements in quantum computing?",
        "max_depth": 2,
        "max_pages_per_domain": 20,
        "date_range": {
          "start": "2024-01-01",
          "end": "2024-06-03"
        },
        "sources": ["academic", "news", "blogs"],
        "languages": ["en"],
        "output_format": "article"
      }'
  3. Check Task Status:

    curl http://localhost:8081/api/tasks/{task_id}

Advanced Usage

  1. Custom Search Parameters:

    • Adjust max_depth for deeper research
    • Modify max_pages_per_domain for broader coverage
    • Use specific sources for targeted research
  2. Output Formats:

    • article: Comprehensive research summary
    • bullet_points: Key findings and insights
    • academic: Formal research paper format
  3. API Integration:

    • Use the API in your applications
    • Integrate with existing research workflows
    • Build custom frontends

Future Enhancements

  • User Management:
    • User authentication and authorization
    • User preferences and settings
    • Saved searches and research history
  • Advanced Content Generation:
    • Support for different output formats (e.g., academic papers, blog posts)
    • Automatic citation generation
    • Fact-checking capabilities
  • Enhanced Crawling:
    • PDF parsing support
    • Academic paper parsing
    • Social media content integration
  • Monitoring and Analytics:
    • Prometheus metrics
    • Grafana dashboards
    • Usage statistics and insights
  • API Key Authentication: Secure API access with key-based authentication
  • Advanced Search Features:
    • Semantic search capabilities
    • Topic clustering
    • Related content suggestions

Prerequisites

  • Python 3.8+
  • Node.js 14+
  • PostgreSQL 15.x
  • Redis (optional, for caching)
  • Docker and Docker Compose (optional, for containerized deployment)

Detailed Installation Guide

1. PostgreSQL Setup

  1. Download and Install PostgreSQL 15.x

  2. Create Database

    Option 1: Using pgAdmin 4 (Recommended)

    1. Open pgAdmin 4 from Windows Start menu
    2. Enter your master password when prompted
    3. In the left sidebar, expand "Servers"
    4. Connect to "PostgreSQL 15" (enter your PostgreSQL password)
    5. Right-click on "Databases"
    6. Select "Create" > "Database"
    7. Enter:
      • Database: research_ai
      • Owner: postgres
    8. Click "Save"

    Option 2: Using Command Line

    # If psql is in your PATH:
    psql -U postgres -p 5433
    
    # Or use the full path (adjust based on your installation):
    "C:\Program Files\PostgreSQL\15\bin\psql.exe" -U postgres -p 5433
    
    # Then create the database:
    CREATE DATABASE research_ai;
    
    # Verify creation
    \l

2. Project Setup

  1. Clone the repository:

    git clone https://github.com/yourusername/deep-research-ai-agent.git
    cd deep-research-ai-agent
  2. Set up Python virtual environment:

    # Windows
    python -m venv venv
    .\venv\Scripts\activate
    
    # Linux/Mac
    python3 -m venv venv
    source venv/bin/activate
  3. Install Python dependencies:

    pip install -r requirements.txt
  4. Configure environment variables: Create a .env file in the project root with the following content:

    # Database configuration
    DATABASE_URL=postgresql+asyncpg://postgres:postgres@localhost:5433/research_ai
    DB_POOL_SIZE=5
    DB_MAX_OVERFLOW=10
    SQL_ECHO=false
    
    # API configuration
    HOST=0.0.0.0
    PORT=8081
    
    # Logging
    LOG_LEVEL=INFO
    
    # Rate limiting
    RATE_LIMIT_PER_MINUTE=10
    
    # Optional API Keys (for enhanced search capabilities)
    NEWS_API_KEY=your_news_api_key
    GOOGLE_SCHOLAR_API_KEY=your_serpapi_key
  5. Initialize the database:

    # Run database migrations
    alembic upgrade head

3. Running the Application

  1. Start the backend server:

    # Windows
    set PORT=8081 && python app.py
    
    # Linux/Mac
    PORT=8081 python app.py

    The API will be available at http://localhost:8081

  2. Test the API:

API Endpoints

Core Endpoints

  • GET /: Root endpoint for API health check
  • GET /health: Health check endpoint
  • GET /docs: Swagger UI documentation

Task Management

  • POST /api/tasks: Create a new research task
    {
      "query": "What are the latest advancements in quantum computing?",
      "max_depth": 2,
      "max_pages_per_domain": 20,
      "date_range": {
        "start": "2024-01-01",
        "end": "2024-06-03"
      },
      "sources": ["academic", "news", "blogs"],
      "languages": ["en"],
      "output_format": "article"
    }
  • GET /api/tasks/{task_id}: Get task status and results
  • GET /api/tasks: List all tasks

Error Handling

The API uses a global exception handler that provides detailed error messages:

{
  "detail": "Error message",
  "type": "ErrorType",
  "message": "An unexpected error occurred"
}

Rate Limiting

The API implements rate limiting to prevent abuse:

  • 10 requests per minute for task creation
  • 60 requests per minute for task status checks
  • 30 requests per minute for task listing

Development and Testing

  • main.py: This is a CLI script for quick testing and debugging of the research pipeline. It is not the main entry point for the deployed application (which uses app.py). You can run it directly to test the research and content generation agents without starting the API server.

Troubleshooting

Common Issues

  1. Database Connection Issues

    • Verify PostgreSQL is running: pg_isready -p 5433 (use your port number)
    • Check database credentials in .env
    • Ensure database exists: psql -U postgres -p 5433 -l (use your port number)
    • Verify port number matches your PostgreSQL installation (default is 5432, but might be 5433)
  2. Port Conflicts

    • Check if ports 8080 (API) and 3000 (Frontend) are available
    • Modify ports in .env if needed
  3. Missing Dependencies

    • Ensure all Python packages are installed: pip install -r requirements.txt
    • Check Node.js dependencies: npm install
  4. Search Source Issues

    • If using NewsAPI, ensure your API key is valid
    • If using Google Scholar, ensure your SerpAPI key is valid
    • Check rate limits for each service

Logs

  • Backend logs: Check console output or logs/app.log
  • Frontend logs: Check browser console
  • Database logs: Check PostgreSQL logs

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • OpenAI for the GPT models
  • FastAPI for the backend framework
  • React for the frontend framework
  • PostgreSQL and Redis for data storage

System Architecture

Core Components

  1. Research Agent

    • Discovers and crawls web content
    • Processes text and images
    • Generates embeddings for both modalities
    • Stores content in vector database
  2. Drafting Agent

    • Retrieves relevant content
    • Generates answers and summaries
    • Handles multimodal context
    • Provides source attribution
  3. URL Discovery Agent

    • Finds relevant URLs for research
    • Supports multiple search sources
    • Implements relevance ranking

RAG Process Flow

  1. Content Discovery

    • User query triggers URL discovery
    • System crawls relevant web pages
    • Extracts text and images
  2. Content Processing

    • Text cleaning and chunking
    • Image extraction and storage
    • Embedding generation for both modalities
  3. Vector Storage

    • FAISS for efficient similarity search
    • Stores text and image embeddings
    • Maintains metadata and relationships
  4. Retrieval & Generation

    • Semantic search for relevant content
    • Context formatting with multimodal support
    • Answer generation using LLM

Model Configuration

Embedding Models

  1. Text Embeddings

    • Model: sentence-transformers/all-MiniLM-L6-v2
    • Purpose: Generate text embeddings for semantic search
    • Usage: Content chunking and query processing
  2. Image Embeddings

    • Model: openai/clip-vit-base-patch32
    • Purpose: Generate image embeddings
    • Usage: Image similarity search and multimodal context

Generation Model

  1. HuggingFace API (Default)

    • Model: mistralai/Mistral-7B-Instruct-v0.2
    • Setup:
      # Set HuggingFace token
      export HF_TOKEN="your_token_here"
    • Benefits:
      • High-quality responses
      • Managed infrastructure
      • Regular updates
  2. Local Model

    • Setup:
      # Set local model endpoint
      export LOCAL_LLM_ENDPOINT="http://localhost:8000/v1"
    • Benefits:
      • No API costs
      • Full control
      • No rate limits

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/deep-research-ai-agent.git
    cd deep-research-ai-agent
  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up environment variables:

    # Create .env file
    cp .env.example .env
    
    # Edit .env with your configuration
    # HF_TOKEN=your_token_here
    # LOCAL_LLM_ENDPOINT=http://localhost:8000/v1
  4. Initialize the database:

    alembic upgrade head

Usage

Basic Research

from agents.research_agent import ResearchAgent
from agents.drafting_agent import DraftingAgent

# Initialize agents
research_agent = ResearchAgent()
drafting_agent = DraftingAgent()

# Perform research
results = await research_agent.discover_and_research("Your research query")

# Generate answer
answer = drafting_agent.answer_question("Your question")

Multimodal Research

The system automatically handles multimodal content:

  • Extracts images from web pages
  • Generates embeddings for both text and images
  • Combines modalities for better context
  • Returns relevant images with answers

Configuration Options

  1. Research Parameters

    research_agent = ResearchAgent(
        max_depth=2,                    # Crawling depth
        max_pages_per_domain=20,        # Pages per domain limit
        vector_store_path="./data/embeddings",
        raw_data_path="./data/raw"
    )
  2. Drafting Parameters

    drafting_agent = DraftingAgent(
        vector_store_path="./data/embeddings",
        model_type="huggingface"        # or "local"
    )

Data Storage

Vector Store

  • Location: ./data/embeddings/
  • Format: FAISS index
  • Content: Text and image embeddings

Raw Data

  • Location: ./data/raw/
  • Content:
    • Text chunks
    • Downloaded images
    • Metadata

Database

  • Models:
    • User
    • Task
    • ContentChunk
  • Purpose: Track research tasks and results

About

A powerful research automation system that combines web crawling and content generation to provide comprehensive research results.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published