Skip to content

msylvester/funding_agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

180 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ’ฐ Funding Intelligence RAG

A comprehensive RAG (Retrieval-Augmented Generation) system for analyzing startup funding data with AI-powered insights using a sophisticated multi-agent architecture.

Funding Intelligence RAG Home Screen

Figure 1: Home screen of the Funding Intelligence RAG Streamlit app.

โœจ Features

  • ๐Ÿค– Multi-Agent Workflows: Hierarchical AI agent orchestration using OpenAI Agents SDK
  • ๐Ÿง  Custom RAG System: Production-grade RAG with ChromaDB vectorization and intelligent reasoning
  • ๐Ÿ” Intelligent Query Routing: Automatic intent classification and workflow selection
  • ๐Ÿ’ก Investor Recommendations: AI-powered investor matching based on sector and funding data
  • ๐Ÿ”ฌ Company Research: Automated research combining internal knowledge base with live web search
  • ๐Ÿ•ท๏ธ Web Scraping: Automated scraping of TechCrunch and other funding sources
  • ๐ŸŒ Streamlit Interface: Interactive multi-page web app
  • ๐Ÿ“Š MongoDB Integration: Scalable database storage for funding data

๐Ÿค– Multi-Agent Workflow Architecture

The system implements a sophisticated multi-agent architecture using the OpenAI Agents SDK (v0.3.0+) with a hierarchical orchestrator-worker pattern. Multiple specialized AI agents collaborate in chains to deliver intelligent responses.

Architecture Overview

User Query (Streamlit)
    โ†“
Orchestrator Workflow (Intent Classification)
    โ†“
    โ”œโ”€ "advice" โ†’ Advice Workflow
    โ”‚       โ†“
    โ”‚   1. Sector Classifier Agent (GPT-4o-mini)
    โ”‚   2. Advice Agent (GPT-4o + MongoDB tools)
    โ”‚   3. Summarize Agent (GPT-4o)
    โ”‚
    โ””โ”€ "research" โ†’ Research Workflow
            โ†“
        1. RAG Research Agent (GPT-4o + RAG tools)
        2. Web Research Agent (GPT-4o-mini + WebSearch)

Workflow Components

๐ŸŽฏ Orchestrator Workflow (services/workflows/orchestrator_workflow.py)

Entry point for all user queries - intelligently routes requests to specialized workflows.

  • Intent Classifier Agent (GPT-4o-mini): Analyzes user intent and classifies into:
    • "advice": Fundraising recommendations, investor matching, strategy questions
    • "research": Company information lookup, funding history queries
  • Smart Routing: Directs queries to appropriate specialized workflow based on classification
  • Trace Metadata: Built-in observability for debugging and monitoring

๐Ÿ”ฌ Research Workflow (services/workflows/research_workflow.py)

Multi-stage pipeline for comprehensive company research combining internal knowledge with live web data.

Agent Chain:

  1. RAG Research Agent (GPT-4o):

    • Queries ChromaDB vector database using custom RAG tools
    • Retrieves relevant companies from internal funding knowledge base
    • Returns structured list with company names, descriptions, and relevance scores
  2. Web Research Agent (GPT-4o-mini):

    • Enriches RAG results with live web search
    • Gathers: website, company size, headquarters, founding year, industry details
    • Returns comprehensive company profiles

Data Flow: Conversation history is maintained across agent calls, enabling context-aware responses.

๐Ÿ’ก Advice Workflow (services/workflows/advice_workflow.py)

Intelligent investor matching and strategic fundraising guidance.

Agent Chain:

  1. Sector Classification:

    • Dynamically fetches valid sectors from MongoDB
    • Uses OpenAI structured outputs with confidence scoring
    • Confidence threshold: 0.8 (falls back to generic advice below threshold)
  2. Advice Agent (GPT-4o):

    • Uses MongoDB function tools to query funding database:
      • search_funded_companies_by_sector(): Find funded companies in target sector
      • get_investors_for_sector(): Get investor activity statistics
    • Extracts investor-company relationships
    • Generates personalized strategic advice
  3. Summarize & Display Agent (GPT-4o):

    • Condenses advice into dashboard-friendly format
    • Structures output for UI presentation

Agent Composition Pattern

All workflows follow this reusable pattern:

# 1. Define Agent with OpenAI Agents SDK
agent = Agent(
    name="Agent Name",
    instructions="Detailed agent instructions...",
    model="gpt-4o",                    # or "gpt-4o-mini"
    output_type=PydanticSchema,        # Structured output validation
    tools=[tool1, tool2],               # Function tools (RAG, MongoDB, WebSearch)
    model_settings=ModelSettings(store=True)
)

# 2. Run Agent Asynchronously
result = await Runner.run(
    agent,
    input=conversation_history,
    run_config=RunConfig(
        trace_metadata={"workflow_id": "..."}
    )
)

# 3. Extract Structured Output
output = result.final_output  # Pydantic model instance

Key Features:

  • Pydantic Schemas: Enforce structured, type-safe outputs
  • Conversation History: Maintained between agents for contextual awareness
  • Function Tools: Reusable capabilities (RAG, MongoDB, WebSearch) injected into agents
  • Async Execution: High-performance async/await patterns
  • Observability: Built-in trace metadata for debugging

๐Ÿง  Custom RAG Tool Implementation

The system features a production-grade RAG (Retrieval-Augmented Generation) implementation using a dual-layer architecture: ChromaDB for vector storage and OpenRouter LLM for intelligent reasoning.

RAG Architecture

MongoDB (Source Data)
    โ†“
DataService (Orchestration)
    โ†“
    โ”œโ”€ OpenAI Embeddings (text-embedding-3-small, 1536-dim)
    โ†“
ChromaDB (Vector Storage)
    โ†“
RAG Agent (OpenRouter LLM Reasoning)
    โ†“
Structured Response

Three RAG Function Tools

The RAG system exposes three function tools (via services/agents/rag_service_agent.py) that any agent can use:

1. rag_semantic_search(query, top_k=5, distance_threshold=0.3)

Vector similarity search in ChromaDB knowledge base.

  • Creates query embedding using OpenAI
  • Performs cosine similarity search
  • Filters by distance threshold (0 = identical, 2 = opposite)
  • Returns documents, metadata, distances, and count

Example:

results = rag_semantic_search("fintech startups", top_k=5)
# Returns: {documents: [...], metadatas: [...], distances: [...], count: 5}

2. rag_generate_reasoning(query, documents, metadatas_json, distances)

LLM-powered synthesis from retrieved documents.

  • Takes search results as input
  • Uses OpenRouter LLM for intelligent reasoning
  • Synthesizes coherent answer with citations
  • Acknowledges data limitations when appropriate

Example:

answer = rag_generate_reasoning(
    query="Who invested in Stripe?",
    documents=retrieved_docs,
    metadatas_json=json.dumps(metadata),
    distances=distances
)
# Returns: "Stripe was funded by [investors], raising $X million..."

3. rag_full_query(query, top_k=5)

Convenience tool combining search + reasoning in one call.

  • All-in-one: retrieval + synthesis
  • Returns structured dict with answer, sources, and document count
  • Recommended for most use cases

Example:

result = rag_full_query("AI companies in healthcare")
# Returns: {answer: "...", sources: [...], document_count: 5}

Usage in Agents

RAG tools are injected into agents via get_rag_tools():

from services.agents.rag_service_agent import get_rag_tools

rag_research_agent = Agent(
    name="RAG research agent",
    instructions="You are a research assistant with RAG access...",
    model="gpt-4o",
    output_type=RagResearchAgentSchema,
    tools=get_rag_tools(),  # โ† RAG tools available to agent
    model_settings=ModelSettings(store=True)
)

The agent can now autonomously decide when and how to use RAG tools based on the user query.

ChromaDB Integration

Vector Database: ChromaDB with persistent storage at ./chromadb_data

Storage Schema:

  • Documents: Text format: "Company: [name]\nDescription: [description]"
  • Embeddings: OpenAI text-embedding-3-small vectors (1536 dimensions)
  • Metadata:
    • company_name: Company identifier
    • company_index: Index in source data
    • date_unix: Unix timestamp (optional)
  • Distance Metric: Cosine similarity (default)

Data Pipeline (services/database/data_service.py):

  1. Ingestion: ingest_data() pulls companies from MongoDB
  2. Embedding: embed_data() creates vectors with OpenAI API
  3. Storage: Batch insertion into ChromaDB with metadata
  4. Retrieval: retrieve_documents() performs similarity search
  5. Reasoning: generate_response_with_reasoning() synthesizes answers

Key Features:

  • Auto-ingestion: Automatically loads data from MongoDB if ChromaDB is empty
  • Similarity Filtering: Configurable threshold for relevance (default: 0.3 distance)
  • Batch Processing: Efficient embedding creation for large datasets
  • Persistent Storage: Data persists across application restarts

RAG Agent (services/agents/custom/agent_rag.py)

LLM reasoning layer using OpenRouter API.

Features:

  • Configurable model selection (default via OpenRouter)
  • Context construction from retrieved documents
  • Prompt engineering for funding domain expertise
  • Citation of specific details (companies, amounts, investors, dates)
  • Acknowledgment of data limitations

Example Prompt Pattern:

You are a funding data expert. Based on these documents:
[Document 1] Company: Stripe, Description: Payment processing...
[Document 2] Company: Plaid, Description: Financial APIs...

Answer this query: "Who invested in fintech companies?"
Cite specific details and acknowledge if data doesn't fully answer.

Funding Intelligence RAG Query Results

Figure 2: Example of AI-powered funding query results.

๐Ÿ“ฆ Installation

The project is pip-installable with a standard Python package structure, making it easy to install and import across different environments.

Quick Install

Development Mode (recommended for development):

# Clone the repository
git clone <repository-url>
cd funding_scraper

# Install in editable mode
pip install -e .

# Install dependencies
pip install -r requirements.txt

Production Install:

pip install .

Package Structure

The project uses pyproject.toml for modern Python packaging:

[project]
name = "funding_scraper"
version = "0.1.0"
description = "Funding data scraper and investor recommendation system"
requires-python = ">=3.8"

Included Packages:

  • services/ - Core business logic and agents
    • services.agents - AI agents (RAG, sector, advice)
    • services.workflows - Multi-agent workflows
    • services.database - MongoDB and ChromaDB integration
    • services.scrapers - Web scraping modules
    • services.processing - Data processing utilities
  • config/ - Application configuration
  • views/ - Streamlit UI pages
  • ui/ - UI components and styling
  • utils/ - Utility functions

Import Usage

After installation, import from anywhere:

# Import workflow orchestrator
from services.workflows.orchestrator_workflow import run_orchestrator_workflow

# Import RAG tools
from services.agents.rag_service_agent import get_rag_tools

# Import configuration
from config.settings import API_CONFIG

# Import database services
from services.database.data_service import DataService
from services.database.mongodb_tools import search_funded_companies_by_sector

# Run async workflow
import asyncio
result = asyncio.run(run_orchestrator_workflow("Research Tesla"))

Environment Setup

Required Environment Variables:

# OpenAI API (for embeddings and agents)
export OPENAI_API_KEY=your_openai_key_here

# OpenRouter API (for RAG reasoning)
export OPENROUTER_API_KEY=your_openrouter_key_here

# MongoDB (optional, defaults to localhost)
export MONGODB_URI=mongodb://localhost:27017/

Optional Configuration:

# ChromaDB path (defaults to ./chromadb_data)
export CHROMA_PATH=./chromadb_data

# Logging level
export LOG_LEVEL=INFO

๐ŸŽฎ Usage

๐ŸŽจ Run the Streamlit Web Interface

streamlit run app.py

๐Ÿ’ป Programmatic Usage

import asyncio
from services.workflows.orchestrator_workflow import run_orchestrator_workflow

# Run a research query
async def main():
    result = await run_orchestrator_workflow("What investors fund AI startups?")
    print(result)

asyncio.run(main())

๐Ÿงช Testing Harnesses

Test individual components:

# Test sector classification
python services/harnesses/sector_harness.py

# Test advice workflow
python services/harnesses/advice_harness.py

# Test research workflow
python services/harnesses/research_workflow_harness.py

๐Ÿ“ Project Structure

๐ŸŽฏ Main Application

  • app.py - Streamlit multi-page application

๐Ÿค– AI Agents & Workflows (/services/)

Workflows (/services/workflows/):

  • orchestrator_workflow.py - Main entry point with intent classification
  • research_workflow.py - RAG + web research pipeline
  • advice_workflow.py - Investor matching and strategic advice

Agents (/services/agents/):

  • rag_service_agent.py - RAG function tools for agents
  • sector_agent.py - Sector classification agent
  • custom/agent_rag.py - RAG reasoning agent (OpenRouter)

Database (/services/database/):

  • data_service.py - ChromaDB integration and embeddings
  • mongodb_tools.py - MongoDB function tools for agents
  • database.py - MongoDB operations and schema

Scrapers (/services/scrapers/):

  • scraper_service.py - TechCrunch scraper
  • article_processor.py - Article content processing

Harnesses (/services/harnesses/):

  • Testing frameworks for individual workflow components

๐ŸŽจ UI Components (/ui/ & /views/)

  • views/research.py - Research page with orchestrator integration
  • ui/components.py - Reusable Streamlit components
  • ui/styles.py - Custom CSS styling

โš™๏ธ Configuration (/config/)

  • settings.py - Application configuration and API settings

๐Ÿ”ง Key Components

  • ๐ŸŽฏ Multi-Agent Orchestration: Hierarchical agent workflow with intelligent routing
  • ๐Ÿง  Production RAG System: ChromaDB + OpenAI embeddings + OpenRouter reasoning
  • ๐Ÿ’ก Investor Matching: AI-powered recommendations based on sector analysis
  • ๐Ÿ” Vector Search: Semantic similarity search for funding data retrieval
  • โšก Live Web Research: Real-time data enrichment via web search
  • ๐Ÿค– Function Tools: Reusable RAG and MongoDB tools for any agent
  • ๐Ÿ“Š Database Integration: MongoDB storage with comprehensive schema
  • ๐ŸŽจ Interactive UI: Streamlit multi-page app with research and advice pages

๐Ÿ› ๏ธ Technology Stack

  • Agents: OpenAI Agents SDK 0.3.0+
  • LLMs: GPT-4o, GPT-4o-mini, OpenRouter models
  • Embeddings: OpenAI text-embedding-3-small (1536-dim)
  • Vector DB: ChromaDB (persistent, cosine similarity)
  • Document DB: MongoDB
  • UI: Streamlit multi-page app
  • Schema Validation: Pydantic v2+
  • Web Scraping: BeautifulSoup4, requests

About

this is an agent used to track companies that have received funding. It ingests in real time

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages