A production-grade multi-agent AI research pipeline built with LangGraph, FastAPI, Docker, and Azure DevOps. The system uses a Supervisor → Researcher → Writer agent architecture to autonomously research any topic and produce a polished, structured report.
User Request (POST /research)
│
▼
┌──────────────────┐
│ FastAPI App │ ← REST API layer
└────────┬─────────┘
│
▼
┌──────────────────────────────────────────┐
│ LangGraph State Machine │
│ │
│ ┌────────────┐ │
│ │ Supervisor │ ← validates & routes │
│ └─────┬──────┘ │
│ │ │
│ ▼ │
│ ┌────────────┐ ┌──────────────┐ │
│ │ Researcher │────▶│ Tavily Search│ │
│ │ Agent │ │ Web Reader │ │
│ └─────┬──────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────┐ ┌──────────────┐ │
│ │ Writer │────▶│ Self Critique│ │
│ │ Agent │ │ Tool │ │
│ └─────┬──────┘ └──────────────┘ │
│ │ │
└─────────┼────────────────────────────────┘
│
▼
Structured JSON Response
(report + sources + agent trace)
| Layer | Technology |
|---|---|
| Agent Framework | LangGraph (StateGraph, ReAct pattern) |
| LLM | OpenAI GPT-4o / GPT-4o-mini via init_chat_model |
| Web Search | Tavily Search API |
| API Framework | FastAPI + Pydantic v2 |
| Dependency Management | uv |
| Containerization | Docker |
| CI/CD | Azure DevOps Pipelines (3-stage) |
| Image Registry | Docker Hub |
| Code Quality | Ruff (linting) |
| Testing | Pytest |
Orchestrates the pipeline. Validates state between agents and handles routing. If the Researcher produces no findings, the pipeline fails gracefully before the Writer is invoked.
- Model:
gpt-4o-mini - Tools: Tavily web search + webpage content reader
- Behaviour: Executes 2-3 targeted searches, reads full articles when needed, outputs structured JSON findings
- Max tool calls: 5 (cost control)
- Model:
gpt-4o - Tools: Self-critique tool
- Behaviour: Transforms raw research into a polished Markdown report, self-reviews the draft, revises before finalising
All agents communicate through a typed AgentState (LangGraph TypedDict) — no direct agent-to-agent calls. State carries topic, research findings, final report, agent trace, and token count.
3-stage pipeline triggered on every push to master:
Stage 1: 🧪 Quality Gate
├── Set Python 3.11
├── Install dependencies via uv
├── Ruff lint check
└── Pytest (mocked pipeline tests)
Stage 2: 🐳 Build & Ship
├── Docker build
└── Push to Docker Hub (:latest + :build_id)
Stage 3: 🔍 Container Health Verification
├── Pull image from Docker Hub
├── Run container with env vars
├── Hit /health endpoint → assert HTTP 200
├── Print health response
└── Cleanup container
Secrets (OPENAI_API_KEY, TAVILY_API_KEY, DOCKER_HUB_USERNAME) are stored in Azure DevOps Variable Groups — never in code.
llmops-multi-agent-cicd-pipeline/
├── agents/
│ ├── researcher.py # Researcher ReAct agent
│ ├── writer.py # Writer ReAct agent
│ └── supervisor.py # Routing & validation logic
├── graph/
│ ├── state.py # Shared AgentState TypedDict
│ └── pipeline.py # LangGraph StateGraph wiring
├── tools/
│ ├── search.py # Tavily search tool
│ └── web_reader.py # URL content extractor
├── models/
│ └── schemas.py # Pydantic request/response schemas
├── tests/
│ └── test_api.py # Pytest tests (mocked pipeline)
├── main.py # App entry point
├── api/
│ └── main.py # FastAPI application
├── Dockerfile
├── docker-compose.yml
├── azure-pipelines.yml
└── pyproject.toml
Returns service health status.
{
"status": "healthy",
"service": "multi-agent-pipeline",
"version": "1.0.0"
}Runs the full multi-agent pipeline.
Request:
{
"topic": "Impact of LLMs on software engineering jobs in 2025",
"max_search_results": 3
}Response:
{
"topic": "Impact of LLMs on software engineering jobs in 2025",
"research_summary": {
"key_findings": ["Finding 1", "Finding 2"],
"sources": ["https://..."],
"search_queries_used": ["query 1", "query 2"]
},
"final_report": "# Impact of LLMs...\n\n## Executive Summary\n...",
"agent_trace": [
{"agent": "researcher", "action": "start", "detail": "Starting research on: ..."},
{"agent": "researcher", "action": "complete", "detail": "Found 5 key findings"},
{"agent": "writer", "action": "start", "detail": "Starting report writing"},
{"agent": "writer", "action": "complete", "detail": "Report written — 520 words"}
],
"tokens_used": 3420,
"status": "success"
}- Python 3.11+
- uv installed
- Docker Desktop
- OpenAI API key
- Tavily API key (free tier at tavily.com)
# Clone the repo
git clone https://github.com/your-username/llmops-multi-agent-cicd-pipeline.git
cd llmops-multi-agent-cicd-pipeline
# Create and activate virtual environment
uv venv .venv
source .venv/bin/activate # Linux/Mac
# .venv\Scripts\Activate.ps1 # Windows PowerShell
# Install dependencies
uv sync
# Add environment variables
cp .env.example .env
# Edit .env and add your API keys
# Run the API
uvicorn main:app --reload --host 0.0.0.0 --port 8000docker build -t llmops-multi-agent-cicd-pipeline .
docker run -p 8000:8000 \
-e OPENAI_API_KEY=your_key \
-e TAVILY_API_KEY=your_key \
llmops-multi-agent-cicd-pipelineuv run pytest tests/ -vuv run ruff check .| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key |
TAVILY_API_KEY |
Tavily Search API key |
Why LangGraph over LangChain AgentExecutor? LangGraph models each agent as an explicit node in a state graph, giving full control over state, routing, and error handling. Each agent's inputs and outputs are typed and traceable.
Why separate Researcher and Writer agents? Separation of concerns — the Researcher never writes prose, the Writer never searches the web. This prevents hallucination (Writer can only use what Researcher found) and makes each agent's behaviour predictable and testable.
Why init_chat_model?
Provider-agnostic LLM initialisation. Swapping from OpenAI to Anthropic or any other provider requires changing one string, not the entire codebase.
Why GPT-4o-mini for Researcher and GPT-4o for Writer? Cost optimisation — the Researcher makes many tool calls and processes raw text, where speed and cost matter more than prose quality. The Writer makes fewer calls but needs higher quality output.
Every API response includes a full agent_trace showing every action taken by every agent. This is intentional — it provides transparency into how the answer was produced, which is essential for debugging and for demonstrating the system's reasoning in interviews and demos.
MIT