A high-performance, production-ready web extraction API designed for LLMs and AI Agents. Distill converts messy web pages into clean, structured Markdown and JSON data.
- Scrape: Synchronous extraction of clean Markdown, metadata, and link maps.
- Map: Asynchronous site discovery to crawl and find all internal URLs.
- Search: Integrated web search (via Serper) with optional top-N scraping.
- Agent Extract: AI-powered structured data extraction using Gemini 1.5 Flash.
- Reliability: Built-in SSRF protection, robots.txt compliance, and rate limiting.
- Intelligent Fetching: Automatic fallback from HTTPX to Playwright for JS-heavy pages.
- Framework: FastAPI (Python)
- Database: PostgreSQL (SQLAlchemy + AsyncPG)
- AI: Google Gemini API
- Scraping: HTTPX, Playwright, Trafilatura, Readability
- Validation: Pydantic
- Python 3.10+
- PostgreSQL
- Gemini API Key (from AI Studio)
- Serper API Key (optional, for search)
-
Clone the repository:
git clone https://github.com/m1r4g3-code/Distill.git cd Distill -
Install dependencies:
cd backend pip install -r requirements.txt -
Configure Environment: Create a
.envfile in thebackend/directory:DATABASE_URL=postgresql+asyncpg://user:password@localhost:5432/webextract GEMINI_API_KEY=your_gemini_key SERPER_API_KEY=your_serper_key SECRET_KEY=your_secret_key
-
Initialize Database:
python seed_api_key.py
Start the development server:
uvicorn app.main:app --reload --port 8000POST /api/v1/scrape: Scrape a single URL.POST /api/v1/map: Start a site mapping job.POST /api/v1/search: Search and optionally scrape results.POST /api/v1/agent/extract: Extract structured JSON via Gemini.GET /api/v1/jobs/{id}: Check job status.GET /api/v1/jobs/{id}/results: Get job results.
- SSRF Protection: Blocks internal IP ranges and private networks.
- Robots.txt: Optional compliance toggle for all requests.
- Rate Limiting: Sliding window rate limiting per API key.
- m1r4g3-code
- Hephzibah (Collaborator)
Built with ❤️ for the AI Developer community.