This full-stack AI application transforms any website, PDF, or DOCX into an intelligent, searchable, and chat-capable knowledge base using a powerful combination of:
- 🌐 Autonomous unlimited web crawling
- 📄 Document ingestion (PDF & DOCX)
- 💾 Persistent vector-based knowledge storage
- 🔍 Retrieval-Augmented Generation (RAG)
- 🧠 Semantic search
- 🤖 Natural language answering via OpenAI's embedding + LLM
Go beyond a chatbot. This project gives you a private AI that deeply understands any website or document you point it at — no repeated scraping, no shallow Q&A. Just one deep crawl = permanent expert-level knowledge.
- 🧠 Autonomous Crawler — Input a single URL, and the bot navigates the entire site.
- 📄 File Support — Upload PDFs & DOCX and query them seamlessly.
- 💾 One-Time Crawl, Permanent Memory — Stores embeddings in Pinecone for future queries.
- 📊 Live Crawler Logs — Watch the bot explore and learn in real-time.
- 🔍 Semantic Search — Vector-based search that understands meaning, not just keywords.
- 🧹 Smart Scraping — Puppeteer handles modern sites and ignores unnecessary assets.
- 🤖 AI Answers by OpenAI — Answers grounded in context-rich embeddings from your target source.
Layer | Technologies | Purpose |
---|---|---|
Frontend | React, Tailwind CSS | Beautiful, responsive UI |
Backend | Node.js, Express.js | API routes & job management |
Crawler | Puppeteer | Headless browser scraping |
AI Orchestration | LangChain | Text chunking & embedding pipeline |
Vector DB | Pinecone | Embedding-based knowledge retrieval |
LLM + Embeddings | OpenAI (small embedding model) | Embedding + Answer generation |
- Start Job: You input a
startUrl
, server generates ajobId
. - Live Updates: Frontend polls crawler logs every 2s.
- Crawling & Scraping: Puppeteer discovers and scrapes all pages.
- Vectorizing: LangChain splits and OpenAI embeds the content.
- Saving: All vectors are saved into Pinecone vector DB.
- Optional Docs: Upload PDFs or DOCX files, automatically vectorized.
- Load Knowledge: Pinecone vector DB is queried.
- Semantic Search: Retrieves top-matching chunks.
- AI Response: OpenAI crafts a natural answer using the context.
- Node.js (v18+)
- OpenAI API Key
- Pinecone API Key
git clone https://github.com/kartik0905/Website-Chatbot.git
cd Website-Chatbot
# Install frontend dependencies
npm install
# Install backend dependencies
cd server
npm install
Create a .env
file in the server/
folder:
# server/.env
OPENAI_API_KEY=your_openai_key
PINECONE_API_KEY=your_pinecone_key
In server.js
:
require("dotenv").config();
const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const PINECONE_API_KEY = process.env.PINECONE_API_KEY;
Terminal 1 — Backend
cd server
node server.js
# Runs on http://localhost:8000
Terminal 2 — Frontend
npm run dev
# Runs on http://localhost:5173
Website-Chatbot/
├── public/
├── src/
│ ├── App.jsx
│ └── main.jsx
├── server/
│ ├── server.js
│ └── vector_stores/
├── .env
├── README.md
├── package.json
└── ...
Built with ❤️ by Kartik Garg