Skip to content

kartik0905/Website-Chatbot

Repository files navigation

🤖 AI Web Crawler & Chatbot

Transform any website or document into an intelligent, conversational knowledge base

React Node.js LangChain Puppeteer OpenAI Pinecone


📌 Overview

This full-stack AI application transforms any website, PDF, or DOCX into an intelligent, searchable, and chat-capable knowledge base using a powerful combination of:

  • 🌐 Autonomous unlimited web crawling
  • 📄 Document ingestion (PDF & DOCX)
  • 💾 Persistent vector-based knowledge storage
  • 🔍 Retrieval-Augmented Generation (RAG)
  • 🧠 Semantic search
  • 🤖 Natural language answering via OpenAI's embedding + LLM

🚀 The Vision

Go beyond a chatbot. This project gives you a private AI that deeply understands any website or document you point it at — no repeated scraping, no shallow Q&A. Just one deep crawl = permanent expert-level knowledge.


✨ Features

  • 🧠 Autonomous Crawler — Input a single URL, and the bot navigates the entire site.
  • 📄 File Support — Upload PDFs & DOCX and query them seamlessly.
  • 💾 One-Time Crawl, Permanent Memory — Stores embeddings in Pinecone for future queries.
  • 📊 Live Crawler Logs — Watch the bot explore and learn in real-time.
  • 🔍 Semantic Search — Vector-based search that understands meaning, not just keywords.
  • 🧹 Smart Scraping — Puppeteer handles modern sites and ignores unnecessary assets.
  • 🤖 AI Answers by OpenAI — Answers grounded in context-rich embeddings from your target source.

🛠️ Tech Stack

Layer Technologies Purpose
Frontend React, Tailwind CSS Beautiful, responsive UI
Backend Node.js, Express.js API routes & job management
Crawler Puppeteer Headless browser scraping
AI Orchestration LangChain Text chunking & embedding pipeline
Vector DB Pinecone Embedding-based knowledge retrieval
LLM + Embeddings OpenAI (small embedding model) Embedding + Answer generation

⚙️ How It Works (RAG Pipeline)

Phase 1: Crawl & Index

  1. Start Job: You input a startUrl, server generates a jobId.
  2. Live Updates: Frontend polls crawler logs every 2s.
  3. Crawling & Scraping: Puppeteer discovers and scrapes all pages.
  4. Vectorizing: LangChain splits and OpenAI embeds the content.
  5. Saving: All vectors are saved into Pinecone vector DB.
  6. Optional Docs: Upload PDFs or DOCX files, automatically vectorized.

Phase 2: Query Time

  1. Load Knowledge: Pinecone vector DB is queried.
  2. Semantic Search: Retrieves top-matching chunks.
  3. AI Response: OpenAI crafts a natural answer using the context.

🧪 Local Development

🔧 Requirements

  • Node.js (v18+)
  • OpenAI API Key
  • Pinecone API Key

🏁 Getting Started

1. Clone & Setup

git clone https://github.com/kartik0905/Website-Chatbot.git
cd Website-Chatbot

# Install frontend dependencies
npm install

# Install backend dependencies
cd server
npm install

2. Add API Keys

Create a .env file in the server/ folder:

# server/.env
OPENAI_API_KEY=your_openai_key
PINECONE_API_KEY=your_pinecone_key

In server.js:

require("dotenv").config();

const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const PINECONE_API_KEY = process.env.PINECONE_API_KEY;

🚦 Run the App

Terminal 1 — Backend

cd server
node server.js
# Runs on http://localhost:8000

Terminal 2 — Frontend

npm run dev
# Runs on http://localhost:5173

📁 Folder Structure

Website-Chatbot/
├── public/
├── src/
│   ├── App.jsx
│   └── main.jsx
├── server/
│   ├── server.js
│   └── vector_stores/
├── .env
├── README.md
├── package.json
└── ...

🙌 Acknowledgments


Built with ❤️ by Kartik Garg

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published