Skip to content

A simple GPT-3.5 chatbot for querying scraped website data using LangChain and ChromaDB.

Notifications You must be signed in to change notification settings

phil1px/webscrape-chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

webscraper-using-langchain-and-chromaDB

This is a small demo project illustrating a chatbot that can query a scraped website. It uses LangChain to manage the chatbot's framework, Gradio for a user friendly interface, OpenAI's gpt-3.5-turbo LLM model, and ChromaDB as a vector store.

image

Getting started

This project supports both pip and pipenv. I recommend using pipenv for the best (and least error prone) experience.

Installation

Pip

Run

pip install -r requirements.txt

if using pip.

Pipenv

Run

pipenv install

if using pipenv, followed by pipenv shell to start a shell with the installed packages.

Environment variables

We need to create a new .env file from the .env.example file with our OPENAI_API_KEY. We can create one of these on OpenAI's platform.

Web scraping

To scrape a site, run

python scrape.py --site <site_url> --depth <int>

This will scrape a url and all links found at that url recursively up to the specified depth. This will only scrape sites with the same origin as the given <site_url>, so for example scraping https://python.langchain.com/docs will only scrape sites at https://python.langchain.com.

The data will be stored in a new scrape/ directory.

Data embeddings

To generate and persist the embeddings and create a vector store, run

python embed.py

A new persisted vector store will be created in the chroma/ directory.

Launching the chatbot

To launch the chatbot, we need to run

python main.py

This will start a Gradio server at http://127.0.0.1:7860, allowing us to chat to the scraped website and data store.

NOTE: we must both first scrape a site and persist a vector store in order for this to work.

About

A simple GPT-3.5 chatbot for querying scraped website data using LangChain and ChromaDB.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages