🌟🤖 AbleBot > The Able QnA Chatbot 🤖🌟

Setup

Navigate to the directory of your choice, and use git clone https://github.com/drewku42/AbleBot.git
Install Python if necessary: https://www.python.org/downloads/
Create a virtual environment: python3 -m venv <name>
Install Python dependencies: pip install -r requirements.txt
Run the scraper on the website of your choice, i.e. https://able.co : python3 streamlit_chatbot/scraper.py
- Note: Set the sitemap URL in the scraper module itself (line 8). Format: https://able.co/sitemap.xml
Create a .env file in the streamlit_chatbot directory, and add your OpenAI API Key. Ex: OPENAI_API_KEY=sk-proj-...
Start the chatbot: streamlit run streamlit_chatbot/app.py
Ask away!

Scraping the Data

Fetch URls from Sitemap: The script fetches URLs from the sitemap at https://able.co/sitemap.xml .
Scrape Page Content: For each URL, the scraper extracts the main text content while removing navigation, footer, and script elements.
Store Scrapted Data: The script collects text data from all pages and saves it to a file called company_info.txt.
Run the Scraper: Execute the script to start scraping and save the data locally.

RAG w/ LangChain Overview

Retrieval Augmented Generation (RAG)
A typical RAG application has two main components, indexing and retreival augmented generation
- Indexing: A pipeline for ingesting data from a source (i.e., able.co) and indexing it. Usually done offline.
- RAG: Takes a user query at run time and retreieves the relevant data from the index, then passes it to the model.

A Common Pipeline

Load: First we need to load our data using a Document Loader.
Split: We need to break these documents into smaller chunks using a Text Splitter. This is useful for indexing and passing data into a model.
- Note: Larger chunks are harder to search over and won't fit in a model's finite context window.
We need somewhere to store and index the splits, so that they can be searched over later. This is done with a VectorStore (database) and Embeddings model (semantic representation).
Retreive: Given a user input, releveant splits are retrieved from storage using a Retriever.
Generate: A ChatModel (or LLM) produces an answer using a prompt that includes both the question and retreived data.

Future Improvements

Citing Information: A great way to prevent hallucinations is to implement citations, where a chatbot provides a reference to the specific chunk or document where it pulls information from. This makes it easy to verify the information that the model provides.
System Prompts: System prompts can help a model remember its own identity. For example, we could tell the chatbot that it has a name like AbleBot, that it works for Able and answers questions from Able's customers, etc.
Optimal Text Splitting: When the RAG pipeline is setup we define a few important parameters. The chunk size defines the number of characters per chunk, and the chunk overlap defines the number of characters that should overlap between adjacent chunks. Optimizing these numbers could improve model recall and accuracy of answers. I did not test it much for this prototype.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
streamlit_chatbot		streamlit_chatbot
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌟🤖 AbleBot > The Able QnA Chatbot 🤖🌟

Setup

Scraping the Data

RAG w/ LangChain Overview

A Common Pipeline

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🌟🤖 AbleBot > The Able QnA Chatbot 🤖🌟

Setup

Scraping the Data

RAG w/ LangChain Overview

A Common Pipeline

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages