How to build a custom Q&A chatbot using OpenAI, LangChain, and Chroma.
The OpenAI API generates answers to questions, LangChain handles prompt construction and retrieval, and ChromaDB serves as a vector database to search relevant content chunks.
brew install pyenv
pyenv install 3.12.9
pyenv local 3.12.9Installed via requirements.txt:
- LangChain: Framework to interface with LLMs and orchestrate prompt chaining.
- Chroma: Lightweight vector database for fast retrieval.
- OpenAI: Language model and embedding API.
- python-dotenv: Loads environment variables.
- Streamlit: Interactive UI framework.
- Others:
tiktoken,colorama,requests,dateutil.
MacOS/Linux:
python3 -m venv env
source env/bin/activateWindows:
python -m venv env
env\Scripts\activatepip install -r requirements.txtGet your key from OpenAI
Set it via environment variable:
export OPENAI_API_KEY='sk-...'Or store in a .env file:
OPENAI_API_KEY=sk-...
Or duplicate template:
cp .env.example .envpython main.pystreamlit run app.pyAlternative (minimalist UI):
streamlit run app-nb.pyThen open http://localhost:8501
| Component | Purpose |
|---|---|
| LangChain | Manages prompt templates, chaining, and LLM interactions. |
| OpenAI API | Provides natural language understanding and embedding generation. |
| ChromaDB | Stores document embeddings for similarity search. |
| Streamlit | Builds a user-friendly, interactive web interface. |
| Docker | Containers for environment consistency and ease of deployment. |
| Docker Compose | Orchestrates CLI and UI services simultaneously with shared config. |
| dotenv | Loads and manages API keys securely in local development. |
-
Document Ingestion
- Raw text (
faq_real_estate.txt) is loaded and split into 100-character chunks usingCharacterTextSplitter.
- Raw text (
-
Embedding & Vector Storage
- Chunks are embedded using
OpenAIEmbeddingsand stored in a ChromaDB vector store.
- Chunks are embedded using
-
Query Flow
- User questions are embedded, compared to stored chunks for similarity, and the top matches are passed as context.
-
Prompt Assembly & LLM Output
- LangChain constructs a system + human prompt using the retrieved context and sends it to OpenAIβs chat model.
-
Response Output
- The chatbot returns a refined, context-aware response through CLI or Streamlit UI.
.
βββ app.py # Streamlit app (model selector)
βββ app-nb.py # Streamlit app (simplified)
βββ main.py # CLI chatbot + core logic
βββ Dockerfile
βββ docker-compose.yml
βββ docs/
β βββ faq_real_estate.txt
βββ requirements.txt
βββ .env.example
raw_documents = TextLoader("./docs/faq_real_estate.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=100)
documents = text_splitter.split_documents(raw_documents)embedding_function = OpenAIEmbeddings()
db = Chroma.from_documents(documents, embedding_function)
retriever = db.as_retriever()template = (
"You are a knowledgeable assistant. Use the following info:\n{context}"
)
chat_prompt = ChatPromptTemplate.from_messages([
SystemMessagePromptTemplate.from_template(template),
HumanMessagePromptTemplate.from_template("{question}")
])chain = (
{"context": retriever, "question": RunnablePassthrough()}
| chat_prompt
| ChatOpenAI(...)
| StrOutputParser()
)
response = chain.invoke("What are the closing costs?")docker build -t custom-chatbot-cli .docker run -it --rm --env-file .env custom-chatbot-clidocker-compose up --buildRebuild with changes:
docker-compose up --build --force-recreateMake builds faster by ignoring:
env/
.idea/
__pycache__/
- Real Estate Agents β e.g., Sunrise Realty FAQ bot
- Internal Knowledgebase β HR, IT support, SOPs
- Legal/Compliance Q&A β Clause-specific search
- Education β Course notes and FAQ retrieval
- β
Swap out
faq_real_estate.txtwith any domain-specific.txtcontent indocs/. - β
Update prompt template in
main.pyto reflect your brand tone. - β Modify vector store to use alternatives like FAISS or Weaviate for scale.
- β
Replace
OpenAIEmbeddingswith Hugging Face or Cohere embeddings. - β Store chat history with SQLite or connect Streamlit to Supabase for persistence.