A Retrieval-Augmented Generation (RAG) chatbot powered by OpenAI embeddings to answer questions using your personal knowledge base: CV, master thesis, GitHub repos, and research papers.
This project builds a personalized chatbot that leverages your own data to answer questions. It uses OpenAI embeddings and FAISS to enable fast and accurate document search and retrieval.
Sources supported:
- 📄 PDFs (CVs, theses, research papers)
- 💻 GitHub repositories (code)
- 🧠 Chunked embeddings with metadata stored locally
git clone https://github.com/b-elamine/MyKnowledgeRAG
cd MyKnowledgeRAGpython3 -m venv venv
source venv/bin/activate # For Windows: venv\Scripts\activatepip install -r requirements.txtYou must create a .env file in the project root with the following variables:
OPENAI_API_KEY=your_openai_api_key
GITHUB_USERNAME=your_github_username
GITHUB_TOKEN=your_github_personal_access_tokenOPENAI_API_KEY: Required to generate embeddings.GITHUB_USERNAMEandGITHUB_TOKEN: Used to authenticate and clone your GitHub repositories automatically.
💡 Tip: You can generate a GitHub token at https://github.com/settings/tokens (give it
repoaccess if you're cloning private repos).
Put all relevant documents inside the folder:
data/pdfs/
Examples:
- Your updated CV
- Academic thesis or dissertation
- Research papers you've written or use
Accepted formats: .pdf
Your repositories will be cloned automatically using the GitHub token. You’ll specify the repo URLs or names inside the script or configuration.
They will be stored in:
data/github_projects/
⚠️ This folder is ignored by Git to prevent uploading private code.
python src/data_processing.py- Loads PDFs and GitHub files
- Extracts and preprocesses text
- Chunks the content into smaller units
- Saves the output to
raw_data.pkl
python src/embedding.py- Loads chunks
- Uses OpenAI Embeddings API
- Batches embedding requests to avoid token limits
- Saves
embeddings.pklwith chunks and vectors
python src/vector_store.py- Loads
embeddings.pkl - Creates FAISS index
- Saves index and metadata locally
Example usage in test.py:
python src/test.pyYou can change the query in the script like this:
query = "What are the main contributions of the thesis?"It will:
- Embed the question
- Search FAISS for most similar document chunks
- Return top matches
PersonalRAGBot/
├── data/
│ ├── pdfs/ # Your documents (CV, thesis, papers)
│ ├── github_projects/ # Auto-cloned repos (gitignored)
│ └── vector_store/ # FAISS index + chunk metadata
├── src/ # All core logic scripts
│ ├── data_processing.py
│ ├── embedding.py
│ ├── vector_store.py
│ └── test.py
├── .env # Secrets (NOT tracked by Git)
├── .gitignore
├── requirements.txt
└── README.md
- Costs: Embeddings API is not free; use batching and caching to reduce calls.
- Privacy: Everything (documents, GitHub code, embeddings) is stored and processed locally except for the embedding API calls.
- Add LLM-based answer generation using retrieved chunks
- Optional web-based chatbot interface
- More advanced PDF structure handling