Skip to content

Commit c0b172a

Browse files
committed
uploade all
0 parents  commit c0b172a

29 files changed

+794
-0
lines changed

.github/workflows/main.yml

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
name: Docker Image CI-CD
2+
3+
on:
4+
push:
5+
branches: [ "master" ]
6+
pull_request:
7+
branches: [ "master" ]
8+
9+
jobs:
10+
build:
11+
12+
runs-on: ubuntu-latest
13+
14+
steps:
15+
- name: Checkout code
16+
uses: actions/checkout@v3
17+
18+
- name: Set up Python
19+
uses: actions/setup-python@v4
20+
with:
21+
python-version: '3.x'
22+
23+
- name: Set up Docker Buildx
24+
uses: docker/setup-buildx-action@v2
25+
26+
- name: Cache Docker layers
27+
uses: actions/cache@v3
28+
with:
29+
path: /tmp/.buildx-cache
30+
key: ${{ runner.os }}-buildx-${{ github.sha }}
31+
restore-keys: |
32+
${{ runner.os }}-buildx-
33+
34+
- name: Build Docker image
35+
run: docker build -t auto-doc-thinker .
36+
37+
- name: Test the application (Run tests inside container)
38+
run: docker run --rm auto-doc-thinker pytest tests/

Dockerfile

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Use an official Python runtime as a parent image
2+
FROM python:3.11-slim
3+
4+
# Set the working directory
5+
WORKDIR /main
6+
7+
# Copy the current directory contents into the container
8+
COPY . /main
9+
10+
# Install the dependencies
11+
RUN pip install --no-cache-dir -r requirements.txt
12+
13+
# Expose Streamlit default port
14+
EXPOSE 8501
15+
16+
# Correct command to run Streamlit app
17+
CMD ["streamlit", "run", "main.py", "--server.port=8501", "--server.address=0.0.0.0"]

README.md

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
# 🧠 AutoDocThinker: Intelligent Search Engine with Reasoning + Tool Usage Logic
2+
3+
## 📌 Overview
4+
5+
This project is an **AI-powered Question Answering Web App** that allows users to upload documents (PDF, DOCX, HTML) or provide URLs, and ask questions about the content. The system uses a **Retrieval-Augmented Generation (RAG)** architecture combined with **LangGraph** for structured reasoning, **Gemini LLM** for responses, and **ChromaDB** for vector storage.
6+
7+
✅ File & URL ingestion
8+
✅ Chunking with Recursive Splitter
9+
✅ SentenceTransformer Embeddings
10+
✅ ChromaDB Vector Store
11+
✅ LangGraph Reasoning Pipeline
12+
✅ DuckDuckGo External Tool
13+
✅ Streamlit Interface
14+
✅ Logger Integration
15+
✅ Short-term Memory
16+
✅ 🐳 Docker + CI/CD ready
17+
18+
---
19+
20+
## 🧱 Project Structure
21+
22+
```
23+
AutoDocThinker/
24+
├── main.py # Main Streamlit UI
25+
├── data/
26+
│ └── Invoice.pdf # Sample document
27+
├── ingestion/
28+
│ └── loader.py # Handles document & URL ingestion
29+
├── processing/
30+
│ ├── chunker.py # Chunk documents
31+
│ └── embeddings.py # Load HuggingFace embeddings
32+
├── vectorstore/
33+
│ └── chroma_store.py # ChromaDB initialization
34+
├── reasoning/
35+
│ └── langgraph_chain.py # LangGraph-based RAG pipeline
36+
├── utils/
37+
│ └── memory.py # Conversation memory
38+
├── tests/
39+
│ └──test_app.py # Conversation memory
40+
|── logger.py # Logger configuration
41+
├── config.py # Configuration for secrets and constants
42+
├── setup.py # Package setup
43+
├── gitignore # Git ignore file
44+
├── Dockerfile # Docker image
45+
├── .github/
46+
│ └── workflows/
47+
│ └── ci.yml # GitHub Actions CI/CD pipeline
48+
├── requirements.txt # Python dependencies
49+
├── README.md # Project documentation
50+
```
51+
52+
---
53+
54+
## ⚙️ Features
55+
56+
| Feature | Description |
57+
| ----------------------------- | ------------------------------------------------ |
58+
| 📄 Multi-format ingestion | Supports PDF, DOCX, HTML, and URLs |
59+
| 🔄 Modular design | Organized into reusable components |
60+
| 🧩 LangGraph planner-executor | Custom planner & executor with LLM |
61+
| 🧠 Memory | Maintains short-term conversational context |
62+
| 🌐 DuckDuckGo tool | External real-time info search |
63+
| 📦 ChromaDB | Embedded document storage & retrieval |
64+
| 🖼️ Streamlit UI | Elegant and interactive web interface |
65+
| 🐳 Docker | Containerized for easy deployment |
66+
| ✅ CI/CD | GitHub Actions pipeline for linting and testing |
67+
68+
---
69+
70+
## 📥 Installation
71+
72+
```bash
73+
# 1. Clone the repository
74+
git clone https://github.com/Md-Emon-Hasan/AutoDocThinker.git
75+
cd AutoDocThinker
76+
77+
# 2. Install dependencies
78+
pip install -r requirements.txt
79+
```
80+
81+
Or with Docker:
82+
83+
```bash
84+
# Build Docker Image
85+
docker build -t auto-doc-thinker .
86+
87+
# Run the container
88+
docker run -p 8501:8501 auto-doc-thinker
89+
```
90+
91+
---
92+
93+
## 🔑 Configuration
94+
95+
Edit `config.py`:
96+
97+
```python
98+
GOOGLE_API_KEY = "your_google_gemini_api_key"
99+
CHROMA_DB_DIR = "./chroma_db"
100+
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
101+
CHUNK_SIZE = 500
102+
CHUNK_OVERLAP = 100
103+
TOP_K = 5
104+
```
105+
106+
---
107+
108+
## 🧠 How It Works
109+
110+
1. **Ingestion**
111+
112+
* Accepts file upload or URL.
113+
* Loads content via proper loader.
114+
115+
2. **Chunking**
116+
117+
* Breaks documents using recursive splitting.
118+
119+
3. **Embeddings + Vector Store**
120+
121+
* Converts chunks into embeddings via SentenceTransformers.
122+
* Stores in ChromaDB.
123+
124+
4. **LangGraph Reasoning**
125+
126+
* Uses `planner → executor` structure.
127+
* Planner routes user query to executor.
128+
* Executor uses retriever to fetch documents, then Gemini LLM to generate response.
129+
130+
5. **External Tools**
131+
132+
* If RAG fails or needs additional info, falls back to DuckDuckGo tool.
133+
134+
6. **Conversation Memory**
135+
136+
* Short-term memory for context in multi-turn dialogue.
137+
138+
---
139+
140+
## 🐳 Docker Setup
141+
142+
**Dockerfile:**
143+
144+
```dockerfile
145+
FROM python:3.11-slim
146+
147+
WORKDIR /app
148+
149+
COPY . .
150+
151+
RUN pip install --upgrade pip && \
152+
pip install -r requirements.txt
153+
154+
EXPOSE 8501
155+
156+
CMD ["streamlit", "run", "main.py", "--server.port=8501", "--server.address=0.0.0.0"]
157+
```
158+
159+
---
160+
161+
## 🔁 GitHub Actions CI/CD
162+
163+
**.github/workflows/main.yml**
164+
165+
```yaml
166+
name: CI
167+
168+
on:
169+
push:
170+
branches: [ master ]
171+
pull_request:
172+
branches: [ master ]
173+
174+
jobs:
175+
build-and-test:
176+
runs-on: ubuntu-latest
177+
178+
steps:
179+
- uses: actions/checkout@v3
180+
- name: Set up Python
181+
uses: actions/setup-python@v4
182+
with:
183+
python-version: "3.11"
184+
- name: Install dependencies
185+
run: |
186+
pip install -r requirements.txt
187+
- name: Lint with flake8
188+
run: |
189+
pip install flake8
190+
flake8 .
191+
```
192+
193+
---
194+
195+
## 📝 Future Enhancements
196+
197+
* ✅ Multilingual document ingestion
198+
* ✅ Audio document ingestion + whisper
199+
* ⏳ Long-term memory + history viewer
200+
* ⏳ MongoDB/FAISS alternative for Chroma
201+
* ✅ More tools (WolframAlpha, SerpAPI)
202+
* ⏳ Model selection dropdown (Gemini, LLaMA, GPT-4)
203+
204+
---
205+
206+
## 👨‍💻 Author
207+
208+
**Md Emon Hasan**
209+
📧 [iconicemon01@gmail.com](mailto:iconicemon01@gmail.com)
210+
🔗 [LinkedIn](https://www.linkedin.com/in/md-emon-hasan)
211+
🔗 [GitHub](https://github.com/Md-Emon-Hasan)
212+
🔗 [Facebook](https://www.facebook.com/mdemon.hasan2001/)
213+
🔗 [WhatsApp](https://wa.me/8801834363533)
214+
215+
---
216+
217+
## 📄 License
218+
219+
MIT License — Free to use, share, and contribute.
220+
221+
---

__pycache__/config.cpython-311.pyc

377 Bytes
Binary file not shown.

__pycache__/logger.cpython-311.pyc

1.37 KB
Binary file not shown.

app.png

84.9 KB
Loading

app.py

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
from ingestion.loader import load_documents
2+
from processing.chunker import chunk_documents
3+
from processing.embeddings import get_embedding_model
4+
from vectorstore.chroma_store import build_vector_store, get_retriever
5+
from reasoning.langgraph_chain import create_langgraph
6+
from reasoning.tools import get_tools
7+
from utils.memory import get_memory
8+
from config import GOOGLE_API_KEY
9+
from logger import setup_logger
10+
11+
import sys
12+
sys.stdout.reconfigure(encoding='utf-8')
13+
14+
from langchain_google_genai import ChatGoogleGenerativeAI
15+
from langchain.agents import initialize_agent
16+
from langchain.agents.agent_types import AgentType
17+
18+
logger = setup_logger(__name__)
19+
20+
def main():
21+
logger.info("Starting Document Search AI System")
22+
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro", google_api_key="")
23+
24+
docs = load_documents("data/Invoice.pdf")
25+
chunks = chunk_documents(docs)
26+
embeddings = get_embedding_model()
27+
vectordb = build_vector_store(chunks, embeddings)
28+
retriever = get_retriever(vectordb)
29+
30+
buffer_memory, _ = get_memory(llm)
31+
tools = get_tools()
32+
33+
agent_executor = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, memory=buffer_memory)
34+
graph = create_langgraph(llm)
35+
36+
question = "How to get jobs in Generative AI?"
37+
state = {"question": question, "retriever": retriever}
38+
39+
logger.info("Running reasoning chain via LangGraph")
40+
result = graph.invoke(state)
41+
print("\n📌 Answer:", result["answer"])
42+
43+
print("\n🌐 Tool Result:", agent_executor.run("Find latest GenAI job postings."))
44+
45+
if __name__ == "__main__":
46+
main()

config.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
import os
2+
3+
PERSIST_DIR = "./chroma_db"
4+
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
5+
CHUNK_SIZE = 500
6+
CHUNK_OVERLAP = 100
7+
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

data/Invoice.pdf

98.7 KB
Binary file not shown.
2.02 KB
Binary file not shown.

0 commit comments

Comments
 (0)