Skip to content

Commit f7c74e0

Browse files
committed
updated readme
1 parent c1d8e4c commit f7c74e0

File tree

2 files changed

+56
-7
lines changed

2 files changed

+56
-7
lines changed

.env.example

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
GOOGLE_API_KEY=YOUR-GOOGLE-API-KEY

README.md

Lines changed: 55 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,59 @@
1-
# available file formats
2-
text files, images (with text through OCR)
1+
# Learn Anything AI Chatbot
32

4-
# needed for image OCR
5-
sudo apt update
6-
sudo apt install -y tesseract-ocr libtesseract-dev
3+
Learn Anything AI Chatbot lets you query your own documents using a language model. It indexes a folder of files, converts CSV and Excel sheets into a DuckDB database, and performs semantic search with Google Gemini embeddings. An interactive agent built with LangGraph combines retrieval and SQL queries so you can "chat" with your data.
74

5+
## Features
6+
- **Multi-format ingestion** – PDFs, Word docs, PowerPoint, Markdown, HTML, and plain text are split into searchable chunks. Images are processed through OCR so their text is also indexed.
7+
- **Data summarization** – CSV and Excel files are loaded into DuckDB tables. Summary cards for each table are added to the vector index.
8+
- **Embeddings & retrieval** – Documents are embedded with `GoogleGenerativeAIEmbeddings` and stored in a FAISS index for fast semantic search.
9+
- **SQL integration** – The agent can issue DuckDB queries over your uploaded spreadsheets. Only `SELECT` and `PRAGMA` statements are allowed for safety.
10+
- **Persistent conversations** – The ReAct agent from LangGraph saves its history to SQLite so you can resume chats.
811

9-
DuckDB restrictions for agent to only SELECT/PRAGMA queries. No destructive queries.
1012

11-
Take time to load first time docs
13+
## Installation
14+
1. Install system packages needed for OCR (first time only):
15+
```bash
16+
sudo apt update
17+
sudo apt install -y tesseract-ocr libtesseract-dev
18+
```
19+
2. Install the Python package and dependencies:
20+
```bash
21+
pip install -e .
22+
pip install -r requirements-dev.txt # optional dev tools
23+
```
24+
25+
## Usage
26+
1. Place the documents you want to search under `data/` directory.
27+
2. Run the agent. The first run may take a while as it loads and indexes the files:
28+
```bash
29+
bash scripts/run_agent.sh --ask "What kinds of files have I provided?" --load_data
30+
```
31+
3. For later sessions, omit `--load_data` to reuse the existing FAISS index and DuckDB database. If you have added more documents under `data/`, please load them again using `--load_data` for the first run.
32+
33+
## Supported File Types
34+
- Text documents: PDF, DOCX, PPTX, Markdown, HTML, TXT
35+
- Images: PNG, JPG, JPEG, TIFF (processed via OCR)
36+
- Spreadsheets: CSV, XLSX
37+
38+
## Testing
39+
Run formatting checks and unit tests with:
40+
```bash
41+
pre-commit run --all-files
42+
pytest
43+
```
44+
45+
## Repository Structure
46+
- `src/any_chatbot/` – core modules for indexing, tools, and agent
47+
- `scripts/` – helper script to launch the agent
48+
- `notebooks/` – example notebooks for experiments
49+
- `tests/` – unit tests for the indexing and tool utilities
50+
51+
## Requirements
52+
- Python 3.10+
53+
- A Google Gemini API key (`GOOGLE_API_KEY` environment variable)
54+
55+
## Contributing
56+
Contributions are welcome! Feel free to open issues or pull requests.
57+
58+
## License
59+
This project is licensed under the MIT License.

0 commit comments

Comments
 (0)