|
1 | | -# available file formats |
2 | | -text files, images (with text through OCR) |
| 1 | +# Learn Anything AI Chatbot |
3 | 2 |
|
4 | | -# needed for image OCR |
5 | | -sudo apt update |
6 | | -sudo apt install -y tesseract-ocr libtesseract-dev |
| 3 | +Learn Anything AI Chatbot lets you query your own documents using a language model. It indexes a folder of files, converts CSV and Excel sheets into a DuckDB database, and performs semantic search with Google Gemini embeddings. An interactive agent built with LangGraph combines retrieval and SQL queries so you can "chat" with your data. |
7 | 4 |
|
| 5 | +## Features |
| 6 | +- **Multi-format ingestion** – PDFs, Word docs, PowerPoint, Markdown, HTML, and plain text are split into searchable chunks. Images are processed through OCR so their text is also indexed. |
| 7 | +- **Data summarization** – CSV and Excel files are loaded into DuckDB tables. Summary cards for each table are added to the vector index. |
| 8 | +- **Embeddings & retrieval** – Documents are embedded with `GoogleGenerativeAIEmbeddings` and stored in a FAISS index for fast semantic search. |
| 9 | +- **SQL integration** – The agent can issue DuckDB queries over your uploaded spreadsheets. Only `SELECT` and `PRAGMA` statements are allowed for safety. |
| 10 | +- **Persistent conversations** – The ReAct agent from LangGraph saves its history to SQLite so you can resume chats. |
8 | 11 |
|
9 | | -DuckDB restrictions for agent to only SELECT/PRAGMA queries. No destructive queries. |
10 | 12 |
|
11 | | -Take time to load first time docs |
| 13 | +## Installation |
| 14 | +1. Install system packages needed for OCR (first time only): |
| 15 | + ```bash |
| 16 | + sudo apt update |
| 17 | + sudo apt install -y tesseract-ocr libtesseract-dev |
| 18 | + ``` |
| 19 | +2. Install the Python package and dependencies: |
| 20 | + ```bash |
| 21 | + pip install -e . |
| 22 | + pip install -r requirements-dev.txt # optional dev tools |
| 23 | + ``` |
| 24 | + |
| 25 | +## Usage |
| 26 | +1. Place the documents you want to search under `data/` directory. |
| 27 | +2. Run the agent. The first run may take a while as it loads and indexes the files: |
| 28 | + ```bash |
| 29 | + bash scripts/run_agent.sh --ask "What kinds of files have I provided?" --load_data |
| 30 | + ``` |
| 31 | +3. For later sessions, omit `--load_data` to reuse the existing FAISS index and DuckDB database. If you have added more documents under `data/`, please load them again using `--load_data` for the first run. |
| 32 | + |
| 33 | +## Supported File Types |
| 34 | +- Text documents: PDF, DOCX, PPTX, Markdown, HTML, TXT |
| 35 | +- Images: PNG, JPG, JPEG, TIFF (processed via OCR) |
| 36 | +- Spreadsheets: CSV, XLSX |
| 37 | + |
| 38 | +## Testing |
| 39 | +Run formatting checks and unit tests with: |
| 40 | +```bash |
| 41 | +pre-commit run --all-files |
| 42 | +pytest |
| 43 | +``` |
| 44 | + |
| 45 | +## Repository Structure |
| 46 | +- `src/any_chatbot/` – core modules for indexing, tools, and agent |
| 47 | +- `scripts/` – helper script to launch the agent |
| 48 | +- `notebooks/` – example notebooks for experiments |
| 49 | +- `tests/` – unit tests for the indexing and tool utilities |
| 50 | + |
| 51 | +## Requirements |
| 52 | +- Python 3.10+ |
| 53 | +- A Google Gemini API key (`GOOGLE_API_KEY` environment variable) |
| 54 | + |
| 55 | +## Contributing |
| 56 | +Contributions are welcome! Feel free to open issues or pull requests. |
| 57 | + |
| 58 | +## License |
| 59 | +This project is licensed under the MIT License. |
0 commit comments