You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Learn Anything AI Chatbot lets you query your own documents using a language model. It indexes a folder of files, converts CSV and Excel sheets into a DuckDB database, and performs semantic search with Google Gemini embeddings. An interactive agent built with LangGraph combines retrieval and SQL queries so you can "chat" with your data.
1
+
# Anyfile-Agent
2
+
Anyfile-Agent lets you query your own documents using natural‑language. It indexes a folder of files, converts CSV and Excel sheets into a DuckDB database, and performs semantic search with Google Gemini embeddings. An interactive LLM agent built with LangGraph combines retrieval and SQL queries so you can "chat" with your data.
4
3
5
4
## Features
6
-
-**Multi-format ingestion** – PDFs, Word docs, PowerPoint, Markdown, HTML, and plain text are split into searchable chunks. Images are processed through OCR so their text is also indexed.
5
+
-**Multi-format ingestion** – Images are processed through OCR so their text is indexed. PDFs, Word docs, PowerPoint, Markdown, HTML, and plain text are split into searchable chunks.
7
6
-**Data summarization** – CSV and Excel files are loaded into DuckDB tables. Summary cards for each table are added to the vector index.
8
7
-**Embeddings & retrieval** – Documents are embedded with `GoogleGenerativeAIEmbeddings` and stored in a FAISS index for fast semantic search.
9
8
-**SQL integration** – The agent can issue DuckDB queries over your uploaded spreadsheets. Only `SELECT` and `PRAGMA` statements are allowed for safety.
10
-
-**Persistent conversations** – The ReAct agent from LangGraph saves its history to SQLite so you can resume chats.
11
-
9
+
-**Persistent conversations** – The agent saves its conversation history with you to SQLite with a `thread_id` so that you can resume or switch between chats.
10
+
-**Gradio App** – Run a user-friendly interface for interactive sessions.
11
+
-**Developer Tools & CI** – Linting with ruff and black, unit tests with pytest, end‐to‐end smoke tests in GitHub Actions.
12
12
13
13
## Installation
14
14
1. Install system packages needed for OCR (first time only):
@@ -23,19 +23,190 @@ Learn Anything AI Chatbot lets you query your own documents using a language mod
23
23
```
24
24
25
25
## Usage
26
+
### CLI
26
27
1. Place the documents you want to search under `data/` directory.
27
28
2. Run the agent. The first run may take a while as it loads and indexes the files:
28
29
```bash
29
-
bash scripts/run_agent.sh --ask "What kinds of files have I provided?" --load_data
30
+
bash scripts/run_agent.sh --thread_id 12345 --ask "What kinds of files have I provided?" --load_data
31
+
```
32
+
Additional options are available via:
33
+
```bash
34
+
bash scripts/run_agent.sh --help
30
35
```
31
-
3. For later sessions, omit `--load_data` to reuse the existing FAISS index and DuckDB database. If you have added more documents under `data/`, please load them again using `--load_data` for the first run.
36
+
3. For later sessions, omit `--load_data` to reuse the existing FAISS index and DuckDB database.
37
+
4. If you have added more documents under `data/`, please load them again using `--load_data`.
38
+
39
+
### Gradio App
40
+
Run the App Locally:
41
+
```bash
42
+
python app.py
43
+
```
44
+
- Visit the printed URL (e.g., `http://127.0.0.1:7860`) to interact with the agent.
32
45
33
46
## Supported File Types
34
47
- Text documents: PDF, DOCX, PPTX, Markdown, HTML, TXT
35
-
- Images: PNG, JPG, JPEG, TIFF (processed via OCR)
48
+
- Images: PNG, JPG, JPEG, TIFF (text processed via OCR)
36
49
- Spreadsheets: CSV, XLSX
50
+
* For best results with XLSX, use a simple tabular layout—one header row, uniform columns, and no merged cells or custom formatting. You can have multiple sheets.
Index Customer Id First Name Last Name Company City Country Phone 1 Phone 2 Email Subscription Date Website
115
+
1 DD37Cf93aecA6Dc Sheryl Baxter Rasmussen Group East Leonard Chile 229.077.5154 397.884.0519x718 [email protected] 2020-08-24 http://www.stephenson.com/
query: SELECT AVG(Age) AS AverageAge, COUNT(CASE WHEN Gender = 'Male' THEN 1 END) AS MaleCount, COUNT(CASE WHEN Gender = 'Female' THEN 1 END) AS FemaleCount FROM file_example_xlsx_50__sheet1;
SELECT AVG(Age) AS AverageAge, COUNT(CASE WHEN Gender = 'Male' THEN 1 END) AS MaleCount, COUNT(CASE WHEN Gender = 'Female' THEN 1 END) AS FemaleCount FROM file_example_xlsx_50__sheet1;
129
+
================================== Ai Message ==================================
query: SELECT AVG(Age) AS AverageAge, COUNT(CASE WHEN Gender = 'Male' THEN 1 END) AS MaleCount, COUNT(CASE WHEN Gender = 'Female' THEN 1 END) AS FemaleCount FROM file_example_xlsx_50__sheet1;
query: SELECT SUM(CustomerCount) FROM (SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 WHERE Country IN ('Solomon Islands', 'Netherlands', 'Sri Lanka', 'United Arab Emirates', 'Belarus') GROUP BY Country)
SELECT SUM(CustomerCount) FROM (SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 WHERE Country IN ('Solomon Islands', 'Netherlands', 'Sri Lanka', 'United Arab Emirates', 'Belarus') GROUP BY Country)
190
+
================================== Ai Message ==================================
query: SELECT SUM(CustomerCount) FROM (SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 WHERE Country IN ('Solomon Islands', 'Netherlands', 'Sri Lanka', 'United Arab Emirates', 'Belarus') GROUP BY Country)
0 commit comments