Skip to content

Commit 5c37f00

Browse files
committed
updated readme with examples
1 parent 7484e36 commit 5c37f00

File tree

4 files changed

+192
-12
lines changed

4 files changed

+192
-12
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ build/
1717
wandb/
1818

1919
# datasets
20-
data/
20+
# data/
2121
old-data/
2222

2323
# outputs

README.md

Lines changed: 188 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
# Learn Anything AI Chatbot
2-
3-
Learn Anything AI Chatbot lets you query your own documents using a language model. It indexes a folder of files, converts CSV and Excel sheets into a DuckDB database, and performs semantic search with Google Gemini embeddings. An interactive agent built with LangGraph combines retrieval and SQL queries so you can "chat" with your data.
1+
# Anyfile-Agent
2+
Anyfile-Agent lets you query your own documents using natural‑language. It indexes a folder of files, converts CSV and Excel sheets into a DuckDB database, and performs semantic search with Google Gemini embeddings. An interactive LLM agent built with LangGraph combines retrieval and SQL queries so you can "chat" with your data.
43

54
## Features
6-
- **Multi-format ingestion** – PDFs, Word docs, PowerPoint, Markdown, HTML, and plain text are split into searchable chunks. Images are processed through OCR so their text is also indexed.
5+
- **Multi-format ingestion**Images are processed through OCR so their text is indexed. PDFs, Word docs, PowerPoint, Markdown, HTML, and plain text are split into searchable chunks.
76
- **Data summarization** – CSV and Excel files are loaded into DuckDB tables. Summary cards for each table are added to the vector index.
87
- **Embeddings & retrieval** – Documents are embedded with `GoogleGenerativeAIEmbeddings` and stored in a FAISS index for fast semantic search.
98
- **SQL integration** – The agent can issue DuckDB queries over your uploaded spreadsheets. Only `SELECT` and `PRAGMA` statements are allowed for safety.
10-
- **Persistent conversations** – The ReAct agent from LangGraph saves its history to SQLite so you can resume chats.
11-
9+
- **Persistent conversations** – The agent saves its conversation history with you to SQLite with a `thread_id` so that you can resume or switch between chats.
10+
- **Gradio App** – Run a user-friendly interface for interactive sessions.
11+
- **Developer Tools & CI** – Linting with ruff and black, unit tests with pytest, end‐to‐end smoke tests in GitHub Actions.
1212

1313
## Installation
1414
1. Install system packages needed for OCR (first time only):
@@ -23,19 +23,190 @@ Learn Anything AI Chatbot lets you query your own documents using a language mod
2323
```
2424

2525
## Usage
26+
### CLI
2627
1. Place the documents you want to search under `data/` directory.
2728
2. Run the agent. The first run may take a while as it loads and indexes the files:
2829
```bash
29-
bash scripts/run_agent.sh --ask "What kinds of files have I provided?" --load_data
30+
bash scripts/run_agent.sh --thread_id 12345 --ask "What kinds of files have I provided?" --load_data
31+
```
32+
Additional options are available via:
33+
```bash
34+
bash scripts/run_agent.sh --help
3035
```
31-
3. For later sessions, omit `--load_data` to reuse the existing FAISS index and DuckDB database. If you have added more documents under `data/`, please load them again using `--load_data` for the first run.
36+
3. For later sessions, omit `--load_data` to reuse the existing FAISS index and DuckDB database.
37+
4. If you have added more documents under `data/`, please load them again using `--load_data`.
38+
39+
### Gradio App
40+
Run the App Locally:
41+
```bash
42+
python app.py
43+
```
44+
- Visit the printed URL (e.g., `http://127.0.0.1:7860`) to interact with the agent.
3245

3346
## Supported File Types
3447
- Text documents: PDF, DOCX, PPTX, Markdown, HTML, TXT
35-
- Images: PNG, JPG, JPEG, TIFF (processed via OCR)
48+
- Images: PNG, JPG, JPEG, TIFF (text processed via OCR)
3649
- Spreadsheets: CSV, XLSX
50+
* For best results with XLSX, use a simple tabular layout—one header row, uniform columns, and no merged cells or custom formatting. You can have multiple sheets.
51+
52+
## Example Results
53+
### Multi‑Step Reasoning with Tool Use
54+
<div style="max-height:400px; overflow-y:auto; border:1px solid #ccc; padding:8px;">
55+
<pre><code class="language-bash">
56+
================================ Human Message =================================
57+
58+
Can you generate some mathematically useful information across these excel files?
59+
================================== Ai Message ==================================
60+
Tool Calls:
61+
sql_db_schema (aea9b712-105d-4253-9fdb-0ef4eebc79b3)
62+
Call ID: aea9b712-105d-4253-9fdb-0ef4eebc79b3
63+
Args:
64+
table_names: file_example_xlsx_50__sheet1
65+
================================= Tool Message =================================
66+
Name: sql_db_schema
67+
68+
69+
CREATE TABLE file_example_xlsx_50__sheet1 (
70+
"0" BIGINT,
71+
"First Name" VARCHAR,
72+
"Last Name" VARCHAR,
73+
"Gender" VARCHAR,
74+
"Country" VARCHAR,
75+
"Age" BIGINT,
76+
"Date" VARCHAR,
77+
"Id" BIGINT
78+
)
79+
80+
/*
81+
3 rows from file_example_xlsx_50__sheet1 table:
82+
0 First Name Last Name Gender Country Age Date Id
83+
1 Dulce Abril Female United States 32 15/10/2017 1562
84+
2 Mara Hashimoto Female Great Britain 25 16/08/2016 1582
85+
3 Philip Gent Male France 36 21/05/2015 2587
86+
*/
87+
================================== Ai Message ==================================
88+
Tool Calls:
89+
sql_db_schema (78cbce3d-fb28-42d6-80bd-eb77d277f7f9)
90+
Call ID: 78cbce3d-fb28-42d6-80bd-eb77d277f7f9
91+
Args:
92+
table_names: customers_100
93+
================================= Tool Message =================================
94+
Name: sql_db_schema
95+
96+
97+
CREATE TABLE customers_100 (
98+
"Index" BIGINT,
99+
"Customer Id" VARCHAR,
100+
"First Name" VARCHAR,
101+
"Last Name" VARCHAR,
102+
"Company" VARCHAR,
103+
"City" VARCHAR,
104+
"Country" VARCHAR,
105+
"Phone 1" VARCHAR,
106+
"Phone 2" VARCHAR,
107+
"Email" VARCHAR,
108+
"Subscription Date" DATE,
109+
"Website" VARCHAR
110+
)
111+
112+
/*
113+
3 rows from customers_100 table:
114+
Index Customer Id First Name Last Name Company City Country Phone 1 Phone 2 Email Subscription Date Website
115+
1 DD37Cf93aecA6Dc Sheryl Baxter Rasmussen Group East Leonard Chile 229.077.5154 397.884.0519x718 [email protected] 2020-08-24 http://www.stephenson.com/
116+
2 1Ef7b82A4CAAD10 Preston Lozano Vega-Gentry East Jimmychester Djibouti 5153435776 686-620-1820x944 [email protected] 2021-04-23 http://www.hobbs.com/
117+
3 6F94879bDAfE5a6 Roy Berry Murillo-Perry Isabelborough Antigua and Barbuda +1-539-402-0259 (496)978-3969x58947 [email protected] 2020-03-25 http://www.lawrence.com/
118+
*/
119+
================================== Ai Message ==================================
120+
Tool Calls:
121+
sql_db_query_checker (9c7c199a-e8d5-45f8-acef-996eb9039681)
122+
Call ID: 9c7c199a-e8d5-45f8-acef-996eb9039681
123+
Args:
124+
query: SELECT AVG(Age) AS AverageAge, COUNT(CASE WHEN Gender = 'Male' THEN 1 END) AS MaleCount, COUNT(CASE WHEN Gender = 'Female' THEN 1 END) AS FemaleCount FROM file_example_xlsx_50__sheet1;
125+
================================= Tool Message =================================
126+
Name: sql_db_query_checker
37127

38-
## Testing
128+
SELECT AVG(Age) AS AverageAge, COUNT(CASE WHEN Gender = 'Male' THEN 1 END) AS MaleCount, COUNT(CASE WHEN Gender = 'Female' THEN 1 END) AS FemaleCount FROM file_example_xlsx_50__sheet1;
129+
================================== Ai Message ==================================
130+
Tool Calls:
131+
sql_db_query (e1413327-1faf-41f1-bb90-8ea283bd52d8)
132+
Call ID: e1413327-1faf-41f1-bb90-8ea283bd52d8
133+
Args:
134+
query: SELECT AVG(Age) AS AverageAge, COUNT(CASE WHEN Gender = 'Male' THEN 1 END) AS MaleCount, COUNT(CASE WHEN Gender = 'Female' THEN 1 END) AS FemaleCount FROM file_example_xlsx_50__sheet1;
135+
================================= Tool Message =================================
136+
Name: sql_db_query
137+
138+
[(33.26, 12, 38)]
139+
================================== Ai Message ==================================
140+
Tool Calls:
141+
sql_db_query_checker (d6a472b0-24e7-4103-bf5e-4b7453a6c70a)
142+
Call ID: d6a472b0-24e7-4103-bf5e-4b7453a6c70a
143+
Args:
144+
query: SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 GROUP BY Country ORDER BY CustomerCount DESC LIMIT 5;
145+
================================= Tool Message =================================
146+
Name: sql_db_query_checker
147+
148+
SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 GROUP BY Country ORDER BY CustomerCount DESC LIMIT 5;
149+
================================== Ai Message ==================================
150+
Tool Calls:
151+
sql_db_query (5a4d951c-4fce-4cc8-98ad-764d91ab6b78)
152+
Call ID: 5a4d951c-4fce-4cc8-98ad-764d91ab6b78
153+
Args:
154+
query: SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 GROUP BY Country ORDER BY CustomerCount DESC LIMIT 5;
155+
================================= Tool Message =================================
156+
Name: sql_db_query
157+
158+
[('Solomon Islands', 4), ('Netherlands', 2), ('Sri Lanka', 2), ('United Arab Emirates', 2), ('Belarus', 2)]
159+
================================== Ai Message ==================================
160+
161+
From the `file_example_xlsx_50__sheet1` table:
162+
* The average age is 33.26 years.
163+
* There are 12 males and 38 females.
164+
165+
From the `customers_100` table, the top 5 countries by customer count are:
166+
* Solomon Islands: 4 customers
167+
* Netherlands: 2 customers
168+
* Sri Lanka: 2 customers
169+
* United Arab Emirates: 2 customers
170+
* Belarus: 2 customers
171+
</code></pre>
172+
</div>
173+
174+
### Persistent Memory and Database
175+
<div style="max-height:400px; overflow-y:auto; border:1px solid #ccc; padding:8px;">
176+
<pre><code class="language-bash">
177+
================================ Human Message =================================
178+
179+
Could you please add the customer counts for these 5 countries?
180+
================================== Ai Message ==================================
181+
Tool Calls:
182+
sql_db_query_checker (7a0f341a-b860-4398-997e-3352f77c081e)
183+
Call ID: 7a0f341a-b860-4398-997e-3352f77c081e
184+
Args:
185+
query: SELECT SUM(CustomerCount) FROM (SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 WHERE Country IN ('Solomon Islands', 'Netherlands', 'Sri Lanka', 'United Arab Emirates', 'Belarus') GROUP BY Country)
186+
================================= Tool Message =================================
187+
Name: sql_db_query_checker
188+
189+
SELECT SUM(CustomerCount) FROM (SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 WHERE Country IN ('Solomon Islands', 'Netherlands', 'Sri Lanka', 'United Arab Emirates', 'Belarus') GROUP BY Country)
190+
================================== Ai Message ==================================
191+
Tool Calls:
192+
sql_db_query (6fe70492-8dc5-42c7-92f2-3de22caf15ab)
193+
Call ID: 6fe70492-8dc5-42c7-92f2-3de22caf15ab
194+
Args:
195+
query: SELECT SUM(CustomerCount) FROM (SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 WHERE Country IN ('Solomon Islands', 'Netherlands', 'Sri Lanka', 'United Arab Emirates', 'Belarus') GROUP BY Country)
196+
================================= Tool Message =================================
197+
Name: sql_db_query
198+
199+
[(12,)]
200+
================================== Ai Message ==================================
201+
202+
The total customer count for Solomon Islands, Netherlands, Sri Lanka, United Arab Emirates, and Belarus is 12.
203+
</code></pre>
204+
</div>
205+
206+
### User Friendly Gradio Interface
207+
![Gradio Interface Example](assets/gradio.png)
208+
209+
## Code Hygiene Tools & Testing
39210
Run formatting checks and unit tests with:
40211
```bash
41212
pre-commit run --all-files
@@ -44,9 +215,11 @@ pytest
44215

45216
## Repository Structure
46217
- `src/any_chatbot/` – core modules for indexing, tools, and agent
218+
- `data/` – directory to add your files for CLI interface.
47219
- `scripts/` – helper script to launch the agent
48220
- `notebooks/` – example notebooks for experiments
49221
- `tests/` – unit tests for the indexing and tool utilities
222+
- `scripts/` – helper script to launch the agent
50223

51224
## Requirements
52225
- Python 3.10+
@@ -55,5 +228,9 @@ pytest
55228
## Contributing
56229
Contributions are welcome! Feel free to open issues or pull requests.
57230

231+
## Acknowledgements
232+
- [LangChain Docs](https://python.langchain.com/docs/introduction/)
233+
- [LVMH Annual Report](https://www.lvmh.com/en/investors)
234+
58235
## License
59-
This project is licensed under the MIT License.
236+
This project is licensed under the [MIT License](LICENSE).

assets/gradio.png

134 KB
Loading

data/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# data/.gitignore
2+
*
3+
!.gitignore

0 commit comments

Comments
 (0)