Skip to content

Commit 66a337a

Browse files
committed
added annotations and docstrings
1 parent 936aca9 commit 66a337a

File tree

12 files changed

+94
-13
lines changed

12 files changed

+94
-13
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ __pycache__/
99
build/
1010
.gradio
1111
*.tar.gz
12+
.benchmarks
1213

1314
# secrets
1415
.env

ETHICS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
Anyfile-Agent helps users explore their local documents with the assistance of an external language model. I designed the software with the following principles:
44

5-
- **User Control and Privacy** – Files remain on the local machine. Processing uses opensource libraries and the configured language model API. No uploaded content is sent elsewhere by the application.
5+
- **User Control and Privacy** – Files remain on the local machine. Processing uses open-source libraries and the configured language model API. No uploaded content is sent elsewhere by the application.
66
- **Transparency** – Indexing creates temporary representations of the documents (e.g., embeddings, OCR text) so the agent can search them. These artifacts are stored locally and users may delete them at any time.
7-
- **Responsible Use** – The agent can generate or execute SQL queries over the user’s data. Only readonly commands are permitted, but users should review outputs before acting on them. Do not rely on the agent for legal, medical, or safetycritical decisions.
7+
- **Responsible Use** – The agent can generate or execute SQL queries over the user’s data. Only read-only commands are permitted, but users should review outputs before acting on them. Do not rely on the agent for legal, medical, or safety-critical decisions.
88
- **Bias and Limitations** – Responses may reflect biases of the underlying language model or the provided data. Users should validate critical information from original sources.
99
- **Open Development** – The project is MIT licensed so that others may inspect, modify, and improve the code. Contributions must follow these ethical guidelines.
1010

MODEL_CARD.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Anyfile-Agent is a retrieval-based assistant that helps users search and analyze
1111
## Intended Use
1212
- **Primary uses**: Searching personal documents, extracting structured summaries, and answering questions via natural language.
1313
- **Users**: Individuals or teams who want a local assistant for their files. Requires a valid Google Gemini API key.
14-
- **Out-of-scope uses**: Do not use the agent for generating legal, medical, or safetycritical advice. It should not be used to process data that violates privacy regulations or thirdparty terms of service.
14+
- **Out-of-scope uses**: Do not use the agent for generating legal, medical, or safety-critical advice. It should not be used to process data that violates privacy regulations or third-party terms of service.
1515

1616
## Data and Training
1717
Anyfile-Agent does not train a new model. It indexes user-provided documents locally and sends text chunks to a Google Gemini model for embedding and chat responses. The quality of answers depends on that service and the content of the uploaded data.

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Anyfile-Agent
2-
Anyfile-Agent lets you query your own documents using natural language. It indexes a folder of files, converts CSV and Excel sheets into a DuckDB database, and performs semantic search via vector retrieval. Built with LangChain/LangGraph, this interactive LLM agent combines RAGbased retrieval and SQL querying so you can “chat” with your data.
2+
Anyfile-Agent lets you query your own documents using natural language. It indexes a folder of files, converts CSV and Excel sheets into a DuckDB database, and performs semantic search via vector retrieval. Built with LangChain/LangGraph, this interactive LLM agent combines RAG-based retrieval and SQL querying so you can “chat” with your data.
33

44
## Features
55
- **Multi-format ingestion** – Images are processed through OCR so their text is indexed. PDFs, Word docs, PowerPoint, Markdown, HTML, and plain text are split into searchable chunks.
@@ -50,7 +50,7 @@ python app.py
5050
* For best results with XLSX, use a simple tabular layout—one header row, uniform columns, and no merged cells or custom formatting. You can have multiple sheets.
5151

5252
## Example Results
53-
### MultiStep Reasoning with Tool Use
53+
### Multi-Step Reasoning with Tool Use
5454
<div style="max-height:400px; overflow-y:auto; border:1px solid #ccc; padding:8px;">
5555
<pre><code class="language-bash">
5656
================================ Human Message =================================

app.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
"""Gradio-based web app for Anyfile-Agent: upload documents, index them, and chat interactively."""
2+
13
import atexit
24
import asyncio
35
import gc
@@ -25,7 +27,10 @@
2527

2628

2729
class Session:
30+
"""Session state for file uploads, indexing, and chat history."""
31+
2832
def __init__(self):
33+
"""Initialize session IDs, paths, and database connections."""
2934
self.sid = uuid.uuid4().hex
3035
self.db_path = TMP_DIR / "csv_excel_to_db.duckdb"
3136
self.index_path = TMP_DIR / "faiss_index"
@@ -35,6 +40,7 @@ def __init__(self):
3540
self.sql_engines: List = []
3641

3742
def cleanup(self):
43+
"""Dispose SQL engines, close history DB, reset agent, and remove tmp directory."""
3844
# dispose any SQLAlchemy/SQL-toolkit engines
3945
for eng in self.sql_engines:
4046
try:
@@ -60,10 +66,12 @@ def cleanup(self):
6066
# shutdown hook that is called when session ends
6167
@atexit.register
6268
def _purge_all():
69+
"""Cleanup session data on program exit."""
6370
sess.cleanup()
6471

6572

6673
def _safe_copy(src: Path, dst_dir: Path):
74+
"""Copy a file to dst_dir, avoiding name collisions by appending a random suffix."""
6775
dst = dst_dir / src.name
6876
if dst.exists():
6977
dst = dst.with_name(f"{dst.stem}_{uuid.uuid4().hex[:4]}{dst.suffix}")
@@ -72,6 +80,14 @@ def _safe_copy(src: Path, dst_dir: Path):
7280

7381
# upload & sync
7482
def cb_upload_and_sync(files: List[gr.File]) -> Generator[Tuple[str, list], None, None]:
83+
"""Handle uploaded files: copy to temp, index docs, build agent, and yield status updates.
84+
85+
Args:
86+
files: List of uploaded files from the Gradio interface.
87+
88+
Yields:
89+
Tuples of (status_message, chat_history).
90+
"""
7591
# GUARDRAIL FOR EMPTY FILES
7692
if not files:
7793
yield "⚠️ No files selected.", []
@@ -132,6 +148,15 @@ def cb_upload_and_sync(files: List[gr.File]) -> Generator[Tuple[str, list], None
132148

133149
# chat
134150
def cb_chat(hist: List[dict], msg: str) -> Tuple[List[dict], str]:
151+
"""Handle user messages: stream agent response and update conversation history.
152+
153+
Args:
154+
hist: Conversation history as a list of role/content dicts.
155+
msg: New user message.
156+
157+
Returns:
158+
A tuple of (updated_history, clear_input_str).
159+
"""
135160
if sess.agent is None:
136161
hist.append(
137162
{

src/any_chatbot/agent.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
"""CLI entry-point for running Anyfile-Agent in streaming mode."""
2+
13
import argparse
24
import sqlite3
35
import logging
@@ -18,6 +20,7 @@
1820

1921

2022
def parse_args() -> argparse.Namespace:
23+
"""Parse command-line options for the agent."""
2124
p = argparse.ArgumentParser()
2225

2326
p.add_argument(
@@ -61,6 +64,7 @@ def parse_args() -> argparse.Namespace:
6164

6265

6366
def main() -> None:
67+
"""Entry-point invoked by `python agent.py`."""
6468
logging.basicConfig(level=logging.INFO)
6569
cfg = parse_args()
6670
load_environ_vars()

src/any_chatbot/indexing.py

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
"""Data-ingestion pipeline: load docs, build DuckDB tables, create FAISS index."""
2+
13
import os
24
import re
35
import logging
@@ -6,6 +8,7 @@
68
import shutil
79
from dotenv import load_dotenv
810
from pathlib import Path
11+
from typing import List, Tuple
912

1013
from langchain_community.vectorstores import FAISS
1114
from langchain_google_genai import GoogleGenerativeAIEmbeddings
@@ -20,7 +23,8 @@
2023
DATA = BASE / "data"
2124

2225

23-
def load_and_split_text_docs(data_dir):
26+
def load_and_split_text_docs(data_dir: Path) -> List[Document]:
27+
"""Load PDFs, DOCX, PPTX, etc. and split into chunks suitable for embeddings."""
2428
text_chunks = []
2529
globs = [
2630
"**/*.pdf",
@@ -60,7 +64,8 @@ def load_and_split_text_docs(data_dir):
6064
return text_chunks
6165

6266

63-
def load_image_docs_as_text(data_dir):
67+
def load_image_docs_as_text(data_dir: Path) -> List[Document]:
68+
"""Run OCR on images and return one Document per image."""
6469
image_text_docs = []
6570
globs = [
6671
"**/*.png",
@@ -90,7 +95,7 @@ def load_image_docs_as_text(data_dir):
9095

9196

9297
def _tbl(name: str) -> str:
93-
"""make a safe SQL table name"""
98+
"""Sanitize an arbitrary string so it can be used as a SQL table name."""
9499
name = re.sub(r"[^0-9a-zA-Z_]+", "_", name).strip("_")
95100
if not name or name[0].isdigit():
96101
name = f"t_{name}"
@@ -101,6 +106,7 @@ def build_duckdb_and_summary_cards(
101106
data_dir: Path,
102107
db_path: Path,
103108
) -> list[Document]:
109+
"""Create DuckDB tables for CSV/XLSX files and return vector-searchable summary cards."""
104110
summary_cards = []
105111
# skip if there are no .csv/.xlsx/.xls files
106112
patterns = ("*.csv", "*.xlsx", "*.xls")
@@ -197,7 +203,8 @@ def embed_and_index_all_docs(
197203
db_path: Path = DATA / "generated_db" / "csv_excel_to_db.duckdb",
198204
index_path: Path = DATA / "generated_db" / "faiss_index",
199205
load_data: bool = False,
200-
):
206+
) -> Tuple[GoogleGenerativeAIEmbeddings, FAISS]:
207+
"""Return (embeddings, vector_store). Build or load FAISS & DuckDB as needed."""
201208
# load embeedings and vector store
202209
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
203210

src/any_chatbot/prompts.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
"""Prompt templates used by Anyfile-Agent."""
2+
13
system_message = """
24
You are a agent designed to conduct semantic search on the uploaded user documents
35
and/or also interact with a SQL database.

src/any_chatbot/tools.py

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
"""Utility helpers that turn a FAISS vector store or DuckDB database into LangChain tools usable by the agent."""
2+
13
from typing import Tuple, List, Literal
24
from pathlib import Path
35

@@ -12,6 +14,15 @@
1214

1315

1416
def initialize_retrieve_tool(vector_store: VectorStore):
17+
"""Return a LangChain `@tool` that performs semantic search.
18+
19+
Args:
20+
vector_store: A pre-built FAISS (or compatible) vector store.
21+
22+
Returns:
23+
The decorated `retrieve` function ready to be passed into an agent.
24+
"""
25+
1526
@tool(
1627
description=(
1728
"""
@@ -41,7 +52,14 @@ def retrieve(
4152

4253

4354
def is_safe_sql(query: str) -> bool:
44-
"""Filters out destructive sql queries"""
55+
"""Reject queries that contain DML/DDL keywords.
56+
57+
Args:
58+
query: Arbitrary SQL supplied by the LLM.
59+
60+
Returns:
61+
True if the query looks safe (SELECT/PRAGMA), else False.
62+
"""
4563
forbidden = ["insert", "update", "delete", "drop", "alter", "create", "replace"]
4664
# Make sure to only block whole words (e.i., don't block 'updated_at')
4765
return not any(f" {word} " in f" {query.lower()} " for word in forbidden)
@@ -51,6 +69,15 @@ def initialize_sql_toolkit(
5169
llm,
5270
db_path: Path = DATA / "generated_db" / "csv_excel_to_db.duckdb",
5371
):
72+
"""Wrap DuckDB in a LangChain `SQLDatabaseToolkit` with a safety filter.
73+
74+
Args:
75+
llm: The chat model that will power SQL-aware tools.
76+
db_path: Location of the DuckDB file created during indexing.
77+
78+
Returns:
79+
A list of LangChain tools for schema look-up and SELECT queries.
80+
"""
5481
db = SQLDatabase.from_uri(f"duckdb:///{db_path}")
5582

5683
# Monkey-path the run method to include safety filter

src/any_chatbot/utils.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
"""Small utility helpers not tied to LangChain."""
2+
13
import getpass
24
import os
35

0 commit comments

Comments
 (0)