Skip to content

Commit 2142380

Browse files
committed
version 2.2.0 ready to fix the database problem
1 parent a40fcb9 commit 2142380

File tree

6 files changed

+416
-81
lines changed

6 files changed

+416
-81
lines changed

AGENTS.md

Lines changed: 32 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,47 @@
11
# Repository Guidelines
22

33
## Project Structure & Module Organization
4-
- `backend/`: FastAPI service and data-to-spec pipeline (`main.py`, `analyzer.py`, `spec_generator.py`, `llm.py`).
5-
- `frontend/`: React + TypeScript + Vite UI (`src/components`, `src/lib/catalog.ts`, `src/App.tsx`).
6-
- `sample_data/`: local datasets for manual validation (CSV/TSV/JSON/XLSX).
7-
- Root docs: `json-render-docs.md`, `quickstart.md`.
8-
9-
Keep backend logic data-focused (analysis/spec generation) and frontend logic presentation-focused (renderer and components).
4+
- `backend/`: FastAPI service and analysis pipeline (`main.py`, `analyzer.py`, `llm.py`, `db.py`, `nl2sql.py`).
5+
- `frontend/`: React 19 + TypeScript app (Vite). UI components live in `frontend/src/components/`, JSON catalog/registry in `frontend/src/lib/` and `frontend/src/components/registry.tsx`.
6+
- `sample_data/`: local datasets for smoke testing uploads.
7+
- `resource/`: demo media assets.
8+
- `.github/workflows/pylint.yml`: Python lint CI job.
109

1110
## Build, Test, and Development Commands
12-
- Backend setup/run:
13-
```powershell
14-
cd backend
15-
python -m venv .venv
16-
.\.venv\Scripts\Activate.ps1
17-
pip install -r requirements.txt
18-
uvicorn main:app --reload --port 8000
19-
```
20-
- Frontend setup/run:
21-
```powershell
22-
cd frontend
23-
npm install
24-
npm run dev
25-
```
11+
- Backend setup:
12+
- `cd backend && pip install -r requirements.txt`
13+
- `uvicorn main:app --reload --port 8000` (runs API at `http://localhost:8000`)
14+
- Frontend setup:
15+
- `cd frontend && npm install`
16+
- `npm run dev` (local app at `http://localhost:5173`)
2617
- Frontend quality/build:
27-
```powershell
28-
npm run lint # ESLint on TS/TSX
29-
npm run build # Type-check + production build
30-
npm run preview # Preview built app
31-
```
18+
- `npm run lint` (ESLint for `ts/tsx`)
19+
- `npm run build` (TypeScript compile + Vite production build)
20+
- `npm run preview` (serve built app)
3221

3322
## Coding Style & Naming Conventions
34-
- Python: follow PEP 8, 4-space indentation, `snake_case` for functions/variables, small focused helpers.
35-
- TypeScript/React: 2-space indentation, `PascalCase` for components (`StatCard.tsx`), `camelCase` for functions/props.
36-
- Keep component schemas and registry aligned with `frontend/src/lib/catalog.ts` and `frontend/src/components/registry.tsx`.
37-
- Run `npm run lint` before opening a PR.
23+
- Python: 4-space indentation, `snake_case` for functions/variables, small focused functions.
24+
- TypeScript/React: `PascalCase` for components (`StatCard.tsx`), `camelCase` for helpers/hooks.
25+
- Keep component contracts aligned with the JSON render catalog (`BarChart`, `LineChart`, `PieChart`, etc.).
26+
- Run `npm run lint` before opening a PR; keep imports and unused variables clean.
3827

3928
## Testing Guidelines
40-
- No automated test suite is currently committed.
41-
- Minimum expectation: manual smoke test with both servers running and at least one file from `sample_data/` uploaded.
42-
- New tests are encouraged:
43-
- Backend: `backend/tests/test_*.py` with `pytest`.
44-
- Frontend: `*.test.tsx` near components or under `frontend/src/__tests__/`.
29+
- There is no committed unit-test suite yet for backend or frontend.
30+
- Minimum check before PR:
31+
- `npm run lint`
32+
- `npm run build`
33+
- Manual smoke test: upload at least one file from `sample_data/` and verify dashboard rendering.
34+
- For backend logic changes, add targeted tests when introducing non-trivial parsing/query behavior.
4535

4636
## Commit & Pull Request Guidelines
47-
- Existing history uses short, release-style messages (example: `version 1.0.4 supportAllKindOfData`).
48-
- Prefer concise commits with clear scope, e.g. `frontend: improve chart legend layout`.
37+
- Current history is release-oriented (`version <semver> <note>`, e.g., `version 2.1.0 support personal database analyze`).
38+
- Prefer concise, imperative commit subjects; include scope when useful (e.g., `backend: tighten SQL guard`).
4939
- PRs should include:
50-
- What changed and why.
51-
- Manual verification steps.
52-
- Screenshots/GIFs for UI changes.
53-
- Any config/env updates.
40+
- What changed and why
41+
- Manual verification steps
42+
- UI screenshots/GIFs for frontend changes
43+
- Linked issue/task reference
5444

5545
## Security & Configuration Tips
56-
- Configure secrets in `backend/.env` (`LLM_API_KEY`, `LLM_BASE_URL`, optional `LLM_MODEL`).
57-
- Never commit API keys or sensitive datasets.
46+
- Use `backend/.env` for `LLM_API_KEY`, `LLM_BASE_URL`, `LLM_MODEL`, and DB credentials (`DB_HOST`, `DB_PORT`, `DB_USER`, `DB_PASSWORD`, `DB_NAME`).
47+
- Never commit secrets. Keep generated metadata (`backend/db_meta.json`) out of version control.

backend/db.py

Lines changed: 42 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,17 @@
11
import os
22
import json
3+
import re
34
import pymysql
45
import pandas as pd
56
from dotenv import load_dotenv
67

78
load_dotenv()
89

910
META_PATH = os.path.join(os.path.dirname(__file__), "db_meta.json")
11+
FORBIDDEN_SQL_PATTERN = re.compile(
12+
r"\b(INSERT|UPDATE|DELETE|DROP|ALTER|TRUNCATE|CREATE|REPLACE|GRANT|REVOKE)\b",
13+
flags=re.IGNORECASE,
14+
)
1015

1116

1217
def get_connection():
@@ -43,7 +48,6 @@ def scan_schema():
4348
# sample values
4449
cur.execute(f"SELECT * FROM `{table}` LIMIT 3")
4550
sample_rows = cur.fetchall()
46-
col_names = [c["name"] for c in columns]
4751
for i, c in enumerate(columns):
4852
c["sample"] = [row[i] for row in sample_rows if row[i] is not None]
4953

@@ -75,14 +79,46 @@ def get_meta():
7579
return scan_schema()
7680

7781

78-
def execute_query(sql: str) -> pd.DataFrame:
79-
stripped = sql.strip().rstrip(";").strip()
80-
if not stripped.upper().startswith("SELECT"):
82+
def normalize_sql(sql: str) -> str:
83+
stripped = (sql or "").strip().rstrip(";").strip()
84+
# Remove leading SQL comments before validation.
85+
cleaned = re.sub(r"^(--[^\n]*\n|/\*.*?\*/\s*)*", "", stripped, flags=re.DOTALL).strip()
86+
return cleaned
87+
88+
89+
def validate_select_sql(sql: str, require_from: bool = False) -> str:
90+
cleaned = normalize_sql(sql)
91+
if not cleaned:
92+
raise ValueError("SQL is empty")
93+
if ";" in cleaned:
94+
raise ValueError("Multiple SQL statements are not allowed")
95+
96+
first_word = cleaned.split()[0].upper() if cleaned.split() else ""
97+
if first_word not in ("SELECT", "WITH"):
8198
raise ValueError("Only SELECT queries are allowed")
8299

100+
if FORBIDDEN_SQL_PATTERN.search(cleaned):
101+
raise ValueError("Only read-only SELECT queries are allowed")
102+
103+
# Reject placeholder outputs often produced by LLMs.
104+
if re.search(r"\bSELECT\s+statement\.?\b", cleaned, flags=re.IGNORECASE):
105+
raise ValueError("LLM returned a placeholder SQL instead of a real query")
106+
107+
if require_from and first_word == "SELECT" and "FROM" not in cleaned.upper():
108+
raise ValueError("Generated SQL must include a FROM clause")
109+
110+
return cleaned
111+
112+
113+
def execute_query(sql: str) -> pd.DataFrame:
114+
cleaned = validate_select_sql(sql)
115+
83116
conn = get_connection()
84117
try:
85-
df = pd.read_sql(sql, conn)
86-
return df.head(5000)
118+
with conn.cursor() as cur:
119+
cur.execute(cleaned)
120+
rows = cur.fetchmany(5000)
121+
columns = [desc[0] for desc in (cur.description or [])]
122+
return pd.DataFrame(rows, columns=columns)
87123
finally:
88124
conn.close()

backend/e2b_runner.py

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
import os
2+
import json
3+
import base64
4+
import httpx
5+
from openai import OpenAI
6+
from dotenv import load_dotenv
7+
from e2b_code_interpreter import Sandbox
8+
9+
load_dotenv()
10+
11+
client = OpenAI(
12+
api_key=os.getenv("LLM_API_KEY"),
13+
base_url=os.getenv("LLM_BASE_URL"),
14+
http_client=httpx.Client(trust_env=False),
15+
)
16+
17+
MAX_STEPS = 6
18+
19+
AGENT_PROMPT = """You are a Deep Research Agent for data analysis. You work iteratively: plan, execute code, observe results, then decide next steps.
20+
21+
You have a dataset at '/tmp/data.csv'. You are inside a Python sandbox with pandas, matplotlib, scipy, sklearn available.
22+
23+
At each step, respond with a JSON object:
24+
{
25+
"thought": "What I learned and what I want to do next",
26+
"code": "python code to execute (or empty string if done)",
27+
"done": false
28+
}
29+
30+
When your analysis is complete, set done=true and include a final summary:
31+
{
32+
"thought": "Final comprehensive summary of all findings",
33+
"code": "",
34+
"done": true
35+
}
36+
37+
Rules:
38+
1. Start by exploring the data (shape, columns, dtypes, basic stats).
39+
2. Each step should build on previous results. Don't repeat work.
40+
3. Save charts to '/tmp/chart_N.png' (increment N). Use plt.savefig() then plt.close().
41+
4. Print results so you can observe them in the next step.
42+
5. If code errors, fix it in the next step.
43+
6. Use the SAME LANGUAGE as the user question for all text output.
44+
7. Output ONLY valid JSON, no other text."""
45+
46+
47+
def _parse_response(text: str) -> dict:
48+
text = text.strip()
49+
if "```" in text:
50+
text = text.split("```")[1]
51+
if text.startswith("json"):
52+
text = text[4:]
53+
try:
54+
return json.loads(text)
55+
except json.JSONDecodeError:
56+
return {"thought": text, "code": "", "done": True}
57+
58+
59+
def _run_code(sbx: Sandbox, code: str, chart_idx: int) -> dict:
60+
execution = sbx.run_code(code)
61+
stdout = "\n".join(execution.logs.stdout) if execution.logs.stdout else ""
62+
result_text = execution.text or ""
63+
output = (stdout + "\n" + result_text).strip()
64+
error = execution.error.value if execution.error else ""
65+
66+
charts = []
67+
for i in range(chart_idx, chart_idx + 5):
68+
try:
69+
content = sbx.files.read(f"/tmp/chart_{i}.png", format="bytes")
70+
charts.append(base64.b64encode(content).decode())
71+
except Exception:
72+
break
73+
74+
return {"output": output, "error": error, "charts": charts}
75+
76+
77+
def deep_analyze(data_csv: str, question: str) -> dict:
78+
sbx = Sandbox.create()
79+
steps = []
80+
chart_idx = 0
81+
82+
try:
83+
sbx.files.write("/tmp/data.csv", data_csv)
84+
messages = [
85+
{"role": "system", "content": AGENT_PROMPT},
86+
{"role": "user", "content": f"Dataset preview:\n{data_csv[:5000]}\n\nQuestion: {question}"},
87+
]
88+
89+
for step_num in range(MAX_STEPS):
90+
resp = client.chat.completions.create(
91+
model=os.getenv("LLM_MODEL", "claude-opus-4-6-thinking"),
92+
messages=messages,
93+
max_tokens=2048,
94+
)
95+
reply = resp.choices[0].message.content
96+
parsed = _parse_response(reply)
97+
98+
step = {"step": step_num + 1, "thought": parsed.get("thought", "")}
99+
100+
if parsed.get("done") or not parsed.get("code", "").strip():
101+
steps.append(step)
102+
break
103+
104+
result = _run_code(sbx, parsed["code"], chart_idx)
105+
chart_idx += len(result["charts"])
106+
107+
step["code"] = parsed["code"]
108+
step["output"] = result["output"]
109+
step["error"] = result["error"]
110+
step["charts"] = result["charts"]
111+
steps.append(step)
112+
113+
# Feed results back to LLM
114+
observation = f"Output:\n{result['output']}"
115+
if result["error"]:
116+
observation += f"\nError:\n{result['error']}"
117+
messages.append({"role": "assistant", "content": reply})
118+
messages.append({"role": "user", "content": observation})
119+
120+
finally:
121+
sbx.kill()
122+
123+
return {"steps": steps}

backend/main.py

Lines changed: 55 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
1-
from fastapi import FastAPI, UploadFile, Query
1+
from fastapi import FastAPI, Query, UploadFile
22
from fastapi.middleware.cors import CORSMiddleware
33
from pydantic import BaseModel
4+
45
from analyzer import analyze
5-
from spec_generator import generate_spec
6+
from db import execute_query, scan_schema
7+
from e2b_runner import deep_analyze
68
from llm import generate_spec_with_llm
7-
from db import scan_schema, execute_query
89
from nl2sql import generate_recommendations, nl_to_sql
10+
from spec_generator import generate_spec
911

1012
app = FastAPI()
1113
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])
@@ -15,6 +17,15 @@ class QueryRequest(BaseModel):
1517
question: str
1618

1719

20+
class SqlRequest(BaseModel):
21+
sql: str
22+
23+
24+
class DeepRequest(BaseModel):
25+
data_csv: str
26+
question: str
27+
28+
1829
@app.post("/analyze")
1930
async def analyze_file(file: UploadFile, mode: str = Query("ai")):
2031
data = await file.read()
@@ -44,8 +55,24 @@ async def db_recommend():
4455

4556
@app.post("/db/query")
4657
async def db_query(req: QueryRequest):
47-
sql = nl_to_sql(req.question)
48-
df = execute_query(sql)
58+
try:
59+
sql = nl_to_sql(req.question)
60+
except ValueError as e:
61+
return {
62+
"error": (
63+
"Cannot convert this question to executable SQL. "
64+
"Ask a more concrete data query.\n"
65+
f"Reason: {e}"
66+
)
67+
}
68+
69+
print(f"[NL2SQL] question: {req.question}")
70+
print(f"[NL2SQL] generated sql: {repr(sql)}")
71+
try:
72+
df = execute_query(sql)
73+
except Exception as e:
74+
return {"error": f"SQL execution failed: {e}"}
75+
4976
data = df.to_csv(index=False).encode("utf-8")
5077
analysis = analyze(data, "query_result.csv")
5178
try:
@@ -54,3 +81,26 @@ async def db_query(req: QueryRequest):
5481
print(f"LLM failed, falling back to template: {e}")
5582
spec = generate_spec(analysis)
5683
return {"spec": spec, "analysis": analysis, "sql": sql}
84+
85+
86+
@app.post("/db/query-sql")
87+
async def db_query_sql(req: SqlRequest):
88+
try:
89+
df = execute_query(req.sql)
90+
except Exception as e:
91+
return {"error": f"SQL execution failed: {e}"}
92+
93+
data = df.to_csv(index=False).encode("utf-8")
94+
analysis = analyze(data, "query_result.csv")
95+
try:
96+
spec = generate_spec_with_llm(analysis)
97+
except Exception as e:
98+
print(f"LLM failed, falling back to template: {e}")
99+
spec = generate_spec(analysis)
100+
return {"spec": spec, "analysis": analysis, "sql": req.sql}
101+
102+
103+
@app.post("/analyze/deep")
104+
async def analyze_deep(req: DeepRequest):
105+
result = deep_analyze(req.data_csv, req.question)
106+
return result

0 commit comments

Comments
 (0)