Agentic QueryBot (Task 3 – Streamlined)

A lightweight, production-oriented Natural Language → SQL → Answer system with Local LLM, hybrid schema RAG, security guardrails, structured logging, and a Streamlit UI. This codebase follows the interview task specs, focuses on SQLite (Postgres/Docker optional), and removes Ollama. Local LLM and OpenAI are supported with an Auto fallback.

Key Features

NL → SQL → Execute → Summarize pipeline
Local LLM + OpenAI with Auto fallback; offline works with Local
MCP-style provider abstraction (app/services/provider.py)
Hybrid RAG over schema (dense + keyword, with optional metadata filters)
Security guardrails:
- Allow-list tables (from env)
- Block DML/destructive ops; block multiple statements
- Enforce ORDER BY when LIMIT is used
Structured JSON logging with trace IDs
Streamlit UI with mode selection, query history, stats, and JSON download
Minimal Windows-safe smoke tests

Repository Layout (selected)

app/
  __init__.py               # .env auto-loader
  main.py                   # end-to-end query pipeline
  agents/
    sql_agent.py            # LLM → SQL generation
    system_prompt.py        # schema RAG + hybrid retrieval helpers
  db/
    connection.py           # SQLite connection
  history/
    query_history.py        # history storage & export
  llm/
    local_llm_client.py     # transformers-based local client (with fallback)
    llm_manager.py          # OpenAI / Local / Auto
  security/
    table_allowlist.py      # table allow-list checks
  services/
    provider.py             # MCP-like service layer
  tools/
    sql_executor.py         # query execution
    sql_validator.py        # safety rules (LIMIT/ORDER BY, DML block, etc.)
    stats_tool.py           # optional column stats
  ui/
    streamlit_app.py        # Streamlit frontend
  utils/
    logger.py               # structured JSON logger
    internet_check.py       # online/OpenAI checks
    tracing.py              # trace IDs (optional)
logs/
  app.log                   # runtime logs
tests/
  test_structured_logging.py # minimal smoke test (Windows-safe)

Requirements

Python 3.9+
Windows/Linux/macOS
Recommended: virtualenv or conda

Install dependencies:

pip install -r requirements.txt

Note: transformers may download models on first use; ensure disk/RAM are sufficient. The local client falls back to a smaller model if needed.

Configuration (.env)

The app loads .env automatically via app/__init__.py (optional). Supported keys:

OPENAI_API_KEY # optional; enables OpenAI mode and Auto fallback
ALLOWED_TABLES # comma-separated allow-list, e.g. json_patients,json_admissions,...

Example .env:

OPENAI_API_KEY=sk-...  # optional
ALLOWED_TABLES=json_patients,json_admissions,json_transfers,json_providers,json_lab,json_diagnoses,json_insurance,json_careunits

Running the App

Streamlit UI:

cd JSL_TASK_2
streamlit run app/ui/streamlit_app.py

LLM Mode selection in sidebar (Auto/OpenAI/Local)
If internet is off, Auto falls back to Local automatically; use the Refresh status button to update the indicators
After submitting a question:
- SQL + answer are shown
- Metadata (validation, tables, performance) available in expanders
- “Download Response” provides JSON
- Query history and statistics appear below results

CLI (optional):

cd JSL_TASK_2
python -c "from app.main import run_query_pipeline; print(run_query_pipeline('Kaç hasta var?'))"

What’s Implemented vs Task 3

Implemented
- MCP-style provider abstraction (service layer)
- Local LLM (offline) + Auto fallback
- Hybrid schema RAG: dense (SentenceTransformer) + keyword, with metadata filters
- Security: allow-list, DML/destructive block, multiple statement block, LIMIT requires ORDER BY
- Observability: structured JSON logs + trace IDs
- UI: query history, statistics, downloadable JSON response
- .env based configuration
Not included by design (can be added on request)
- Postgres and Docker Compose setup
- Persistent vector store (current approach is schema-centric hybrid retrieval)
- Expanded test matrix; kept a minimal Windows-safe smoke test

Example Questions (and expected output shape)

“Kaç hasta var?”
- SQL: SELECT COUNT(*) ...
- Answer: “Toplam X hasta bulundu.”
- Meta: database.query_type=COUNT_QUERY, performance.execution_ms, results.row_count
“Son 30 günde kaç yatış var?”
- SQL: SELECT COUNT(*) FROM json_admissions WHERE admittime >= date('now','-30 days')
- Meta: validation passes; ORDER BY / LIMIT rules apply if used
“Hangi bakım birimlerinde kaç transfer var?”
- SQL: SELECT careunit, COUNT(*) FROM json_transfers GROUP BY careunit

These are examples; exact numbers depend on your dataset.

Security Guardrails

Only SELECT queries are allowed
Forbidden keywords: insert, update, delete, drop, alter, create, attach, pragma, vacuum, truncate, replace
Multiple statements (;) blocked
LIMIT requires ORDER BY
Table allow-list enforced via ALLOWED_TABLES

Logging & Tracing

Structured JSON logs written to logs/app.log
Each pipeline run includes a trace_id
Minimal smoke test: python tests/test_structured_logging.py

Development Notes

Local LLM
- Uses transformers pipeline; falls back to microsoft/DialoGPT-small if the configured model cannot load
- Token usage is approximate (based on tokenizer lengths)
RAG
- Schema documents are built from SQLite PRAGMA
- Dense embeddings via sentence-transformers/all-MiniLM-L6-v2 when available; else bag-of-words fallback
- get_hybrid_relevant_schema_snippets_with_metadata supports filters such as { "table": "json_admissions" }
Provider Layer (app/services/provider.py)
- Centralizes access to DB, Security, RAG, Logger, Tracing, and LLM manager

Troubleshooting

Streamlit shows “Connection error”
- Check terminal for exceptions; large model load may exceed memory
- Switch to Local Only or rely on fallback small model
OpenAI unavailable
- Ensure internet and OPENAI_API_KEY; otherwise use Local Only
SQLite schema
- Ensure the sample DB file exists under app/db/; adjust connection.py if needed

Lightweight Testing

Run the minimal smoke test:

cd JSL_TASK_2
python tests/test_structured_logging.py

Optionally, you can add more tests under tests/ following the same pattern.

Roadmap (Optional Extensions)

Postgres + Docker Compose (app + db + vector store)
Persistent vector index (FAISS/Chroma)
More comprehensive test suite
Expanded UI (intermediate steps, assumptions panel)

License

This repository is intended for interview evaluation. Licensing can be adapted to company policy.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
app		app
etl		etl
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LLM_SWITCHING_GUIDE.md		LLM_SWITCHING_GUIDE.md
LOCAL_LLM_SETUP.md		LOCAL_LLM_SETUP.md
README.md		README.md
UPGRADE_SUMMARY.md		UPGRADE_SUMMARY.md
_jsl_smoke.py		_jsl_smoke.py
_jsl_smoke_verbose.py		_jsl_smoke_verbose.py
config.py		config.py
debug_test.py		debug_test.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run_etl.py		run_etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic QueryBot (Task 3 – Streamlined)

Key Features

Repository Layout (selected)

Requirements

Configuration (.env)

Running the App

What’s Implemented vs Task 3

Example Questions (and expected output shape)

Security Guardrails

Logging & Tracing

Development Notes

Troubleshooting

Lightweight Testing

Roadmap (Optional Extensions)

License

About

Uh oh!

Releases

Packages

Languages

Mehmetda/JSL_TASK_QUERYBOT

Folders and files

Latest commit

History

Repository files navigation

Agentic QueryBot (Task 3 – Streamlined)

Key Features

Repository Layout (selected)

Requirements

Configuration (.env)

Running the App

What’s Implemented vs Task 3

Example Questions (and expected output shape)

Security Guardrails

Logging & Tracing

Development Notes

Troubleshooting

Lightweight Testing

Roadmap (Optional Extensions)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages