An AI-powered data analysis platform that allows users to upload datasets (CSV/Excel) and interact with them using natural language. The system leverages a dual-database architecture and an autonomous AI agent to convert questions into high-performance analytical queries.
-
CSV Support: Direct high-speed ingestion using DuckDB's native readers.
-
Excel (.xlsx) Support: Automated conversion of binary Excel sheets to structured data via Apache POI.
-
DuckDB (Analytical Engine): Utilized for lightning-fast SQL queries on uploaded datasets.
-
PostgreSQL (Metadata & Memory): Stores persistent chat sessions and system metadata.
-
Powered by Llama-3.3-70b-versatile via Groq for near-instant reasoning.
-
Tool Calling: The AI can autonomously discover schemas, preview data, and execute SQL.
-
Multi-user session support using
sessionId. -
Conversation history is persisted in PostgreSQL, allowing the AI to remember context across server restarts.
- Returns human-friendly answers, the generated SQL query, and structured JSON data ready for frontend charts.
| Layer | Technology |
|---|---|
| Framework | Spring Boot 3.5.9 (Java 21) |
| AI Orchestration | LangChain4j 0.36.2 |
| LLM Provider | Groq (Llama 3.3 70B) |
| Analytical DB | DuckDB (Embedded) |
| Operational DB | PostgreSQL |
| File Processing | Apache POI, Commons CSV |
com.praj.datasetai/
├── config/ # Multi-DataSource & AI Bean configurations
├── controller/ # REST Endpoints for Upload & Chat
├── domain/ # JPA Entities (PostgreSQL)
├── dto/ # Data Transfer Objects (Request/Response)
├── langchain/
│ ├── agent/ # AI Service Interfaces
│ ├── store/ # Persistent Memory implementation
│ └── tool/ # SQL Execution & Schema tools
├── repository/ # Spring Data JPA Repositories
└── service/ # Business Logic (File Ingestion & Storage)
POST /api/files/upload
- Payload:
multipart/form-data(key:file) - Action: Saves file and creates a table in DuckDB.
POST /api/chat
- Payload:
{"sessionId": "string", "question": "string"} - Action: AI reasons over the data and returns structured insights.
By using DuckDB, the application avoids the overhead of a traditional network-based database for heavy analytical tasks, achieving up to 100x faster execution for Group By and Aggregate queries compared to standard SQL engines in this context.
The system doesn't just "run SQL." It follows a ReAct (Reasoning + Acting) loop:
- Observe: Check table schema.
- Think: Determine which columns answer the user's question.
- Act: Execute the query and validate results.