This application allows users to query a complex business database using natural language questions, which are then translated to SQL by the Gemini Flash AI model and executed against a Supabase database.
- Natural language to SQL conversion using Google Gemini Flash
- Complex database schema with intentionally "dirty" elements to challenge AI
- Visualization of query results
- Detailed explanations of generated SQL
- Handling of business analytics questions about sales, users, products, and regions
The project consists of two main parts:
- Client: React frontend with TypeScript and Tailwind CSS
- Server: Node.js backend with Express and TypeScript
- Node.js (v16 or higher)
- pnpm (or npm/yarn)
- Supabase account
- Google Gemini API key
- Clone the repository
git clone <repository-url>
cd ai-data-analytics- Set up environment variables
Create .env files in both the client and server directories.
For server:
SUPABASE_URL=https://nxyvixarrciicytnkrvs.supabase.co
SUPABASE_ANON_KEY=your-supabase-anon-key
GEMINI_API_KEY=your-gemini-api-key
PORT=5000
For client:
VITE_API_URL=http://localhost:5000/api
- Install dependencies and start the applications
For server:
cd server
pnpm install
pnpm devFor client:
cd client
pnpm install
pnpm devThe database schema is intentionally complex with some "dirty" elements to challenge the AI:
-
regions: Geographic regions with cryptic columns
- Contains boolean flags Q1-Q4 indicating quarterly activity
- Has ambiguous x_factor metric
-
products: Product catalog
- Contains columns with abbreviated names like prd_desc and prd_code
- Includes various product status indicators
-
users: Customer information
- Contains cryptic tier indicators t1, t2, t3
- Stores user preferences in JSON format
-
sales: Transaction data
- Includes cryptic columns c1 (channel), c2 (campaign code)
- Contains discount percentage and satisfaction score
- Linked to products, users, and regions
-
inventory: Stock information
- Tracks inventory by product and region
- Contains fields for stock thresholds and flags
Here are some example questions you can ask:
-
"Why did sales drop in Q2 in the southern zone?"
-- Expected SQL: SELECT r.name AS region_name, p.name AS product_name, SUM(CASE WHEN EXTRACT(QUARTER FROM s.date) = 1 THEN s.amount ELSE 0 END) AS q1_sales, SUM(CASE WHEN EXTRACT(QUARTER FROM s.date) = 2 THEN s.amount ELSE 0 END) AS q2_sales, COUNT(DISTINCT s.id) AS transaction_count, AVG(s.dscnt) AS avg_discount FROM sales s JOIN regions r ON s.region_id = r.id JOIN products p ON s.product_id = p.id WHERE r.zone = 'South' AND EXTRACT(YEAR FROM s.date) = 2024 GROUP BY r.name, p.name HAVING SUM(CASE WHEN EXTRACT(QUARTER FROM s.date) = 1 THEN s.amount ELSE 0 END) > SUM(CASE WHEN EXTRACT(QUARTER FROM s.date) = 2 THEN s.amount ELSE 0 END) ORDER BY (SUM(CASE WHEN EXTRACT(QUARTER FROM s.date) = 1 THEN s.amount ELSE 0 END) - SUM(CASE WHEN EXTRACT(QUARTER FROM s.date) = 2 THEN s.amount ELSE 0 END)) DESC
-
"Which products had the highest profit margin in Q4?"
-- Expected SQL: SELECT p.id, p.name AS product_name, p.category, SUM(s.amount) AS total_sales, SUM(s.qty * p.cost) AS total_cost, SUM(s.amount) - SUM(s.qty * p.cost) AS total_profit, (SUM(s.amount) - SUM(s.qty * p.cost)) / SUM(s.amount) * 100 AS profit_margin_percentage FROM sales s JOIN products p ON s.product_id = p.id JOIN regions r ON s.region_id = r.id WHERE r."Q4" = true OR EXTRACT(QUARTER FROM s.date) = 4 GROUP BY p.id, p.name, p.category ORDER BY profit_margin_percentage DESC LIMIT 10
-
"Show me the top 5 selling products in New York in 2024"
-- Expected SQL: SELECT p.id, p.name AS product_name, p.category, SUM(s.qty) AS total_quantity_sold, SUM(s.amount) AS total_sales_amount FROM sales s JOIN products p ON s.product_id = p.id JOIN regions r ON s.region_id = r.id WHERE r.name = 'New York' AND EXTRACT(YEAR FROM s.date) = 2024 GROUP BY p.id, p.name, p.category ORDER BY total_quantity_sold DESC LIMIT 5
-
"Compare online vs in-store sales across different regions"
-- Expected SQL: SELECT r.name AS region_name, r.country, SUM(CASE WHEN s.c1 = 'online' THEN s.amount ELSE 0 END) AS online_sales, SUM(CASE WHEN s.c1 = 'in-store' THEN s.amount ELSE 0 END) AS in_store_sales, COUNT(CASE WHEN s.c1 = 'online' THEN s.id END) AS online_transactions, COUNT(CASE WHEN s.c1 = 'in-store' THEN s.id END) AS in_store_transactions, SUM(CASE WHEN s.c1 = 'online' THEN s.amount ELSE 0 END) / NULLIF(SUM(CASE WHEN s.c1 = 'in-store' THEN s.amount ELSE 0 END), 0) AS online_to_instore_ratio FROM sales s JOIN regions r ON s.region_id = r.id GROUP BY r.name, r.country ORDER BY r.name
-
"Which regions have the most critical inventory levels?"
-- Expected SQL: SELECT r.name AS region_name, r.country, p.name AS product_name, p.category, i.qty_available, i.min_stock_level, i.qty_available - i.min_stock_level AS stock_buffer, CASE WHEN i.qty_available <= i.min_stock_level THEN 'Critical' WHEN i.qty_available <= i.min_stock_level * 1.2 THEN 'Warning' ELSE 'Normal' END AS stock_status FROM inventory i JOIN regions r ON i.region_id = r.id JOIN products p ON i.product_id = p.id WHERE i.qty_available <= i.min_stock_level * 1.2 ORDER BY stock_buffer ASC, r.name LIMIT 15
- The application uses Supabase for database hosting
- Gemini Flash API is used for natural language to SQL conversion
- The backend includes a service to provide the AI with schema information
- A custom SQL execution function in Supabase allows dynamic query execution
- The visualization component is a placeholder and would need integration with a charting library in a production environment
- The Gemini Flash model may sometimes generate incorrect SQL for very complex queries
- Error handling could be improved in a production version