Skip to content

Commit 7649855

Browse files
authored
AI powered SQL generation (#4232)
* AI powered SQL generation
1 parent 5f5cf3b commit 7649855

File tree

1 file changed

+193
-0
lines changed

1 file changed

+193
-0
lines changed
Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
---
2+
slug: /use-cases/AI/ai-powered-sql-generation
3+
sidebar_label: 'AI-powered SQL generation'
4+
title: 'AI-powered SQL generation'
5+
pagination_prev: null
6+
pagination_next: null
7+
description: 'This guide explains how to use AI to generate SQL queries in ClickHouse Client or clickhouse-local.'
8+
keywords: ['AI', 'SQL generation']
9+
show_related_blogs: true
10+
---
11+
12+
Starting from ClickHouse 25.7, [ClickHouse Client](https://clickhouse.com/docs/interfaces/cli) and [clickhouse-local](https://clickhouse.com/docs/operations/utilities/clickhouse-local) include [AI-powered functionality](https://clickhouse.com/docs/interfaces/cli#ai-sql-generation) that converts natural language descriptions into SQL queries. This feature allows users to describe their data requirements in plain text, which the system then translates into corresponding SQL statements.
13+
14+
This capability is particularly useful for users who may not be familiar with complex SQL syntax or need to quickly generate queries for exploratory data analysis. The feature works with standard ClickHouse tables and supports common query patterns including filtering, aggregation, and joins.
15+
16+
It does this with help from the following in-built tools/functions:
17+
18+
* `list_databases` - List all available databases in the ClickHouse instance
19+
* `list_tables_in_database` - List all tables in a specific database
20+
* `get_schema_for_table` - Get the `CREATE TABLE` statement (schema) for a specific table
21+
22+
## Prerequisites {#prerequisites}
23+
24+
We'll need to add an Anthropic or OpenAI key as an environment variable:
25+
26+
```bash
27+
export ANTHROPIC_API_KEY=your_api_key
28+
export OPENAI_API_KEY=your_api_key
29+
```
30+
31+
Alternatively, you can [provide a configuration file](https://clickhouse.com/docs/interfaces/cli#ai-sql-generation-configuration).
32+
33+
## Connecting to the ClickHouse SQL playground {#connecting-to-the-clickhouse-sql-playground}
34+
35+
We're going to explore this feature using the [ClickHouse SQL playground](https://sql.clickhouse.com/).
36+
37+
We can connect to the ClickHouse SQL playground using the following command:
38+
39+
```bash
40+
clickhouse client -mn \
41+
--host sql-clickhouse.clickhouse.com \
42+
--secure \
43+
--user demo --password ''
44+
```
45+
46+
:::note
47+
We'll assume you have ClickHouse installed, but if not, refer to the [installation guide](https://clickhouse.com/docs/install)
48+
:::
49+
50+
## Asking ClickHouse questions in natural language {#asking-clickhouse-questions-in-natural-language}
51+
52+
Now it's time to start asking some questions!
53+
54+
The text to SQL feature is effectively a one-shot query generation tool. Since it doesn't maintain conversation history, include as much useful context as possible in your question. Be specific about:
55+
56+
Time periods or date ranges
57+
The type of analysis you want (averages, totals, rankings, etc.)
58+
Any filtering criteria
59+
60+
### Finding expensive housing markets {#finding-expensive-housing-markets}
61+
62+
Let's start by asking a question about house prices. The SQL playground contains a UK house prices dataset, which the AI will automatically discover:
63+
64+
```sql
65+
?? Can you tell me the most expensive place to buy a house in 2021?;
66+
```
67+
68+
Once we press enter, we'll see the thought process of the AI as it tries to answer our question.
69+
70+
```text
71+
• Starting AI SQL generation with schema discovery...
72+
─────────────────────────────────────────────────
73+
🧠 thinking...[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_01QpQd35ZD8HM9QSYFZuQ8Wh
74+
🔧 Calling: list_databases [toolu_01...]
75+
🧠 thinking✓ list_databases completed
76+
└─ Found 37 databases: - amazon - bluesky - country - covid - default - dns - en...
77+
🔧 Calling: list_databases [toolu_01...]
78+
✓ list_databases completed
79+
└─ Found 37 databases: - amazon - bluesky - country - covid - default - dns - en...
80+
🧠 thinking[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_016DxK8SMVKGrUPfFwi43AzE
81+
🔧 Calling: list_tables_in_database [toolu_01...]
82+
└─ Args: {"database":"uk"}
83+
🧠 thinking.✓ list_tables_in_database completed
84+
└─ Found 9 tables in database 'uk': - uk_codes - uk_postcode_to_iso - uk_price_p...
85+
🔧 Calling: list_tables_in_database [toolu_01...]
86+
└─ Args: {"database":"uk"}
87+
✓ list_tables_in_database completed
88+
└─ Found 9 tables in database 'uk': - uk_codes - uk_postcode_to_iso - uk_price_p...
89+
🧠 thinking.[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_015giF12wm619beNGC5aTrmw
90+
🔧 Calling: get_schema_for_table [toolu_01...]
91+
└─ Args: {"database":"uk","table":"uk_price_paid"}
92+
✓ get_schema_for_table completed
93+
└─ Schema for uk.uk_price_paid: CREATE TABLE uk.uk_price_paid ( `price` UInt...
94+
🔧 Calling: get_schema_for_table [toolu_01...]
95+
└─ Args: {"database":"uk","table":"uk_price_paid"}
96+
🧠 thinking..✓ get_schema_for_table completed
97+
└─ Schema for uk.uk_price_paid: CREATE TABLE uk.uk_price_paid ( `price` UInt...
98+
🧠 thinking[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_01HxT1HKbaTT3165Wx5bDtY9
99+
─────────────────────────────────────────────────
100+
• ✨ SQL query generated successfully!
101+
:) SELECT town, district, county, round(avg(price), 2) as avg_price, count() as total_sales FROM uk.uk_price_paid WHERE date >= '2021-01-01' AND date <= '2021-12-31' GROUP BY town, district, county HAVING total_sales >= 10 ORDER BY avg_price DESC LIMIT 10
102+
```
103+
104+
The AI follows these steps:
105+
106+
1. Schema discovery - Explores available databases and tables
107+
2. Table analysis - Examines the structure of relevant tables
108+
3. Query generation - Creates SQL based on your question and the discovered schema
109+
110+
We can see that it did find the `uk_price_paid` table and generated a query for us to run.
111+
If we run that query, we'll see the following output:
112+
113+
```text
114+
┌─town───────────┬─district───────────────┬─county──────────┬──avg_price─┬─total_sales─┐
115+
│ ILKLEY │ HARROGATE │ NORTH YORKSHIRE │ 4310200 │ 10 │
116+
│ LONDON │ CITY OF LONDON │ GREATER LONDON │ 4008117.32 │ 311 │
117+
│ LONDON │ CITY OF WESTMINSTER │ GREATER LONDON │ 2847409.81 │ 3984 │
118+
│ LONDON │ KENSINGTON AND CHELSEA │ GREATER LONDON │ 2331433.1 │ 2594 │
119+
│ EAST MOLESEY │ RICHMOND UPON THAMES │ GREATER LONDON │ 2244845.83 │ 12 │
120+
│ LEATHERHEAD │ ELMBRIDGE │ SURREY │ 2051836.42 │ 102 │
121+
│ VIRGINIA WATER │ RUNNYMEDE │ SURREY │ 1914137.53 │ 169 │
122+
│ REIGATE │ MOLE VALLEY │ SURREY │ 1715780.89 │ 18 │
123+
│ BROADWAY │ TEWKESBURY │ GLOUCESTERSHIRE │ 1633421.05 │ 19 │
124+
│ OXFORD │ SOUTH OXFORDSHIRE │ OXFORDSHIRE │ 1628319.07 │ 405 │
125+
└────────────────┴────────────────────────┴─────────────────┴────────────┴─────────────┘
126+
```
127+
128+
If we want to ask follow up questions, we need to ask our question from scratch.
129+
130+
### Finding expensive properties in Greater London {#finding-expensive-properties-in-greater-london}
131+
132+
Since the feature doesn't maintain conversation history, each query must be self-contained. When asking follow-up questions, you need to provide the full context rather than referring to previous queries.
133+
For example, after seeing the previous results, we might want to focus specifically on Greater London properties. Rather than asking "What about Greater London?", we need to include the complete context:
134+
135+
```sql
136+
?? Can you tell me the most expensive place to buy a house in Greater London across the years?;
137+
```
138+
139+
Notice that the AI goes through the same discovery process, even though it just examined this data:
140+
141+
```text
142+
• Starting AI SQL generation with schema discovery...
143+
─────────────────────────────────────────────────
144+
🧠 thinking[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_012m4ayaSHTYtX98gxrDy1rz
145+
🔧 Calling: list_databases [toolu_01...]
146+
✓ list_databases completed
147+
└─ Found 37 databases: - amazon - bluesky - country - covid - default - dns - en...
148+
🔧 Calling: list_databases [toolu_01...]
149+
🧠 thinking.✓ list_databases completed
150+
└─ Found 37 databases: - amazon - bluesky - country - covid - default - dns - en...
151+
🧠 thinking.[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_01KU4SZRrJckutXUzfJ4NQtA
152+
🔧 Calling: list_tables_in_database [toolu_01...]
153+
└─ Args: {"database":"uk"}
154+
🧠 thinking..✓ list_tables_in_database completed
155+
└─ Found 9 tables in database 'uk': - uk_codes - uk_postcode_to_iso - uk_price_p...
156+
🔧 Calling: list_tables_in_database [toolu_01...]
157+
└─ Args: {"database":"uk"}
158+
✓ list_tables_in_database completed
159+
└─ Found 9 tables in database 'uk': - uk_codes - uk_postcode_to_iso - uk_price_p...
160+
🧠 thinking[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_01X9CnxoBpbD2xj2UzuRy2is
161+
🔧 Calling: get_schema_for_table [toolu_01...]
162+
└─ Args: {"database":"uk","table":"uk_price_paid"}
163+
🧠 thinking.✓ get_schema_for_table completed
164+
└─ Schema for uk.uk_price_paid: CREATE TABLE uk.uk_price_paid ( `price` UInt...
165+
🔧 Calling: get_schema_for_table [toolu_01...]
166+
└─ Args: {"database":"uk","table":"uk_price_paid"}
167+
✓ get_schema_for_table completed
168+
└─ Schema for uk.uk_price_paid: CREATE TABLE uk.uk_price_paid ( `price` UInt...
169+
🧠 thinking...[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_01QTMypS1XuhjgVpDir7N9wD
170+
─────────────────────────────────────────────────
171+
• ✨ SQL query generated successfully!
172+
:) SELECT district, toYear(date) AS year, round(avg(price), 2) AS avg_price, count() AS total_sales FROM uk.uk_price_paid WHERE county = 'GREATER LONDON' GROUP BY district, year HAVING total_sales >= 10 ORDER BY avg_price DESC LIMIT 10;
173+
```
174+
175+
This generates a more targeted query that filters specifically for Greater London and breaks down results by year.
176+
The output of the query is shown below:
177+
178+
```text
179+
┌─district────────────┬─year─┬───avg_price─┬─total_sales─┐
180+
│ CITY OF LONDON │ 2019 │ 14504772.73 │ 299 │
181+
│ CITY OF LONDON │ 2017 │ 6351366.11 │ 367 │
182+
│ CITY OF LONDON │ 2016 │ 5596348.25 │ 243 │
183+
│ CITY OF LONDON │ 2023 │ 5576333.72 │ 252 │
184+
│ CITY OF LONDON │ 2018 │ 4905094.54 │ 523 │
185+
│ CITY OF LONDON │ 2021 │ 4008117.32 │ 311 │
186+
│ CITY OF LONDON │ 2025 │ 3954212.39 │ 56 │
187+
│ CITY OF LONDON │ 2014 │ 3914057.39 │ 416 │
188+
│ CITY OF LONDON │ 2022 │ 3700867.19 │ 290 │
189+
│ CITY OF WESTMINSTER │ 2018 │ 3562457.76 │ 3346 │
190+
└─────────────────────┴──────┴─────────────┴─────────────┘
191+
```
192+
193+
The City of London consistently appears as the most expensive district! You'll notice the AI created a reasonable query, though the results are ordered by average price rather than chronologically. For a year-over-year analysis, we might refine your question to ask specifically for "the most expensive district each year" to get results grouped differently.

0 commit comments

Comments
 (0)