|
| 1 | +--- |
| 2 | +slug: /use-cases/AI/ai-powered-sql-generation |
| 3 | +sidebar_label: 'AI-powered SQL generation' |
| 4 | +title: 'AI-powered SQL generation' |
| 5 | +pagination_prev: null |
| 6 | +pagination_next: null |
| 7 | +description: 'This guide explains how to use AI to generate SQL queries in ClickHouse Client or clickhouse-local.' |
| 8 | +keywords: ['AI', 'SQL generation'] |
| 9 | +show_related_blogs: true |
| 10 | +--- |
| 11 | + |
| 12 | +Starting from ClickHouse 25.7, [ClickHouse Client](https://clickhouse.com/docs/interfaces/cli) and [clickhouse-local](https://clickhouse.com/docs/operations/utilities/clickhouse-local) include [AI-powered functionality](https://clickhouse.com/docs/interfaces/cli#ai-sql-generation) that converts natural language descriptions into SQL queries. This feature allows users to describe their data requirements in plain text, which the system then translates into corresponding SQL statements. |
| 13 | + |
| 14 | +This capability is particularly useful for users who may not be familiar with complex SQL syntax or need to quickly generate queries for exploratory data analysis. The feature works with standard ClickHouse tables and supports common query patterns including filtering, aggregation, and joins. |
| 15 | + |
| 16 | +It does this with help from the following in-built tools/functions: |
| 17 | + |
| 18 | +* `list_databases` - List all available databases in the ClickHouse instance |
| 19 | +* `list_tables_in_database` - List all tables in a specific database |
| 20 | +* `get_schema_for_table` - Get the `CREATE TABLE` statement (schema) for a specific table |
| 21 | + |
| 22 | +## Prerequisites {#prerequisites} |
| 23 | + |
| 24 | +We'll need to add an Anthropic or OpenAI key as an environment variable: |
| 25 | + |
| 26 | +```bash |
| 27 | +export ANTHROPIC_API_KEY=your_api_key |
| 28 | +export OPENAI_API_KEY=your_api_key |
| 29 | +``` |
| 30 | + |
| 31 | +Alternatively, you can [provide a configuration file](https://clickhouse.com/docs/interfaces/cli#ai-sql-generation-configuration). |
| 32 | + |
| 33 | +## Connecting to the ClickHouse SQL playground {#connecting-to-the-clickhouse-sql-playground} |
| 34 | + |
| 35 | +We're going to explore this feature using the [ClickHouse SQL playground](https://sql.clickhouse.com/). |
| 36 | + |
| 37 | +We can connect to the ClickHouse SQL playground using the following command: |
| 38 | + |
| 39 | +```bash |
| 40 | +clickhouse client -mn \ |
| 41 | +--host sql-clickhouse.clickhouse.com \ |
| 42 | +--secure \ |
| 43 | +--user demo --password '' |
| 44 | +``` |
| 45 | + |
| 46 | +:::note |
| 47 | +We'll assume you have ClickHouse installed, but if not, refer to the [installation guide](https://clickhouse.com/docs/install) |
| 48 | +::: |
| 49 | + |
| 50 | +## Asking ClickHouse questions in natural language {#asking-clickhouse-questions-in-natural-language} |
| 51 | + |
| 52 | +Now it's time to start asking some questions! |
| 53 | + |
| 54 | +The text to SQL feature is effectively a one-shot query generation tool. Since it doesn't maintain conversation history, include as much useful context as possible in your question. Be specific about: |
| 55 | + |
| 56 | +Time periods or date ranges |
| 57 | +The type of analysis you want (averages, totals, rankings, etc.) |
| 58 | +Any filtering criteria |
| 59 | + |
| 60 | +### Finding expensive housing markets {#finding-expensive-housing-markets} |
| 61 | + |
| 62 | +Let's start by asking a question about house prices. The SQL playground contains a UK house prices dataset, which the AI will automatically discover: |
| 63 | + |
| 64 | +```sql |
| 65 | +?? Can you tell me the most expensive place to buy a house in 2021?; |
| 66 | +``` |
| 67 | + |
| 68 | +Once we press enter, we'll see the thought process of the AI as it tries to answer our question. |
| 69 | + |
| 70 | +```text |
| 71 | +• Starting AI SQL generation with schema discovery... |
| 72 | +───────────────────────────────────────────────── |
| 73 | +🧠 thinking...[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_01QpQd35ZD8HM9QSYFZuQ8Wh |
| 74 | +🔧 Calling: list_databases [toolu_01...] |
| 75 | +🧠 thinking✓ list_databases completed |
| 76 | + └─ Found 37 databases: - amazon - bluesky - country - covid - default - dns - en... |
| 77 | +🔧 Calling: list_databases [toolu_01...] |
| 78 | +✓ list_databases completed |
| 79 | + └─ Found 37 databases: - amazon - bluesky - country - covid - default - dns - en... |
| 80 | +🧠 thinking[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_016DxK8SMVKGrUPfFwi43AzE |
| 81 | +🔧 Calling: list_tables_in_database [toolu_01...] |
| 82 | + └─ Args: {"database":"uk"} |
| 83 | +🧠 thinking.✓ list_tables_in_database completed |
| 84 | + └─ Found 9 tables in database 'uk': - uk_codes - uk_postcode_to_iso - uk_price_p... |
| 85 | +🔧 Calling: list_tables_in_database [toolu_01...] |
| 86 | + └─ Args: {"database":"uk"} |
| 87 | +✓ list_tables_in_database completed |
| 88 | + └─ Found 9 tables in database 'uk': - uk_codes - uk_postcode_to_iso - uk_price_p... |
| 89 | +🧠 thinking.[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_015giF12wm619beNGC5aTrmw |
| 90 | +🔧 Calling: get_schema_for_table [toolu_01...] |
| 91 | + └─ Args: {"database":"uk","table":"uk_price_paid"} |
| 92 | +✓ get_schema_for_table completed |
| 93 | + └─ Schema for uk.uk_price_paid: CREATE TABLE uk.uk_price_paid ( `price` UInt... |
| 94 | +🔧 Calling: get_schema_for_table [toolu_01...] |
| 95 | + └─ Args: {"database":"uk","table":"uk_price_paid"} |
| 96 | +🧠 thinking..✓ get_schema_for_table completed |
| 97 | + └─ Schema for uk.uk_price_paid: CREATE TABLE uk.uk_price_paid ( `price` UInt... |
| 98 | +🧠 thinking[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_01HxT1HKbaTT3165Wx5bDtY9 |
| 99 | +───────────────────────────────────────────────── |
| 100 | +• ✨ SQL query generated successfully! |
| 101 | +:) SELECT town, district, county, round(avg(price), 2) as avg_price, count() as total_sales FROM uk.uk_price_paid WHERE date >= '2021-01-01' AND date <= '2021-12-31' GROUP BY town, district, county HAVING total_sales >= 10 ORDER BY avg_price DESC LIMIT 10 |
| 102 | +``` |
| 103 | + |
| 104 | +The AI follows these steps: |
| 105 | + |
| 106 | +1. Schema discovery - Explores available databases and tables |
| 107 | +2. Table analysis - Examines the structure of relevant tables |
| 108 | +3. Query generation - Creates SQL based on your question and the discovered schema |
| 109 | + |
| 110 | +We can see that it did find the `uk_price_paid` table and generated a query for us to run. |
| 111 | +If we run that query, we'll see the following output: |
| 112 | + |
| 113 | +```text |
| 114 | +┌─town───────────┬─district───────────────┬─county──────────┬──avg_price─┬─total_sales─┐ |
| 115 | +│ ILKLEY │ HARROGATE │ NORTH YORKSHIRE │ 4310200 │ 10 │ |
| 116 | +│ LONDON │ CITY OF LONDON │ GREATER LONDON │ 4008117.32 │ 311 │ |
| 117 | +│ LONDON │ CITY OF WESTMINSTER │ GREATER LONDON │ 2847409.81 │ 3984 │ |
| 118 | +│ LONDON │ KENSINGTON AND CHELSEA │ GREATER LONDON │ 2331433.1 │ 2594 │ |
| 119 | +│ EAST MOLESEY │ RICHMOND UPON THAMES │ GREATER LONDON │ 2244845.83 │ 12 │ |
| 120 | +│ LEATHERHEAD │ ELMBRIDGE │ SURREY │ 2051836.42 │ 102 │ |
| 121 | +│ VIRGINIA WATER │ RUNNYMEDE │ SURREY │ 1914137.53 │ 169 │ |
| 122 | +│ REIGATE │ MOLE VALLEY │ SURREY │ 1715780.89 │ 18 │ |
| 123 | +│ BROADWAY │ TEWKESBURY │ GLOUCESTERSHIRE │ 1633421.05 │ 19 │ |
| 124 | +│ OXFORD │ SOUTH OXFORDSHIRE │ OXFORDSHIRE │ 1628319.07 │ 405 │ |
| 125 | +└────────────────┴────────────────────────┴─────────────────┴────────────┴─────────────┘ |
| 126 | +``` |
| 127 | + |
| 128 | +If we want to ask follow up questions, we need to ask our question from scratch. |
| 129 | + |
| 130 | +### Finding expensive properties in Greater London {#finding-expensive-properties-in-greater-london} |
| 131 | + |
| 132 | +Since the feature doesn't maintain conversation history, each query must be self-contained. When asking follow-up questions, you need to provide the full context rather than referring to previous queries. |
| 133 | +For example, after seeing the previous results, we might want to focus specifically on Greater London properties. Rather than asking "What about Greater London?", we need to include the complete context: |
| 134 | + |
| 135 | +```sql |
| 136 | +?? Can you tell me the most expensive place to buy a house in Greater London across the years?; |
| 137 | +``` |
| 138 | + |
| 139 | +Notice that the AI goes through the same discovery process, even though it just examined this data: |
| 140 | + |
| 141 | +```text |
| 142 | +• Starting AI SQL generation with schema discovery... |
| 143 | +───────────────────────────────────────────────── |
| 144 | +🧠 thinking[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_012m4ayaSHTYtX98gxrDy1rz |
| 145 | +🔧 Calling: list_databases [toolu_01...] |
| 146 | +✓ list_databases completed |
| 147 | + └─ Found 37 databases: - amazon - bluesky - country - covid - default - dns - en... |
| 148 | +🔧 Calling: list_databases [toolu_01...] |
| 149 | +🧠 thinking.✓ list_databases completed |
| 150 | + └─ Found 37 databases: - amazon - bluesky - country - covid - default - dns - en... |
| 151 | +🧠 thinking.[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_01KU4SZRrJckutXUzfJ4NQtA |
| 152 | +🔧 Calling: list_tables_in_database [toolu_01...] |
| 153 | + └─ Args: {"database":"uk"} |
| 154 | +🧠 thinking..✓ list_tables_in_database completed |
| 155 | + └─ Found 9 tables in database 'uk': - uk_codes - uk_postcode_to_iso - uk_price_p... |
| 156 | +🔧 Calling: list_tables_in_database [toolu_01...] |
| 157 | + └─ Args: {"database":"uk"} |
| 158 | +✓ list_tables_in_database completed |
| 159 | + └─ Found 9 tables in database 'uk': - uk_codes - uk_postcode_to_iso - uk_price_p... |
| 160 | +🧠 thinking[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_01X9CnxoBpbD2xj2UzuRy2is |
| 161 | +🔧 Calling: get_schema_for_table [toolu_01...] |
| 162 | + └─ Args: {"database":"uk","table":"uk_price_paid"} |
| 163 | +🧠 thinking.✓ get_schema_for_table completed |
| 164 | + └─ Schema for uk.uk_price_paid: CREATE TABLE uk.uk_price_paid ( `price` UInt... |
| 165 | +🔧 Calling: get_schema_for_table [toolu_01...] |
| 166 | + └─ Args: {"database":"uk","table":"uk_price_paid"} |
| 167 | +✓ get_schema_for_table completed |
| 168 | + └─ Schema for uk.uk_price_paid: CREATE TABLE uk.uk_price_paid ( `price` UInt... |
| 169 | +🧠 thinking...[INFO] Text generation successful - model: claude-3-5-sonnet-latest, response_id: msg_01QTMypS1XuhjgVpDir7N9wD |
| 170 | +───────────────────────────────────────────────── |
| 171 | +• ✨ SQL query generated successfully! |
| 172 | +:) SELECT district, toYear(date) AS year, round(avg(price), 2) AS avg_price, count() AS total_sales FROM uk.uk_price_paid WHERE county = 'GREATER LONDON' GROUP BY district, year HAVING total_sales >= 10 ORDER BY avg_price DESC LIMIT 10; |
| 173 | +``` |
| 174 | + |
| 175 | +This generates a more targeted query that filters specifically for Greater London and breaks down results by year. |
| 176 | +The output of the query is shown below: |
| 177 | + |
| 178 | +```text |
| 179 | +┌─district────────────┬─year─┬───avg_price─┬─total_sales─┐ |
| 180 | +│ CITY OF LONDON │ 2019 │ 14504772.73 │ 299 │ |
| 181 | +│ CITY OF LONDON │ 2017 │ 6351366.11 │ 367 │ |
| 182 | +│ CITY OF LONDON │ 2016 │ 5596348.25 │ 243 │ |
| 183 | +│ CITY OF LONDON │ 2023 │ 5576333.72 │ 252 │ |
| 184 | +│ CITY OF LONDON │ 2018 │ 4905094.54 │ 523 │ |
| 185 | +│ CITY OF LONDON │ 2021 │ 4008117.32 │ 311 │ |
| 186 | +│ CITY OF LONDON │ 2025 │ 3954212.39 │ 56 │ |
| 187 | +│ CITY OF LONDON │ 2014 │ 3914057.39 │ 416 │ |
| 188 | +│ CITY OF LONDON │ 2022 │ 3700867.19 │ 290 │ |
| 189 | +│ CITY OF WESTMINSTER │ 2018 │ 3562457.76 │ 3346 │ |
| 190 | +└─────────────────────┴──────┴─────────────┴─────────────┘ |
| 191 | +``` |
| 192 | + |
| 193 | +The City of London consistently appears as the most expensive district! You'll notice the AI created a reasonable query, though the results are ordered by average price rather than chronologically. For a year-over-year analysis, we might refine your question to ask specifically for "the most expensive district each year" to get results grouped differently. |
0 commit comments