|
| 1 | +# SQL Console: Query Hugging Face datasets in your browser |
| 2 | + |
| 3 | +You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by [DuckDB](https://duckdb.org/) WASM and runs entirely in the browser. You can access the SQL Console from the dataset page by clicking on the **SQL Console** badge. |
| 4 | + |
| 5 | +<div class="flex justify-center"> |
| 6 | + <img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/sql-console-histogram.png"/> |
| 7 | + <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/sql-console-histogram-dark.png"/> |
| 8 | +</div> |
| 9 | + |
| 10 | +<p class="text-sm text-center italic"> |
| 11 | + To learn more about the SQL Console, see the <a href="https://huggingface.co/blog/sql-console" target="_blank" rel="noopener noreferrer">SQL Console blog post</a>. |
| 12 | +</p> |
| 13 | + |
| 14 | + |
| 15 | +Through the SQL Console, you can: |
| 16 | + |
| 17 | +- Run [DuckDB SQL queries](https://duckdb.org/docs/sql/query_syntax/select) on the dataset (_checkout [SQL Snippets](https://huggingface.co/spaces/cfahlgren1/sql-snippets) for useful queries_) |
| 18 | +- Share results of the query with others via a link (_check out [this example](https://huggingface.co/datasets/gretelai/synthetic-gsm8k-reflection-405b?sql_console=true&sql=FROM+histogram%28%0A++train%2C%0A++topic%2C%0A++bin_count+%3A%3D+10%0A%29)_) |
| 19 | +- Download the results of the query to a parquet file |
| 20 | +- Embed the results of the query in your own webpage using an iframe |
| 21 | + |
| 22 | +<Tip> |
| 23 | +You can also use the DuckDB locally through the CLI to query the dataset via the `hf://` protocol. See the <a href="https://huggingface.co/docs/hub/en/datasets-duckdb" target="_blank" rel="noopener noreferrer">DuckDB Datasets documentation</a> for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI. |
| 24 | +</Tip> |
| 25 | + |
| 26 | + |
| 27 | +## Examples |
| 28 | + |
| 29 | +### Filtering |
| 30 | + |
| 31 | +The SQL Console makes filtering datasets really easy. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query: |
| 32 | + |
| 33 | +<div class="flex justify-center"> |
| 34 | + <img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/bar-struct-length.png"/> |
| 35 | + <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/bar-struct-length-dark.png"/> |
| 36 | +</div> |
| 37 | + |
| 38 | +In the query, we can use the `len` function to get the length of the `reasoning_chains` column and the `bar` function to create a bar chart of the reasoning lengths. |
| 39 | + |
| 40 | +```sql |
| 41 | +SELECT len(reasoning_chains) AS reason_len, bar(reason_len, 0, 100), * |
| 42 | +FROM train |
| 43 | +WHERE reason_len > 10 |
| 44 | +ORDER BY reason_len DESC |
| 45 | +``` |
| 46 | + |
| 47 | +The [bar](https://duckdb.org/docs/sql/functions/char.html#barx-min-max-width) function is a neat built-in DuckDB function that creates a bar chart of the reasoning lengths. |
| 48 | + |
| 49 | +### Histogram |
| 50 | + |
| 51 | +Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values. |
| 52 | + |
| 53 | +For example, to plot a histogram of the `reason_len` column in the `SkunkworksAI/reasoning-0.01` dataset, you can use the following query: |
| 54 | + |
| 55 | +<div class="flex justify-center"> |
| 56 | + <img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/histogram-simple.png"/> |
| 57 | + <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/histogram-simple-dark.png"/> |
| 58 | +</div> |
| 59 | +<p class="text-sm text-center italic"> |
| 60 | + Learn more about the `histogram` function and parameters <a href="https://cfahlgren1-sql-snippets.hf.space/histogram" target="_blank" rel="noopener noreferrer">here</a>. |
| 61 | +</p> |
| 62 | + |
| 63 | +```sql |
| 64 | +FROM histogram(train, len(reasoning_chains)) |
| 65 | +``` |
| 66 | + |
| 67 | +### Regex Matching |
| 68 | + |
| 69 | +One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data. |
| 70 | + |
| 71 | + Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the `SkunkworksAI/reasoning-0.01` dataset for instructions that contain markdown code blocks. |
| 72 | + |
| 73 | + <div class="flex justify-center"> |
| 74 | + <img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/regex-matching-markdown-code.png"/> |
| 75 | + <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/regex-matching-markdown-code-dark.png"/> |
| 76 | +</div> |
| 77 | +<p class="text-sm text-center italic"> |
| 78 | + Learn more about the DuckDB regex functions <a href="https://duckdb.org/docs/sql/functions/regular_expressions.html" target="_blank" rel="noopener noreferrer">here</a>. |
| 79 | +</p> |
| 80 | + |
| 81 | + |
| 82 | +```sql |
| 83 | +SELECT * |
| 84 | +FROM train |
| 85 | +WHERE regexp_matches(instruction, '```[a-z]*\n') |
| 86 | +limit 100 |
| 87 | +``` |
| 88 | + |
| 89 | + |
| 90 | +### Leakage Detection |
| 91 | + |
| 92 | +Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set. |
| 93 | + |
| 94 | +<div class="flex justify-center"> |
| 95 | + <img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/leakage-detection.png"/> |
| 96 | + <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/leakage-detection-dark.png"/> |
| 97 | +</div> |
| 98 | + |
| 99 | +<p class="text-sm text-center italic"> |
| 100 | + Learn more about leakage detection <a href="https://huggingface.co/blog/lbourdois/lle">here</a>. |
| 101 | +</p> |
| 102 | + |
| 103 | +```sql |
| 104 | +WITH |
| 105 | + overlapping_rows AS ( |
| 106 | + SELECT COALESCE( |
| 107 | + (SELECT COUNT(*) AS overlap_count |
| 108 | + FROM train |
| 109 | + INTERSECT |
| 110 | + SELECT COUNT(*) AS overlap_count |
| 111 | + FROM test), |
| 112 | + 0 |
| 113 | + ) AS overlap_count |
| 114 | + ), |
| 115 | + total_unique_rows AS ( |
| 116 | + SELECT COUNT(*) AS total_count |
| 117 | + FROM ( |
| 118 | + SELECT * FROM train |
| 119 | + UNION |
| 120 | + SELECT * FROM test |
| 121 | + ) combined |
| 122 | + ) |
| 123 | +SELECT |
| 124 | + overlap_count, |
| 125 | + total_count, |
| 126 | + CASE |
| 127 | + WHEN total_count > 0 THEN (overlap_count * 100.0 / total_count) |
| 128 | + ELSE 0 |
| 129 | + END AS overlap_percentage |
| 130 | +FROM overlapping_rows, total_unique_rows; |
| 131 | +``` |
0 commit comments