Skip to content

Commit 59d3159

Browse files
cfahlgren1lhoestqjulien-csevero
authored
add sql console section (#1434)
* add sql console section * reorder sections * make sql console bold * add wasm * make sql console be its own sub page * Update docs/hub/datasets-viewer-sql-console.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/_toctree.yml Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-viewer.md Co-authored-by: Julien Chaumond <[email protected]> * Update docs/hub/datasets-viewer.md Co-authored-by: Sylvain Lesage <[email protected]> * Update docs/hub/datasets-viewer-sql-console.md Co-authored-by: Julien Chaumond <[email protected]> * add examples, remove .pyc * update headings * update title for seo * Update docs/hub/datasets-viewer-sql-console.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-viewer-sql-console.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-viewer-sql-console.md Co-authored-by: Quentin Lhoest <[email protected]> * reorder sections * update tip --------- Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Julien Chaumond <[email protected]> Co-authored-by: Sylvain Lesage <[email protected]>
1 parent c2a741f commit 59d3159

File tree

4 files changed

+139
-0
lines changed

4 files changed

+139
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
node_modules/
2+
__pycache__/
23
.vscode/
34
.idea/
45

docs/hub/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,8 @@
208208
title: Configure the Dataset Viewer
209209
- local: datasets-viewer-embed
210210
title: Embed the Dataset Viewer in a webpage
211+
- local: datasets-viewer-sql-console
212+
title: "SQL Console"
211213
- local: datasets-download-stats
212214
title: Datasets Download Stats
213215
- local: datasets-data-files-configuration
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# SQL Console: Query Hugging Face datasets in your browser
2+
3+
You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by [DuckDB](https://duckdb.org/) WASM and runs entirely in the browser. You can access the SQL Console from the dataset page by clicking on the **SQL Console** badge.
4+
5+
<div class="flex justify-center">
6+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/sql-console-histogram.png"/>
7+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/sql-console-histogram-dark.png"/>
8+
</div>
9+
10+
<p class="text-sm text-center italic">
11+
To learn more about the SQL Console, see the <a href="https://huggingface.co/blog/sql-console" target="_blank" rel="noopener noreferrer">SQL Console blog post</a>.
12+
</p>
13+
14+
15+
Through the SQL Console, you can:
16+
17+
- Run [DuckDB SQL queries](https://duckdb.org/docs/sql/query_syntax/select) on the dataset (_checkout [SQL Snippets](https://huggingface.co/spaces/cfahlgren1/sql-snippets) for useful queries_)
18+
- Share results of the query with others via a link (_check out [this example](https://huggingface.co/datasets/gretelai/synthetic-gsm8k-reflection-405b?sql_console=true&sql=FROM+histogram%28%0A++train%2C%0A++topic%2C%0A++bin_count+%3A%3D+10%0A%29)_)
19+
- Download the results of the query to a parquet file
20+
- Embed the results of the query in your own webpage using an iframe
21+
22+
<Tip>
23+
You can also use the DuckDB locally through the CLI to query the dataset via the `hf://` protocol. See the <a href="https://huggingface.co/docs/hub/en/datasets-duckdb" target="_blank" rel="noopener noreferrer">DuckDB Datasets documentation</a> for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI.
24+
</Tip>
25+
26+
27+
## Examples
28+
29+
### Filtering
30+
31+
The SQL Console makes filtering datasets really easy. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query:
32+
33+
<div class="flex justify-center">
34+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/bar-struct-length.png"/>
35+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/bar-struct-length-dark.png"/>
36+
</div>
37+
38+
In the query, we can use the `len` function to get the length of the `reasoning_chains` column and the `bar` function to create a bar chart of the reasoning lengths.
39+
40+
```sql
41+
SELECT len(reasoning_chains) AS reason_len, bar(reason_len, 0, 100), *
42+
FROM train
43+
WHERE reason_len > 10
44+
ORDER BY reason_len DESC
45+
```
46+
47+
The [bar](https://duckdb.org/docs/sql/functions/char.html#barx-min-max-width) function is a neat built-in DuckDB function that creates a bar chart of the reasoning lengths.
48+
49+
### Histogram
50+
51+
Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values.
52+
53+
For example, to plot a histogram of the `reason_len` column in the `SkunkworksAI/reasoning-0.01` dataset, you can use the following query:
54+
55+
<div class="flex justify-center">
56+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/histogram-simple.png"/>
57+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/histogram-simple-dark.png"/>
58+
</div>
59+
<p class="text-sm text-center italic">
60+
Learn more about the `histogram` function and parameters <a href="https://cfahlgren1-sql-snippets.hf.space/histogram" target="_blank" rel="noopener noreferrer">here</a>.
61+
</p>
62+
63+
```sql
64+
FROM histogram(train, len(reasoning_chains))
65+
```
66+
67+
### Regex Matching
68+
69+
One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data.
70+
71+
Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the `SkunkworksAI/reasoning-0.01` dataset for instructions that contain markdown code blocks.
72+
73+
<div class="flex justify-center">
74+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/regex-matching-markdown-code.png"/>
75+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/regex-matching-markdown-code-dark.png"/>
76+
</div>
77+
<p class="text-sm text-center italic">
78+
Learn more about the DuckDB regex functions <a href="https://duckdb.org/docs/sql/functions/regular_expressions.html" target="_blank" rel="noopener noreferrer">here</a>.
79+
</p>
80+
81+
82+
```sql
83+
SELECT *
84+
FROM train
85+
WHERE regexp_matches(instruction, '```[a-z]*\n')
86+
limit 100
87+
```
88+
89+
90+
### Leakage Detection
91+
92+
Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.
93+
94+
<div class="flex justify-center">
95+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/leakage-detection.png"/>
96+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/leakage-detection-dark.png"/>
97+
</div>
98+
99+
<p class="text-sm text-center italic">
100+
Learn more about leakage detection <a href="https://huggingface.co/blog/lbourdois/lle">here</a>.
101+
</p>
102+
103+
```sql
104+
WITH
105+
overlapping_rows AS (
106+
SELECT COALESCE(
107+
(SELECT COUNT(*) AS overlap_count
108+
FROM train
109+
INTERSECT
110+
SELECT COUNT(*) AS overlap_count
111+
FROM test),
112+
0
113+
) AS overlap_count
114+
),
115+
total_unique_rows AS (
116+
SELECT COUNT(*) AS total_count
117+
FROM (
118+
SELECT * FROM train
119+
UNION
120+
SELECT * FROM test
121+
) combined
122+
)
123+
SELECT
124+
overlap_count,
125+
total_count,
126+
CASE
127+
WHEN total_count > 0 THEN (overlap_count * 100.0 / total_count)
128+
ELSE 0
129+
END AS overlap_percentage
130+
FROM overlapping_rows, total_unique_rows;
131+
```

docs/hub/datasets-viewer.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,11 @@ Similarly, if you select one class from a categorical column, it will show only
2020

2121
You can search for a word in the dataset by typing it in the search bar at the top of the table. The search is case-insensitive and will match any row containing the word. The text is searched in the columns of `string`, even if the values are nested in a dictionary or a list.
2222

23+
## Run SQL queries on the dataset
24+
25+
You can run SQL queries on the dataset in the browser using the SQL Console. This feature also leverages our [auto-conversion to Parquet](datasets-viewer#access-the-parquet-files).
26+
For more information see our guide on [SQL Console](./datasets-viewer-sql-console).
27+
2328
## Share a specific row
2429

2530
You can share a specific row by clicking on it, and then copying the URL in the address bar of your browser. For example https://huggingface.co/datasets/nyu-mll/glue/viewer/mrpc/test?p=2&row=241 will open the dataset viewer on the MRPC dataset, on the test split, and on the 241st row.

0 commit comments

Comments
 (0)