You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/hub/datasets-viewer-sql-console.md
+108-2Lines changed: 108 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,5 +20,111 @@ Through the SQL Console, you can:
20
20
- Embed the results of the query in your own webpage using an iframe
21
21
22
22
<Tip>
23
-
You can also use a local DuckDB CLI to query the dataset via the `hf://` protocol. See the <ahref="https://huggingface.co/docs/hub/en/datasets-duckdb"target="_blank"rel="noopener noreferrer">DuckDB CLI documentation</a> for more information.
24
-
</Tip>
23
+
You can also use the DuckDB CLI to query the dataset via the `hf://` protocol. See the <ahref="https://huggingface.co/docs/hub/en/datasets-duckdb"target="_blank"rel="noopener noreferrer">DuckDB CLI documentation</a> for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI.
24
+
</Tip>
25
+
26
+
27
+
# Examples
28
+
29
+
## Leakage Detection
30
+
31
+
Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.
Learn more about leakage detection <a href="https://huggingface.co/blog/lbourdois/lle">here</a>.
40
+
</p>
41
+
42
+
```sql
43
+
WITH
44
+
overlapping_rows AS (
45
+
SELECT COALESCE(
46
+
(SELECTCOUNT(*) AS overlap_count
47
+
FROM train
48
+
INTERSECT
49
+
SELECTCOUNT(*) AS overlap_count
50
+
FROM test),
51
+
0
52
+
) AS overlap_count
53
+
),
54
+
total_unique_rows AS (
55
+
SELECTCOUNT(*) AS total_count
56
+
FROM (
57
+
SELECT*FROM train
58
+
UNION
59
+
SELECT*FROM test
60
+
) combined
61
+
)
62
+
SELECT
63
+
overlap_count,
64
+
total_count,
65
+
CASE
66
+
WHEN total_count >0 THEN (overlap_count *100.0/ total_count)
67
+
ELSE 0
68
+
END AS overlap_percentage
69
+
FROM overlapping_rows, total_unique_rows;
70
+
```
71
+
72
+
## Filtering
73
+
74
+
The SQL Console makes filtering datasets really easily. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query:
In the query, we can use the `len` function to get the length of the `reasoning_chains` column and the `bar` function to create a bar chart of the reasoning lengths.
82
+
83
+
```sql
84
+
select len(reasoning_chains) as reason_len, bar(reason_len, 0, 100), *
85
+
from train
86
+
where reason_len >10
87
+
order by reason_len desc
88
+
```
89
+
90
+
The [bar](https://duckdb.org/docs/sql/functions/char.html#barx-min-max-width) function is a neat built-in DuckDB function that creates a bar chart of the reasoning lengths.
91
+
92
+
## Histogram
93
+
94
+
Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values.
95
+
96
+
For example, to plot a histogram of the `reason_len` column in the `SkunkworksAI/reasoning-0.01` dataset, you can use the following query:
Learn more about the `histogram` function and parameters <a href="https://cfahlgren1-sql-snippets.hf.space/histogram" target="_blank" rel="noopener noreferrer">here</a>.
104
+
</p>
105
+
106
+
```sql
107
+
from histogram(train, len(reasoning_chains))
108
+
```
109
+
110
+
## Regex Matching
111
+
112
+
One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data.
113
+
114
+
Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the `SkunkworksAI/reasoning-0.01` dataset for instructions that contain markdown code blocks.
Learn more about the DuckDB regex functions <a href="https://duckdb.org/docs/sql/functions/regular_expressions.html" target="_blank" rel="noopener noreferrer">here</a>.
0 commit comments