Skip to content

Commit 65b92c2

Browse files
committed
add examples, remove .pyc
1 parent 0ab43dd commit 65b92c2

File tree

3 files changed

+109
-2
lines changed

3 files changed

+109
-2
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
node_modules/
2+
__pycache__/
23
.vscode/
34
.idea/
45

-145 Bytes
Binary file not shown.

docs/hub/datasets-viewer-sql-console.md

Lines changed: 108 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,111 @@ Through the SQL Console, you can:
2020
- Embed the results of the query in your own webpage using an iframe
2121

2222
<Tip>
23-
You can also use a local DuckDB CLI to query the dataset via the `hf://` protocol. See the <a href="https://huggingface.co/docs/hub/en/datasets-duckdb" target="_blank" rel="noopener noreferrer">DuckDB CLI documentation</a> for more information.
24-
</Tip>
23+
You can also use the DuckDB CLI to query the dataset via the `hf://` protocol. See the <a href="https://huggingface.co/docs/hub/en/datasets-duckdb" target="_blank" rel="noopener noreferrer">DuckDB CLI documentation</a> for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI.
24+
</Tip>
25+
26+
27+
# Examples
28+
29+
## Leakage Detection
30+
31+
Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.
32+
33+
<div class="flex justify-center">
34+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/leakage-detection.png"/>
35+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/leakage-detection-dark.png"/>
36+
</div>
37+
38+
<p class="text-sm text-center italic">
39+
Learn more about leakage detection <a href="https://huggingface.co/blog/lbourdois/lle">here</a>.
40+
</p>
41+
42+
```sql
43+
WITH
44+
overlapping_rows AS (
45+
SELECT COALESCE(
46+
(SELECT COUNT(*) AS overlap_count
47+
FROM train
48+
INTERSECT
49+
SELECT COUNT(*) AS overlap_count
50+
FROM test),
51+
0
52+
) AS overlap_count
53+
),
54+
total_unique_rows AS (
55+
SELECT COUNT(*) AS total_count
56+
FROM (
57+
SELECT * FROM train
58+
UNION
59+
SELECT * FROM test
60+
) combined
61+
)
62+
SELECT
63+
overlap_count,
64+
total_count,
65+
CASE
66+
WHEN total_count > 0 THEN (overlap_count * 100.0 / total_count)
67+
ELSE 0
68+
END AS overlap_percentage
69+
FROM overlapping_rows, total_unique_rows;
70+
```
71+
72+
## Filtering
73+
74+
The SQL Console makes filtering datasets really easily. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query:
75+
76+
<div class="flex justify-center">
77+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/bar-struct-length.png"/>
78+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/bar-struct-length-dark.png"/>
79+
</div>
80+
81+
In the query, we can use the `len` function to get the length of the `reasoning_chains` column and the `bar` function to create a bar chart of the reasoning lengths.
82+
83+
```sql
84+
select len(reasoning_chains) as reason_len, bar(reason_len, 0, 100), *
85+
from train
86+
where reason_len > 10
87+
order by reason_len desc
88+
```
89+
90+
The [bar](https://duckdb.org/docs/sql/functions/char.html#barx-min-max-width) function is a neat built-in DuckDB function that creates a bar chart of the reasoning lengths.
91+
92+
## Histogram
93+
94+
Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values.
95+
96+
For example, to plot a histogram of the `reason_len` column in the `SkunkworksAI/reasoning-0.01` dataset, you can use the following query:
97+
98+
<div class="flex justify-center">
99+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/histogram-simple.png"/>
100+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/histogram-simple-dark.png"/>
101+
</div>
102+
<p class="text-sm text-center italic">
103+
Learn more about the `histogram` function and parameters <a href="https://cfahlgren1-sql-snippets.hf.space/histogram" target="_blank" rel="noopener noreferrer">here</a>.
104+
</p>
105+
106+
```sql
107+
from histogram(train, len(reasoning_chains))
108+
```
109+
110+
## Regex Matching
111+
112+
One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data.
113+
114+
Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the `SkunkworksAI/reasoning-0.01` dataset for instructions that contain markdown code blocks.
115+
116+
<div class="flex justify-center">
117+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/regex-matching-markdown-code.png"/>
118+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/regex-matching-markdown-code-dark.png"/>
119+
</div>
120+
<p class="text-sm text-center italic">
121+
Learn more about the DuckDB regex functions <a href="https://duckdb.org/docs/sql/functions/regular_expressions.html" target="_blank" rel="noopener noreferrer">here</a>.
122+
</p>
123+
124+
125+
```sql
126+
SELECT *
127+
FROM train
128+
WHERE regexp_matches(instruction, '```[a-z]*\n')
129+
limit 100
130+
```

0 commit comments

Comments
 (0)