You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/hub/datasets-viewer-sql-console.md
+19-23Lines changed: 19 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,8 +3,8 @@
3
3
You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by [DuckDB](https://duckdb.org/) WASM and runs entirely in the browser. You can access the SQL Console from the Data Studio.
@@ -32,59 +32,55 @@ You can also use the DuckDB locally through the CLI to query the dataset via the
32
32
The SQL Console makes filtering datasets really easy. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query:
In the query, we can use the `len` function to get the length of the `reasoning_chains` column and the `bar` function to create a bar chart of the reasoning lengths.
40
-
39
+
Here's the SQL to sort by length of the reasoning
41
40
```sql
42
-
SELECTlen(reasoning_chains) AS reason_len, bar(reason_len, 0, 100), *
41
+
SELECT*
43
42
FROM train
44
-
WHERE reason_len >10
45
-
ORDER BY reason_len DESC
43
+
WHERE LENGTH(reasoning_chains) >10;
46
44
```
47
45
48
-
The [bar](https://duckdb.org/docs/sql/functions/char.html#barx-min-max-width) function is a neat built-in DuckDB function that creates a bar chart of the reasoning lengths.
49
-
50
46
### Histogram
51
47
52
48
Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values.
53
49
54
-
For example, to plot a histogram of the `reason_len` column in the `SkunkworksAI/reasoning-0.01` dataset, you can use the following query:
50
+
For example, to plot a histogram of the `Rating` column in the [Lichess/chess-puzzles](https://huggingface.co/datasets/Lichess/chess-puzzles) dataset, you can use the following query:
Learn more about the `histogram` function and parameters <a href="https://cfahlgren1-sql-snippets.hf.space/histogram" target="_blank" rel="noopener noreferrer">here</a>.
62
58
</p>
63
59
64
60
```sql
65
-
FROM histogram(train, len(reasoning_chains))
61
+
from histogram(train, Rating)
66
62
```
67
63
68
64
### Regex Matching
69
65
70
66
One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data.
71
67
72
-
Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the `SkunkworksAI/reasoning-0.01` dataset for instructions that contain markdown code blocks.
68
+
Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the [GeneralReasoning/GeneralThought-195k](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-195K) dataset for instructions that contain markdown code blocks.
Learn more about the DuckDB regex functions <a href="https://duckdb.org/docs/sql/functions/regular_expressions.html" target="_blank" rel="noopener noreferrer">here</a>.
80
76
</p>
81
77
82
78
83
79
```sql
84
-
SELECT*
80
+
SELECT*
85
81
FROM train
86
-
WHERE regexp_matches(instruction, '```[a-z]*\n')
87
-
limit100
82
+
WHERE regexp_matches(model_answer, '```')
83
+
LIMIT10;
88
84
```
89
85
90
86
@@ -93,8 +89,8 @@ limit 100
93
89
Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.
0 commit comments