You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/hub/datasets-viewer-sql-console.md
+44-43Lines changed: 44 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,49 +26,6 @@ You can also use the DuckDB CLI to query the dataset via the `hf://` protocol. S
26
26
27
27
## Examples
28
28
29
-
### Leakage Detection
30
-
31
-
Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.
Learn more about leakage detection <a href="https://huggingface.co/blog/lbourdois/lle">here</a>.
40
-
</p>
41
-
42
-
```sql
43
-
WITH
44
-
overlapping_rows AS (
45
-
SELECT COALESCE(
46
-
(SELECTCOUNT(*) AS overlap_count
47
-
FROM train
48
-
INTERSECT
49
-
SELECTCOUNT(*) AS overlap_count
50
-
FROM test),
51
-
0
52
-
) AS overlap_count
53
-
),
54
-
total_unique_rows AS (
55
-
SELECTCOUNT(*) AS total_count
56
-
FROM (
57
-
SELECT*FROM train
58
-
UNION
59
-
SELECT*FROM test
60
-
) combined
61
-
)
62
-
SELECT
63
-
overlap_count,
64
-
total_count,
65
-
CASE
66
-
WHEN total_count >0 THEN (overlap_count *100.0/ total_count)
67
-
ELSE 0
68
-
END AS overlap_percentage
69
-
FROM overlapping_rows, total_unique_rows;
70
-
```
71
-
72
29
### Filtering
73
30
74
31
The SQL Console makes filtering datasets really easy. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query:
@@ -128,3 +85,47 @@ FROM train
128
85
WHERE regexp_matches(instruction, '```[a-z]*\n')
129
86
limit100
130
87
```
88
+
89
+
90
+
### Leakage Detection
91
+
92
+
Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.
0 commit comments