Skip to content

Commit 2b9ca20

Browse files
authored
Dataset Viewer -> Data Studio (#1621)
* rename dataset viewer -> data studio, add graphics * update older images * small nit for images sizes
1 parent 4545d17 commit 2b9ca20

File tree

4 files changed

+49
-36
lines changed

4 files changed

+49
-36
lines changed

docs/hub/_toctree.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,7 @@
206206
- local: datasets-webdataset
207207
title: WebDataset
208208
- local: datasets-viewer
209-
title: Dataset Viewer
209+
title: Data Studio
210210
sections:
211211
- local: datasets-viewer-configure
212212
title: Configure the Dataset Viewer

docs/hub/datasets-adding.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,9 +98,9 @@ Other formats and structures may not be recognized by the Hub.
9898

9999
For most types of datasets, **Parquet** is the recommended format due to its efficient compression, rich typing, and since a variety of tools supports this format with optimized read and batched operations. Alternatively, CSV or JSON Lines/JSON can be used for tabular data (prefer JSON Lines for nested data). Although easy to parse compared to Parquet, these formats are not recommended for data larger than several GBs. For image and audio datasets, uploading raw files is the most practical for most use cases since it's easy to access individual files. For large scale image and audio datasets streaming, [WebDataset](https://github.com/webdataset/webdataset) should be preferred over raw image and audio files to avoid the overhead of accessing individual files. Though for more general use cases involving analytics, data filtering or metadata parsing, Parquet is the recommended option for large scale image and audio datasets.
100100

101-
### Dataset Viewer
101+
### Data Studio
102102

103-
The [Dataset Viewer](./datasets-viewer) is useful to know how the data actually looks like before you download it.
103+
The [Data Studio](./datasets-viewer) is useful to know how the data actually looks like before you download it.
104104
It is enabled by default for all public datasets. It is also available for private datasets owned by a [PRO user](https://huggingface.co/pricing) or an [Enterprise Hub organization](https://huggingface.co/enterprise).
105105

106106
After uploading your dataset, make sure the Dataset Viewer correctly shows your data, or [Configure the Dataset Viewer](./datasets-viewer-configure).

docs/hub/datasets-viewer-sql-console.md

Lines changed: 23 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# SQL Console: Query Hugging Face datasets in your browser
22

3-
You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by [DuckDB](https://duckdb.org/) WASM and runs entirely in the browser. You can access the SQL Console from the dataset page by clicking on the **SQL Console** badge.
3+
You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by [DuckDB](https://duckdb.org/) WASM and runs entirely in the browser. You can access the SQL Console from the Data Studio.
44

55
<div class="flex justify-center">
6-
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/sql-console-histogram.png"/>
7-
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/sql-console-histogram-dark.png"/>
6+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/sql-ai.png"/>
7+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/sql-ai-dark.png"/>
88
</div>
99

1010
<p class="text-sm text-center italic">
@@ -16,8 +16,9 @@ Through the SQL Console, you can:
1616

1717
- Run [DuckDB SQL queries](https://duckdb.org/docs/sql/query_syntax/select) on the dataset (_checkout [SQL Snippets](https://huggingface.co/spaces/cfahlgren1/sql-snippets) for useful queries_)
1818
- Share results of the query with others via a link (_check out [this example](https://huggingface.co/datasets/gretelai/synthetic-gsm8k-reflection-405b?sql_console=true&sql=FROM+histogram%28%0A++train%2C%0A++topic%2C%0A++bin_count+%3A%3D+10%0A%29)_)
19-
- Download the results of the query to a parquet file
19+
- Download the results of the query to a Parquet or CSV file
2020
- Embed the results of the query in your own webpage using an iframe
21+
- Query datasets with natural language
2122

2223
<Tip>
2324
You can also use the DuckDB locally through the CLI to query the dataset via the `hf://` protocol. See the <a href="https://huggingface.co/docs/hub/en/datasets-duckdb" target="_blank" rel="noopener noreferrer">DuckDB Datasets documentation</a> for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI.
@@ -31,59 +32,55 @@ You can also use the DuckDB locally through the CLI to query the dataset via the
3132
The SQL Console makes filtering datasets really easy. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query:
3233

3334
<div class="flex justify-center">
34-
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/bar-struct-length.png"/>
35-
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/bar-struct-length-dark.png"/>
35+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-filtering.png"/>
36+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-filtering-dark.png"/>
3637
</div>
3738

38-
In the query, we can use the `len` function to get the length of the `reasoning_chains` column and the `bar` function to create a bar chart of the reasoning lengths.
39-
39+
Here's the SQL to sort by length of the reasoning
4040
```sql
41-
SELECT len(reasoning_chains) AS reason_len, bar(reason_len, 0, 100), *
41+
SELECT *
4242
FROM train
43-
WHERE reason_len > 10
44-
ORDER BY reason_len DESC
43+
WHERE LENGTH(reasoning_chains) > 10;
4544
```
4645

47-
The [bar](https://duckdb.org/docs/sql/functions/char.html#barx-min-max-width) function is a neat built-in DuckDB function that creates a bar chart of the reasoning lengths.
48-
4946
### Histogram
5047

5148
Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values.
5249

53-
For example, to plot a histogram of the `reason_len` column in the `SkunkworksAI/reasoning-0.01` dataset, you can use the following query:
50+
For example, to plot a histogram of the `Rating` column in the [Lichess/chess-puzzles](https://huggingface.co/datasets/Lichess/chess-puzzles) dataset, you can use the following query:
5451

5552
<div class="flex justify-center">
56-
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/histogram-simple.png"/>
57-
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/histogram-simple-dark.png"/>
53+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-histogram.png"/>
54+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-histogram-dark.png"/>
5855
</div>
5956
<p class="text-sm text-center italic">
6057
Learn more about the `histogram` function and parameters <a href="https://cfahlgren1-sql-snippets.hf.space/histogram" target="_blank" rel="noopener noreferrer">here</a>.
6158
</p>
6259

6360
```sql
64-
FROM histogram(train, len(reasoning_chains))
61+
from histogram(train, Rating)
6562
```
6663

6764
### Regex Matching
6865

6966
One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data.
7067

71-
Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the `SkunkworksAI/reasoning-0.01` dataset for instructions that contain markdown code blocks.
68+
Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the [GeneralReasoning/GeneralThought-195k](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-195K) dataset for instructions that contain markdown code blocks.
7269

7370
<div class="flex justify-center">
74-
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/regex-matching-markdown-code.png"/>
75-
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/regex-matching-markdown-code-dark.png"/>
71+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-regex.png"/>
72+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-regex-dark.png"/>
7673
</div>
7774
<p class="text-sm text-center italic">
7875
Learn more about the DuckDB regex functions <a href="https://duckdb.org/docs/sql/functions/regular_expressions.html" target="_blank" rel="noopener noreferrer">here</a>.
7976
</p>
8077

8178

8279
```sql
83-
SELECT *
80+
SELECT *
8481
FROM train
85-
WHERE regexp_matches(instruction, '```[a-z]*\n')
86-
limit 100
82+
WHERE regexp_matches(model_answer, '```')
83+
LIMIT 10;
8784
```
8885

8986

@@ -92,8 +89,8 @@ limit 100
9289
Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.
9390

9491
<div class="flex justify-center">
95-
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/leakage-detection.png"/>
96-
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/leakage-detection-dark.png"/>
92+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-leakage.png"/>
93+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-leakage-dark.png"/>
9794
</div>
9895

9996
<p class="text-sm text-center italic">
@@ -128,4 +125,4 @@ SELECT
128125
ELSE 0
129126
END AS overlap_percentage
130127
FROM overlapping_rows, total_unique_rows;
131-
```
128+
```

docs/hub/datasets-viewer.md

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
# Dataset viewer
1+
# Data Studio
22

33
Each dataset page includes a table with the contents of the dataset, arranged by pages of 100 rows. You can navigate between pages using the buttons at the bottom of the table.
44

55
<div class="flex justify-center">
6-
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/dataset-viewer.png"/>
7-
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/dataset-viewer-dark.png"/>
6+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio.png"/>
7+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-dark.png"/>
88
</div>
99

1010
## Inspect data distributions
@@ -16,18 +16,34 @@ At the top of the columns you can see the graphs representing the distribution o
1616
If you click on a bar of a histogram from a numerical column, the dataset viewer will filter the data and show only the rows with values that fall in the selected range.
1717
Similarly, if you select one class from a categorical column, it will show only the rows from the selected category.
1818

19+
<div class="flex justify-center">
20+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-filter.png"/>
21+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-filter-dark.png"/>
22+
</div>
23+
1924
## Search a word in the dataset
2025

2126
You can search for a word in the dataset by typing it in the search bar at the top of the table. The search is case-insensitive and will match any row containing the word. The text is searched in the columns of `string`, even if the values are nested in a dictionary or a list.
2227

2328
## Run SQL queries on the dataset
2429

2530
You can run SQL queries on the dataset in the browser using the SQL Console. This feature also leverages our [auto-conversion to Parquet](datasets-viewer#access-the-parquet-files).
26-
For more information see our guide on [SQL Console](./datasets-viewer-sql-console).
31+
32+
<div class="flex justify-center">
33+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/sql-ai.png" />
34+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/sql-ai-dark.png"/>
35+
</div>
36+
37+
For more information see our guide on [SQL Console](./datasets-viewer-sql-console).
2738

2839
## Share a specific row
2940

30-
You can share a specific row by clicking on it, and then copying the URL in the address bar of your browser. For example https://huggingface.co/datasets/nyu-mll/glue/viewer/mrpc/test?p=2&row=241 will open the dataset viewer on the MRPC dataset, on the test split, and on the 241st row.
41+
You can share a specific row by clicking on it, and then copying the URL in the address bar of your browser. For example https://huggingface.co/datasets/nyu-mll/glue/viewer/mrpc/test?p=2&row=241 will open the dataset studio on the MRPC dataset, on the test split, and on the 241st row.
42+
43+
<div class="flex justify-center">
44+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-row.png"/>
45+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datastudio-row-dark.png"/>
46+
</div>
3147

3248
## Large scale datasets
3349

@@ -53,8 +69,8 @@ Parquet is a columnar storage format optimized for querying and processing large
5369
When you create a new dataset, the [`parquet-converter` bot](https://huggingface.co/parquet-converter) notifies you once it converts the dataset to Parquet. The [discussion](./repositories-pull-requests-discussions) it opens in the repository provides details about the Parquet format and links to the Parquet files.
5470

5571
<div class="flex justify-center">
56-
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parquet-converter-profile-light.png"/>
57-
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parquet-converter-profile-dark.png"/>
72+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parquet-converter-profile-light.png" width=600/>
73+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parquet-converter-profile-dark.png" width=600/>
5874
</div>
5975

6076
### Programmatic access

0 commit comments

Comments
 (0)