Skip to content

Commit 58e35e1

Browse files
davanstrienpcuenca
andauthored
Add Embedding Atlas to dataset library integrations (#1870)
* Add Embedding Atlas to dataset library integrations - Add comprehensive documentation for Embedding Atlas integration - Include examples using real datasets (stanfordnlp/imdb) for easy testing - Document both CLI and Python/Jupyter usage patterns - Add Embedding Atlas to the libraries table and navigation * Remove visualization image from Embedding Atlas documentation * Update docs/hub/datasets-embedding-atlas.md Co-authored-by: Pedro Cuenca <[email protected]> * Update docs/hub/datasets-embedding-atlas.md Co-authored-by: Pedro Cuenca <[email protected]> * simplify examples --------- Co-authored-by: Pedro Cuenca <[email protected]>
1 parent 931924c commit 58e35e1

File tree

3 files changed

+167
-0
lines changed

3 files changed

+167
-0
lines changed

docs/hub/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,8 @@
192192
title: Combine datasets and export
193193
- local: datasets-duckdb-vector-similarity-search
194194
title: Perform vector similarity search
195+
- local: datasets-embedding-atlas
196+
title: Embedding Atlas
195197
- local: datasets-fiftyone
196198
title: FiftyOne
197199
- local: datasets-pandas
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# Embedding Atlas
2+
3+
[Embedding Atlas](https://apple.github.io/embedding-atlas/) is an interactive visualization tool for exploring large embedding spaces. It enables you to visualize, cross-filter, and search embeddings alongside associated metadata, helping you understand patterns and relationships in high-dimensional data. All computation happens in your computer, ensuring your data remains private and secure.
4+
5+
## Key Features
6+
7+
- **Interactive exploration**: Navigate through millions of embeddings with smooth, responsive visualization
8+
- **Browser-based computation**: Compute embeddings and projections locally without sending data to external servers
9+
- **Cross-filtering**: Link and filter data across multiple metadata columns
10+
- **Search capabilities**: Find similar data points to a given query or existing item
11+
- **Multiple integration options**: Use via command line, Jupyter widgets, or web interface
12+
13+
## Prerequisites
14+
15+
First, install Embedding Atlas:
16+
17+
```bash
18+
pip install embedding-atlas
19+
```
20+
21+
If you plan to load private datasets from the Hugging Face Hub, you'll also need to [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login):
22+
23+
```bash
24+
hf auth login
25+
```
26+
27+
## Loading Datasets from the Hub
28+
29+
Embedding Atlas provides seamless integration with the Hugging Face Hub, allowing you to visualize embeddings from any dataset directly.
30+
31+
### Using the Command Line
32+
33+
The simplest way to visualize a Hugging Face dataset is through the command line interface. Try it with the IMDB dataset:
34+
35+
```bash
36+
# Load the IMDB dataset from the Hub
37+
embedding-atlas stanfordnlp/imdb
38+
39+
# Specify the text column for embedding computation
40+
embedding-atlas stanfordnlp/imdb --text "text"
41+
42+
# Load only a sample for faster exploration
43+
embedding-atlas stanfordnlp/imdb --text "text" --sample 5000
44+
```
45+
46+
For your own datasets, use the same pattern:
47+
48+
```bash
49+
# Load your dataset from the Hub
50+
embedding-atlas username/dataset-name
51+
52+
# Load multiple splits
53+
embedding-atlas username/dataset-name --split train --split test
54+
55+
# Specify custom text column
56+
embedding-atlas username/dataset-name --text "content"
57+
```
58+
59+
### Using Python and Jupyter
60+
61+
You can also use Embedding Atlas in Jupyter notebooks for interactive exploration:
62+
63+
```python
64+
from embedding_atlas.widget import EmbeddingAtlasWidget
65+
from datasets import load_dataset
66+
import pandas as pd
67+
68+
# Load the IMDB dataset from Hugging Face Hub
69+
dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]")
70+
71+
# Convert to pandas DataFrame
72+
df = dataset.to_pandas()
73+
74+
# Create interactive widget
75+
widget = EmbeddingAtlasWidget(df)
76+
widget
77+
```
78+
79+
For your own datasets:
80+
81+
```python
82+
from embedding_atlas.widget import EmbeddingAtlasWidget
83+
from datasets import load_dataset
84+
import pandas as pd
85+
86+
# Load your dataset from the Hub
87+
dataset = load_dataset("username/dataset-name", split="train")
88+
df = dataset.to_pandas()
89+
90+
# Create interactive widget
91+
widget = EmbeddingAtlasWidget(df)
92+
widget
93+
```
94+
95+
### Working with Pre-computed Embeddings
96+
97+
If you have datasets with pre-computed embeddings, you can load them directly:
98+
99+
```bash
100+
# Load dataset with pre-computed coordinates
101+
embedding-atlas username/dataset-name \
102+
--x "embedding_x" \
103+
--y "embedding_y"
104+
105+
# Load with pre-computed nearest neighbors
106+
embedding-atlas username/dataset-name \
107+
--neighbors "neighbors_column"
108+
```
109+
110+
## Customizing Embeddings
111+
112+
Embedding Atlas uses [SentenceTransformers](https://huggingface.co/sentence-transformers) by default but supports custom embedding models:
113+
114+
```bash
115+
# Use a specific embedding model
116+
embedding-atlas stanfordnlp/imdb \
117+
--text "text" \
118+
--model "sentence-transformers/all-MiniLM-L6-v2"
119+
120+
# For models requiring remote code execution
121+
embedding-atlas username/dataset-name \
122+
--model "custom/model" \
123+
--trust-remote-code
124+
```
125+
126+
### UMAP Projection Parameters
127+
128+
Fine-tune the dimensionality reduction for your specific use case:
129+
130+
```bash
131+
embedding-atlas stanfordnlp/imdb \
132+
--text "text" \
133+
--umap-n-neighbors 30 \
134+
--umap-min-dist 0.1 \
135+
--umap-metric "cosine"
136+
```
137+
138+
## Use Cases
139+
140+
### Exploring Text Datasets
141+
142+
Visualize and explore text corpora to identify clusters, outliers, and patterns:
143+
144+
```python
145+
from embedding_atlas.widget import EmbeddingAtlasWidget
146+
from datasets import load_dataset
147+
import pandas as pd
148+
149+
# Load a text classification dataset
150+
dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]")
151+
df = dataset.to_pandas()
152+
153+
# Visualize with metadata
154+
widget = EmbeddingAtlasWidget(df)
155+
widget
156+
```
157+
158+
159+
## Additional Resources
160+
161+
- [Embedding Atlas GitHub Repository](https://github.com/apple/embedding-atlas)
162+
- [Official Documentation](https://apple.github.io/embedding-atlas/)
163+
- [Interactive Demo](https://apple.github.io/embedding-atlas/upload/)
164+
- [Command Line Reference](https://apple.github.io/embedding-atlas/tool.html)

docs/hub/datasets-libraries.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ The table below summarizes the supported libraries and their level of integratio
1313
| [Datasets](./datasets-usage) | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). |||
1414
| [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback. |||
1515
| [DuckDB](./datasets-duckdb) | In-process SQL OLAP database management system. |||
16+
| [Embedding Atlas](./datasets-embedding-atlas) | Interactive visualization and exploration tool for large embeddings. |||
1617
| [FiftyOne](./datasets-fiftyone) | FiftyOne is a library for curation and visualization of image, video, and 3D data. |||
1718
| [Pandas](./datasets-pandas) | Python data analysis toolkit. |||
1819
| [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. |||

0 commit comments

Comments
 (0)