Skip to content

Commit bb48da1

Browse files
committed
Add Embedding Atlas to dataset library integrations
- Add comprehensive documentation for Embedding Atlas integration - Include examples using real datasets (stanfordnlp/imdb) for easy testing - Document both CLI and Python/Jupyter usage patterns - Add Embedding Atlas to the libraries table and navigation
1 parent 0699a3e commit bb48da1

File tree

3 files changed

+191
-0
lines changed

3 files changed

+191
-0
lines changed

docs/hub/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,8 @@
192192
title: Combine datasets and export
193193
- local: datasets-duckdb-vector-similarity-search
194194
title: Perform vector similarity search
195+
- local: datasets-embedding-atlas
196+
title: Embedding Atlas
195197
- local: datasets-fiftyone
196198
title: FiftyOne
197199
- local: datasets-pandas
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# Embedding Atlas
2+
3+
Embedding Atlas is an interactive visualization tool for exploring large embedding spaces. It enables you to visualize, cross-filter, and search embeddings alongside associated metadata, helping you understand patterns and relationships in high-dimensional data. All computation happens in your browser, ensuring your data remains private and secure.
4+
5+
![Embedding Atlas Visualization](https://github.com/apple/embedding-atlas/assets/1688009/d3bb956e-d43e-4797-89c3-c1c69b98b30e)
6+
7+
## Key Features
8+
9+
- **Interactive exploration**: Navigate through millions of embeddings with smooth, responsive visualization
10+
- **Browser-based computation**: Compute embeddings and projections locally without sending data to external servers
11+
- **Cross-filtering**: Link and filter data across multiple metadata columns
12+
- **Search capabilities**: Find similar data points to a given query or existing item
13+
- **Multiple integration options**: Use via command line, Jupyter widgets, or web interface
14+
15+
## Prerequisites
16+
17+
First, install Embedding Atlas:
18+
19+
```bash
20+
pip install embedding-atlas
21+
```
22+
23+
If you plan to load private datasets from the Hugging Face Hub, you'll also need to [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login):
24+
25+
```bash
26+
hf auth login
27+
```
28+
29+
## Loading Datasets from the Hub
30+
31+
Embedding Atlas provides seamless integration with the Hugging Face Hub, allowing you to visualize embeddings from any dataset directly.
32+
33+
### Using the Command Line
34+
35+
The simplest way to visualize a Hugging Face dataset is through the command line interface. Try it with the IMDB dataset:
36+
37+
```bash
38+
# Load the IMDB dataset from the Hub
39+
embedding-atlas stanfordnlp/imdb
40+
41+
# Specify the text column for embedding computation
42+
embedding-atlas stanfordnlp/imdb --text "text"
43+
44+
# Load only a sample for faster exploration
45+
embedding-atlas stanfordnlp/imdb --text "text" --sample 5000
46+
```
47+
48+
For your own datasets, use the same pattern:
49+
50+
```bash
51+
# Load your dataset from the Hub
52+
embedding-atlas username/dataset-name
53+
54+
# Load multiple splits
55+
embedding-atlas username/dataset-name --split train --split test
56+
57+
# Specify custom text column
58+
embedding-atlas username/dataset-name --text "content"
59+
```
60+
61+
### Using Python and Jupyter
62+
63+
You can also use Embedding Atlas in Jupyter notebooks for interactive exploration:
64+
65+
```python
66+
from embedding_atlas.widget import EmbeddingAtlasWidget
67+
from datasets import load_dataset
68+
import pandas as pd
69+
70+
# Load the IMDB dataset from Hugging Face Hub
71+
dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]")
72+
73+
# Convert to pandas DataFrame
74+
df = dataset.to_pandas()
75+
76+
# Create interactive widget
77+
widget = EmbeddingAtlasWidget(df)
78+
widget
79+
```
80+
81+
For your own datasets:
82+
83+
```python
84+
from embedding_atlas.widget import EmbeddingAtlasWidget
85+
from datasets import load_dataset
86+
import pandas as pd
87+
88+
# Load your dataset from the Hub
89+
dataset = load_dataset("username/dataset-name", split="train")
90+
df = dataset.to_pandas()
91+
92+
# Create interactive widget
93+
widget = EmbeddingAtlasWidget(df)
94+
widget
95+
```
96+
97+
### Working with Pre-computed Embeddings
98+
99+
If you have datasets with pre-computed embeddings, you can load them directly:
100+
101+
```bash
102+
# Load dataset with pre-computed coordinates
103+
embedding-atlas username/dataset-name \
104+
--x "embedding_x" \
105+
--y "embedding_y"
106+
107+
# Load with pre-computed nearest neighbors
108+
embedding-atlas username/dataset-name \
109+
--neighbors "neighbors_column"
110+
```
111+
112+
## Customizing Embeddings
113+
114+
Embedding Atlas uses SentenceTransformers by default but supports custom embedding models:
115+
116+
```bash
117+
# Use a specific embedding model
118+
embedding-atlas stanfordnlp/imdb \
119+
--text "text" \
120+
--model "sentence-transformers/all-MiniLM-L6-v2"
121+
122+
# For models requiring remote code execution
123+
embedding-atlas username/dataset-name \
124+
--model "custom/model" \
125+
--trust-remote-code
126+
```
127+
128+
### UMAP Projection Parameters
129+
130+
Fine-tune the dimensionality reduction for your specific use case:
131+
132+
```bash
133+
embedding-atlas stanfordnlp/imdb \
134+
--text "text" \
135+
--umap-n-neighbors 30 \
136+
--umap-min-dist 0.1 \
137+
--umap-metric "cosine"
138+
```
139+
140+
## Use Cases
141+
142+
### Exploring Text Datasets
143+
144+
Visualize and explore text corpora to identify clusters, outliers, and patterns:
145+
146+
```python
147+
from embedding_atlas.widget import EmbeddingAtlasWidget
148+
from datasets import load_dataset
149+
import pandas as pd
150+
151+
# Load a text classification dataset
152+
dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]")
153+
df = dataset.to_pandas()
154+
155+
# Visualize with metadata
156+
widget = EmbeddingAtlasWidget(df)
157+
widget
158+
```
159+
160+
### Analyzing Model Outputs
161+
162+
Compare embeddings from different models or examine how embeddings change during training:
163+
164+
```bash
165+
# Compare embeddings from different models
166+
embedding-atlas stanfordnlp/imdb \
167+
--text "text" \
168+
--model "sentence-transformers/all-MiniLM-L6-v2" \
169+
--sample 5000
170+
```
171+
172+
### Quality Control and Data Curation
173+
174+
Identify mislabeled data, duplicates, or unusual patterns in your datasets:
175+
176+
```bash
177+
# Visualize with sampling for large datasets
178+
embedding-atlas username/large-dataset \
179+
--sample 50000 \
180+
--text "content"
181+
```
182+
183+
## Additional Resources
184+
185+
- [Embedding Atlas GitHub Repository](https://github.com/apple/embedding-atlas)
186+
- [Official Documentation](https://apple.github.io/embedding-atlas/)
187+
- [Interactive Demo](https://apple.github.io/embedding-atlas/upload/)
188+
- [Command Line Reference](https://apple.github.io/embedding-atlas/tool.html)

docs/hub/datasets-libraries.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ The table below summarizes the supported libraries and their level of integratio
1313
| [Datasets](./datasets-usage) | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). |||
1414
| [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback. |||
1515
| [DuckDB](./datasets-duckdb) | In-process SQL OLAP database management system. |||
16+
| [Embedding Atlas](./datasets-embedding-atlas) | Interactive visualization and exploration tool for large embeddings. |||
1617
| [FiftyOne](./datasets-fiftyone) | FiftyOne is a library for curation and visualization of image, video, and 3D data. |||
1718
| [Pandas](./datasets-pandas) | Python data analysis toolkit. |||
1819
| [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. |||

0 commit comments

Comments
 (0)