Add Embedding Atlas to dataset library integrations

davanstrien · davanstrien · commit bb48da19e10d · 2025-08-11T09:28:09.000+01:00
- Add comprehensive documentation for Embedding Atlas integration
- Include examples using real datasets (stanfordnlp/imdb) for easy testing
- Document both CLI and Python/Jupyter usage patterns
- Add Embedding Atlas to the libraries table and navigation
diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml
@@ -192,6 +192,8 @@
             title: Combine datasets and export
           - local: datasets-duckdb-vector-similarity-search
             title: Perform vector similarity search
+      - local: datasets-embedding-atlas
+        title: Embedding Atlas
       - local: datasets-fiftyone
         title: FiftyOne
       - local: datasets-pandas
diff --git a/docs/hub/datasets-embedding-atlas.md b/docs/hub/datasets-embedding-atlas.md
@@ -0,0 +1,188 @@
+# Embedding Atlas
+
+Embedding Atlas is an interactive visualization tool for exploring large embedding spaces. It enables you to visualize, cross-filter, and search embeddings alongside associated metadata, helping you understand patterns and relationships in high-dimensional data. All computation happens in your browser, ensuring your data remains private and secure.
+
+![Embedding Atlas Visualization](https://github.com/apple/embedding-atlas/assets/1688009/d3bb956e-d43e-4797-89c3-c1c69b98b30e)
+
+## Key Features
+
+- **Interactive exploration**: Navigate through millions of embeddings with smooth, responsive visualization
+- **Browser-based computation**: Compute embeddings and projections locally without sending data to external servers
+- **Cross-filtering**: Link and filter data across multiple metadata columns
+- **Search capabilities**: Find similar data points to a given query or existing item
+- **Multiple integration options**: Use via command line, Jupyter widgets, or web interface
+
+## Prerequisites
+
+First, install Embedding Atlas:
+
+```bash
+pip install embedding-atlas
+```
+
+If you plan to load private datasets from the Hugging Face Hub, you'll also need to [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login):
+
+```bash
+hf auth login
+```
+
+## Loading Datasets from the Hub
+
+Embedding Atlas provides seamless integration with the Hugging Face Hub, allowing you to visualize embeddings from any dataset directly.
+
+### Using the Command Line
+
+The simplest way to visualize a Hugging Face dataset is through the command line interface. Try it with the IMDB dataset:
+
+```bash
+# Load the IMDB dataset from the Hub
+embedding-atlas stanfordnlp/imdb
+
+# Specify the text column for embedding computation
+embedding-atlas stanfordnlp/imdb --text "text"
+
+# Load only a sample for faster exploration
+embedding-atlas stanfordnlp/imdb --text "text" --sample 5000
+```
+
+For your own datasets, use the same pattern:
+
+```bash
+# Load your dataset from the Hub
+embedding-atlas username/dataset-name
+
+# Load multiple splits
+embedding-atlas username/dataset-name --split train --split test
+
+# Specify custom text column
+embedding-atlas username/dataset-name --text "content"
+```
+
+### Using Python and Jupyter
+
+You can also use Embedding Atlas in Jupyter notebooks for interactive exploration:
+
+```python
+from embedding_atlas.widget import EmbeddingAtlasWidget
+from datasets import load_dataset
+import pandas as pd
+
+# Load the IMDB dataset from Hugging Face Hub
+dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]")
+
+# Convert to pandas DataFrame
+df = dataset.to_pandas()
+
+# Create interactive widget
+widget = EmbeddingAtlasWidget(df)
+widget
+```
+
+For your own datasets:
+
+```python
+from embedding_atlas.widget import EmbeddingAtlasWidget
+from datasets import load_dataset
+import pandas as pd
+
+# Load your dataset from the Hub
+dataset = load_dataset("username/dataset-name", split="train")
+df = dataset.to_pandas()
+
+# Create interactive widget
+widget = EmbeddingAtlasWidget(df)
+widget
+```
+
+### Working with Pre-computed Embeddings
+
+If you have datasets with pre-computed embeddings, you can load them directly:
+
+```bash
+# Load dataset with pre-computed coordinates
+embedding-atlas username/dataset-name \
+    --x "embedding_x" \
+    --y "embedding_y"
+
+# Load with pre-computed nearest neighbors
+embedding-atlas username/dataset-name \
+    --neighbors "neighbors_column"
+```
+
+## Customizing Embeddings
+
+Embedding Atlas uses SentenceTransformers by default but supports custom embedding models:
+
+```bash
+# Use a specific embedding model
+embedding-atlas stanfordnlp/imdb \
+    --text "text" \
+    --model "sentence-transformers/all-MiniLM-L6-v2"
+
+# For models requiring remote code execution
+embedding-atlas username/dataset-name \
+    --model "custom/model" \
+    --trust-remote-code
+```
+
+### UMAP Projection Parameters
+
+Fine-tune the dimensionality reduction for your specific use case:
+
+```bash
+embedding-atlas stanfordnlp/imdb \
+    --text "text" \
+    --umap-n-neighbors 30 \
+    --umap-min-dist 0.1 \
+    --umap-metric "cosine"
+```
+
+## Use Cases
+
+### Exploring Text Datasets
+
+Visualize and explore text corpora to identify clusters, outliers, and patterns:
+
+```python
+from embedding_atlas.widget import EmbeddingAtlasWidget
+from datasets import load_dataset
+import pandas as pd
+
+# Load a text classification dataset
+dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]")
+df = dataset.to_pandas()
+
+# Visualize with metadata
+widget = EmbeddingAtlasWidget(df)
+widget
+```
+
+### Analyzing Model Outputs
+
+Compare embeddings from different models or examine how embeddings change during training:
+
+```bash
+# Compare embeddings from different models
+embedding-atlas stanfordnlp/imdb \
+    --text "text" \
+    --model "sentence-transformers/all-MiniLM-L6-v2" \
+    --sample 5000
+```
+
+### Quality Control and Data Curation
+
+Identify mislabeled data, duplicates, or unusual patterns in your datasets:
+
+```bash
+# Visualize with sampling for large datasets
+embedding-atlas username/large-dataset \
+    --sample 50000 \
+    --text "content"
+```
+
+## Additional Resources
+
+- [Embedding Atlas GitHub Repository](https://github.com/apple/embedding-atlas)
+- [Official Documentation](https://apple.github.io/embedding-atlas/)
+- [Interactive Demo](https://apple.github.io/embedding-atlas/upload/)
+- [Command Line Reference](https://apple.github.io/embedding-atlas/tool.html)
diff --git a/docs/hub/datasets-libraries.md b/docs/hub/datasets-libraries.md
@@ -13,6 +13,7 @@ The table below summarizes the supported libraries and their level of integratio
 | [Datasets](./datasets-usage)        | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). | ✅                | ✅          |
 | [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback.                                                                   | ✅                | ✅          |
 | [DuckDB](./datasets-duckdb)         | In-process SQL OLAP database management system.                                                                                | ✅                | ✅          |
+| [Embedding Atlas](./datasets-embedding-atlas) | Interactive visualization and exploration tool for large embeddings.                                                     | ✅                | ❌          |
 | [FiftyOne](./datasets-fiftyone)     | FiftyOne is a library for curation and visualization of image, video, and 3D data.                                             | ✅                | ✅          |
 | [Pandas](./datasets-pandas)         | Python data analysis toolkit.                                                                                                  | ✅                | ✅          |
 | [Polars](./datasets-polars)         | A DataFrame library on top of an OLAP query engine.                                                                            | ✅                | ✅          |