Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/hub/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,8 @@
title: Combine datasets and export
- local: datasets-duckdb-vector-similarity-search
title: Perform vector similarity search
- local: datasets-embedding-atlas
title: Embedding Atlas
- local: datasets-fiftyone
title: FiftyOne
- local: datasets-pandas
Expand Down
186 changes: 186 additions & 0 deletions docs/hub/datasets-embedding-atlas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# Embedding Atlas

Embedding Atlas is an interactive visualization tool for exploring large embedding spaces. It enables you to visualize, cross-filter, and search embeddings alongside associated metadata, helping you understand patterns and relationships in high-dimensional data. All computation happens in your browser, ensuring your data remains private and secure.

## Key Features

- **Interactive exploration**: Navigate through millions of embeddings with smooth, responsive visualization
- **Browser-based computation**: Compute embeddings and projections locally without sending data to external servers
- **Cross-filtering**: Link and filter data across multiple metadata columns
- **Search capabilities**: Find similar data points to a given query or existing item
- **Multiple integration options**: Use via command line, Jupyter widgets, or web interface

## Prerequisites

First, install Embedding Atlas:

```bash
pip install embedding-atlas
```

If you plan to load private datasets from the Hugging Face Hub, you'll also need to [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login):

```bash
hf auth login
```

## Loading Datasets from the Hub

Embedding Atlas provides seamless integration with the Hugging Face Hub, allowing you to visualize embeddings from any dataset directly.

### Using the Command Line

The simplest way to visualize a Hugging Face dataset is through the command line interface. Try it with the IMDB dataset:

```bash
# Load the IMDB dataset from the Hub
embedding-atlas stanfordnlp/imdb

# Specify the text column for embedding computation
embedding-atlas stanfordnlp/imdb --text "text"

# Load only a sample for faster exploration
embedding-atlas stanfordnlp/imdb --text "text" --sample 5000
```

For your own datasets, use the same pattern:

```bash
# Load your dataset from the Hub
embedding-atlas username/dataset-name

# Load multiple splits
embedding-atlas username/dataset-name --split train --split test

# Specify custom text column
embedding-atlas username/dataset-name --text "content"
```

### Using Python and Jupyter

You can also use Embedding Atlas in Jupyter notebooks for interactive exploration:

```python
from embedding_atlas.widget import EmbeddingAtlasWidget
from datasets import load_dataset
import pandas as pd

# Load the IMDB dataset from Hugging Face Hub
dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]")

# Convert to pandas DataFrame
df = dataset.to_pandas()

# Create interactive widget
widget = EmbeddingAtlasWidget(df)
widget
```

For your own datasets:

```python
from embedding_atlas.widget import EmbeddingAtlasWidget
from datasets import load_dataset
import pandas as pd

# Load your dataset from the Hub
dataset = load_dataset("username/dataset-name", split="train")
df = dataset.to_pandas()

# Create interactive widget
widget = EmbeddingAtlasWidget(df)
widget
```

### Working with Pre-computed Embeddings

If you have datasets with pre-computed embeddings, you can load them directly:

```bash
# Load dataset with pre-computed coordinates
embedding-atlas username/dataset-name \
--x "embedding_x" \
--y "embedding_y"

# Load with pre-computed nearest neighbors
embedding-atlas username/dataset-name \
--neighbors "neighbors_column"
```

## Customizing Embeddings

Embedding Atlas uses SentenceTransformers by default but supports custom embedding models:

```bash
# Use a specific embedding model
embedding-atlas stanfordnlp/imdb \
--text "text" \
--model "sentence-transformers/all-MiniLM-L6-v2"

# For models requiring remote code execution
embedding-atlas username/dataset-name \
--model "custom/model" \
--trust-remote-code
```

### UMAP Projection Parameters

Fine-tune the dimensionality reduction for your specific use case:

```bash
embedding-atlas stanfordnlp/imdb \
--text "text" \
--umap-n-neighbors 30 \
--umap-min-dist 0.1 \
--umap-metric "cosine"
```

## Use Cases

### Exploring Text Datasets

Visualize and explore text corpora to identify clusters, outliers, and patterns:

```python
from embedding_atlas.widget import EmbeddingAtlasWidget
from datasets import load_dataset
import pandas as pd

# Load a text classification dataset
dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]")
df = dataset.to_pandas()

# Visualize with metadata
widget = EmbeddingAtlasWidget(df)
widget
```

### Analyzing Model Outputs

Compare embeddings from different models or examine how embeddings change during training:

```bash
# Compare embeddings from different models
embedding-atlas stanfordnlp/imdb \
--text "text" \
--model "sentence-transformers/all-MiniLM-L6-v2" \
--sample 5000
```

### Quality Control and Data Curation

Identify mislabeled data, duplicates, or unusual patterns in your datasets:

```bash
# Visualize with sampling for large datasets
embedding-atlas username/large-dataset \
--sample 50000 \
--text "content"
```

## Additional Resources

- [Embedding Atlas GitHub Repository](https://github.com/apple/embedding-atlas)
- [Official Documentation](https://apple.github.io/embedding-atlas/)
- [Interactive Demo](https://apple.github.io/embedding-atlas/upload/)
- [Command Line Reference](https://apple.github.io/embedding-atlas/tool.html)
1 change: 1 addition & 0 deletions docs/hub/datasets-libraries.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ The table below summarizes the supported libraries and their level of integratio
| [Datasets](./datasets-usage) | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). | ✅ | ✅ |
| [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback. | ✅ | ✅ |
| [DuckDB](./datasets-duckdb) | In-process SQL OLAP database management system. | ✅ | ✅ |
| [Embedding Atlas](./datasets-embedding-atlas) | Interactive visualization and exploration tool for large embeddings. | ✅ | ❌ |
| [FiftyOne](./datasets-fiftyone) | FiftyOne is a library for curation and visualization of image, video, and 3D data. | ✅ | ✅ |
| [Pandas](./datasets-pandas) | Python data analysis toolkit. | ✅ | ✅ |
| [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. | ✅ | ✅ |
Expand Down