|
| 1 | +# Embedding Atlas |
| 2 | + |
| 3 | +[Embedding Atlas](https://apple.github.io/embedding-atlas/) is an interactive visualization tool for exploring large embedding spaces. It enables you to visualize, cross-filter, and search embeddings alongside associated metadata, helping you understand patterns and relationships in high-dimensional data. All computation happens in your computer, ensuring your data remains private and secure. |
| 4 | + |
| 5 | +## Key Features |
| 6 | + |
| 7 | +- **Interactive exploration**: Navigate through millions of embeddings with smooth, responsive visualization |
| 8 | +- **Browser-based computation**: Compute embeddings and projections locally without sending data to external servers |
| 9 | +- **Cross-filtering**: Link and filter data across multiple metadata columns |
| 10 | +- **Search capabilities**: Find similar data points to a given query or existing item |
| 11 | +- **Multiple integration options**: Use via command line, Jupyter widgets, or web interface |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +First, install Embedding Atlas: |
| 16 | + |
| 17 | +```bash |
| 18 | +pip install embedding-atlas |
| 19 | +``` |
| 20 | + |
| 21 | +If you plan to load private datasets from the Hugging Face Hub, you'll also need to [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): |
| 22 | + |
| 23 | +```bash |
| 24 | +hf auth login |
| 25 | +``` |
| 26 | + |
| 27 | +## Loading Datasets from the Hub |
| 28 | + |
| 29 | +Embedding Atlas provides seamless integration with the Hugging Face Hub, allowing you to visualize embeddings from any dataset directly. |
| 30 | + |
| 31 | +### Using the Command Line |
| 32 | + |
| 33 | +The simplest way to visualize a Hugging Face dataset is through the command line interface. Try it with the IMDB dataset: |
| 34 | + |
| 35 | +```bash |
| 36 | +# Load the IMDB dataset from the Hub |
| 37 | +embedding-atlas stanfordnlp/imdb |
| 38 | + |
| 39 | +# Specify the text column for embedding computation |
| 40 | +embedding-atlas stanfordnlp/imdb --text "text" |
| 41 | + |
| 42 | +# Load only a sample for faster exploration |
| 43 | +embedding-atlas stanfordnlp/imdb --text "text" --sample 5000 |
| 44 | +``` |
| 45 | + |
| 46 | +For your own datasets, use the same pattern: |
| 47 | + |
| 48 | +```bash |
| 49 | +# Load your dataset from the Hub |
| 50 | +embedding-atlas username/dataset-name |
| 51 | + |
| 52 | +# Load multiple splits |
| 53 | +embedding-atlas username/dataset-name --split train --split test |
| 54 | + |
| 55 | +# Specify custom text column |
| 56 | +embedding-atlas username/dataset-name --text "content" |
| 57 | +``` |
| 58 | + |
| 59 | +### Using Python and Jupyter |
| 60 | + |
| 61 | +You can also use Embedding Atlas in Jupyter notebooks for interactive exploration: |
| 62 | + |
| 63 | +```python |
| 64 | +from embedding_atlas.widget import EmbeddingAtlasWidget |
| 65 | +from datasets import load_dataset |
| 66 | +import pandas as pd |
| 67 | + |
| 68 | +# Load the IMDB dataset from Hugging Face Hub |
| 69 | +dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]") |
| 70 | + |
| 71 | +# Convert to pandas DataFrame |
| 72 | +df = dataset.to_pandas() |
| 73 | + |
| 74 | +# Create interactive widget |
| 75 | +widget = EmbeddingAtlasWidget(df) |
| 76 | +widget |
| 77 | +``` |
| 78 | + |
| 79 | +For your own datasets: |
| 80 | + |
| 81 | +```python |
| 82 | +from embedding_atlas.widget import EmbeddingAtlasWidget |
| 83 | +from datasets import load_dataset |
| 84 | +import pandas as pd |
| 85 | + |
| 86 | +# Load your dataset from the Hub |
| 87 | +dataset = load_dataset("username/dataset-name", split="train") |
| 88 | +df = dataset.to_pandas() |
| 89 | + |
| 90 | +# Create interactive widget |
| 91 | +widget = EmbeddingAtlasWidget(df) |
| 92 | +widget |
| 93 | +``` |
| 94 | + |
| 95 | +### Working with Pre-computed Embeddings |
| 96 | + |
| 97 | +If you have datasets with pre-computed embeddings, you can load them directly: |
| 98 | + |
| 99 | +```bash |
| 100 | +# Load dataset with pre-computed coordinates |
| 101 | +embedding-atlas username/dataset-name \ |
| 102 | + --x "embedding_x" \ |
| 103 | + --y "embedding_y" |
| 104 | + |
| 105 | +# Load with pre-computed nearest neighbors |
| 106 | +embedding-atlas username/dataset-name \ |
| 107 | + --neighbors "neighbors_column" |
| 108 | +``` |
| 109 | + |
| 110 | +## Customizing Embeddings |
| 111 | + |
| 112 | +Embedding Atlas uses [SentenceTransformers](https://huggingface.co/sentence-transformers) by default but supports custom embedding models: |
| 113 | + |
| 114 | +```bash |
| 115 | +# Use a specific embedding model |
| 116 | +embedding-atlas stanfordnlp/imdb \ |
| 117 | + --text "text" \ |
| 118 | + --model "sentence-transformers/all-MiniLM-L6-v2" |
| 119 | + |
| 120 | +# For models requiring remote code execution |
| 121 | +embedding-atlas username/dataset-name \ |
| 122 | + --model "custom/model" \ |
| 123 | + --trust-remote-code |
| 124 | +``` |
| 125 | + |
| 126 | +### UMAP Projection Parameters |
| 127 | + |
| 128 | +Fine-tune the dimensionality reduction for your specific use case: |
| 129 | + |
| 130 | +```bash |
| 131 | +embedding-atlas stanfordnlp/imdb \ |
| 132 | + --text "text" \ |
| 133 | + --umap-n-neighbors 30 \ |
| 134 | + --umap-min-dist 0.1 \ |
| 135 | + --umap-metric "cosine" |
| 136 | +``` |
| 137 | + |
| 138 | +## Use Cases |
| 139 | + |
| 140 | +### Exploring Text Datasets |
| 141 | + |
| 142 | +Visualize and explore text corpora to identify clusters, outliers, and patterns: |
| 143 | + |
| 144 | +```python |
| 145 | +from embedding_atlas.widget import EmbeddingAtlasWidget |
| 146 | +from datasets import load_dataset |
| 147 | +import pandas as pd |
| 148 | + |
| 149 | +# Load a text classification dataset |
| 150 | +dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]") |
| 151 | +df = dataset.to_pandas() |
| 152 | + |
| 153 | +# Visualize with metadata |
| 154 | +widget = EmbeddingAtlasWidget(df) |
| 155 | +widget |
| 156 | +``` |
| 157 | + |
| 158 | + |
| 159 | +## Additional Resources |
| 160 | + |
| 161 | +- [Embedding Atlas GitHub Repository](https://github.com/apple/embedding-atlas) |
| 162 | +- [Official Documentation](https://apple.github.io/embedding-atlas/) |
| 163 | +- [Interactive Demo](https://apple.github.io/embedding-atlas/upload/) |
| 164 | +- [Command Line Reference](https://apple.github.io/embedding-atlas/tool.html) |
0 commit comments