This repository explores what embeddings are and how we can better understand them through visualization and clustering.
Install dependencies with Poetry:
poetry install
Then copy .env.template to .env and fill in the required values.
Open and run the notebooks in order.
-
Prepare Generates embeddings for the dataset in the data/ folder using Cohere’s
embed-english-v3.0embedding model, and ingests them into OpenSearch. -
PCA Analyzes the raw embedding vectors:
- Checks for normalization
- Uses unit vectors to explore if certain dimensions capture specific semantic features
-
KMeans Applies KMeans clustering to identify groups of semantically similar items.
-
UMAP Reduces the embedding space to 2D and 3D for visualization. Also demonstrates how even a small change in context can shift the meaning — and embedding — of a term.
