Skip to content

BigDataBoutique/what-are-embeddings

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository explores what embeddings are and how we can better understand them through visualization and clustering.

🔧 Getting Started

Install dependencies with Poetry:

poetry install

Then copy .env.template to .env and fill in the required values.

Open and run the notebooks in order.

📓 Included Notebooks

  1. Prepare Generates embeddings for the dataset in the data/ folder using Cohere’s embed-english-v3.0 embedding model, and ingests them into OpenSearch.

  2. PCA Analyzes the raw embedding vectors:

    • Checks for normalization
    • Uses unit vectors to explore if certain dimensions capture specific semantic features
  3. KMeans Applies KMeans clustering to identify groups of semantically similar items.

  4. UMAP Reduces the embedding space to 2D and 3D for visualization. Also demonstrates how even a small change in context can shift the meaning — and embedding — of a term.

image

About

Visually explores what embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%