-- copy of dedup_pca
In this project we use the SentenceTransformers (SBERT) to create text embeddings from the following marc fields:
- id: 001,
- title: 245$a, 245$p, 245$f
- title inclusive dates: 245$b
- pagination: 300$a
- publication year: 008[7:11], 008[11:15], 264$c, 260$c
- vernacular title: linked fields of 245
- vernacular author: linked fields of 100, 110, 111
- context title: 505$t
- edition: 250$a
- publisher name: 264$b, 260$b
As a first step we use the trained model all-MiniLM-L6-v2 which according to the documentation it is 5 times faster and still offers good quality.
We parse the MARCXML file and create a string from the values of the MARC fields. We use this string to create text embeddings. We're saving the text embedding in a new JSON file that includes the text_embedding field and the record id.
In order to test the program:
- Clone the repo
- Install jupyter notebooks in your local environment
pip install jupyterlab - Go to the Bibdata evens page. Find the events with the label
partner updates. Download a few dump files. Don't use files that have delete in the dump file name. Rename the...xml.gzfiles so that they match the following naming conventionscsb_update_*.xml.gz. Save the files in the directorydata_marcxml. - In your terminal run
jupyter lab. This will load jupyter notebooks in localhost. Find and select the notebook filededup_embeddings_pca.ipynb. SelectKernel->Restart Kernel and Run all cells.
- Use different sentence transformers models to produce text-embeddings. Apply PCA and compare the results.
- Use a clustering method to cluster the resulting pairs from the PCA components that have the same id.
- Use K-means instead of PCA to cluster the text-embeddings. See Semantic Deduplication for different methods.
- Write the repo in Rust. Use sklears and polars
- Principal component Analysis
- An intuitive introduction to text embeddings
- SentenceTransformers
- Why is it ok to average embeddings?
- Deep Learning vs Principal Component Analysis
- How to train Sentence Transformers
- Linking Theory and Practice of Digital Libraries
- Mastering Text Embeddings: A Key Ingredient for RAG Success
- Semantic Deduplication