Skip to content

pulibrary/dedup-text-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Identify similarity and duplicates using text embeddings and PCA

-- copy of dedup_pca

Text Embeddings

In this project we use the SentenceTransformers (SBERT) to create text embeddings from the following marc fields:

  • id: 001,
  • title: 245$a, 245$p, 245$f
  • title inclusive dates: 245$b
  • pagination: 300$a
  • publication year: 008[7:11], 008[11:15], 264$c, 260$c
  • vernacular title: linked fields of 245
  • vernacular author: linked fields of 100, 110, 111
  • context title: 505$t
  • edition: 250$a
  • publisher name: 264$b, 260$b

Methodology

As a first step we use the trained model all-MiniLM-L6-v2 which according to the documentation it is 5 times faster and still offers good quality.

We parse the MARCXML file and create a string from the values of the MARC fields. We use this string to create text embeddings. We're saving the text embedding in a new JSON file that includes the text_embedding field and the record id.

Run the program

In order to test the program:

  • Clone the repo
  • Install jupyter notebooks in your local environment pip install jupyterlab
  • Go to the Bibdata evens page. Find the events with the label partner updates. Download a few dump files. Don't use files that have delete in the dump file name. Rename the ...xml.gz files so that they match the following naming convention scsb_update_*.xml.gz. Save the files in the directory data_marcxml.
  • In your terminal run jupyter lab. This will load jupyter notebooks in localhost. Find and select the notebook file dedup_embeddings_pca.ipynb. Select Kernel -> Restart Kernel and Run all cells.

Next steps

  1. Use different sentence transformers models to produce text-embeddings. Apply PCA and compare the results.
  2. Use a clustering method to cluster the resulting pairs from the PCA components that have the same id.
  3. Use K-means instead of PCA to cluster the text-embeddings. See Semantic Deduplication for different methods.
  4. Write the repo in Rust. Use sklears and polars

References

  1. Principal component Analysis
  2. An intuitive introduction to text embeddings
  3. SentenceTransformers
  4. Why is it ok to average embeddings?
  5. Deep Learning vs Principal Component Analysis
  6. How to train Sentence Transformers
  7. Linking Theory and Practice of Digital Libraries
  8. Mastering Text Embeddings: A Key Ingredient for RAG Success
  9. Semantic Deduplication

About

Identify similarity and duplicates using text embeddings PCA and other clustering methods

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors