Identify similarity and duplicates using text embeddings and PCA

Text Embeddings

In this project we use the SentenceTransformers (SBERT) to create text embeddings from the following marc fields:

id: 001,
title: 245$a, 245$p, 245$f
title inclusive dates: 245$b
pagination: 300$a
publication year: 008[7:11], 008[11:15], 264$c, 260$c
vernacular title: linked fields of 245
vernacular author: linked fields of 100, 110, 111
context title: 505$t
edition: 250$a
publisher name: 264$b, 260$b

Methodology

As a first step we use the trained model all-MiniLM-L6-v2 which according to the documentation it is 5 times faster and still offers good quality.

We parse the MARCXML file and create a string from the values of the MARC fields. We use this string to create text embeddings. We're saving the text embedding in a new JSON file that includes the text_embedding field and the record id.

Run the program

In order to test the program:

Clone the repo
Install jupyter notebooks in your local environment pip install jupyterlab
Go to the Bibdata evens page. Find the events with the label partner updates. Download a few dump files. Don't use files that have delete in the dump file name. Rename the ...xml.gz files so that they match the following naming convention scsb_update_*.xml.gz. Save the files in the directory data_marcxml.
In your terminal run jupyter lab. This will load jupyter notebooks in localhost. Find and select the notebook file dedup_embeddings_pca.ipynb. Select Kernel -> Restart Kernel and Run all cells.

Next steps

Use different sentence transformers models to produce text-embeddings. Apply PCA and compare the results.
Use a clustering method to cluster the resulting pairs from the PCA components that have the same id.
Use K-means instead of PCA to cluster the text-embeddings. See Semantic Deduplication for different methods.
Write the repo in Rust. Use sklears and polars

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data_marcxml/extracted		data_marcxml/extracted
data_with_embeddings		data_with_embeddings
embeddings_matrix		embeddings_matrix
fixed_json		fixed_json
raw_data		raw_data
similarities_matrix		similarities_matrix
.gitignore		.gitignore
.python-version		.python-version
.tool-versions		.tool-versions
LICENSE		LICENSE
README.md		README.md
add_comma_between_hashes.py		add_comma_between_hashes.py
dedup_embeddings_pca.ipynb		dedup_embeddings_pca.ipynb
ipca_on_embeddings.py		ipca_on_embeddings.py
json_embedding_parser.py		json_embedding_parser.py
main.py		main.py
marcxml_embedding_parser.py		marcxml_embedding_parser.py
pca_on_embeddings_experiments.py		pca_on_embeddings_experiments.py
pca_on_similarities.py		pca_on_similarities.py
pyproject.toml		pyproject.toml
sample_embeddings_matrix.py		sample_embeddings_matrix.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identify similarity and duplicates using text embeddings and PCA

Text Embeddings

Methodology

Run the program

Next steps

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Identify similarity and duplicates using text embeddings and PCA

Text Embeddings

Methodology

Run the program

Next steps

References

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages