Add support for embedding pruning

Add support for pruning embeddings, where N embeddings are retained. Words for which embeddings are removed are mapped to their nearest neighbor.

This should provide more or less the same functionality as pruning in spaCy:

https://spacy.io/api/vocab#prune_vectors

I encourage some investigation here. Some ideas:

1. The most basic version could simply retain the embeddings of the N most frequent words and map all the remaining words to the nearest neighbor in the N embeddings that are retained.

2. Select vectors such that the similarities to the pruned vectors is maximized. The challenge here is making it tractable.

3. An approach similar to quantization, where k-means clustering is performed with *N* clusters. The embedding matrix is then replaced by the cluster centroid matrix. Each word maps to the cluster it is in. (This could reuse the KMeans stuff from [reductive](https://git.sr.ht/~danieldk/reductive), which is already a dependency of finalfusion).

I would focus on (1) and (3) first.

Benefits:

- Compresses the embedding matrix.
- Faster than quantized embedding matrices, because simple lookups are used.
- Could later be applied to @sebpuetz 's non-hashed subword n-grams as well.
- Could perhaps be combined with quantization for even better compression.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for embedding pruning #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for embedding pruning #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions