33## Introduction
44
55` finalfusion-python ` is a Python module for reading, writing, and
6- using * finalfusion* embeddings. This module is implemented in Rust as
7- a wrapper around the [ finalfusion] ( https://docs.rs/finalfusion/ )
8- crate. The Python module supports the same types of finalfusion
9- embeddings:
6+ using * finalfusion* embeddings, but also offers methods to read
7+ and use fastText, word2vec and GloVe embeddings. This module is
8+ implemented in Rust as a wrapper around the
9+ [ finalfusion] ( https://docs.rs/finalfusion/ ) crate.
10+
11+ The Python module supports the same types of embeddings:
1012
1113* Vocabulary:
1214 * No subwords
@@ -15,6 +17,11 @@ embeddings:
1517 * Array
1618 * Memory-mapped
1719 * Quantized
20+ * Format:
21+ * finalfusion
22+ * fastText
23+ * word2vec
24+ * GloVe
1825
1926## Installation
2027
@@ -55,8 +62,12 @@ The wheels are then in the `target/wheels` directory.
5562## Getting embeddings
5663
5764finalfusion uses its own embedding format, which supports memory mapping,
58- subword units, and quantized matrices. GloVe and word2vec embeddings
59- can be converted using finalfusion's ` ff-convert ` utility.
65+ subword units, and quantized matrices. Moreover, finalfusion can read
66+ fastText, GloVe and word2vec embeddings, but does not support memory
67+ mapping those formats. Such embedddings can be converted to finalfusion
68+ format using
69+ [ finalfusion-utils'] ( https://github.com/finalfusion/finalfusion-utils )
70+ ` convert ` .
6071
6172Embeddings trained with
6273[ finalfrontier] ( https://github.com/finalfusion/finalfrontier ) version
@@ -65,23 +76,60 @@ with this Python module.
6576
6677## Usage
6778
68- finalfusion embeddings can be loaded as follows:
79+ Embeddings can be loaded as follows:
6980
7081~~~ python
7182import finalfusion
83+ # Loading embeddings in finalfusion format
7284embeds = finalfusion.Embeddings(" myembeddings.fifu" )
7385
7486# Or if you want to memory-map the embedding matrix:
75- embeds = finalfusion.Embeddings(" myembeddings" , mmap = True )
87+ embeds = finalfusion.Embeddings(" myembeddings.fifu" , mmap = True )
88+
89+ # fastText format
90+ embeds = finalfusion.read_fasttext(" myembeddings.bin" )
91+
92+ # word2vec format
93+ embeds = finalfusion.read_word2vec(" myembeddings.w2v" )
7694~~~
7795
7896You can then compute an embedding, perform similarity queries, or analogy
7997queries:
8098
8199~~~ python
82100e = embeds.embedding(" Tübingen" )
83- embeds.similarity(" Tübingen" )
101+ # default similarity query for "Tübingen"
102+ embeds.word_similarity(" Tübingen" )
103+
104+ # similarity query based on a vector, returning the closest embedding to
105+ # the input vector, skipping "Tübingen"
106+ embeds.embeddings_similarity(e, skip = {" Tübingen" })
107+
108+ # default analogy query
84109embeds.analogy(" Berlin" , " Deutschland" , " Amsterdam" )
110+
111+ # analogy query allowing "Deutschland" as answer
112+ embeds.analogy(" Berlin" , " Deutschland" , " Amsterdam" , mask = (True ,False ,True ))
113+ ~~~
114+
115+ If you want to operate directly on the full embedding matrix, you can
116+ get a copy of this matrix through:
117+ ~~~ python
118+ # get copy of embedding matrix, changes to this won't touch the original matrix
119+ e.matrix_copy()
120+ ~~~
121+
122+ Finally access to the vocabulary is provided through:
123+ ~~~ python
124+ v = e.vocab()
125+ # get a list of indices associated with "Tübingen"
126+ v.item_to_indices(" Tübingen" )
127+
128+ # get a list of `(ngram, index)` tuples for "Tübingen"
129+ v.ngram_indices(" Tübingen" )
130+
131+ # get a list of subword indices for "Tübingen"
132+ v.subword_indices(" Tübingen" )
85133~~~
86134
87135More usage examples can be found in the
0 commit comments