Skip to content

Commit 178ca4f

Browse files
sebpuetzDaniël de Kok
authored andcommitted
Update README.md
1 parent f49680c commit 178ca4f

File tree

1 file changed

+57
-9
lines changed

1 file changed

+57
-9
lines changed

README.md

Lines changed: 57 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,12 @@
33
## Introduction
44

55
`finalfusion-python` is a Python module for reading, writing, and
6-
using *finalfusion* embeddings. This module is implemented in Rust as
7-
a wrapper around the [finalfusion](https://docs.rs/finalfusion/)
8-
crate. The Python module supports the same types of finalfusion
9-
embeddings:
6+
using *finalfusion* embeddings, but also offers methods to read
7+
and use fastText, word2vec and GloVe embeddings. This module is
8+
implemented in Rust as a wrapper around the
9+
[finalfusion](https://docs.rs/finalfusion/) crate.
10+
11+
The Python module supports the same types of embeddings:
1012

1113
* Vocabulary:
1214
* No subwords
@@ -15,6 +17,11 @@ embeddings:
1517
* Array
1618
* Memory-mapped
1719
* Quantized
20+
* Format:
21+
* finalfusion
22+
* fastText
23+
* word2vec
24+
* GloVe
1825

1926
## Installation
2027

@@ -55,8 +62,12 @@ The wheels are then in the `target/wheels` directory.
5562
## Getting embeddings
5663

5764
finalfusion uses its own embedding format, which supports memory mapping,
58-
subword units, and quantized matrices. GloVe and word2vec embeddings
59-
can be converted using finalfusion's `ff-convert` utility.
65+
subword units, and quantized matrices. Moreover, finalfusion can read
66+
fastText, GloVe and word2vec embeddings, but does not support memory
67+
mapping those formats. Such embedddings can be converted to finalfusion
68+
format using
69+
[finalfusion-utils'](https://github.com/finalfusion/finalfusion-utils)
70+
`convert`.
6071

6172
Embeddings trained with
6273
[finalfrontier](https://github.com/finalfusion/finalfrontier) version
@@ -65,23 +76,60 @@ with this Python module.
6576

6677
## Usage
6778

68-
finalfusion embeddings can be loaded as follows:
79+
Embeddings can be loaded as follows:
6980

7081
~~~python
7182
import finalfusion
83+
# Loading embeddings in finalfusion format
7284
embeds = finalfusion.Embeddings("myembeddings.fifu")
7385

7486
# Or if you want to memory-map the embedding matrix:
75-
embeds = finalfusion.Embeddings("myembeddings", mmap=True)
87+
embeds = finalfusion.Embeddings("myembeddings.fifu", mmap=True)
88+
89+
# fastText format
90+
embeds = finalfusion.read_fasttext("myembeddings.bin")
91+
92+
# word2vec format
93+
embeds = finalfusion.read_word2vec("myembeddings.w2v")
7694
~~~
7795

7896
You can then compute an embedding, perform similarity queries, or analogy
7997
queries:
8098

8199
~~~python
82100
e = embeds.embedding("Tübingen")
83-
embeds.similarity("Tübingen")
101+
# default similarity query for "Tübingen"
102+
embeds.word_similarity("Tübingen")
103+
104+
# similarity query based on a vector, returning the closest embedding to
105+
# the input vector, skipping "Tübingen"
106+
embeds.embeddings_similarity(e, skip={"Tübingen"})
107+
108+
# default analogy query
84109
embeds.analogy("Berlin", "Deutschland", "Amsterdam")
110+
111+
# analogy query allowing "Deutschland" as answer
112+
embeds.analogy("Berlin", "Deutschland", "Amsterdam", mask=(True,False,True))
113+
~~~
114+
115+
If you want to operate directly on the full embedding matrix, you can
116+
get a copy of this matrix through:
117+
~~~python
118+
# get copy of embedding matrix, changes to this won't touch the original matrix
119+
e.matrix_copy()
120+
~~~
121+
122+
Finally access to the vocabulary is provided through:
123+
~~~python
124+
v = e.vocab()
125+
# get a list of indices associated with "Tübingen"
126+
v.item_to_indices("Tübingen")
127+
128+
# get a list of `(ngram, index)` tuples for "Tübingen"
129+
v.ngram_indices("Tübingen")
130+
131+
# get a list of subword indices for "Tübingen"
132+
v.subword_indices("Tübingen")
85133
~~~
86134

87135
More usage examples can be found in the

0 commit comments

Comments
 (0)