Skip to content

Commit cee0e99

Browse files
authored
Merge pull request #251 from dbespalov/python_bindings_pickle_io
Add pickle support to python bindings `Index` class
2 parents 21b54fe + 345f71d commit cee0e99

File tree

4 files changed

+561
-92
lines changed

4 files changed

+561
-92
lines changed

README.md

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ For other spaces use the nmslib library https://github.com/nmslib/nmslib.
3737
#### Short API description
3838
* `hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`.
3939

40-
Index methods:
40+
`hnswlib.Index` methods:
4141
* `init_index(max_elements, ef_construction = 200, M = 16, random_seed = 100)` initializes the index from with no elements.
4242
* `max_elements` defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
4343
* `ef_construction` defines a construction time/accuracy trade-off (see [ALGO_PARAMS.md](ALGO_PARAMS.md)).
@@ -76,14 +76,34 @@ Index methods:
7676

7777
* `get_current_count()` - returns the current number of element stored in the index
7878

79-
80-
79+
Read-only properties of `hnswlib.Index` class:
80+
81+
* `space` - name of the space (can be one of "l2", "ip", or "cosine").
82+
83+
* `dim` - dimensionality of the space.
84+
85+
* `M` - parameter that defines the maximum number of outgoing connections in the graph.
86+
87+
* `ef_construction` - parameter that controls speed/accuracy trade-off during the index construction.
88+
89+
* `max_elements` - current capacity of the index. Equivalent to `p.get_max_elements()`.
90+
91+
* `element_count` - number of items in the index. Equivalent to `p.get_current_count()`.
92+
93+
Properties of `hnswlib.Index` that support reading and writing:
94+
95+
* `ef` - parameter controlling query time/accuracy trade-off.
96+
97+
* `num_threads` - default number of threads to use in `add_items` or `knn_query`. Note that calling `p.set_num_threads(3)` is equivalent to `p.num_threads=3`.
98+
99+
81100
82101

83102
#### Python bindings examples
84103
```python
85104
import hnswlib
86105
import numpy as np
106+
import pickle
87107

88108
dim = 128
89109
num_elements = 10000
@@ -106,6 +126,18 @@ p.set_ef(50) # ef should always be > k
106126

107127
# Query dataset, k - number of closest elements (returns 2 numpy arrays)
108128
labels, distances = p.knn_query(data, k = 1)
129+
130+
# Index objects support pickling
131+
# WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!
132+
# Note: ef parameter is included in serialization; random number generator is initialized with random_seeed on Index load
133+
p_copy = pickle.loads(pickle.dumps(p)) # creates a copy of index p using pickle round-trip
134+
135+
### Index parameters are exposed as class properties:
136+
print(f"Parameters passed to constructor: space={p_copy.space}, dim={p_copy.dim}")
137+
print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")
138+
print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")
139+
print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")
140+
109141
```
110142

111143
An example with updates after serialization/deserialization:

hnswlib/hnswalg.h

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -637,7 +637,6 @@ namespace hnswlib {
637637
if (!input.is_open())
638638
throw std::runtime_error("Cannot open file");
639639

640-
641640
// get file size:
642641
input.seekg(0,input.end);
643642
std::streampos total_filesize=input.tellg();
@@ -874,7 +873,7 @@ namespace hnswlib {
874873
for (auto&& cand : sCand) {
875874
if (cand == neigh)
876875
continue;
877-
876+
878877
dist_t distance = fstdistfunc_(getDataByInternalId(neigh), getDataByInternalId(cand), dist_func_param_);
879878
if (candidates.size() < elementsToKeep) {
880879
candidates.emplace(distance, cand);
@@ -1137,7 +1136,7 @@ namespace hnswlib {
11371136
}
11381137

11391138
std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst> top_candidates;
1140-
if (has_deletions_) {
1139+
if (has_deletions_) {
11411140
top_candidates=searchBaseLayerST<true,true>(
11421141
currObj, query_data, std::max(ef_, k));
11431142
}
@@ -1186,27 +1185,27 @@ namespace hnswlib {
11861185
std::unordered_set<tableint> s;
11871186
for (int j=0; j<size; j++){
11881187
assert(data[j] > 0);
1189-
assert(data[j] < cur_element_count);
1188+
assert(data[j] < cur_element_count);
11901189
assert (data[j] != i);
11911190
inbound_connections_num[data[j]]++;
11921191
s.insert(data[j]);
11931192
connections_checked++;
1194-
1193+
11951194
}
11961195
assert(s.size() == size);
11971196
}
11981197
}
11991198
if(cur_element_count > 1){
12001199
int min1=inbound_connections_num[0], max1=inbound_connections_num[0];
1201-
for(int i=0; i < cur_element_count; i++){
1200+
for(int i=0; i < cur_element_count; i++){
12021201
assert(inbound_connections_num[i] > 0);
12031202
min1=std::min(inbound_connections_num[i],min1);
12041203
max1=std::max(inbound_connections_num[i],max1);
12051204
}
12061205
std::cout << "Min inbound: " << min1 << ", Max inbound:" << max1 << "\n";
12071206
}
12081207
std::cout << "integrity ok, checked " << connections_checked << " connections\n";
1209-
1208+
12101209
}
12111210

12121211
};

0 commit comments

Comments
 (0)