Generalize SIMD distance implementation to n-length vectors #38

pierd · 2023-02-22T19:38:23Z

Addresses #20

Changes

Adds the ability to benchmark instant-distance-py crate
Adds custom storage support
Introduces Metric trait
Introduces PointStorage to prevent extra dereference when getting to data
Makes Euclid metric work on variable length

Benchmark

I used building HNSW for the first 100k fasttext words as a benchmark (benchmarking script for reference).

The original implementation takes around 18.7s (min out of 20 runs). After changes it takes 19.5s (again min out of 20 runs). Roughly 4.2% slower, I tried profiling it but I'm not sure where it comes from.

As for the querying (10k queries per benchmark): original takes 4.8s and the new version 4.5s. So it actually looks like it's slightly faster.

Benchmarking directly in rust (added in 1st commit of this PR)

Original (almost, with benchmarking added):

test build    ... bench: 103,131,796 ns/iter (+/- 12,592,833)
test distance ... bench:          21 ns/iter (+/- 1)
test query    ... bench:      42,648 ns/iter (+/- 9,372)

After changes:

test build    ... bench: 120,487,512 ns/iter (+/- 13,330,122)
test distance ... bench:          21 ns/iter (+/- 5)
test query    ... bench:      43,210 ns/iter (+/- 10,657)

This time building seems to be 16.8% slower (less visible when using through Python). Querying is comparable.

djc · 2023-02-23T08:39:15Z

Is the regression for building or for querying? We should ideally benchmark both. I think we'd be willing to take a small regression on building but not regressing querying is actually more important. Also would be good to take a quick look at the impact on in-memory size.

pierd · 2023-02-23T21:06:47Z

I've just run some benchmarks for querying. There's a lot of variance in the results, I think it's coming from Python's overhead since I'm running search 10k times and it takes a few seconds. But even with that variance I can clearly see that making the distance calculation not bound to vector length regresses querying slightly and then actually using Vec in vectors regresses it further.

For original the shortest loop took 3.227s, after updating the metric it grew to 5.059s and after using AVec it went up to 5.596s. All those are the shortest times out of 40 runs (each run doing 10k searches).

djc · 2023-02-24T12:20:36Z

Right, those are definitely some substantial latency regressions -- so I don't think we want to go this route.

(Probably not necessary right now, but it might be interesting to test the low-level metric stuff with Rust benchmarks instead of the Python bindings? Would get rid of the Python noise at least.)

djc · 2023-05-08T10:55:22Z

Should keep this issue in mind:

#35

pierd · 2023-05-08T11:34:29Z

I think I'd prefer to continue with this one and close/rework #35 once this PR is merged.

djc · 2023-05-08T11:55:05Z

Why is that? Do you think you'll still have time to work on this? Hope the new job is working out well for you!

pierd · 2023-05-11T08:47:00Z

The new job is great. I'm still hoping to finish this but the last few months are proving me wrong.

djc · 2023-05-12T11:12:28Z

Ambient looks cool, BTW. How about we just say that we're going to pick it up, and if you get to it before us that's nice, but we'll both assume that you have enough other things on your plate as it is? 🙂

pierd force-pushed the n-len-vecs branch from afff903 to 30b0f16 Compare February 22, 2023 19:41

pierd mentioned this pull request Feb 22, 2023

Generalize SIMD distance implementation to n-length vectors #20

Open

pierd force-pushed the n-len-vecs branch 2 times, most recently from 8b307d2 to afad778 Compare March 27, 2023 21:12

pierd and others added 5 commits March 28, 2023 19:29

Add benchmarks to instant-distance-py

53fbb45

Add custom storage support

11b5626

Introduce Metric trait

d5f7b47

Introduce PointStorage and support variable length

ebf9e45

Apply suggestions for clippy 1.68

e4ce915

pierd force-pushed the n-len-vecs branch 2 times, most recently from 8c7374c to e4ce915 Compare March 28, 2023 17:36

Merge branch 'main' into n-len-vecs

4dd4f99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Generalize SIMD distance implementation to n-length vectors #38

Generalize SIMD distance implementation to n-length vectors #38

Uh oh!

pierd commented Feb 22, 2023 •

edited

Loading

Uh oh!

djc commented Feb 23, 2023

Uh oh!

pierd commented Feb 23, 2023

Uh oh!

djc commented Feb 24, 2023

Uh oh!

djc commented May 8, 2023

Uh oh!

pierd commented May 8, 2023

Uh oh!

djc commented May 8, 2023

Uh oh!

pierd commented May 11, 2023

Uh oh!

djc commented May 12, 2023

Uh oh!

Uh oh!

Uh oh!

Generalize SIMD distance implementation to n-length vectors #38

Are you sure you want to change the base?

Generalize SIMD distance implementation to n-length vectors #38

Uh oh!

Conversation

pierd commented Feb 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Benchmark

Benchmarking directly in rust (added in 1st commit of this PR)

Uh oh!

djc commented Feb 23, 2023

Uh oh!

pierd commented Feb 23, 2023

Uh oh!

djc commented Feb 24, 2023

Uh oh!

djc commented May 8, 2023

Uh oh!

pierd commented May 8, 2023

Uh oh!

djc commented May 8, 2023

Uh oh!

pierd commented May 11, 2023

Uh oh!

djc commented May 12, 2023

Uh oh!

Uh oh!

pierd commented Feb 22, 2023 •

edited

Loading