Fix the erroneous note on time complexity of the maxsum algorithm #272

asrelo · 2025-11-10T19:07:24Z

The documentation for the max sum distance algorithm says (or can be easily interpreted to say) that time complexity of the algorithm is O(n^2). It is correct only about the inner part of the algorithm's main loop. The loop itself iterates over all possible combinations of top_n phrases among nr_candidates (usually 2 * top_n).

It's a bit complicated to derive the actual time complexity. The loop body runs C(top_n, nr_candidates) times (where C denotes a binomial coefficient), each run with quadratic complexity. For the case when nr_candidates == 2 * top_n, I believe in asymptotic terms it ends up as O(n^n) (thus I marked it as "super-exponential" in the docs) but I'm not sure.

Anyway, this is much worse than quadratic complexity; look at this growth:

top_n == 5: 252 combinations to evaluate;
top_n == 10: 185 thousands combinations;
top_n == 20: 139 billions combinations.

No wonder it somehow worked on top_n = 10 but I never got it to work at top_n = 20.

This PR fixes the misleading notes on time complexity of the algorithm and clarifies the explanation of the algorithm itself a bit. The docstring for it is still weird since the note on complexity is written as if the argument nr_candidates does not exist and 2 * top_n is always used (even though in some example in the docs these arguments are set separately).

Only the part inside the loop has quadratic time complexity, the loop itself has super-exponential time complexity.

MaartenGr · 2025-11-10T20:12:08Z

Thank you for taking the time to go through this in this much detail! I haven't taken a look at this code for a while so I'll need some time to go through it in more detail to get a better understanding. I might want to create a small test case with dummy data to verify this particular case if I find the time.

That said, the note that you refer too isn't what I would call "wildly misleading". Runtime complexity isn't much of a theme throughout the documentation of KeyBERT nor am I trying to mislead users into thinking this has a certain runtime complexity. To me, it is an attempt to explain the underlying methodology. By saying "wildly misleading" it now sounds as if I did something horrible... Explanations can always be improved, but I would rather frame this as enhancing documentation rather than me somehow attempting to mislead others.

asrelo · 2025-11-10T21:32:34Z

I am not a native speaker, probably assumed a wrong shade of meaning, sorry.

I believe this particular note should be addressed one way or another, because in the practical range of top_n quadratic complexity doesn't look like something that can be a big problem, but super-exponential complexity is very much a problem beyond some top_n = 10.

MaartenGr · 2025-12-17T06:59:09Z

No worries, glad to hear you understand.

Do you perhaps have any reproducible code so that I can test this further? I think its best to have a practical example to showcase this.

asrelo · 2025-12-18T08:48:17Z

I'm not sure what reproducible code would be needed beyond inspecting the library code. I tried to put together a demo measuring execution time at different (nr_candidates, top_n), but it turned out to be tricky to implement it properly with all the necessary timeouts, so it runs horrendously slowly (4s per trial).

Some results (obsolete, see the next comment):

an entry represents average run time of max_sum_distance for given parameters over some number of trials that took no longer than 1 s overall;
blue entries below the diagonal represent the trials that were taking more than 1 s to run max_sum_distance even once;
note the log scale of time on the figure;
dashed line represents the default policy to set nr_candidates = top_n * 2.

raw data: keybert_maxsum_perf_data.csv - Pastebin.com

Or you can edit the script and just do 2 tests with --trial-timeout=inf:

CANDIDATES_NUM_LIMITS = (20, 20)
RESULTS_NUM_LIMITS = (10, 10)

which finishes in seconds;

CANDIDATES_NUM_LIMITS = (40, 40)
RESULTS_NUM_LIMITS = (20, 20)

which finishes when the Sun explodes.

asrelo · 2025-12-18T09:24:44Z

Made the demo a bit more stable:

raw data: keybert_maxsum_perf_data.csv - Pastebin.com

fix: fix misleading note on time complexity of the maxsum algorithm

1561f84

Only the part inside the loop has quadratic time complexity, the loop itself has super-exponential time complexity.

asrelo changed the title ~~Fix the wildly misleading note on time complexity of the maxsum algorithm~~ Fix the erroneous note on time complexity of the maxsum algorithm Nov 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the erroneous note on time complexity of the maxsum algorithm #272

Fix the erroneous note on time complexity of the maxsum algorithm #272

Uh oh!

asrelo commented Nov 10, 2025 •

edited

Loading

Uh oh!

MaartenGr commented Nov 10, 2025

Uh oh!

asrelo commented Nov 10, 2025

Uh oh!

MaartenGr commented Dec 17, 2025

Uh oh!

asrelo commented Dec 18, 2025 •

edited

Loading

Uh oh!

asrelo commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix the erroneous note on time complexity of the maxsum algorithm #272

Are you sure you want to change the base?

Fix the erroneous note on time complexity of the maxsum algorithm #272

Uh oh!

Conversation

asrelo commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaartenGr commented Nov 10, 2025

Uh oh!

asrelo commented Nov 10, 2025

Uh oh!

MaartenGr commented Dec 17, 2025

Uh oh!

asrelo commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asrelo commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asrelo commented Nov 10, 2025 •

edited

Loading

asrelo commented Dec 18, 2025 •

edited

Loading