Skip to content

Conversation

@asrelo
Copy link

@asrelo asrelo commented Nov 10, 2025

The documentation for the max sum distance algorithm says (or can be easily interpreted to say) that time complexity of the algorithm is O(n^2). It is correct only about the inner part of the algorithm's main loop. The loop itself iterates over all possible combinations of top_n phrases among nr_candidates (usually 2 * top_n).

It's a bit complicated to derive the actual time complexity. The loop body runs C(top_n, nr_candidates) times (where C denotes a binomial coefficient), each run with quadratic complexity. For the case when nr_candidates == 2 * top_n, I believe in asymptotic terms it ends up as O(n^n) (thus I marked it as "super-exponential" in the docs) but I'm not sure.

Anyway, this is much worse than quadratic complexity; look at this growth:

  • top_n == 5: 252 combinations to evaluate;
  • top_n == 10: 185 thousands combinations;
  • top_n == 20: 139 billions combinations.

No wonder it somehow worked on top_n = 10 but I never got it to work at top_n = 20.

This PR fixes the misleading notes on time complexity of the algorithm and clarifies the explanation of the algorithm itself a bit. The docstring for it is still weird since the note on complexity is written as if the argument nr_candidates does not exist and 2 * top_n is always used (even though in some example in the docs these arguments are set separately).

Only the part inside the loop has quadratic time complexity,
the loop itself has super-exponential time complexity.
@MaartenGr
Copy link
Owner

Thank you for taking the time to go through this in this much detail! I haven't taken a look at this code for a while so I'll need some time to go through it in more detail to get a better understanding. I might want to create a small test case with dummy data to verify this particular case if I find the time.

That said, the note that you refer too isn't what I would call "wildly misleading". Runtime complexity isn't much of a theme throughout the documentation of KeyBERT nor am I trying to mislead users into thinking this has a certain runtime complexity. To me, it is an attempt to explain the underlying methodology. By saying "wildly misleading" it now sounds as if I did something horrible... Explanations can always be improved, but I would rather frame this as enhancing documentation rather than me somehow attempting to mislead others.

@asrelo
Copy link
Author

asrelo commented Nov 10, 2025

I am not a native speaker, probably assumed a wrong shade of meaning, sorry.

I believe this particular note should be addressed one way or another, because in the practical range of top_n quadratic complexity doesn't look like something that can be a big problem, but super-exponential complexity is very much a problem beyond some top_n = 10.

@asrelo asrelo changed the title Fix the wildly misleading note on time complexity of the maxsum algorithm Fix the erroneous note on time complexity of the maxsum algorithm Nov 10, 2025
@MaartenGr
Copy link
Owner

No worries, glad to hear you understand.

Do you perhaps have any reproducible code so that I can test this further? I think its best to have a practical example to showcase this.

@asrelo
Copy link
Author

asrelo commented Dec 18, 2025

I'm not sure what reproducible code would be needed beyond inspecting the library code. I tried to put together a demo measuring execution time at different (nr_candidates, top_n), but it turned out to be tricky to implement it properly with all the necessary timeouts, so it runs horrendously slowly (4s per trial).

Some results (obsolete, see the next comment):

  • an entry represents average run time of max_sum_distance for given parameters over some number of trials that took no longer than 1 s overall;
  • blue entries below the diagonal represent the trials that were taking more than 1 s to run max_sum_distance even once;
  • note the log scale of time on the figure;
  • dashed line represents the default policy to set nr_candidates = top_n * 2.
Figure_1

raw data: keybert_maxsum_perf_data.csv - Pastebin.com

Or you can edit the script and just do 2 tests with --trial-timeout=inf:

  1. CANDIDATES_NUM_LIMITS = (20, 20)
    RESULTS_NUM_LIMITS = (10, 10)

    which finishes in seconds;

  2. CANDIDATES_NUM_LIMITS = (40, 40)
    RESULTS_NUM_LIMITS = (20, 20)

    which finishes when the Sun explodes.

@asrelo
Copy link
Author

asrelo commented Dec 18, 2025

Made the demo a bit more stable:

Figure_1

raw data: keybert_maxsum_perf_data.csv - Pastebin.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants