Skip to content

Conversation

@shengbo-ma
Copy link
Contributor

@shengbo-ma shengbo-ma commented Mar 9, 2025

Issue: #259

Description:

  • replace deprecated lagnchain api .run() with .invoke()
  • replace deprecated load_qa_chain with a simple chain of langchain
  • a new default prompt
  • update docstring for documentation use

I did not add a test for this change, since I don't see langchain installed in KeyBERT test env, so posted the test below (Same as in the keybert.llm.LangChain docsting)

New user experience, also how to test this change

from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

from keybert import KeyLLM
from keybert.llm import LangChain

_llm = ChatOpenAI(
    model="gpt-4o",
    api_key="my-openai-api-key",
    temperature=0,
)
_prompt = ChatPromptTemplate(
    [
        ("human", LangChain.DEFAULT_PROMPT_TEMPLATE),  # the default prompt from KeyBERT
    ]
)
chain = _prompt | _llm | StrOutputParser()


# Create your LLM
llm = LangChain(chain)

# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
docs = [
    "KeyBERT: A minimal method for keyword extraction with BERT. The keyword extraction is done by finding the sub-phrases in a document that are the most similar to the document itself. First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document.",
    "KeyLLM: A minimal method for keyword extraction with Large Language Models (LLM). The keyword extraction is done by simply asking the LLM to extract a number of keywords from a single piece of text.",
]
candidates = [
    ["keyword extraction", "Large Language Models", "LLM", "BERT", "transformer", "embeddings"],
    ["keyword extraction", "Large Language Models", "LLM", "BERT", "transformer", "embeddings"],
]
keywords = kw_model.extract_keywords(docs=docs, candidate_keywords=candidates)
print(keywords)
# [['keyword extraction', 'BERT', 'embeddings'], ['keyword extraction', 'Large Language Models', 'LLM']]

Discussion

@MaartenGr Please review these concerns below.

  • While implementing this PR, I feel it a bit hard to identify what keybert wants to provide on top of langchain, so please review if the new user experience above makes sense. My understanding is that

    • keybert.llm.LangChain provides
      • a default prompt template that nicely extract keywords from a doc using candidate keywords
      • keywords output parsed as list[str] such that can be integrated with KeyLLM
    • I am assuming that users have basic understanding of writing simple langchain chain below since they choose to use langchain with keybert.
    • I am assume the motivation of keybert.llm.LangChain is to support a chain, not a LLM model.
      However, a chain can be so flexible, containing prompts, llm, output formats / parsers, etc, that it is very tricky trying to change the prompt template of a given chain inside our own keybert.llm.LangChain by taking a prompt arg. So in this PR, I am suggesting users to use LangChain.DEFAULT_PROMPT_TEMPLATE when they create the chain. And prompt arg is removed from keybert.llm.LangChain.__init__.
  • I see keybert.llm.LangChain imports langchain in source code this line, but does not list it as a dependency in pyproject.toml. Are we assuming before keybert.llm.LangChain is called, user must have installed langchain since they have to write their only chain?

@shengbo-ma shengbo-ma marked this pull request as ready for review March 9, 2025 23:45
@MaartenGr
Copy link
Owner

@shengbo-ma Thanks for the PR!

keybert.llm.LangChain provides
a default prompt template that nicely extract keywords from a doc using candidate keywords
keywords output parsed as list[str] such that can be integrated with KeyLLM

That's correct, like the other LLMs it is merely used as the main model to parse the documents and/or keywords according the 5 principles here.

I am assuming that users have basic understanding of writing simple langchain chain below since they choose to use langchain with keybert.

The intent is that users do not need to have a basic understanding of any LLM provider/framework but just can throw in any model (GGUF or otherwise). LangChain is a tricky library to be familiar with considering the quick changes in their API over the years. As such, I would prefer the most straightforward implementation so that users do not have to learn a framework. Compared to the LangChain implementation in BERTopic, I believe the keyword extraction use case is much simpler.

I am assume the motivation of keybert.llm.LangChain is to support a chain, not a LLM model.
However, a chain can be so flexible, containing prompts, llm, output formats / parsers, etc, that it is very tricky trying to change the prompt template of a given chain inside our own keybert.llm.LangChain by taking a prompt arg. So in this PR, I am suggesting users to use LangChain.DEFAULT_PROMPT_TEMPLATE when they create the chain. And prompt arg is removed from keybert.llm.LangChain.init.

It is actually to support an LLM model considering many users encounter LangChain as one of the first techniques to load a model (rather than llama-cpp-python, ollama, vllm, etc.).

That said, I would actually prefer to keep the prompt variable here as users would only have to replace {DOCUMENTS} with [DOCUMENTS]. Perhaps you can check out the BERTopic implementation here: https://github.com/MaartenGr/BERTopic/blob/master/bertopic/representation/_langchain.py

I see keybert.llm.LangChain imports langchain in source code this line, but does not list it as a dependency in pyproject.toml. Are we assuming before keybert.llm.LangChain is called, user must have installed langchain since they have to write their only chain?

Yes, LangChain needs to be installed beforehand, but it does not need to be included in the pyproject.toml considering there are currently no tests for LangChain. It will give an error if it is not installed: https://github.com/MaartenGr/KeyBERT/blob/master/keybert/llm/__init__.py

@shengbo-ma
Copy link
Contributor Author

@MaartenGr Thanks for the info. Will change this PR accordingly.

As such, I would prefer the most straightforward implementation so that users do not have to learn a framework.

Will match what BERTopic does for langchain.
Glad to see BERTopic also replies on the user to copy the LangChain Expression Language (LCEL) from BERTopic documentation and feed it into LangChain api, like did in this PR.

Compared to the LangChain implementation in BERTopic, I believe the keyword extraction use case is much simpler.

Agree. I don't think load_qa_chain is necessary here. A simple and straight forward chain like in this PR should be good enough. Thus one less dependency from langchain.

It is actually to support an LLM model considering many users encounter LangChain as one of the first techniques to load a model (rather than llama-cpp-python, ollama, vllm, etc.).
That said, I would actually prefer to keep the prompt variable here as users would only have to replace {DOCUMENTS} with [DOCUMENTS]. Perhaps you can check out the BERTopic implementation here:

Got it. Will restore the prompt arg in next commits.

@shengbo-ma
Copy link
Contributor Author

shengbo-ma commented Mar 17, 2025

It is actually to support an LLM model considering many users encounter LangChain as one of the first techniques to load a model (rather than llama-cpp-python, ollama, vllm, etc.).

@MaartenGr Thanks for this clarification. It inspires me to simplify the user experience as below. Please let me know if it makes sense.

Updates

  • keybert.llm.LangChain.__init__ takes a langchain llm class rather than a chain, so user simply pass their langchain llm like ChatOpenAI
  • keybert.llm.LangChain.__init__ takes a prompt again.
  • A simple chain is constructed inside keybert.llm.LangChain.__init__ using the input llm and prompt. User doesn't need any knowledge of langchain.
  • Updated the example in doc string accordingly.
  • langchain >=0.1 is required for this PR. since we are now using .invoke instead of .run. Also the simple chain uses LangChain Expression Language which also requires lagnchain >=0.1

The new implementation

from langchain_openai import ChatOpenAI

from keybert import KeyLLM
from keybert.llm import LangChain

_llm = ChatOpenAI(
    model="gpt-4o",
    api_key="my-openai-api-key",
    temperature=0,
)


# Create your LLM
llm = LangChain(_llm)

# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
docs = [
    "KeyBERT: A minimal method for keyword extraction with BERT. The keyword extraction is done by finding the sub-phrases in a document that are the most similar to the document itself. First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document.",
    "KeyLLM: A minimal method for keyword extraction with Large Language Models (LLM). The keyword extraction is done by simply asking the LLM to extract a number of keywords from a single piece of text.",
]
keywords = kw_model.extract_keywords(docs=docs)
print("with no candidates")
print(keywords)

# Output:
# [
#     ['KeyBERT', 'keyword extraction', 'BERT', 'document embeddings', 'word embeddings', 'N-gram phrases', 'cosine similarity', 'document representation'],
#     ['KeyLLM', 'keyword extraction', 'Large Language Models', 'LLM', 'minimal method']
# ]


# fine tune with candidate keywords
candidates = [
    ["keyword extraction", "Large Language Models", "LLM", "BERT", "transformer", "embeddings"],
    ["keyword extraction", "Large Language Models", "LLM", "BERT", "transformer", "embeddings"],
]
keywords = kw_model.extract_keywords(docs=docs, candidate_keywords=candidates)
print("with candidates")
print(keywords)

# Output:
# [
#     ['keyword extraction', 'BERT', 'document embeddings', 'word embeddings', 'cosine similarity', 'N-gram phrases'],
#     ['KeyLLM', 'keyword extraction', 'Large Language Models', 'LLM']
# ]

@shengbo-ma
Copy link
Contributor Author

This PR is ready for review @MaartenGr

Copy link
Owner

@MaartenGr MaartenGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left the tiniest of comments, but otherwise it looks great!

Comment on lines 12 to 16
"""NOTE
langchain >= 0.1 is required. Which supports:
- chain.invoke()
- LangChain Expression Language (LCEL) is used and it is not compatible with langchain < 0.1.
"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be best to also update this in the documentation here: https://github.com/MaartenGr/KeyBERT/blob/master/docs/guides/llms.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Updated

@shengbo-ma
Copy link
Contributor Author

shengbo-ma commented Mar 20, 2025

@MaartenGr
Added a simple version check for langchain.

packaging package is used for version check. It is a popular package for python versioning. AND it is an existing (transient) dependency of keybert

  • keybert -> sentence-transformers -> transformers -> packaging

Despite of that, it still would be better to include it explicitly in pyproject.toml, just in case sentence-transformers is no longer a dependency, or they drop the dependency, although not very likely.
Please let me know if it makes sense to use packaging in this PR and whether to explicitly add packaging as a dependency in pyproject.toml.

@MaartenGr
Copy link
Owner

Thanks for the initiative! There's an edge case for this, namely that it is possible to install KeyBERT without sentence-transformers by simply running pip install keybert --no-deps scikit-learn model2vec.

As such, it will throw an error when using this functionality. I would prefer not using this additional dependency if it is not absolutely needed for the core functionality of KeyBERT (which includes this light-weight installation option).

@shengbo-ma
Copy link
Contributor Author

shengbo-ma commented Mar 22, 2025

I would prefer not using this additional dependency if it is not absolutely needed for the core functionality of KeyBERT (which includes this light-weight installation option).

@MaartenGr Yeah make sense.
I removed the version check logic that uses packaging dependency. Instead, I catch the ModuleNotFoundError for langchian_core which was introduced since langchain 0.1.0.
It is equivalent to check version >= 0.1

@shengbo-ma shengbo-ma requested a review from MaartenGr March 22, 2025 06:53
@MaartenGr
Copy link
Owner

Awesome, thank you for the work on this. Everything looks in order!

@MaartenGr MaartenGr merged commit 55a9d1e into MaartenGr:master Mar 25, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants