Remove dependency on deprecated langchain apis #261

shengbo-ma · 2025-03-09T23:17:36Z

Issue: #259

Description:

replace deprecated lagnchain api .run() with .invoke()
replace deprecated load_qa_chain with a simple chain of langchain
a new default prompt
update docstring for documentation use

I did not add a test for this change, since I don't see langchain installed in KeyBERT test env, so posted the test below (Same as in the keybert.llm.LangChain docsting)

New user experience, also how to test this change

from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

from keybert import KeyLLM
from keybert.llm import LangChain

_llm = ChatOpenAI(
    model="gpt-4o",
    api_key="my-openai-api-key",
    temperature=0,
)
_prompt = ChatPromptTemplate(
    [
        ("human", LangChain.DEFAULT_PROMPT_TEMPLATE),  # the default prompt from KeyBERT
    ]
)
chain = _prompt | _llm | StrOutputParser()


# Create your LLM
llm = LangChain(chain)

# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
docs = [
    "KeyBERT: A minimal method for keyword extraction with BERT. The keyword extraction is done by finding the sub-phrases in a document that are the most similar to the document itself. First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document.",
    "KeyLLM: A minimal method for keyword extraction with Large Language Models (LLM). The keyword extraction is done by simply asking the LLM to extract a number of keywords from a single piece of text.",
]
candidates = [
    ["keyword extraction", "Large Language Models", "LLM", "BERT", "transformer", "embeddings"],
    ["keyword extraction", "Large Language Models", "LLM", "BERT", "transformer", "embeddings"],
]
keywords = kw_model.extract_keywords(docs=docs, candidate_keywords=candidates)
print(keywords)
# [['keyword extraction', 'BERT', 'embeddings'], ['keyword extraction', 'Large Language Models', 'LLM']]

Discussion

@MaartenGr Please review these concerns below.

While implementing this PR, I feel it a bit hard to identify what keybert wants to provide on top of langchain, so please review if the new user experience above makes sense. My understanding is that
- keybert.llm.LangChain provides
  - a default prompt template that nicely extract keywords from a doc using candidate keywords
  - keywords output parsed as list[str] such that can be integrated with KeyLLM
- I am assuming that users have basic understanding of writing simple langchain chain below since they choose to use langchain with keybert.
- I am assume the motivation of keybert.llm.LangChain is to support a chain, not a LLM model.
  However, a chain can be so flexible, containing prompts, llm, output formats / parsers, etc, that it is very tricky trying to change the prompt template of a given chain inside our own keybert.llm.LangChain by taking a prompt arg. So in this PR, I am suggesting users to use LangChain.DEFAULT_PROMPT_TEMPLATE when they create the chain. And prompt arg is removed from keybert.llm.LangChain.__init__.
I see keybert.llm.LangChain imports langchain in source code this line, but does not list it as a dependency in pyproject.toml. Are we assuming before keybert.llm.LangChain is called, user must have installed langchain since they have to write their only chain?

refactor how chain prompt is handled

MaartenGr · 2025-03-14T07:22:22Z

@shengbo-ma Thanks for the PR!

keybert.llm.LangChain provides
a default prompt template that nicely extract keywords from a doc using candidate keywords
keywords output parsed as list[str] such that can be integrated with KeyLLM

That's correct, like the other LLMs it is merely used as the main model to parse the documents and/or keywords according the 5 principles here.

I am assuming that users have basic understanding of writing simple langchain chain below since they choose to use langchain with keybert.

The intent is that users do not need to have a basic understanding of any LLM provider/framework but just can throw in any model (GGUF or otherwise). LangChain is a tricky library to be familiar with considering the quick changes in their API over the years. As such, I would prefer the most straightforward implementation so that users do not have to learn a framework. Compared to the LangChain implementation in BERTopic, I believe the keyword extraction use case is much simpler.

I am assume the motivation of keybert.llm.LangChain is to support a chain, not a LLM model.
However, a chain can be so flexible, containing prompts, llm, output formats / parsers, etc, that it is very tricky trying to change the prompt template of a given chain inside our own keybert.llm.LangChain by taking a prompt arg. So in this PR, I am suggesting users to use LangChain.DEFAULT_PROMPT_TEMPLATE when they create the chain. And prompt arg is removed from keybert.llm.LangChain.init.

It is actually to support an LLM model considering many users encounter LangChain as one of the first techniques to load a model (rather than llama-cpp-python, ollama, vllm, etc.).

That said, I would actually prefer to keep the prompt variable here as users would only have to replace {DOCUMENTS} with [DOCUMENTS]. Perhaps you can check out the BERTopic implementation here: https://github.com/MaartenGr/BERTopic/blob/master/bertopic/representation/_langchain.py

I see keybert.llm.LangChain imports langchain in source code this line, but does not list it as a dependency in pyproject.toml. Are we assuming before keybert.llm.LangChain is called, user must have installed langchain since they have to write their only chain?

Yes, LangChain needs to be installed beforehand, but it does not need to be included in the pyproject.toml considering there are currently no tests for LangChain. It will give an error if it is not installed: https://github.com/MaartenGr/KeyBERT/blob/master/keybert/llm/__init__.py

shengbo-ma · 2025-03-14T19:33:52Z

@MaartenGr Thanks for the info. Will change this PR accordingly.

As such, I would prefer the most straightforward implementation so that users do not have to learn a framework.

Will match what BERTopic does for langchain.
Glad to see BERTopic also replies on the user to copy the LangChain Expression Language (LCEL) from BERTopic documentation and feed it into LangChain api, like did in this PR.

Compared to the LangChain implementation in BERTopic, I believe the keyword extraction use case is much simpler.

Agree. I don't think load_qa_chain is necessary here. A simple and straight forward chain like in this PR should be good enough. Thus one less dependency from langchain.

It is actually to support an LLM model considering many users encounter LangChain as one of the first techniques to load a model (rather than llama-cpp-python, ollama, vllm, etc.).
That said, I would actually prefer to keep the prompt variable here as users would only have to replace {DOCUMENTS} with [DOCUMENTS]. Perhaps you can check out the BERTopic implementation here:

Got it. Will restore the prompt arg in next commits.

shengbo-ma · 2025-03-17T02:42:48Z

It is actually to support an LLM model considering many users encounter LangChain as one of the first techniques to load a model (rather than llama-cpp-python, ollama, vllm, etc.).

@MaartenGr Thanks for this clarification. It inspires me to simplify the user experience as below. Please let me know if it makes sense.

Updates

keybert.llm.LangChain.__init__ takes a langchain llm class rather than a chain, so user simply pass their langchain llm like ChatOpenAI
keybert.llm.LangChain.__init__ takes a prompt again.
A simple chain is constructed inside keybert.llm.LangChain.__init__ using the input llm and prompt. User doesn't need any knowledge of langchain.
Updated the example in doc string accordingly.
langchain >=0.1 is required for this PR. since we are now using .invoke instead of .run. Also the simple chain uses LangChain Expression Language which also requires lagnchain >=0.1

The new implementation

from langchain_openai import ChatOpenAI

from keybert import KeyLLM
from keybert.llm import LangChain

_llm = ChatOpenAI(
    model="gpt-4o",
    api_key="my-openai-api-key",
    temperature=0,
)


# Create your LLM
llm = LangChain(_llm)

# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
docs = [
    "KeyBERT: A minimal method for keyword extraction with BERT. The keyword extraction is done by finding the sub-phrases in a document that are the most similar to the document itself. First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document.",
    "KeyLLM: A minimal method for keyword extraction with Large Language Models (LLM). The keyword extraction is done by simply asking the LLM to extract a number of keywords from a single piece of text.",
]
keywords = kw_model.extract_keywords(docs=docs)
print("with no candidates")
print(keywords)

# Output:
# [
#     ['KeyBERT', 'keyword extraction', 'BERT', 'document embeddings', 'word embeddings', 'N-gram phrases', 'cosine similarity', 'document representation'],
#     ['KeyLLM', 'keyword extraction', 'Large Language Models', 'LLM', 'minimal method']
# ]


# fine tune with candidate keywords
candidates = [
    ["keyword extraction", "Large Language Models", "LLM", "BERT", "transformer", "embeddings"],
    ["keyword extraction", "Large Language Models", "LLM", "BERT", "transformer", "embeddings"],
]
keywords = kw_model.extract_keywords(docs=docs, candidate_keywords=candidates)
print("with candidates")
print(keywords)

# Output:
# [
#     ['keyword extraction', 'BERT', 'document embeddings', 'word embeddings', 'cosine similarity', 'N-gram phrases'],
#     ['KeyLLM', 'keyword extraction', 'Large Language Models', 'LLM']
# ]

shengbo-ma · 2025-03-17T04:39:55Z

This PR is ready for review @MaartenGr

MaartenGr

Left the tiniest of comments, but otherwise it looks great!

MaartenGr · 2025-03-19T14:31:53Z

keybert/llm/_langchain.py

+"""NOTE
+langchain >= 0.1 is required. Which supports:
+- chain.invoke()
+- LangChain Expression Language (LCEL) is used and it is not compatible with langchain < 0.1.
+"""


Might be best to also update this in the documentation here: https://github.com/MaartenGr/KeyBERT/blob/master/docs/guides/llms.md

Sure. Updated

shengbo-ma · 2025-03-20T00:43:50Z

@MaartenGr
Added a simple version check for langchain.

packaging package is used for version check. It is a popular package for python versioning. AND it is an existing (transient) dependency of keybert

keybert -> sentence-transformers -> transformers -> packaging

Despite of that, it still would be better to include it explicitly in pyproject.toml, just in case sentence-transformers is no longer a dependency, or they drop the dependency, although not very likely.
Please let me know if it makes sense to use packaging in this PR and whether to explicitly add packaging as a dependency in pyproject.toml.

MaartenGr · 2025-03-21T13:46:13Z

Thanks for the initiative! There's an edge case for this, namely that it is possible to install KeyBERT without sentence-transformers by simply running pip install keybert --no-deps scikit-learn model2vec.

As such, it will throw an error when using this functionality. I would prefer not using this additional dependency if it is not absolutely needed for the core functionality of KeyBERT (which includes this light-weight installation option).

shengbo-ma · 2025-03-22T06:30:10Z

I would prefer not using this additional dependency if it is not absolutely needed for the core functionality of KeyBERT (which includes this light-weight installation option).

@MaartenGr Yeah make sense.
I removed the version check logic that uses packaging dependency. Instead, I catch the ModuleNotFoundError for langchian_core which was introduced since langchain 0.1.0.
It is equivalent to check version >= 0.1

MaartenGr · 2025-03-25T11:51:07Z

Awesome, thank you for the work on this. Everything looks in order!

shengbo-ma added 2 commits March 9, 2025 16:15

replace deprecated .run() with .invoke()

b0f5769

refactor how chain prompt is handled

docstring

2c3a00f

shengbo-ma marked this pull request as ready for review March 9, 2025 23:45

shengbo-ma mentioned this pull request Mar 10, 2025

LangChain Deprecation Warning #259

Open

shengbo-ma added 2 commits March 9, 2025 18:50

doc

36ad930

doc

b4a3a74

shengbo-ma added 2 commits March 16, 2025 19:22

allow input arg prompt; replace chain with llm for simplify user exp

de786a3

doc

a40f886

shengbo-ma added 9 commits March 16, 2025 19:47

format

56ba04d

remove note

98723ab

refactor chain construction

e4969e8

format

0add85d

refactor: extract get_chain method

f416dde

output format: abandon comma separated list

a0dc89a

rename type alias

272be98

output format

fc4765e

typo

f248099

MaartenGr reviewed Mar 19, 2025

View reviewed changes

shengbo-ma added 6 commits March 19, 2025 16:38

doc update

2c31422

add langchain import error handling

558a114

format

7c38920

use packaging for version compare

7b9e016

remove import check

1f50569

import and version check for langchain

541a765

amend

3f40083

shengbo-ma added 2 commits March 21, 2025 22:29

drop packaging lib

02fd3bf

refactor import error handle

fe85c7e

remove match not support in 3.9

2569f4f

shengbo-ma requested a review from MaartenGr March 22, 2025 06:53

MaartenGr merged commit 55a9d1e into MaartenGr:master Mar 25, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove dependency on deprecated langchain apis #261

Remove dependency on deprecated langchain apis #261

Uh oh!

shengbo-ma commented Mar 9, 2025 •

edited

Loading

Uh oh!

MaartenGr commented Mar 14, 2025

Uh oh!

shengbo-ma commented Mar 14, 2025

Uh oh!

shengbo-ma commented Mar 17, 2025 •

edited

Loading

Uh oh!

shengbo-ma commented Mar 17, 2025

Uh oh!

MaartenGr left a comment

Uh oh!

MaartenGr Mar 19, 2025

Uh oh!

shengbo-ma Mar 19, 2025

Uh oh!

shengbo-ma commented Mar 20, 2025 •

edited

Loading

Uh oh!

MaartenGr commented Mar 21, 2025

Uh oh!

shengbo-ma commented Mar 22, 2025 •

edited

Loading

Uh oh!

MaartenGr commented Mar 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove dependency on deprecated langchain apis #261

Remove dependency on deprecated langchain apis #261

Uh oh!

Conversation

shengbo-ma commented Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Discussion

Uh oh!

MaartenGr commented Mar 14, 2025

Uh oh!

shengbo-ma commented Mar 14, 2025

Uh oh!

shengbo-ma commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shengbo-ma commented Mar 17, 2025

Uh oh!

MaartenGr left a comment

Choose a reason for hiding this comment

Uh oh!

MaartenGr Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

shengbo-ma Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

shengbo-ma commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaartenGr commented Mar 21, 2025

Uh oh!

shengbo-ma commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaartenGr commented Mar 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shengbo-ma commented Mar 9, 2025 •

edited

Loading

shengbo-ma commented Mar 17, 2025 •

edited

Loading

shengbo-ma commented Mar 20, 2025 •

edited

Loading

shengbo-ma commented Mar 22, 2025 •

edited

Loading