GPT4All LangChain Community model trims model output #27021

Luca-Pozzi · 2024-10-01T14:15:09Z

Luca-Pozzi
Oct 1, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

LANGCHAIN = True

model_path = "C:\\Users\\user\\path\\to\\model\\GPT4All\\Meta-Llama-3-8B-Instruct.Q4_0.gguf"

query="Please write me an exhaustive story of the history of Italy, from the foundation of Rome until today."

model_kwargs = {"max_tokens":4096,
                "temp":0.7,
                "top_p":0.4,
                "top_k":40,
                "repeat_last_n":64,
                "repeat_penalty":1.18
                }

if LANGCHAIN:
    from langchain_community.llms import GPT4All
    from langchain_core.prompts import PromptTemplate
    model = GPT4All(model=model_path,
                  device='gpu')
    call_to_model = model.invoke
else:
    from gpt4all import GPT4All
    model = GPT4All(model_path, 
                    device='gpu'
                    )
    call_to_model = model.generate
ans = call_to_model(query,
                    **model_kwargs
                    )
print(and)

Description

When I try to use Llama3 via the GPT4All. When I ask for a long answer to the model directly via the Python GPT4All SDK (i.e. LANGCHAIN = False in code), everything works as expected. In the attached file output_SDK.txt you can see a sample response with >700 words.

If I do (what I expect to be) the same calling GPT4All from LangChain, my output is limited to ~200 words. With this, I mean that the response is not just shorter, but interrupted abruptly at a certain length. You can see a sample in the attached file output_LangChain.txt.

Is this behavior expected? How can I modify it to get long answers from Llama3 via LangChain?

I tried to inspect LangChain classes LLM and BaseLLM to no avail.

System Info

Output of pip freeze | FINDSTR langchain

langchain==0.3.1
langchain-community==0.3.1
langchain-core==0.3.6
langchain-text-splitters==0.3.0

Platform: Windows 11
Python version: 3.11.9

@Luca-Pozzi · 2024-10-01T14:18:21Z

dosubot[bot]
bot Oct 1, 2024

Hey @Luca-Pozzi! I'm here to help you with any bugs, questions, or contributions you have.

A similar issue was discussed in this thread, where the response length was limited. The solution involves adjusting the max_tokens parameter, which defaults to 200 in LangChain's GPT4All integration ^[1].

To get longer responses, ensure you set max_tokens to a higher value when initializing the model. Here's how you can modify your code:

from langchain_community.llms import GPT4All

model = GPT4All(
    model=model_path,
    device='gpu',
    max_tokens=4096,  # Increase this value to allow for longer outputs
)

Additionally, check other parameters like n_predict, stop, and temperature, as they can also affect the output length ^[1]^[2].

To continue talking to Dosu, mention @dosu.

2 replies

Luca-Pozzi Oct 1, 2024
Author

Thanks @dosu ! The answer was helpful, though inaccurate.
In langchain-community.llms.gpt4all the max_tokens is stated to regulate the length of the context rather than the output.

    max_tokens: int = Field(200, alias="max_tokens")
    """Token context window."""

Nevertheless, few lines below I have spotted

    n_predict: Optional[int] = 256
    """The maximum number of tokens to generate."""

Indeed, passing a larger value to n_predict (i.e. appending n_predict=4096 to model_kwargs in the above code) did the trick.

What confused me is that in the source of GPT4All Python SDK the same n_predict arg is said to be Equivalent to max_tokens, exists for backwards compatibility.

Should the behavior of max_tokens and n_predict be corrected in LangChain?

dosubot[bot] bot Oct 1, 2024

In LangChain's GPT4All, the max_tokens parameter is indeed intended for the context window, while n_predict controls the maximum number of tokens to generate. This distinction is important, as you've discovered.

The confusion arises because in the GPT4All Python SDK, n_predict is described as equivalent to max_tokens for backward compatibility. However, in LangChain, they serve different purposes.

If the behavior in LangChain is inconsistent with the documentation or expected usage, it might be worth considering an update to clarify these parameters. This could help prevent similar confusion in the future.

To continue talking to Dosu, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPT4All LangChain Community model trims model output #27021

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GPT4All LangChain Community model trims model output #27021

Uh oh!

Luca-Pozzi Oct 1, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 1 comment · 2 replies

Uh oh!

dosubot[bot] bot Oct 1, 2024

Uh oh!

Luca-Pozzi Oct 1, 2024 Author

Uh oh!

dosubot[bot] bot Oct 1, 2024

Luca-Pozzi
Oct 1, 2024

Replies: 1 comment 2 replies

dosubot[bot]
bot Oct 1, 2024

Luca-Pozzi Oct 1, 2024
Author