how to customise LLM decoding with ChatOpenAPI ? #26975

schopra6 · 2024-09-29T19:47:06Z

schopra6
Sep 29, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain_openai import ChatOpenAI


llm = ChatOpenAI(base_url='', model_name="meta-llama/Meta-Llama-3.1-8B-Instruct") 
wiki_qa_chain = direct_qa_prompt_template | llm
answer = wiki_qa_chain.invoke({'chat_history':context[:-1], 'input': query}, config=config).content

Description

I have been trying to figure out a way to to modify inference method of Langchain chat models.
I would like to use optimisations in vllm but its only possible as a llm, not as a chat_model . Is there a way we can control parameters of llm and how it decode tokens in langchain?

System Info

langchain==0.2.14
langchain-cli==0.0.30
langchain-community==0.2.12
langchain-core==0.2.35
langchain-experimental==0.0.64
langchain-openai==0.1.22
langchain-text-splitters==0.2.2

@schopra6 · 2024-09-29T19:59:34Z

dosubot[bot]
bot Sep 29, 2024

Hey @schopra6! 👋 I'm here to help you with bugs, questions, and becoming a contributor. Let's tackle this together!

To customize the LLM decoding with ChatOpenAPI in LangChain, you need to create a custom LLM class by implementing the _call method to modify the inference method and control parameters for token decoding. Here is an example of how you can do this:

import json
import logging
from typing import Any, Dict, List, Optional, Set
import requests
from langchain_core.callbacks.manager import CallbackManagerForLLMRun
from langchain_core.language_models.llms import LLM
from pydantic import Field

logger = logging.getLogger(__name__)

class CustomLLM(LLM):
    """A custom chat model that allows customization of the inference method and token decoding parameters."""

    infer_api: str = "http://127.0.0.1:8000/custom"
    max_tokens: int = Field(1024, alias="max_token")
    temp: Optional[float] = 0.7
    top_p: Optional[float] = 0.9
    top_k: Optional[int] = 0
    do_sample: bool = False
    echo: Optional[bool] = False
    stop: Optional[List[str]] = []
    repeat_last_n: Optional[int] = 64
    repeat_penalty: Optional[float] = 1.18
    streaming: bool = False
    history: List[str] = []
    use_history: bool = False

    def __init__(self, **kwargs: Any) -> None:
        super().__init__(**kwargs)
        if (self.top_p or 0) > 0 and (self.top_k or 0) > 0:
            logger.warning(
                "top_p and top_k cannot be set simultaneously. "
                "set top_k to 0 instead..."
            )
            self.top_k = 0

    @property
    def _llm_type(self) -> str:
        return "CustomLLM"

    @staticmethod
    def _model_param_names() -> Set[str]:
        return {
            "max_tokens",
            "temp",
            "top_k",
            "top_p",
            "do_sample",
        }

    def _default_params(self) -> Dict[str, Any]:
        return {
            "do_sample": self.do_sample,
            "infer_api": self.infer_api,
            "max_tokens": self.max_tokens,
            "repeat_penalty": self.repeat_penalty,
            "temp": self.temp,
            "top_k": self.top_k,
            "top_p": self.top_p,
            "use_history": self.use_history,
        }

    @property
    def _identifying_params(self) -> Dict[str, Any]:
        return {
            "model": self._llm_type,
            **self._default_params(),
            **{
                k: v for k, v in self.__dict__.items() if k in self._model_param_names()
            },
        }

    def _call(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> str:
        if self.use_history:
            self.history.append(prompt)
            input = "<n>".join(self.history)
        else:
            input = prompt

        headers = {"Content-Type": "application/json"}
        data = json.dumps(
            {
                "ques_list": [{"id": "000", "ques": input}],
                "tokens_to_generate": self.max_tokens,
                "temperature": self.temp,
                "top_p": self.top_p,
                "top_k": self.top_k,
                "do_sample": self.do_sample,
            }
        )

        logger.debug("CustomLLM prompt:", input)

        try:
            response = requests.put(self.infer_api, headers=headers, data=data)
        except requests.exceptions.RequestException as e:
            raise ValueError(f"Error raised by inference api: {e}")

        logger.debug(f"CustomLLM response: {response}")

        if response.status_code != 200:
            raise ValueError(f"Failed with response: {response}")
        try:
            resp = response.json()

            if resp["errCode"] != "0":
                raise ValueError(
                    f"Failed with error code [{resp['errCode']}], "
                    f"error message: [{resp['exceptionMsg']}]"
                )

            if "resData" in resp:
                if len(resp["resData"]["output"]) >= 0:
                    generate_text = resp["resData"]["output"][0]["ans"]
                else:
                    raise ValueError("No output found in response.")
            else:
                raise ValueError("No resData found in response.")

        except requests.exceptions.JSONDecodeError as e:
            raise ValueError(
                f"Error raised during decoding response from inference api: {e}."
                f"\nResponse: {response.text}"
            )

        if stop is not None:
            generate_text = enforce_stop_tokens(generate_text, stop)

        if self.use_history:
            self.history.append(generate_text)

        logger.debug(f"history: {self.history}")
        return generate_text

# Example usage
llm = CustomLLM(infer_api="http://127.0.0.1:8000/custom", max_tokens=50, temp=0.8, top_p=0.95)
result = llm.invoke("This is a foobar thing")
print(result)

In this example, the CustomLLM class is designed to call a custom inference API and allows for various parameters such as max_tokens, temp, top_p, top_k, and others to be set. The _call method handles the API request and processes the response, including handling stop tokens and maintaining conversation history if needed ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

2 replies

schopra6 Sep 29, 2024
Author

@dosu how can I integrate "meta-llama/Meta-Llama-3.1-8B-Instruct" with VLLM optimisation in the above Custom LLM chat model?

dosubot[bot] bot Sep 29, 2024

To integrate "meta-llama/Meta-Llama-3.1-8B-Instruct" with VLLM optimization in your Custom LLM chat model, you can follow these steps:

Install Required Packages:
Ensure you have the necessary packages installed:
```
%pip install -qU langchain-community llama-cpp-python
```

Set Up the Model:
Define the path to your model weights and instantiate the Llama2Chat model with the desired configurations:

import multiprocessing
from langchain_community.llms import LlamaCpp
from langchain_experimental.chat_models import Llama2Chat

# Path to your model weights
local_model = "local/path/to/Meta-Llama-3.1-8B-Instruct.gguf"

llm = LlamaCpp(
    model_path=local_model,
    streaming=False,
    n_ctx=10000,
    n_gpu_layers=8,
    n_batch=300,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    max_tokens=512,
    n_threads=multiprocessing.cpu_count() - 1,
    repeat_penalty=1.5,
    top_p=0.5,
    temperature=0.5,
)

model = Llama2Chat(llm=llm)

Set Up the Prompt Template:
Define the chat prompt template to be used with the model:

from langchain_core.messages import SystemMessage
from langchain_core.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
)

template_messages = [
    SystemMessage(content="You are a helpful assistant."),
    MessagesPlaceholder(variable_name="chat_history"),
    HumanMessagePromptTemplate.from_template("{text}"),
]
prompt_template = ChatPromptTemplate.from_messages(template_messages)

Invoke the Model:
Use the instantiated model to generate chat completions:

from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
chain = LLMChain(llm=model, prompt=prompt_template, memory=memory)

print(
    chain.run(
        text="What can I see in Vienna? Propose a few locations. Names only, no details."
    )
)

print(chain.run(text="Tell me more about #2."))

By following these steps, you can integrate "meta-llama/Meta-Llama-3.1-8B-Instruct" with VLLM optimization in your Custom LLM chat model using LangChain. The Llama2Chat class in LangChain provides a wrapper for chat LLMs, ensuring that the model is set up with the appropriate system and user message formats ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to customise LLM decoding with ChatOpenAPI ? #26975

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

how to customise LLM decoding with ChatOpenAPI ? #26975

Uh oh!

schopra6 Sep 29, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 1 comment · 2 replies

Uh oh!

dosubot[bot] bot Sep 29, 2024

Uh oh!

schopra6 Sep 29, 2024 Author

Uh oh!

dosubot[bot] bot Sep 29, 2024

schopra6
Sep 29, 2024

Replies: 1 comment 2 replies

dosubot[bot]
bot Sep 29, 2024

schopra6 Sep 29, 2024
Author