Agents and Images #26888

sbugdary · 2024-09-26T07:02:09Z

sbugdary
Sep 26, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

python


import os
import base64
from mimetypes import guess_type
from langchain_openai import AzureChatOpenAI
from langchain.prompts import (
    ChatPromptTemplate,

    MessagesPlaceholder,
)
from langchain.schema import  HumanMessage
from langchain.agents import AgentExecutor
from langchain.memory import ConversationBufferMemory
from dotenv import load_dotenv
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_core.tools import tool

from langchain_community.chat_message_histories import ChatMessageHistory

os.environ["no_proxy"] = "localhost,127.0.0.1,::1"




def local_image_to_data_url(image_path):
    # Guess the MIME type of the image based on the file extension
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = "application/octet-stream"  # Default MIME type if none is found

    # Read and encode the image file
    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode("utf-8")

    # Construct the data URL
    return f"data:{mime_type};base64,{base64_encoded_data}"


def img_path2url(path):
    img_encoded = local_image_to_data_url(path)
    img_url_dict = {"type": "image_url", "image_url": {"url": f"{img_encoded}"}}
    return img_url_dict


llm = AzureChatOpenAI(
    azure_deployment="gpt-4o", api_version="2024-02-15-preview", temperature=0
)

prompt = ChatPromptTemplate(
    messages=[
        (
            "system",
            "you are a general assistent ",
        ),
        MessagesPlaceholder(variable_name="chat_history", optional=True),
        ("human", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)


@tool
def say_hi(a):
    """Says hi"""
    return "hi"


memory = ChatMessageHistory(session_id="test-session")

tools = [say_hi]
agent = create_openai_functions_agent(llm, tools, prompt)
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
)

agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, memory=memory)


result1 = agent_executor.invoke(input={"input": "hi how are you ? "}) #works great 

img_path = "./out0001.png"

img_msg = img_path2url(img_path)


message = HumanMessage(
    content=[
        {"type": "text", "text": "what is in the image ? "},
        img_msg,
    ],
)

llm.invoke([message]) #works
agent_executor.invoke(input={"input": message}) #crashes 

pass

Description

I am trying to have an agent interact with an image and execute tools based on the content of that image, but I have been unable to achieve this. My goal is for the user to provide both an input image and text, and for the agent to execute actions based on the image's content.

What is the proper way to communicate with an agent using images? I searched online but found that most working examples directly invoke the LLM, rather than doing so through an agent.

When i execute the above code, i encounter the following error

 File "/home/shlomi/miniforge3/envs/rag/lib/python3.12/site-packages/openai/_base_client.py", line 1041, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': "Invalid 'messages[3].content': string too long. Expected a string with maximum length 1048576, but got a string with length 3422374 instead.", 'type': 'invalid_request_error', 'param': 'messages[3].content', 'code': 'string_above_max_length'}}

System Info

System Information

OS: Linux
OS Version: #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2
Python Version: 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:36:51) [GCC 12.4.0]

Package Information

langchain_core: 0.3.5
langchain: 0.3.0
langchain_community: 0.3.0
langsmith: 0.1.128
langchain_openai: 0.2.0
langchain_text_splitters: 0.3.0

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.5
async-timeout: 4.0.3
dataclasses-json: 0.6.7
httpx: 0.27.2
jsonpatch: 1.33
numpy: 1.26.4
openai: 1.44.1
orjson: 3.10.7
packaging: 24.1
pydantic: 2.9.1
pydantic-settings: 2.5.2
PyYAML: 6.0.2
requests: 2.32.3
SQLAlchemy: 2.0.35
tenacity: 8.5.0
tiktoken: 0.7.0
typing-extensions: 4.12.2

Answered by sbugdary

Oct 1, 2024

I solved this issue using langgraph.

python

from langchain_openai import AzureChatOpenAI
from langchain_core.messages import HumanMessage
from langchain_community.vectorstores import FAISS
from langchain_core.globals import set_debug
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.tools.retriever import create_retriever_tool
from utils import img_path2url
from langgraph.checkpoint.memory import MemorySaver  # an in-memory checkpointer
from langchain_core.tools import tool

# https://python.langchain.com/docs/how_to/migrate_agent/

# set_debug(True)
model = AzureChatOpenAI(
    azure_deployment="gpt-4o", api_version="2024-02-15-preview", temperature=0
)


@tool
def …

View full answer

@sbugdary · 2024-09-26T07:04:52Z

dosubot[bot]
bot Sep 26, 2024

Hello @sbugdary! I'm here to assist you with any issues you're experiencing. Let's work together to resolve your problem.

I found some similar discussions and issues that might help you resolve the problem you're facing.

Similar Discussions:
- Letting an agent view an image returned from a tool -- what format to use? discusses how the ToolNode converts the content of a message to a string, which might be relevant to your issue ^[1].
- How to pass more arguments to a tool of an OpenAI tool Agent provides an example of how to correctly pass multiple parameters to a tool, which might help you structure your inputs better ^[2].
Similar Issues:
- Error with multi-modal chat and agent memory suggests converting image data to a base64 string before passing it into the AIMessage to avoid issues with the content format ^[3].

To address the error "Invalid 'messages[3].content': string too long," you need to ensure that the content of your messages is within the acceptable length limits. Here is an example of how to format messages with images and handle the error:

from typing import List, Dict, Any
import base64
import httpx

# Example message with image URL
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "describe the weather in this image"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}},
        ],
    }
]

def _convert_messages_to_ollama_messages(messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    ollama_messages = []
    for message in messages:
        role = message.get("role")
        content = message.get("content", "")
        
        if isinstance(content, list):
            content_str = str(content)
            if len(content_str) > 1000:  # Example length check
                raise ValueError("Message content is too long")
        
        msg = {
            "role": role,
            "content": content,
        }
        ollama_messages.append(msg)
    
    return ollama_messages

# Convert messages
ollama_messages = _convert_messages_to_ollama_messages(messages)

# Example function to send messages (simplified)
def send_messages(messages: List[Dict[str, Any]]):
    for message in messages:
        print(f"Sending message: {message}")

# Send the converted messages
send_messages(ollama_messages)

In this example, the _convert_messages_to_ollama_messages function converts the input messages to the format expected by the Ollama API, including handling image URLs. The length of the message content is checked to ensure it does not exceed a specified limit, which helps prevent the "string too long" error. Adjust the length check as needed based on the actual limits of your application ^[4]^[5].

Additionally, LangChain supports multimodal inputs, which can be useful for applications involving images. For example, you can resize a base64-encoded image and pass it to a model directly. This allows the model to reason over the two-dimensional content on the page, instead of a "one-dimensional" string representation. Here is a utility function to resize a base64-encoded image:

import base64
import io
from PIL import Image

def resize_base64_image(base64_string, size=(128, 128)):
    """
    Resize an image encoded as a Base64 string.

    :param base64_string: A Base64 encoded string of the image to be resized.
    :param size: A tuple representing the new size (width, height) for the image.
    :return: A Base64 encoded string of the resized image.
    """
    img_data = base64.b64decode(base64_string)
    img = Image.open(io.BytesIO(img_data))
    resized_img = img.resize(size, Image.LANCZOS)
    buffered = io.BytesIO()
    resized_img.save(buffered, format=img.format)
    return base64.b64encode(buffered.getvalue()).decode("utf-8")

def get_resized_images(docs):
    """
    Resize images from base64-encoded strings.

    :param docs: A list of base64-encoded image to be resized.
    :return: Dict containing a list of resized base64-encoded strings.
    """
    b64_images = []
    for doc in docs:
        if isinstance(doc, Document):
            doc = doc.page_content
        resized_image = resize_base64_image(doc, size=(1280, 720))
        b64_images.append(resized_image)
    return {"images": b64_images}

You can then use this function to resize a base64-encoded image and include it in your messages ^[6]^[7]^[8].

For more detailed information on using LangChain with multimodal inputs, you can refer to the LangChain documentation. This documentation provides comprehensive guidance on setting up and using models, including examples of passing images as URLs or base64-encoded strings.

To continue talking to Dosu, mention @dosu.

2 replies

sbugdary Sep 26, 2024
Author

@dosu I am aware of those resources and still cant figure how to interact with an agent with images.

dosubot[bot] bot Sep 26, 2024

To properly interact with an agent using images in LangChain, you can use the qwen-vl-max model from the ChatTongyi class. Here is an example of how to do it:

from langchain_community.chat_models import ChatTongyi
from langchain_core.messages import HumanMessage

chatLLM = ChatTongyi(model_name="qwen-vl-max")
image_message = {
    "image": "https://lilianweng.github.io/posts/2023-06-23-agent/agent-overview.png",
}
text_message = {
    "text": "summarize this picture",
}
message = HumanMessage(content=[text_message, image_message])
chatLLM.invoke([message])

Additionally, you can leverage Azure Cognitive Services for more advanced multimodal capabilities. Here is an example of how to set up and use the Azure Cognitive Services Toolkit within LangChain:

Install the necessary packages:

%pip install --upgrade --quiet azure-ai-formrecognizer azure-cognitiveservices-speech azure-ai-textanalytics azure-ai-vision

Set up environment variables:

import os

os.environ["AZURE_COGS_KEY"] = "your_azure_cogs_key"
os.environ["AZURE_COGS_ENDPOINT"] = "your_azure_cogs_endpoint"
os.environ["AZURE_COGS_REGION"] = "your_azure_cogs_region"

Create the Toolkit:

from langchain_community.agent_toolkits import AzureCognitiveServicesToolkit

toolkit = AzureCognitiveServicesToolkit()

Initialize the agent:

from langchain.agents import AgentType, initialize_agent
from langchain_openai import OpenAI

llm = OpenAI(temperature=0)
agent = initialize_agent(
    tools=toolkit.get_tools(),
    llm=llm,
    agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
)

Interact with the agent using an image:

agent.run(
    "What can I make with these ingredients?"
    "https://images.openai.com/blob/9ad5a2ab-041f-475f-ad6a-b51899c50182/ingredients.png"
)

This setup allows you to use Azure Cognitive Services to extract information from images and interact with the agent in a more advanced manner. The Azure Cognitive Services Toolkit includes tools for image analysis, form recognition, speech-to-text, text-to-speech, and healthcare text analytics, providing a comprehensive suite for multimodal interactions ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

sbugdary · 2024-10-01T08:58:30Z

sbugdary
Oct 1, 2024
Author

I solved this issue using langgraph.

python

from langchain_openai import AzureChatOpenAI
from langchain_core.messages import HumanMessage
from langchain_community.vectorstores import FAISS
from langchain_core.globals import set_debug
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.tools.retriever import create_retriever_tool
from utils import img_path2url
from langgraph.checkpoint.memory import MemorySaver  # an in-memory checkpointer
from langchain_core.tools import tool

# https://python.langchain.com/docs/how_to/migrate_agent/

# set_debug(True)
model = AzureChatOpenAI(
    azure_deployment="gpt-4o", api_version="2024-02-15-preview", temperature=0
)


@tool
def say_hi(a):
    """applys the magic function """
    return "hi guys "


tools = [say_hi]


img_path = "./out0001.png"

img_msg = img_path2url(img_path)


query = "what is the image ? "

message = HumanMessage(
    content=[
        {
            "type": "text",
            "text": "whats returns the magic function ? say it after you describe what is in the image  ",
        },
        img_msg,
    ],
)


from langgraph.prebuilt import create_react_agent

# system_message = "Your are an image quality engineer helping people to make their images more beautiful. "

memory = MemorySaver()

langgraph_agent_executor = create_react_agent(
    model, tools,  checkpointer=memory
)

config = {"configurable": {"thread_id": "2"}}

# messages = langgraph_agent_executor.invoke({"messages": [("human", query)]})
messages = langgraph_agent_executor.invoke({"messages": [message]}, config=config)


print(messages["messages"][-1].content)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agents and Images #26888

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Agents and Images #26888

Uh oh!

sbugdary Sep 26, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

Replies: 2 comments · 2 replies

Uh oh!

dosubot[bot] bot Sep 26, 2024

Uh oh!

sbugdary Sep 26, 2024 Author

Uh oh!

dosubot[bot] bot Sep 26, 2024

Uh oh!

sbugdary Oct 1, 2024 Author

sbugdary
Sep 26, 2024

Replies: 2 comments 2 replies

dosubot[bot]
bot Sep 26, 2024

sbugdary Sep 26, 2024
Author

sbugdary
Oct 1, 2024
Author