Skip to content

New design for Multimodal Messages #73

@HavenDV

Description

@HavenDV

https://python.langchain.com/docs/how_to/multimodal_inputs/

There are currently several leading approaches to presenting a series of messages:

AnthropicMessage
  Content = IList<ContentBlock = OneOf<Text, Image, ToolUse, ToolResult>>
  Other non content properties

OllamaMessage
  Content = string
  Images = IList<string>
  ToolCalls = IList<ToolCall>

OpenAiMessage  // Each Role Has Different Content
  System
    Content = IList<ContentPart = OneOf<Text>>
  User
    Content = IList<ContentPart = OneOf<Text, Image>>
  Assistant
    Content = IList<ContentPart = OneOf<Text, Refusal>>
    ToolCalls
  Tool
    Content = IList<ContentPart = OneOf<Text>>

GoogleMessage
  Content = IList<ContentPart = OneOf<Text, Blob = (byte[], string MimeType)>>

I like the simplicity of Anthropic, but I would change Block to ContentPart

So far in LangChain I see it as:

Message
Content = IList<ContentPart = OneOf<Text, Image, ToolUse, ToolResult, Blob, Video>>

or as separate messages that don't allow parts inside

TextMessage
ImageMessage
ToolUseMessage
ToolResultMessage
BlobMessage // allow you to specify a MimeType 
VideoMessage

When the user returns multiple parts, we just use two messages in a row 

OpenAI also has changes to the message structure in the Realtime API
I will add to this taking into account the changes in the OpenAI Realtime API

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions