-
Notifications
You must be signed in to change notification settings - Fork 365
Open
Description
The UserInput struct can represent a series of messages with media attached to each image:
return UserInput(
chat: [
.system(generate.system),
.user(prompt, images: media.images, videos: media.videos),
],
processing: media.processing
)This could include back and forth between the user and assistant including adding additional media.
The UserInputProcessor converts this to an LMInput:
public struct LMInput {
public let text: Text
public let image: ProcessedImage?
public let video: ProcessedVideo?
but that only allows for one set of image/video. This should probably have:
public let images: [ProcessedImage]
public let videos: [ProcessedVideo]though the model would have to be updated to take advantage of that.
Consider this chat:
> /image /tmp/img.jpeg
> what animal is in the image?
[["role": "system", "content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]]], ["role": "user", "content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**]]]
The animal in the image is a dog.
> /image /tmp/img2.jpeg
> describe the second image
[["content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]], "role": "system"], ["content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**], "role": "user"], ["content": [["type": "text", "text": "The animal in the image is a dog."]], "role": "assistant"], ["content": [["type": "text", "text": "describe the second image"], **["type": "image"]**], "role": "user"]]
The image shows a dog wearing a Santa hat.
Ideally this would present the second image for the second image marker. As it is today it will combine both images and inject them for the first marker.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels