LMInput restricts model input to a single collection of images and video frames

See #277 and #276 

The `UserInput` struct can represent a series of messages with media attached to each image:

```swift
        return UserInput(
            chat: [
                .system(generate.system),
                .user(prompt, images: media.images, videos: media.videos),
            ],
            processing: media.processing
        )
```

This could include back and forth between the user and assistant including adding additional media.

The `UserInputProcessor` converts this to an `LMInput`:

```swift
public struct LMInput {
    public let text: Text
    public let image: ProcessedImage?
    public let video: ProcessedVideo?
```

but that only allows for one set of image/video.  This should probably have:

```swift
    public let images: [ProcessedImage]
    public let videos: [ProcessedVideo]
```

though the model would have to be updated to take advantage of that.

Consider this chat:

```
> /image /tmp/img.jpeg


> what animal is in the image?
[["role": "system", "content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]]], ["role": "user", "content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**]]]
The animal in the image is a dog.

> /image /tmp/img2.jpeg


> describe the second image
[["content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]], "role": "system"], ["content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**], "role": "user"], ["content": [["type": "text", "text": "The animal in the image is a dog."]], "role": "assistant"], ["content": [["type": "text", "text": "describe the second image"], **["type": "image"]**], "role": "user"]]
The image shows a dog wearing a Santa hat.

```

Ideally this would present the second image for the second image marker.  As it is today it will combine both images and inject them for the first marker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMInput restricts model input to a single collection of images and video frames #282

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LMInput restricts model input to a single collection of images and video frames #282

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions