[Feature]: Replace raw &str prompt in SmolVlm::inference with a structured prompt type

### 🚀 Feature Description

File: **kornia\kornia-rs\crates\kornia-vlm\src\smolvlm\mod.rs**
The current `SmolVlm::inference` API accepts a raw `&str` prompt:

```rust
pub fn inference(
    &mut self,
    prompt: &str, // TODO: make it structured
    image: Option<Image<u8, 3, A>>,
    sample_len: usize,
    alloc: A,
)

However, internally the function constructs a structured chat prompt containing roles and special tokens:

<|im_start|>
User:<image>
<prompt>
<end_of_utterance>
Assistant:

This suggests the API conceptually operates on structured messages, while the public interface only exposes a plain string.

### 📂 Feature Category

Rust Core Library

### 💡 Motivation

```markdown
Using a raw `&str` prompt has a few limitations:

- The prompt formatting logic is embedded directly in `inference`.
- The API does not reflect the structured format expected by the model.
- Images are passed separately instead of being associated with the message that contains them.
- The current design makes future extensions (e.g. multi-turn conversations or richer message types) harder to implement.

Introducing a structured prompt representation would make the API clearer, safer, and easier to extend.

### 💭 Proposed Solution

Introduce a structured prompt type that represents the user input more explicitly.

For example:

```rust
pub struct Prompt<A: ImageAllocator> {
    pub text: String,
    pub image: Option<Image<u8, 3, A>>,
}

The inference API could then be updated to:

pub fn inference(
    &mut self,
    prompt: Prompt<A>,
    sample_len: usize,
    alloc: A,
) -> Result<String, SmolVlmError>

Internally, the existing prompt formatting logic can remain unchanged, but it would be derived from the structured prompt instead of a raw string.

This keeps the implementation simple while improving API clarity.

### 📚 Library Reference

```markdown
This proposal follows the structured message design used by several modern LLM/VLM APIs, such as:

- Hugging Face Transformers chat templates
- OpenAI Chat Completions API
- Various multimodal chat interfaces where messages can optionally include images

These systems typically represent prompts as structured messages rather than raw strings.

### 🔄 Alternatives Considered

One alternative would be to keep the `&str` interface and rely on users to manually construct prompts with special tokens.

However, this approach keeps prompt formatting logic outside the library and increases the risk of incorrect formatting.

### 🎯 Use Cases

- Cleaner and more explicit API for multimodal inference
- Easier integration into chat-based pipelines
- Simplified support for multi-turn interactions in future updates
- Reduced risk of malformed prompt templates

### 📝 Additional Context

This change is primarily an API-level improvement and should require minimal changes to the existing inference logic.

### 🤝 Contribution Intent

- [x] I plan to submit a PR to implement this feature
- [ ] I'm requesting this feature but not planning to implement it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Replace raw &str prompt in SmolVlm::inference with a structured prompt type #786

🚀 Feature Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Replace raw &str prompt in SmolVlm::inference with a structured prompt type #786

Description

🚀 Feature Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions