Skip to content

[Feature]: Replace raw &str prompt in SmolVlm::inference with a structured prompt typeΒ #786

@Varun-sai-500

Description

@Varun-sai-500

πŸš€ Feature Description

File: kornia\kornia-rs\crates\kornia-vlm\src\smolvlm\mod.rs
The current SmolVlm::inference API accepts a raw &str prompt:

pub fn inference(
    &mut self,
    prompt: &str, // TODO: make it structured
    image: Option<Image<u8, 3, A>>,
    sample_len: usize,
    alloc: A,
)

However, internally the function constructs a structured chat prompt containing roles and special tokens:

<|im_start|>
User:<image>
<prompt>
<end_of_utterance>
Assistant:

This suggests the API conceptually operates on structured messages, while the public interface only exposes a plain string.

### πŸ“‚ Feature Category

Rust Core Library

### πŸ’‘ Motivation

```markdown
Using a raw `&str` prompt has a few limitations:

- The prompt formatting logic is embedded directly in `inference`.
- The API does not reflect the structured format expected by the model.
- Images are passed separately instead of being associated with the message that contains them.
- The current design makes future extensions (e.g. multi-turn conversations or richer message types) harder to implement.

Introducing a structured prompt representation would make the API clearer, safer, and easier to extend.

### πŸ’­ Proposed Solution

Introduce a structured prompt type that represents the user input more explicitly.

For example:

```rust
pub struct Prompt<A: ImageAllocator> {
    pub text: String,
    pub image: Option<Image<u8, 3, A>>,
}

The inference API could then be updated to:

pub fn inference(
    &mut self,
    prompt: Prompt<A>,
    sample_len: usize,
    alloc: A,
) -> Result<String, SmolVlmError>

Internally, the existing prompt formatting logic can remain unchanged, but it would be derived from the structured prompt instead of a raw string.

This keeps the implementation simple while improving API clarity.

### πŸ“š Library Reference

```markdown
This proposal follows the structured message design used by several modern LLM/VLM APIs, such as:

- Hugging Face Transformers chat templates
- OpenAI Chat Completions API
- Various multimodal chat interfaces where messages can optionally include images

These systems typically represent prompts as structured messages rather than raw strings.

### πŸ”„ Alternatives Considered

One alternative would be to keep the `&str` interface and rely on users to manually construct prompts with special tokens.

However, this approach keeps prompt formatting logic outside the library and increases the risk of incorrect formatting.

### 🎯 Use Cases

- Cleaner and more explicit API for multimodal inference
- Easier integration into chat-based pipelines
- Simplified support for multi-turn interactions in future updates
- Reduced risk of malformed prompt templates

### πŸ“ Additional Context

This change is primarily an API-level improvement and should require minimal changes to the existing inference logic.

### 🀝 Contribution Intent

- [x] I plan to submit a PR to implement this feature
- [ ] I'm requesting this feature but not planning to implement it

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is neededtriagewait for a maintainer to approve and assign this ticket

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions