-
-
Notifications
You must be signed in to change notification settings - Fork 181
Open
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is neededtriagewait for a maintainer to approve and assign this ticketwait for a maintainer to approve and assign this ticket
Description
π Feature Description
File: kornia\kornia-rs\crates\kornia-vlm\src\smolvlm\mod.rs
The current SmolVlm::inference API accepts a raw &str prompt:
pub fn inference(
&mut self,
prompt: &str, // TODO: make it structured
image: Option<Image<u8, 3, A>>,
sample_len: usize,
alloc: A,
)
However, internally the function constructs a structured chat prompt containing roles and special tokens:
<|im_start|>
User:<image>
<prompt>
<end_of_utterance>
Assistant:
This suggests the API conceptually operates on structured messages, while the public interface only exposes a plain string.
### π Feature Category
Rust Core Library
### π‘ Motivation
```markdown
Using a raw `&str` prompt has a few limitations:
- The prompt formatting logic is embedded directly in `inference`.
- The API does not reflect the structured format expected by the model.
- Images are passed separately instead of being associated with the message that contains them.
- The current design makes future extensions (e.g. multi-turn conversations or richer message types) harder to implement.
Introducing a structured prompt representation would make the API clearer, safer, and easier to extend.
### π Proposed Solution
Introduce a structured prompt type that represents the user input more explicitly.
For example:
```rust
pub struct Prompt<A: ImageAllocator> {
pub text: String,
pub image: Option<Image<u8, 3, A>>,
}
The inference API could then be updated to:
pub fn inference(
&mut self,
prompt: Prompt<A>,
sample_len: usize,
alloc: A,
) -> Result<String, SmolVlmError>
Internally, the existing prompt formatting logic can remain unchanged, but it would be derived from the structured prompt instead of a raw string.
This keeps the implementation simple while improving API clarity.
### π Library Reference
```markdown
This proposal follows the structured message design used by several modern LLM/VLM APIs, such as:
- Hugging Face Transformers chat templates
- OpenAI Chat Completions API
- Various multimodal chat interfaces where messages can optionally include images
These systems typically represent prompts as structured messages rather than raw strings.
### π Alternatives Considered
One alternative would be to keep the `&str` interface and rely on users to manually construct prompts with special tokens.
However, this approach keeps prompt formatting logic outside the library and increases the risk of incorrect formatting.
### π― Use Cases
- Cleaner and more explicit API for multimodal inference
- Easier integration into chat-based pipelines
- Simplified support for multi-turn interactions in future updates
- Reduced risk of malformed prompt templates
### π Additional Context
This change is primarily an API-level improvement and should require minimal changes to the existing inference logic.
### π€ Contribution Intent
- [x] I plan to submit a PR to implement this feature
- [ ] I'm requesting this feature but not planning to implement itReactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is neededtriagewait for a maintainer to approve and assign this ticketwait for a maintainer to approve and assign this ticket