Inference takes forever when using pipeline=vlm

Hi all, 
   I'm trying to deploy the docling-serve 1.6.0 with the docling-granite VLLM on a A10 on modal.com. The standard pipeline works fine (30s/document - is this runtime reasonable?). when I run with pipeline=vlm the process goes in timeout. I tried with a L40S GPU too but it seems that the GPU is not the issue. A < 1B params model should fit into any of those. 

What GPU would you recommend using? 

Any suggestion for deployment for using the pipeline=vlm? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference takes forever when using pipeline=vlm #399

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inference takes forever when using pipeline=vlm #399

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions