diff --git a/docs/adr/adr-model-and-stages.md b/docs/adr/adr-model-and-stages.md new file mode 100644 index 000000000..57daf1e82 --- /dev/null +++ b/docs/adr/adr-model-and-stages.md @@ -0,0 +1,254 @@ +# Stages and model runtimes + +The current architecture is mixing model runtime (framework, inline/remote, etc) with the stage definition. +The actual choice is done by `kind` fields in the respective `..Options` object. + +This is now getting into duplication of runtime logic. For example we have 2x implementation for running vision models from transformers, 2x implementation for running via api inference servers, etc. + +Requirements for the proposed changes +1. make more code reusable +2. provide easy presets +3. allow to run custom (model) choices without code changes, e.g. a different model which can do the same task +4. make presets discoverable as plugins, (which can also be third-party contributions) +5. plugins should allow easy choices in clients (CLI, APIs, etc) + + +TODO: +- [ ] table processing example + + +## Proposal + +### Generic model runtimes + +Model runtimes are as generic as possible (but very likely some duplicates might still be there). + +1. they operate only on basic objects like PIL images and only expose API for batch predictions +2. the prompt is left out of the model runtime, such that they can be reused +3. model runtime are preferably not bound to a model (but they could if very very specific) +4. model runtime could still have some intenal pre-/post-processing, but it should be limited to model internals, e.g. normalization of images to RGB. + +Open questions: +a. should __init__ load the models or we prefer lazy loading? + +```py +class BaseModelOptions(BaseModel): + kind: str + +##### +class VisionOpenAILikeApi: + def __init__(self, options): + ... + + def predict_batch(self, images: Iterable[PILImage], prompt: str) -> Iterable[...]: + ... + + @classmethod + def get_options_type(cls) -> Type[BaseModelOptions]: + return VisionOpenAILikeApiOptions + +##### + +class VisionHfTransformersOptions(BaseVlmOptions): + kind: Literal["vision_hf_transformers"] = "vision_hf_transformers" + + repo_id: str + trust_remote_code: bool = False + load_in_8bit: bool = True + llm_int8_threshold: float = 6.0 + quantized: bool = False + + transformers_model_type: TransformersModelType = TransformersModelType.AUTOMODEL + transformers_prompt_style: TransformersPromptStyle = TransformersPromptStyle.CHAT + + torch_dtype: Optional[str] = None + supported_devices: List[AcceleratorDevice] = [ + AcceleratorDevice.CPU, + AcceleratorDevice.CUDA, + AcceleratorDevice.MPS, + ] + + use_kv_cache: bool = True + max_new_tokens: int = 4096 + + +class VisionHfTransformers: + def predict_batch(self, images: Iterable[PILImage], prompt: str) -> Iterable[...]: + ... + +##### + +class VisionMlx: + def predict_batch(self, images: Iterable[PILImage], prompt: str) -> Iterable[...]: + ... + +##### + +class VisionVllm: + def predict_batch(self, images: Iterable[PILImage], prompt: str) -> Iterable[...]: + ... +``` + +### Model options and instances + +```py +class BaseModelOptions(BaseModel): + kind: str # needed for the options to model factory + name: str # needed for name (e.g. CLI, etc) to options instance to model factory + +class VisionOpenAILikeApiOptions(BaseModelOptions): + kind: Literal["vision_openailike_api"] = "vision_openailike_api" + name: str + + +# Instances + +QWEN_VL_OLLAMA = VisionOpenAILikeApiOptions( + name="qwen_vl_ollama", + api_url="...", + model_name="qwen_vl.." +) + +SMOLDOCLING_LMSTUDIO = VisionOpenAILikeApiOptions( + name="smoldocling_lms", + api_url="...", + model_name="smoldocling.." +) +SMOLDOCLING_MLX = VisionHfTransformersOptions( + name="smoldocling_mlx", + repo_id="ds4sd/smoldocling...", +) +SMOLDOCLING_VLLM = ... + +``` + +### Model factories + +Level 1: class names +- From Type[BaseModelOptions] --> Model +- No Enum of kind/names, because these options will have mandatory arguments (api_url, repo_id, etc) + +Level 2: instance names +- From the name of the instance +- Expose enum of all names to be used in CLI, etc + +### Stage definition + +Stages are responsible for +1. **Pre-process** the input (DoclingDocument, Page batch, etc) to the more generic format + that can be consumed by the models +2. **Post-process** the output of the models into the format it should be saved back + + +The stage options are linking together: +1. which stage and its own settings, e.g. the model prompt to use +2. `model_options` used to get the model form the factory +3. `model_interpreter_options` used to interpret the model raw response, which depend on the use-case, so it is independent from the model runtime. + - we could have each stage (or the ones needing it) define their own factory, but also a shared one should be enough. + + +```py +## Base classes (options, etc) + +class BaseStageOptions(BaseModel): + kind: str + model_options: BaseModelOptions + model_interpreter_options # in the base clas + + +## Helper base classes + +class BaseDocItemImageEnrichment: + labels: list[DocItemLabel] # ..or with a simple filter callable (like now) + image_scale: float + expansion_factor: float + + ... + + +## Actual stages + +class PictureDescriptionOptions(BaseStageOptions): + kind: Literal["picture_description"] = "picture_description" + model_options: BaseModelOptions = ... # default choice, fully instanciated + ... # other options + +class PictureDescription(BaseDocItemImageEnrichment): + labels = [PictureItem] + ... + + def __init__(self, options, ...): + ... + +class CodeUnderstanding(BaseDocItemImageEnrichment): + labels = [CodeItem] + ... + + def __init__(self, options, ...): + ... + +class VisionConvertOptions(BaseStageOptions): + kind: Literal["picture_description"] = "vision_converter" + model_options: BaseModelOptions = ... # default choice, fully instanciated + + +class VisionConvert: + """Equivalent to the VlmModel now for DocTags or Markdown""" + ... + + def __init__(self, options, ...): + ... +``` + + +### Usage + +#### SDK + +```py +# Raw inputs +pipeline_options.STAGE_options = PictureDescriptionOptions( + model_options=VisionOpenAILikeApi( + api_url="my fancy url", + model_name="qwen_vl", + ), + prompt="Write a few sentences which describe in details this image. If it is a diagram also provide some numeric key highlights." +) + +# Using presets +pipeline_options.STAGE_options = PictureDescriptionOptions( + model_options=model_specs.GRANITE_VISION_LMSTUDIO, + # there will be a default prompt (but not specific to the model!) +) +``` + +#### CLI + +We could make the options use `--stage-NAME-X` or directly `--NAME-X`. + +```sh +# Default options +docling --enrich-picture-description + +# Change model (only from preset) +docling --enrich-picture-description \ + --stage-picture-description-model=qwen_vl \ + --stage-picture-description-prompt="..." +``` + + +### Open points + +Some minor open questions + +1. Should we move the accelerator options in the model_options? +2. Where should the batch_size be? + +### Weaknesses + +Should we consider storing presets of the full Stage options? Will this quickly become too complex? + + +## Status + +Proposed