Skip to content

Commit 5b3b7ea

Browse files
tic-topydshiehstevhliu
authored
Add Kosmos-2.5 (#31711)
Add Microsoft Kosmos-2.5 --------- Co-authored-by: [email protected] <tic-top> Co-authored-by: ydshieh <[email protected]> Co-authored-by: Yih-Dar <[email protected]> Co-authored-by: Steven Liu <[email protected]>
1 parent c93594e commit 5b3b7ea

24 files changed

+4873
-4
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1039,6 +1039,8 @@
10391039
title: Janus
10401040
- local: model_doc/kosmos-2
10411041
title: KOSMOS-2
1042+
- local: model_doc/kosmos2_5
1043+
title: KOSMOS-2.5
10421044
- local: model_doc/layoutlm
10431045
title: LayoutLM
10441046
- local: model_doc/layoutlmv2

docs/source/en/model_doc/kosmos2_5.md

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
<div style="float: right;">
14+
<div class="flex flex-wrap space-x-1">
15+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
16+
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAC0AAAAtCAMAAAANxBKoAAAC7lBMVEUAAADg5vYHPVgAoJH+/v76+v39/f9JbLP///9+AIgAnY3///+mcqzt8fXy9fgkXa3Ax9709fr+///9/f8qXq49qp5AaLGMwrv8/P0eW60VWawxYq8yqJzG2dytt9Wyu9elzci519Lf3O3S2efY3OrY0+Xp7PT///////+dqNCexMc6Z7AGpJeGvbenstPZ5ejQ1OfJzOLa7ejh4+/r8fT29vpccbklWK8PVa0AS6ghW63O498vYa+lsdKz1NDRt9Kw1c672tbD3tnAxt7R6OHp5vDe7OrDyuDn6vLl6/EAQKak0MgATakkppo3ZK/Bz9y8w9yzu9jey97axdvHzeG21NHH4trTwthKZrVGZLSUSpuPQJiGAI+GAI8SWKydycLL4d7f2OTi1+S9xNzL0ePT6OLGzeEAo5U0qJw/aLEAo5JFa7JBabEAp5Y4qZ2QxLyKmsm3kL2xoMOehrRNb7RIbbOZgrGre68AUqwAqZqNN5aKJ5N/lMq+qsd8kMa4pcWzh7muhLMEV69juq2kbKqgUaOTR5uMMZWLLZSGAI5VAIdEAH+ovNDHuNCnxcy3qcaYx8K8msGplrx+wLahjbYdXrV6vbMvYK9DrZ8QrZ8tqJuFms+Sos6sw8ecy8RffsNVeMCvmb43aLltv7Q4Y7EZWK4QWa1gt6meZKUdr6GOAZVeA4xPAISyveLUwtivxtKTpNJ2jcqfvcltiMiwwcfAoMVxhL+Kx7xjdrqTe60tsaNQs6KaRKACrJ6UTZwkqpqTL5pkHY4AloSgsd2ptNXPvNOOncuxxsqFl8lmg8apt8FJcr9EbryGxLqlkrkrY7dRa7ZGZLQ5t6iXUZ6PPpgVpZeJCJFKAIGareTa0+KJod3H0deY2M+esM25usmYu8d2zsJOdcBVvrCLbqcAOaaHaKQAMaScWqKBXqCXMJ2RHpiLF5NmJZAdAHN2kta11dKu1M+DkcZLdb+Mcql3TppyRJdzQ5ZtNZNlIY+DF4+voCOQAAAAZ3RSTlMABAT+MEEJ/RH+/TP+Zlv+pUo6Ifz8+fco/fz6+evr39S9nJmOilQaF/7+/f38+smmoYp6b1T+/v7++vj189zU0tDJxsGzsrKSfv34+Pf27dDOysG9t6+n/vv6+vr59uzr1tG+tZ6Qg9Ym3QAABR5JREFUSMeNlVVUG1EQhpcuxEspXqS0SKEtxQp1d3d332STTRpIQhIISQgJhODu7lAoDoUCpe7u7u7+1puGpqnCPOyZvffbOXPm/PsP9JfQgyCC+tmTABTOcbxDz/heENS7/1F+9nhvkHePG0wNDLbGWwdXL+rbLWvpmZHXD8+gMfBjTh+aSe6Gnn7lwQIOTR0c8wfX3PWgv7avbdKwf/ZoBp1Gp/PvuvXW3vw5ib7emnTW4OR+3D4jB9vjNJ/7gNvfWWeH/TO/JyYrsiKCRjVEZA3UB+96kON+DxOQ/NLE8PE5iUYgIXjFnCOlxEQMaSGVxjg4gxOnEycGz8bptuNjVx08LscIgrzH3umcn+KKtiBIyvzOO2O99aAdR8cF19oZalnCtvREUw79tCd5sow1g1UKM6kXqUx4T8wsi3sTjJ3yzDmmhenLXLpo8u45eG5y4Vvbk6kkC4LLtJMowkSQxmk4ggVJEG+7c6QpHT8vvW9X7/o7+3ELmiJi2mEzZJiz8cT6TBlanBk70cB5GGIGC1gRDdZ00yADLW1FL6gqhtvNXNG5S9gdSrk4M1qu7JAsmYshzDS4peoMrU/gT7qQdqYGZaYhxZmVbGJAm/CS/HloWyhRUlknQ9KYcExTwS80d3VNOxUZJpITYyspl0LbhArhpZCD9cRWEQuhYkNGMHToQ/2Cs6swJlb39CsllxdXX6IUKh/H5jbnSsPKjgmoaFQ1f8wRLR0UnGE/RcDEjj2jXG1WVTwUs8+zxfcrVO+vSsuOpVKxCfYZiQ0/aPKuxQbQ8lIz+DClxC8u+snlcJ7Yr1z1JPqUH0V+GDXbOwAib931Y4Imaq0NTIXPXY+N5L18GJ37SVWu+hwXff8l72Ds9XuwYIBaXPq6Shm4l+Vl/5QiOlV+uTk6YR9PxKsI9xNJny31ygK1e+nIRC1N97EGkFPI+jCpiHe5PCEy7oWqWSwRrpOvhFzcbTWMbm3ZJAOn1rUKpYIt/lDhW/5RHHteeWFN60qo98YJuoq1nK3uW5AabyspC1BcIEpOhft+SZAShYoLSvnmSfnYADUERP5jJn2h5XtsgCRuhYQqAvwTwn33+YWEKUI72HX5AtfSAZDe8F2DtPPm77afhl0EkthzuCQU0BWApgQIH9+KB0JhopMM7bJrdTRoleM2JAVNMyPF+wdoaz+XJpGoVAQ7WXUkcV7gT3oUZyi/ISIJAVKhgNp+4b4veCFhYVJw4locdSjZCp9cPUhLF9EZ3KKzURepMEtCDPP3VcWFx4UIiZIklIpFNfHpdEafIF2aRmOcrUmjohbT2WUllbmRvgfbythbQO3222fpDJoufaQPncYYuqoGtUEsCJZL6/3PR5b4syeSjZMQG/T2maGANlXT2v8S4AULWaUkCxfLyW8iW4kdka+nEMjxpL2NCwsYNBp+Q61PF43zyDg9Bm9+3NNySn78jMZUUkumqE4Gp7JmFOdP1vc8PpRrzj9+wPinCy8K1PiJ4aYbnTYpCCbDkBSbzhu2QJ1Gd82t8jI8TH51+OzvXoWbnXUOBkNW+0mWFwGcGOUVpU81/n3TOHb5oMt2FgYGjzau0Nif0Ss7Q3XB33hjjQHjHA5E5aOyIQc8CBrLdQSs3j92VG+3nNEjbkbdbBr9zm04ruvw37vh0QKOdeGIkckc80fX3KH/h7PT4BOjgCty8VZ5ux1MoO5Cf5naca2LAsEgehI+drX8o/0Nu+W0m6K/I9gGPd/dfx/EN/wN62AhsBWuAAAAAElFTkSuQmCC">
17+
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
18+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
19+
</div>
20+
</div>
21+
22+
23+
# KOSMOS-2.5
24+
25+
The Kosmos-2.5 model was proposed in [KOSMOS-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419/) by Microsoft.
26+
27+
The abstract from the paper is the following:
28+
29+
*We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.*
30+
31+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_ocr.png"
32+
alt="drawing" width="600"/>
33+
34+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_md.png"
35+
alt="drawing" width="600"/>
36+
37+
<small> Overview of tasks that KOSMOS-2.5 can handle. Taken from the <a href="https://arxiv.org/abs/2309.11419">original paper</a>. </small>
38+
39+
The examples below demonstrates how to generate with [`AutoModel`], for both Markdown and OCR tasks.
40+
41+
<hfoptions id="usage">
42+
<hfoption id="AutoModel - Markdown Task">
43+
44+
```py
45+
import re
46+
import torch
47+
import requests
48+
from PIL import Image, ImageDraw
49+
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration
50+
51+
repo = "ydshieh/kosmos-2.5"
52+
device = "cuda:0"
53+
dtype = torch.bfloat16
54+
model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=device, torch_dtype=dtype)
55+
processor = AutoProcessor.from_pretrained(repo)
56+
57+
# sample image
58+
url = "https://huggingface.co/ydshieh/kosmos-2.5/resolve/main/receipt_00008.png"
59+
image = Image.open(requests.get(url, stream=True).raw)
60+
61+
prompt = "<md>"
62+
inputs = processor(text=prompt, images=image, return_tensors="pt")
63+
64+
height, width = inputs.pop("height"), inputs.pop("width")
65+
raw_width, raw_height = image.size
66+
scale_height = raw_height / height
67+
scale_width = raw_width / width
68+
69+
inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()}
70+
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)
71+
generated_ids = model.generate(
72+
**inputs,
73+
max_new_tokens=1024,
74+
)
75+
76+
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
77+
print(generated_text[0])
78+
```
79+
80+
</hfoption>
81+
<hfoption id="AutoModel - OCR Task">
82+
83+
```py
84+
import re
85+
import torch
86+
import requests
87+
from PIL import Image, ImageDraw
88+
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration
89+
90+
repo = "ydshieh/kosmos-2.5"
91+
device = "cuda:0"
92+
dtype = torch.bfloat16
93+
model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=device, torch_dtype=dtype)
94+
processor = AutoProcessor.from_pretrained(repo)
95+
96+
# sample image
97+
url = "https://huggingface.co/ydshieh/kosmos-2.5/resolve/main/receipt_00008.png"
98+
image = Image.open(requests.get(url, stream=True).raw)
99+
100+
# bs = 1
101+
prompt = "<ocr>"
102+
inputs = processor(text=prompt, images=image, return_tensors="pt")
103+
height, width = inputs.pop("height"), inputs.pop("width")
104+
raw_width, raw_height = image.size
105+
scale_height = raw_height / height
106+
scale_width = raw_width / width
107+
108+
# bs > 1, batch generation
109+
# inputs = processor(text=[prompt, prompt], images=[image,image], return_tensors="pt")
110+
# height, width = inputs.pop("height"), inputs.pop("width")
111+
# raw_width, raw_height = image.size
112+
# scale_height = raw_height / height[0]
113+
# scale_width = raw_width / width[0]
114+
115+
inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()}
116+
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)
117+
generated_ids = model.generate(
118+
**inputs,
119+
max_new_tokens=1024,
120+
)
121+
122+
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
123+
def post_process(y, scale_height, scale_width):
124+
y = y.replace(prompt, "")
125+
if "<md>" in prompt:
126+
return y
127+
pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>"
128+
bboxs_raw = re.findall(pattern, y)
129+
lines = re.split(pattern, y)[1:]
130+
bboxs = [re.findall(r"\d+", i) for i in bboxs_raw]
131+
bboxs = [[int(j) for j in i] for i in bboxs]
132+
info = ""
133+
for i in range(len(lines)):
134+
box = bboxs[i]
135+
x0, y0, x1, y1 = box
136+
if not (x0 >= x1 or y0 >= y1):
137+
x0 = int(x0 * scale_width)
138+
y0 = int(y0 * scale_height)
139+
x1 = int(x1 * scale_width)
140+
y1 = int(y1 * scale_height)
141+
info += f"{x0},{y0},{x1},{y0},{x1},{y1},{x0},{y1},{lines[i]}"
142+
return info
143+
144+
output_text = post_process(generated_text[0], scale_height, scale_width)
145+
print(output_text)
146+
147+
draw = ImageDraw.Draw(image)
148+
lines = output_text.split("\n")
149+
for line in lines:
150+
# draw the bounding box
151+
line = list(line.split(","))
152+
if len(line) < 8:
153+
continue
154+
line = list(map(int, line[:8]))
155+
draw.polygon(line, outline="red")
156+
image.save("output.png")
157+
```
158+
159+
</hfoption>
160+
</hfoptions>
161+
162+
163+
## Example
164+
**Markdown Task:** For usage instructions, please refer to [md.py](https://huggingface.co/ydshieh/kosmos-2.5/blob/main/md.py).
165+
166+
**OCR Task:** For usage instructions, please refer to [ocr.py](https://huggingface.co/ydshieh/kosmos-2.5/blob/main/ocr.py).
167+
168+
169+
170+
## Kosmos2_5Config
171+
172+
[[autodoc]] Kosmos2_5Config
173+
174+
## Kosmos2_5ImageProcessor
175+
176+
[[autodoc]] Kosmos2_5ImageProcessor
177+
- preprocess
178+
179+
## Kosmos2_5ImageProcessorFast
180+
181+
[[autodoc]] Kosmos2_5ImageProcessorFast
182+
- preprocess
183+
184+
## Kosmos2_5Processor
185+
186+
[[autodoc]] Kosmos2_5Processor
187+
188+
## Kosmos2_5Model
189+
190+
[[autodoc]] Kosmos2_5Model
191+
- forward
192+
193+
## Kosmos2_5ForConditionalGeneration
194+
195+
[[autodoc]] Kosmos2_5ForConditionalGeneration
196+
- forward

src/transformers/__init__.py

100644100755
File mode changed.

src/transformers/models/auto/configuration_auto.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,7 @@
209209
("jetmoe", "JetMoeConfig"),
210210
("jukebox", "JukeboxConfig"),
211211
("kosmos-2", "Kosmos2Config"),
212+
("kosmos-2.5", "Kosmos2_5Config"),
212213
("kyutai_speech_to_text", "KyutaiSpeechToTextConfig"),
213214
("layoutlm", "LayoutLMConfig"),
214215
("layoutlmv2", "LayoutLMv2Config"),
@@ -626,6 +627,7 @@
626627
("jetmoe", "JetMoe"),
627628
("jukebox", "Jukebox"),
628629
("kosmos-2", "KOSMOS-2"),
630+
("kosmos-2.5", "KOSMOS-2.5"),
629631
("kyutai_speech_to_text", "KyutaiSpeechToText"),
630632
("layoutlm", "LayoutLM"),
631633
("layoutlmv2", "LayoutLMv2"),
@@ -908,6 +910,7 @@
908910
("data2vec-vision", "data2vec"),
909911
("donut-swin", "donut"),
910912
("kosmos-2", "kosmos2"),
913+
("kosmos-2.5", "kosmos2_5"),
911914
("maskformer-swin", "maskformer"),
912915
("xclip", "x_clip"),
913916
("clip_vision_model", "clip"),

src/transformers/models/auto/image_processing_auto.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,7 @@
116116
("instructblipvideo", ("InstructBlipVideoImageProcessor", None)),
117117
("janus", ("JanusImageProcessor", "JanusImageProcessorFast")),
118118
("kosmos-2", ("CLIPImageProcessor", "CLIPImageProcessorFast")),
119+
("kosmos-2.5", ("Kosmos2_5ImageProcessor", "Kosmos2_5ImageProcessorFast")),
119120
("layoutlmv2", ("LayoutLMv2ImageProcessor", "LayoutLMv2ImageProcessorFast")),
120121
("layoutlmv3", ("LayoutLMv3ImageProcessor", "LayoutLMv3ImageProcessorFast")),
121122
("levit", ("LevitImageProcessor", "LevitImageProcessorFast")),

src/transformers/models/auto/modeling_auto.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
209209
("jetmoe", "JetMoeModel"),
210210
("jukebox", "JukeboxModel"),
211211
("kosmos-2", "Kosmos2Model"),
212+
("kosmos-2.5", "Kosmos2_5Model"),
212213
("kyutai_speech_to_text", "KyutaiSpeechToTextModel"),
213214
("layoutlm", "LayoutLMModel"),
214215
("layoutlmv2", "LayoutLMv2Model"),
@@ -943,6 +944,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
943944
("instructblip", "InstructBlipForConditionalGeneration"),
944945
("instructblipvideo", "InstructBlipVideoForConditionalGeneration"),
945946
("kosmos-2", "Kosmos2ForConditionalGeneration"),
947+
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"),
946948
("llava", "LlavaForConditionalGeneration"),
947949
("llava_next", "LlavaNextForConditionalGeneration"),
948950
("llava_next_video", "LlavaNextVideoForConditionalGeneration"),
@@ -992,6 +994,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
992994
("internvl", "InternVLForConditionalGeneration"),
993995
("janus", "JanusForConditionalGeneration"),
994996
("kosmos-2", "Kosmos2ForConditionalGeneration"),
997+
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"),
995998
("llama4", "Llama4ForConditionalGeneration"),
996999
("llava", "LlavaForConditionalGeneration"),
9971000
("llava_next", "LlavaNextForConditionalGeneration"),

src/transformers/models/auto/processing_auto.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@
8888
("internvl", "InternVLProcessor"),
8989
("janus", "JanusProcessor"),
9090
("kosmos-2", "Kosmos2Processor"),
91+
("kosmos-2.5", "Kosmos2_5Processor"),
9192
("kyutai_speech_to_text", "KyutaiSpeechToTextProcessor"),
9293
("layoutlmv2", "LayoutLMv2Processor"),
9394
("layoutlmv3", "LayoutLMv3Processor"),

src/transformers/models/auto/tokenization_auto.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -343,6 +343,7 @@
343343
"XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
344344
),
345345
),
346+
("kosmos-2.5", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
346347
("layoutlm", ("LayoutLMTokenizer", "LayoutLMTokenizerFast" if is_tokenizers_available() else None)),
347348
("layoutlmv2", ("LayoutLMv2Tokenizer", "LayoutLMv2TokenizerFast" if is_tokenizers_available() else None)),
348349
("layoutlmv3", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)),

src/transformers/models/kosmos2/modeling_kosmos2.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1820,6 +1820,7 @@ def forward(
18201820
vision_model_output=vision_model_output,
18211821
)
18221822

1823+
@torch.no_grad()
18231824
def generate(
18241825
self,
18251826
pixel_values: Optional[torch.Tensor] = None,
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# coding=utf-8
2+
# Copyright 2024 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
from typing import TYPE_CHECKING
16+
17+
from ...utils import _LazyModule
18+
from ...utils.import_utils import define_import_structure
19+
20+
21+
if TYPE_CHECKING:
22+
from .configuration_kosmos2_5 import *
23+
from .image_processing_kosmos2_5 import *
24+
from .image_processing_kosmos2_5_fast import *
25+
from .modeling_kosmos2_5 import *
26+
from .processing_kosmos2_5 import *
27+
else:
28+
import sys
29+
30+
_file = globals()["__file__"]
31+
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

0 commit comments

Comments
 (0)