Skip to content

Commit 2efb71a

Browse files
authored
fix: allow captions without holding item (#215)
Signed-off-by: Panos Vagenas <[email protected]>
1 parent bcace5d commit 2efb71a

File tree

5 files changed

+37
-19
lines changed

5 files changed

+37
-19
lines changed

docling_core/experimental/serializer/common.py

Lines changed: 31 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,11 @@
77
import sys
88
from abc import abstractmethod
99
from copy import deepcopy
10+
from functools import cached_property
1011
from pathlib import Path
1112
from typing import Any, Optional, Union
1213

13-
from pydantic import AnyUrl, BaseModel, NonNegativeInt
14+
from pydantic import AnyUrl, BaseModel, NonNegativeInt, computed_field
1415
from typing_extensions import Self, override
1516

1617
from docling_core.experimental.serializer.base import (
@@ -96,6 +97,21 @@ class Config:
9697

9798
_excluded_refs_cache: dict[str, list[str]] = {}
9899

100+
@computed_field # type: ignore[misc]
101+
@cached_property
102+
def _captions_of_some_item(self) -> set[str]:
103+
layers = {cl for cl in ContentLayer} # TODO review
104+
refs = {
105+
cap.cref
106+
for (item, _) in self.doc.iterate_items(
107+
with_groups=True,
108+
traverse_pictures=True,
109+
included_content_layers=layers,
110+
)
111+
for cap in (item.captions if isinstance(item, FloatingItem) else [])
112+
}
113+
return refs
114+
99115
@override
100116
def get_excluded_refs(self, **kwargs) -> list[str]:
101117
"""References to excluded items."""
@@ -201,11 +217,6 @@ def serialize(
201217
else:
202218
return empty_res
203219

204-
label_blocklist = {
205-
# captions only considered in context of floating items (pictures, tables)
206-
DocItemLabel.CAPTION,
207-
}
208-
209220
########
210221
# groups
211222
########
@@ -231,20 +242,22 @@ def serialize(
231242
###########
232243
# doc items
233244
###########
234-
elif isinstance(item, DocItem) and item.label in label_blocklist:
235-
return empty_res
236245
elif isinstance(item, TextItem):
237-
part = (
238-
self.text_serializer.serialize(
239-
item=item,
240-
doc_serializer=self,
241-
doc=self.doc,
242-
is_inline_scope=is_inline_scope,
243-
**kwargs,
246+
if item.self_ref in self._captions_of_some_item:
247+
# those captions will be handled by the floating item holding them
248+
return empty_res
249+
else:
250+
part = (
251+
self.text_serializer.serialize(
252+
item=item,
253+
doc_serializer=self,
254+
doc=self.doc,
255+
is_inline_scope=is_inline_scope,
256+
**kwargs,
257+
)
258+
if item.self_ref not in self.get_excluded_refs(**kwargs)
259+
else empty_res
244260
)
245-
if item.self_ref not in self.get_excluded_refs(**kwargs)
246-
else empty_res
247-
)
248261
elif isinstance(item, TableItem):
249262
part = self.table_serializer.serialize(
250263
item=item,

test/data/doc/2206.01062.yaml.dt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@
8282
<text><loc_44><loc_364><loc_241><loc_445>Phase 3: Training. After a first trial with a small group of people, we realised that providing the annotation guideline and a set of random practice pages did not yield the desired quality level for layout annotation. Therefore we prepared a subset of pages with two different complexity levels, each with a practice and an exam part. 974 pages were reference-annotated by one proficient core team member. Annotation staff were then given the task to annotate the same subsets (blinded from the reference). By comparing the annotations of each staff member with the reference annotations, we could quantify how closely their annotations matched the reference. Only after passing two exam levels with high annotation quality, staff were admitted into the production phase. Practice iterations</text>
8383
<picture><loc_258><loc_54><loc_457><loc_290></picture>
8484
<text><loc_327><loc_290><loc_389><loc_291>05237a14f2524e3f53c8454b074409d05078038a6a36b770fcc8ec7e540deae0</text>
85+
<caption><loc_260><loc_299><loc_457><loc_318>Figure 4: Examples of plausible annotation alternatives for the same page. Criteria in our annotation guideline can resolve cases A to C, while the case D remains ambiguous.</caption>
8586
<text><loc_259><loc_332><loc_456><loc_344>were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.</text>
8687
<text><loc_259><loc_346><loc_457><loc_448>Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted</text>
8788
<page_break>

test/data/doc/2206.01062.yaml.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,8 @@ Phase 3: Training. After a first trial with a small group of people, we realised
152152

153153
05237a14f2524e3f53c8454b074409d05078038a6a36b770fcc8ec7e540deae0
154154

155+
Figure 4: Examples of plausible annotation alternatives for the same page. Criteria in our annotation guideline can resolve cases A to C, while the case D remains ambiguous.
156+
155157
were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.
156158

157159
Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted

test/data/doc/2206.01062.yaml.min.dt

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

test/data/doc/2206.01062.yaml.paged.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,8 @@ Phase 3: Training. After a first trial with a small group of people, we realised
160160

161161
05237a14f2524e3f53c8454b074409d05078038a6a36b770fcc8ec7e540deae0
162162

163+
Figure 4: Examples of plausible annotation alternatives for the same page. Criteria in our annotation guideline can resolve cases A to C, while the case D remains ambiguous.
164+
163165
were carried out over a timeframe of 12 weeks, after which 8 of the 40 initially allocated annotators did not pass the bar.
164166

165167
Phase 4: Production annotation. The previously selected 80K pages were annotated with the defined 11 class labels by 32 annotators. This production phase took around three months to complete. All annotations were created online through CCS, which visualises the programmatic PDF text-cells as an overlay on the page. The page annotation are obtained by drawing rectangular bounding-boxes, as shown in Figure 3. With regard to the annotation practices, we implemented a few constraints and capabilities on the tooling level. First, we only allow non-overlapping, vertically oriented, rectangular boxes. For the large majority of documents, this constraint was sufficient and it speeds up the annotation considerably in comparison with arbitrary segmentation shapes. Second, annotator staff were not able to see each other's annotations. This was enforced by design to avoid any bias in the annotation, which could skew the numbers of the inter-annotator agreement (see Table 1). We wanted

0 commit comments

Comments
 (0)