Skip to content

Commit a334b9f

Browse files
authored
fix: remove duplicate captions in markdown (#31)
Signed-off-by: Michele Dolfi <[email protected]>
1 parent bdda0ee commit a334b9f

File tree

2 files changed

+36
-18
lines changed

2 files changed

+36
-18
lines changed

docling_core/types/doc/document.py

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -477,6 +477,26 @@ def export_to_markdown( # noqa: C901
477477
md_texts: list[str] = []
478478

479479
if self.main_text is not None:
480+
# collect all captions embedded in table and figure objects
481+
# to avoid repeating them
482+
embedded_captions = set()
483+
for orig_item in self.main_text[main_text_start:main_text_stop]:
484+
item = (
485+
self._resolve_ref(orig_item)
486+
if isinstance(orig_item, Ref)
487+
else orig_item
488+
)
489+
if item is None:
490+
continue
491+
492+
if (
493+
isinstance(item, (Table, Figure))
494+
and item.text
495+
and item.obj_type in main_text_labels
496+
):
497+
embedded_captions.add(item.text)
498+
499+
# serialize document to markdown
480500
for orig_item in self.main_text[main_text_start:main_text_stop]:
481501
markdown_text = ""
482502

@@ -492,6 +512,11 @@ def export_to_markdown( # noqa: C901
492512
if isinstance(item, BaseText) and item_type in main_text_labels:
493513
text = item.text
494514

515+
# skip captions of they are embedded in the actual
516+
# floating object
517+
if item_type == "caption" and text in embedded_captions:
518+
continue
519+
495520
# ignore repeated text
496521
if prev_text == text or text is None:
497522
continue
@@ -523,8 +548,9 @@ def export_to_markdown( # noqa: C901
523548
isinstance(item, Table)
524549
and item.data
525550
and item_type in main_text_labels
526-
and not strict_text
527551
):
552+
553+
md_table = ""
528554
table = []
529555
for row in item.data:
530556
tmp = []
@@ -545,15 +571,19 @@ def export_to_markdown( # noqa: C901
545571
disable_numparse=True,
546572
)
547573

548-
markdown_text = md_table
574+
markdown_text = ""
575+
if item.text:
576+
markdown_text = item.text
577+
if not strict_text:
578+
markdown_text += "\n" + md_table
549579

550580
elif isinstance(item, Figure) and item_type in main_text_labels:
551581

552582
markdown_text = ""
553-
if not strict_text:
554-
markdown_text = f"{image_placeholder}"
555583
if item.text:
556-
markdown_text += "\n" + item.text
584+
markdown_text = item.text
585+
if not strict_text:
586+
markdown_text += f"\n{image_placeholder}"
557587

558588
if markdown_text:
559589
md_texts.append(markdown_text)

test/data/doc/doc-export.md

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@ In modern document understanding systems [1,15], table extraction is typically a
1616

1717
Fig. 1. Comparison between HTML and OTSL table structure representation: (A) table-example with complex row and column headers, including a 2D empty span, (B) minimal graphical representation of table structure using rectangular layout, (C) HTML representation, (D) OTSL representation. This example demonstrates many of the key-features of OTSL, namely its reduced vocabulary size (12 versus 5 in this case), its reduced sequence length (55 versus 30) and a enhanced internal structure (variable token sequence length per row in HTML versus a fixed length of rows in OTSL).
1818

19-
<!-- image -->
2019
Fig. 1. Comparison between HTML and OTSL table structure representation: (A) table-example with complex row and column headers, including a 2D empty span, (B) minimal graphical representation of table structure using rectangular layout, (C) HTML representation, (D) OTSL representation. This example demonstrates many of the key-features of OTSL, namely its reduced vocabulary size (12 versus 5 in this case), its reduced sequence length (55 versus 30) and a enhanced internal structure (variable token sequence length per row in HTML versus a fixed length of rows in OTSL).
20+
<!-- image -->
2121

2222
today, table detection in documents is a well understood problem, and the latest state-of-the-art (SOTA) object detection methods provide an accuracy comparable to human observers [7,8,10,14,23]. On the other hand, the problem of table structure recognition (TSR) is a lot more challenging and remains a very active area of research, in which many novel machine learning algorithms are being explored [3,4,5,9,11,12,13,14,17,18,21,22].
2323

@@ -46,9 +46,7 @@ All known Im2Seq based models for TSR fundamentally work in similar ways. Given
4646
ulary and can be interpreted as a table structure. For example, with the HTML tokens <table>, </table>, <tr>, </tr>, <td> and </td>, one can construct simple table structures without any spanning cells. In reality though, one needs at least 28 HTML tokens to describe the most common complex tables observed in real-world documents [21,22], due to a variety of spanning cells definitions in the HTML token vocabulary.
4747

4848
Fig. 2. Frequency of tokens in HTML and OTSL as they appear in PubTabNet.
49-
5049
<!-- image -->
51-
Fig. 2. Frequency of tokens in HTML and OTSL as they appear in PubTabNet.
5250

5351
Obviously, HTML and other general-purpose markup languages were not designed for Im2Seq models. As such, they have some serious drawbacks. First, the token vocabulary needs to be artificially large in order to describe all plausible tabular structures. Since most Im2Seq models use an autoregressive approach, they generate the sequence token by token. Therefore, to reduce inference time, a shorter sequence length is critical. Every table-cell is represented by at least two tokens (<td> and </td>). Furthermore, when tokenizing the HTML structure, one needs to explicitly enumerate possible column-spans and row-spans as words. In practice, this ends up requiring 28 different HTML tokens (when including column-and row-spans up to 10 cells) just to describe every table in the PubTabNet dataset. Clearly, not every token is equally represented, as is depicted in Figure 2. This skewed distribution of tokens in combination with variable token row-length makes it challenging for models to learn the HTML structure.
5452

@@ -83,9 +81,7 @@ The OTSL vocabulary is comprised of the following tokens:
8381
A notable attribute of OTSL is that it has the capability of achieving lossless conversion to HTML.
8482

8583
Fig. 3. OTSL description of table structure: A-table example; B-graphical representation of table structure; C-mapping structure on a grid; D-OTSL structure encoding; E-explanation on cell encoding
86-
8784
<!-- image -->
88-
Fig. 3. OTSL description of table structure: A-table example; B-graphical representation of table structure; C-mapping structure on a grid; D-OTSL structure encoding; E-explanation on cell encoding
8985

9086
## 4.2 Language Syntax
9187

@@ -118,9 +114,7 @@ The design of OTSL allows to validate a table structure easily on an unfinished
118114
To evaluate the impact of OTSL on prediction accuracy and inference times, we conducted a series of experiments based on the TableFormer model (Figure 4) with two objectives: Firstly we evaluate the prediction quality and performance of OTSL vs. HTML after performing Hyper Parameter Optimization (HPO) on the canonical PubTabNet data set. Secondly we pick the best hyper-parameters found in the first step and evaluate how OTSL impacts the performance of TableFormer after training on other publicly available data sets (FinTabNet, PubTables-1M [14]). The ground truth (GT) from all data sets has been converted into OTSL format for this purpose, and will be made publicly available.
119115

120116
Fig. 4. Architecture sketch of the TableFormer model, which is a representative for the Im2Seq approach.
121-
122117
<!-- image -->
123-
Fig. 4. Architecture sketch of the TableFormer model, which is a representative for the Im2Seq approach.
124118

125119
We rely on standard metrics such as Tree Edit Distance score (TEDs) for table structure prediction, and Mean Average Precision (mAP) with 0.75 Intersection Over Union (IOU) threshold for the bounding-box predictions of table cells. The predicted OTSL structures were converted back to HTML format in
126120

@@ -131,7 +125,6 @@ order to compute the TED score. Inference timing results for all experiments wer
131125
We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.
132126

133127
Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.
134-
135128
| # | # | Language | TEDs | TEDs | TEDs | mAP (0.75) | Inference |
136129
|------------|------------|------------|-------------|-------------|-------------|--------------|-------------|
137130
| enc-layers | dec-layers | | simple | complex | all | | time (secs) |
@@ -147,7 +140,6 @@ We picked the model parameter configuration that produced the best prediction qu
147140
Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.
148141

149142
Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using Table-Former [9] (with enc=6, dec=6, heads=8).
150-
151143
| Data set | Language | TEDs | TEDs | TEDs | mAP(0.75) | Inference time (secs) |
152144
|--------------|------------|--------|---------|--------|-------------|-------------------------|
153145
| | | simple | complex | all | | |
@@ -163,18 +155,14 @@ Table 2. TSR and cell detection results compared between OTSL and HTML on the Pu
163155
To illustrate the qualitative differences between OTSL and HTML, Figure 5 demonstrates less overlap and more accurate bounding boxes with OTSL. In Figure 6, OTSL proves to be more effective in handling tables with longer token sequences, resulting in even more precise structure prediction and bounding boxes.
164156

165157
Fig. 5. The OTSL model produces more accurate bounding boxes with less overlap (E) than the HTML model (D), when predicting the structure of a sparse table (A), at twice the inference speed because of shorter sequence length (B),(C). 'PMC2807444_006_00.png ' PubTabNet. μ
166-
167158
<!-- image -->
168-
Fig. 5. The OTSL model produces more accurate bounding boxes with less overlap (E) than the HTML model (D), when predicting the structure of a sparse table (A), at twice the inference speed because of shorter sequence length (B),(C). 'PMC2807444_006_00.png ' PubTabNet. μ
169159

170160
μ
171161

172162
173163

174164
Fig. 6. Visualization of predicted structure and detected bounding boxes on a complex table with many rows. The OTSL model (B) captured repeating pattern of horizontally merged cells from the GT (A), unlike the HTML model (C). The HTML model also didn't complete the HTML sequence correctly and displayed a lot more of drift and overlap of bounding boxes. 'PMC5406406_003_01.png ' PubTabNet.
175-
176165
<!-- image -->
177-
Fig. 6. Visualization of predicted structure and detected bounding boxes on a complex table with many rows. The OTSL model (B) captured repeating pattern of horizontally merged cells from the GT (A), unlike the HTML model (C). The HTML model also didn't complete the HTML sequence correctly and displayed a lot more of drift and overlap of bounding boxes. 'PMC5406406_003_01.png ' PubTabNet.
178166

179167
## 6 Conclusion
180168

0 commit comments

Comments
 (0)