Skip to content

Commit 5c93dea

Browse files
committed
fix(layout,table): perform orientation detection at the table level
Signed-off-by: Clément Doumouro <[email protected]>
1 parent 7183dc6 commit 5c93dea

35 files changed

+46284
-46040
lines changed

tests/data/groundtruth/docling_v2/2203.01017v2.doctags.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
<section_header_level_1><loc_258><loc_138><loc_334><loc_143>a. Picture of a table:</section_header_level_1>
1010
<otsl><loc_258><loc_146><loc_439><loc_191><ched>1<nl></otsl>
1111
<text><loc_437><loc_224><loc_441><loc_228>7</text>
12-
<otsl><loc_258><loc_274><loc_439><loc_313><fcel>0<fcel>1 2 1<lcel><lcel><nl><fcel>3 4<fcel>5<fcel>6<fcel>7<nl><fcel>9 13<fcel>10<fcel>11<fcel>12<nl><fcel>8 2<fcel>14<fcel>15<fcel>16<nl><fcel>17<fcel>18<fcel>19<fcel>20<nl></otsl>
12+
<otsl><loc_258><loc_274><loc_439><loc_313></otsl>
1313
<picture><loc_257><loc_143><loc_439><loc_313><caption><loc_252><loc_325><loc_445><loc_353>Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.</caption></picture>
1414
<text><loc_252><loc_369><loc_445><loc_420>Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.</text>
1515
<text><loc_252><loc_422><loc_445><loc_450>The first problem is called table-location and has been previously addressed [30, 38, 19, 21, 23, 26, 8] with stateof-the-art object-detection networks (e.g. YOLO and later on Mask-RCNN [9]). For all practical purposes, it can be</text>
@@ -94,11 +94,11 @@
9494
<text><loc_41><loc_114><loc_234><loc_135>where T a and T b represent tables in tree structure HTML format. EditDist denotes the tree-edit distance, and | T | represents the number of nodes in T .</text>
9595
<section_header_level_1><loc_41><loc_142><loc_139><loc_148>5.4. Quantitative Analysis</section_header_level_1>
9696
<text><loc_41><loc_154><loc_234><loc_250>Structure. As shown in Tab. 2, TableFormer outperforms all SOTA methods across different datasets by a large margin for predicting the table structure from an image. All the more, our model outperforms pre-trained methods. During the evaluation we do not apply any table filtering. We also provide our baseline results on the SynthTabNet dataset. It has been observed that large tables (e.g. tables that occupy half of the page or more) yield poor predictions. We attribute this issue to the image resizing during the preprocessing step, that produces downsampled images with indistinguishable features. This problem can be addressed by treating such big tables with a separate model which accepts a large input image size.</text>
97-
<otsl><loc_43><loc_258><loc_231><loc_367><ched>Model<ched>Dataset<ched>Simple<ched>TEDS Complex<ched>All<nl><rhed>EDD<fcel>PTN<fcel>91.1<fcel>88.7<fcel>89.9<nl><rhed>GTE<fcel>PTN<fcel>-<fcel>-<fcel>93.01<nl><rhed>TableFormer<fcel>PTN<fcel>98.5<fcel>95.0<fcel>96.75<nl><rhed>EDD<fcel>FTN<fcel>88.4<fcel>92.08<fcel>90.6<nl><rhed>GTE<fcel>FTN<fcel>-<fcel>-<fcel>87.14<nl><rhed>GTE (FT)<fcel>FTN<fcel>-<fcel>-<fcel>91.02<nl><rhed>TableFormer<fcel>FTN<fcel>97.5<fcel>96.0<fcel>96.8<nl><rhed>EDD<fcel>TB<fcel>86.0<fcel>-<fcel>86.0<nl><rhed>TableFormer<fcel>TB<fcel>89.6<fcel>-<fcel>89.6<nl><rhed>TableFormer<fcel>STN<fcel>96.9<fcel>95.7<fcel>96.7<nl></otsl>
97+
<otsl><loc_43><loc_258><loc_231><loc_367></otsl>
9898
<text><loc_41><loc_374><loc_234><loc_387>Table 2: Structure results on PubTabNet (PTN), FinTabNet (FTN), TableBank (TB) and SynthTabNet (STN).</text>
9999
<text><loc_41><loc_389><loc_214><loc_395>FT: Model was trained on PubTabNet then finetuned.</text>
100100
<text><loc_41><loc_407><loc_234><loc_450><loc_252><loc_48><loc_445><loc_144>Cell Detection. Like any object detector, our Cell BBox Detector provides bounding boxes that can be improved with post-processing during inference. We make use of the grid-like structure of tables to refine the predictions. A detailed explanation on the post-processing is available in the supplementary material. As shown in Tab. 3, we evaluate our Cell BBox Decoder accuracy for cells with a class label of 'content' only using the PASCAL VOC mAP metric for pre-processing and post-processing. Note that we do not have post-processing results for SynthTabNet as images are only provided. To compare the performance of our proposed approach, we've integrated TableFormer's Cell BBox Decoder into EDD architecture. As mentioned previously, the Structure Decoder provides the Cell BBox Decoder with the features needed to predict the bounding box predictions. Therefore, the accuracy of the Structure Decoder directly influences the accuracy of the Cell BBox Decoder . If the Structure Decoder predicts an extra column, this will result in an extra column of predicted bounding boxes.</text>
101-
<otsl><loc_252><loc_156><loc_436><loc_192><ched>Model<ched>Dataset<ched>mAP<ched>mAP (PP)<nl><rhed>EDD+BBox<fcel>PubTabNet<fcel>79.2<fcel>82.7<nl><rhed>TableFormer<fcel>PubTabNet<fcel>82.1<fcel>86.8<nl><rhed>TableFormer<fcel>SynthTabNet<fcel>87.7<fcel>-<nl><caption><loc_252><loc_200><loc_445><loc_213>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption></otsl>
101+
<otsl><loc_252><loc_156><loc_436><loc_192><caption><loc_252><loc_200><loc_445><loc_213>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption></otsl>
102102
<text><loc_252><loc_232><loc_445><loc_328>Cell Content. In this section, we evaluate the entire pipeline of recovering a table with content. Here we put our approach to test by capitalizing on extracting content from the PDF cells rather than decoding from images. Tab. 4 shows the TEDs score of HTML code representing the structure of the table along with the content inserted in the data cell and compared with the ground-truth. Our method achieved a 5.3% increase over the state-of-the-art, and commercial solutions. We believe our scores would be higher if the HTML ground-truth matched the extracted PDF cell content. Unfortunately, there are small discrepancies such as spacings around words or special characters with various unicode representations.</text>
103103
<otsl><loc_272><loc_341><loc_426><loc_406><ched>Model<ched>Simple<ched>TEDS Complex<ched>All<nl><rhed>Tabula<fcel>78.0<fcel>57.8<fcel>67.9<nl><rhed>Traprange<fcel>60.8<fcel>49.9<fcel>55.4<nl><rhed>Camelot<fcel>80.0<fcel>66.0<fcel>73.0<nl><rhed>Acrobat Pro<fcel>68.9<fcel>61.8<fcel>65.3<nl><rhed>EDD<fcel>91.2<fcel>85.4<fcel>88.3<nl><rhed>TableFormer<fcel>95.4<fcel>90.1<fcel>93.6<nl><caption><loc_252><loc_415><loc_445><loc_435>Table 4: Results of structure with content retrieved using cell detection on PubTabNet. In all cases the input is PDF documents with cropped tables.</caption></otsl>
104104
<page_footer><loc_241><loc_464><loc_245><loc_469>7</page_footer>

0 commit comments

Comments
 (0)