Commit 4e424ef
authored
feat: use lxml instead of bs4 to parse hOCR data (#3960)
- `lxml` is a much faster library than `bs4` when the input data is
regular
- since the hOCR data is guaranteed to be regular (programmatically
generated) we don't need `bs4` here to parse the data
- `lxml` improves parsing speed by about 10x
Example runtime profiling locally using the same `hocr` data from 1 page
pdf, where `agent.hocr_to_dataframe_bs4` is the current method on main
and `agent.hocr_to_dataframe` is the PR's method.
1 parent 66bf4b0 commit 4e424ef
File tree
4 files changed
+46
-37
lines changed- test_unstructured/partition/pdf_image
- unstructured
- partition/utils/ocr_models
4 files changed
+46
-37
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
| 9 | + | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| |||
536 | 536 | | |
537 | 537 | | |
538 | 538 | | |
539 | | - | |
540 | | - | |
541 | | - | |
542 | | - | |
543 | | - | |
544 | | - | |
545 | | - | |
546 | | - | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
547 | 549 | | |
548 | | - | |
549 | | - | |
550 | | - | |
551 | | - | |
552 | | - | |
553 | | - | |
554 | | - | |
555 | | - | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
556 | 557 | | |
557 | 558 | | |
558 | 559 | | |
| |||
565 | 566 | | |
566 | 567 | | |
567 | 568 | | |
568 | | - | |
| 569 | + | |
| 570 | + | |
569 | 571 | | |
570 | | - | |
| 572 | + | |
571 | 573 | | |
572 | 574 | | |
573 | | - | |
| 575 | + | |
574 | 576 | | |
575 | 577 | | |
576 | | - | |
| 578 | + | |
577 | 579 | | |
578 | 580 | | |
579 | | - | |
| 581 | + | |
580 | 582 | | |
581 | 583 | | |
582 | 584 | | |
| |||
590 | 592 | | |
591 | 593 | | |
592 | 594 | | |
593 | | - | |
594 | | - | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
595 | 598 | | |
596 | 599 | | |
597 | 600 | | |
| |||
608 | 611 | | |
609 | 612 | | |
610 | 613 | | |
611 | | - | |
| 614 | + | |
612 | 615 | | |
613 | 616 | | |
614 | 617 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
11 | | - | |
| 11 | + | |
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| 37 | + | |
| 38 | + | |
37 | 39 | | |
38 | 40 | | |
39 | 41 | | |
| |||
106 | 108 | | |
107 | 109 | | |
108 | 110 | | |
109 | | - | |
110 | | - | |
111 | 111 | | |
112 | 112 | | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
113 | 120 | | |
114 | 121 | | |
115 | 122 | | |
116 | 123 | | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | 124 | | |
121 | 125 | | |
122 | 126 | | |
| |||
140 | 144 | | |
141 | 145 | | |
142 | 146 | | |
143 | | - | |
144 | | - | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
145 | 150 | | |
146 | 151 | | |
147 | | - | |
| 152 | + | |
148 | 153 | | |
149 | 154 | | |
150 | 155 | | |
| |||
0 commit comments