Commit 961c8d5
authored
feat: use block matrix to reduce peak memory usage for matmul (#3947)
This PR targets the most memory expensive operation in partition pdf and
images: deduplicate pdfminer elements. In large pages the number of
elements can be over 10k, which would generate multiple 10k x 10k square
double float matrices during deduplication, pushing peak memory usage
close to 13Gb

This PR breaks this computation down by computing partial IOU. More
precisely it computes IOU for each 2000 elements against all the
elements at a time to reduce peak memory usage by about 10x to around
1.6Gb.

The block size is configurable based on user preference for peak memory
usage and it is set by changing the env `UNST_MATMUL_MEMORY_CAP_IN_GB`.1 parent 19373de commit 961c8d5
File tree
3 files changed
+14
-6
lines changed- unstructured
- partition/pdf_image
3 files changed
+14
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
| |||
708 | 709 | | |
709 | 710 | | |
710 | 711 | | |
711 | | - | |
712 | | - | |
713 | | - | |
714 | | - | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
715 | 722 | | |
716 | 723 | | |
717 | 724 | | |
| |||
0 commit comments