Skip to content

Conversation

@PeterStaar-IBM
Copy link
Member

@PeterStaar-IBM PeterStaar-IBM commented Sep 15, 2025

We are adding perf-tools to compare native docling parse with other parsing libraries.

taa@Munlochy docling-parse % uv run python perf/run_perf.py /Users/taa/Documents/projects/_data/cas_gt20 --parser pypdfium2
Parsing PDFs with pypdfium2: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:09<00:00,  2.17it/s]

Summary for parser=pypdfium2
 - files:        20
 - pages total:  2270
 - pages ok:     2270
 - pages failed: 0
 - total sec:    9.071300
 - avg sec/page: 0.003996
 - p50: 0.002912  p90: 0.004833  p95: 0.007059  p99: 0.042817
 - min: 0.000027  max: 0.105380

Per-document statistics (sec/page):
document                                                                                                                     pages     total      mean    median       min       max       p90       p95       p99
-------------------------------------------------------------------------------------------------------------------------  -------  --------  --------  --------  --------  --------  --------  --------  --------
/Users/taa/Documents/projects/_data/cas_gt20/cd-bitmap-space-configuration-2.pdf                                                 3  0.023439  0.007813  0.008968  0.005376  0.009096  0.00907   0.009083  0.009093
/Users/taa/Documents/projects/_data/cas_gt20/concurrent-compatibility-and-code-cross-reference-ibm-storage-virtualize.pdf       41  0.369759  0.009019  0.006148  0.001466  0.059231  0.010384  0.013811  0.056612
/Users/taa/Documents/projects/_data/cas_gt20/configuring-call-home.pdf                                                           1  0.005256  0.005256  0.005256  0.005256  0.005256  0.005256  0.005256  0.005256
/Users/taa/Documents/projects/_data/cas_gt20/pip-environmental-requirements-3.pdf                                                5  0.035835  0.007167  0.007101  0.004275  0.010678  0.009393  0.010035  0.010549
/Users/taa/Documents/projects/_data/cas_gt20/pools-creating-data-reduction.pdf                                                   3  0.021791  0.007264  0.009914  0.001322  0.010555  0.010427  0.010491  0.010543
/Users/taa/Documents/projects/_data/cas_gt20/problem-procedure-understanding-system-status-from-leds.pdf                         6  0.030994  0.005166  0.005928  0.001246  0.006533  0.006292  0.006412  0.006509
/Users/taa/Documents/projects/_data/cas_gt20/redp5520.pdf                                                                       16  0.054211  0.003388  0.002783  3e-05     0.013255  0.005441  0.007547  0.012113
/Users/taa/Documents/projects/_data/cas_gt20/redp5534.pdf                                                                      118  0.395341  0.00335   0.00222   2.8e-05   0.10538   0.004422  0.004927  0.007827
/Users/taa/Documents/projects/_data/cas_gt20/redp5613.pdf                                                                      152  0.514253  0.003383  0.003137  2.9e-05   0.037295  0.004338  0.007336  0.01078
/Users/taa/Documents/projects/_data/cas_gt20/redp5617.pdf                                                                       72  0.205231  0.00285   0.002802  3.2e-05   0.008208  0.004025  0.004257  0.006277
/Users/taa/Documents/projects/_data/cas_gt20/redp5638.pdf                                                                       54  0.128385  0.002378  0.002439  2.8e-05   0.008159  0.003402  0.003866  0.006801
/Users/taa/Documents/projects/_data/cas_gt20/redp5669.pdf                                                                       74  0.208506  0.002818  0.002815  2.9e-05   0.008052  0.004042  0.004442  0.006295
/Users/taa/Documents/projects/_data/cas_gt20/redp5670.pdf                                                                       48  0.145144  0.003024  0.003097  2.7e-05   0.008027  0.003985  0.004731  0.006918
/Users/taa/Documents/projects/_data/cas_gt20/sg248468.pdf                                                                      452  1.06892   0.002365  0.002271  2.8e-05   0.007948  0.003902  0.004371  0.005224
/Users/taa/Documents/projects/_data/cas_gt20/sg248497.pdf                                                                      364  1.0194    0.002801  0.002781  2.7e-05   0.011013  0.004282  0.004863  0.005895
/Users/taa/Documents/projects/_data/cas_gt20/sg248521.pdf                                                                      784  2.38926   0.003048  0.003148  2.7e-05   0.008834  0.004429  0.004843  0.005684
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_6172647.pdf                                                      4  0.082807  0.020702  0.019703  0.001479  0.041922  0.036375  0.039149  0.041367
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_6380846.pdf                                                      3  0.071179  0.023726  0.013885  0.007232  0.050061  0.042826  0.046444  0.049338
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_690527.pdf                                                       6  0.094528  0.015755  0.015209  0.003775  0.028282  0.025911  0.027096  0.028045
/Users/taa/Documents/projects/_data/cas_gt20/system-storage-ts3500-tape-library-model-d23.pdf                                   64  2.20705   0.034485  0.02846   0.002481  0.079582  0.062871  0.070184  0.077868

Wrote: perf/results/perf_pypdfium2_20250915-163339.csv
Total wall time: 9.215333 sec
taa@Munlochy docling-parse % uv run python perf/run_perf.py /Users/taa/Documents/projects/_data/cas_gt20 --parser docling

Summary for parser=docling
 - files:        20
 - pages total:  2270
 - pages ok:     2270
 - pages failed: 0
 - total sec:    55.937700
 - avg sec/page: 0.024642
 - p50: 0.018701  p90: 0.047992  p95: 0.057145  p99: 0.138790
 - min: 0.000082  max: 0.695331

Per-document statistics (sec/page):
document                                                                                                                     pages      total      mean    median       min       max       p90       p95       p99
-------------------------------------------------------------------------------------------------------------------------  -------  ---------  --------  --------  --------  --------  --------  --------  --------
/Users/taa/Documents/projects/_data/cas_gt20/cd-bitmap-space-configuration-2.pdf                                                 3   0.066386  0.022129  0.017372  0.016411  0.032603  0.029557  0.03108   0.032299
/Users/taa/Documents/projects/_data/cas_gt20/concurrent-compatibility-and-code-cross-reference-ibm-storage-virtualize.pdf       41   0.326398  0.007961  0.006582  0.002146  0.030672  0.011083  0.01489   0.028812
/Users/taa/Documents/projects/_data/cas_gt20/configuring-call-home.pdf                                                           1   0.019819  0.019819  0.019819  0.019819  0.019819  0.019819  0.019819  0.019819
/Users/taa/Documents/projects/_data/cas_gt20/pip-environmental-requirements-3.pdf                                                5   0.103324  0.020665  0.023853  0.007514  0.032654  0.029866  0.03126   0.032375
/Users/taa/Documents/projects/_data/cas_gt20/pools-creating-data-reduction.pdf                                                   3   0.091617  0.030539  0.029676  0.00262   0.059322  0.053393  0.056357  0.058729
/Users/taa/Documents/projects/_data/cas_gt20/problem-procedure-understanding-system-status-from-leds.pdf                         6   0.067737  0.011289  0.011884  0.002055  0.016188  0.015311  0.01575   0.016101
/Users/taa/Documents/projects/_data/cas_gt20/redp5520.pdf                                                                       16   0.314556  0.01966   0.016715  8.2e-05   0.087908  0.030249  0.044779  0.079282
/Users/taa/Documents/projects/_data/cas_gt20/redp5534.pdf                                                                      118   2.34795   0.019898  0.012499  0.000118  0.138592  0.047552  0.059891  0.11705
/Users/taa/Documents/projects/_data/cas_gt20/redp5613.pdf                                                                      152   4.62904   0.030454  0.017637  0.000134  0.695331  0.052979  0.105751  0.148997
/Users/taa/Documents/projects/_data/cas_gt20/redp5617.pdf                                                                       72   1.65364   0.022967  0.022107  0.000114  0.087608  0.035781  0.049695  0.064743
/Users/taa/Documents/projects/_data/cas_gt20/redp5638.pdf                                                                       54   0.763623  0.014141  0.012112  8.3e-05   0.08701   0.022073  0.026895  0.083627
/Users/taa/Documents/projects/_data/cas_gt20/redp5669.pdf                                                                       74   1.7683    0.023896  0.022534  0.000102  0.087034  0.041909  0.046114  0.064978
/Users/taa/Documents/projects/_data/cas_gt20/redp5670.pdf                                                                       48   1.31494   0.027395  0.025007  0.000106  0.086364  0.046853  0.051582  0.071403
/Users/taa/Documents/projects/_data/cas_gt20/sg248468.pdf                                                                      452   9.09468   0.020121  0.012579  0.000187  0.169903  0.048343  0.060521  0.086648
/Users/taa/Documents/projects/_data/cas_gt20/sg248497.pdf                                                                      364   9.25975   0.025439  0.020284  0.000221  0.168398  0.048626  0.055955  0.137817
/Users/taa/Documents/projects/_data/cas_gt20/sg248521.pdf                                                                      784  22.6357    0.028872  0.023877  0.000341  0.178095  0.050237  0.057951  0.166556
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_6172647.pdf                                                      4   0.061035  0.015259  0.016006  0.00249   0.026534  0.024042  0.025288  0.026284
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_6380846.pdf                                                      3   0.050379  0.016793  0.013387  0.009066  0.027926  0.025018  0.026472  0.027635
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_690527.pdf                                                       6   0.080042  0.01334   0.014352  0.004538  0.021341  0.019659  0.0205    0.021173
/Users/taa/Documents/projects/_data/cas_gt20/system-storage-ts3500-tape-library-model-d23.pdf                                   64   1.2888    0.020137  0.016832  0.00263   0.042742  0.033855  0.038352  0.041713

Wrote: perf/results/perf_docling_20250915-163355.csv
Total wall time: 56.524318 sec

With some optimizations in the post-processing, we now cut the time from 56 sec to 20 sec,

taa@Munlochy docling-parse % uv run python perf/run_perf.py /Users/taa/Documents/projects/_data/cas_gt20/ --parser docling

Summary for parser=docling
 - files:        20
 - pages total:  2270
 - pages ok:     2270
 - pages failed: 0
 - total sec:    19.795889
 - avg sec/page: 0.008721
 - p50: 0.007942  p90: 0.012235  p95: 0.013742  p99: 0.022533
 - min: 0.000292  max: 0.599250

Per-document statistics (sec/page):
document                                                                                                                     pages     total      mean    median       min       max       p90       p95       p99
-------------------------------------------------------------------------------------------------------------------------  -------  --------  --------  --------  --------  --------  --------  --------  --------
/Users/taa/Documents/projects/_data/cas_gt20/cd-bitmap-space-configuration-2.pdf                                                 3  0.020358  0.006786  0.006948  0.005593  0.007817  0.007644  0.00773   0.0078
/Users/taa/Documents/projects/_data/cas_gt20/concurrent-compatibility-and-code-cross-reference-ibm-storage-virtualize.pdf       41  0.265393  0.006473  0.006063  0.002323  0.016817  0.008584  0.009183  0.016245
/Users/taa/Documents/projects/_data/cas_gt20/configuring-call-home.pdf                                                           1  0.007035  0.007035  0.007035  0.007035  0.007035  0.007035  0.007035  0.007035
/Users/taa/Documents/projects/_data/cas_gt20/pip-environmental-requirements-3.pdf                                                5  0.032129  0.006426  0.006521  0.004478  0.007909  0.007836  0.007873  0.007902
/Users/taa/Documents/projects/_data/cas_gt20/pools-creating-data-reduction.pdf                                                   3  0.019728  0.006576  0.007729  0.00274   0.009259  0.008953  0.009106  0.009228
/Users/taa/Documents/projects/_data/cas_gt20/problem-procedure-understanding-system-status-from-leds.pdf                         6  0.02744   0.004573  0.004795  0.002087  0.005677  0.005631  0.005654  0.005672
/Users/taa/Documents/projects/_data/cas_gt20/redp5520.pdf                                                                       16  0.130631  0.008164  0.008022  0.000303  0.019775  0.015141  0.016508  0.019122
/Users/taa/Documents/projects/_data/cas_gt20/redp5534.pdf                                                                      118  0.808388  0.006851  0.006126  0.000363  0.020969  0.012235  0.015409  0.01887
/Users/taa/Documents/projects/_data/cas_gt20/redp5613.pdf                                                                      152  2.49114   0.016389  0.007895  0.000361  0.59925   0.012232  0.046459  0.144804
/Users/taa/Documents/projects/_data/cas_gt20/redp5617.pdf                                                                       72  0.585817  0.008136  0.007988  0.000365  0.02174   0.011132  0.012277  0.018216
/Users/taa/Documents/projects/_data/cas_gt20/redp5638.pdf                                                                       54  0.333605  0.006178  0.006279  0.000292  0.019145  0.008869  0.015019  0.017126
/Users/taa/Documents/projects/_data/cas_gt20/redp5669.pdf                                                                       74  0.602868  0.008147  0.007971  0.000347  0.019501  0.010964  0.012345  0.017658
/Users/taa/Documents/projects/_data/cas_gt20/redp5670.pdf                                                                       48  0.425696  0.008869  0.009087  0.000331  0.018868  0.01204   0.015151  0.017346
/Users/taa/Documents/projects/_data/cas_gt20/sg248468.pdf                                                                      452  3.03613   0.006717  0.006153  0.000459  0.038482  0.01117   0.01259   0.018323
/Users/taa/Documents/projects/_data/cas_gt20/sg248497.pdf                                                                      364  3.09135   0.008493  0.007899  0.000507  0.033431  0.012508  0.015614  0.023992
/Users/taa/Documents/projects/_data/cas_gt20/sg248521.pdf                                                                      784  7.15395   0.009125  0.009189  0.000588  0.047945  0.012466  0.013484  0.021405
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_6172647.pdf                                                      4  0.04104   0.01026   0.011187  0.003159  0.015506  0.014484  0.014995  0.015404
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_6380846.pdf                                                      3  0.028673  0.009558  0.008706  0.005812  0.014156  0.013066  0.013611  0.014047
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_690527.pdf                                                       6  0.04509   0.007515  0.007635  0.004257  0.010149  0.009575  0.009862  0.010092
/Users/taa/Documents/projects/_data/cas_gt20/system-storage-ts3500-tape-library-model-d23.pdf                                   64  0.649441  0.010148  0.009776  0.002652  0.015244  0.013474  0.014139  0.01509

with pdfpliumber,

taa@Munlochy docling-parse % uv run python perf/run_perf.py /Users/taa/Documents/projects/_data/cas_gt20/ --parser pdfplumber

Summary for parser=pdfplumber
 - files:        20
 - pages total:  2270
 - pages ok:     2270
 - pages failed: 0
 - total sec:    62.418775
 - avg sec/page: 0.027497
 - p50: 0.021881  p90: 0.040325  p95: 0.056019  p99: 0.151062
 - min: 0.000015  max: 2.254295

Per-document statistics (sec/page):
document                                                                                                                     pages      total      mean    median       min       max       p90       p95       p99
-------------------------------------------------------------------------------------------------------------------------  -------  ---------  --------  --------  --------  --------  --------  --------  --------
/Users/taa/Documents/projects/_data/cas_gt20/cd-bitmap-space-configuration-2.pdf                                                 3   0.100635  0.033545  0.031535  0.029609  0.03949   0.037899  0.038694  0.039331
/Users/taa/Documents/projects/_data/cas_gt20/concurrent-compatibility-and-code-cross-reference-ibm-storage-virtualize.pdf       41   1.13814   0.02776   0.025909  0.005255  0.059836  0.037877  0.042827  0.059378
/Users/taa/Documents/projects/_data/cas_gt20/configuring-call-home.pdf                                                           1   0.030102  0.030102  0.030102  0.030102  0.030102  0.030102  0.030102  0.030102
/Users/taa/Documents/projects/_data/cas_gt20/pip-environmental-requirements-3.pdf                                                5   0.199461  0.039892  0.037457  0.018594  0.071347  0.058667  0.065007  0.070079
/Users/taa/Documents/projects/_data/cas_gt20/pools-creating-data-reduction.pdf                                                   3   0.103717  0.034572  0.048773  0.005703  0.049241  0.049147  0.049194  0.049232
/Users/taa/Documents/projects/_data/cas_gt20/problem-procedure-understanding-system-status-from-leds.pdf                         6   0.127908  0.021318  0.022745  0.005306  0.028139  0.027702  0.027921  0.028095
/Users/taa/Documents/projects/_data/cas_gt20/redp5520.pdf                                                                       16   0.354731  0.022171  0.019615  2.5e-05   0.050667  0.045086  0.047448  0.050023
/Users/taa/Documents/projects/_data/cas_gt20/redp5534.pdf                                                                      118   2.30042   0.019495  0.016468  2.2e-05   0.074085  0.046614  0.051927  0.065132
/Users/taa/Documents/projects/_data/cas_gt20/redp5613.pdf                                                                      152   8.72473   0.0574    0.021635  2.1e-05   2.25429   0.053616  0.199148  0.568759
/Users/taa/Documents/projects/_data/cas_gt20/redp5617.pdf                                                                       72   1.88302   0.026153  0.022836  2.5e-05   0.226009  0.03814   0.046467  0.106629
/Users/taa/Documents/projects/_data/cas_gt20/redp5638.pdf                                                                       54   1.04583   0.019367  0.016451  2.2e-05   0.131393  0.038497  0.044431  0.086597
/Users/taa/Documents/projects/_data/cas_gt20/redp5669.pdf                                                                       74   1.83392   0.024783  0.022822  2.5e-05   0.080191  0.04097   0.051279  0.077794
/Users/taa/Documents/projects/_data/cas_gt20/redp5670.pdf                                                                       48   1.32962   0.0277    0.025943  3.2e-05   0.129767  0.044269  0.04689   0.091079
/Users/taa/Documents/projects/_data/cas_gt20/sg248468.pdf                                                                      452   9.01397   0.019942  0.016959  1.6e-05   0.174405  0.035862  0.041936  0.118115
/Users/taa/Documents/projects/_data/cas_gt20/sg248497.pdf                                                                      364  10.0202    0.027528  0.022053  1.5e-05   0.546701  0.042372  0.065037  0.148075
/Users/taa/Documents/projects/_data/cas_gt20/sg248521.pdf                                                                      784  20.8973    0.026655  0.023247  1.6e-05   0.623846  0.034225  0.04102   0.160503
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_6172647.pdf                                                      4   0.153167  0.038292  0.042402  0.005951  0.062411  0.056923  0.059667  0.061862
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_6380846.pdf                                                      3   0.118929  0.039643  0.036796  0.022411  0.059721  0.055136  0.057429  0.059263
/Users/taa/Documents/projects/_data/cas_gt20/support_pages_node_690527.pdf                                                       6   0.180209  0.030035  0.032347  0.014005  0.0393    0.038431  0.038865  0.039213
/Users/taa/Documents/projects/_data/cas_gt20/system-storage-ts3500-tape-library-model-d23.pdf                                   64   2.86276   0.044731  0.042375  0.008636  0.069816  0.061402  0.065401  0.069189

@PeterStaar-IBM PeterStaar-IBM self-assigned this Sep 15, 2025
@github-actions
Copy link
Contributor

DCO Check Passed

Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Sep 15, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: Peter Staar <[email protected]>
dolfim-ibm
dolfim-ibm previously approved these changes Sep 15, 2025
Copy link
Member

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Signed-off-by: Peter Staar <[email protected]>
@PeterStaar-IBM PeterStaar-IBM marked this pull request as ready for review September 16, 2025 12:21
@PeterStaar-IBM PeterStaar-IBM merged commit f8d53ee into main Sep 16, 2025
39 checks passed
@PeterStaar-IBM PeterStaar-IBM deleted the dev/add-perf-tools branch September 16, 2025 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants