Handling poor formula extraction performance #1254
agodinezmm2007
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I have been testing the formula extraction feature extensively and noticed that it had low performance related to long processing time and repeated/spammy formula generation and runaway token generation issues. Initially I thought if i adjust the batch size it would process faster. The default setting for:
docling.models.code_formula_model.CodeFormulaModel.elements_batch_size
is 5. When I leave it at the default, it uses approximately 18-20GB of VRAM. I tested setting it to 7 and used a bit more (see attached image)
Raising the batch size however, does not increase the speed. In My testing, i noticed that the following settings within the file located here:
"\wsl.localhost\Ubuntu\home\wstation\miniconda3\envs\newenv\lib\python3.12\site-packages\docling_ibm_models\code_formula_model\code_formula_predictor.py"
these settings lead to the formula recognizer to produce extremely long formulas of random characters, in some cases it would reach or exceed the 4096 token limit, which is why it takes so long for it to process the formulas. I first tried to adjust the prompt, but that led to more problems. So what I did instead is adjust to _predict function:
Additionally, i had to adjust the following dtype .float() parameters in this file since autocast expects bfloat16: "\wsl.localhost\Ubuntu\home\wstation\miniconda3\envs\newenv\lib\python3.12\site-packages\transformers\generation\utils.py"
in def _sample around lines 3300~3305:
in def _beam_search around lines 3772~3779:
in def _beam_search between lines 4069-4100:
With beam search enabled however, more VRAM is used and i kept getting out of memory errors even though I was running it on a 48GB GPU. So then what I did was lower batch size to 2:
The end result is that processing went from 5 minutes down to about 63 seconds for 15 page PDF article with 16 equations within it. GPU usage when beam search is enabled and batch size is set to two is approximately 40GB:
Despite massive performance enhancements, the formula extraction still struggles with proper formula extraction. For example, the following are all 16 of the latex equations from the PDF:
But the extracted equations are quite messy:
Nevertheless, I believe that simple "iterative refinement" or post processing should bring quality up. Below is a python file i am working on to do some of the post processing
But anyways, yeah expected to use anywhere between 20GB-40GB of VRAM to get it working properly, or try to offload some stuff to CPU. Or perhaps offload beam search to CPU and do the formula extraction on GPU? maybe enable the use of multiple GPUs?
Beta Was this translation helpful? Give feedback.
All reactions