-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
I have words in a line being combined together into a single word box by pdf_sanitator<PAGE_CELLS>::create_word_cells
The space between the end of one word and the next is < space_width_factor_for_merge so the word box contains multiple words.
Should a space character trigger a word regardless of the character box distances? i.e. rather than remove all space chars (pdf_sanitator<PAGE_CELLS>::create_word_cells (https://github.com/docling-project/docling-parse/blob/main/src/v2/pdf_sanitators/cells.h Line 136)) should they not be used as a marker to prevent merging.
// remove all spaces
auto itr = word_cells.begin();
while(itr!=word_cells.end())
{
if(utils::string::is_space(itr->text))
{
itr = word_cells.erase(itr);
}
else
{
itr++;
}
}
I'm currently having to post process the word boxes to check if it overlaps with ' ' box and break the word box up.
Metadata
Metadata
Assignees
Labels
No labels