Skip to content

Word cell sometimes contains multiple words #186

@jaleigh

Description

@jaleigh

I have words in a line being combined together into a single word box by pdf_sanitator<PAGE_CELLS>::create_word_cells

The space between the end of one word and the next is < space_width_factor_for_merge so the word box contains multiple words.

Should a space character trigger a word regardless of the character box distances? i.e. rather than remove all space chars (pdf_sanitator<PAGE_CELLS>::create_word_cells (https://github.com/docling-project/docling-parse/blob/main/src/v2/pdf_sanitators/cells.h Line 136)) should they not be used as a marker to prevent merging.

    // remove all spaces 
    auto itr = word_cells.begin();
    while(itr!=word_cells.end())
      {
	if(utils::string::is_space(itr->text))
	  {
	    itr = word_cells.erase(itr);	    
	  }
	else
	  {
	    itr++;
	  }
      }

I'm currently having to post process the word boxes to check if it overlaps with ' ' box and break the word box up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions