Skip to content

When using image embeddings, some image embeddings may be skipped #1675

@pamelafox

Description

@pamelafox

This is replicable with the sample data. When running prepdocs, you'll see that several pages aren't represented in the sections uploaded, in that there are no sections with corresponding sourcepage equal to that page number, and thus no sections with an imageEmbedding corresponding to that sourcepage. That means some answers may be lower quality, as they don't find the relevant matching image.

Possible approaches:

  • For certain document types, like slides, never chunk sections across pages. This was my original idea but then realized our sample document was a slide exported as a PDF, so I couldn't have a PPT-dependent condition. Thus, this isn't a full solution.
  • Never let sections go across pages. This may not work well with many PDFs like research papers that legitimately have sections go across pages.
  • Associate multiple sourcepage's with a single section. @mattgotteiner says that's possible by picking a delimeter. Not sure if multiple imageEmbedding's would also be possible? Otherwise we'd have to pick which imageEmbedding we thought was best.
  • ...? Your idea here!

Metadata

Metadata

Assignees

Labels

visionRelated to the multimodal feature that can ingest figures and answer questions based off images

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions