Skip to content

Wrong sourcepage, when section include text from two pages #370

@jomieljaniuk

Description

@jomieljaniuk

Please provide us with the following information:

This issue is for a: (mark with an x)

- [x ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Run app with some pdf data that will be splited in pages
After indexing pages to Azure Search, find section that include text from the ending of x page and beginning of x+1 page
Ask chatbot about info related to that section from page x+1
Chat will respond correctly but in citation will print pdf from page x, but our info is in page x+1

Expected/desired behavior

Dividing pdf (our data) into sections to be indexed in Azure Search in prepdocs.py file should consider end of page. Information from page x and x+1 should be in separate sections.

Mention any other details that might be useful

Function find_page in file prepdocs.py is looking for page when is the beginng of the section, but do not consider that section can end in next page.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions