Skip to content

Indexing pdf documents which are not in English #689

@suma-sai-paluri

Description

@suma-sai-paluri

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

I have a few inquiries unrelated to any bugs but rather some uncertainties. I possess PDF documents in Italian, and I aim to utilize this repository for querying them. Upon deploying this repo, I found the results to be satisfactory when utilizing semantic ranker mode, fine with vectors, but unsatisfactory with pure text. My objective is to achieve favorable results using only embeddings or vectors and decent results with text, as the number of semantic searches available is currently limited. I'd appreciate assistance with the following questions:

  1. What is the optimal method for indexing? Should I opt for a language-specific analyzer (such as Lucene Italian) or a language-agnostic one (like standard Lucene)?

  2. Would it be advisable to modify the prompt instructions, specifying not to translate the queries into English, given that my data is in Italian?

  3. My documents also contain a substantial amount of tables and images in Italian. Will the form recognizer be sufficiently effective in extracting information from them?

Please let me know if you have any other suggestions which you think might be helpful in my case.

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

Anyone who had experience using a data repository with content in a language other than English and featuring a significant amount of images and tables, It would be greatly helpful hearing about your experiences or any modifications you made that yielded positive results. Your insights would be invaluable.
Thanks! We'll be in touch soon.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions