diff --git a/examples/multi_format_indexing/README.md b/examples/multi_format_indexing/README.md index 9f98cb56d..1547625b7 100644 --- a/examples/multi_format_indexing/README.md +++ b/examples/multi_format_indexing/README.md @@ -55,6 +55,19 @@ Run: python main.py ``` +## Data Attribution + +The example data files used in this demonstration come from the following sources: + +### PDF Documents +- **ArXiv Papers**: Research papers sourced from [ArXiv](https://arxiv.org/), an open-access repository of electronic preprints covering various scientific disciplines. + +### Image Documents +- **Healthcare Industry Dataset**: Images from the [vidore/syntheticDocQA_healthcare_industry_test](https://huggingface.co/datasets/vidore/syntheticDocQA_healthcare_industry_test) dataset on Hugging Face, which contains synthetic document question-answering data for healthcare industry documents. +- **ESG Reports Dataset**: Images from the [vidore/esg_reports_eng_v2](https://huggingface.co/datasets/vidore/esg_reports_eng_v2) dataset on Hugging Face, containing Environmental, Social, and Governance (ESG) reports. + +We thank the creators and maintainers of these datasets for making their data available for research and development purposes. + ## About ColPali This example uses [ColPali](https://github.com/illuin-tech/colpali), a state-of-the-art vision-language model that enables: - Direct visual understanding of document layouts, tables, and figures diff --git a/examples/multi_format_indexing/source_files/2502.06786v3.pdf b/examples/multi_format_indexing/source_files/2502.06786v3.pdf new file mode 100644 index 000000000..9edb7eec6 Binary files /dev/null and b/examples/multi_format_indexing/source_files/2502.06786v3.pdf differ diff --git a/examples/multi_format_indexing/source_files/cat1.jpeg b/examples/multi_format_indexing/source_files/cat1.jpeg deleted file mode 100644 index cd92bb4a1..000000000 Binary files a/examples/multi_format_indexing/source_files/cat1.jpeg and /dev/null differ diff --git a/examples/multi_format_indexing/source_files/dog1.jpeg b/examples/multi_format_indexing/source_files/dog1.jpeg deleted file mode 100644 index 53767beb1..000000000 Binary files a/examples/multi_format_indexing/source_files/dog1.jpeg and /dev/null differ diff --git a/examples/multi_format_indexing/source_files/elephant1.jpg b/examples/multi_format_indexing/source_files/elephant1.jpg deleted file mode 100644 index a6a412427..000000000 Binary files a/examples/multi_format_indexing/source_files/elephant1.jpg and /dev/null differ diff --git a/examples/multi_format_indexing/source_files/giraffe.jpg b/examples/multi_format_indexing/source_files/giraffe.jpg deleted file mode 100644 index db6d23dfa..000000000 Binary files a/examples/multi_format_indexing/source_files/giraffe.jpg and /dev/null differ diff --git a/examples/multi_format_indexing/source_files/healthcare_industry_test_p101.jpg b/examples/multi_format_indexing/source_files/healthcare_industry_test_p101.jpg new file mode 100644 index 000000000..635c3879b Binary files /dev/null and b/examples/multi_format_indexing/source_files/healthcare_industry_test_p101.jpg differ diff --git a/examples/multi_format_indexing/source_files/healthcare_industry_test_p86.jpg b/examples/multi_format_indexing/source_files/healthcare_industry_test_p86.jpg new file mode 100644 index 000000000..49a61b971 Binary files /dev/null and b/examples/multi_format_indexing/source_files/healthcare_industry_test_p86.jpg differ diff --git a/examples/multi_format_indexing/source_files/healthcare_industry_test_p9.jpg b/examples/multi_format_indexing/source_files/healthcare_industry_test_p9.jpg new file mode 100644 index 000000000..0729aef89 Binary files /dev/null and b/examples/multi_format_indexing/source_files/healthcare_industry_test_p9.jpg differ diff --git a/examples/multi_format_indexing/source_files/restaurant_brands_international_2023.jpg b/examples/multi_format_indexing/source_files/restaurant_brands_international_2023.jpg new file mode 100644 index 000000000..600b4c016 Binary files /dev/null and b/examples/multi_format_indexing/source_files/restaurant_brands_international_2023.jpg differ diff --git a/examples/multi_format_indexing/source_files/rfc8259.pdf b/examples/multi_format_indexing/source_files/rfc8259.pdf deleted file mode 100644 index 6c032d5fc..000000000 Binary files a/examples/multi_format_indexing/source_files/rfc8259.pdf and /dev/null differ diff --git a/examples/multi_format_indexing/source_files/sweetgreen_2023.jpg b/examples/multi_format_indexing/source_files/sweetgreen_2023.jpg new file mode 100644 index 000000000..aa34e4ce5 Binary files /dev/null and b/examples/multi_format_indexing/source_files/sweetgreen_2023.jpg differ