We have currently had great success with 1. https://github.com/datalab-to/marker to extract data from `pdf` to `markdown` components. but it would be interesting to compare to a couple of newly released tools: 1. https://github.com/Yuliang-Liu/MonkeyOCR