Add Omni Reader Project #187

marwan37 · 2025-03-28T01:14:19Z

OmniReader is built for teams who routinely work with unstructured documents (e.g., PDFs, images, scanned forms) and want a scalable workflow for structured text extraction. It provides an end-to-end batch OCR pipeline with optional multi-model comparison to help ML engineers evaluate different OCR solutions before deployment.

dagshub · 2025-03-28T01:14:21Z

Join the discussion on DagsHub!

strickvl

A few quick general observations:

you have a poetry.lock, a requirements.txt and a pyproject.toml, all seemingly with different dependencies... I think you need to pick one format and coalesce on that :) I think requirements.txt is generally what we've done in projects prior to now so I'd suggest you keep that.
seems like maybe too much logic is contained within the pipeline itself that normally we'd see happen inside the step. the loop over the images I think is a pattern normally that we'd see inside a step. We'd also often have a loader step which would load the images (even if it was just returning their paths vs the actual Image object) just for purposes of logging the filenames etc
where are you running these pipelines? I can't see them on the internal or the demo tenant?
if you put the looping logic inside the step, it'd be nice then to have the average processing times etc (across multiple images i.e.) logged as metadata which we can then compare in the dashboard
I think we do have room for at least one Ollama project in our projects, but I wonder a bit whether it's this one. Mabye it depends a bit whether we get standout results here or not.
Feels maybe a bit unfair to compare gemma3 vs pixtral. they're almost in different classes. (see https://huggingface.co/spaces/opencompass/open_vlm_leaderboard one of the popular leaderboards to get a sense of this). Wondering whether maybe to switch out pixtral for either mistral-small 3.1, or one of the Qwen vision models of a similar size.

Otherwise I think the README can be a bit nicer probably, including some image of the streamlit app, perhaps.

Also def return and log the results of the evaluate_models step as dict etc, but hoping you can add in an HTML visualization (i.e. return an HTMLString (see ZenML docs or other projects for how to do this)) for the data. Maybe with one sample result included inside or something as well so that you have an actual report at the end?

strickvl · 2025-03-28T07:08:05Z

I was thinking about ways to improve the evals part of the pipeline and collaboratively came up with this document: https://gist.github.com/strickvl/c165a3d73310f7d91e75cde670aa428d

The code suggestions may or may not be actually how you implement things, but I think some of the ideas are maybe worth exploring, esp the quick wins.

You should also def do the HTMLString visualization thing I mentioned above as well as it'll allow you to lay out the results really nicely.

htahir1

I like it! Its getting better. I think it could use some materialisers and custom visualizations. Read docs here, here and here to get an idea

omni-reader/steps/save_results.py

omni-reader/steps/run_ocr.py

omni-reader/README.md

strickvl

Basically you're almost there. Just some small changes and this is ready to publish.

omni-reader/README.md

omni-reader/utils/prompt.py

…iles

…un_ocr results type

marwan37 added 18 commits March 27, 2025 20:01

Initial project setup with dependencies and license

f9ed2aa

Add project README with documentation and usage instructions

8001947

Add core utility functions for integrations

d97da0c

Add data schemas for the OCR pipeline

74a12b0

Add OCR comparison pipeline framework

db67c1e

Add model evaluation step for OCR comparison

377263b

Add main execution script for OCR pipeline

f09655d

Add image encoding, metrics calculation and prompt utilities

0de7871

Add OCR implementation steps for Gemma3 and Mistral models

25347d3

Add Streamlit web interface for interactive OCR comparison

2387351

Add sample images for OCR testing

71b8b94

update README

e885c25

Add configuration settings for OCR pipeline

c6b5ce6

Add standalone script for quick OCR comparison without ZenML

2b45149

Add poetry.lock file to lock dependencies

3ea4f93

Add pip requirements file for non-Poetry installations

66d0de1

Add ImageDescription pydantic model for the OCR pipeline

a747636

update readme

1c4ee21

marwan37 added enhancement New feature or request wip work in progress (don't merge) labels Mar 28, 2025

strickvl added the internal label Mar 28, 2025

strickvl requested changes Mar 28, 2025

View reviewed changes

marwan37 added 6 commits March 28, 2025 18:12

update assets dir structure

c70c6e9

update prompt.py: add confidence for extracted text

b60e1a6

add confidence field to ImageDescription

08a8fc6

organize assets dir

b14b16a

remove integrations utils: remove mlflow, and docker for now

4882760

add pipelines __init__ file

991298b

htahir1 requested changes Apr 4, 2025

View reviewed changes

omni-reader/steps/save_results.py Outdated Show resolved Hide resolved

omni-reader/steps/run_ocr.py Outdated Show resolved Hide resolved

omni-reader/README.md Outdated Show resolved Hide resolved

marwan37 added 3 commits April 4, 2025 09:19

update image links in README

7062613

update visualization img

c31edfa

update assets

868e3b7

strickvl requested changes Apr 7, 2025

View reviewed changes

omni-reader/README.md Outdated Show resolved Hide resolved

omni-reader/utils/prompt.py Outdated Show resolved Hide resolved

marwan37 added 13 commits April 7, 2025 14:01

refactor utils

7496685

move visualization logic into util file

a792081

refactor steps: remove local saving/loading and integrate artifacts

32a1bbc

refactor pipelines: remove save/load from local dirs + split config f…

637655a

…iles

split config into dedicated config files for each pipeline

a7e4fa8

add schemas dir

fc45c9d

update requirements.txt

394e8ce

update run.py: simplify args, and integrate new config structure

aa0190f

delete main.py

5dcf313

update configs

94d8e62

add docker settings in pipeline definitions

09e847f

update README

51c8eae

add analyse and Labour to .typos.toml

df81509

marwan37 requested review from htahir1 and strickvl April 8, 2025 13:16

strickvl approved these changes Apr 8, 2025

View reviewed changes

marwan37 added 8 commits April 8, 2025 11:10

cleanup pipelines and add small html visualization for batch pipeline

bbbd82c

cleanup utils

bcaf3df

update steps/evaluate models to use combined dataframe from updated r…

2d4b4bf

…un_ocr results type

update loader to work with Dataframe directly, and not a dict

09c5704

coerce potential lists being returned in model responses to strings

cb91618

update README

b443344

cleanup configs and run_ocr

894366f

cleanup requirements.txt

d8a654d

marwan37 merged commit f9b2b6a into main Apr 9, 2025
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Omni Reader Project #187

Add Omni Reader Project #187

Uh oh!

marwan37 commented Mar 28, 2025 •

edited

Loading

Uh oh!

dagshub bot commented Mar 28, 2025

Uh oh!

strickvl left a comment

Uh oh!

strickvl commented Mar 28, 2025

Uh oh!

htahir1 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

strickvl left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add Omni Reader Project #187

Add Omni Reader Project #187

Uh oh!

Conversation

marwan37 commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dagshub bot commented Mar 28, 2025

Uh oh!

strickvl left a comment

Choose a reason for hiding this comment

Uh oh!

strickvl commented Mar 28, 2025

Uh oh!

htahir1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

strickvl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

marwan37 commented Mar 28, 2025 •

edited

Loading