-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
The current design of docling-eval assumes the workflow:
create-gt: Create a Ground Truth dataset in HF parquet format.create-eval: Create a prediction dataset in HF parquet format that contains the predictions and
the ground truth data from step 1.evaluate: Run evaluations on the prediction dataset created in step 2.
In case the predictions already exist in lossless files like DocTag or DoclingDocument json formats, it is still possible to use the previous workflow via the FileProvider. However this still imposes an unnecessary overhead because:
- It requires additional storage space to save the prediction parquet dataset.
- There is significant time spent in I/O to save the prediction dataset.
- A quick runtime benchmarking shows that 15% of the time is spent to convert DocTag files into
DoclingDocument objects and 85% to dump the shards of the created prediction dataset.
- A quick runtime benchmarking shows that 15% of the time is spent to convert DocTag files into
An improved design should allow the direct evaluation of DocTag/json files without the necessity to dump a prediction dataset on the disk.
One approach could be:
- The user places the
dtorjsonfiles in a directory. - Each
dt/jsonfile follows the naming convention:<document_id>.dt,<document_id>.json.document_idmust be the same with thedocument_idcolumn of the GT dataset.
- All evaluators must accept an optional parameter
external_predictions_path. If present:- Each GT document is matched to a doctags/json file.
- The
doctagsfile is loaded and converted on-the-fly to DoclingDocument object. Thejsonfile is deserialized in DoclingDocument. - The evaluation proceeds between the GT-sourced doc and the prediction doc.
- The CLI for the
evaluatecommand must accordingly be expanded to receive an optional parameter
--external-predictions-path.
Notice: This design allows to parallelize the evaluations by comparing batches of GT/predicted documents concurrently.
Metadata
Metadata
Assignees
Labels
No labels