Skip to content

Commit 2ee3cac

Browse files
authored
feat: add metadata model hierarchy (#408)
* feat: add metadata model hierarchy Signed-off-by: Panos Vagenas <[email protected]> * add deprecation, add first migration Signed-off-by: Panos Vagenas <[email protected]> * extend annotations migration Signed-off-by: Panos Vagenas <[email protected]> * update with feedback Signed-off-by: Panos Vagenas <[email protected]> * expose main prediction Signed-off-by: Panos Vagenas <[email protected]> * ideas on enforcing separation between standard and custom fields Signed-off-by: Panos Vagenas <[email protected]> * add custom field setter method Signed-off-by: Panos Vagenas <[email protected]> * update Markdown serialization Signed-off-by: Panos Vagenas <[email protected]> * revert description, add include_non_meta, showcase custom serializer for summaries Signed-off-by: Panos Vagenas <[email protected]> * simplify customization Signed-off-by: Panos Vagenas <[email protected]> * fix reference exclusion Signed-off-by: Panos Vagenas <[email protected]> * eliminate serialization dupliation between meta & (legacy) annotations Signed-off-by: Panos Vagenas <[email protected]> * remove old file Signed-off-by: Panos Vagenas <[email protected]> * fix item used in get_parts for meta ser Signed-off-by: Panos Vagenas <[email protected]> * serialize GroupItem meta prior to content, DocItem meta after content Signed-off-by: Panos Vagenas <[email protected]> * restore ser order for all nodeitems Signed-off-by: Panos Vagenas <[email protected]> * move meta serialization into DocSerializer.serialize() to maintain seamless chunking integration Signed-off-by: Panos Vagenas <[email protected]> * add allow- & block-lists for meta names, add std field name enum Signed-off-by: Panos Vagenas <[email protected]> * add HTML serializer, document meta field names, rename SMILES field Signed-off-by: Panos Vagenas <[email protected]> * bump DoclingDocument version Signed-off-by: Panos Vagenas <[email protected]> * make TabularChartMetaField.title optional, expose new classes through __init__.py, add MetaUtils Signed-off-by: Panos Vagenas <[email protected]> * add DocTags serialization, revert smiles to smi to prevent confusion with plural Signed-off-by: Panos Vagenas <[email protected]> --------- Signed-off-by: Panos Vagenas <[email protected]>
1 parent a3feae0 commit 2ee3cac

File tree

80 files changed

+2721
-258
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

80 files changed

+2721
-258
lines changed

docling_core/transforms/serializer/base.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
from typing import Any, Optional, Union
1010

1111
from pydantic import AnyUrl, BaseModel
12+
from typing_extensions import deprecated
1213

1314
from docling_core.types.doc.document import (
1415
DocItem,
@@ -258,6 +259,7 @@ def serialize_captions(
258259
"""Serialize the item's captions."""
259260
...
260261

262+
@deprecated("Use serialize_meta() instead.")
261263
@abstractmethod
262264
def serialize_annotations(
263265
self,
@@ -267,6 +269,15 @@ def serialize_annotations(
267269
"""Serialize the item's annotations."""
268270
...
269271

272+
@abstractmethod
273+
def serialize_meta(
274+
self,
275+
item: NodeItem,
276+
**kwargs: Any,
277+
) -> SerializationResult:
278+
"""Serialize the item's meta."""
279+
...
280+
270281
@abstractmethod
271282
def get_excluded_refs(self, **kwargs: Any) -> set[str]:
272283
"""Get references to excluded items."""
@@ -287,6 +298,26 @@ def get_serializer(self, doc: DoclingDocument) -> BaseDocSerializer:
287298
...
288299

289300

301+
class BaseMetaSerializer(ABC):
302+
"""Base class for meta serializers."""
303+
304+
@abstractmethod
305+
def serialize(
306+
self,
307+
*,
308+
item: NodeItem,
309+
doc: DoclingDocument,
310+
**kwargs: Any,
311+
) -> SerializationResult:
312+
"""Serializes the meta of the passed item."""
313+
...
314+
315+
def _humanize_text(self, text: str, title: bool = False) -> str:
316+
tmp = text.replace("__", "_").replace("_", " ")
317+
return tmp.title() if title else tmp.capitalize()
318+
319+
320+
@deprecated("Use BaseMetaSerializer() instead.")
290321
class BaseAnnotationSerializer(ABC):
291322
"""Base class for annotation serializers."""
292323

0 commit comments

Comments
 (0)