Skip to content

Add JSON-based serialization logic and deepcopy implementation for Sentence object#3669

Merged
alanakbik merged 4 commits intomasterfrom
GH-3243-serialize-sentence
Jun 6, 2025
Merged

Add JSON-based serialization logic and deepcopy implementation for Sentence object#3669
alanakbik merged 4 commits intomasterfrom
GH-3243-serialize-sentence

Conversation

@alanakbik
Copy link
Copy Markdown
Collaborator

@alanakbik alanakbik commented Jun 5, 2025

This pull request introduces a comprehensive solution for serializing and deep-copying Flair's Sentence objects, addressing a long-standing issue where standard methods like copy.deepcopy() and pickle would fail.

Closes #3245

This PR implements two key mechanisms to solve this problem:

  1. JSON Serialization (to_dict / from_dict):

    • A new sentence.to_dict() method serializes the entire Sentence object—including its text, tokenizer, and all annotations (spans, relations, and labels at all levels)—into a clean, JSON-compatible dictionary.
    • A corresponding class method, Sentence.from_dict(), fully reconstructs the Sentence object from this dictionary, correctly restoring all annotations and the original tokenizer.
    • The core logic is encapsulated in two powerful private helpers, _capture_annotations() and _reapply_annotations(), which use character offsets to robustly map annotations between different tokenizations. This also allowed for a significant simplification of the existing retokenize() logic.
  2. Custom Deep Copy (__deepcopy__):

    • A custom __deepcopy__ method has been implemented for the Sentence class.
    • This method intelligently navigates the object graph, manually recreating tokens, spans, and relations while respecting the caching mechanism of Span and Relation.
    • This makes Sentence objects fully compatible with Python's copy.deepcopy(), resolving the original TypeError.

Key Changes

  • New Public Methods: Added Sentence.to_dict(), Sentence.from_dict(), and Sentence.__deepcopy__().
  • Refactoring:
    • Refactored the Sentence constructor (__init__) to correctly handle initialization from raw text, a list of strings, and a list of pre-made Token objects, which was critical for the deserialization logic.
    • Simplified the retokenize() implementation by reusing the new annotation helper methods.
    • Made the _add_token() method more robust in calculating token start positions.
  • Testing:
    • Created a new test file, tests/test_sentence_serialization.py, with comprehensive unit tests.
    • These tests verify that both deep-copying and the JSON serialization cycle perfectly preserve complex annotations (including multi-word spans and relations) and the sentence's tokenizer configuration.
  • Bug Fixes: Addressed several mypy type-checking errors that arose during the refactoring to ensure the new code is robust and maintainable.

@alanakbik alanakbik merged commit fa9e439 into master Jun 6, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant