Skip to content

Refactor json_serialization_test.py to use schema-based hierarchical tests#338

Open
rwxayheee wants to merge 17 commits intoforlilab:developfrom
rwxayheee:better_json_tests
Open

Refactor json_serialization_test.py to use schema-based hierarchical tests#338
rwxayheee wants to merge 17 commits intoforlilab:developfrom
rwxayheee:better_json_tests

Conversation

@rwxayheee
Copy link
Contributor

@rwxayheee rwxayheee commented Apr 11, 2025

This is a rewrite of json_serialization_test.py to achieve the three major objectives:

  • Code efficiency
    This PR follows the hierarchical spirit of the original JSON tests, in a schema-first way with minimal redundant codes. Hopefully this is easier for long-term maintainance and automated test coverage assessment.

  • Ensure exhaustive attribute checking
    This PR exhaustively checks all attributes of all JSON-parsable classes, without hardcoding the list of attributes to check. As requested in added function stitch from PR188, with options residues_to_add and bonds_to_use #306 (comment) by @diogomart, the JSON-interchange logics of new attributes must be checked. This PR demands all attributes to be evaluated in original and decoded object, unless explicitly ignored. Otherwise, pytest will throw errors.

  • Deep equality check and bug fixes
    This PR implements a deep recursive equality check, instead of hardcoding how each attribute should be evaluated. Because of this, some bugs are exposed.

(Bug fixes will be added in separate commits, in case we want to fix actual problems without major changes of the tests)

@rwxayheee rwxayheee marked this pull request as draft April 11, 2025 07:05
@rwxayheee
Copy link
Contributor Author

rwxayheee commented Apr 11, 2025

Below is a little more technical details for future reference:

Design explained

This PR rewrites the tests using a schema-first principle to exhaustively validate (i) the serialization and deserialization process of JSON parsable classes, and (ii) the deep equality of original and decoded instances, based on the internal structure of the original object. The main components in the updated tests are:

  • JSONParsable classes subject to serialization tests
    Additionally, it is possible to specify attributes to skip by EQUALITY_SKIP_FIELDS

  • Input data for testing and fixtures
    These construct the highest-level objects, two Polymer instances and one default ResidueChemTemplates

  • Hierarchical tests
    These tests (see an example below) have two parts. First, @pytest.mark.parametrize gives the list of JSON-parsable classes to be iterated for seralization/deserialization tests. Then, the test_json_roundtrip test performs deep equality tests for all attributes in all corresponding instances made from the subobject_factory function. The recursion tree, which describes the parent/child relationship between different classes and how one can be part of another is specified in subobject_factory.

# iterate over nested classes in the Polymer hierarchy
@pytest.mark.parametrize("cls", [
    Polymer,
    Monomer,
    RDKitMoleculeSetup,
])
# check for seralization/deserialization and deep equality
def test_json_roundtrip(cls, populated_polymer):
    """Tests starting from a populated polymer object"""
    for obj in subobject_factory(cls, populated_polymer):
        json_str = obj.to_json()
        decoded = cls.from_json(json_str)
        assert isinstance(decoded, cls)
        deep_assert_equal(decoded, obj)
  • Other tests
    There are just a few more tests that were kept as is. Overall, the updated version is supposed to be a strict superset, with a greater and deeper coverage compared to the original version.

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Apr 11, 2025

Bugs discovered

Running the tests points out some minor issues:

FAILED test/json_serialization_test.py::test_json_roundtrip[Polymer] - AssertionError: [root.residue_chem_templates.padders.3-prime.adjacent_smartsmol] Mol SMILES mismatch
FAILED test/json_serialization_test.py::test_json_roundtrip[Monomer] - AssertionError: [root.molsetup.rmsd_symmetry_indices] Type mismatch: <class 'list'> != <class 'tuple'>
FAILED test/json_serialization_test.py::test_json_roundtrip[RDKitMoleculeSetup] - AssertionError: [root.rmsd_symmetry_indices] Type mismatch: <class 'list'> != <class 'tuple'>
FAILED test/json_serialization_test.py::test_json_roundtrip_missing[Polymer] - AssertionError: [root.residue_chem_templates.padders.3-prime.adjacent_smartsmol] Mol SMILES mismatch
FAILED test/json_serialization_test.py::test_json_rct[ResidueChemTemplates] - AssertionError: [root.padders.3-prime.adjacent_smartsmol] Mol SMILES mismatch
FAILED test/json_serialization_test.py::test_json_rct[ResiduePadder] - AssertionError: [root.adjacent_smartsmol] Mol SMILES mismatch
FAILED test/json_serialization_test.py::test_json_molsetup[RDKitMoleculeSetup] - AssertionError: [root.rmsd_symmetry_indices] Type mismatch: <class 'list'> != <class 'tuple'>
=========================================================== 7 failed, 79 passed, 1 skipped, 2 warnings in 3.79s ===========================================================

These will be fixed in additional, separate commits to address each issue.

because roundtrip MolToSmarts-MolFromSmarts not preserving all atom specs
for class ResiduePadder attribute adjacent_smartsmol
@rwxayheee
Copy link
Contributor Author

rwxayheee commented Apr 11, 2025

Issues and solutions

(1) The MolToSmarts-MolFromSmarts roundtrip not preserving all atom specs (3d6858c)

FAILED test/json_serialization_test.py::test_json_roundtrip[Polymer] - AssertionError: [root.residue_chem_templates.padders.3-prime.adjacent_smartsmol] Mol SMILES mismatch

This problem was discussed in rdkit/rdkit#3193. Here it occurred because we are serializing/deserializing SMARTS via a molecule attribute: class ResiduePadder attribute adjacent_smartsmol. There is possible information loss in the roundtrip (see example below). A potential a workaround would be to serialize the SMARTS directly without going through the roundtrip.

from rdkit import Chem

smarts_in = "[P:11]([O-:12])(=[O:13])[O:14][C:15]"
mol_1 = Chem.MolFromSmarts(smarts_in)
smarts_2 = Chem.MolToSmarts(mol_1)
print(f"Original SMARTS: {smarts_in}")
print(f"Converted SMARTS: {smarts_2}")
mol_2 = Chem.MolFromSmarts(smarts_2)

# Compare charges
for a1, a2 in zip(mol_1.GetAtoms(), mol_2.GetAtoms()):
    print(f"Atom {a1.GetSymbol()} — original: {a1.GetFormalCharge()}, decoded: {a2.GetFormalCharge()}")

Output:

Original SMARTS: [P:11]([O-:12])(=[O:13])[O:14][C:15]
Converted SMARTS: [P:11]([O&-:12])(=[O:13])[O:14][C:15]
Atom P — original: 0, decoded: 0
Atom O — original: -1, decoded: 0
Atom O — original: 0, decoded: 0
Atom O — original: 0, decoded: 0
Atom C — original: 0, decoded: 0

(2) Small type fix for class RDKitMoleculeSetup attribute rmsd_symmetry_indices in encoding logic (63fb250)

FAILED test/json_serialization_test.py::test_json_roundtrip[Monomer] - AssertionError: [root.molsetup.rmsd_symmetry_indices] Type mismatch: <class 'list'> != <class 'tuple'>

(3) Some bug fixes within class ResidueChemTemplatese (e8d734b)
There're some minor issues with variable names and scope of data_path.

(4) Remove unused attribute atom_true_count from class RDKitMoleculeSetup (2c94571)

(5) Similar to (2), restores tuple (converting list to tuple in decoding logic) in nested lists under class Polymer attr log (a46f885)

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Apr 12, 2025

(970b1e9) are improvements to attribute detection and comparison in the deep equality test. Now, the tests are able to capture the following problems:

What if...?

(1) What if we add an attribute in the __init__ of the class, without revision of the serialization/deserialization logic for it?

For example, as discussed in conversation #306 (comment), adding bonds to Polymer:

        _bonds = {}
        for key, bond_list in bonds.items():
            monomer1 = self.monomers[key[0]]
            monomer2 = self.monomers[key[1]]
            if monomer1.rdkit_mol is None or monomer2.rdkit_mol is None:
                continue
            invmap1 = {j: i for i, j in monomer1.mapidx_to_raw.items()}
            invmap2 = {j: i for i, j in monomer2.mapidx_to_raw.items()}
            _bonds[key] = [(invmap1[b[0]], invmap2[b[1]]) for b in bond_list]
        bonds = _bonds
        self.bonds = bonds

Expected pytest output

============================================================================== short test summary info ==============================================================================
FAILED test/json_serialization_test.py::test_json_roundtrip[Polymer] - AssertionError: [root.bonds] Dict keys mismatch

(2) What if the original instance for some reason got additional attribute that wasn't passed to the decoded object?

Example output

============================================================================== short test summary info ==============================================================================
FAILED test/json_serialization_test.py::test_json_roundtrip[Polymer] - AssertionError: [root] Missing attribute: additional_attr
FAILED test/json_serialization_test.py::test_json_roundtrip_missing[Polymer] - AssertionError: [root] Missing attribute: additional_attr

(3) What if the decoded instance for some reason got additional attribute that wasn't present in original object?

Example output

============================================================================== short test summary info ==============================================================================
FAILED test/json_serialization_test.py::test_json_roundtrip_missing[Polymer] - AssertionError: [root] Extra attributes in decoded object: {'excess_attr'}
FAILED test/json_serialization_test.py::test_json_roundtrip_missing[Monomer] - AssertionError: [root] Extra attributes in decoded object: {'excess_attr'}

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Apr 12, 2025

The updated test script currently includes four separable (mutually independent) test groups in #region Hierarchical Tests. It's very doable to add a new test group for a full hierachical test. Here're the general steps:

Developer Usage: Steps to write a new, independent hierarchical test

Here's a practical example:

step 1: Import the necessary data and create a fixture as needed

pkgdir = pathlib.Path(meeko.__file__).parents[1]
ahhy_v061_json = pkgdir / "test/polymer_data/AHHY-v0.6.1.json"
@pytest.fixture
def populated_polymer_v061():
    """fixture for a populated polymer object, from a v0.6.1 JSON file"""
    with open(ahhy_v061_json) as f:
        json_str = f.read()
    polymer = Polymer.from_json(json_str)
    if len(polymer.monomers) == 0:
        raise ValueError("Polymer creation failed")
    return polymer

step 2: Define an object factory

def lazy_factory(cls, root): 
    """Factory function to create a single object based on the class and root object."""
    if isinstance(root, Polymer):
        if cls is Polymer:
            return root
        monomers = list(root.monomers.values())
        if not monomers:
            raise ValueError("No monomers found in Polymer")
        last_monomer = monomers[-1]
        if cls is Monomer:
            return last_monomer
        if cls is RDKitMoleculeSetup:
            return last_monomer.molsetup
    raise ValueError(f"Unexpected class or root: {cls}, {type(root)}")

step 3: Define the class to be iterated and include nested JSON-parsable classes, if any

@pytest.mark.parametrize("cls", [
    Polymer,
    Monomer,
    RDKitMoleculeSetup,
])

step 4: Write the test_ function, in which serialization-deserialization and deep equality checks take place in order

def test_json_roundtrip_lazy(cls, populated_polymer_v061):
    """Lazy tests starting from a populated polymer object"""
    obj = lazy_factory(cls, populated_polymer_v061)
    if obj is None:
        warnings.warn(f"Subobject of type {cls.__name__} is None — skipping.", stacklevel=1)
        pass
    json_str = obj.to_json()
    decoded = cls.from_json(json_str)
    assert isinstance(decoded, cls)
    deep_assert_equal(decoded, obj)

The fixture (test case), object factory (scope) & classes to be iterated (depth of testing), and the testing logics are all flexible.

@rwxayheee rwxayheee marked this pull request as ready for review April 12, 2025 07:44
@rwxayheee rwxayheee requested review from diogomart and removed request for diogomart April 12, 2025 07:53
@rwxayheee rwxayheee marked this pull request as draft April 12, 2025 08:09
@rwxayheee
Copy link
Contributor Author

(Just realized that EQUALITY_SKIP_FIELDS is currently non-empty but should be empty if there are no remaining issues. Currently all tests are passing because it suppresses potential failures

EQUALITY_SKIP_FIELDS = { 
    RDKitMoleculeSetup: {"atom_true_count" },
    Monomer: {"template", "link_labels"},
}

(working on two small fixes before finalizing.)

@rwxayheee
Copy link
Contributor Author

The test script is finalized, but the tests are not passing without exceptions (EQUALITY_SKIP_FIELDS) when there is unfinished attribute like Monomer: {"template", "link_labels"},. This was #293 and has yet to be solved. Therefore, I will take a look

@rwxayheee rwxayheee marked this pull request as ready for review April 15, 2025 07:34
@rwxayheee rwxayheee requested a review from diogomart April 15, 2025 08:04
@rwxayheee
Copy link
Contributor Author

This should be ready. Overall, this PR has a rewritten version of the JSON roundtrip tests, and some minor fixes. I will write some more tests for residue templates in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant