Skip to content

Allow multiple class_derivations for the same target class#118

Merged
amc-corey-cox merged 7 commits intomainfrom
inlined-as-list-class-derivations
Feb 25, 2026
Merged

Allow multiple class_derivations for the same target class#118
amc-corey-cox merged 7 commits intomainfrom
inlined-as-list-class-derivations

Conversation

@amc-corey-cox
Copy link
Contributor

Summary

  • Add inlined_as_list: true to TransformationSpecification.class_derivations in the schema, changing storage from Dict[str, ClassDerivation] to List[ClassDerivation]
  • This enables multiple derivations targeting the same class name (e.g., two source tables both mapping to Condition), which was previously impossible due to dict key uniqueness
  • All existing YAML specs (dict format) and programmatic construction (class_derivations={"Agent": ...}) continue to work — a Pydantic field_validator auto-converts dicts to lists

Changes

  • Schema/Model: inlined_as_list: true on class_derivations; regenerated Pydantic model; hand-added backward-compat field_validator
  • Normalization: Pre-process YAML None values and compact list keys before ReferenceValidator.normalize(); convert ObjectDerivation.class_derivations back to dict after normalization (it stays as dict)
  • All consumers: Mechanical .values() → direct iteration, .items()for cd in ...: cd.name, [key] =.append() across transformer, inference, compiler, and test modules
  • Tests: 11 comprehensive tests covering duplicate names, dict backward compat, name inference, empty lists, append, serialization roundtrip, YAML list format, YAML dict format, YAML None values, and create_transformer_specification

Closes #111

Test plan

  • uv run pytest tests/test_datamodel.py -v — 11 tests covering all input/output formats
  • uv run pytest tests/ -q — 258 passed, 4 skipped, 0 failures
  • uv run ruff check on modified files — no new lint issues

🤖 Generated with Claude Code

amc-corey-cox and others added 3 commits February 13, 2026 08:49
Add inlined_as_list: true to TransformationSpecification.class_derivations,
changing storage from Dict[str, ClassDerivation] to List[ClassDerivation].
This enables multiple derivations targeting the same class name (e.g., two
source tables both mapping to Condition), which was previously impossible
due to dict key uniqueness.

Backward compatibility is preserved via a Pydantic field_validator that
auto-converts dict input to list format, so all existing YAML specs and
programmatic construction continue to work unchanged.

Closes #111

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per EXTRACT marker in map_object(), extract source type resolution
logic to a private method. Add comprehensive unit tests covering all
resolution paths (explicit, no-schema fallback, tree_root, errors)
and an integration test through the full transformer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The hand-added field_validator was lost when CI regenerates the model.
Add a class.py.jinja template override that injects the
coerce_class_derivations validator for TransformationSpecification,
so it survives model regeneration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables class_derivations to support multiple derivations for the same target class name by changing it from a Dict[str, ClassDerivation] to a List[ClassDerivation] in TransformationSpecification. This is needed to support use cases like mapping multiple source tables to the same target class (e.g., two source tables both mapping to Condition).

Changes:

  • Modified schema to use inlined_as_list: true for TransformationSpecification.class_derivations while keeping ObjectDerivation.class_derivations as a dict
  • Added backward compatibility via Pydantic field_validator to accept dict input and convert to list
  • Updated all code to iterate over the list instead of using dict methods (.values(), .items(), etc.)
  • Added comprehensive test coverage for all input formats and edge cases

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/linkml_map/datamodel/transformer_model.yaml Added inlined_as_list: true to class_derivations
src/linkml_map/datamodel/transformer_model.py Generated Pydantic model with List type and field_validator
templates/pydantic/class.py.jinja Added field_validator template for backward compatibility
src/linkml_map/transformer/transformer.py Added preprocessing and normalization logic, updated helper methods
src/linkml_map/transformer/object_transformer.py Updated to use list indexing, added _resolve_source_type method
src/linkml_map/utils/loaders.py Added preprocessing call
src/linkml_map/session.py Added preprocessing call
src/linkml_map/inference/*.py Updated to iterate over list
src/linkml_map/compiler/*.py Updated to iterate over list
tests/test_datamodel.py Added 11 comprehensive tests
tests/test_transformer/*.py Updated existing tests and added new ones
tests/test_schema_mapper/*.py Updated to use .append() instead of dict assignment

@kevinschaper
Copy link
Contributor

PR #118 Issue: schema_mapper.derive_schema silently overwrites duplicate-name class derivations

Problem

The core motivation of PR #118 is enabling multiple ClassDerivation entries that target the same class name — e.g., two source tables both mapping to Condition. However, SchemaMapper.derive_schema writes derived classes into a dict keyed by name:

# src/linkml_map/inference/schema_mapper.py, derive_schema()
for class_derivation in specification.class_derivations:
    class_definition = self._derive_class(class_derivation)
    target_schema.classes[class_definition.name] = class_definition  # <-- last one wins

When two derivations share the same target name, the second silently overwrites the first. The slot derivations, populated_from, copy_directives, and all other attributes from the first derivation are lost.

Example

Given this transformation specification (the exact use case the PR enables):

id: dbgap-to-fhir
class_derivations:
  - Condition:
      populated_from: pht004031
      slot_derivations:
        code:
          populated_from: diagnosis_code
        onset_date:
          populated_from: dx_date
  - Condition:
      populated_from: pht004047
      slot_derivations:
        code:
          populated_from: condition_code
        severity:
          populated_from: severity_level

After derive_schema, the target schema would only contain the second Condition class definition (with severity and code from pht004047). The first derivation's onset_date slot is silently dropped.

This is because _derive_class builds a fresh ClassDefinition for each derivation, resetting attributes = {} each time:

# _derive_class()
target_class = copy(source_class)
target_class.attributes = {}   # fresh start each time
# ... then populates from this derivation's slot_derivations only
for slot_derivation in class_derivation.slot_derivations.values():
    slot_definition = self._derive_slot(slot_derivation)
    target_class.attributes[slot_definition.name] = slot_definition

Suggested test

This test demonstrates the problem — it will fail on the current PR branch, proving that derive_schema drops the first derivation:

def test_derive_schema_preserves_duplicate_name_derivations():
    """derive_schema should not silently discard class derivations that share a target name."""
    from copy import deepcopy
    from linkml_runtime import SchemaView
    from linkml_map.datamodel.transformer_model import (
        ClassDerivation,
        SlotDerivation,
        TransformationSpecification,
    )
    from linkml_map.inference.schema_mapper import SchemaMapper

    source_schema_str = """
    id: https://example.org/source
    name: source
    imports:
      - linkml:types
    classes:
      pht004031:
        attributes:
          diagnosis_code:
            range: string
          dx_date:
            range: string
      pht004047:
        attributes:
          condition_code:
            range: string
          severity_level:
            range: string
    """
    sv = SchemaView(source_schema_str)
    mapper = SchemaMapper(source_schemaview=sv)

    spec = TransformationSpecification(
        id="dup-name-test",
        class_derivations=[
            ClassDerivation(
                name="Condition",
                populated_from="pht004031",
                slot_derivations={
                    "code": SlotDerivation(name="code", populated_from="diagnosis_code"),
                    "onset_date": SlotDerivation(name="onset_date", populated_from="dx_date"),
                },
            ),
            ClassDerivation(
                name="Condition",
                populated_from="pht004047",
                slot_derivations={
                    "code": SlotDerivation(name="code", populated_from="condition_code"),
                    "severity": SlotDerivation(name="severity", populated_from="severity_level"),
                },
            ),
        ],
    )

    target_schema = mapper.derive_schema(spec)

    # There is only one "Condition" class in the target schema (dict keyed by name),
    # so at minimum all slot derivations from BOTH derivations should be present.
    condition_cls = target_schema.classes["Condition"]
    derived_slot_names = set(condition_cls.attributes.keys())

    # This assertion will FAIL: "onset_date" from the first derivation is lost
    assert "onset_date" in derived_slot_names, (
        f"Expected 'onset_date' from first Condition derivation, "
        f"but target schema only has: {derived_slot_names}"
    )
    assert "severity" in derived_slot_names, (
        f"Expected 'severity' from second Condition derivation, "
        f"but target schema only has: {derived_slot_names}"
    )
    # All four slots should be present if both derivations are merged
    assert derived_slot_names >= {"code", "onset_date", "severity"}, (
        f"Expected slots from both derivations, got: {derived_slot_names}"
    )

What should happen

This is a design question for the maintainers. Options include:

  1. Merge slot derivations — when two derivations share a target name, accumulate their slots into a single ClassDefinition (with a conflict strategy for overlapping slot names like code above).
  2. Raise an error in derive_schema — duplicate target names are valid for object transformation but not for schema derivation. Fail loudly so users know.
  3. Warn and keep last — emit a warning so it's at least visible, preserving current behavior.

Option 2 is the safest minimal fix; option 1 is the most useful but requires defining merge semantics.


Issue 2: session.py and loaders.py skip normalize_transform_spec — missing ObjectDerivation dict conversion

Problem

The PR adds logic to Transformer.normalize_transform_spec that converts ObjectDerivation.class_derivations back from list to dict after the ReferenceValidator normalizes it:

# transformer.py, normalize_transform_spec()
od_cd = od_normalized.get("class_derivations")
if isinstance(od_cd, list):
    od_normalized["class_derivations"] = {
        item["name"]: item for item in od_cd if isinstance(item, dict)
    }

This is needed because ObjectDerivation.class_derivations is still a dict (no inlined_as_list), but the normalizer may convert it to a list.

However, both session.py and loaders.py call normalizer.normalize() directly — they never go through normalize_transform_spec:

# session.py:51-58
elif isinstance(specification, dict):
    # TODO: centralize this code
    Transformer._preprocess_class_derivations(specification)
    normalizer = ReferenceValidator(...)
    normalizer.expand_all = True
    specification = normalizer.normalize(specification)  # <-- no nested OD fixup
    self.transformer_specification = TransformationSpecification(**specification)
# loaders.py:11-22
def load_specification(path):
    with open(path) as f:
        obj = yaml.safe_load(f)
        Transformer._preprocess_class_derivations(obj)
        normalizer = ReferenceValidator(...)
        normalizer.expand_all = True
        obj = normalizer.normalize(obj)  # <-- no nested OD fixup
        return TransformationSpecification(**obj)

If a specification with nested object_derivations is loaded through either path, the ObjectDerivation.class_derivations could receive a list instead of the expected dict.

Suggested test

def test_session_loads_spec_with_object_derivations():
    """session.set_transformer_specification should handle nested object_derivations correctly."""
    from linkml_map.session import Session

    spec_dict = {
        "id": "od-test",
        "class_derivations": {
            "Agent": {
                "populated_from": "Person",
                "slot_derivations": {
                    "address": {
                        "populated_from": "address",
                        "object_derivations": [
                            {
                                "class_derivations": {
                                    "Address": {
                                        "populated_from": "Address",
                                        "slot_derivations": {
                                            "street": {"populated_from": "street"},
                                        },
                                    }
                                }
                            }
                        ],
                    }
                },
            }
        },
    }
    session = Session()
    session.set_transformer_specification(spec_dict)
    spec = session.transformer_specification

    # Verify the top-level is a list (new behavior)
    assert isinstance(spec.class_derivations, list)

    # Verify the nested ObjectDerivation.class_derivations is still a dict
    agent_cd = spec.class_derivations[0]
    addr_sd = agent_cd.slot_derivations["address"]
    od = addr_sd.object_derivations[0]
    assert isinstance(od.class_derivations, dict), (
        f"Expected dict for ObjectDerivation.class_derivations, got {type(od.class_derivations)}"
    )

Suggested fix

Consolidate all three normalization paths (Transformer.load_transformer_specification, session.py, loaders.py) into a single shared function that includes both _preprocess_class_derivations and normalize_transform_spec. The existing TODO: centralize this code comment in session.py already calls for this.


Issue 3: _get_class_derivation edge case with duplicate names after inference

Problem

Transformer._get_class_derivation matches derivations by populated_from:

# transformer.py:180-191
matching_tgt_class_derivs = [
    deriv
    for deriv in spec.class_derivations
    if deriv.populated_from == target_class_name
    or (not deriv.populated_from and target_class_name == deriv.name)
]
if len(matching_tgt_class_derivs) != 1:
    msg = f"Could not find class derivation for {target_class_name} (results={len(matching_tgt_class_derivs)})"
    raise ValueError(msg)

induce_missing_values (called when building derived_specification) sets populated_from = cd.name for any derivation where populated_from is None. After that, two derivations with the same name and both originally missing populated_from would both have populated_from == name, causing both to match the same target_class_name query, hitting the len != 1 error.

The core use case (same name, different populated_from) works fine. But the edge case should be documented or tested.

Suggested test

def test_get_class_derivation_with_duplicate_names_requires_populated_from():
    """Two derivations with the same name must have distinct populated_from values."""
    from linkml_runtime import SchemaView
    from linkml_map.datamodel.transformer_model import (
        ClassDerivation,
        TransformationSpecification,
    )
    from linkml_map.transformer.object_transformer import ObjectTransformer

    source_schema_str = """
    id: https://example.org/src
    name: src
    imports:
      - linkml:types
    classes:
      TableA:
        tree_root: true
        attributes:
          x:
            range: string
      TableB:
        attributes:
          y:
            range: string
    """
    sv = SchemaView(source_schema_str)

    # Works: same name, different populated_from
    tr = ObjectTransformer()
    tr.source_schemaview = sv
    spec_ok = TransformationSpecification(
        id="ok",
        class_derivations=[
            ClassDerivation(name="Result", populated_from="TableA"),
            ClassDerivation(name="Result", populated_from="TableB"),
        ],
    )
    tr.specification = spec_ok
    # Each can be looked up by its populated_from
    cd_a = tr._get_class_derivation("TableA")
    assert cd_a.populated_from == "TableA"

    # Fails: same name, both missing populated_from — after inference, both get populated_from="Result"
    tr2 = ObjectTransformer()
    tr2.source_schemaview = sv
    spec_bad = TransformationSpecification(
        id="bad",
        class_derivations=[
            ClassDerivation(name="Result"),
            ClassDerivation(name="Result"),
        ],
    )
    tr2.specification = spec_bad
    import pytest
    with pytest.raises(ValueError, match="Could not find class derivation"):
        # After induce_missing_values, both have populated_from="Result"
        # and _get_class_derivation("Result") matches 2
        tr2._get_class_derivation("Result")

Issue 4: _find_class_derivation_by_name returns first match — undocumented

Problem

# transformer.py (PR diff)
def _find_class_derivation_by_name(self, name: str) -> ClassDerivation:
    """Look up a class derivation by name from the specification."""
    for cd in self.specification.class_derivations:
        if cd.name == name:
            return cd
    msg = f"No class derivation named '{name}'"
    raise KeyError(msg)

With the duplicate-name feature, this returns the first match. It's called from _class_derivation_ancestors for is_a/mixins resolution. If someone uses inheritance with duplicate-named derivations, the first match may not be the intended parent. The docstring should note this "first match" behavior.

This is low-risk since inheritance (is_a/mixins) references would typically use unique names, but worth documenting.


Issue 5: test_self_transform duplicates preprocessing logic

Problem

The PR adds a manual None-value fix in test_object_transformer.py:575-579:

# Fix None values in class_derivations before normalization
cd = source_object.get("class_derivations")
if isinstance(cd, dict):
    for k, v in cd.items():
        if v is None:
            cd[k] = {}

This duplicates one branch of Transformer._preprocess_class_derivations. It should call the shared method:

from linkml_map.transformer.transformer import Transformer
Transformer._preprocess_class_derivations(source_object)

This avoids divergence if the preprocessing logic changes and follows DRY.


Issue 6: get_biolink_class_derivations has incorrect return type

Problem

In tests/test_transformer/test_biolink_subsetting.py:24:

def get_biolink_class_derivations(sv: SchemaView, subset_classes: list) -> dict:
    """
    Get Biolink class definitions.

    :param sv: SchemaView object
    :param subset_classes: List of classes to subset
    :return: Dictionary of class derivations incl slot derivations
    """
    class_derivations = []  # <-- builds a list
    for class_name in subset_classes:
        ...
        class_derivations.append(class_derivation)  # <-- appends to list
    return class_derivations  # <-- returns list

The return type annotation says -> dict and the docstring says "Dictionary of class derivations", but the function body (after the PR's changes) builds and returns a list. Should be -> list with a matching docstring.


Issue 7: Backward-compat validator doesn't handle name mismatches for ClassDerivation objects

Problem

The coerce_class_derivations field_validator only sets name from the dict key when the value is a plain dict:

@field_validator('class_derivations', mode='before')
@classmethod
def coerce_class_derivations(cls, v):
    if isinstance(v, dict):
        result = []
        for key, cd in v.items():
            if isinstance(cd, dict):
                cd.setdefault('name', key)  # only for plain dicts
            result.append(cd)              # ClassDerivation objects pass through as-is
        return result
    return v

Consider:

spec = TransformationSpecification(
    class_derivations={
        "Agent": ClassDerivation(name="Person", populated_from="Person")
    }
)

In the old dict-based model, the key "Agent" was authoritative — you'd access it as spec.class_derivations["Agent"]. Now, the ClassDerivation's internal name "Person" is silently used, and the dict key "Agent" is discarded. This is a subtle semantic change.

This is low-severity since real-world code almost always has matching keys and names, but the validator could handle ClassDerivation objects too:

if isinstance(cd, dict):
    cd.setdefault('name', key)
elif hasattr(cd, 'name') and not cd.name:
    cd.name = key

Issue 8: _preprocess_class_derivations compact-list heuristic

Observation

# transformer.py, _preprocess_class_derivations
elif isinstance(cd, list):
    for i, item in enumerate(cd):
        if isinstance(item, dict) and len(item) == 1:
            key, val = next(iter(item.items()))
            if key != "name" and isinstance(val, (dict, type(None))):
                expanded = val if val is not None else {}
                expanded.setdefault("name", key)
                cd[i] = expanded

The heuristic detects YAML compact keys like - Condition: {populated_from: x} via len(item) == 1 and key != "name". The isinstance(val, (dict, type(None))) guard prevents false positives on items like {"populated_from": "table1"} (string value). But a comment explaining why these checks exist would help future maintainers understand the intent. For example:

# Detect YAML compact key format: - ClassName: {body...}
# Distinguished from regular single-field dicts by requiring the value
# to be a dict or None (not a scalar like a string).

Summary

# Severity Issue
1 Critical schema_mapper.derive_schema silently drops first derivation with duplicate names
2 Important session.py/loaders.py skip ObjectDerivation dict conversion
3 Important _get_class_derivation fails with duplicate names when populated_from not set
4 Minor _find_class_derivation_by_name first-match behavior undocumented
5 Minor test_self_transform duplicates preprocessing logic (DRY)
6 Minor get_biolink_class_derivations return type annotation is wrong
7 Minor Backward-compat validator ignores dict key for ClassDerivation objects
8 Suggestion Compact-list heuristic needs an explanatory comment

…te derivations

Consolidate the duplicated preprocess→normalize→create-spec pipeline into a
single `_normalize_spec_dict` classmethod so all four entry points (including
Session and loaders) get nested ObjectDerivation fixup. Add merge-on-conflict
logic in `derive_schema` so duplicate-name class derivations combine their
slots instead of silently dropping the first derivation's attributes. Include
minor fixes: docstring/type-annotation corrections, compact-list heuristic
comments, and `coerce_class_derivations` handling of ClassDerivation objects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

@amc-corey-cox
Copy link
Contributor Author

Thanks for the thorough review, Kevin. All 8 issues are addressed in 14788f6:

Issue #1 (Critical)derive_schema now merges duplicate-name derivations instead of overwriting. New _merge_class_definition method combines attributes (later wins on conflict with a warning), deduplicates slots and mixins, and warns on is_a conflicts. Tests: test_derive_schema_merges_duplicate_name_derivations and test_derive_schema_warns_on_slot_conflict.

Issue #2 — Consolidated all four normalization entry points into Transformer._normalize_spec_dict, a classmethod that bundles _preprocess_class_derivations + ReferenceValidator normalization + nested ObjectDerivation fixup. session.py, loaders.py, load_transformer_specification, and create_transformer_specification all call it now. Removed the # TODO: centralize this code comment and unused imports.

Issue #3 — Added test_get_class_derivation_raises_for_duplicate_names_without_populated_from confirming the existing ValueError is raised (results=2). No code change needed — the behavior was already correct.

Issue #4 — Updated _find_class_derivation_by_name docstring to note first-match behavior.

Issue #5 — Replaced the inline None-value fix in test_self_transform with Transformer._normalize_spec_dict(source_object).

Issue #6 — Changed return type from -> dict to -> list and updated docstring.

Issue #7 — Added elif hasattr(cd, 'name') and not cd.name: cd.name = key branch in coerce_class_derivations (both transformer_model.py and class.py.jinja template).

Issue #8 — Added comments explaining the compact-list heuristic in _preprocess_class_derivations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

… has none

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@amc-corey-cox amc-corey-cox merged commit 303add3 into main Feb 25, 2026
11 checks passed
@amc-corey-cox amc-corey-cox deleted the inlined-as-list-class-derivations branch February 25, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow class_derivations to be a list for multiple target instances

3 participants