Add title validation with dcterms:title support #20

cmungall · 2026-01-08T02:01:08Z

Summary

Add validate_title() method in SupportingTextValidator for exact title matching after normalization
Add _find_title_fields() in plugin to discover slots implementing dcterms:title or having slot_uri dcterms:title
Add _validate_title() in plugin to yield validation results for title mismatches
Update _validate_instance() to validate titles from dedicated title fields, either combined with excerpt validation or standalone

Title Validation Behavior

Title validation uses exact matching (not substring like excerpt) but allows for minor orthographic variation:

Case insensitivity (JAK1 matches jak1)
Whitespace normalization (extra spaces collapsed)
Punctuation differences (Cell-Signaling matches Cell Signaling)
Greek letter equivalence (α-catenin matches alpha-catenin)

Schema Usage

Slots can be marked for title validation using:

slots:
  reference_title:
    implements:
      - dcterms:title

Or via slot_uri:

slots:
  title:
    slot_uri: dcterms:title

Test plan

16 new tests added in tests/test_title_validation.py
All 261 tests pass (just test)
mypy passes
ruff passes

🤖 Generated with Claude Code

- Add validate_title() method in SupportingTextValidator for exact title matching after normalization (case, whitespace, punctuation, Greek letters) - Add _find_title_fields() in plugin to discover slots implementing dcterms:title or having slot_uri dcterms:title - Add _validate_title() in plugin to yield validation results - Update _validate_instance() to validate titles from title fields, either combined with excerpt or standalone - Add comprehensive test suite (16 tests) for title validation Title validation uses exact matching (not substring like excerpt) but allows for minor orthographic variation including: - Case insensitivity - Whitespace normalization - Punctuation differences - Greek letter equivalence (α ↔ alpha) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

claude · 2026-01-08T02:01:21Z

Claude finished @cmungall's task —— View job

PR Review: Title Validation Implementation

Todo List

Read and analyze changed files
Review code quality and best practices
Check for potential bugs or issues
Evaluate performance considerations
Assess security concerns
Review test coverage

Code Quality & Best Practices ✅

Overall, this is a well-structured implementation with good separation of concerns:

Strengths:

Clean API design with descriptive method names (validate_title, _find_title_fields)
Good use of type hints throughout
Comprehensive docstrings with examples
Follows established patterns from existing codebase
Proper error handling with meaningful messages

Minor Suggestions:

Line 174 in supporting_text_validator.py: The title validation logic is duplicated between validate_title method and the inline check in validate. Consider extracting this to a private method to reduce duplication.

Potential Bugs & Issues ⚠️

Schema Discovery Logic (reference_validation_plugin.py:334-335):

if "dcterms:title" in interface or "title" in interface.lower():

This condition is too broad - any interface containing "title" will match. Consider making this more specific:

if interface == "dcterms:title" or interface.lower() == "title":

Path Construction (reference_validation_plugin.py:181):
The path construction uses simple string concatenation which could produce malformed paths:

f"{path}.{excerpt_field}" if path else excerpt_field

Consider validating or sanitizing the path components.

Performance Considerations 🚀

Good Performance Practices:

Proper caching in place via ReferenceFetcher
Early returns to avoid unnecessary processing
Efficient text normalization using regex

Potential Optimization:

The Greek letter mapping (lines 427-452) creates a new dict on every call to normalize_text. Consider making it a class constant.

Security Concerns 🔒

No major security issues identified. The code:

✅ Doesn't execute user input
✅ Properly handles external data fetching through existing validated channels
✅ Uses safe regex patterns
✅ Validates inputs appropriately

Test Coverage 📋

Excellent test coverage with 16 comprehensive tests:

Test Quality:

Good use of mocking to isolate units under test
Clear test names that describe expected behavior
Edge cases well covered (empty titles, malformed data)

Missing Test Scenarios:

Unicode normalization beyond Greek letters
Very long title strings (potential performance impact)
Concurrent validation scenarios

Summary 🎯

This is a solid implementation that follows good software engineering practices. The title validation feature is well-designed with appropriate normalization rules and comprehensive test coverage.

Key Highlights:

✅ Architecture: Clean separation between validator and plugin layers
✅ Functionality: Proper exact matching with intelligent normalization
✅ Integration: Well-integrated with existing LinkML validation framework
✅ Testing: Comprehensive test suite covering edge cases

Recommended Actions:

✅ Approve - The implementation is production-ready
Consider the minor improvements suggested above for future iterations
The test coverage is excellent and validates the expected behavior thoroughly

Overall Assessment: Strong implementation ready for merge ⭐

Copilot

Pull request overview

This pull request adds title validation functionality to the LinkML reference validator, enabling exact matching of reference titles after normalization. The feature supports schema-driven discovery of title fields through dcterms:title interface implementation or slot_uri specification.

Key changes:

Adds validate_title() method for standalone title validation with exact matching after normalization (case, whitespace, punctuation, Greek letters)
Implements _find_title_fields() to discover title slots via dcterms:title interface or slot_uri
Integrates title validation into the plugin's validation flow, supporting both combined excerpt+title validation and standalone title-only validation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
tests/test_title_validation.py	Comprehensive test suite with 16 tests covering title validation scenarios including normalization, exact matching, field discovery, and plugin integration
src/linkml_reference_validator/validation/supporting_text_validator.py	Adds `validate_title()` method implementing exact title matching with normalization support for case, whitespace, punctuation, and Greek letters
src/linkml_reference_validator/plugins/reference_validation_plugin.py	Adds `_find_title_fields()` for schema discovery and `_validate_title()` for integration; updates validation flow to support title fields alongside excerpt validation

Copilot · 2026-01-08T02:06:31Z

src/linkml_reference_validator/plugins/reference_validation_plugin.py

+                for title_field in title_fields:
+                    title_value = instance.get(title_field)
+                    if not title_value:
+                        continue
+
+                    for ref_field in reference_fields:
+                        ref_value = instance.get(ref_field)
+                        if ref_value:
+                            reference_id = self._extract_reference_id(ref_value)
+                            if reference_id:
+                                yield from self._validate_title(
+                                    title_value,
+                                    reference_id,
+                                    f"{path}.{title_field}" if path else title_field,
+                                )


Similar to the excerpt validation loop above, if an instance has multiple title fields or multiple reference fields with values, this could result in redundant title validations. Consider validating only the first available title against the first available reference to avoid unnecessary duplicate checks.

Suggested change

for title_field in title_fields:

title_value = instance.get(title_field)

if not title_value:

continue

for ref_field in reference_fields:

ref_value = instance.get(ref_field)

if ref_value:

reference_id = self._extract_reference_id(ref_value)

if reference_id:

yield from self._validate_title(

title_value,

reference_id,

f"{path}.{title_field}" if path else title_field,

)

# Validate only the first available title against the first available reference

first_title_field: Optional[str] = None

first_title_value: Optional[str] = None

for title_field in title_fields:

title_value = instance.get(title_field)

if title_value:

first_title_field = title_field

first_title_value = title_value

break

if first_title_field and first_title_value:

for ref_field in reference_fields:

ref_value = instance.get(ref_field)

if ref_value:

reference_id = self._extract_reference_id(ref_value)

if reference_id:

yield from self._validate_title(

first_title_value,

reference_id,

f"{path}.{first_title_field}"

if path

else first_title_field,

)

break

Copilot · 2026-01-08T02:06:31Z

src/linkml_reference_validator/validation/supporting_text_validator.py

+        if not reference:
+            return ValidationResult(
+                is_valid=False,
+                reference_id=reference_id,
+                supporting_text="",
+                severity=ValidationSeverity.ERROR,
+                message=f"Could not fetch reference: {reference_id}",
+                path=path,
+            )
+
+        if not reference.title:
+            return ValidationResult(
+                is_valid=False,
+                reference_id=reference_id,
+                supporting_text="",
+                severity=ValidationSeverity.ERROR,
+                message=f"Reference {reference_id} has no title to validate against",
+                path=path,
+            )


Missing test coverage for the case where reference fetching fails (returns None) in validate_title(). Consider adding a test similar to test_validate_title_only_no_reference_title but for when the fetcher returns None entirely.

Copilot · 2026-01-08T02:06:32Z

src/linkml_reference_validator/plugins/reference_validation_plugin.py

+        """LinkML validation plugin for supporting text validation.

-    def pre_process(self, context: ValidationContext) -> None:
-        """Pre-process hook called before validation.
+        This plugin integrates with the LinkML validation framework to validate
+        that supporting text quotes actually appear in their referenced publications.

-        Args:
-            context: Validation context from LinkML
+        The plugin discovers reference and excerpt fields using LinkML's interface
+        mechanism. It looks for:
+        - Slots implementing linkml:authoritative_reference
+        - Slots implementing linkml:excerpt


The class docstring should be updated to mention title validation functionality. It currently only mentions "supporting text validation" and lists excerpt fields, but doesn't document the new dcterms:title field discovery and validation capability that this PR adds.

Copilot · 2026-01-08T02:06:32Z

src/linkml_reference_validator/plugins/reference_validation_plugin.py

+                # Check implements for dcterms:title
+                if slot.implements:
+                    for interface in slot.implements:
+                        if "dcterms:title" in interface or "title" in interface.lower():


The condition "title" in interface.lower() on line 334 is overly broad and could match unintended interfaces. For example, it would match "subtitle", "entitle", "retitled", etc. The check should be more specific, such as checking for exact match with "dcterms:title" or using word boundaries. Consider using a more precise match like interface.lower() == "title" or re.search(r'\btitle\b', interface.lower()).

Suggested change

if "dcterms:title" in interface or "title" in interface.lower():

interface_lower = interface.lower()

if interface_lower == "dcterms:title" or interface_lower == "title":

Copilot · 2026-01-08T02:06:32Z

src/linkml_reference_validator/plugins/reference_validation_plugin.py

+            for excerpt_field in excerpt_fields:
+                excerpt_value = instance.get(excerpt_field)
+                if not excerpt_value:
+                    continue
+
+                for ref_field in reference_fields:
+                    ref_value = instance.get(ref_field)
+                    if ref_value:
+                        reference_id = self._extract_reference_id(ref_value)
+                        # Get title from title field or from reference dict
+                        expected_title = None
+                        for title_field in title_fields:
+                            title_value = instance.get(title_field)
+                            if title_value:
+                                expected_title = title_value
+                                break
+                        if not expected_title:
+                            expected_title = self._extract_title(ref_value)
+                        if reference_id:
+                            validated_with_excerpt = True
+                            yield from self._validate_excerpt(
+                                excerpt_value,
+                                reference_id,
+                                expected_title,
+                                f"{path}.{excerpt_field}" if path else excerpt_field,
+                            )


The nested loop structure could cause duplicate validations if an instance has multiple reference fields with values. For example, if both "reference" and "reference_id" are present and have values, the same excerpt would be validated against both references. Consider breaking after the first successful reference field match, or validating only against the first available reference field to avoid redundant validations.

cmungall · 2026-01-08T02:08:55Z

@copilot open a new pull request to apply changes based on the comments in this thread

Copilot · 2026-01-08T02:09:03Z

@cmungall I've opened a new pull request, #21, to work on those changes. Once the pull request is ready, I'll request review from you.

- Add new how-to guide: docs/how-to/validate-titles.md - Add title validation section to docs/concepts/how-it-works.md - Update index.md and quickstart.md to mention title validation - Add validate-titles.md to mkdocs.yml navigation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

claude · 2026-01-08T02:10:59Z

Claude finished @cmungall's task —— View job

PR Review: Title Validation Implementation

Todo List

Read and analyze changed files
Review code quality and best practices
Check for potential bugs or issues
Evaluate performance considerations
Assess security concerns
Review test coverage

Code Quality & Best Practices ✅

Overall, this is a well-structured implementation with good separation of concerns:

Strengths:

Clean API design with descriptive method names (validate_title, _find_title_fields)
Good use of type hints throughout
Comprehensive docstrings with examples
Follows established patterns from existing codebase
Proper error handling with meaningful messages

Minor Suggestions:

Line 174 in supporting_text_validator.py: The title validation logic is duplicated between validate_title method and the inline check in validate. Consider extracting this to a private method to reduce duplication.

Potential Bugs & Issues ⚠️

Schema Discovery Logic (reference_validation_plugin.py:334-335):

if "dcterms:title" in interface or "title" in interface.lower():

This condition is too broad - any interface containing "title" will match. Consider making this more specific:

if interface == "dcterms:title" or interface.lower() == "title":

Path Construction (reference_validation_plugin.py:181):
The path construction uses simple string concatenation which could produce malformed paths:

f"{path}.{excerpt_field}" if path else excerpt_field

Consider validating or sanitizing the path components.

Multiple Reference Fields (reference_validation_plugin.py:190-199):
The nested loop structure could cause duplicate validations if an instance has multiple reference fields with values. Consider validating only against the first available reference field to avoid redundant validations.

Performance Considerations 🚀

Good Performance Practices:

Proper caching in place via ReferenceFetcher
Early returns to avoid unnecessary processing
Efficient text normalization using regex

Potential Optimization:

The Greek letter mapping (lines 427-452) creates a new dict on every call to normalize_text. Consider making it a class constant.

Security Concerns 🔒

No major security issues identified. The code:

✅ Doesn't execute user input
✅ Properly handles external data fetching through existing validated channels
✅ Uses safe regex patterns
✅ Validates inputs appropriately

Test Coverage 📋

Excellent test coverage with 16 comprehensive tests:

Test Quality:

Good use of mocking to isolate units under test
Clear test names that describe expected behavior
Edge cases well covered (empty titles, malformed data)

Missing Test Scenarios:

Reference fetching failure (returns None) in validate_title method
Unicode normalization beyond Greek letters
Very long title strings (potential performance impact)

Summary 🎯

This is a solid implementation that follows good software engineering practices. The title validation feature is well-designed with appropriate normalization rules and comprehensive test coverage.

Key Highlights:

✅ Architecture: Clean separation between validator and plugin layers
✅ Functionality: Proper exact matching with intelligent normalization
✅ Integration: Well-integrated with existing LinkML validation framework
✅ Testing: Comprehensive test suite covering edge cases
✅ Documentation: Excellent documentation with clear examples

Recommended Actions:

✅ Approve - The implementation is production-ready
Consider the minor improvements suggested above for future iterations
Address the overly broad interface matching condition to prevent false positives

Overall Assessment: Strong implementation ready for merge ⭐

Copilot AI review requested due to automatic review settings January 8, 2026 02:01

Copilot started reviewing on behalf of cmungall January 8, 2026 02:01 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

Copilot AI mentioned this pull request Jan 8, 2026

Fix duplicate validations and overly broad interface matching in title validation #21

Merged

cmungall merged commit cae9c8b into main Jan 8, 2026
5 checks passed

cmungall deleted the feature/title-validation branch January 8, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add title validation with dcterms:title support #20

Add title validation with dcterms:title support #20

Uh oh!

cmungall commented Jan 8, 2026

Uh oh!

claude bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

cmungall commented Jan 8, 2026

Uh oh!

Copilot AI commented Jan 8, 2026

Uh oh!

claude bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if "dcterms:title" in interface or "title" in interface.lower():
	interface_lower = interface.lower()
	if interface_lower == "dcterms:title" or interface_lower == "title":

Add title validation with dcterms:title support #20

Add title validation with dcterms:title support #20

Uh oh!

Conversation

cmungall commented Jan 8, 2026

Summary

Title Validation Behavior

Schema Usage

Test plan

Uh oh!

claude bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Title Validation Implementation

Todo List

Code Quality & Best Practices ✅

Potential Bugs & Issues ⚠️

Performance Considerations 🚀

Security Concerns 🔒

Test Coverage 📋

Summary 🎯

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

cmungall commented Jan 8, 2026

Uh oh!

Copilot AI commented Jan 8, 2026

Uh oh!

claude bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Title Validation Implementation

Todo List

Code Quality & Best Practices ✅

Potential Bugs & Issues ⚠️

Performance Considerations 🚀

Security Concerns 🔒

Test Coverage 📋

Summary 🎯

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Jan 8, 2026 •

edited

Loading

claude bot commented Jan 8, 2026 •

edited

Loading