Add JSON extraction for citation author and editor metadata#238
Add JSON extraction for citation author and editor metadata#238jordanpadams merged 7 commits intomainfrom
Conversation
Replaces flattened individual fields for citation authors and editors with JSON-based extraction to preserve the complete nested structure of Person and Organization elements. This enables richer citation metadata in the Solr registry while maintaining all relationships between contributors, affiliations, and organizations. Changes: - Add getValuesAsJsonFromDoc() method in XMLExtractor for XML-to-JSON conversion - Update Pds4MetExtractor to detect "_json" suffix and use JSON extraction - Replace individual citation author/editor XPath expressions with JSON variants in global-policy.xml - Update bundle.xml field mappings to use new JSON field names - Add corresponding field definitions in Solr managed-schema.xml - Include test data for pvmp_dlbi bundle
There was a problem hiding this comment.
Pull request overview
This PR adds JSON extraction capability for citation author and editor metadata to preserve the nested structure of Person and Organization elements. Previously, these complex nested relationships were flattened into individual fields, losing the associations between contributors, their affiliations, and organizations. The new approach converts XML subtrees to JSON strings, maintaining the complete hierarchical structure.
- Adds
getValuesAsJsonFromDoc()method in XMLExtractor for XML-to-JSON conversion using org.json library - Updates field extraction logic to detect "_json" suffix and use JSON extraction
- Replaces 22+ flattened individual citation XPath fields with 4 JSON-based fields per product type
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/test/resources/data/pds4/pvmp_dlbi/bundle.xml | Adds test data file with complex author/editor structures including multiple affiliations |
| src/test/resources/data/pds4/pvmp_dlbi/README.md | Documents test case expectations for JSON extraction validation |
| registry-mgr-legacy/src/main/resources/collections/data/managed-schema.xml | Removes 22 flattened citation fields, adds 4 JSON fields for authors/editors |
| harvest-legacy/src/main/resources/policy/global-policy.xml | Replaces individual XPath mappings with JSON variants across all 14 product types |
| harvest-legacy/src/main/resources/conf/search/defaults/pds/pds4/bundle.xml | Updates bundle field mappings to use new JSON field names |
| harvest-legacy/src/main/java/gov/nasa/pds/harvest/search/util/XMLExtractor.java | Implements XML-to-JSON conversion methods using Saxon and org.json |
| harvest-legacy/src/main/java/gov/nasa/pds/harvest/search/crawler/metadata/extractor/Pds4MetExtractor.java | Adds conditional logic to extract JSON when slot name ends with "_json" |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| jsonStrings.add(jsonObject.toString()); | ||
| } catch (Exception e) { | ||
| // If conversion fails, log and skip this node | ||
| System.err.println("Warning: Failed to convert XML node to JSON: " + e.getMessage()); |
There was a problem hiding this comment.
Using System.err.println for error logging is not a best practice. Consider using a proper logging framework (e.g., SLF4J or java.util.logging) for consistent error handling and better log management.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
nutjob4life
left a comment
There was a problem hiding this comment.
Code delta: almost 99% great!
Tests: Maven has no complaints, but the procedure in src/test/resources/data/pds4/pvmp_dlbi/README.md was not run as it requires a local Solr
Approval: almost 99% there! 😉
harvest-legacy/src/main/java/gov/nasa/pds/harvest/search/util/XMLExtractor.java
Show resolved
Hide resolved
|
@jordanpadams I've opened a new pull request, #239, to work on those changes. Once the pull request is ready, I'll request review from you. |
- Add java.util.logging.Logger import - Initialize static final Logger instance for XMLExtractor class - Replace System.err.println with log.warning() for error logging This addresses the code review feedback to use a proper logging framework instead of System.err for consistent error handling and better log management. Co-authored-by: jordanpadams <33492486+jordanpadams@users.noreply.github.com>
Address feedback on JSON extraction for citation metadata
|



🗒️ Summary
Replaces flattened individual fields for citation authors and editors with JSON-based extraction to preserve the complete nested structure of Person and Organization elements. This enables richer citation metadata in the Solr registry while maintaining all relationships between contributors, affiliations, and organizations.
Changes:
⚙️ Test Data and/or Report
♻️ Related Issues