Skip to content

Add JSON extraction for citation author and editor metadata#238

Merged
jordanpadams merged 7 commits intomainfrom
i237
Jan 10, 2026
Merged

Add JSON extraction for citation author and editor metadata#238
jordanpadams merged 7 commits intomainfrom
i237

Conversation

@jordanpadams
Copy link
Member

@jordanpadams jordanpadams commented Dec 31, 2025

🗒️ Summary

Replaces flattened individual fields for citation authors and editors with JSON-based extraction to preserve the complete nested structure of Person and Organization elements. This enables richer citation metadata in the Solr registry while maintaining all relationships between contributors, affiliations, and organizations.

Changes:

  • Add getValuesAsJsonFromDoc() method in XMLExtractor for XML-to-JSON conversion
  • Update Pds4MetExtractor to detect "_json" suffix and use JSON extraction
  • Replace individual citation author/editor XPath expressions with JSON variants in global-policy.xml
  • Update bundle.xml field mappings to use new JSON field names
  • Add corresponding field definitions in Solr managed-schema.xml
  • Include test data for pvmp_dlbi bundle

⚙️ Test Data and/or Report

  1. See GitHub Actions run
  2. From a local solr query:
        "citation_editor_person_json":["{\"Person\":{\"sequence_number\":1,\"display_full_name\":\"Young-In Won\",\"xmlns\":\"http://pds.nasa.gov/pds4/pds/v1\",\"person_orcid\":\"https://orcid.org/0009-0003-0452-774X\",\"xmlns:xsi\":\"http://www.w3.org/2001/XMLSchema-instance\",\"given_name\":\"Young-In\",\"family_name\":\"Won\",\"Affiliation\":[{\"organization_name\":\"ADNET Systems, Inc.\"},{\"organization_name\":\"NASA Space Science Data Coordinated Archive\"}]}}",
          "{\"Person\":{\"sequence_number\":2,\"display_full_name\":\"Stephanie A. McLaughlin\",\"xmlns\":\"http://pds.nasa.gov/pds4/pds/v1\",\"xmlns:xsi\":\"http://www.w3.org/2001/XMLSchema-instance\",\"given_name\":\"Stephanie A.\",\"family_name\":\"McLaughlin\",\"Affiliation\":[{\"organization_name\":\"Telophase Corporation\"},{\"organization_name\":\"NASA Space Science Data Coordinated Archive\"}]}}",
          "{\"Person\":{\"sequence_number\":3,\"display_full_name\":\"Lynn D. V. Neakrase\",\"xmlns\":\"http://pds.nasa.gov/pds4/pds/v1\",\"person_orcid\":\"https://orcid.org/0000-0002-6370-5791\",\"xmlns:xsi\":\"http://www.w3.org/2001/XMLSchema-instance\",\"given_name\":\"Lynn D. V.\",\"family_name\":\"Neakrase\",\"Affiliation\":[{\"organization_name\":\"New Mexico State University\"},{\"organization_name\":\"NASA Planetary Data System\"}]}}",
          "{\"Person\":{\"sequence_number\":4,\"display_full_name\":\"Nancy J. Chanover\",\"xmlns\":\"http://pds.nasa.gov/pds4/pds/v1\",\"person_orcid\":\"https://orcid.org/0000-0002-9984-4670\",\"xmlns:xsi\":\"http://www.w3.org/2001/XMLSchema-instance\",\"given_name\":\"Nancy J.\",\"family_name\":\"Chanover\",\"Affiliation\":[{\"organization_name\":\"New Mexico State University\"},{\"organization_name\":\"NASA Planetary Data System\"}]}}",
          "{\"Person\":{\"sequence_number\":5,\"display_full_name\":\"David R. Williams\",\"xmlns\":\"http://pds.nasa.gov/pds4/pds/v1\",\"person_orcid\":\"https://orcid.org/0000-0003-2187-2716\",\"xmlns:xsi\":\"http://www.w3.org/2001/XMLSchema-instance\",\"given_name\":\"David R.\",\"family_name\":\"Williams\",\"Affiliation\":[{\"organization_name\":\"Goddard Space Flight Center\"},{\"organization_name\":\"NASA Space Science Data Coordinated Archive\"}]}}"],
  1. from ds-view:
Screenshot 2026-01-06 at 4 20 52 PM

♻️ Related Issues

Replaces flattened individual fields for citation authors and editors with JSON-based extraction to preserve the complete nested structure of Person and Organization elements. This enables richer citation metadata in the Solr registry while maintaining all relationships between contributors, affiliations, and organizations.

Changes:
- Add getValuesAsJsonFromDoc() method in XMLExtractor for XML-to-JSON conversion
- Update Pds4MetExtractor to detect "_json" suffix and use JSON extraction
- Replace individual citation author/editor XPath expressions with JSON variants in global-policy.xml
- Update bundle.xml field mappings to use new JSON field names
- Add corresponding field definitions in Solr managed-schema.xml
- Include test data for pvmp_dlbi bundle
@jordanpadams jordanpadams requested a review from a team as a code owner December 31, 2025 00:23
@jordanpadams jordanpadams marked this pull request as draft December 31, 2025 00:23
@jordanpadams jordanpadams requested a review from Copilot December 31, 2025 00:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds JSON extraction capability for citation author and editor metadata to preserve the nested structure of Person and Organization elements. Previously, these complex nested relationships were flattened into individual fields, losing the associations between contributors, their affiliations, and organizations. The new approach converts XML subtrees to JSON strings, maintaining the complete hierarchical structure.

  • Adds getValuesAsJsonFromDoc() method in XMLExtractor for XML-to-JSON conversion using org.json library
  • Updates field extraction logic to detect "_json" suffix and use JSON extraction
  • Replaces 22+ flattened individual citation XPath fields with 4 JSON-based fields per product type

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/test/resources/data/pds4/pvmp_dlbi/bundle.xml Adds test data file with complex author/editor structures including multiple affiliations
src/test/resources/data/pds4/pvmp_dlbi/README.md Documents test case expectations for JSON extraction validation
registry-mgr-legacy/src/main/resources/collections/data/managed-schema.xml Removes 22 flattened citation fields, adds 4 JSON fields for authors/editors
harvest-legacy/src/main/resources/policy/global-policy.xml Replaces individual XPath mappings with JSON variants across all 14 product types
harvest-legacy/src/main/resources/conf/search/defaults/pds/pds4/bundle.xml Updates bundle field mappings to use new JSON field names
harvest-legacy/src/main/java/gov/nasa/pds/harvest/search/util/XMLExtractor.java Implements XML-to-JSON conversion methods using Saxon and org.json
harvest-legacy/src/main/java/gov/nasa/pds/harvest/search/crawler/metadata/extractor/Pds4MetExtractor.java Adds conditional logic to extract JSON when slot name ends with "_json"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jsonStrings.add(jsonObject.toString());
} catch (Exception e) {
// If conversion fails, log and skip this node
System.err.println("Warning: Failed to convert XML node to JSON: " + e.getMessage());
Copy link

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using System.err.println for error logging is not a best practice. Consider using a proper logging framework (e.g., SLF4J or java.util.logging) for consistent error handling and better log management.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Copy link
Member

@nutjob4life nutjob4life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code delta: almost 99% great!
Tests: Maven has no complaints, but the procedure in src/test/resources/data/pds4/pvmp_dlbi/README.md was not run as it requires a local Solr
Approval: almost 99% there! 😉

Copy link
Contributor

Copilot AI commented Jan 7, 2026

@jordanpadams I've opened a new pull request, #239, to work on those changes. Once the pull request is ready, I'll request review from you.

@jordanpadams jordanpadams marked this pull request as ready for review January 7, 2026 00:26
Copilot AI and others added 2 commits January 7, 2026 00:35
- Add java.util.logging.Logger import
- Initialize static final Logger instance for XMLExtractor class
- Replace System.err.println with log.warning() for error logging

This addresses the code review feedback to use a proper logging framework
instead of System.err for consistent error handling and better log management.

Co-authored-by: jordanpadams <33492486+jordanpadams@users.noreply.github.com>
Address feedback on JSON extraction for citation metadata
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 7, 2026

Copy link
Member

@nutjob4life nutjob4life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woot woot!

@jordanpadams jordanpadams merged commit 8cb18f2 into main Jan 10, 2026
3 checks passed
@jordanpadams jordanpadams deleted the i237 branch January 10, 2026 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

harvest-solr XML flattening does not preserve relationships between nested elements

4 participants