Add JSON extraction for citation author and editor metadata by jordanpadams · Pull Request #238 · NASA-PDS/registry-legacy-solr

jordanpadams · 2025-12-31T00:23:18Z

🗒️ Summary

Replaces flattened individual fields for citation authors and editors with JSON-based extraction to preserve the complete nested structure of Person and Organization elements. This enables richer citation metadata in the Solr registry while maintaining all relationships between contributors, affiliations, and organizations.

Changes:

Add getValuesAsJsonFromDoc() method in XMLExtractor for XML-to-JSON conversion
Update Pds4MetExtractor to detect "_json" suffix and use JSON extraction
Replace individual citation author/editor XPath expressions with JSON variants in global-policy.xml
Update bundle.xml field mappings to use new JSON field names
Add corresponding field definitions in Solr managed-schema.xml
Include test data for pvmp_dlbi bundle

⚙️ Test Data and/or Report

See GitHub Actions run
From a local solr query:

        "citation_editor_person_json":["{\"Person\":{\"sequence_number\":1,\"display_full_name\":\"Young-In Won\",\"xmlns\":\"http://pds.nasa.gov/pds4/pds/v1\",\"person_orcid\":\"https://orcid.org/0009-0003-0452-774X\",\"xmlns:xsi\":\"http://www.w3.org/2001/XMLSchema-instance\",\"given_name\":\"Young-In\",\"family_name\":\"Won\",\"Affiliation\":[{\"organization_name\":\"ADNET Systems, Inc.\"},{\"organization_name\":\"NASA Space Science Data Coordinated Archive\"}]}}",
          "{\"Person\":{\"sequence_number\":2,\"display_full_name\":\"Stephanie A. McLaughlin\",\"xmlns\":\"http://pds.nasa.gov/pds4/pds/v1\",\"xmlns:xsi\":\"http://www.w3.org/2001/XMLSchema-instance\",\"given_name\":\"Stephanie A.\",\"family_name\":\"McLaughlin\",\"Affiliation\":[{\"organization_name\":\"Telophase Corporation\"},{\"organization_name\":\"NASA Space Science Data Coordinated Archive\"}]}}",
          "{\"Person\":{\"sequence_number\":3,\"display_full_name\":\"Lynn D. V. Neakrase\",\"xmlns\":\"http://pds.nasa.gov/pds4/pds/v1\",\"person_orcid\":\"https://orcid.org/0000-0002-6370-5791\",\"xmlns:xsi\":\"http://www.w3.org/2001/XMLSchema-instance\",\"given_name\":\"Lynn D. V.\",\"family_name\":\"Neakrase\",\"Affiliation\":[{\"organization_name\":\"New Mexico State University\"},{\"organization_name\":\"NASA Planetary Data System\"}]}}",
          "{\"Person\":{\"sequence_number\":4,\"display_full_name\":\"Nancy J. Chanover\",\"xmlns\":\"http://pds.nasa.gov/pds4/pds/v1\",\"person_orcid\":\"https://orcid.org/0000-0002-9984-4670\",\"xmlns:xsi\":\"http://www.w3.org/2001/XMLSchema-instance\",\"given_name\":\"Nancy J.\",\"family_name\":\"Chanover\",\"Affiliation\":[{\"organization_name\":\"New Mexico State University\"},{\"organization_name\":\"NASA Planetary Data System\"}]}}",
          "{\"Person\":{\"sequence_number\":5,\"display_full_name\":\"David R. Williams\",\"xmlns\":\"http://pds.nasa.gov/pds4/pds/v1\",\"person_orcid\":\"https://orcid.org/0000-0003-2187-2716\",\"xmlns:xsi\":\"http://www.w3.org/2001/XMLSchema-instance\",\"given_name\":\"David R.\",\"family_name\":\"Williams\",\"Affiliation\":[{\"organization_name\":\"Goddard Space Flight Center\"},{\"organization_name\":\"NASA Space Science Data Coordinated Archive\"}]}}"],

from ds-view:

♻️ Related Issues

Replaces flattened individual fields for citation authors and editors with JSON-based extraction to preserve the complete nested structure of Person and Organization elements. This enables richer citation metadata in the Solr registry while maintaining all relationships between contributors, affiliations, and organizations. Changes: - Add getValuesAsJsonFromDoc() method in XMLExtractor for XML-to-JSON conversion - Update Pds4MetExtractor to detect "_json" suffix and use JSON extraction - Replace individual citation author/editor XPath expressions with JSON variants in global-policy.xml - Update bundle.xml field mappings to use new JSON field names - Add corresponding field definitions in Solr managed-schema.xml - Include test data for pvmp_dlbi bundle

Copilot

Pull request overview

This PR adds JSON extraction capability for citation author and editor metadata to preserve the nested structure of Person and Organization elements. Previously, these complex nested relationships were flattened into individual fields, losing the associations between contributors, their affiliations, and organizations. The new approach converts XML subtrees to JSON strings, maintaining the complete hierarchical structure.

Adds getValuesAsJsonFromDoc() method in XMLExtractor for XML-to-JSON conversion using org.json library
Updates field extraction logic to detect "_json" suffix and use JSON extraction
Replaces 22+ flattened individual citation XPath fields with 4 JSON-based fields per product type

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/test/resources/data/pds4/pvmp_dlbi/bundle.xml	Adds test data file with complex author/editor structures including multiple affiliations
src/test/resources/data/pds4/pvmp_dlbi/README.md	Documents test case expectations for JSON extraction validation
registry-mgr-legacy/src/main/resources/collections/data/managed-schema.xml	Removes 22 flattened citation fields, adds 4 JSON fields for authors/editors
harvest-legacy/src/main/resources/policy/global-policy.xml	Replaces individual XPath mappings with JSON variants across all 14 product types
harvest-legacy/src/main/resources/conf/search/defaults/pds/pds4/bundle.xml	Updates bundle field mappings to use new JSON field names
harvest-legacy/src/main/java/gov/nasa/pds/harvest/search/util/XMLExtractor.java	Implements XML-to-JSON conversion methods using Saxon and org.json
harvest-legacy/src/main/java/gov/nasa/pds/harvest/search/crawler/metadata/extractor/Pds4MetExtractor.java	Adds conditional logic to extract JSON when slot name ends with "_json"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-31T00:25:28Z

harvest-legacy/src/main/java/gov/nasa/pds/harvest/search/util/XMLExtractor.java

+                    jsonStrings.add(jsonObject.toString());
+                } catch (Exception e) {
+                    // If conversion fails, log and skip this node
+                    System.err.println("Warning: Failed to convert XML node to JSON: " + e.getMessage());


Using System.err.println for error logging is not a best practice. Consider using a proper logging framework (e.g., SLF4J or java.util.logging) for consistent error handling and better log management.

@copilot open a new pull request to apply changes based on this feedback

nutjob4life

Code delta: almost 99% great!
Tests: Maven has no complaints, but the procedure in src/test/resources/data/pds4/pvmp_dlbi/README.md was not run as it requires a local Solr
Approval: almost 99% there! 😉

harvest-legacy/src/main/java/gov/nasa/pds/harvest/search/util/XMLExtractor.java

Copilot · 2026-01-07T00:24:52Z

@jordanpadams I've opened a new pull request, #239, to work on those changes. Once the pull request is ready, I'll request review from you.

- Add java.util.logging.Logger import - Initialize static final Logger instance for XMLExtractor class - Replace System.err.println with log.warning() for error logging This addresses the code review feedback to use a proper logging framework instead of System.err for consistent error handling and better log management. Co-authored-by: jordanpadams <33492486+jordanpadams@users.noreply.github.com>

Address feedback on JSON extraction for citation metadata

sonarqubecloud · 2026-01-07T00:42:53Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

nutjob4life

Woot woot!

jordanpadams requested a review from a team as a code owner December 31, 2025 00:23

jordanpadams marked this pull request as draft December 31, 2025 00:23

jordanpadams requested a review from Copilot December 31, 2025 00:24

Copilot AI reviewed Dec 31, 2025

View reviewed changes

nutjob4life requested changes Dec 31, 2025

View reviewed changes

harvest-legacy/src/main/java/gov/nasa/pds/harvest/search/util/XMLExtractor.java Show resolved Hide resolved

jordanpadams and others added 3 commits January 6, 2026 15:53

Update smoke tests and test data

110759a

Update with rorid for testing purposes

f4b5ce5

Initial plan

c1b5ba3

Copilot AI mentioned this pull request Jan 7, 2026

Address feedback on JSON extraction for citation metadata #239

Merged

6 tasks

jordanpadams marked this pull request as ready for review January 7, 2026 00:26

Propogate runtime exception per PR comment from @nutjob4life

6f84269

jordanpadams mentioned this pull request Jan 7, 2026

Update citation author/editor handling to use JSON fields with ORCID and ROR links NASA-PDS/ds-view#59

Merged

9 tasks

Copilot AI and others added 2 commits January 7, 2026 00:35

Merge pull request #239 from NASA-PDS/copilot/sub-pr-238

0395a03

Address feedback on JSON extraction for citation metadata

nutjob4life approved these changes Jan 7, 2026

View reviewed changes

jordanpadams merged commit 8cb18f2 into main Jan 10, 2026
3 checks passed

jordanpadams deleted the i237 branch January 10, 2026 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JSON extraction for citation author and editor metadata#238

Add JSON extraction for citation author and editor metadata#238
jordanpadams merged 7 commits intomainfrom
i237

jordanpadams commented Dec 31, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 31, 2025

Uh oh!

jordanpadams Jan 7, 2026

Uh oh!

nutjob4life left a comment

Uh oh!

Uh oh!

Copilot AI commented Jan 7, 2026

Uh oh!

sonarqubecloud bot commented Jan 7, 2026

Uh oh!

nutjob4life left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jordanpadams commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🗒️ Summary

⚙️ Test Data and/or Report

♻️ Related Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

jordanpadams Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

nutjob4life left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI commented Jan 7, 2026

Uh oh!

sonarqubecloud bot commented Jan 7, 2026

Quality Gate passed

Uh oh!

nutjob4life left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jordanpadams commented Dec 31, 2025 •

edited

Loading