Skip to content

Vocabularies: ORCID Names conversion command includes full ORCID records activities instead of normalized name entries #3351

@Samk13

Description

@Samk13

Package version (if known): latest

Describe the bug

The invenio vocabularies convert command for the names vocabulary randomly outputs full ORCID record data instead of a normalized names vocabulary entry.

Steps to Reproduce

  1. Download ORCID gigantic dump
  2. Convert the dump by following the [steps here], which will take a day or so.(https://inveniordm.docs.cern.ch/operate/customize/vocabularies/names/#creating-a-namesyaml-file)
  3. Inspect the generated head -n 1000 names_dump_2025.yaml > test.yaml
  4. Notice the large file size "13.58" GB, usually it should be around 4 GB
  5. Observe entries containing raw ORCID record fields (e.g. @ xmlns:*, activities-summary, history) instead of name fields.

Expected behavior

Each ORCID record is converted into a normalized names vocabulary entry containing only name fields, identifiers, and optional affiliations, or is skipped if required name data is missing.

Screenshots (if applicable)

test.yaml

Additional context

Command used:

invenio vocabularies convert \
  --vocabulary names \
  --origin /path/to/ORCID_2025_summaries.tar.gz \
  --target names.yaml

This commit might be related to the issue: inveniosoftware/invenio-vocabularies@fb68221

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions