-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Expected behaviour
- validate imported XML metadata without any problems aside from the data's own validity
- have control over schema files associated with a certain namespace URI
- https://www.archivematica.org/en/docs/archivematica-1.18/user-manual/transfer/import-metadata/#metadata-xml-validation
Current behaviour
XML metadata in the library context comes in many different flavours. Data producers can often only provide metadata in an XML schema supported by their cataloguing or presentation systems. Some XML can be a challenge to work with, but we can not skip validation for 2 reasons: We have to ensure the given metadata is valid for archiving, and secondly, don't break the AIP METS XML (since the metadata gets embedded into it). There are 2 general scenarios that still cause problems for us:
- (1) schema files with unavailable resources
- either temporarily (hoster issues, rate limiting),
- just outdated URLs in old (but still used) schemas or
- revalidation/re-ingest of older AIPs
- (2) complex schemas with multiple layers of imports/includes
- multiple nested dependencies spanning over more than one hoster
- imports with relative paths
- import circles
The current implementation provides the XML parser with a URI mapping when reading the XML metadata files but then struggles with parsing schema files themselves. The URI mapping does not extend to the custom resolver that is used in this case. The parser just fails on unknown or unavailable schemaLocation URLs for namespaces used in XSD files and its imports.
examples for (1): OAI_DC (https://www.openarchives.org/OAI/openarchivesprotocol.html#dublincore)
- container for DC elements, many public facing presentation systems have an OAI interface for metadata harvesting
- schema imports old DC version, HTTP 403 response, only accessible via HTTPS now, no redirection set
- job
Generate METS.xml documentfails
Error(s) processing and/or validating XML metadata:
- Could not parse schema file: file:///etc/xml/slub/schemas/oai-dc/oai_dc.xsd
- Element '{http://www.w3.org/2001/XMLSchema}element', attribute 'ref': The QName value '{http://purl.org/dc/elements/1.1/}title' does not resolve to a(n) element declaration., line 23
example for (2): LIDO v1.1 (https://lido-schema.org/schema/v1.1/lido-v1.1.html)
- a general object description schema for sharing metadata, often related to portals covering photography and physical objects
- amongst other things the schema imports GML of a specific version (https://schemas.liquid-technologies.com/GML/3.1.1/?page=gml_xsd.html), GML itself is split again into ~30 XSDs with some additional imports
- job
Generate METS.xml documentfails
Error(s) processing and/or validating XML metadata:
- Could not parse schema file: file:///etc/xml/slub/schemas/lido-1.1/lido-v1.1.xsd
- Element '{http://www.w3.org/2001/XMLSchema}include': Failed to load the document 'dynamicFeature.xsd' for inclusion., line 14
Proposal
artefactual/archivematica#2225
The initial prototype code aims to extend the capabilities of the custom resolver. It forwards and reuses the existing URI mapping (dictionary). The resolver can now browse it for matches. The resolver is also compelled to do this first and use the existing code mechanics only as a fallback (if there are no matches). The user can simply add additional mappings to the configuration file housing the dictionary.
example (my current configuration):
# Validation error behaviour
XML_VALIDATION_FAIL_ON_ERROR = True
# Validation URI mapping
XML_VALIDATION = {
#
# Local XML schema files (provided by SLUBArchiv.digital XML catalog package)
"http://purl.org/dc/elements/1.1/": "/etc/xml/slub/schemas/dc-1.1/dc.xsd", # descriptive producer XML metadata ("data/metadata" only)
"http://slubarchiv.slub-dresden.de/bag_info_metadata1": "/etc/xml/slub/schemas/slub-bag-info-metadata-1.0/bag_info_metadata1.xsd", # mapped bag-info.txt metadata
"http://slubarchiv.slub-dresden.de/other_tag_metadata1": "/etc/xml/slub/schemas/slub-other-tag-metadata-1.0/other_tag_metadata1.xsd", # mapped descriptive producer non-XML metadata ("data/metadata" only)
"http://slubarchiv.slub-dresden.de/rights1": "/etc/xml/slub/schemas/slub-rights-1.0/rights1.xsd", # producer rights declaration
"http://slubarchiv.slub-dresden.de/sigprops1": "/etc/xml/slub/schemas/slub-sigprops-1.1/sigprops.xsd", # significant properties of ingest workflow
"http://www.opengis.net/gml": "/etc/xml/slub/schemas/lido-1.1/requires/gml-3.1.1.2/gml.xsd", # descriptive producer XML metadata ("data/metadata" only), required for LIDO 1.1
"http://www.lido-schema.org": "/etc/xml/slub/schemas/lido-1.1/lido-v1.1.xsd", # descriptive producer XML metadata ("data/metadata" only)
"http://www.loc.gov/MARC21/slim": "/etc/xml/slub/schemas/marcxml-1.2/MARC21slim.xsd", # descriptive producer XML metadata ("data/metadata" only)
"http://www.loc.gov/METS/": "/etc/xml/slub/schemas/mets-1.12.1/mets.xsd", # descriptive producer XML metadata ("data/metadata" only)
"http://www.loc.gov/mods/v3": "/etc/xml/slub/schemas/mods-3.8/mods-3-8.xsd", # descriptive producer XML metadata ("data/metadata" only)
"http://www.loc.gov/standards/alto/ns-v2#": "/etc/xml/slub/schemas/alto-2.0/alto-v2.0.xsd", # descriptive producer XML metadata ("data/metadata" only)
"http://www.openarchives.org/OAI/2.0/oai_dc/": "/etc/xml/slub/schemas/oai-dc/oai_dc.xsd", # descriptive producer XML metadata ("data/metadata" only)
"http://www.tei-c.org/ns/1.0": "/etc/xml/slub/schemas/tei-4.6.0/xsd/tei_all.xsd", # descriptive producer XML metadata ("data/metadata" only), required for LIDO 1.1
#
# 'schemaLocation' overrides (redirect via resolver to local XML schema files)
"http://dublincore.org/schemas/xmls/simpledc20021212.xsd": "/etc/xml/slub/schemas/oai-dc/requires/simpledc20021212.xsd",
"http://www.w3.org/2001/03/xml.xsd": "/etc/xml/slub/schemas/xml-1.0/xml.2001.03.xsd",
"http://www.w3.org/1999/xlink.xsd": "/etc/xml/slub/schemas/xlink-1.1/xlink.xsd",
"http://www.loc.gov/mods/xml.xsd": "/etc/xml/slub/schemas/mods-3.8/xml.xsd",
"http://www.loc.gov/standards/xlink/xlink.xsd": "/etc/xml/slub/schemas/mods-3.8/xlink.xsd",
"http://schemas.opengis.net/gml/3.1.1/base/gml.xsd": "/etc/xml/slub/schemas/lido-1.1/requires/gml-3.1.1.2/gml.xsd",
"http://schemas.opengis.net/gml/3.1.1/smil/smil20.xsd": "/etc/xml/slub/schemas/gml-3.3.1/requires/smil-2.0/smil20.xsd",
}This approach could probably be expanded further since it only effects the parsing of XSD files for now.
In summary, my motivations for this are:
- to provide a way forward on broken external schema resources
- to not be rate limited or blocked by external sites
- to overall stop or heavily reduce outgoing schema-related traffic (and potentially even decrease loading times)
- to have a user controlled local schema file catalogue for security concerns
Steps to reproduce
- use parts of the config above
- metadata test files for (1) & (2)
Your environment (version of Archivematica, operating system, other relevant details)
- Archivematica 1.18.0, Storage Service 0.24.0, package install
- Ubuntu 24.04 LTS
For Artefactual use:
Before you close this issue, you must check off the following:
- All pull requests related to this issue are properly linked
- All pull requests related to this issue have been merged
- A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
- Documentation regarding this issue has been written and merged (if applicable)
- Details about this issue have been added to the release notes (if applicable)