-
Notifications
You must be signed in to change notification settings - Fork 81
Description
Hello,
I've encountered an issue with the get_element() method where it can find elements when the path contains the index [1], but fails and returns None for any higher indexes like [2], [3], etc.
I'm not entirely sure if this is expected behavior or a regression, as the documentation doesn't explicitly specify how indexed XPath suffixes should be handled. However, this behavior is problematic for our use case where we automatically generate XPaths from XML documents using lxml.etree.getpath(), which includes these indexes starting from [1] for all elements in a sequence.
Steps to Reproduce:
Have an XSD schema with a sequence of elements where the same name appears multiple times (e.g., via maxOccurs="unbounded").
Generate an XPath for elements in a corresponding XML file using lxml.etree.getpath().
Try to get the XSD element using schema.get_element(tag, path=path).
The method works for paths with [1] but returns None for paths with [2], [3], etc.
Minimal Reproducible Example:
import re
import xmlschema
from lxml import etree
xsd_path = '/path/to/test.xsd'
xml_path = '/path/to/test.xml'
schema = xmlschema.XMLSchema11(xsd_path)
xml_document = xmlschema.XmlDocument(xml_path, schema=schema)
xml_resource = xmlschema.XMLResource(source=xml_document)
namespaces = xml_resource.get_namespaces()
tree = etree.parse(xml_path)
for xml_element in tree.iter():
tag = xml_element.tag
path: str = tree.getpath(xml_element)
# This call works for [1] but fails for [2], [3], etc.
xsd_element = schema.get_element(
tag,
path=path,
namespaces=namespaces,
)
if xsd_element is None:
print(f'FAILED: XSD element not found for path="{path}", tag="{tag}"')
else:
print(f'SUCCESS for path: "{path}"')
# Workaround: remove indexes
cleaned_path = re.sub(r'\[\d+\]', '', path)
xsd_element_by_cleaned_path = schema.get_element(
tag,
path=cleaned_path,
namespaces=namespaces,
)
if xsd_element is None and xsd_element_by_cleaned_path is not None:
print(f'SUCCESS with cleaned path: "{cleaned_path}"')Test files:
- test.xml - https://www.paste.org/129573
- test.xsd - https://www.paste.org/129574
Actual Output:
Based on my test with a catalog of books, here's the actual output demonstrating the issue:
SUCCESS for path: "/catalog"
SUCCESS for path: "/catalog/book[1]"
SUCCESS for path: "/catalog/book[1]/title"
SUCCESS for path: "/catalog/book[1]/author"
SUCCESS for path: "/catalog/book[1]/year"
SUCCESS for path: "/catalog/book[1]/price"
SUCCESS for path: "/catalog/book[1]/category"
FAILED: XSD element not found for path="/catalog/book[2]", tag="book"
SUCCESS with cleaned path: "/catalog/book"
FAILED: XSD element not found for path="/catalog/book[2]/title", tag="title"
SUCCESS with cleaned path: "/catalog/book/title"
FAILED: XSD element not found for path="/catalog/book[2]/author", tag="author"
SUCCESS with cleaned path: "/catalog/book/author"
FAILED: XSD element not found for path="/catalog/book[2]/year", tag="year"
SUCCESS with cleaned path: "/catalog/book/year"
FAILED: XSD element not found for path="/catalog/book[2]/price", tag="price"
SUCCESS with cleaned path: "/catalog/book/price"
FAILED: XSD element not found for path="/catalog/book[2]/category", tag="category"
SUCCESS with cleaned path: "/catalog/book/category"
FAILED: XSD element not found for path="/catalog/book[3]", tag="book"
SUCCESS with cleaned path: "/catalog/book"
FAILED: XSD element not found for path="/catalog/book[3]/title", tag="title"
SUCCESS with cleaned path: "/catalog/book/title"
FAILED: XSD element not found for path="/catalog/book[3]/author", tag="author"
SUCCESS with cleaned path: "/catalog/book/author"
FAILED: XSD element not found for path="/catalog/book[3]/year", tag="year"
SUCCESS with cleaned path: "/catalog/book/year"
FAILED: XSD element not found for path="/catalog/book[3]/price", tag="price"
SUCCESS with cleaned path: "/catalog/book/price"
FAILED: XSD element not found for path="/catalog/book[3]/category", tag="category"
SUCCESS with cleaned path: "/catalog/book/category"
Questions:
Is this the expected behavior? Should get_element() only work with [1] but not with higher indexes?
If this is expected, what is the recommended approach to handle XPaths generated by lxml.etree.getpath() that include these indexes?
If this is a regression, could it be fixed to handle all indexes consistently?
Current workaround:
We're currently using a regex to remove all [\d+] substrings from the path before calling get_element(), which works but feels like a workaround rather than a proper solution.
Environment:
- Python Version: 3.11
- xmlschema Version: 4.2.0
- lxml Version: 5.2.2
Additional context:
I suspect this might be a regression because our code worked with an older version of xmlschema 3.4.5.