Skip to content

get_element() fails to find XSD element when path contains indexed suffixes starting from [2] - Expected behavior or regression? #468

@serwizz

Description

@serwizz

Hello,

I've encountered an issue with the get_element() method where it can find elements when the path contains the index [1], but fails and returns None for any higher indexes like [2], [3], etc.

I'm not entirely sure if this is expected behavior or a regression, as the documentation doesn't explicitly specify how indexed XPath suffixes should be handled. However, this behavior is problematic for our use case where we automatically generate XPaths from XML documents using lxml.etree.getpath(), which includes these indexes starting from [1] for all elements in a sequence.

Steps to Reproduce:

Have an XSD schema with a sequence of elements where the same name appears multiple times (e.g., via maxOccurs="unbounded").

Generate an XPath for elements in a corresponding XML file using lxml.etree.getpath().

Try to get the XSD element using schema.get_element(tag, path=path).

The method works for paths with [1] but returns None for paths with [2], [3], etc.

Minimal Reproducible Example:

import re

import xmlschema
from lxml import etree

xsd_path = '/path/to/test.xsd'
xml_path = '/path/to/test.xml'
schema = xmlschema.XMLSchema11(xsd_path)

xml_document = xmlschema.XmlDocument(xml_path, schema=schema)
xml_resource = xmlschema.XMLResource(source=xml_document)
namespaces = xml_resource.get_namespaces()

tree = etree.parse(xml_path)
for xml_element in tree.iter():
    tag = xml_element.tag
    path: str = tree.getpath(xml_element)

    # This call works for [1] but fails for [2], [3], etc.
    xsd_element = schema.get_element(
        tag,
        path=path,
        namespaces=namespaces,
    )
    if xsd_element is None:
        print(f'FAILED: XSD element not found for path="{path}", tag="{tag}"')
    else:
        print(f'SUCCESS for path: "{path}"')

    # Workaround: remove indexes
    cleaned_path = re.sub(r'\[\d+\]', '', path)
    xsd_element_by_cleaned_path = schema.get_element(
        tag,
        path=cleaned_path,
        namespaces=namespaces,
    )
    if xsd_element is None and xsd_element_by_cleaned_path is not None:
        print(f'SUCCESS with cleaned path: "{cleaned_path}"')

Test files:

Actual Output:

Based on my test with a catalog of books, here's the actual output demonstrating the issue:

SUCCESS for path: "/catalog"
SUCCESS for path: "/catalog/book[1]"
SUCCESS for path: "/catalog/book[1]/title"
SUCCESS for path: "/catalog/book[1]/author"
SUCCESS for path: "/catalog/book[1]/year"
SUCCESS for path: "/catalog/book[1]/price"
SUCCESS for path: "/catalog/book[1]/category"
FAILED: XSD element not found for path="/catalog/book[2]", tag="book"
SUCCESS with cleaned path: "/catalog/book"
FAILED: XSD element not found for path="/catalog/book[2]/title", tag="title"
SUCCESS with cleaned path: "/catalog/book/title"
FAILED: XSD element not found for path="/catalog/book[2]/author", tag="author"
SUCCESS with cleaned path: "/catalog/book/author"
FAILED: XSD element not found for path="/catalog/book[2]/year", tag="year"
SUCCESS with cleaned path: "/catalog/book/year"
FAILED: XSD element not found for path="/catalog/book[2]/price", tag="price"
SUCCESS with cleaned path: "/catalog/book/price"
FAILED: XSD element not found for path="/catalog/book[2]/category", tag="category"
SUCCESS with cleaned path: "/catalog/book/category"
FAILED: XSD element not found for path="/catalog/book[3]", tag="book"
SUCCESS with cleaned path: "/catalog/book"
FAILED: XSD element not found for path="/catalog/book[3]/title", tag="title"
SUCCESS with cleaned path: "/catalog/book/title"
FAILED: XSD element not found for path="/catalog/book[3]/author", tag="author"
SUCCESS with cleaned path: "/catalog/book/author"
FAILED: XSD element not found for path="/catalog/book[3]/year", tag="year"
SUCCESS with cleaned path: "/catalog/book/year"
FAILED: XSD element not found for path="/catalog/book[3]/price", tag="price"
SUCCESS with cleaned path: "/catalog/book/price"
FAILED: XSD element not found for path="/catalog/book[3]/category", tag="category"
SUCCESS with cleaned path: "/catalog/book/category"

Questions:

Is this the expected behavior? Should get_element() only work with [1] but not with higher indexes?

If this is expected, what is the recommended approach to handle XPaths generated by lxml.etree.getpath() that include these indexes?

If this is a regression, could it be fixed to handle all indexes consistently?

Current workaround:

We're currently using a regex to remove all [\d+] substrings from the path before calling get_element(), which works but feels like a workaround rather than a proper solution.

Environment:

  • Python Version: 3.11
  • xmlschema Version: 4.2.0
  • lxml Version: 5.2.2

Additional context:

I suspect this might be a regression because our code worked with an older version of xmlschema 3.4.5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions