Skip to content

Conversation

@marc-gr
Copy link

@marc-gr marc-gr commented Jun 30, 2025

This PR creates a new XML processor that achieves feature parity with Logstash's XML filter.

⚙️ Configuration Options

processors:
  - xml:
      field: "xml_data"
      target_field: "parsed"
      to_lower: false
      # Logstash-compatible options
      xpath:
        "/root/item/@id": "item_id"
        "//product/name/text()": "product_name"
      namespaces:
        "ns": "http://example.com/namespace"
      force_array: true
      force_content: false
      remove_namespaces: false
      ignore_empty_value: true
      parse_options: "strict"

🏗️ Architecture

  • Streaming SAX Parser: Optimal memory usage for large XML documents
  • Selective DOM Building: Only builds DOM when XPath expressions are configured
  • Pre-compiled XPath: XPath expressions compiled at processor creation for performance
  • Security: Enhanced XXE protection with secure parser factory configurations

📚 Documentation

Documentation includes:

  • Complete configuration reference
  • XPath expression examples
  • Namespace configuration guide

Logstash differences

  • ignore_empty_value behaves a bit different than suppress_empty, but I think it matches better with other processors behavior. It could be adapted, or even add both, but I found it confusing.

Closes #97364

@github-actions
Copy link
Contributor

github-actions bot commented Jun 30, 2025

🔍 Preview links for changed docs:

🔔 The preview site may take up to 3 minutes to finish building. These links will become live once it completes.

@marc-gr marc-gr force-pushed the feat/xml-processor branch from 95df637 to 67dd264 Compare June 30, 2025 14:28
@marc-gr marc-gr requested a review from Copilot June 30, 2025 14:29
@marc-gr marc-gr marked this pull request as ready for review June 30, 2025 14:29
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 30, 2025
@marc-gr marc-gr added the Team:Security Meta label for security team label Jun 30, 2025
@elasticsearchmachine elasticsearchmachine removed the Team:Security Meta label for security team label Jun 30, 2025
Copilot

This comment was marked as outdated.

@marc-gr marc-gr added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jul 1, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jul 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Jul 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @marc-gr, I've created a changelog YAML for you.

@PeteGillinElastic
Copy link
Member

Sorry, github has been being weird at me and sticking some of my comments in pending when I didn't expect it to. I was meaning to reply to the question about the noop processor. The other comments are nits, I'm not sure whether they're still valid on the latest iteration of the code.

The callers are always either static-ly a String or a List, so let's
just have static arities for exactly and only those versions.
They don't make a drug for what's wrong with me.
this.xpathExpressions = xpathExpressions != null ? Map.copyOf(xpathExpressions) : Map.of();
this.namespaces = namespaces != null ? Map.copyOf(namespaces) : Map.of();

this.compiledXPathExpressions = compileXPathExpressions(this.xpathExpressions, this.namespaces);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://jcp.org/aboutJava/communityprocess/review/jsr063/jaxp-pd1.pdf, though, it seems that the DocumentBuilderFactory and SAXParserFactory classes are threadsafe, so XPathFactory is the odd ball. This is fun, we're having fun.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#134923 is tangentially related to this conversation (not the thread safety part, though, which is the primary thrust of this conversation).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Use pre-configured secure DOM factory
// Since we build DOM programmatically (createElementNS/createElement),
// the factory's namespace awareness doesn't affect our usage
DocumentBuilder builder = XmlFactories.DOM_FACTORY.newDocumentBuilder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newDocumentBuilder is too expensive to invoke per-document. We're going to have to do the same threadlocal business on that one, too.

@joegallo
Copy link
Contributor

a structured object tree will be built 
  (but we don't need a dom object!):

  target_field: foobar
  store_xml: true # note: this is the default


xpath expressions will be evaluated (and their results stored),
  (but we don't need a structured object tree!):

  store_xml: false
  xpath: {...}


a structured object tree will be built AND
xpath expressions will be evaluated (and their results stored):

  target_field: foobar
  store_xml: true # note: this is the default
  xpath: {...}

The existing implementation always does the work of building the structured result, even if the the structured result is not used for anything. That hurts performance in the xpath-expressions-only case.

@joegallo
Copy link
Contributor

I merged main in to take advantage of #134923, but I haven't addressed my complaints about the other aspects of the XPath handling here yet.

return XmlUtils.getHardenedXPath();
} catch (Exception e) {
logger.warn("Cannot configure secure XPath object - XML processor may not work correctly", e);
return null;
Copy link
Contributor

@joegallo joegallo Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the existing pattern of returning null here, but honestly I think we should blow up and die much more aggressively if this happens. The processor is just going to throw NPEs at every step of the way anyway, and that seems... unproductive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: The throwing null pattern was introduced at a point when we were eagerly creating things in static initializers. It was my suggested approach to avoid a node which lacked the required functionality failing to start up, before anyone even tried to create an XML processor. If the thread local changes mean we're now creating things lazily and will only throw when you try to use the missing functionality, we might want to revisit the exception handling.

@joegallo
Copy link
Contributor

More notes from myself to myself:

should it be allowed to have `target_field` set,
and to have `store_xml` be *false*?

force_array only makes sense when store_xml is true,
probably same with force_content
and probably same with remove_empty_values

maybe remove_empty_values should work with xpath?

@joegallo
Copy link
Contributor

joegallo commented Sep 18, 2025

I think the XPath performance here is bad enough that I don't think we can ship this feature as is.

With just store_xml, I'm seeing 2.9 micros/doc for the xml processor for a 'hello world'-ish xml document. That's absolutely fine.

With this configuration, however:

        "field": "source",
        "store_xml": false,
        "xpath": {
          "//root/element": "root_element",
          "//root/note": "root_note"
        }

I'm getting 32.9 micros per doc, the vast majority of that time being the evaluation of the xpath expressions. Additionally, my test cluster (running on this laptop, so yes, it's a toy) starts to struggle with GCing (25% of its time is in GC) and the vast majority of the allocations that it's having to GC through are allocations from the evaluation of the xpath expressions (>95%).

edit: Here's a flamegraph. Broadly, the leftmost section is parsing XML, the rightmost section is compiling xpath expressions, and the middle section is evaluating xpath expressions. The evaluation is hurting us a lot, and even if I come up with some per-instance per-thread scheme for reusing precompiled xpath expressions, I don't see what we can do about the evaluation time.

Screenshot 2025-09-18 at 4 33 42 PM

double edit: And the allocation flamegraph, showing that xpath evaluation is >95% of the allocations (so, like, everything else Elasticsearch does in terms of ingest and indexing is getting blown out of the water -- it all becomes that little strip on the RHS).

Screenshot 2025-09-18 at 4 38 12 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ingest Pipeline] XML Processor

5 participants