Add XmlProcessor initial implementation #130337

marc-gr · 2025-06-30T14:18:38Z

This PR creates a new XML processor that achieves feature parity with Logstash's XML filter.

⚙️ Configuration Options

processors:
  - xml:
      field: "xml_data"
      target_field: "parsed"
      to_lower: false
      # Logstash-compatible options
      xpath:
        "/root/item/@id": "item_id"
        "//product/name/text()": "product_name"
      namespaces:
        "ns": "http://example.com/namespace"
      force_array: true
      force_content: false
      remove_namespaces: false
      ignore_empty_value: true
      parse_options: "strict"

🏗️ Architecture

Streaming SAX Parser: Optimal memory usage for large XML documents
Selective DOM Building: Only builds DOM when XPath expressions are configured
Pre-compiled XPath: XPath expressions compiled at processor creation for performance
Security: Enhanced XXE protection with secure parser factory configurations

📚 Documentation

Documentation includes:

Complete configuration reference
XPath expression examples
Namespace configuration guide

Logstash differences

ignore_empty_value behaves a bit different than suppress_empty, but I think it matches better with other processors behavior. It could be adapted, or even add both, but I found it confusing.

Closes #97364

github-actions · 2025-06-30T14:18:49Z

🔍 Preview links for changed docs:

🔔 The preview site may take up to 3 minutes to finish building. These links will become live once it completes.

elasticsearchmachine · 2025-07-01T08:33:36Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-07-01T08:34:15Z

Hi @marc-gr, I've created a changelog YAML for you.

PeteGillinElastic · 2025-09-16T08:26:38Z

Sorry, github has been being weird at me and sticking some of my comments in pending when I didn't expect it to. I was meaning to reply to the question about the noop processor. The other comments are nits, I'm not sure whether they're still valid on the latest iteration of the code.

The callers are always either static-ly a String or a List, so let's just have static arities for exactly and only those versions.

They don't make a drug for what's wrong with me.

joegallo · 2025-09-16T20:33:02Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/XmlProcessor.java

+        this.xpathExpressions = xpathExpressions != null ? Map.copyOf(xpathExpressions) : Map.of();
+        this.namespaces = namespaces != null ? Map.copyOf(namespaces) : Map.of();
+
+        this.compiledXPathExpressions = compileXPathExpressions(this.xpathExpressions, this.namespaces);


An XPath expression is not thread-safe and not reentrant.

For that matter, The XPathFactory class is not thread-safe.

From https://jcp.org/aboutJava/communityprocess/review/jsr063/jaxp-pd1.pdf, though, it seems that the DocumentBuilderFactory and SAXParserFactory classes are threadsafe, so XPathFactory is the odd ball. This is fun, we're having fun.

#134923 is tangentially related to this conversation (not the thread safety part, though, which is the primary thrust of this conversation).

#130337 (comment)

joegallo · 2025-09-17T21:01:31Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/XmlProcessor.java

+                    // Use pre-configured secure DOM factory
+                    // Since we build DOM programmatically (createElementNS/createElement),
+                    // the factory's namespace awareness doesn't affect our usage
+                    DocumentBuilder builder = XmlFactories.DOM_FACTORY.newDocumentBuilder();


newDocumentBuilder is too expensive to invoke per-document. We're going to have to do the same threadlocal business on that one, too.

joegallo · 2025-09-17T21:03:06Z

a structured object tree will be built 
  (but we don't need a dom object!):

  target_field: foobar
  store_xml: true # note: this is the default


xpath expressions will be evaluated (and their results stored),
  (but we don't need a structured object tree!):

  store_xml: false
  xpath: {...}


a structured object tree will be built AND
xpath expressions will be evaluated (and their results stored):

  target_field: foobar
  store_xml: true # note: this is the default
  xpath: {...}

The existing implementation always does the work of building the structured result, even if the the structured result is not used for anything. That hurts performance in the xpath-expressions-only case.

joegallo · 2025-09-18T13:58:22Z

I merged main in to take advantage of #134923, but I haven't addressed my complaints about the other aspects of the XPath handling here yet.

joegallo · 2025-09-18T14:00:23Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/XmlProcessor.java

+            return XmlUtils.getHardenedXPath();
+        } catch (Exception e) {
+            logger.warn("Cannot configure secure XPath object - XML processor may not work correctly", e);
+            return null;


I followed the existing pattern of returning null here, but honestly I think we should blow up and die much more aggressively if this happens. The processor is just going to throw NPEs at every step of the way anyway, and that seems... unproductive.

FYI: The throwing null pattern was introduced at a point when we were eagerly creating things in static initializers. It was my suggested approach to avoid a node which lacked the required functionality failing to start up, before anyone even tried to create an XML processor. If the thread local changes mean we're now creating things lazily and will only throw when you try to use the missing functionality, we might want to revisit the exception handling.

joegallo · 2025-09-18T14:50:14Z

More notes from myself to myself:

should it be allowed to have `target_field` set,
and to have `store_xml` be *false*?

force_array only makes sense when store_xml is true,
probably same with force_content
and probably same with remove_empty_values

maybe remove_empty_values should work with xpath?

joegallo · 2025-09-18T20:28:45Z

I think the XPath performance here is bad enough that I don't think we can ship this feature as is.

With just store_xml, I'm seeing 2.9 micros/doc for the xml processor for a 'hello world'-ish xml document. That's absolutely fine.

With this configuration, however:

        "field": "source",
        "store_xml": false,
        "xpath": {
          "//root/element": "root_element",
          "//root/note": "root_note"
        }

I'm getting 32.9 micros per doc, the vast majority of that time being the evaluation of the xpath expressions. Additionally, my test cluster (running on this laptop, so yes, it's a toy) starts to struggle with GCing (25% of its time is in GC) and the vast majority of the allocations that it's having to GC through are allocations from the evaluation of the xpath expressions (>95%).

edit: Here's a flamegraph. Broadly, the leftmost section is parsing XML, the rightmost section is compiling xpath expressions, and the middle section is evaluating xpath expressions. The evaluation is hurting us a lot, and even if I come up with some per-instance per-thread scheme for reusing precompiled xpath expressions, I don't see what we can do about the evaluation time.

double edit: And the allocation flamegraph, showing that xpath evaluation is >95% of the allocations (so, like, everything else Elasticsearch does in terms of ingest and indexing is getting blown out of the water -- it all becomes that little strip on the RHS).

marc-gr added the >enhancement label Jun 30, 2025

elasticsearchmachine added v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jun 30, 2025

github-actions bot deployed to docs-preview June 30, 2025 14:19 View deployment

marc-gr force-pushed the feat/xml-processor branch from 5bd50d5 to 95df637 Compare June 30, 2025 14:25

github-actions bot deployed to docs-preview June 30, 2025 14:26 View deployment

Add XmlProcessor initial implementation

67dd264

marc-gr force-pushed the feat/xml-processor branch from 95df637 to 67dd264 Compare June 30, 2025 14:28

marc-gr requested a review from Copilot June 30, 2025 14:29

marc-gr marked this pull request as ready for review June 30, 2025 14:29

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 30, 2025

github-actions bot deployed to docs-preview June 30, 2025 14:29 View deployment

marc-gr added the Team:Security Meta label for security team label Jun 30, 2025

elasticsearchmachine removed the Team:Security Meta label for security team label Jun 30, 2025

This comment was marked as outdated.

Sign in to view

[CI] Auto commit changes from spotless

16e129e

github-actions bot deployed to docs-preview June 30, 2025 14:38 View deployment

Make factory static

12f4560

github-actions bot deployed to docs-preview June 30, 2025 16:07 View deployment

[CI] Auto commit changes from spotless

0a72059

github-actions bot deployed to docs-preview June 30, 2025 16:24 View deployment

marc-gr added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jul 1, 2025

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jul 1, 2025

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Jul 1, 2025

Update docs/changelog/130337.yaml

3a5689e

github-actions bot deployed to docs-preview July 1, 2025 08:35 View deployment

Merge remote-tracking branch 'upstream/main' into feat/xml-processor

23f80b0

joegallo added 7 commits September 16, 2025 13:34

Merge branch 'main' into feat/xml-processor

14393c2

Invert these conditionals for consistency reasons

43cd0c8

Always use the utility function

babffdb

This can be static

bbe8d6c

This doesn't need to handle the general case

071f60a

The callers are always either static-ly a String or a List, so let's just have static arities for exactly and only those versions.

Whitespace

da00dab

They don't make a drug for what's wrong with me.

Only trim the text if it's not blank

a4e6a93

github-actions bot deployed to docs-preview September 16, 2025 18:06 View deployment

joegallo reviewed Sep 16, 2025

View reviewed changes

joegallo mentioned this pull request Sep 17, 2025

Add XPath to XmlUtils #134923

Merged

joegallo reviewed Sep 17, 2025

View reviewed changes

Merge branch 'main' into feat/xml-processor

566f90d

github-actions bot deployed to docs-preview September 18, 2025 13:58 View deployment

joegallo reviewed Sep 18, 2025

View reviewed changes

joegallo added 4 commits September 18, 2025 10:10

Handle these options in the opposite order

4336cfc

Use better variable names here

9218fd9

Handle this validation in the factory

980a7d4

Hoist the namespace context into a field

68920c4

github-actions bot deployed to docs-preview September 18, 2025 14:43 View deployment

Update this test string

57ad518

github-actions bot deployed to docs-preview September 18, 2025 15:13 View deployment

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

[CI] Update transport version definitions

88a03a3

github-actions bot deployed to docs-preview October 2, 2025 07:34 View deployment

Add XmlProcessor initial implementation #130337

Are you sure you want to change the base?

Add XmlProcessor initial implementation #130337

Uh oh!

Conversation

marc-gr commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚙️ Configuration Options

🏗️ Architecture

📚 Documentation

Logstash differences

Uh oh!

github-actions bot commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

elasticsearchmachine commented Jul 1, 2025

Uh oh!

elasticsearchmachine commented Jul 1, 2025

Uh oh!

PeteGillinElastic commented Sep 16, 2025

Uh oh!

joegallo Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

joegallo Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

joegallo Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

joegallo Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

joegallo Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

joegallo Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

joegallo commented Sep 17, 2025

Uh oh!

joegallo commented Sep 18, 2025

Uh oh!

joegallo Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PeteGillinElastic Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

joegallo commented Sep 18, 2025

Uh oh!

joegallo commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

marc-gr commented Jun 30, 2025 •

edited

Loading

github-actions bot commented Jun 30, 2025 •

edited

Loading

joegallo Sep 18, 2025 •

edited

Loading

joegallo commented Sep 18, 2025 •

edited

Loading