Skip to content

Conversation

jbaiera
Copy link
Member

@jbaiera jbaiera commented Aug 20, 2025

Adds support for the flexible field access pattern within ingest pipelines, which provides the ability to progressively scan for fields on an ingest document in a way that includes dotted field names:

PUT _ingest/pipeline/my-flexible-pipeline
{
  "field_access_pattern": "flexible",
  "processors": [
    {
      "set": {
        "field": "result",
        "copy_from": "a.b.c.d"
      }
    }
  ]
}

Setting field_access_pattern to classic supports reading the field a.b.c.d from the following document:

{"a": {"b": {"c": {"d": "value"}}}}

Setting field_access_pattern to flexible supports reading the field a.b.c.d from the following documents:

{"a": {"b": {"c": {"d": "value"}}}}
{"a.b": {"c": {"d": "value"}}}
{"a": {"b.c": {"d": "value"}}}
{"a": {"b": {"c.d": "value"}}}
{"a.b": {"c.d": "value"}}
{"a.b.c": {"d": "value"}}
{"a": {"b.c.d": "value"}}
{"a.b.c.d": "value"}

When writing and removing fields with flexible mode, as much of the path is resolved as it can be following the abilities of detailed above. When the field path is exhausted, any unresolved fields will be concatenated together to write the field to the document. Note: This does not account for field names that resolve to non-object values.

This results in the following outputs when setting a.b.c.d.e.f to "new_value"

Before Result After
{"foo": "bar"} {"foo": "bar", "a.b.c.d.e.f": "new_value"}
{"a": {"foo": "bar"}} {"a": {"foo": "bar", "b.c.d.e.f": "new_value"}}
{"a": {"b": {"foo": "bar"}}} {"a": {"b": {"foo": "bar", "c.d.e.f": "new_value"}}}
{"a": {"b": {"c": {"foo": "bar"}}}} {"a": {"b": {"c": {"foo": "bar", "d.e.f": "new_value"}}}}
{"a": {"b": {"c": {"d": {"foo": "bar"}}}}} {"a": {"b": {"c": {"d": {"foo": "bar", "e.f": "new_value"}}}}}
{"a.b": {"foo": "bar"}} {"a.b": {"foo": "bar", "b.c.d.e.f": "new_value"}}
{"a.b": {"c": {"foo": "bar"}}} {"a.b": {"c": {"foo": "bar", "d.e.f": "new_value"}}}
{"a.b": {"c": {"d": {"foo": "bar"}}}} {"a.b": {"c": {"d": {"foo": "bar", "e.f": "new_value"}}}}
{"a": {"b.c": {"d": {"foo": "bar"}}}} {"a": {"b.c": {"d": {"foo": "bar", "e.f": "new_value"}}}}
{"a": {"b": {"c.d": {"foo": "bar"}}}} {"a": {"b": {"c.d": {"foo": "bar", "e.f": "new_value"}}}}
{"a.b": {"c": {"d": "value"}}} cannot set [e.f] with parent object of type [java.lang.String] as part of path [a.b.c.d.e.f]
{"a": {"b.c": {"d": "value"}}} cannot set [e.f] with parent object of type [java.lang.String] as part of path [a.b.c.d.e.f]
{"a": {"b": {"c.d": "value"}}} cannot set [e.f] with parent object of type [java.lang.String] as part of path [a.b.c.d.e.f]
{"a.b": {"c.d": "value"}} cannot set [e.f] with parent object of type [java.lang.String] as part of path [a.b.c.d.e.f]
{"a.b.c": {"d": "value"}} cannot set [e.f] with parent object of type [java.lang.String] as part of path [a.b.c.d.e.f]
{"a": {"b.c.d": "value"}} cannot set [e.f] with parent object of type [java.lang.String] as part of path [a.b.c.d.e.f]
{"a.b.c.d": "value"} cannot set [e.f] with parent object of type [java.lang.String] as part of path [a.b.c.d.e.f]

The structure of this new access pattern algorithm is very similar across all methods (get, set, hasField, remove), but is clunky to extract into a reuseable bit of code without making the result worse to read than duplicating the structure in every method. This is because it would need to return multiple values, and each method needs to perform slightly different logic in the conditional branches.

Added tests, of which several will require updates when we add support for arrays.
@jbaiera jbaiera added >non-issue :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v9.2.0 labels Aug 20, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Aug 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@jbaiera jbaiera requested a review from masseyke August 20, 2025 21:36
@joegallo joegallo self-requested a review August 22, 2025 20:25
@joegallo
Copy link
Contributor

I haven't read the code yet, but I do intend to -- I'll also spend some time early next week on benchmarking this just to see if anything interesting pops out.

Comment on lines +283 to +338
switch (getCurrentAccessPatternSafe()) {
case CLASSIC -> context = map.get(pathElement);
case FLEXIBLE -> {
Object object = map.getOrDefault(pathElement, NOT_FOUND);
if (object != NOT_FOUND) {
context = object;
} else if (i == lastContainerIndex) {
// This is the last path element, update the leaf key to use this path element as a dotted prefix.
// Leave the context as it is.
leafKey = pathElement + "." + leafKey;
} else {
// Iterate through the remaining path elements, joining them with dots, until we get a hit
String combinedPath = pathElement;
for (int j = i + 1; j <= lastContainerIndex; j++) {
combinedPath = combinedPath + "." + fieldPath.pathElements[j];
object = map.getOrDefault(combinedPath, NOT_FOUND); // getOrDefault is faster than containsKey + get
if (object != NOT_FOUND) {
// Found one, update the outer loop index to skip past the elements we've used
context = object;
i = j;
break;
}
}
if (object == NOT_FOUND) {
// Made it to the last path element without finding the field.
// Update the leaf key to use the visited combined path elements as a dotted prefix.
leafKey = combinedPath + "." + leafKey;
// Update outer loop index to skip past the elements we've used
i = lastContainerIndex;
}
}
}
}
} else if (context instanceof Map<?, ?> map) {
context = map.get(pathElement);
switch (getCurrentAccessPatternSafe()) {
case CLASSIC -> context = map.get(pathElement);
case FLEXIBLE -> {
@SuppressWarnings("unchecked")
Map<String, Object> typedMap = (Map<String, Object>) context;
Object object = typedMap.getOrDefault(pathElement, NOT_FOUND);
if (object != NOT_FOUND) {
context = object;
} else if (i == lastContainerIndex) {
// This is the last path element, update the leaf key to use this path element as a dotted prefix.
// Leave the context as it is.
leafKey = pathElement + "." + leafKey;
} else {
// Iterate through the remaining path elements, joining them with dots, until we get a hit
String combinedPath = pathElement;
for (int j = i + 1; j <= lastContainerIndex; j++) {
combinedPath = combinedPath + "." + fieldPath.pathElements[j];
object = typedMap.getOrDefault(combinedPath, NOT_FOUND); // getOrDefault is faster than containsKey + get
if (object != NOT_FOUND) {
// Found one, update the outer loop index to skip past the elements we've used
context = object;
i = j;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two blocks identical? It didn't seem like a big deal to have them separate when they were 1-liners, but it seems kind of risky to have this much copy/pasted code. Or maybe there's some difference I'm missing? They're really both just operating on Map<String, Object>, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh apparently there was some reason for this as @joegallo describes here: #125051. But maybe the code could go into a method? Or maybe his comment no longer holds with these changes anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we need to just make the code different between the different paths? I generated code coverage when i ran the tests, and certain lines in certain branches I just can't hit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are both indeed just operating on Map<String,Object> - I'm not sure I understand the optimization work that has gone into the splitting of these code paths, maybe it's avoiding polymorphic access overhead in favor of the monomorphic usage of the types. I think @joegallo might have more insight into the reasoning. I'm just following his lead.

A lot of these algorithms are nearly identical to each other, with some small differences in the last operations they perform (some return true/false for if a field exists, others need to return a context and how much it couldn't resolve, etc). Additionally, splitting these into their own methods will require multiple return values in the form of updating the outer loop index i, the context object that has been selected, and in some cases the building out of the leafKey variable. I could make these algorithms return all of that packaged in a record class and unpack it at the call site, and inject the different logic via some functions, but there's just enough different between all of them that I worry the result will not be that much easier to follow or maintain.

Copy link
Contributor

@joegallo joegallo Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's monomorphic callsites -- in a resolution of foo.bar.baz the foo lookup will be against an IngestCtxMap but the bar and baz lookups will be against HashMaps.

And that's the usual pattern -- always the first lookup against an IngestCtxMap, and then all subsequent lookups against something else (mostly HashMap).

Copy link
Member

@masseyke masseyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. The tests are really thorough and helpful. I left a few questions about code reuse / unreachable code. It's also worth seeing how this impacts performance.

@masseyke masseyke requested a review from Copilot August 25, 2025 20:20
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds flexible field access pattern support to ingest pipelines, allowing progressive scanning for dotted field names during document processing. This enables pipelines to handle documents with various field naming patterns.

Key Changes:

  • Added new field access pattern functionality to ingest document operations
  • Enhanced field resolution to support dotted field name combinations
  • Added comprehensive test coverage for flexible access pattern features

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
server/src/main/java/org/elasticsearch/rest/action/ingest/RestPutPipelineAction.java Adds capability flag for flexible field access pattern support
server/src/main/java/org/elasticsearch/ingest/IngestService.java Updates cluster state usage for feature detection
server/src/main/java/org/elasticsearch/ingest/IngestDocument.java Implements flexible field access pattern logic for get, set, has, and remove operations
modules/ingest-common/src/yamlRestTest/resources/rest-api-spec/test/ingest/340_flexible_access_pattern.yml Adds comprehensive test cases for flexible access pattern functionality
modules/ingest-common/build.gradle Adds capabilities API to test dependencies

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@jbaiera jbaiera merged commit 04ea11b into elastic:main Aug 29, 2025
33 checks passed
@jbaiera jbaiera deleted the streams-ingest-pipeline-field-access-pattern branch August 29, 2025 06:54
JeremyDahlgren pushed a commit to JeremyDahlgren/elasticsearch that referenced this pull request Aug 29, 2025
…3270)

Adds support for the flexible field access pattern within ingest pipelines, which provides 
the ability to progressively scan for fields on an ingest document in a way that includes 
dotted field names

---------

Co-authored-by: elasticsearchmachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >non-issue Team:Data Management Meta label for data/management team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants