Skip to content

Conversation

joegallo
Copy link
Contributor

@joegallo joegallo commented Sep 8, 2025

This PR fixes a discrepancy in the way the append processor handles the "allow_duplicates": false option (note: reminder that the default for the append processor is that "allow_duplicates" is true).

Specifically, the handling of the 'data to be appended' varies according to value of the 'data to be appended to'. To wit, if there are duplicates in the data to be appended (on its own, regardless of the data to be appended to) then:

  1. If the data to be appended to is a non-empty list OR an empty list, duplicates in the data to be appended will be removed.
  2. If the data to be appended to is non-existent (think null-ish), duplicates in the data to be appended will NOT be removed (<-- this is the edge case / bug).

The code is pretty understandably written as if the case we're handling is for example when the data to be appended to is ["foo", "bar"] and the data to be appended is ["bar", "baz"] (or ["bar", "baz", "bar"]), but it mistakenly shortcuts in the case of a non-existent/null data to be appended to and just copies the entirety of the data to be appended which means that it correctly handles ["bar", "baz"] but not ["bar", "baz", "bar"] (and so duplicate items sneak through even though allow_duplicates is set to false).

Anyway, this PR unifies those cases so that duplicates are processed the same way regardless of the presence/absence of the data to be appended to list.


I discovered this odd behavior in the append processor's allow_duplicates handling while reviewing #105718, and it seemed important to handle it separately from that PR: first, to call it out as a bug, and second, (since it's a bug) to put up a PR that fixes it and can be backported to all the currently-maintained branches.

Personally, I think this PR is best reviewed one commit at a time (the better to see the expected behavior, and then the fix) but you can do it how you'd like.

Oh, and one more thing... I'm of the opinion that this is 'just' a bug (and one that warrants fixing), but in full https://www.hyrumslaw.com fashion, you could argue with me that actually this is a breaking change.

@joegallo joegallo added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v9.2.0 v9.1.4 v9.0.7 v8.18.7 v8.19.4 labels Sep 8, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @joegallo, I've created a changelog YAML for you.

Copy link
Contributor

@samxbr samxbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I would agree that this is a bug.

@dakrone
Copy link
Member

dakrone commented Sep 8, 2025

I also agree this is a bug, but can you describe the issue in the PR body so that it's searchable?

@joegallo joegallo added the auto-backport Automatically create backport pull requests when merged label Sep 9, 2025
@joegallo joegallo merged commit a24f2ad into elastic:main Sep 9, 2025
34 checks passed
@joegallo joegallo deleted the append-fix-allow_duplicates-edge-case branch September 9, 2025 16:37
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.1
9.0 Commit could not be cherrypicked due to conflicts
8.18 Commit could not be cherrypicked due to conflicts
8.19

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 134319

@joegallo
Copy link
Contributor Author

joegallo commented Sep 9, 2025

All the backport PRs are up, so I'm dropping backport pending from this now.

elasticsearchmachine pushed a commit that referenced this pull request Sep 9, 2025
#134377 is an automatically generated backport PR that was merged
without CI actually having run. It turns out that there are changes in
my newly added tests there that aren't compatible with the 9.1 branch.
That would have shaken out if CI had actually run, but it didn't, so I'm
fixing it in post.

This is tangentially related to #134319.
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Sep 9, 2025
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Sep 9, 2025
sarog pushed a commit to portsbuild/elasticsearch that referenced this pull request Sep 11, 2025
sarog pushed a commit to portsbuild/elasticsearch that referenced this pull request Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.18.7 v8.19.4 v9.0.7 v9.1.4 v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants