Skip to content

Conversation

@joegallo
Copy link
Contributor

@joegallo joegallo commented Feb 20, 2025

If you upgrade Elasticsearch via a full cluster restart, then the DatabaseConfigurationMetadata entries from the cluster state get cleared out. This is easiest to reproduce by putting a single node cluster through an upgrade OR a restart. In either of those cases, we restore the cluster state by reading the x-content from disk (rather than serializing it over the network), so we're exercising an entirely different path than in the case of the ordinary operations of a cluster, or the case of rolling (node-by-node) upgrades.


I renamed the existing full-cluster-restart test to geoip-reindexed to better capture the purpose of the test, and I added a new full-cluster-restart test that would have caught wrong behavior here (note: it really doesn't do anything more than that, though, but I'd argue that's enough for right now).

As a matter of inside baseball, the previous paragraph means that you'd almost certainly be better off reviewing this PR commit by commit rather than reading the overall diff of the PR. But do as you'd like.

This makes this class more like SnapshotLifecycleMetadata and
IndexLifecycleMetadata, but I don't think this is load bearing code.
The previous String literal version was more like EnrichMetadata, so
there's prior art for that, but it feels more like an odd duck to me.
@joegallo joegallo added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.18.1 v8.19.0 v9.0.1 v9.1.0 v8.16.5 v8.17.3 labels Feb 20, 2025
@joegallo joegallo changed the title Register IngestGeoipMetadata as a NamedXContent Register IngestGeoIpMetadata as a NamedXContent Feb 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @joegallo, I've created a changelog YAML for you.

@DaveCTurner
Copy link
Contributor

In terms of tests, I would expect ./gradlew :modules:ingest-geoip:qa:full-cluster-restart:check to pick this up. It seems that it doesn't, which indicates maybe this test needs to be expanded.

Also when I ran that task I saw that it only attempts an upgrade from v8.19.0. If it's trying to test BwC then it should be picking up all compatible versions right? And it should definitely be doing a full-cluster-restart without an upgrade too.

@DaveCTurner
Copy link
Contributor

By "pick this up" I mean I would expect some assertion somewhere to trip, if we wrote out some NamedXContent to the cluster state and then couldn't read it back in again. But then I see that we just quietly ignore that kind of issue even in tests:

} catch (NamedObjectNotFoundException ex) {
logger.warn("Skipping unknown custom object with type {}", currentFieldName);
parser.skipChildren();

@DaveCTurner
Copy link
Contributor

FTR we don't hit that lenient code in ./gradlew :modules:ingest-geoip:qa:full-cluster-restart:check anyway, in the sense that this task still passes even with the following:

diff --git a/server/src/main/java/org/elasticsearch/cluster/metadata/Metadata.java b/server/src/main/java/org/elasticsearch/cluster/metadata/Metadata.java
index 35e853cdd55a..bc7e08fd76eb 100644
--- a/server/src/main/java/org/elasticsearch/cluster/metadata/Metadata.java
+++ b/server/src/main/java/org/elasticsearch/cluster/metadata/Metadata.java
@@ -2682,6 +2682,7 @@ public class Metadata implements Iterable<IndexMetadata>, Diffable<Metadata>, Ch
                             Custom custom = parser.namedObject(Custom.class, currentFieldName, null);
                             builder.putCustom(custom.getWriteableName(), custom);
                         } catch (NamedObjectNotFoundException ex) {
+                            assert false : ex;
                             logger.warn("Skipping unknown custom object with type {}", currentFieldName);
                             parser.skipChildren();
                         }

So there's at least three test problems here:

  1. We don't actually run full-cluster-restart tests to cover very many relevant scenarios.
  2. Even in the ones we do cover, we don't try and insert the problematic object in the cluster state first.
  3. Even if we did insert such a problematic object, the tests wouldn't automatically fail when encountering a missing NamedXContent registration.

@joegallo
Copy link
Contributor Author

joegallo commented Feb 21, 2025

I happen to know the history here -- I wrote the FullClusterRestartIT three years ago in #85792 specifically to cover the case of upgrading and using the system indices migration feature, that's why it covers such a limited number of upgrade-from versions. Reading between the lines of what I know now from this conversation, that test is poorly named -- we don't have (and never have had) a true FullClusterRestartIT for the geoip code.

I'll rename the test to better reflect its true purpose, and write a new FullClusterRestartIT that would have failed without the fix on this PR.

I'll add the assert you mentioned, too. I'd already found that line of code and had a big loop of printlns there to figure out why it was claiming that there wasn't any way to handle "ingest_geoip" when I'd clearly registered all the relevant NamedWriteables (I'm in the future, too, so I understand that this thinking was not right, although I can't actually explain confidently why it isn't right).

However, I think we should go one more: I'm curious what it would look like to add an assert that makes sure that a cluster state that we write to disk is deserializable from disk (immediately after writing). This would have failed on the code prior to this PR, and it wouldn't have required a test that we don't have, and that, indeed, I didn't realize I had to write.

@DaveCTurner
Copy link
Contributor

However, I think we should go one more: I'm curious what it would look like to add an assert that makes sure that a cluster state that we write to disk is deserializable from disk (immediately after writing).

We already have this, see org.elasticsearch.gateway.PersistedClusterStateService#getAssertOnCommit. The trouble is that the state in question is deserializable (i.e. doesn't throw an exception) but not faithfully so.

@joegallo joegallo added the auto-backport Automatically create backport pull requests when merged label Feb 21, 2025
@joegallo joegallo marked this pull request as ready for review February 24, 2025 19:06
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@joegallo joegallo merged commit 6315b8a into elastic:main Feb 24, 2025
17 checks passed
@joegallo joegallo deleted the register-geoip-metadata-as-named-xcontent branch February 24, 2025 22:25
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.18 Commit could not be cherrypicked due to conflicts
8.x Commit could not be cherrypicked due to conflicts
9.0
8.16 Commit could not be cherrypicked due to conflicts
8.17 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 123079

@joegallo
Copy link
Contributor Author

Manual backports are up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.16.5 v8.17.3 v8.18.1 v8.19.0 v9.0.1 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants