Skip to content

Conversation

@samxbr
Copy link
Contributor

@samxbr samxbr commented Jun 23, 2025

Make GeoIP database node loader project-aware:

  • loads the downloaded GeoIP databases from system index to ingest node file system for each project
  • each project's databases are loaded to directory tmp/geoip-databases/{nodeId}/{projectId}

Note: more work is needed to make the REST tests run in MP mode

Apologies for the change in many classes, they are related and hard to split to separate PRs. Some FixForMultiProject annotation will be fixed in separate PR's to reduce the PR size.

@samxbr samxbr added >non-issue :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Jun 23, 2025
@samxbr samxbr requested a review from PeteGillinElastic June 23, 2025 09:19
@samxbr samxbr marked this pull request as ready for review June 23, 2025 09:20
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jun 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Member

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, just a few comments. I'll hold off on approving because @nielsbauman said he'd like to take a look as well.


@Before
private void setup() {
projectId = ProjectId.DEFAULT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that there's more work required to make this pass with a random project? Can we add a @FixForMultiProject to make sure we remember to come back to it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractGeoIpIT extends ESIntegTestCase. We currently don't have support for running internal cluster tests in MP mode yet. Therefore, we're bound to use the default project ID in these tests.

That said, we don't need to reinitialize this field for every test. At the very least, we should make it a private final or even a private static final, although I'm personally more leaning towards just passing the ProjectId.DEFAULT constant where we need it, as a private static final doesn't feel super valuable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly as Niels said, ESIntegTestCase is not MP enabled yet, I put a @FixForMultiProject as reminder. I prefer keeping it as a class variable instead of local variable since it's easier change it later at a single place for all tests.

private ProjectResolver projectResolver;

private final ConcurrentMap<String, DatabaseReaderLazyLoader> databases = new ConcurrentHashMap<>();
private final ConcurrentMap<ProjectId, ConcurrentMap<String, DatabaseReaderLazyLoader>> databases = new ConcurrentHashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make things simpler or not to use a map with a two-member record as a key, rather than the nested maps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would save having to do the databases.computeIfAbsent(projectId, (k) -> new ConcurrentHashMap<>()) dance in a few places. On the other hand, it would mean that in the removeStale... method you'd have iterate over everything and filter rather than being able to go straight to the map for the project in question... I'm not sure which is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always struggle with these kinds of tradeoffs. FWIW, avoiding the "dance" could also be partially mitigated by adding a private method that does the retrieval. I think the performance impact of the iteration in removeStale is acceptable, as that method won't be called with a high frequency. I am personally usually more a fan of the nested maps as it avoids an extra record class and extra object instances/creations. I don't think there are very strong arguments either way in this case, so I don't think it matters much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used nested map as it saves object creation as Niels pointed out (and we use nested map in a few other places so maybe more consistent?). record key does make the map initialization easier. I don't have a strong opinion either way. Unless anyone feels strongly about this, I will just leave it unchanged :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly :-)

@Override
public FileVisitResult visitFileFailed(Path file, IOException e) {
if (e instanceof NoSuchFileException == false) {
// https://github.com/elastic/elasticsearch/issues/104782
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this comment is meant to tell the reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my IDE gives me a warning since it thinks I should favor parameterized log {}, but that would fail due to logger check. I put a comment to point to a previous issue that explains it. I updated the comment to make it clearer.

private ProjectResolver projectResolver;

private final ConcurrentMap<String, DatabaseReaderLazyLoader> databases = new ConcurrentHashMap<>();
private final ConcurrentMap<ProjectId, ConcurrentMap<String, DatabaseReaderLazyLoader>> databases = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always struggle with these kinds of tradeoffs. FWIW, avoiding the "dance" could also be partially mitigated by adding a private method that does the retrieval. I think the performance impact of the iteration in removeStale is acceptable, as that method won't be called with a high frequency. I am personally usually more a fan of the nested maps as it avoids an extra record class and extra object instances/creations. I don't think there are very strong arguments either way in this case, so I don't think it matters much.

Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left one more comment but other than that LGTM, so I'm approving to allow you to merge in your morning.

Comment on lines 128 to 131
boolean multiProject = randomBoolean();
projectId = multiProject ? randomProjectIdOrDefault() : ProjectId.DEFAULT;
projectResolver = multiProject ? TestProjectResolvers.singleProject(projectId) : TestProjectResolvers.DEFAULT_PROJECT_ONLY;
projectId = randomProjectIdOrDefault();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we do the multiProject switch because we want to cover when the project resolver doesn't support multiple projects?

  1. Could you add a comment explaining that? (also in GeoIpProcessorFactoryTests.java)
  2. I don't think the last projectId assignment is correct, right? I think that line should be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, that's a left over from before.

@samxbr samxbr merged commit 41a47c2 into elastic:main Jun 25, 2025
33 checks passed
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 25, 2025
- loads the downloaded GeoIP databases from system index to ingest node file system for each project
- each project's databases are loaded to directory `tmp/geoip-databases/{nodeId}/{projectId}`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >non-issue Team:Data Management Meta label for data/management team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants