- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.6k
Description
The geoip processor is one of our more costly ones (see comment), and a common pattern is to have several of them in a pipeline (e.g. a location and asn geoip processor for both a source and destination ip address, so that'd be four processors).
We have a cache around the actual geoip database lookup, but even in the case of a cache hit, it's still relatively expensive.
First, the cache is roughly an InetAddress --> Response map. However, we receive ip addresses as strings -- if the cache used String keys we could avoid even converting the String to an InetAddress.
Second, the cached values are not small, here these CityResponse objects are ~7kb a piece:
If we cached a more compact 'destination' object (e.g. a Map<String, Object>), we could expand the size (in objects) of the cache while not increasing the memory footprint (in bytes) -- that is, for a given bytes-size of cache we could have more records cached.
Third, the cached values are not optimized for cache-y usage. Consider for example this code snippet:
elasticsearch/modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpProcessor.java
Line 231 in e81f461
| String countryName = country.getName(); | 
and then see the implementation of getName itself:
Each time we process a cache hit, we call several methods that end up invoking a getName() and those end up doing a map scan to find a result String, but the returned String would be the same each time -- rather than caching a Response and building an object from it N times, we could build such an object once and just use it as the cached value.
That is, we're paying a price to build the cache key (String to InetAddress conversion), and we're paying both a memory and cpu price for using the Response as the cache value.
