Skip to content

Dynamic internal object cache #2368

@joto

Description

@joto

In general osm2pgsql is built around the principle that while it does its processing you only look at one OSM object at a time and process that. But this is a simplified view. There are already cases where we look at more than one object at a time, and it is likely that there will be more cases in the future with advanced relations processing that is often asked for.

For this to work we need to store OSM objects (or at least their location) in the middle and get them back when needed. Osm2pgsql has code for that already. Unfortunately that code has changed many times over the years and it has become hard to reason about and check. The recent PRs #2365 and #2367 have shown that.

Basically the problem is this:

  • We need different pieces of objects at different times and for different reasons. Sometimes we only need the geometry (location), sometimes the tags, sometimes related objects (members, parents, ...).
  • We can not predict what pieces of data we will need, because it depends on the complex logic implemented in Lua scripts by the user.
  • Depending on the middle used and several options (ram middle and pgsql middle) parts of the data can be stored in different places. Some of these places are expensive to access (mainly the database).
  • Accessing the database is more efficient if we don't do it every time we need something. For instance if and when we need a node member of a relation it makes sense to also get the other node members in the same query. Chances are, we are going to need them also, and we can do the query in one go instead of having n queries for n nodes.

Keeping track of all this "manually" in the code will lead to headaches and bugs every time we want to add new features in osm2pgsql that need extra bits and pieces of objects. So we should think about a better way to solve this.

We'd need some kind of "smart cache" either in the middle implementations or between the RAM and pgsql middle and the users of the middle that will answer requests for objects. If the object is not available yet, the cache will retrieve it and possibly other pieces of data, too.

To make this work without the outside code having to understand the details, the cache must be accessed through the objects themselves. So for instance the outside code says: "give me node 17", it will get a proxy object back. When the code then uses the object ("give me the location for this node"), the proxy will figure out that it needs to get the location "just in time". It stores the location in the proxy so that it doesn't have to do that again, in case the code needs the location a second time. The cache probably also needs some kind of interface to get more than one object at a time. So that it can optimize database queries as mentioned above.

Currently we are using osmium::Node/Way/Relation objects in many places. But they are cumbersome, because they have to live in an osmium::Buffer. And they have no space to store the extra data needed for our proxy objects. We have to change all the code to work with those proxy objects instead. The only place where we really need the Osmium objects is when interacting with the Osmium library, which is when reading the data from the input file and when building multipolygons. We need to take that into account, but I believe that in all other cases we can move away from that interface.

One other thing we need to keep in mind here: One way to speed things up is multithreading. If we can ask the database for objects we are likely going to need soon in an extra thread, we could speed things up. But that means that cache would have to support multithreading in some form.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions