-
-
Notifications
You must be signed in to change notification settings - Fork 477
Description
osm2pgsql by default "clusters" all output tables by geometry. I have put "clusters" in quotes, because this doesn't use the CLUSTER functionality of the database, but we are doing this our own way.
How this works:
- Data is first written in a kind of temporary table. It isn't a TEMP table in the SQL sense, but it is created as UNLOGGED, so no WAL is written, which speeds up writing. If the import is interrupted for any reason the table is automatically removed by the database, but that's fine, we have to restart from scratch anyway.
- After the data import is finished a new "final table" is created as a normal table.
- Then we do an INSERT on the new table getting all data from the temporary table with ORDER BY [geomcolumn]. So all data is copied over but in sorted form.
- We delete the temporary table.
- Then we build a geometry index on the final table.
(You can see here why we are not using the PostgreSQL CLUSTER commmand. For CLUSTER we'd have to build the table, build the index and then run the CLUSTER command. The effect is that a) we can not use an UNLOGGED table for the first version of the table, because it is the same as the final one and b) the index basically has to be built twice. This makes our approach more efficient, albeit a bit more complex.)
Problems
What we are doing here has two problems:
A. While we are copying the table in step (3), we need twice the amount of disk space temporarily, because we have the old and new tables around at the same time. A table with, say, all the planets around 1 billion ways in it, can easily need 300 GB or more. That's quite some disk space we need extra. This is made worse, because usually (if --disable-parallel-indexing is not used) we do step (3) in parallel with several tables. So doing this step means we need a lot more disk space in the import phase which is not used in later day-to-day operations of the database which is an inefficient use of available disk space.
B. We also need to do the sorting of the table contents in step (3). As long as the data that's being sorted will fit into the memory set with work_mem, the database will do that in memory. If it does not fit, the database will sort the data in chunks and write those out to disk. Which means we'll have another temporary copy of the data in disk compounding the problem explained above. It also means we'll do a lot more disk I/O slowing down the whole process.
A way to improve this would be to write the data in step (1) not into one table but into many tables doing some kind of presorting. And then in step (3) sort and copy each of the temporary tables one after the other. This would solve problem A because we never have both temporary and final data together on disk. And it can help with problem B, because ideally the sorting in step (3) can now be done in memory.
Why we are doing the clustering?
It probably makes sense to quickly talk about why we are doing this "clustering".
The idea is that data that will often be queried together (because it is "near" each other) is located in the same or adjacent blocks on disk. This makes getting the data much more efficient.
Note that this doesn't have to be perfect. In normal operation, because data is changed all the time, table ordering will "degrade" over time anyway. Functionality will not be affected if the ordering is not perfect, but it will give us a nice performance bonus.
What does "ordering" mean for a geometry
I have glossed over the question of what exactly ordering means in this context. How are geometries ordered? The answer ist that modern PostGIS implements a Hilbert curve based on the center of the bounding box of each geometry. You can read about the details here.
This is actually pretty convenient, because if we want to split the large temporary tables into small ones, we only have to split the world into quadrants (or 16 tiles or 64 tiles and so on) and write the data for each one into one table. Then sorting each of those tables individually and appending them to the final table in the order given by the Hilbert curve should produce the right order.
Proposal
We can use partitioning on the output tables. We create the partitions with one extra column that we fill with the "tile" the output is in. We'll write to the partitioned table as usual adding the "tile" from our own Hilbert curve calculation and PostgreSQL will put the data into the correct table. After the import we copy the ordered content of each table in turn (in the order given by the Hilbert curve) into the final table. After each copy we remove the table that was just copied.
Data is not spread uniformly around the globe, so we will need more than (all data / partitions) extra disk space, but it should be reduced enough with 16 or 64 partitions that it doesn't matter any more.
What about the number of tables we are creating here? It seems there are no limits we have any chance of hitting. From all I have read having thousands of tables in PostgreSQL is totally fine and people routinely use that in production. And we'd only need them on import anyway. So we should not run into any problems even if we have, say, 50 output tables and split them up into 64 or even more subtables. Empty tables seem to consume less than 1kByte of disk space, so having thousands of them around unnecessarily also doesn't seem to be a problem.
Options
Other options of getting a similar result:
- Use several tables as described but without the partitioning feature. In this case we have to write to all the individual tables ourselves. That would mean a lot more changes in our code.
- Don't use temporary tables at all. Write the COPY output into files instead of directly into the database. Use the same partitioning schema. After all data is processed, sort each file ourselves and write all of them to the database in one go. This gives us a bit more control and should have the least overhead, because the database isn't involved at all until the end. But it needs more work and more code.