Recursion with PostgreSQL Follow-Up 1 - Performances #4
vbilopav
started this conversation in
PostgreSQL
Replies: 1 comment 6 replies
-
In the previous article, Vik Fearing shared a code snippet using the CYCLE feature of Postgres 14 to achieve this, and it seems like even prior to Postgres 14 the Postgres documentation contained sample code that precisely addressed the kind of "depth" output requirements here by tracking an array of "seen" rows and an ANY operator |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is a follow-up article on my previous blog post A Different Type of SQL Recursion with PostgreSQL.
Previously, I explored two different approaches to recursion with PostgreSQL.
Now, it's time to do some testing.
Testing
Instead of a view on system tables that enables us to traverse table relationship tree, I've decided to create a simple test table:
And then use a script to insert as many random values as I like:
This script will insert text nodes in groups. There are two parameters:
_max_rows
- how many id rows will be generated_max_groups
- any id can repeat to point to different parents how many times? Minimal 1, maximum this number.For
_max_rows = 10
and_max_groups = 3
, this script will generate the following test set:So, based on a manual calculation, this is the result we need to get for processing a
node_1
:Finally, we can create new functions to work with table
big_tree_test
:So, we're all set. Let's go.
Problem
Right from the start, while doing smoke tests, I noticed something very wrong. The function
select_nodes_recursive_cte
doesn't return the expected results at all.For a small test set above, this is what it returns:
Nodes
node_5
,node_10
, andnode_9
are all level 3, not 3, 4 and 6, andnode_7
is level 4, not 6.After some investigation, my strategy of "islands and gaps" (see previous article) is completely wrong!
Implementation was fine; the fact is that every change in
id
row doesn't mean automatically that we have the new recursion level. At all.And after some more investigation, it appears that the only way to implement the recursion level calculation is to add level counter in CTE union queries.
However, there is a big problem with this approach:
There is no way to know which node was already processed in recursive CTE to be able to skip it entirely, and:
So, my final version looks like this now:
But this is a very bad solution, and here is why:
We don't know which recursion depth is appropriate. And increasing the recursion threshold will only slow this function dramatically. We will still not know if we fetched all levels or if the recursive CTE ran in circles when it encountered the circular reference.
Let's see some tests - a test table with 1290 records and 250 nodes:
So that's no good. If anyone has a better implementation with recursion level (depth) implemented, I'd like to see it.
The Solution
Let's try our procedural approach to this problem to see how it performs. Same test table: - 1290 records and 250 nodes - 00.783 seconds.
It's not bad at all, but let's try a bit bigger dataset:
ERROR: stack depth limit exceeded
So that's no good.
Let's see if we can optimize this function a bit.
Optimization 1
The first thing I noticed was that the message log was full of
NOTICE: relation "_result" already exists, skipping˙
messages.I'm not even sure if this is even an optimization, but I first would like to get rid of these messages.
So, instead of this:
We will have this:
The second thing I would try is to get rid of that loop. Looping is an anti-pattern with SQL, anyhow.
Instead, this is what I came up with:
id
parameter and aggregateparent_id
values in an array variable.This is how that looks like:
Now, let's do performance things again:
Not bad, but let's see if we can do even better.
Optimization 2
The idea is to reduce the number of recursion calls as much as possible. Currently, we have this:
This executes the
select_nodes_recursive_function
function one for each element in_parents
array variable. Instead, we could change the input parameter type so that it receives an entire array instead of a single value.Great idea, let's do it!
Oh, wait, but to have that work as intended, we need to slightly modify the condition we use to exit the recursion in two steps:
id
parameter all IDs that are already processed (we don't want them).Also, we need to use
any
operator when filtering our table for the insert:This seems to produce satisfying results...
Now, let's do performance things again:
Wait what?
Let's increase the test data set some more:
So it's almost 757K tree records with 20K unique nodes in 5 seconds, and I haven't even used any indexes yet. Not bad, not bad at all...
Finally, let's see how that magic function looks like
To invoke this function, pass the array parameter:
And that's it: fast tree parsing recursion for PostgreSQL.
Conclusion
This started as an exploration into the possibilities of parsing tree structures with PostgreSQL and has now grown into a series of articles.
My conclusion is that SQL is not that suitable for recursion. As we can see in this article, it has some severe limitations, mainly regarding recursion depth issues.
As far as I know, SQL was never even intended to do that in the first place.
On the other hand, the more procedural approach, optimized properly, can yield some spectacular results.
However, this approach may not be for everybody.
In the next follow-up, I'll focus on debugging and testing. Those two issues are the biggest concerns, as I understand from the comments. We can use the TDD approach to develop this function again.
Edit
The new follow-up on this series: Recursion with PostgreSQL Follow-Up 2 - CYCLE Detection and Processing Millions of Nodes
Beta Was this translation helpful? Give feedback.
All reactions