Recursion with PostgreSQL Follow-Up 3 - Finding the Right Path #6
Pinned
vbilopav
started this conversation in
PostgreSQL
Replies: 1 comment 2 replies
-
I'm only now coming back to have a look at your original requirement (in my words: select the partition of connected tables, "partition" as in graph theory). This is what I'd do, while most of the complexity comes from the requirement to also show the distance:
Instead of using two arrays, a JSONB object |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Recursion with PostgreSQL Follow-Up 3 - Finding the Right Path
The mystery has finally been resolved, and I have learned something today.
In my previous article, I managed to get some really crazy performances in tree processing by using PostgreSQL recursive procedural-style function.
While recursive CTE was getting some severe performance degradation - even on a smaller scale:
It usually would eat all of my available space.
Thanks to @fatalmind (Markus Winand) comment, I got better clarification on what path and cycle really mean.
Example
So we have this test table:
Now, If we run recursive CTE for the
node_1
and if we include this statement:cycle id set is_cycle using path
, we will get this:Now, If we take a look at this random record on level 3 for example:
To be able to get to that node connection
| node_5 | node_7 |
on level 3, and by starting from thenode_1
- we would have to go through following path:| node_1 | node_8 |
- level 1: we start withnode_1
and next node isnode_8
(parent ofnode_1
).| node_8 | node_5 |
- level 2: next isnode_8
, because that was parent ofnode_1
| node_5 | node_7 |
- level 3:node_5
is the parent ofnode_8
at previous level 2, so now, we're atnode_5
at that is level 3.Following this path, we have calculated the entire path from
node_1
to node connection| node_5 | node_7 |
to be this following path:node_1
,node_8
,node_5
.And that is the correct path calculation.
A cycle path means simply that if you follow that path - you will end at the already visited node at that path, which means you will run in circles (endless loop).
And indeed, if we check out the cycle paths in this example:
We will notice some repeating nodes:
{(node_1),(node_8),(node_9),(node_9)}
-node_9
is repeating.{(node_1),(node_8),(node_10),(node_1)}
-node_1
is repeating.{(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
-node_1
is repeating.{(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
-node_1
is repeating.New Recursive Function
Given all that new understanding of what path and cycle actually means - I created a new version of my recursive function. This time to calculate and return the correct path and is cycle check:
This is a crucial piece of logic that allowed for correct path calculation:
p.level = _level - 1
- for the same parent node -p.parent_id = t.id
. That is what thisleft outer join lateral
does. It is alateral
join because we need to include main tablebig_tree_test t
into the subselect join query, and it isleft lateral
because the subquery may not find any results, and we want all frombig_tree_test t
array_append(l.path, t.id) as path
to calculate the right path.t.id = any(p.path) as is_cycle
.and not coalesce(l.is_cycle, false)
I created many different tests with many different datasets. And they all return the identical results to a recursive CTE version:
Perfomances
Finally, performances:
select_nodes_recursive_function
select_nodes_recursive_cte
big_tree_test
table. That is the number of tree connections.id
values (nodes) inbig_tree_test
table. That is the number of the unique tree nodes.Note: any larger dataset from these would yield a too big dataset, and it would end up in the same error for both of these functions:
ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp226.8386": No space left on device
Now, based on these results, we can conclude two things:
PostgreSQL Recursive CTE built-in functionality is indeed faster and better optimized.
Both functions are returning an insane number of records. That is because the further two nodes are distant from each other on the graph - there are more possible paths between the two nodes. If we take two nodes from opposite sides of the graphs, there will be a vast number of possible paths between those two nodes. Both of these functions return all possible paths between all existing nodes. That is why the number of function results is progressively growing as the number of nodes (records in the test table) grows.
Conslusion
Now, we see that PostgreSQL functionality for the same algorithm is indeed better optimized, as it should be.
However, they both return an insane number of records, that is, the total number of possible paths between all possible nodes. And that is, of course, very hardware intensive.
And it very well may be unnecessary.
I started this series of articles with a clear requirement - Let's write a function that will return all table names that are related to the table from a parameter on any possible level. In that context, path calculation and even cycle aren't really necessary.
And that small distinct requirement could mean milliseconds of processing versus hours of expensive processing.
A simple example, let's modify this recursive function
select_nodes_recursive_function
so that:In that case, we would add
limit 1
to the lateral subquery that calculates our path:And that is enough to return only one example path, not all. Let's see performances now and a bit bigger dataset:
select_nodes_recursive_function
select_nodes_recursive_cte
ERROR
ERROR
ERROR
So this small change with
limit 1
gave us:However, even that is not what is required in these functional requirements: Let's write a function that will return all table names that are related to the table from a parameter on any possible level.
For this, we simply need a unique node on any level. We basically don't even care about the level at all. So we may return a unique node on the first level we see that node.
And for that, the algorithm from the previous article is enough. I will paste it here now:
To repeat: this function returns unique nodes at any level of recursion and exits when all nodes have been visited. And that's all that we need.
Now, let's see some performances:
select_nodes_recursive_function
select_nodes_recursive_cte
ERROR
ERROR
ERROR
Of course, for this dataset, recursive CTE with cycle/path check breaks down.
But the thing is, we don't really need that. We can fulfill what is required by using simple recursion with
plpgsql
, and we will be able to process the data thousands of times faster.So, in the final conclusion, I still believe that the classical recursion approach using the
plpgsql
procedural approach can yield much better results. It is flexible enough to help us avoid unnecessary processing and to return exactly what we need from our tree structures. Calculation of all possible paths can lead to serious performance degradations, and it would be wise to avoid that if we can. This is one possibility how we can do that.Next time, I'll focus on the
plpgsql
function debugging and testing techniques.Beta Was this translation helpful? Give feedback.
All reactions