Recursion with PostgreSQL Follow-Up 3 - Finding the Right Path #6

vbilopav · 2023-09-20T11:59:02Z

vbilopav
Sep 20, 2023
Maintainer

Recursion with PostgreSQL Follow-Up 3 - Finding the Right Path

The mystery has finally been resolved, and I have learned something today.

In my previous article, I managed to get some really crazy performances in tree processing by using PostgreSQL recursive procedural-style function.

While recursive CTE was getting some severe performance degradation - even on a smaller scale:
It usually would eat all of my available space.

Thanks to @fatalmind (Markus Winand) comment, I got better clarification on what path and cycle really mean.

Example

So we have this test table:

create table big_tree_test (id text, parent_id text);

   id    | parent_id | 
---------+-----------+
 node_1  | node_3    |
 node_1  | node_8    |
 node_4  | node_8    |
 node_5  | node_7    |
 node_6  | node_4    |
 node_6  | node_6    |
 node_6  | node_9    |
 node_7  | node_10   |
 node_8  | node_10   |
 node_8  | node_5    |
 node_8  | node_9    |
 node_9  | node_9    |
 node_10 | node_1    |

Now, If we run recursive CTE for the node_1 and if we include this statement: cycle id set is_cycle using path, we will get this:

with recursive _recursive_cte as (
    select
        id,
        parent_id,
        1 as level
    from
        big_tree_test
    where 
        id = 'node_1'
    
    union all

    select
        rec.parent_id as id,
        t.parent_id,
        rec.level + 1 as level
    from
        _recursive_cte rec
        inner join big_tree_test t on rec.parent_id = t.id
)
cycle id set is_cycle using path
select r.id, r.parent_id, r.level, is_cycle, path
from _recursive_cte r

   id    | parent_id | level | is_cycle |                      path
---------+-----------+-------+----------+-------------------------------------------------
| node_1 |    node_3 |     1 |    false | {(node_1)}
| node_1 |    node_8 |     1 |    false | {(node_1)}
| node_8 |    node_9 |     2 |    false | {(node_1),(node_8)}
| node_8 |    node_5 |     2 |    false | {(node_1),(node_8)}
| node_8 |   node_10 |     2 |    false | {(node_1),(node_8)}
| node_9 |    node_9 |     3 |    false | {(node_1),(node_8),(node_9)}
| node_5 |    node_7 |     3 |    false | {(node_1),(node_8),(node_5)}
|node_10 |    node_1 |     3 |    false | {(node_1),(node_8),(node_10)}
| node_9 |    node_9 |     4 |     true | {(node_1),(node_8),(node_9),(node_9)}
| node_7 |   node_10 |     4 |    false | {(node_1),(node_8),(node_5),(node_7)}
| node_1 |    node_8 |     4 |     true | {(node_1),(node_8),(node_10),(node_1)}
| node_1 |    node_3 |     4 |     true | {(node_1),(node_8),(node_10),(node_1)}
|node_10 |    node_1 |     5 |    false | {(node_1),(node_8),(node_5),(node_7),(node_10)}
| node_1 |    node_8 |     6 |     true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
| node_1 |    node_3 |     6 |     true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}

Now, If we take a look at this random record on level 3 for example:

   id    | parent_id | level | is_cycle |                      path
---------+-----------+-------+----------+-------------------------------------------------
| node_5 |    node_7 |     3 |    false | {(node_1),(node_8),(node_5)}

To be able to get to that node connection | node_5 | node_7 | on level 3, and by starting from the node_1 - we would have to go through following path:

| node_1 | node_8 | - level 1: we start with node_1 and next node is node_8 (parent of node_1).
| node_8 | node_5 | - level 2: next is node_8, because that was parent of node_1
| node_5 | node_7 | - level 3: node_5 is the parent of node_8 at previous level 2, so now, we're at node_5 at that is level 3.

Following this path, we have calculated the entire path from node_1 to node connection | node_5 | node_7 | to be this following path: node_1, node_8, node_5.

And that is the correct path calculation.

A cycle path means simply that if you follow that path - you will end at the already visited node at that path, which means you will run in circles (endless loop).

And indeed, if we check out the cycle paths in this example:

   id    | parent_id | level | is_cycle |                      path
---------+-----------+-------+----------+-------------------------------------------------
| node_9 |    node_9 |     4 |     true | {(node_1),(node_8),(node_9),(node_9)}
| node_1 |    node_8 |     4 |     true | {(node_1),(node_8),(node_10),(node_1)}
| node_1 |    node_3 |     4 |     true | {(node_1),(node_8),(node_10),(node_1)}
| node_1 |    node_8 |     6 |     true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}
| node_1 |    node_3 |     6 |     true | {(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)}

We will notice some repeating nodes:

{(node_1),(node_8),(node_9),(node_9)} - node_9 is repeating.
{(node_1),(node_8),(node_10),(node_1)} - node_1 is repeating.
{(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)} - node_1 is repeating.
{(node_1),(node_8),(node_5),(node_7),(node_10),(node_1)} - node_1 is repeating.

New Recursive Function

Given all that new understanding of what path and cycle actually means - I created a new version of my recursive function. This time to calculate and return the correct path and is cycle check:

select drop_function('public', 'select_nodes_recursive_function');

create or replace function select_nodes_recursive_function(
    _id text[],
    _level int = 1
)
returns table (
    id text,
    parent_id text,
    level int,
    is_cycle boolean,
    path text[]
)
language plpgsql
as 
$$
declare 
    _parents text[];
begin    
    if not exists(
        select 1 from information_schema.tables t
        where t.table_name = '_result' and table_type = 'LOCAL TEMPORARY'
    ) then
        create temp table _result (
            id text,
            parent_id text,
            level int,
            is_cycle boolean,
            path text[]
        ) on commit drop;
    end if;

    with _insert_cte as (
        insert into _result
        select
            t.id,
            t.parent_id,
            _level,
            coalesce(l.is_cycle, false) as is_cycle,
            array_append(l.path, t.id) as path
        from
            big_tree_test t
            left outer join lateral (
                select p.path, t.id = any(p.path) as is_cycle
                from _result p
                where p.level = _level - 1 and p.parent_id = t.id
            ) l on true
        where 
            t.id = any(_id)
            and not coalesce(l.is_cycle, false)
        returning 
            _result.parent_id
    ) 
    select array_agg(p.parent_id)
    into _parents
    from _insert_cte p;
            
    if _parents is not null then
        perform select_nodes_recursive_function(_parents, _level + 1);
    end if;
    
    return query 
    select 
        t.id,
        t.parent_id,
        t.level,
        t.is_cycle,
        t.path
    from 
        _result t;
end;
$$;

-- helper function overload
create or replace function select_nodes_recursive_function(_id text)
returns table (
    id text,
    parent_id text,
    level int,
    is_cycle boolean,
    path text[]
)
language sql
as 'select id, parent_id, level, is_cycle, path from select_nodes_recursive_function(array[_id]::text[])';

This is a crucial piece of logic that allowed for correct path calculation:

select
    t.id,
    t.parent_id,
    _level,
    coalesce(l.is_cycle, false) as is_cycle,
    array_append(l.path, t.id) as path
from
    big_tree_test t
    left outer join lateral (
        select p.path, t.id = any(p.path) as is_cycle
        from _result p
        where p.level = _level - 1 and p.parent_id = t.id
    ) l on true
where 
    t.id = any(_id)
    and not coalesce(l.is_cycle, false)

For every new node connection level, we can select the previous level p.level = _level - 1 - for the same parent node - p.parent_id = t.id. That is what this left outer join lateral does. It is a lateral join because we need to include main table big_tree_test t into the subselect join query, and it is left lateral because the subquery may not find any results, and we want all from big_tree_test t
And now we can add the current ID to this path - array_append(l.path, t.id) as path to calculate the right path.
That path is a cycle path if the current ID already exists in the previous path - t.id = any(p.path) as is_cycle.
And now, we can filter out cycles if we want with the expression and not coalesce(l.is_cycle, false)

I created many different tests with many different datasets. And they all return the identical results to a recursive CTE version:

create or replace function select_nodes_recursive_cte(
    _id text
)
returns table (
    id text,
    parent_id text,
    level int,
    is_cycle boolean,
    path record[]
)
language sql
as 
$$
with recursive _recursive_cte as (
    select
        id,
        parent_id,
        1 as level
    from
        big_tree_test
    where 
        id = _id
    
    union all

    select
        rec.parent_id as id,
        t.parent_id,
        rec.level + 1 as level
    from
        _recursive_cte rec
        inner join big_tree_test t on rec.parent_id = t.id
)
cycle id set is_cycle using path
select r.id, r.parent_id, r.level, is_cycle, path
from _recursive_cte r
where not is_cycle;
$$;

Perfomances

Finally, performances:

round	records ¹	unique nodes ²	total rows ³	`select_nodes_recursive_function`	`select_nodes_recursive_cte`
1	20	10	28	00:00:00.055	00:00:00.040
2	20	10	28	00:00:00.076	00:00:00.043
3	20	10	28	00:00:00.056	00:00:00.040
1	49	20	3674	00:00:00.106	00:00:00.086
2	49	20	3674	00:00:00.078	00:00:00.082
3	49	20	3674	00:00:00.071	00:00:00.074
1	101	40	1583446	00:00:49.447	00:00:31.218
2	101	40	1583446	00:00:46.344	00:00:31.118
3	101	40	1583446	00:00:44.252	00:00:31.099

1 - records: This is the number of total records in big_tree_test table. That is the number of tree connections.
2 - unique nodes: This is the number of unique id values (nodes) in big_tree_test table. That is the number of the unique tree nodes.
3 - total rows: This is the total number of records returned by those two functions. That is the number of unique paths between nodes.

Note: any larger dataset from these would yield a too big dataset, and it would end up in the same error for both of these functions: ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp226.8386": No space left on device

Now, based on these results, we can conclude two things:

PostgreSQL Recursive CTE built-in functionality is indeed faster and better optimized.
Both functions are returning an insane number of records. That is because the further two nodes are distant from each other on the graph - there are more possible paths between the two nodes. If we take two nodes from opposite sides of the graphs, there will be a vast number of possible paths between those two nodes. Both of these functions return all possible paths between all existing nodes. That is why the number of function results is progressively growing as the number of nodes (records in the test table) grows.

Conslusion

Now, we see that PostgreSQL functionality for the same algorithm is indeed better optimized, as it should be.

However, they both return an insane number of records, that is, the total number of possible paths between all possible nodes. And that is, of course, very hardware intensive.

And it very well may be unnecessary.

I started this series of articles with a clear requirement - Let's write a function that will return all table names that are related to the table from a parameter on any possible level. In that context, path calculation and even cycle aren't really necessary.

And that small distinct requirement could mean milliseconds of processing versus hours of expensive processing.

A simple example, let's modify this recursive function select_nodes_recursive_function so that:

It doesn't return all possible paths.
It returns all possible nodes (connections) with only one path, Any path, just not all paths.
We will keep our "is cycle" check, just in case we don't end up in an endless loop.

In that case, we would add limit 1 to the lateral subquery that calculates our path:

select
    t.id,
    t.parent_id,
    _level,
    coalesce(l.is_cycle, false) as is_cycle,
    array_append(l.path, t.id) as path
from
    big_tree_test t
    left outer join lateral (
        select p.path, t.id = any(p.path) as is_cycle
        from _result p
        where p.level = _level - 1 and p.parent_id = t.id
        limit 1
    ) l on true
where 
    t.id = any(_id)
    and not coalesce(l.is_cycle, false)

And that is enough to return only one example path, not all. Let's see performances now and a bit bigger dataset:

round	records	unique nodes	total rows	`select_nodes_recursive_function`	`select_nodes_recursive_cte`
1	305	100	9130	00:00:01.815	`ERROR`
2	305	100	9130	00:00:01.815	`ERROR`
3	305	100	9130	00:00:01.815	`ERROR`

So this small change with limit 1 gave us:

Unique nodes per recursion level with one example path.
Many times better performances than recursive CTE, which tried to yield a too-huge dataset (probably hundreds of millions) and ended up with ERROR (ran out of space).

However, even that is not what is required in these functional requirements: Let's write a function that will return all table names that are related to the table from a parameter on any possible level.

For this, we simply need a unique node on any level. We basically don't even care about the level at all. So we may return a unique node on the first level we see that node.

And for that, the algorithm from the previous article is enough. I will paste it here now:

create or replace function select_nodes_recursive_function(
    _id text[],
    _level int
)
returns table (
    id text,
    parent_id text,
    level int
)
language plpgsql
as 
$$
declare 
    _row record;
    _parents text[];
begin    
    if not exists(
        select 1 from information_schema.tables t
        where t.table_name = '_result' and table_type = 'LOCAL TEMPORARY'
    ) then
        create temp table _result (
            id text,
            parent_id text,
            level int
        ) on commit drop;
    end if;

    _id = array(select unnest(_id) except select distinct t.id from _result t);

    if _id = '{}' then
        return;
    end if;
    
    with _insert_cte as (
        insert into _result
        select
            t.id,
            t.parent_id,
            _level
        from
            big_tree_test t
        where 
            t.id = any(_id)
        returning 
            _result.parent_id
    ) 
    select array_agg(distinct p.parent_id)
    into _parents
    from _insert_cte p; 

    perform select_nodes_recursive_function(_parents, _level + 1);

    return query 
    select 
        t.id,
        t.parent_id,
        t.level
    from 
        _result t;
end;
$$;

-- helper overload
create or replace function select_nodes_recursive_function(_id text)
returns table (
    id text,
    parent_id text,
    level int
)
language sql
as 'select id, parent_id, level from select_nodes_recursive_function(array[_id]::text[], 1)';

To repeat: this function returns unique nodes at any level of recursion and exits when all nodes have been visited. And that's all that we need.

Now, let's see some performances:

round	records	unique nodes	total rows	`select_nodes_recursive_function`	`select_nodes_recursive_cte`
1	305	100	9130	00:00:00.055	`ERROR`
2	305	100	9130	00:00:00.048	`ERROR`
3	305	100	9130	00:00:00.049	`ERROR`

Of course, for this dataset, recursive CTE with cycle/path check breaks down.

But the thing is, we don't really need that. We can fulfill what is required by using simple recursion with plpgsql, and we will be able to process the data thousands of times faster.

So, in the final conclusion, I still believe that the classical recursion approach using the plpgsql procedural approach can yield much better results. It is flexible enough to help us avoid unnecessary processing and to return exactly what we need from our tree structures. Calculation of all possible paths can lead to serious performance degradations, and it would be wise to avoid that if we can. This is one possibility how we can do that.

Next time, I'll focus on the plpgsql function debugging and testing techniques.

fatalmind · 2023-09-25T11:55:44Z

fatalmind
Sep 25, 2023

I'm only now coming back to have a look at your original requirement (in my words: select the partition of connected tables, "partition" as in graph theory).

This is what I'd do, while most of the complexity comes from the requirement to also show the distance:

WITH RECURSIVE _recursive_cte
AS ( SELECT ARRAY['node_1']                                         AS  new
          , ARRAY[]                                    ::TEXT[]     AS  prev
          , ARRAY[]                                    ::INTEGER[]  AS  prev_level
          , 1                                                       AS  level
          , FALSE                                                   AS  done
      UNION ALL
     SELECT n.new                                                       AS  new
          , prev       || o.new                                         AS  prev
          , prev_level || ARRAY_FILL(level, ARRAY[COALESCE(CARDINALITY(o.new),0)])  AS  prev_level
          , level + 1                                                   AS  level
          , CARDINALITY(n.new) IS NULL                                  AS  done
       FROM _recursive_cte o
       LEFT JOIN LATERAL (SELECT ARRAY_AGG(DISTINCT parent_id) FILTER(WHERE NOT parent_id = ANY(o.prev || o.new)) AS  new
                            FROM big_tree_test
                           WHERE id  = ANY(o.new)
                         ) n ON TRUE
      WHERE NOT done
   )
SELECT ids.id, levels.level
  FROM _recursive_cte
 CROSS JOIN UNNEST(prev)       WITH ORDINALITY ids    (id   , ord)
       JOIN UNNEST(prev_level) WITH ORDINALITY levels (level, ord) ON ids.ord = levels.ord
WHERE done
;

Instead of using two arrays, a JSONB object {"node_id": level} working with new || old so that the old level is pertained might be the better option.

2 replies

vbilopav Sep 29, 2023
Maintainer Author

Thanks, I'll test this over the next weekend (I hope).

vbilopav Oct 1, 2023
Maintainer Author

Hi @fatalmind,

I finally managed to get some free time to get my hands on this (Sunday morning, while everyone is asleep). Anyway...

The first thing I noticed is that your query returns id and level. A few things:

My function will return a unique connection between nodes the first time it appears on any level (id, parent, level).
Your query returns a unique node that appeared on any level, and that is indeed the original requirement, so you are correct.

However, the level is pretty much unnecessary, so I modified both:

Your query to return only the unique node that at least once without level: SELECT ids.id FROM _recursive_cte
My function to return exactly the same result return query select distinct t.parent_id from _result t;

They are both indeed the same, returning the same results, and those results are indeed the original requirements. Also, both functions (I put your query into SQL function) are performing well on medium and smaller-sized datasets.

However, when the dataset grows over a certain threshold, recursive CTE starts to show serious performance degradation.

Example datasets with 10,000 unique nodes with 255,559 nodes connection:

--Round 1: Total rows: 1000 of 10000 Query complete 00:00:00.886
--Round 2: Total rows: 1000 of 10000 Query complete 00:00:00.885
--Round 3: Total rows: 1000 of 10000 Query complete 00:00:00.904
select * from select_nodes_recursive_function('node_1');

--Round 1: Total rows: 1000 of 10000 Query complete 00:00:45.279
--Round 2: Total rows: 1000 of 10000 Query complete 00:00:47.369
--Round 3: Total rows: 1000 of 10000 Query complete 00:00:46.311
select * from select_nodes_recursive_cte('node_1');

Function select_nodes_recursive_function without recursive CTE will start to degrade in performance after a few million unique nodes.

That, of course, doesn't mean that recursive CTE can't be optimized; it probably can, but I still think that function select_nodes_recursive_function is way more understandable and manageable. And, my goal is to get as many developers as possible to start writing SQL (even if it is a procedural one, nothing wrong with that).

I'll soon write another article to show how these functions can and should be tested and even TDD-ed (hopefully, not on Sunday morning again).

Thanks for your feedback, much appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recursion with PostgreSQL Follow-Up 3 - Finding the Right Path #6

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Recursion with PostgreSQL Follow-Up 3 - Finding the Right Path #6

Uh oh!

Uh oh!

vbilopav Sep 20, 2023 Maintainer

Recursion with PostgreSQL Follow-Up 3 - Finding the Right Path

Example

New Recursive Function

Perfomances

Conslusion

Replies: 1 comment · 2 replies

Uh oh!

fatalmind Sep 25, 2023

Uh oh!

vbilopav Sep 29, 2023 Maintainer Author

Uh oh!

vbilopav Oct 1, 2023 Maintainer Author

vbilopav
Sep 20, 2023
Maintainer

Replies: 1 comment 2 replies

fatalmind
Sep 25, 2023

vbilopav Sep 29, 2023
Maintainer Author

vbilopav Oct 1, 2023
Maintainer Author