Skip to content

Conversation

@krishna-samy
Copy link
Contributor

@krishna-samy krishna-samy commented Dec 26, 2025

Problem:

BGP path lookup currently uses O(N) linear search through the list (dest->info) for every incoming route update. In high-ECMP scenarios, each update requires iterating through the entire path list to check if a path from that peer already exists.

This becomes a severe performance bottleneck in data center environments with high ECMP during large-scale route updates churn.

Solution:

Implement a path_info hash per table using typesafe hash. Use the same in bgp_update() for efficient lookup.

This hash lookup improves CPU overhead upto ~60%

@krishna-samy
Copy link
Contributor Author

krishna-samy commented Dec 26, 2025

Test Results:

Overall scale: 32k routes * 128 way ECMP: ~4 million paths

32279 RIB nodes, using 4035 KiB of memory
4114785 BGP routes, using 565 MiB of memory
1 Static routes, using 152 bytes of memory
1 Packets, using 56 bytes of memory
32149 Adj-Out entries, using 2763 KiB of memory
129 Nexthop cache entries, using 29 KiB of memory
358 BGP attributes, using 106 KiB of memory
9 BGP AS-PATH entries, using 360 bytes of memory
8 BGP AS-PATH segments, using 192 bytes of memory
1 BGP ext-community entries, using 32 bytes of memory
131 peers, using 2622 KiB of memory
1 peer groups, using 64 bytes of memory

Without Fix:

> cat bgp-mem-wo-fix | head
System allocator statistics:
  Total heap allocated:  662 MiB
  Holding block headers: 82 MiB
  Used small blocks:     0 bytes
  Used ordinary blocks:  656 MiB
  Free small blocks:     9408 bytes
  Free ordinary blocks:  6037 KiB
  Ordinary blocks:       8056
  Small blocks:          178
  Holding blocks:        131

> cat bgp-mem-wo-fix | awk 'NF > 0 && $NF+0 > 1000000'
Buffer data                   :        1   4120        4120      577   2377240
Ring buffer                   :      260 variable  85732432      260  85732432
Stream                        :       14 variable    842752     2942  15824752
Route node                    :    64556    120     7747040    64556   7747040
BGP peer                      :      131  20496     2686024      131   2686024
BGP node                      :    32279    128     4390104    32279   4390104
BGP route                     :  4114785    144   625471416  4114785 625471416
BGP adj out                   :    32149     88     2830040    32149   2830040
BGP multipath info            :    32148     32     1289696    32148   1289696

Event CPU:

Event statistics for bgpd:

Showing statistics for pthread default
--------------------------------------
                               CPU (user+system): Real (wall-clock):
Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  CPU_Warn Wall_Warn Starv_Warn   Type  Event
    1       1230.206     32035       38       192       38       193         0         0          0  R      zclient_read
    1          2.652        62       42       427       42       427         0         0          0  R      vtysh_read
    0        114.999        10    11499     20363    11500     20363         0         0          0   W     zclient_flush_data
    0       5318.575       700     7597     73560     7598     73560         0         0          0    TE   work_queue_run
   64         40.846     15424        2      4747        3      4748         0         0          0    T    (bgp_generate_updgrp_packets)
    0        533.971       180     2966     11819     2966     11819         0         0          0     E   bgp_handle_route_announcements_to_zebra
    0          0.040         7        5         7        6         8         0         0          0    T    update_group_refresh_default_originate_route_map
    0      20670.469      5198     3976    124041     3977    124041         0         0          0     E   bgp_process_packet
    1          0.016         1       16        16       17        17         0         0          0  R      vtysh_accept


Showing statistics for pthread BGP I/O thread
---------------------------------------------
                               CPU (user+system): Real (wall-clock):
Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  CPU_Warn Wall_Warn Starv_Warn   Type  Event
    0        189.527     16704       11       135       12       136         0         0          0   W     bgp_process_writes
  128        106.145     14282        7       171        7       172         0         0          0  R      bgp_process_reads


Total Event statistics
-------------------------
                               CPU (user+system): Real (wall-clock):
Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  CPU_Warn Wall_Warn Starv_Warn   Type  Event
  195      28207.446     84603      333    124041      333    124041         0         0          0  RWTEX  TOTAL

With Fix:


>  cat bgp-mem-with-fix | head
System allocator statistics:
  Total heap allocated:  789 MiB
  Holding block headers: 82 MiB
  Used small blocks:     0 bytes
  Used ordinary blocks:  784 MiB
  Free small blocks:     10736 bytes
  Free ordinary blocks:  5809 KiB
  Ordinary blocks:       34201
  Small blocks:          223
  Holding blocks:        131

>  cat bgp-mem-with-fix | awk 'NF > 0 && $NF+0 > 1000000'
Ring buffer                   :      260 variable  85732400      260  85732400
Stream                        :       14 variable    842752     1090  10727632
Route node                    :    64556    120     8000672    64556   8000672
Typed-hash bucket             :    32158 variable  66803224    32158  66803224          >>> ~10% increase
BGP peer                      :      131  20496     2686024      131   2686024
BGP node                      :    32279    144     4906680    32279   4906680
BGP route                     :  4114881    160   691300136  4114881 691300136          >>> ~10% increase
BGP adj out                   :    32147     88     2829240    32147   2829240
BGP multipath info            :    32148     32     1286640    32148   1286640

Event CPU: ~60% improvement

Event statistics for bgpd:

Showing statistics for pthread default
--------------------------------------
                               CPU (user+system): Real (wall-clock):
Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  CPU_Warn Wall_Warn Starv_Warn   Type  Event
    1          0.238         7       34        86       34        87         0         0          0  R      vtysh_accept
    0        437.779       153     2861     12610     2862     12611         0         0          0     E   bgp_handle_route_announcements_to_zebra
    0        286.284        21    13632     19355    13633     19356         0         0          0   W     zclient_flush_data
    0         49.167       640       76       476      129     28768         0         0          0     E   bgp_event
    2          3.274       128       25        74       26        75         0         0          0  R      bgp_accept
    0          0.299       128        2        24        2        24         0         0          0    T    (bgp_routeadv_timer)
    0         37.511     13437        2      3013        3      3013         0         0          0    T    (bgp_generate_updgrp_packets)
    0       4062.274      5922      685     40060      687     40066         0         0          0     E   bgp_process_packet
    0          0.043         5        8        21       10        22         0         0          0    T    update_group_refresh_default_originate_route_map
    1       1218.576     36683       33       227       33       228         0         0          0  R      zclient_read
    1         42.768       869       49      1713       50      1715         0         0          0  R      vtysh_read
    0          0.033         1       33        33       34        34         0         0          0     E   zclient_connect
    0       5372.033       413    13007     79805    13010     79933         0         0          0    TE   work_queue_run


Showing statistics for pthread BGP I/O thread
---------------------------------------------
                               CPU (user+system): Real (wall-clock):
Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  CPU_Warn Wall_Warn Starv_Warn   Type  Event
    0        176.297     13952       12       126       13       184         0         0          0   W     bgp_process_writes
  128        100.120     13652        7       105        7       105         0         0          0  R      bgp_process_reads


Total Event statistics
-------------------------
                               CPU (user+system): Real (wall-clock):
Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  CPU_Warn Wall_Warn Starv_Warn   Type  Event
  133      11789.620     86146      136     79805      137     79933         0         0          0  RWTEX  TOTAL                               >>> ~60% improvement in CPU

@krishna-samy krishna-samy force-pushed the krishna/bgp-path-lookup branch from 0093961 to 61983d3 Compare December 26, 2025 07:27
@krishna-samy
Copy link
Contributor Author

krishna-samy commented Dec 26, 2025

Flamegraph showing the cpu cycles in bgp_update() during NLRI processing

Without Fix:
image

With Fix: bgp_update is improved a lot
image

@krishna-samy krishna-samy force-pushed the krishna/bgp-path-lookup branch from 61983d3 to b2d8f2c Compare December 26, 2025 07:34
bgpd/bgp_route.c Outdated
* Initialize typesafe hash on first path add.
* XCALLOC zeroes dest, so count is 0 initially.
*/
if (bgp_pi_hash_count(&dest->pi_hash) == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems fragile and prone to problems in the future to me. When we create the dest why not init there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of having the hash stuck off the table w/ the prefix as part of the key instead? Wouldn't this reduce the memory footprint?

Copy link
Contributor Author

@krishna-samy krishna-samy Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of having the hash stuck off the table w/ the prefix as part of the key instead? Wouldn't this reduce the memory footprint?

Do you mean having a global hash per AFI/SAFI instead of per dest?
In that case, yes we would save the per dest hash_head - which is calculated below.
let's assume high route scenario - 1.2 million paths and 4 way ecmp
--> 1.2M × 16bytes(hash_head) = 19,200,000 bytes = ~18 MB
But in case of low route(16k/32k) and high ECMP(256) scenario, the memory footprint would be pretty much same based on the # of buckets allocation.

So, we consume around ~130MB with existing per dest hash. We save ~18MB if we move to global hash.
I have couple of questions in that case.

  • Larger buckets needed. Must accommodate ALL paths in table. 5M entries in one hash would be ok?
  • Must remove all paths for a prefix when deleting/clean-up. we see any complexity there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems fragile and prone to problems in the future to me. When we create the dest why not init there?

There was a dependency issue while doing this in table.c file. But, I got the point. Will re-visit this.

bgpd/bgp_route.c Outdated
}
}

/* Show BGP path info hash keys */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this new DEFUN. Can you elaborate to me what this is needed and why it's different than just showing the prefix with it's paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this new DEFUN. Can you elaborate to me what this is needed and why it's different than just showing the prefix with it's paths?

Ack. I thought of adding the keys for debugging purpose. But, we can display the same in the existing show command itself.

@vayetze vayetze requested a review from choppsv1 January 6, 2026 16:52
@krishna-samy krishna-samy force-pushed the krishna/bgp-path-lookup branch from b2d8f2c to d6490dc Compare January 8, 2026 07:23
@github-actions github-actions bot added the rebase PR needs rebase label Jan 8, 2026
@krishna-samy
Copy link
Contributor Author

krishna-samy commented Jan 8, 2026

Below is the memory footprint with the per-table hash implementation

Memory without any hash:

  Total heap allocated:  2065 MiB         >>> 2065 MiB
  Holding block headers: 221 MiB
  Used small blocks:     0 bytes
  Used ordinary blocks:  2021 MiB
  Free small blocks:     4864 bytes
  Free ordinary blocks:  44 MiB
  Ordinary blocks:       5308
  Small blocks:          113
  Holding blocks:        355

Memory per table hash: ~7% increase with per-table hash

Total heap allocated:  2205 MiB          >>> 2205 MiB
 Holding block headers: 265 MiB
 Used small blocks:     0 bytes
 Used ordinary blocks:  2166 MiB
 Free small blocks:     3344 bytes
 Free ordinary blocks:  40 MiB
 Ordinary blocks:       3553
 Small blocks:          85
 Holding blocks:        323

CPU performance is improved by 50-60% as shared in the above comments.

@choppsv1
Copy link
Contributor

choppsv1 commented Jan 8, 2026

Below is the memory footprint with the new patch(per table hash implementation) Memory without any hash:

  Total heap allocated:  2065 MiB
  Used ordinary blocks:  2021 MiB
  Free ordinary blocks:  44 MiB
  Ordinary blocks:       5308

Memory per table hash: ~7% increase with per-table hash

Total heap allocated:  2205 MiB
 Used ordinary blocks:  2166 MiB
 Free ordinary blocks:  40 MiB
 Ordinary blocks:       3553

Memory with per dest hash: more than 10% increase with per-dest hash

  Total heap allocated:  2324 MiB
  Used ordinary blocks:  2283 MiB
  Free ordinary blocks:  41 MiB
  Ordinary blocks:       10351

I don't know what these blocks are, but 3 very different numbers of "Ordinary Blocks", and yet the actual memory usage seems close to identical -- seems a bit fishy.

@krishna-samy krishna-samy force-pushed the krishna/bgp-path-lookup branch from d6490dc to 3ef3edd Compare January 8, 2026 11:48
Problem:
--------
BGP path lookup currently uses O(N) linear search through the list
(dest->info) for every incoming route update. In high-ECMP scenarios,
each update requires iterating through the entire path list to check
if a path from that peer already exists.

This becomes a severe performance bottleneck in data center environments
with high ECMP during large-scale route updates churn.

Solution:
---------
Implement a path_info hash per table using typesafe hash.
Use the same in bgp_update() for efficient lookup.

This hash lookup improves CPU overhead upto ~60%

Signed-off-by: Krishnasamy R <[email protected]>
Signed-off-by: Krishnasamy <[email protected]>
@krishna-samy krishna-samy force-pushed the krishna/bgp-path-lookup branch from 3ef3edd to b45c53b Compare January 8, 2026 11:51
@krishna-samy
Copy link
Contributor Author

Below is the memory footprint with the new patch(per table hash implementation) Memory without any hash:

  Total heap allocated:  2065 MiB
  Used ordinary blocks:  2021 MiB
  Free ordinary blocks:  44 MiB
  Ordinary blocks:       5308

Memory per table hash: ~7% increase with per-table hash

Total heap allocated:  2205 MiB
 Used ordinary blocks:  2166 MiB
 Free ordinary blocks:  40 MiB
 Ordinary blocks:       3553

Memory with per dest hash: more than 10% increase with per-dest hash

  Total heap allocated:  2324 MiB
  Used ordinary blocks:  2283 MiB
  Free ordinary blocks:  41 MiB
  Ordinary blocks:       10351

I don't know what these blocks are, but 3 very different numbers of "Ordinary Blocks", and yet the actual memory usage seems close to identical -- seems a bit fishy.

I think the comment should have been bit more clear. Just updated now.
Basically, the real goal is to achieve the CPU improvement during BGP path lookup. Since we are adding new hash, I tried to show the additional memory footprint(Total heap allocated) as well.
My above comments have more info on what we achieve from CPU perspective.

Copy link
Member

@donaldsharp donaldsharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@choppsv1
Copy link
Contributor

choppsv1 commented Jan 9, 2026

I'm good with adding hashing, I didn't want to gate this change. I was just noticing that the numbers didn't add up :)

Copy link
Contributor

@choppsv1 choppsv1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashing for lookups, good idea.

Copy link
Member

@riw777 riw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@riw777 riw777 merged commit 87d33a0 into FRRouting:master Jan 9, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants