Skip to content

[BUG] load_balancer algorithm weaknesses #3297

@spacetourist

Description

@spacetourist

OpenSIPS version you are running

version: opensips 3.2.11 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 14e4858f4
main.c compiled on 00:00:00 May 27 2021 with gcc 11

Describe the bug

The load_balancer algorithm is not correctly accounting for existing calls on the destinations.

When selecting a destination the algorithm is one of the following, in my case I'm using relative so we include the CPU score:

if( flags & LB_FLAGS_RELATIVE ) {
if( dst->rmap[l].max_load )
av = 100 - (100 * lb_dlg_binds.get_profile_size(res[k]->profile, &dst->profile_id) / dst->rmap[l].max_load);
} else {
av = dst->rmap[l].max_load - lb_dlg_binds.get_profile_size(res[k]->profile, &dst->profile_id);
}

The max_load value is loaded via freeswitch HEARTBEAT data using this code:

if (psz < dst->fs_sock->stats.max_sess) {
dst->rmap[ri].max_load =
(dst->fs_sock->stats.id_cpu / (float)100) *
(dst->fs_sock->stats.max_sess -
(dst->fs_sock->stats.sess - psz));
} else {
dst->rmap[ri].max_load =
(dst->fs_sock->stats.id_cpu / (float)100) *
dst->fs_sock->stats.max_sess;
}

This means that the max_load score is the max number of sessions configured on the server minus the number of sessions that already exist minus the number of dialogs that this OpenSIPs instance has allocated to it (all reduced by the CPU availability score).

This calculation means that we are including the dialog profile counts twice - both during the max_load calculation and the destination selection calculation, most of those sessions will be double counted as the Session-Count value in the heartbeats will include these dialogs.

Furthermore when no calls have been allocated to the destination instance from this OpenSIPs all instances will get the same score of 100 as the calculation is inevitably: 100 - (100 * 0 / max_load) - this means that in my environment if one instance has 100 available channels it will be just as likely to be allocated the call as an instance with 1000 channels available.

Expected behavior

My goal is to get proportional load balancing working using this module such that incoming calls get spread evenly over all the available freeswitch instances of varying sizes.

In my opinion a more perfect calculation would account for existing sessions on the destinations as well as any which have been allocated since the last heartbeat. Something like:

100 - ( 100 * ( FS_Session_Count+Profile_dialogs_since_last_heartbeat / FS_Max_Sessions - CPU load )

This would need the max sessions and current session data to be added to the lb_resource_map struct to make it available to the balancer. I'm not clear regarding how easy it would be to get the dialogs added to the profile since the last heartbeat was processed for that instance - perhaps that's too complex.

Modifying the max_load calculation in favour of a system utilisation figure would probably be enough of an improvement for my situation:

dst->rmap[ri].utilisation = (dst->fs_sock->stats.id_cpu / (float)100) * 
   ( 100 * dst->fs_sock->stats.max_sess - dst->fs_sock->stats.sess / dst->fs_sock->stats.max_sess );

The new value utilisation should equal the percentage of channels utilised reduced by the CPU utilisation. This would remove the dialog profiles from the calculation and allow the destinations to be scored proportionally. As we have reduced the capacity figures to a load percentage this might work well with the random destination flag such that we resolve to an eventually even distribution.

--

Please note this report is a work in progress as I gather information on the module and is meant as a discussion point rather than a call for a solution at this time!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions