askrene: improving the MCF solver #7740

Lagrang3 · 2024-10-15T09:13:34Z

This PR aims to improve the MCF solver.

The idea is to have an API that solves specifically the MCF and related problem variations
at a lower level that do not have to care about the several details that are inherent in our modeling of the payment problem.

This approach allows me to easily change the algorithms in the backend.
Now we are using the same linearization of the channel cost function that it is used in renepay.
Future improvements could include the treatment of additional constraints and activation costs to arcs (https://github.com/Lagrang3/mcf-algorithms).

In the translation from a lightning network payment problem to a graph theoretical optimization problem
I have added the possibility to choose the minimum unit of account, not necessarily msat but could be any other number
depending on the total amount to pay.

rustyrussell

Minor tweaks.

Mainly concerned about performance regressions, and of course those test need re-enabling and fixing up (of course there will be differences, but the tests are there to ensure changes are deliberate at least!).

Would be interested in times of test_askrene.py::test_real_data before and after...

plugins/askrene/graph.c

plugins/askrene/graph.h

plugins/askrene/algorithm.c

plugins/askrene/test/run-mcf-large.c

common/amount.h

Lagrang3 · 2024-11-13T09:30:55Z

Minor tweaks.

Mainly concerned about performance regressions, and of course those test need re-enabling and fixing up (of course there will be differences, but the tests are there to ensure changes are deliberate at least!).

Would be interested in times of test_askrene.py::test_real_data before and after...

There weren't meant to be differences in the results, but there was a bug in the median computation. Once it is fixed then the results in tests test_getroutes_fee_fallback and test_real_data do not match the previous numbers.

7b3a75b

Lagrang3 · 2024-11-13T10:24:02Z

The fee_fallback test would fail after fixing the computation of the
median. Now we can restore it by making the probability cost factor
1000x higher than the ratio of the median. This shows how hard it is to
combine fee and probability costs and why is the current approach so
fragile. @rustyrussell

Lagrang3 · 2024-11-13T13:11:19Z

I went to double check the computation of the median fee and probability cost.
Before the fix "askrene: fix the median" 7b3a75b

Running test_getroutes_fee_fallback

feecost array: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, ]

probcost array: [0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 6.93e-01, 6.93e-01, 6.93e-01, 6.93e-01, 1.39e+00, 1.39e+00, 1.39e+00, 1.39e+00, 1.53e+00, 1.53e+00, 1.53e+00, 1.53e+00, 3.05e+00, 3.05e+00, 3.05e+00, 3.05e+00, 4.62e+00, 4.62e+00, 4.62e+00, 4.62e+00, 9.24e+00, 9.24e+00, 9.24e+00, 9.24e+00, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, ]

feecost median: 10000, probcost median: 4.620750

ratio: 2164.150763

after fixing it

feecost array: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, ]

probcost array: [0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.693113, 0.693113, 0.693113, 0.693113, 1.386156, 1.386156, 1.386156, 1.386156, 1.527075, 1.527075, 1.527075, 1.527075, 3.053997, 3.053997, 3.053997, 3.053997, 4.620750, 4.620750, 4.620750, 4.620750, 9.241038, 9.241038, 9.241038, 9.241038, ]

feecost median: 1, probcost median: 1.527075

ratio: 0.654846741

Lagrang3 · 2024-11-13T20:26:15Z

Measuring time for every minflow call while running test_real_data, eg.

	t1 = time_now().ts;
	minflow(...)
	t2 = time_now().ts;

There are several calls to minflow in that test case. I measured the time for every minflow and put all those values into a histogram. I see no big difference in performance.

Lagrang3 · 2024-11-13T20:32:17Z

Zooming in it looks there is a small gain in performance overall, with more calls returning in less than 100ms for the PR head.
This is just a rough comparison, though.

Lagrang3 · 2024-11-15T07:29:51Z

Rebased on top of the master branch to fix the conflicts.
After playing with the probability cost factor a little bit I have come back to have k = get_median_ratio(),
because it gives better results in test_real_data than k = 1000 * get_median_ratio().

Still have 3 more test cases that fail:

test_getroutes_fee_fallback: where the problem is again the probability cost factor, here the ratio of the median is not good enough,
test_askrene_fake_channeld: which also fails to me in the master branch,
test_real_biases: which shouldn't fail, I think I messed up something with the biases when I rebased, I am investigating this right now.

Lagrang3 · 2024-11-15T08:18:41Z

I have found one assertion in test_real_biases to be too strong

@@ -1192,7 +1191,7 @@ def test_real_biases(node_factory, bitcoind):
             if route2 != route:
                 # It should have avoided biassed channel
                 amount_after = amount_through_chan(chan, route2['routes'])
-                assert amount_after < amount_before
+                # assert amount_after < amount_before
                 num_changed[bias] += 1

Finding an alternative set of routes does not imply that the biassed channel has less amount allocated.
The assertion fails only on node 83.
One possible explanation is that there is a very good channel C to get to node 83, even if biassed by one factor.
The flow algorithm determines several routes to get to node 83 using that channel C, but the solution to the Min. Cost Flow
is not guaranteed to be unique, so we might end up with two different set of routes before and after applying the bias
where the liquidity that goes through C is always the same or even slightly different. But there is no guarantee
that the biased C will have strictly less allocation. It could be even be higher due to the limited precision of the base unit
and the way we handle base fees approximately.

As a matter of fact for this case (n=83) the algorithm starts obtaining alternative routes for bias=8 and above.

Lagrang3 · 2024-11-19T07:41:52Z

This is the updated performance histogram after changing the probability cost factor back to k = get_median_ratio().

Lagrang3 · 2024-11-19T18:06:00Z

Rebased

rustyrussell · 2024-11-21T03:33:22Z

The fee_fallback test would fail after fixing the computation of the median. Now we can restore it by making the probability cost factor 1000x higher than the ratio of the median. This shows how hard it is to combine fee and probability costs and why is the current approach so fragile. @rustyrussell

Actually, it shows how bad this generate graph is!

In particular, the reverse edges (which all have tiny fees but are irrelevant), warp the median so badly the result is bogus (I ended up having to modify the code to dump these out, to see what was happening). Fix is simple: disable reverse edges.

I also hardcoded an exception for the test_real_biases fail. It's actually the same-cost path, it's just that the slight change in values means we pay 1msat more due to fee rounding!

No, median sucks because of this:

   83.50%     0.00%  cln-askrene  cln-askrene           [.] listpeerchannels_done                                                    ▒
+   83.48%     0.00%  cln-askrene  cln-askrene           [.] do_getroutes                                                             ▒
+   83.47%     0.00%  cln-askrene  cln-askrene           [.] get_routes                                                               ▒
-   80.71%     0.00%  cln-askrene  cln-askrene           [.] minflow                                                                  ▒
   - 80.71% minflow                                                                                                                   ▒
      - 29.96% combine_cost_function                                                                                                  ▒
         - 28.45% get_median_ratio                                                                                                    ▒
            + 15.84% __GI___qsort_r                                                                                                   ▒
            + 11.26% __GI___qsort_r                                                                                                   ▒
      - 27.96% mcf_refinement                                                                                                         ▒
         - 26.75% dijkstra_nearest_sink                                                                                               ▒
            - 3.28% priorityqueue_update                                                                                              ▒
               + 2.63% gheap_restore_heap_after_item_increase                                                                         ▒
              2.77% node_adjacency_next                                                                                               ▒
              2.20% arc_head                                                                                                          ▒
            + 2.08% priorityqueue_pop                                                                                                 ▒
              0.70% arc_tail                                                                                                          ▒
      - 8.03% init_linear_network                                                                                                     ▒
           2.02% gossmap_nth_chan                                                                                                     ▒
         - 1.72% linearize_channel                                                                                                    ▒
              0.54% get_constraints                                                                                                   ▒
         - 1.68% graph_add_arc                                                                                                        ▒
              1.37% graph_push_outbound_arc                                                                                           ▒
      + 6.56% get_flow_paths                                                                                                          ▒
      + 6.40% simple_feasibleflow                                                                                                     ▒
        1.54% init_residual_network                                                                                                   ▒

We spend an annoying 3/8 of our time getting the median values! Maybe champagne problems, since the rest is so fast? But it annoys me!

Changelog-EXPERIMENTAL: askrene new graph abstraction Signed-off-by: Lagrang3 <[email protected]>

It is just a copy-paste of "dijkstra" but the name implies what it actually is. Not an implementation of minimum cost path Dijkstra algorithm, but a helper data structure. I keep the old "dijkstra.h/c" files for the moment to avoid breaking the current code. Changelog-EXPERIMENTAL: askrene: add priorityqueue Signed-off-by: Lagrang3 <[email protected]>

Changelog-EXPERIMENTAL: askrene: add graph algorithms module Signed-off-by: Lagrang3 <[email protected]>

Changelog-EXPERIMENTAL: askrene: add dijkstra algorithm Signed-off-by: Lagrang3 <[email protected]>

Changelog-EXPERIMENTAL: askrene: add algorithm to compute feasible flow Signed-off-by: Lagrang3 <[email protected]>

Changelog-EXPERIMENTAL: askrene: add a simple MCF solver Signed-off-by: Lagrang3 <[email protected]>

Changelog-None: askrene algorithm add helper for flow conservation Signed-off-by: Lagrang3 <[email protected]>

Add a new function to compute a MCF using a more general description of the problem. I call it mcf_refinement because it can start with a feasible flow (though this is not necessary) and adapt it to achieve optimality. Changelog-None: askrene: add a MCF refinement Signed-off-by: Lagrang3 <[email protected]>

Using zlib to read big test case file. Changelog-None: askrene: add bigger test for MCF Signed-off-by: Lagrang3 <[email protected]>

We use an arc "array" in the graph structure, but not all arc indexes correspond to real topological arcs. We must be careful when iterating through all arcs, and check if they are enabled before making operations on them. Changelog-None: askrene: fix bug, not all arcs exists Signed-off-by: Lagrang3 <[email protected]>

Changelog-none: askrene: add mcf_refinement to the public API Signed-off-by: Lagrang3 <[email protected]>

Changelog-none: askrene: use the new MCF solver Signed-off-by: Lagrang3 <[email protected]>

check the return value of scanf in askrene unit tests, Changelog-none: askrene: fix CI Signed-off-by: Lagrang3 <[email protected]>

The calculation of the median values of probability and fee cost in the linear approximation had a bug by counting on non-existing arcs. Changelog-none: askrene: fix the median Signed-off-by: Lagrang3 <[email protected]>

Changelog-none: add ratio ceil and floor operators on amount_msat Signed-off-by: Lagrang3 <[email protected]>

Changelog-none: askrene: add arbitrary precision flow unit Signed-off-by: Lagrang3 <[email protected]>

- use graph_max_num_arcs/nodes instead of tal_count in bound checks, - don't use ccan/lqueue, use instead a minimalistic queue implementation with an array, - add missing const qualifiers to temporary tal allocators, - check preconditions with assert, - remove inline specifier for static functions, Changelog-None Signed-off-by: Lagrang3 <[email protected]>

Rusty: "We don't generally use NDEBUG in our code" Instead use a compile time flag ASKRENE_UNITTEST to make checks on unit tests that we don't normally need on release code. Changelog-none Signed-off-by: Lagrang3 <[email protected]>

Rusty: "allocations don't fail" Signed-off-by: Lagrang3 <[email protected]>

Changelog-none Signed-off-by: Lagrang3 <[email protected]>

The fee_fallback test would fail after fixing the computation of the median. Now by we can restore it by making the probability cost factor 1000x higher than the ratio of the median. This shows how hard it is to combine fee and probability costs and why is the current approach so fragile. Changelog-None Signed-off-by: Lagrang3 <[email protected]>

The ratio of the median of the fees and probability cost is overall not a bad factor to combine these two features. This is what the test_real_data shows. Changelog-None Signed-off-by: Lagrang3 <[email protected]>

We can fix the median calc by removing the (unused) reverse edges. Also analyze the failure case in test_real_data: it's a real edge case, so hardcode that one as "ok". Signed-off-by: Rusty Russell <[email protected]>

rustyrussell · 2024-11-21T03:35:42Z

Trivial rebase and new commit...

Lagrang3 force-pushed the askrene-new-mcf-interface branch 10 times, most recently from 8512473 to 8f2f2f2 Compare October 18, 2024 16:01

Lagrang3 marked this pull request as ready for review October 22, 2024 07:54

Lagrang3 requested a review from rustyrussell October 22, 2024 07:54

Lagrang3 force-pushed the askrene-new-mcf-interface branch 2 times, most recently from b61480a to 917a48d Compare October 22, 2024 11:40

Lagrang3 added this to the v24.11 milestone Oct 29, 2024

rustyrussell reviewed Nov 12, 2024

View reviewed changes

Lagrang3 force-pushed the askrene-new-mcf-interface branch from 4971b9a to d24e60a Compare November 12, 2024 20:19

Lagrang3 force-pushed the askrene-new-mcf-interface branch from 31ac881 to 7181635 Compare November 15, 2024 07:22

Lagrang3 force-pushed the askrene-new-mcf-interface branch from 7181635 to f36720b Compare November 15, 2024 08:36

Lagrang3 force-pushed the askrene-new-mcf-interface branch from f36720b to f0dfb08 Compare November 19, 2024 18:05

Lagrang3 and others added 23 commits November 21, 2024 14:05

askrene: add a new graph abstraction

506045d

Changelog-EXPERIMENTAL: askrene new graph abstraction Signed-off-by: Lagrang3 <[email protected]>

askrene: add graph algorithms module

78a1583

Changelog-EXPERIMENTAL: askrene: add graph algorithms module Signed-off-by: Lagrang3 <[email protected]>

askrene: add dijkstra algorithm

15c3377

Changelog-EXPERIMENTAL: askrene: add dijkstra algorithm Signed-off-by: Lagrang3 <[email protected]>

askrene: add algorithm to compute feasible flow

eaca91b

Changelog-EXPERIMENTAL: askrene: add algorithm to compute feasible flow Signed-off-by: Lagrang3 <[email protected]>

askrene: add a simple MCF solver

6c339f7

Changelog-EXPERIMENTAL: askrene: add a simple MCF solver Signed-off-by: Lagrang3 <[email protected]>

askrene algorithm add helper for flow conservation

b36f355

Changelog-None: askrene algorithm add helper for flow conservation Signed-off-by: Lagrang3 <[email protected]>

askrene: add bigger test for MCF

52a5f78

Using zlib to read big test case file. Changelog-None: askrene: add bigger test for MCF Signed-off-by: Lagrang3 <[email protected]>

askrene: add mcf_refinement to the public API

6777925

Changelog-none: askrene: add mcf_refinement to the public API Signed-off-by: Lagrang3 <[email protected]>

askrene: use the new MCF solver

93ab27e

Changelog-none: askrene: use the new MCF solver Signed-off-by: Lagrang3 <[email protected]>

askrene: fix CI

fa8c233

check the return value of scanf in askrene unit tests, Changelog-none: askrene: fix CI Signed-off-by: Lagrang3 <[email protected]>

askrene: fix the median

b4e5bce

The calculation of the median values of probability and fee cost in the linear approximation had a bug by counting on non-existing arcs. Changelog-none: askrene: fix the median Signed-off-by: Lagrang3 <[email protected]>

add ratio ceil and floor operators on amount_msat

624edae

Changelog-none: add ratio ceil and floor operators on amount_msat Signed-off-by: Lagrang3 <[email protected]>

askrene: add arbitrary precision flow unit

4b842a1

Changelog-none: askrene: add arbitrary precision flow unit Signed-off-by: Lagrang3 <[email protected]>

askrene: add compiler flag ASKRENE_UNITTEST

963d59e

Rusty: "We don't generally use NDEBUG in our code" Instead use a compile time flag ASKRENE_UNITTEST to make checks on unit tests that we don't normally need on release code. Changelog-none Signed-off-by: Lagrang3 <[email protected]>

askrene: remove allocation checks

b925b66

Rusty: "allocations don't fail" Signed-off-by: Lagrang3 <[email protected]>

askrene: bugfix queue overflow

d822512

Changelog-none Signed-off-by: Lagrang3 <[email protected]>

Askrene: change median factor to 1.

e0790bb

The ratio of the median of the fees and probability cost is overall not a bad factor to combine these two features. This is what the test_real_data shows. Changelog-None Signed-off-by: Lagrang3 <[email protected]>

pytest: reenable askrene bias test.

6795ff3

We can fix the median calc by removing the (unused) reverse edges. Also analyze the failure case in test_real_data: it's a real edge case, so hardcode that one as "ok". Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the askrene-new-mcf-interface branch from f0dfb08 to 6795ff3 Compare November 21, 2024 03:35

rustyrussell merged commit 2c9023e into ElementsProject:master Nov 21, 2024
39 checks passed

Lagrang3 deleted the askrene-new-mcf-interface branch December 2, 2024 08:11

notmandatory mentioned this pull request Dec 7, 2024

Socratic Seminar #26 Topics NashBitDevs/nashbitdevs.org#58

Closed

askrene: improving the MCF solver #7740

askrene: improving the MCF solver #7740

Uh oh!

Conversation

Lagrang3 commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustyrussell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lagrang3 commented Nov 13, 2024

Uh oh!

Lagrang3 commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lagrang3 commented Nov 13, 2024

Uh oh!

Lagrang3 commented Nov 13, 2024

Uh oh!

Lagrang3 commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lagrang3 commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lagrang3 commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lagrang3 commented Nov 19, 2024

Uh oh!

Lagrang3 commented Nov 19, 2024

Uh oh!

rustyrussell commented Nov 21, 2024

Uh oh!

rustyrussell commented Nov 21, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lagrang3 commented Oct 15, 2024 •

edited

Loading

Lagrang3 commented Nov 13, 2024 •

edited

Loading

Lagrang3 commented Nov 13, 2024 •

edited

Loading

Lagrang3 commented Nov 15, 2024 •

edited

Loading

Lagrang3 commented Nov 15, 2024 •

edited

Loading