Skip to content

Conversation

@Lagrang3
Copy link
Collaborator

@Lagrang3 Lagrang3 commented Oct 15, 2024

This PR aims to improve the MCF solver.

The idea is to have an API that solves specifically the MCF and related problem variations
at a lower level that do not have to care about the several details that are inherent in our modeling of the payment problem.

This approach allows me to easily change the algorithms in the backend.
Now we are using the same linearization of the channel cost function that it is used in renepay.
Future improvements could include the treatment of additional constraints and activation costs to arcs (https://github.com/Lagrang3/mcf-algorithms).

In the translation from a lightning network payment problem to a graph theoretical optimization problem
I have added the possibility to choose the minimum unit of account, not necessarily msat but could be any other number
depending on the total amount to pay.

@Lagrang3 Lagrang3 force-pushed the askrene-new-mcf-interface branch 10 times, most recently from 8512473 to 8f2f2f2 Compare October 18, 2024 16:01
@Lagrang3 Lagrang3 marked this pull request as ready for review October 22, 2024 07:54
@Lagrang3 Lagrang3 requested a review from rustyrussell October 22, 2024 07:54
@Lagrang3 Lagrang3 force-pushed the askrene-new-mcf-interface branch 2 times, most recently from b61480a to 917a48d Compare October 22, 2024 11:40
@Lagrang3 Lagrang3 added this to the v24.11 milestone Oct 29, 2024
Copy link
Contributor

@rustyrussell rustyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor tweaks.

Mainly concerned about performance regressions, and of course those test need re-enabling and fixing up (of course there will be differences, but the tests are there to ensure changes are deliberate at least!).

Would be interested in times of test_askrene.py::test_real_data before and after...

@Lagrang3 Lagrang3 force-pushed the askrene-new-mcf-interface branch from 4971b9a to d24e60a Compare November 12, 2024 20:19
@Lagrang3
Copy link
Collaborator Author

Minor tweaks.

Mainly concerned about performance regressions, and of course those test need re-enabling and fixing up (of course there will be differences, but the tests are there to ensure changes are deliberate at least!).

Would be interested in times of test_askrene.py::test_real_data before and after...

There weren't meant to be differences in the results, but there was a bug in the median computation. Once it is fixed then the results in tests test_getroutes_fee_fallback and test_real_data do not match the previous numbers.

7b3a75b

@Lagrang3
Copy link
Collaborator Author

Lagrang3 commented Nov 13, 2024

The fee_fallback test would fail after fixing the computation of the
median. Now we can restore it by making the probability cost factor
1000x higher than the ratio of the median. This shows how hard it is to
combine fee and probability costs and why is the current approach so
fragile. @rustyrussell

@Lagrang3
Copy link
Collaborator Author

I went to double check the computation of the median fee and probability cost.
Before the fix "askrene: fix the median" 7b3a75b

Running test_getroutes_fee_fallback

feecost array: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, 9223372036854775807, ]

probcost array: [0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 6.93e-01, 6.93e-01, 6.93e-01, 6.93e-01, 1.39e+00, 1.39e+00, 1.39e+00, 1.39e+00, 1.53e+00, 1.53e+00, 1.53e+00, 1.53e+00, 3.05e+00, 3.05e+00, 3.05e+00, 3.05e+00, 4.62e+00, 4.62e+00, 4.62e+00, 4.62e+00, 9.24e+00, 9.24e+00, 9.24e+00, 9.24e+00, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, 1.80e+308, ]

feecost median: 10000, probcost median: 4.620750

ratio: 2164.150763

after fixing it

feecost array: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, ]

probcost array: [0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.693113, 0.693113, 0.693113, 0.693113, 1.386156, 1.386156, 1.386156, 1.386156, 1.527075, 1.527075, 1.527075, 1.527075, 3.053997, 3.053997, 3.053997, 3.053997, 4.620750, 4.620750, 4.620750, 4.620750, 9.241038, 9.241038, 9.241038, 9.241038, ]

feecost median: 1, probcost median: 1.527075

ratio: 0.654846741

@Lagrang3
Copy link
Collaborator Author

Measuring time for every minflow call while running test_real_data, eg.

	t1 = time_now().ts;
	minflow(...)
	t2 = time_now().ts;

There are several calls to minflow in that test case. I measured the time for every minflow and put all those values into a histogram. I see no big difference in performance.
Screenshot from 2024-11-13 21-22-44

@Lagrang3
Copy link
Collaborator Author

Lagrang3 commented Nov 13, 2024

Zooming in it looks there is a small gain in performance overall, with more calls returning in less than 100ms for the PR head.
This is just a rough comparison, though.
Screenshot from 2024-11-13 21-31-46

@Lagrang3 Lagrang3 force-pushed the askrene-new-mcf-interface branch from 31ac881 to 7181635 Compare November 15, 2024 07:22
@Lagrang3
Copy link
Collaborator Author

Lagrang3 commented Nov 15, 2024

Rebased on top of the master branch to fix the conflicts.
After playing with the probability cost factor a little bit I have come back to have k = get_median_ratio(),
because it gives better results in test_real_data than k = 1000 * get_median_ratio().

Still have 3 more test cases that fail:

  • test_getroutes_fee_fallback: where the problem is again the probability cost factor, here the ratio of the median is not good enough,
  • test_askrene_fake_channeld: which also fails to me in the master branch,
  • test_real_biases: which shouldn't fail, I think I messed up something with the biases when I rebased, I am investigating this right now.

@Lagrang3
Copy link
Collaborator Author

Lagrang3 commented Nov 15, 2024

I have found one assertion in test_real_biases to be too strong

@@ -1192,7 +1191,7 @@ def test_real_biases(node_factory, bitcoind):
             if route2 != route:
                 # It should have avoided biassed channel
                 amount_after = amount_through_chan(chan, route2['routes'])
-                assert amount_after < amount_before
+                # assert amount_after < amount_before
                 num_changed[bias] += 1

Finding an alternative set of routes does not imply that the biassed channel has less amount allocated.
The assertion fails only on node 83.
One possible explanation is that there is a very good channel C to get to node 83, even if biassed by one factor.
The flow algorithm determines several routes to get to node 83 using that channel C, but the solution to the Min. Cost Flow
is not guaranteed to be unique, so we might end up with two different set of routes before and after applying the bias
where the liquidity that goes through C is always the same or even slightly different. But there is no guarantee
that the biased C will have strictly less allocation. It could be even be higher due to the limited precision of the base unit
and the way we handle base fees approximately.

As a matter of fact for this case (n=83) the algorithm starts obtaining alternative routes for bias=8 and above.

@Lagrang3 Lagrang3 force-pushed the askrene-new-mcf-interface branch from 7181635 to f36720b Compare November 15, 2024 08:36
@Lagrang3
Copy link
Collaborator Author

This is the updated performance histogram after changing the probability cost factor back to k = get_median_ratio().

Screenshot from 2024-11-19 08-40-18

@Lagrang3 Lagrang3 force-pushed the askrene-new-mcf-interface branch from f36720b to f0dfb08 Compare November 19, 2024 18:05
@Lagrang3
Copy link
Collaborator Author

Rebased

@rustyrussell
Copy link
Contributor

The fee_fallback test would fail after fixing the computation of the median. Now we can restore it by making the probability cost factor 1000x higher than the ratio of the median. This shows how hard it is to combine fee and probability costs and why is the current approach so fragile. @rustyrussell

Actually, it shows how bad this generate graph is!

In particular, the reverse edges (which all have tiny fees but are irrelevant), warp the median so badly the result is bogus (I ended up having to modify the code to dump these out, to see what was happening). Fix is simple: disable reverse edges.

I also hardcoded an exception for the test_real_biases fail. It's actually the same-cost path, it's just that the slight change in values means we pay 1msat more due to fee rounding!

No, median sucks because of this:

   83.50%     0.00%  cln-askrene  cln-askrene           [.] listpeerchannels_done                                                    ▒
+   83.48%     0.00%  cln-askrene  cln-askrene           [.] do_getroutes                                                             ▒
+   83.47%     0.00%  cln-askrene  cln-askrene           [.] get_routes                                                               ▒
-   80.71%     0.00%  cln-askrene  cln-askrene           [.] minflow                                                                  ▒
   - 80.71% minflow                                                                                                                   ▒
      - 29.96% combine_cost_function                                                                                                  ▒
         - 28.45% get_median_ratio                                                                                                    ▒
            + 15.84% __GI___qsort_r                                                                                                   ▒
            + 11.26% __GI___qsort_r                                                                                                   ▒
      - 27.96% mcf_refinement                                                                                                         ▒
         - 26.75% dijkstra_nearest_sink                                                                                               ▒
            - 3.28% priorityqueue_update                                                                                              ▒
               + 2.63% gheap_restore_heap_after_item_increase                                                                         ▒
              2.77% node_adjacency_next                                                                                               ▒
              2.20% arc_head                                                                                                          ▒
            + 2.08% priorityqueue_pop                                                                                                 ▒
              0.70% arc_tail                                                                                                          ▒
      - 8.03% init_linear_network                                                                                                     ▒
           2.02% gossmap_nth_chan                                                                                                     ▒
         - 1.72% linearize_channel                                                                                                    ▒
              0.54% get_constraints                                                                                                   ▒
         - 1.68% graph_add_arc                                                                                                        ▒
              1.37% graph_push_outbound_arc                                                                                           ▒
      + 6.56% get_flow_paths                                                                                                          ▒
      + 6.40% simple_feasibleflow                                                                                                     ▒
        1.54% init_residual_network                                                                                                   ▒

We spend an annoying 3/8 of our time getting the median values! Maybe champagne problems, since the rest is so fast? But it annoys me!

Lagrang3 and others added 23 commits November 21, 2024 14:05
Changelog-EXPERIMENTAL: askrene new graph abstraction

Signed-off-by: Lagrang3 <[email protected]>
It is just a copy-paste of "dijkstra" but the name
implies what it actually is. Not an implementation of minimum cost path
Dijkstra algorithm, but a helper data structure.
I keep the old "dijkstra.h/c" files for the moment to avoid breaking the
current code.

Changelog-EXPERIMENTAL: askrene: add priorityqueue

Signed-off-by: Lagrang3 <[email protected]>
Changelog-EXPERIMENTAL: askrene: add graph algorithms module

Signed-off-by: Lagrang3 <[email protected]>
Changelog-EXPERIMENTAL: askrene: add dijkstra algorithm

Signed-off-by: Lagrang3 <[email protected]>
Changelog-EXPERIMENTAL: askrene: add algorithm to compute feasible flow

Signed-off-by: Lagrang3 <[email protected]>
Changelog-EXPERIMENTAL: askrene: add a simple MCF solver

Signed-off-by: Lagrang3 <[email protected]>
Changelog-None: askrene algorithm add helper for flow conservation

Signed-off-by: Lagrang3 <[email protected]>
Add a new function to compute a MCF using a more general description of
the problem. I call it mcf_refinement because it can start with a
feasible flow (though this is not necessary) and adapt it to achieve
optimality.

Changelog-None: askrene: add a MCF refinement

Signed-off-by: Lagrang3 <[email protected]>
Using zlib to read big test case file.

Changelog-None: askrene: add bigger test for MCF

Signed-off-by: Lagrang3 <[email protected]>
We use an arc "array" in the graph structure, but not all arc indexes
correspond to real topological arcs. We must be careful when iterating
through all arcs, and check if they are enabled before making operations
on them.

Changelog-None: askrene: fix bug, not all arcs exists

Signed-off-by: Lagrang3 <[email protected]>
Changelog-none: askrene: add mcf_refinement to the public API

Signed-off-by: Lagrang3 <[email protected]>
Changelog-none: askrene: use the new MCF solver

Signed-off-by: Lagrang3 <[email protected]>
check the return value of scanf in askrene unit tests,

Changelog-none: askrene: fix CI

Signed-off-by: Lagrang3 <[email protected]>
The calculation of the median values of probability and fee cost in the
linear approximation had a bug by counting on non-existing arcs.

Changelog-none: askrene: fix the median

Signed-off-by: Lagrang3 <[email protected]>
Changelog-none: add ratio ceil and floor operators on amount_msat

Signed-off-by: Lagrang3 <[email protected]>
Changelog-none: askrene: add arbitrary precision flow unit

Signed-off-by: Lagrang3 <[email protected]>
- use graph_max_num_arcs/nodes instead of tal_count in bound checks,
- don't use ccan/lqueue, use instead a minimalistic queue
  implementation with an array,
- add missing const qualifiers to temporary tal allocators,
- check preconditions with assert,
- remove inline specifier for static functions,

Changelog-None

Signed-off-by: Lagrang3 <[email protected]>
Rusty: "We don't generally use NDEBUG in our code"

Instead use a compile time flag ASKRENE_UNITTEST to make checks on unit
tests that we don't normally need on release code.

Changelog-none

Signed-off-by: Lagrang3 <[email protected]>
Rusty: "allocations don't fail"

Signed-off-by: Lagrang3 <[email protected]>
Changelog-none

Signed-off-by: Lagrang3 <[email protected]>
The fee_fallback test would fail after fixing the computation of the
median. Now by we can restore it by making the probability cost factor
1000x higher than the ratio of the median. This shows how hard it is to
combine fee and probability costs and why is the current approach so
fragile.

Changelog-None

Signed-off-by: Lagrang3 <[email protected]>
The ratio of the median of the fees and probability cost is overall not
a bad factor to combine these two features. This is what the
test_real_data shows.

Changelog-None

Signed-off-by: Lagrang3 <[email protected]>
We can fix the median calc by removing the (unused) reverse edges.

Also analyze the failure case in test_real_data: it's a real edge case, so
hardcode that one as "ok".

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the askrene-new-mcf-interface branch from f0dfb08 to 6795ff3 Compare November 21, 2024 03:35
@rustyrussell
Copy link
Contributor

Trivial rebase and new commit...

@rustyrussell rustyrussell merged commit 2c9023e into ElementsProject:master Nov 21, 2024
39 checks passed
@Lagrang3 Lagrang3 deleted the askrene-new-mcf-interface branch December 2, 2024 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants