Skip to content

Conversation

@Garfield96
Copy link
Collaborator

The error handling is aligned with the error handling of the surrounding function or file.

@Garfield96 Garfield96 added the bug Something isn't working label Oct 13, 2023
@Garfield96 Garfield96 self-assigned this Oct 13, 2023
alexanderstephan pushed a commit that referenced this pull request Oct 13, 2023
Since the following patch :
  commit 33c49ce
  MINOR: quic: Make qc_dgrams_retransmit() return a status.
retransmission process is interrupted as soon as a fatal send error has
been encounted. However, this may leave frames in local list. This cause
several issues : a memory leak and a potential crash.

The crash happens because leaked frames are duplicated of an origin
frame via qc_dup_pkt_frms(). If an ACK arrives later for the origin
frame, all duplicated frames are also freed. During qc_frm_free(),
LIST_DEL_INIT() operation is invalid as it still references the local
list used inside qc_dgrams_retransmit().

This bug was reproduced using the following injection from another
machine :
  $ h2load --npn-list h3 -t 8 -c 10000 -m 1 -n 2000000000 \
      https://<host>:<port>/?s=4m

Haproxy was compiled using ASAN. The crash resulted in the following
trace :
==332748==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7fff82bf9d78 at pc 0x556facd3b95a bp 0x7fff82bf8b20 sp 0x7fff82bf8b10
WRITE of size 8 at 0x7fff82bf9d78 thread T0
    #0 0x556facd3b959 in qc_frm_free include/haproxy/quic_frame.h:273
    #1 0x556facd59501 in qc_release_frm src/quic_conn.c:1724
    #2 0x556facd5a07f in quic_stream_try_to_consume src/quic_conn.c:1803
    #3 0x556facd5abe9 in qc_treat_acked_tx_frm src/quic_conn.c:1866
    #4 0x556facd5b3d8 in qc_ackrng_pkts src/quic_conn.c:1928
    #5 0x556facd60187 in qc_parse_ack_frm src/quic_conn.c:2354
    #6 0x556facd693a1 in qc_parse_pkt_frms src/quic_conn.c:3203
    #7 0x556facd7531a in qc_treat_rx_pkts src/quic_conn.c:4606
    haproxy#8 0x556facd7a528 in quic_conn_app_io_cb src/quic_conn.c:5059
    haproxy#9 0x556fad3284be in run_tasks_from_lists src/task.c:596
    haproxy#10 0x556fad32a3fa in process_runnable_tasks src/task.c:876
    haproxy#11 0x556fad24a676 in run_poll_loop src/haproxy.c:2968
    haproxy#12 0x556fad24b510 in run_thread_poll_loop src/haproxy.c:3167
    haproxy#13 0x556fad24e7ff in main src/haproxy.c:3857
    haproxy#14 0x7fae30ddd0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x240b2)
    haproxy#15 0x556facc9375d in _start (/opt/haproxy-quic-2.8/haproxy+0x1ea75d)

Address 0x7fff82bf9d78 is located in stack of thread T0 at offset 40 in frame
    #0 0x556facd74ede in qc_treat_rx_pkts src/quic_conn.c:4580

This must be backported up to 2.7.
src/log.c Outdated
{
struct logsrv *cpy = malloc(sizeof(*cpy));
if (unlikely(!cpy))
return NULL;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return NULL;
goto error;

For consistency goto error could be used here as well. free_logsrv should be a no-op in this case.

@alexanderstephan
Copy link

The merge conflict with the latest master version should be also addressed.

Garfield96 pushed a commit that referenced this pull request Mar 3, 2024
A quic_conn is instantiated and tied on the first thread which has
received the first INITIAL packet. After handshake completion,
listener_accept() is called. For each quic_conn, a new thread is
selected among the least loaded ones Note that this occurs earlier if
handling 0-RTT data.

This thread connection migration is done in two steps :
* inside listener_accept(), on the origin thread, quic_conn
  tasks/tasklet are killed. After this, no quic_conn related processing
  will occur on this thread. The connection is flagged with
  QUIC_FL_CONN_AFFINITY_CHANGED.
* as soon as the first quic_conn related processing occurs on the new
  thread, the migration is finalized. This allows to allocate the new
  tasks/tasklet directly on the destination thread.

This last step on the new thread must be done prior to other quic_conn
access. There is two events which may trigger it :
* a packet is received on the new thread. In this case,
  qc_finalize_affinity_rebind() is called from quic_dgram_parse().
* the recently accepted connection is popped from accept_queue_ring via
  accept_queue_process(). This will called session_accept_fd() as
  listener.bind_conf.accept callback. This instantiates a new session
  and start connection stack via conn_xprt_start(), which itself calls
  qc_xprt_start() where qc_finalize_affinity_rebind() is used.

A condition was recently found which could cause a closing to be used
with qc_finalize_affinity_rebind() which is forbidden with a BUG_ON().

This lat step was not compatible with layer 4 rule such as "tcp-request
connection reject" which closes the connection early. In this case, most
of the body of session_accept_fd() is skipped, including
qc_xprt_start(), so thread migration is not finalized. At the end of the
function, conn_xprt_close() is then called which flags the connection as
CLOSING.

If a datagram is received for this connection before it is released,
this will call qc_finalize_affinity_rebind() which triggers its BUG_ON()
to prevent thread migration for CLOSING quic_conn.

FATAL: bug condition "qc->flags & ((1U << 29)|(1U << 30))" matched at src/quic_conn.c:2036
Thread 3 "haproxy" received signal SIGILL, Illegal instruction.
[Switching to Thread 0x7ffff794f700 (LWP 2973030)]
0x00005555556221f3 in qc_finalize_affinity_rebind (qc=0x7ffff002d060) at src/quic_conn.c:2036
2036            BUG_ON(qc->flags & (QUIC_FL_CONN_CLOSING|QUIC_FL_CONN_DRAINING));
(gdb) bt
 #0  0x00005555556221f3 in qc_finalize_affinity_rebind (qc=0x7ffff002d060) at src/quic_conn.c:2036
 #1  0x0000555555682463 in quic_dgram_parse (dgram=0x7fff5003ef10, from_qc=0x0, li=0x555555f38670) at src/quic_rx.c:2602
 #2  0x0000555555651aae in quic_lstnr_dghdlr (t=0x555555fc4440, ctx=0x555555fc3f78, state=32832) at src/quic_sock.c:189
 #3  0x00005555558c9393 in run_tasks_from_lists (budgets=0x7ffff7944c90) at src/task.c:596
 #4  0x00005555558c9e8e in process_runnable_tasks () at src/task.c:876
 #5  0x000055555586b7b2 in run_poll_loop () at src/haproxy.c:2966
 #6  0x000055555586be87 in run_thread_poll_loop (data=0x555555d3d340 <ha_thread_info+64>) at src/haproxy.c:3165
 #7  0x00007ffff7b59609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
 haproxy#8  0x00007ffff7a7e133 in clone () from /lib/x86_64-linux-gnu/libc.so.6

To fix this issue, ensure quic_conn migration is completed earlier
inside session_accept_fd(), before any tcp rules processing. This is
done by moving qc_finalize_affinity_rebind() invocation from
qc_xprt_start() to qc_conn_init().

This must be backported up to 2.7.
Garfield96 pushed a commit that referenced this pull request Mar 3, 2024
The connections are flagged as "to be killed" asap when the peer has left
(detected by sendto() "Connection refused" errno) by qc_kill_conn(). This
function has to wakeup the idle timer task to release the connection (and the idle
timer  and the idle timer task itself). Then if in the meantime the connection
was flagged as having to process some retransmissions, some packet could lead
to sendto() errors again with a call to qc_kill_conn(), this time with a released
idle timer task.

This bug could be detected by libasan as follows:

.AddressSanitizer:DEADLYSIGNAL
=================================================================
==21018==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x    560b5d898717 bp 0x7f9aaac30000 sp 0x7f9aaac2ff80 T3)
==21018==The signal is caused by a READ memory access.
==21018==Hint: address points to the zero page.
.    #0 0x560b5d898717 in _task_wakeup include/haproxy/task.h:209
    #1 0x560b5d8a563c in qc_kill_conn src/quic_conn.c:171
    #2 0x560b5d97f832 in qc_send_ppkts src/quic_tx.c:636
    #3 0x560b5d981b53 in qc_send_app_pkts src/quic_tx.c:876
    #4 0x560b5d987122 in qc_send_app_probing src/quic_tx.c:910
    #5 0x560b5d987122 in qc_dgrams_retransmit src/quic_tx.c:1397
    #6 0x560b5d8ab250 in quic_conn_app_io_cb src/quic_conn.c:712
    #7 0x560b5de41593 in run_tasks_from_lists src/task.c:596
    haproxy#8 0x560b5de4333c in process_runnable_tasks src/task.c:876
    haproxy#9 0x560b5dd6f2b0 in run_poll_loop src/haproxy.c:2966
    haproxy#10 0x560b5dd700fb in run_thread_poll_loop src/haproxy.c:3165
    haproxy#11 0x7f9ab9188ea6 in start_thread nptl/pthread_create.c:477
    haproxy#12 0x7f9ab90a8a2e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0xfba2e)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV include/haproxy/task.h:209 in _task_wakeup
Thread T3 created by T0 here:
    #0 0x7f9ab97ac2a2 in __interceptor_pthread_create ../../../../src/libsaniti    zer/asan/asan_interceptors.cpp:214
    #1 0x560b5df4f3ef in setup_extra_threads src/thread.c:252                  o
    #2 0x560b5dd730c7 in main src/haproxy.c:3856
    #3 0x7f9ab8fd0d09 in __libc_start_main ../csu/libc-start.c:308             i

==21018==ABORTING
AddressSanitizer:DEADLYSIGNAL
Aborted (core dumped)

To fix, simply reset the connection flag QUIC_FL_CONN_RETRANS_NEEDED to cancel
the retransmission when qc_kill_conn is called.

Note that this new bug arrived with this fix which is correct and flagged as to be
backported as far as 2.6.
      BUG/MINOR: quic: idle timer task requeued in the past

Must be backported as far as 2.6.
Garfield96 pushed a commit that referenced this pull request Mar 3, 2024
This bug could be reproduced with -dMfail and dectected by libasan as follows:

$ ASAN_OPTIONS=disable_coredump=0:unmap_shadow_on_exit=1:abort_on_error=f quic-freeze.cfg -dMfail -dMno-cache -dM0x55
=================================================================
==82989==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7ffc 0x560790cc4749 bp 0x7fff8e0e8e30 sp 0x7fff8e0e8e28
WRITE of size 8 at 0x7fff8e0ea338 thread T0
    #0 0x560790cc4748 in qc_frm_free src/quic_frame.c:1222
    #1 0x560790cc5260 in qc_release_frm src/quic_frame.c:1261
    #2 0x560790d1de99 in qc_treat_acked_tx_frm src/quic_rx.c:312
    #3 0x560790d1e708 in qc_ackrng_pkts src/quic_rx.c:370
    #4 0x560790d22a1d in qc_parse_ack_frm src/quic_rx.c:694
    #5 0x560790d25daa in qc_parse_pkt_frms src/quic_rx.c:988
    #6 0x560790d2a509 in qc_treat_rx_pkts src/quic_rx.c:1373
    #7 0x560790c72d45 in quic_conn_io_cb src/quic_conn.c:906
    haproxy#8 0x560791207847 in run_tasks_from_lists src/task.c:596
    haproxy#9 0x5607912095f0 in process_runnable_tasks src/task.c:876
    haproxy#10 0x560791135564 in run_poll_loop src/haproxy.c:2966
    haproxy#11 0x5607911363af in run_thread_poll_loop src/haproxy.c:3165
    haproxy#12 0x56079113938c in main src/haproxy.c:3862
    haproxy#13 0x7f92606edd09 in __libc_start_main ../csu/libc-start.c:308
    haproxy#14 0x560790bcd529 in _start (/home/flecaille/src/haproxy/haproxy+0x

Address 0x7fff8e0ea338 is located in stack of thread T0 at offset 1032 i
    #0 0x560790d29b52 in qc_treat_rx_pkts src/quic_rx.c:1341

  This frame has 2 object(s):
    [32, 48) 'ar' (line 1380)
    [64, 1088) '_msg' (line 1368) <== Memory access at offset 1032 is inable
HINT: this may be a false positive if your program uses some custom stacnism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-use-after-scope src/quic_frame.c:1222 i
Shadow bytes around the buggy address:
  0x100071c15410: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
  0x100071c15420: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
  0x100071c15430: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
  0x100071c15440: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
  0x100071c15450: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
=>0x100071c15460: f8 f8 f8 f8 f8 f8 f8[f8]f8 f8 f8 f8 f8 f8 f3 f3
  0x100071c15470: f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 00 00
  0x100071c15480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100071c15490: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100071c154a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100071c154b0: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 04 f3 f3 f3
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==82989==ABORTING
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
Aborted (core dumped)

Note that a coredump could not always be produced with all compilers. This was
always the case with clang 11.

When allocating frames to be retransmitted from qc_dgrams_retransmit(), if they
could not be sent for any reason, they could remain attached to a local list to
qc_dgrams_retransmit() and trigger a crash with libasan when releasing the
original frames they were duplicated from.

To fix this, always release the frames which could not be sent during
retransmissions calling qc_free_frm_list() where needed.

Must be backported as far as 2.6.
Garfield96 pushed a commit that referenced this pull request Mar 3, 2024
This bug could be reproduced loading several certificated from "bind" line:
with "server_ocsp.pem" as argument to "crt" setting and updating
the CDSA certificate with the RSA as follows:

echo -e "set ssl cert reg-tests/ssl/ocsp_update/multicert/server_ocsp.pem.ecdsa \
	     <<\n$(cat reg-tests/ssl/ocsp_update/multicert/server_ocsp.pem.rsa)\n" | socat - /tmp/stats
followed by an "commit ssl cert reg-tests/ssl/ocsp_update/multicert/server_ocsp.pem.ecdsa"
command. This could be detected by libasan as follows:

=================================================================
==507223==ERROR: AddressSanitizer: attempting double-free on 0x60200007afb0 in thread T3:
    #0 0x7fabc6fb5527 in __interceptor_free (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x54527)
    #1 0x7fabc6ae8f8c in ossl_asn1_string_embed_free (/opt/quictls/lib/libcrypto.so.81.3+0xd4f8c)
    #2 0x7fabc6af54e9 in ossl_asn1_primitive_free (/opt/quictls/lib/libcrypto.so.81.3+0xe14e9)
    #3 0x7fabc6af5960 in ossl_asn1_template_free (/opt/quictls/lib/libcrypto.so.81.3+0xe1960)
    #4 0x7fabc6af569f in ossl_asn1_item_embed_free (/opt/quictls/lib/libcrypto.so.81.3+0xe169f)
    #5 0x7fabc6af58a4 in ASN1_item_free (/opt/quictls/lib/libcrypto.so.81.3+0xe18a4)
    #6 0x46a159 in ssl_sock_free_cert_key_and_chain_contents src/ssl_ckch.c:723
    #7 0x46aa92 in ckch_store_free src/ssl_ckch.c:869
    haproxy#8 0x4704ad in cli_release_commit_cert src/ssl_ckch.c:1981
    haproxy#9 0x962e83 in cli_io_handler src/cli.c:1140
    haproxy#10 0xc1edff in task_run_applet src/applet.c:454
    haproxy#11 0xaf8be9 in run_tasks_from_lists src/task.c:634
    haproxy#12 0xafa2ed in process_runnable_tasks src/task.c:876
    haproxy#13 0xa23c72 in run_poll_loop src/haproxy.c:3024
    haproxy#14 0xa24aa3 in run_thread_poll_loop src/haproxy.c:3226
    haproxy#15 0x7fabc69e7ea6 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x7ea6)
    haproxy#16 0x7fabc6907a2e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0xfba2e)

0x60200007afb0 is located 0 bytes inside of 3-byte region [0x60200007afb0,0x60200007afb3)
freed by thread T3 here:
    #0 0x7fabc6fb5527 in __interceptor_free (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x54527)
    #1 0x7fabc6ae8f8c in ossl_asn1_string_embed_free (/opt/quictls/lib/libcrypto.so.81.3+0xd4f8c)

previously allocated by thread T2 here:
    #0 0x7fabc6fb573f in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x5473f)
    #1 0x7fabc6ae8d77 in ASN1_STRING_set (/opt/quictls/lib/libcrypto.so.81.3+0xd4d77)

Thread T3 created by T0 here:
    #0 0x7fabc6f84bba in pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x23bba)
    #1 0xc04f36 in setup_extra_threads src/thread.c:252
    #2 0xa2761f in main src/haproxy.c:3917
    #3 0x7fabc682fd09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)

Thread T2 created by T0 here:
    #0 0x7fabc6f84bba in pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x23bba)
    #1 0xc04f36 in setup_extra_threads src/thread.c:252
    #2 0xa2761f in main src/haproxy.c:3917
    #3 0x7fabc682fd09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)

SUMMARY: AddressSanitizer: double-free ??:0 __interceptor_free
==507223==ABORTING
Aborted

The OCSP CID stored in the impacted ckch data were freed but not reset to NULL,
leading to a subsequent double free.

Must be backported to 2.8.
Garfield96 pushed a commit that referenced this pull request Mar 3, 2024
h3_resp_data_send() is used to transcode HTX data into H3 data frames.
If QCS Tx buffer is not aligned when first invoked, two separate frames
may be built, first until buffer end, then with remaining space in
front.

If buffer space is not enough for at least the H3 frame header, -1 is
returned with the flag QC_SF_BLK_MROOM set to await for more room. An
issue arises if this occurs for the second frame : -1 is returned even
though HTX data were properly transcoded and removed on the first step.
This causes snd_buf callback to return an incorrect value to the stream
layer, which in the end will corrupt the channel output buffer.

To fix this, stop considering that not enough remaining space is an
error case. Instead, return 0 if this is encountered for the first frame
or the HTX removed block size for the second one. As QC_SF_BLK_MROOM is
set, this will correctly interrupt H3 encoding. Label err is thus only
properly limited to fatal error which should cause a connection closure.
A new BUG_ON() has been added which should prevent similar issues in the
future.

This issue was detected using the following client :
 $ ngtcp2-client --no-quic-dump --no-http-dump --exit-on-all-streams-close \
   127.0.0.1 20443 -n2 "http://127.0.0.1:20443/?s=50k"

This triggers the following CHECK_IF statement. Note that it may be
necessary to disable fast forwarding to enforce snd_buf usage.

Thread 1 "haproxy" received signal SIGILL, Illegal instruction.
0x00005555558bc48a in co_data (c=0x5555561ed428) at include/haproxy/channel.h:130
130             CHECK_IF_HOT(c->output > c_data(c));
[ ## gdb ## ] bt
 #0  0x00005555558bc48a in co_data (c=0x5555561ed428) at include/haproxy/channel.h:130
 #1  0x00005555558c1d69 in sc_conn_send (sc=0x5555561f92d0) at src/stconn.c:1637
 #2  0x00005555558c2683 in sc_conn_io_cb (t=0x5555561f7f10, ctx=0x5555561f92d0, state=32832) at src/stconn.c:1824
 #3  0x000055555590c48f in run_tasks_from_lists (budgets=0x7fffffffdaa0) at src/task.c:596
 #4  0x000055555590cf88 in process_runnable_tasks () at src/task.c:876
 #5  0x00005555558aae3b in run_poll_loop () at src/haproxy.c:3049
 #6  0x00005555558ab57e in run_thread_poll_loop (data=0x555555d9fa00 <ha_thread_info>) at src/haproxy.c:3251
 #7  0x00005555558ad053 in main (argc=6, argv=0x7fffffffddd8) at src/haproxy.c:3948

In case CHECK_IF are not activated, it may cause crash or incorrect
transfers.

This was introduced by the following commit
  commit 2144d24
  BUG/MINOR: h3: close connection on sending alloc errors

This must be backported wherever the above patch is.
Garfield96 pushed a commit that referenced this pull request Mar 3, 2024
Before a dynamic server can be deleted, a set of preconditions must be
validated to ensure it is not referenced naymore by a stream or a
connection. This is implemented in srv_check_for_deletion().

The various criteria specified were incomplete. This allows a server
instance to be deleted while still be referenced by a stream and a
connection.

This bug was reproduced by using ASAN compilation. A script was used to
add and delete a server every second, while using h2load to generate
traffic with download of 1k objects. Here is the ASAN error.

==140916==ERROR: AddressSanitizer: heap-use-after-free on address 0x520000020080 at pc 0x63cb25679537 bp 0x701529ff5070 sp 0x701529ff5060
READ of size 1 at 0x520000020080 thread T7
    #0 0x63cb25679536 in objt_server include/haproxy/obj_type.h:99
    #1 0x63cb2568f465 in process_stream src/stream.c:1823
    #2 0x63cb25a4a4a2 in run_tasks_from_lists src/task.c:632
    #3 0x63cb25a4bf62 in process_runnable_tasks src/task.c:876
    #4 0x63cb2596a220 in run_poll_loop src/haproxy.c:3050
    #5 0x63cb2596b192 in run_thread_poll_loop src/haproxy.c:3252
    #6 0x701539aa9559  (/usr/lib/libc.so.6+0x8b559) (BuildId: c0caa0b7709d3369ee575fcd7d7d0b0fc48733af)
    #7 0x701539b26a3b  (/usr/lib/libc.so.6+0x108a3b) (BuildId: c0caa0b7709d3369ee575fcd7d7d0b0fc48733af)

To fix this, add <curr_used_conns> to the counters checked in
srv_check_for_deletion().

Outside of this bug, one case which remains sensible is for SF_DIRECT
streams which referenced a server instance early in process_stream()
before connect_server(). This occurs with use-server directive,
force-persist rule or cookie persistence. However, after code
reexamination, the code is considered reliable as process_stream() is
not rescheduled before connect_server() invocation. These observations
have been saved in sess_change_server() documentation to ensure it
remains valid in the future.

This must be backported up to 2.6.
alexanderstephan pushed a commit that referenced this pull request Mar 14, 2025
Due to QUIC packet reordering, a stream may be opened via a new
RESET_STREAM or STOP_SENDING frame. This would cause either Tx or Rx
channel to be immediately closed.

This can cause an issue with current QUIC MUX implementation with QCS
purging. QCS are inserted into QCC purge list when transfer could be
considered as completed. In most cases, this happens after full
request/response exchange. However, it can also happens after request
reception if RESET_STREAM/STOP_SENDING are received first.

A BUG_ON() crash will occur if a STREAM frame is received after. In this
case, streamdesc instance will be attached via qcs_attach_sc() to handle
the new request. However, QCS is already considered eligible to purging.
It could cause it to be released while its streamdesc instance remains.
A BUG_ON() crash detects this problem in qcc_purge_streams().

To fix this, extend qcc_decode_qcs() to skip app proto rcv_buf
invokation if QCS is considered completed. A similar condition was
already implemented when read was previously aborted after a
STOP_SENDING emission by QUIC MUX.

This crash was reproduced on haproxy.org. Here is the output of the
backtrace :
Core was generated by `./haproxy-dev -db -f /etc/haproxy/haproxy-current.cfg -sf 16495'.
Program terminated with signal SIGILL, Illegal instruction.
 #0  0x00000000004e442b in qcc_purge_streams (qcc=0x774cca0) at src/mux_quic.c:2661
2661                    BUG_ON_HOT(!qcs_is_completed(qcs));
[Current thread is 1 (LWP 1457)]
[ ## gdb ## ] bt
 #0  0x00000000004e442b in qcc_purge_streams (qcc=0x774cca0) at src/mux_quic.c:2661
 #1  0x00000000004e4db7 in qcc_io_process (qcc=0x774cca0) at src/mux_quic.c:2744
 #2  0x00000000004e5a54 in qcc_io_cb (t=0x7f71193940c0, ctx=0x774cca0, status=573504) at src/mux_quic.c:2886
 #3  0x0000000000b4f792 in run_tasks_from_lists (budgets=0x7ffdcea1e670) at src/task.c:603
 #4  0x0000000000b5012f in process_runnable_tasks () at src/task.c:883
 #5  0x00000000007de4a3 in run_poll_loop () at src/haproxy.c:2771
 #6  0x00000000007deb9f in run_thread_poll_loop (data=0x1335a00 <ha_thread_info>) at src/haproxy.c:2985
 #7  0x00000000007dfd8d in main (argc=6, argv=0x7ffdcea1e958) at src/haproxy.c:3570

This BUG_ON() crash can only happen since 3.1 refactoring. Indeed, purge
list was only implemented on this version. As such, please backport it
on 3.1 immediately. However, a logic issue remains for older version as
a stream could be attached on a fully closed QCS. Thus, it should be
backported up to 2.8, this time after a period of observation.
Garfield96 pushed a commit that referenced this pull request Aug 22, 2025
…ll()

In the following trace trying to abuse the watchdog from the CLI's
"debug dev loop" command running in parallel to "show threads" loops,
it's clear that some re-entrance may happen in ha_thread_dump_fill().

A first minimal fix consists in using a test-and-set on the flag
indicating that the function is currently dumping threads, so that
the one from the signal just returns. However the caller should be
made more reliable to serialize all of this, that's for future
work.

Here's an example capture of 7 threads stuck waiting for each other:
  (gdb) bt
  #0  0x00007fe78d78e147 in sched_yield () from /lib64/libc.so.6
  #1  0x0000000000674a05 in ha_thread_relax () at src/thread.c:356
  #2  0x00000000005ba4f5 in ha_thread_dump_fill (thr=2, buf=0x7ffdd8e08ab0) at src/debug.c:402
  #3  ha_thread_dump_fill (buf=0x7ffdd8e08ab0, thr=<optimized out>) at src/debug.c:384
  #4  0x00000000005baac4 in ha_stuck_warning (thr=thr@entry=2) at src/debug.c:840
  #5  0x00000000006a360d in wdt_handler (sig=<optimized out>, si=<optimized out>, arg=<optimized out>) at src/wdt.c:156
  #6  <signal handler called>
  #7  0x00007fe78d78e147 in sched_yield () from /lib64/libc.so.6
  haproxy#8  0x0000000000674a05 in ha_thread_relax () at src/thread.c:356
  haproxy#9  0x00000000005ba4c2 in ha_thread_dump_fill (thr=2, buf=0x7fe78f2d6420) at src/debug.c:426
  haproxy#10 ha_thread_dump_fill (buf=0x7fe78f2d6420, thr=2) at src/debug.c:384
  haproxy#11 0x00000000005ba7c6 in cli_io_handler_show_threads (appctx=0x2a89ab0) at src/debug.c:548
  haproxy#12 0x000000000057ea43 in cli_io_handler (appctx=0x2a89ab0) at src/cli.c:1176
  haproxy#13 0x00000000005d7885 in task_process_applet (t=0x2a82730, context=0x2a89ab0, state=<optimized out>) at src/applet.c:920
  haproxy#14 0x0000000000659002 in run_tasks_from_lists (budgets=budgets@entry=0x7ffdd8e0a5c0) at src/task.c:644
  haproxy#15 0x0000000000659bd7 in process_runnable_tasks () at src/task.c:886
  haproxy#16 0x00000000005cdcc9 in run_poll_loop () at src/haproxy.c:2858
  haproxy#17 0x00000000005ce457 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3075
  haproxy#18 0x0000000000430628 in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3665
Garfield96 pushed a commit that referenced this pull request Aug 22, 2025
Stephen Farrell reported in issue haproxy#2942 that recent haproxy versions
crash if there's no resolv.conf. A quick bisect with his reproducer
showed that it started with commit 4194f75 ("MEDIUM: tree-wide: avoid
manually initializing proxies") which reorders the proxies initialization
sequence a bit. The crash shows a corrupted tree, typically indicating a
use-after-free. With the help of ASAN it was possible to find that a
resolver proxy had been destroyed and freed before the name insertion
that causes the crash, very likely caused by the absence of the needed
resolv.conf:

    #0 0x7ffff72a82f7 in free (/usr/local/lib64/libasan.so.5+0x1062f7)
    #1 0x94c1fd in free_proxy src/proxy.c:436
    #2 0x9355d1 in resolvers_destroy src/resolvers.c:2604
    #3 0x93e899 in resolvers_create_default src/resolvers.c:3892
    #4 0xc6ed29 in httpclient_resolve_init src/http_client.c:1170
    #5 0xc6fbcf in httpclient_create_proxy src/http_client.c:1310
    #6 0x4ae9da in ssl_ocsp_update_precheck src/ssl_ocsp.c:1452
    #7 0xa1b03f in step_init_2 src/haproxy.c:2050

But free_proxy() doesn't delete the ebpt_node that carries the name,
which perfectly explains the situation. This patch simply deletes the
name node and Stephen confirmed that it fixed the problem for him as
well. Let's also free it since the key points to p->id which is never
freed either in this function!

No backport is needed since the patch above was first merged into
3.2-dev10.
Garfield96 pushed a commit that referenced this pull request Aug 22, 2025
For frontend side, quic_conn is only released if MUX wasn't allocated,
either due to handshake abort, in which case upper layer is never
allocated, or after transfer completion when full conn + MUX layers are
already released.

On the backend side, initialization is not performed in the same order.
Indeed, in this case, connection is first instantiated, the nthe
quic_conn is created to execute the handshake, while MUX is still only
allocated on handshake completion. As such, it is not possible anymore
to free immediately quic_conn on handshake failure. Else, this can cause
crash if the connection try to reaccess to its transport layer after
quic_conn release.

Such crash can easily be reproduced in case of connection error to the
QUIC server. Here is an example of an experienced backtrace.

Thread 1 "haproxy" received signal SIGSEGV, Segmentation fault.
  0x0000555555739733 in quic_close (conn=0x55555734c0d0, xprt_ctx=0x5555573a6e50) at src/xprt_quic.c:28
  28              qc->conn = NULL;
  [ ## gdb ## ] bt
  #0  0x0000555555739733 in quic_close (conn=0x55555734c0d0, xprt_ctx=0x5555573a6e50) at src/xprt_quic.c:28
  #1  0x00005555559c9708 in conn_xprt_close (conn=0x55555734c0d0) at include/haproxy/connection.h:162
  #2  0x00005555559c97d2 in conn_full_close (conn=0x55555734c0d0) at include/haproxy/connection.h:206
  #3  0x00005555559d01a9 in sc_detach_endp (scp=0x7fffffffd648) at src/stconn.c:451
  #4  0x00005555559d05b9 in sc_reset_endp (sc=0x55555734bf00) at src/stconn.c:533
  #5  0x000055555598281d in back_handle_st_cer (s=0x55555734adb0) at src/backend.c:2754
  #6  0x000055555588158a in process_stream (t=0x55555734be10, context=0x55555734adb0, state=516) at src/stream.c:1907
  #7  0x0000555555dc31d9 in run_tasks_from_lists (budgets=0x7fffffffdb30) at src/task.c:655
  haproxy#8  0x0000555555dc3dd3 in process_runnable_tasks () at src/task.c:889
  haproxy#9  0x0000555555a1daae in run_poll_loop () at src/haproxy.c:2865
  haproxy#10 0x0000555555a1e20c in run_thread_poll_loop (data=0x5555569d1c00 <ha_thread_info>) at src/haproxy.c:3081
  haproxy#11 0x0000555555a1f66b in main (argc=5, argv=0x7fffffffde18) at src/haproxy.c:3671

To fix this, change the condition prior to calling quic_conn release. If
<conn> member is not NULL, delay the release, similarly to the case when
MUX is allocated. This allows connection to be freed first, and detach
from quic_conn layer through close xprt operation.

No need to backport.
Garfield96 pushed a commit that referenced this pull request Aug 22, 2025
Following c24de07 ("OPTIM: stats: store fast sharded counters pointers
at session and stream level") some crashes were observed in
connect_server():

  #0  0x00000000007ba39c in connect_server (s=0x65117b0) at src/backend.c:2101
  2101                            _HA_ATOMIC_INC(&s->sv_tgcounters->connect);
  Missing separate debuginfos, use: debuginfo-install glibc-2.17-325.el7_9.x86_64 libgcc-4.8.5-44.el7.x86_64 nss-softokn-freebl-3.67.0-3.el7_9.x86_64 pcre-8.32-17.el7.x86_64
  (gdb) bt
  #0  0x00000000007ba39c in connect_server (s=0x65117b0) at src/backend.c:2101
  #1  0x00000000007baff8 in back_try_conn_req (s=0x65117b0) at src/backend.c:2378
  #2  0x00000000006c0e9f in process_stream (t=0x650f180, context=0x65117b0, state=8196) at src/stream.c:2366
  #3  0x0000000000bd3e51 in run_tasks_from_lists (budgets=0x7ffd592752e0) at src/task.c:655
  #4  0x0000000000bd49ef in process_runnable_tasks () at src/task.c:889
  #5  0x0000000000851169 in run_poll_loop () at src/haproxy.c:2834
  #6  0x0000000000851865 in run_thread_poll_loop (data=0x1a03580 <ha_thread_info>) at src/haproxy.c:3050
  #7  0x0000000000852a53 in main (argc=7, argv=0x7ffd592755f8) at src/haproxy.c:3637

Here the crash occurs during the atomic inc of a sv_tgcounters metric from
the stream pointer, which tells us the pointer is likely garbage.

In fact, we assign s->sv_tgcounters each time the stream target is set to
a valid server. For that we use stream_set_srv_target() helper which does
assigment for us. By reviewing the code, in turns out we forgot to call
stream_set_srv_target() in pendconn_dequeue(), where the stream target
is set to the server who picked the pendconn.

Let's fix the bug by using stream_set_srv_target() there.

No backport needed unless c24de07 is.
alexanderstephan and others added 6 commits September 1, 2025 10:07
…tus() and filter_count_url()

This patch adds missing out-of-memory (OOM) checks after calls to
calloc() in the functions `filter_count_srv_status()` and `filter_count_url()`.
If memory allocation fails, an error message is printed to stderr
and the process exits with status 1. This improves robustness
and prevents undefined behavior in low-memory situations.

Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
… parser and dup_logger()

This patch adds missing out-of-memory (OOM) checks after calls
to `calloc()` and `malloc()` in the logformat parser and the
`dup_logger()` function. If memory allocation fails, an error
is reported or NULL is returned, preventing undefined behavior
in low-memory conditions.

Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
This patch adds a missing out-of-memory (OOM) check after
the call to `calloc()` in `smp_fetch_acl_parse()`. If
memory allocation fails, an error message is set and
the function returns 0, improving robustness in
low-memory situations.

Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
This commit adds a missing out-of-memory (OOM) check
after the call to `calloc()` in `cfg_parse_listen()`.
If memory allocation fails, an alert is logged, error
codes are set, and parsing is aborted to prevent
undefined behavior.

Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
…on_options()

This patch adds a missing out-of-memory (OOM) check after
the call to `calloc()` in `parse_compression_options()`. If
memory allocation fails, an error message is set, the function
returns -1, and parsing is aborted to ensure safe handling
of low-memory conditions.

Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
This patch adds a missing out-of-memory (OOM) check after
the call to `malloc()` in `indent_msg()`. If memory
allocation fails, the function returns NULL to prevent
undefined behavior.

Co-authored-by: Christian Norbert Menges <christian.norbert.menges@sap.com>
Garfield96 pushed a commit that referenced this pull request Nov 5, 2025
When an SNI is set on a QUIC server line, ssl_sock_set_servername() is called
from connect_server() (backend.c). This leads some BUG_ON() to be triggered
because the CO_FL_WAIT_L6_CONN | CO_FL_SSL_WAIT_HS were not set. This must
be done into the ->init() xprt callback. This patch move the flags settings
from ->start() to ->init() callback.

Indeed, connect_server() calls these functions in this order:

   ->init(),
   ssl_sock_set_servername() # => crash if CO_FL_WAIT_L6_CONN | CO_FL_SSL_WAIT_HS not set
   ->start()

Furthermore ssl_sock_set_servername() has a side effect to reset the SSL_SESSION
object (attached to SSL object) calling SSL_set_session(), leading to crashes as follows:

 [Thread debugging using libthread_db enabled]
 Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
 Core was generated by `./haproxy -f quic_srv.cfg'.
 Program terminated with signal SIGSEGV, Segmentation fault.
 #0  tls_process_server_hello (s=0x560c259733b0, pkt=0x7fffac239f20)
     at ssl/statem/statem_clnt.c:1624
 1624            if (s->session->session_id_length > 0) {
 [Current thread is 1 (Thread 0x7fc364e53dc0 (LWP 35514))]
 (gdb) bt
 #0  tls_process_server_hello (s=0x560c259733b0, pkt=0x7fffac239f20)
     at ssl/statem/statem_clnt.c:1624
 #1  0x00007fc36540fba4 in ossl_statem_client_process_message (s=0x560c259733b0,
     pkt=0x7fffac239f20) at ssl/statem/statem_clnt.c:1042
 #2  0x00007fc36540d028 in read_state_machine (s=0x560c259733b0) at ssl/statem/statem.c:646
 #3  0x00007fc36540ca70 in state_machine (s=0x560c259733b0, server=0)
     at ssl/statem/statem.c:439
 #4  0x00007fc36540c576 in ossl_statem_connect (s=0x560c259733b0) at ssl/statem/statem.c:250
 #5  0x00007fc3653f1698 in SSL_do_handshake (s=0x560c259733b0) at ssl/ssl_lib.c:3835
 #6  0x0000560c22620327 in qc_ssl_do_hanshake (qc=qc@entry=0x560c25961f60,
     ctx=ctx@entry=0x560c25963020) at src/quic_ssl.c:863
 #7  0x0000560c226210be in qc_ssl_provide_quic_data (len=90, data=<optimized out>,
     ctx=0x560c25963020, level=ssl_encryption_initial, ncbuf=0x560c2588bb18)
     at src/quic_ssl.c:1071
 haproxy#8  qc_ssl_provide_all_quic_data (qc=qc@entry=0x560c25961f60, ctx=0x560c25963020)
     at src/quic_ssl.c:1123
 haproxy#9  0x0000560c2260ca5f in quic_conn_io_cb (t=0x560c25962f80, context=0x560c25961f60,
     state=<optimized out>) at src/quic_conn.c:791
 haproxy#10 0x0000560c228255ed in run_tasks_from_lists (budgets=<optimized out>) at src/task.c:648
 haproxy#11 0x0000560c22825f7a in process_runnable_tasks () at src/task.c:889
 haproxy#12 0x0000560c22793dc7 in run_poll_loop () at src/haproxy.c:2836
 haproxy#13 0x0000560c22794481 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3056
 haproxy#14 0x0000560c2259082d in main (argc=<optimized out>, argv=<optimized out>)
     at src/haproxy.c:3667

<s> is the SSL object, and <s->session> is the SSL_SESSION object.

For the client, this is the first call do SSL_do_handshake() which initializes this
SSL_SESSION object from ->init() xpt callback. Then it is reset by
ssl_sock_set_servername(), then tls_process_server_hello() TLS stack is called with
NULL value for s->session when receiving the ServerHello TLS message.

To fix this, simply move the first call to SSL_do_handshake to ->start xprt call
back (qc_xprt_start()).

No need to backport.
alexanderstephan pushed a commit that referenced this pull request Dec 11, 2025
This bug arrived with this commit:
     MINOR: quic: implement cc-algo server keyword
where <srv> keywords list with a missing array NULL termination inside was
introduced to parse the QUIC backend CC algorithms.

Detected by ASAN during ssl/add_ssl_crt-list.vtc execution as follows:

***  h1    debug|==4066081==ERROR: AddressSanitizer: global-buffer-overflow on address 0x5562e31dedb8 at pc 0x5562e298951f bp 0x7ffe9f9f2b40 sp 0x7ffe9f9f2b38
***  h1    debug|READ of size 8 at 0x5562e31dedb8 thread T0
**** dT    0.173
***  h1    debug|    #0 0x5562e298951e in srv_find_kw src/server.c:789
***  h1    debug|    #1 0x5562e2989630 in _srv_parse_kw src/server.c:3847
***  h1    debug|    #2 0x5562e299db1f in parse_server src/server.c:4024
***  h1    debug|    #3 0x5562e2c86ea4 in cfg_parse_listen src/cfgparse-listen.c:593
***  h1    debug|    #4 0x5562e2b0ede9 in parse_cfg src/cfgparse.c:2708
***  h1    debug|    #5 0x5562e2c47d48 in read_cfg src/haproxy.c:1077
***  h1    debug|    #6 0x5562e2682055 in main src/haproxy.c:3366
***  h1    debug|    #7 0x7ff3ff867249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
***  h1    debug|    haproxy#8 0x7ff3ff867304 in __libc_start_main_impl ../csu/libc-start.c:360
***  h1    debug|    haproxy#9 0x5562e26858d0 in _start (/home/flecaille/src/haproxy/haproxy+0x2638d0)
***  h1    debug|
***  h1    debug|0x5562e31dedb8 is located 40 bytes to the left of global variable 'bind_kws' defined in 'src/cfgparse-quic.c:255:28' (0x5562e31dede0) of size 120
***  h1    debug|0x5562e31dedb8 is located 0 bytes to the right of global variable 'srv_kws' defined in 'src/cfgparse-quic.c:264:27' (0x5562e31ded80) of size 56
***  h1    debug|SUMMARY: AddressSanitizer: global-buffer-overflow src/server.c:789 in srv_find_kw
***  h1    debug|Shadow bytes around the buggy address:
***  h1    debug|  0x0aacdc633d60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
***  h1    debug|  0x0aacdc633d70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
***  h1    debug|  0x0aacdc633d80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
***  h1    debug|  0x0aacdc633d90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
***  h1    debug|  0x0aacdc633da0: 00 00 00 00 00 00 00 00 00 00 f9 f9 f9 f9 f9 f9
***  h1    debug|=>0x0aacdc633db0: 00 00 00 00 00 00 00[f9]f9 f9 f9 f9 00 00 00 00
***  h1    debug|  0x0aacdc633dc0: 00 00 00 00 00 00 00 00 00 00 00 f9 f9 f9 f9 f9
***  h1    debug|  0x0aacdc633dd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
***  h1    debug|  0x0aacdc633de0: 00 00 00 00 00 00 00 00 f9 f9 f9 f9 f9 f9 f9 f9
***  h1    debug|  0x0aacdc633df0: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
***  h1    debug|  0x0aacdc633e00: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
***  h1    debug|Shadow byte legend (one shadow byte represents 8 application bytes):

This should be backported where the commit above is supposed to be backported.
@alexanderstephan
Copy link

Can be closed since it is now part of the upstream haproxy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants