Skip to content

YQ-5091 fixed PQ partitions balancer idle / reconnects#34486

Merged
GrigoriyPA merged 9 commits intoydb-platform:mainfrom
GrigoriyPA:YQ-5091-fix-partitions-balancer-with-idle-and-reconnect-settings
Feb 20, 2026
Merged

YQ-5091 fixed PQ partitions balancer idle / reconnects#34486
GrigoriyPA merged 9 commits intoydb-platform:mainfrom
GrigoriyPA:YQ-5091-fix-partitions-balancer-with-idle-and-reconnect-settings

Conversation

@GrigoriyPA
Copy link
Collaborator

@GrigoriyPA GrigoriyPA commented Feb 19, 2026

Changelog entry

Fixed PQ partitions balancer idle / reconnects

Changelog category

  • Bugfix

Description for reviewers

Fixed Idle setting handling, fixed hanging on reconnect, fixed counters names

Bugfix ticket: https://st.yandex-team.ru/YQ-5091

@github-actions
Copy link

github-actions bot commented Feb 19, 2026

2026-02-19 10:52:27 UTC Pre-commit check linux-x86_64-relwithdebinfo for 4914311 has started.
2026-02-19 10:53:10 UTC Artifacts will be uploaded here
2026-02-19 10:55:29 UTC ya make is running...
2026-02-19 11:00:01 UTC Check cancelled

@github-actions
Copy link

github-actions bot commented Feb 19, 2026

2026-02-19 10:52:30 UTC Pre-commit check linux-x86_64-release-asan for 4914311 has started.
2026-02-19 10:52:49 UTC Artifacts will be uploaded here
2026-02-19 10:55:00 UTC ya make is running...
2026-02-19 11:00:01 UTC Check cancelled

@ydbot
Copy link
Collaborator

ydbot commented Feb 19, 2026

Run Extra Tests

Run additional tests for this PR. You can customize:

  • Test Size: small, medium, large (default: all)
  • Test Targets: any directory path (default: ydb/)
  • Sanitizers: ASAN, MSAN, TSAN
  • Coredumps: enable for debugging (default: off)
  • Additional args: custom ya make arguments

▶  Run tests

@github-actions
Copy link

github-actions bot commented Feb 19, 2026

🟢 2026-02-19 11:02:17 UTC The validation of the Pull Request description is successful.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes issues in the PQ (PersQueue) partitions balancer related to idle timeout handling, reconnection hanging, and counter naming. The changes include a significant refactoring of the composite read session implementation to better manage partition states and improve observability through enhanced metrics.

Changes:

  • Refactored partition state management in composite read session with separate tracking of suspended, pending, ready, and idle partitions
  • Added sequence number tracking to counter updates to prevent out-of-order message processing
  • Introduced new signal utilities for coordinating asynchronous operations between actors and read sessions
  • Fixed counter naming conventions and added more granular metrics for debugging

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
yql_pq_composite_read_session.h Added Cluster field to settings and standardized NMonitoring namespace usage
yql_pq_composite_read_session.cpp Major refactoring of balancer actor and read session with improved partition state management, metrics, and logging
ya.make Added signals library dependency
dq_pq_read_actor.cpp Added cluster-specific counters and wakeup scheduling for hanging detection
dq_events.proto Added SeqNo field for ordering counter updates
dq_info_aggregation_actor.cpp Implemented sequence number handling and sender cleanup on actor termination
signal_utils.h/cpp New utility classes for counter management and future signaling
kqp_federated_query_helpers.cpp Increased max queued requests to prevent request throttling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added bugfix and removed bugfix labels Feb 19, 2026
GrigoriyPA and others added 7 commits February 19, 2026 13:59
…composite_read_session.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…composite_read_session.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…composite_read_session.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…composite_read_session.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…composite_read_session.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…composite_read_session.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link

github-actions bot commented Feb 19, 2026

2026-02-19 11:05:49 UTC Pre-commit check linux-x86_64-release-asan for bd9712f has started.
2026-02-19 11:06:08 UTC Artifacts will be uploaded here
2026-02-19 11:08:30 UTC ya make is running...
🟡 2026-02-19 13:20:44 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
18841 18798 0 27 7 9

🟢 2026-02-19 13:20:54 UTC Build successful.
🟡 2026-02-19 13:21:28 UTC ydbd size 3.9 GiB changed* by +381.5 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 6196fd7 merge: bd9712f diff diff %
ydbd size 4 190 805 048 Bytes 4 191 195 752 Bytes +381.5 KiB +0.009%
ydbd stripped size 1 567 995 200 Bytes 1 568 137 376 Bytes +138.8 KiB +0.009%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Feb 19, 2026

2026-02-19 11:05:56 UTC Pre-commit check linux-x86_64-relwithdebinfo for bd9712f has started.
2026-02-19 11:06:14 UTC Artifacts will be uploaded here
2026-02-19 11:08:40 UTC ya make is running...
🟡 2026-02-19 14:00:33 UTC Some tests failed, follow the links below. Going to retry failed tests...

Details

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
48957 45442 0 4 3494 17

2026-02-19 14:00:53 UTC ya make is running... (failed tests rerun, try 2)
🟢 2026-02-19 14:04:02 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
104 (only retried tests) 104 0 0 0 0

🟢 2026-02-19 14:04:09 UTC Build successful.
🟡 2026-02-19 14:04:33 UTC ydbd size 2.4 GiB changed* by +233.5 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 6196fd7 merge: bd9712f diff diff %
ydbd size 2 573 584 568 Bytes 2 573 823 688 Bytes +233.5 KiB +0.009%
ydbd stripped size 542 463 848 Bytes 542 515 656 Bytes +50.6 KiB +0.010%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

void DoUpdate(i64 newValue, std::optional<i64> oldValue, const TActorId& sender) final {
if (oldValue) {
OrderedValues.erase({*oldValue, sender});
Y_VALIDATE(OrderedValues.erase({*oldValue, sender}), "Unexpected OrderedValues");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Точно не на-сейчас):
В таких местах можно избегать реаллокации с set::extract -> rec.value().= -> set::insert(move) (ну, и если это вызывается часто -- запланировать когда-нибудь heap вместо set; хип, впрочем, нужен самописный, стандартный не умеет update)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Окей, пока оставлю как есть

@github-actions
Copy link

github-actions bot commented Feb 19, 2026

2026-02-19 14:07:03 UTC Pre-commit check linux-x86_64-relwithdebinfo for d9c1a9d has started.
2026-02-19 14:07:21 UTC Artifacts will be uploaded here
2026-02-19 14:09:45 UTC ya make is running...
🟡 2026-02-19 16:41:48 UTC Some tests failed, follow the links below. Going to retry failed tests...

Details

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
48747 45317 0 1 3419 10

2026-02-19 16:42:06 UTC ya make is running... (failed tests rerun, try 2)
🟢 2026-02-19 16:44:39 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
2 (only retried tests) 2 0 0 0 0

🟢 2026-02-19 16:44:46 UTC Build successful.
🟡 2026-02-19 16:45:08 UTC ydbd size 2.4 GiB changed* by +236.8 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 33b756e merge: d9c1a9d diff diff %
ydbd size 2 573 584 664 Bytes 2 573 827 176 Bytes +236.8 KiB +0.009%
ydbd stripped size 542 463 336 Bytes 542 515 272 Bytes +50.7 KiB +0.010%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Feb 19, 2026

2026-02-19 14:07:09 UTC Pre-commit check linux-x86_64-release-asan for d9c1a9d has started.
2026-02-19 14:07:28 UTC Artifacts will be uploaded here
2026-02-19 14:09:41 UTC ya make is running...
🟡 2026-02-19 16:18:41 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
18837 18787 0 27 11 12

🟢 2026-02-19 16:18:52 UTC Build successful.
🟡 2026-02-19 16:19:26 UTC ydbd size 3.9 GiB changed* by +386.2 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 33b756e merge: d9c1a9d diff diff %
ydbd size 4 190 802 520 Bytes 4 191 197 944 Bytes +386.2 KiB +0.009%
ydbd stripped size 1 567 992 320 Bytes 1 568 134 944 Bytes +139.3 KiB +0.009%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@GrigoriyPA GrigoriyPA requested a review from yumkam February 19, 2026 14:16
@GrigoriyPA GrigoriyPA merged commit e4cd50a into ydb-platform:main Feb 20, 2026
9 checks passed
@GrigoriyPA GrigoriyPA deleted the YQ-5091-fix-partitions-balancer-with-idle-and-reconnect-settings branch February 20, 2026 08:52
@ydbot
Copy link
Collaborator

ydbot commented Feb 20, 2026

Backport

To backport this PR, click the button next to the target branch and then click "Run workflow" in the Run Actions UI.

Branch Run
stable-25-3, stable-25-3-1, stable-25-4, stable-25-4-1 ▶  Backport
stable-25-4, stable-25-4-1 ▶  Backport
stable-25-4 ▶  Backport

▶  Backport manual

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments