Skip to content

Conversation

@ankushT369
Copy link
Contributor

Fixes #661
This PR adds the lightweight PostgreSQL wire-protocol message validation and logging to improve debugging of dump-related issues.

Changes

Introduced pgagroal_log_postgresql(void*, size_t) helper.
Added dynamic buffer appending functionality.
Logs when a valid message type is encountered (e.g. 'd').
Note
This functionality is indented for debugging purposes only.

@ankushT369
Copy link
Contributor Author

ankushT369 commented Dec 29, 2025

@fluca1978 is this okay?

@ankushT369
Copy link
Contributor Author

@fluca1978 after the ./clang-format.sh other files also got changed will there be problem?

@fluca1978
Copy link
Collaborator

Uhm, your PR is fine @ankushT369 but there is some issue with clang-format.
I'm opening a new issue about this, please standby. Once we understand what to do, you'll rebase and I'll merge.

@jesperpedersen
Copy link
Collaborator

@fluca1978 Did @Userfrom1995 sign-off on it ? Are we closer to finding the core issue ?

@Userfrom1995
Copy link
Collaborator

@fluca1978 Did @Userfrom1995 sign-off on it ? Are we closer to finding the core issue ?

I’ll test it today and run all three pipelines with the new log tool. I’ll share the logs and any findings if I spot something useful.

@fluca1978
Copy link
Collaborator

@fluca1978 Did @Userfrom1995 sign-off on it ? Are we closer to finding the core issue ?

An interesting observation from @ankushT369 is that, after enabling this message logging facility, some issues are reduced. I don't think we are any closer to the solution right now, but we have another tool in the bag that can surely help.

Besides, there is #667 slowing down the merge of this work.

@ankushT369
Copy link
Contributor Author

What I’m seeing is that when I enable pgagroal_log_debug or pgagroal_log_trace, everything slows down just enough that the bug disappears. The null / invalid bytes are no longer being sent to Postgres. But when those logs are disabled, the data is extremely fast from pgbench, the bug shows up again.

@fluca1978 said it looks like a critical section or something alike issue — the logging changes the execution timing and masks the real problem.

One idea is to add a very targeted check instead of heavy logging, something like:

if (msg_type != 'd')
{
   pgagroal_log_trace("Message type: %c len: %u", msg_type, msg_len);
}

This way we only log when the message type is not d (normal data), and we can catch the corrupted or unexpected packets without slowing down the hot path too much. If the corruption is happening, it should show up here as an invalid msg_type or a weird length, giving us the exact packet that breaks the stream.

@Userfrom1995
Copy link
Collaborator

These are the logs from all three pipelines using io_uring, along with pgagroal_log_postgres(msg):
io_uring_log_performance.log
io_uring_log_session.log
io_uring_log_transaction.log

Please look specifically for WARN messages in these logs.

Logs with

if (msg_type != 'd')
{
   pgagroal_log_trace("Message type: %c len: %u", msg_type, msg_len);
}

io_uring_log_session_non_d.log
io_uring_log_transaction_non_d.log
io_uring_log_performance_non_d.log

I didn’t notice any change after adding these log statements. The issue still behaves the same. The assumption that adding these log statements improves something doesn’t seem correct to me. I’m not able to see any improvement at all.

@ankushT369, are you sure that enabling pgagroal_log_trace or pgagroal_log_debug resolves the issue on your machine #666 (comment) and that you’re able to run the pgbench initialization command successfully with io_uring enabled?
If yes, please share:

  • Your relevant logs

  • The pgbench command you ran

  • The terminal output

One more suggestion: add the calls pgagroal_log_postgres(msg) for all three pipelines in the same way you added them for the performance pipeline, but comment them out. That way, we can selectively uncomment them during debugging. However, I don’t think we should push such calls directly to the master branch, as they are very verbose and won’t be particularly useful for end users.

@ankushT369
Copy link
Contributor Author

Hey @Userfrom1995,

The thing is, @jesperpedersen asked me to add a logging function to check the message type.

This is the pgbench command I ran and the errors I am getting:

[ankush@cognitive pgbench] ./pgbench -i -h localhost -p 2345 -U ankush ankush
dropping old tables...
creating tables...
generating data (client-side)...
pgbench: error: PQputline failed

[ankush@cognitive pgbench] ./pgbench -i -h localhost -p 2345 -U ankush ankush
dropping old tables...
creating tables...
generating data (client-side)...
ERROR:  invalid byte sequence for encoding "UTF8": 0x00
CONTEXT:  COPY pgbench_accounts, line 27931
pgbench: error: PQendcopy failed

Please look specifically for WARN messages in these logs.

Logs generated with:

if (msg_type != 'd')
{
   pgagroal_log_trace("Message type: %c len: %u", msg_type, msg_len);
}
  • io_uring_log_session_non_d.log
  • io_uring_log_transaction_non_d.log
  • io_uring_log_performance_non_d.log

What I mean is that you should not just remove pgagroal_log_postgres(msg).

Instead, inside the function

pgagroal_log_postgres(msg)

replace the existing line

pgagroal_log_trace("Message type: %c len: %u", msg_type, msg_len);

with

if (msg_type != 'd')
{
   pgagroal_log_trace("Message type: %c len: %u", msg_type, msg_len);
}

@jesperpedersen
Copy link
Collaborator

This is a function that we can use to help pinpoint where the bug is. Once we are done the function will remain in the code base, but never called due to its overhead.

So, it is a matter of finding the call-sites, and creating output that will help us

@ankushT369
Copy link
Contributor Author

@jesperpedersen one question is this happening because of race condition or similar??

@jesperpedersen
Copy link
Collaborator

@ankushT369 It is likely socket descriptor management which is the hardest area to debug

@Userfrom1995
Copy link
Collaborator

What I mean is that you should not just remove pgagroal_log_postgres(msg).

Instead, inside the function

pgagroal_log_postgres(msg)

replace the existing line

pgagroal_log_trace("Message type: %c len: %u", msg_type, msg_len);

with

if (msg_type != 'd')
{
   pgagroal_log_trace("Message type: %c len: %u", msg_type, msg_len);
}

That’s exactly what I did to produce the non_d.logs.

Also, regarding your earlier comment where you mentioned that enabling debug and trace solves the issue.I’m particularly interested in that. What you’ve shared so far looks like the same standard issue we are already facing.

Could you please explain what improvement you noticed after enabling the debug and trace level logging? Specifically, I’d like to understand what changes when the debug/trace is enabled and how it helps with the issue.

@ankushT369
Copy link
Contributor Author

Also, regarding your earlier comment where you mentioned that enabling debug and trace solves the issue.I’m particularly interested in that. What you’ve shared so far looks like the same standard issue we are already facing.

u can see its not showing anymore

[ankush@cognitive pgbench] ./pgbench -i -h localhost -p 2345 -U ankush ankush
dropping old tables...
creating tables...
generating data (client-side)...
vacuuming...
creating primary keys...
done in 0.30 s (drop tables 0.01 s, create tables 0.00 s, client-side generate 0.24 s, vacuum 0.03 s, primary keys 0.02 s).
[ankush@cognitive pgbench]

heres the log file:
logfile.txt

i placed it into the performance pipeline :

static void
performance_client(struct io_watcher* watcher)
{
   int status = MESSAGE_STATUS_ERROR;
   struct worker_io* wi = NULL;
   struct message* msg = NULL;
   struct main_configuration* config = (struct main_configuration*)shmem;

   wi = (struct worker_io*)watcher;

   status = pgagroal_recv_message(watcher, &msg);
   pgagroal_log_postgres(msg);

   if (likely(status == MESSAGE_STATUS_OK))
   {
      if (likely(msg->kind != 'X'))
      {
         status = pgagroal_send_message(watcher, msg);

         if (unlikely(status != MESSAGE_STATUS_OK))
         {
            goto server_error;

@fluca1978
Copy link
Collaborator

I'm wondering if does make sense to move the pgagroal_log_postgres into the place that reads the message using io_uring, that should be https://github.com/agroal/pgagroal/blob/master/src/libpgagroal/message.c#L102. This is going to make the usage ubiqutous(-ish), in the hope to make the behavior uniform.

Besides, @jesperpedersen , why don't make a macro that expands into pgagroal_log_postgres so that we can compile in and out when needed?

@ankushT369
Copy link
Contributor Author

Besides, @jesperpedersen , why don't make a macro that expands into pgagroal_log_postgres so that we can compile in and out when needed?

can tell me about this more like what should i name my macro?

@jesperpedersen
Copy link
Collaborator

jesperpedersen commented Dec 30, 2025

@fluca1978 Either that or an explicit

#ifdef DEBUG
   if (pgagroal_log_is_enabled(PGAGROAL_LOGGING_LEVEL_DEBUG5))
   {
      pgmoneta_log_postgres...
   }
#endif

@fluca1978
Copy link
Collaborator

@fluca1978 Either that or an explicit

#ifdef DEBUG
   if (pgagroal_log_is_enabled(PGAGROAL_LOGGING_LEVEL_DEBUG5))
   {
      pgmoneta_log_postgres...
   }
#endif

I would rather add the macro and the test within the function itself, something like (pseudo-code):

pgagroal_log_postgres( .. ) {
#ifndef DEBUG
     return;
#endif

    if  (! pgagroal_log_is_enabled(PGAGROAL_LOGGING_LEVEL_DEBUG5))
        return;


// function code here

}

pros:

  • we can apply pgagroal_log_postgres wherever it makes sense and enable it with the macro
  • if DEBUG is not enabled the function simply returns

cons:

  • overhead for a noop function call

Otherwise, define something like (pseudo-code):

#ifdef DEBUG
#define PGAGROAL_LOG_POSTGRES(x) pgagroal_log_postgres/x)
#else 
#define PGAGROAL_LOG_POSTGRES(x) 1
#endif 

so that using PGAGROAL_LOG_POSTGRES in the code either results in calling the function or doing nothing.

@ankushT369
Copy link
Contributor Author

ankushT369 commented Jan 2, 2026

@fluca1978
Option 2 shows this I choose this because of noop overhead

In file included from /home/ankush/pgagroal/src/libpgagroal/pipeline_perf.c:32:
/home/ankush/pgagroal/src/libpgagroal/pipeline_perf.c: In function ‘performance_client’:
/home/ankush/pgagroal/src/include/logging.h:71:34: error: statement with no effect [-Werror=unused-value]
   71 | #define PGAGROAL_LOG_POSTGRES(x) 1
      |                                  ^
/home/ankush/pgagroal/src/libpgagroal/pipeline_perf.c:123:4: note: in expansion of macro ‘PGAGROAL_LOG_POSTGRES’
  123 |    PGAGROAL_LOG_POSTGRES(msg);
      |    ^~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[2]: *** [src/CMakeFiles/pgagroal.dir/build.make:289: src/CMakeFiles/pgagroal.dir/libpgagroal/pipeline_perf.c.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:1155: src/CMakeFiles/pgagroal.dir/all] Error 2
make: *** [Makefile:166: all] Error 2
[ankush@cognitive build]

can i use this ((void)(x)) in place of 1??
like

#ifdef DEBUG
#define PGAGROAL_LOG_POSTGRES(x) pgagroal_log_postgres(x)
#else
#define PGAGROAL_LOG_POSTGRES(x) ((void)(x))
#endif

@fluca1978
Copy link
Collaborator

@fluca1978 Option 2 shows this I choose this because of noop overhead

In file included from /home/ankush/pgagroal/src/libpgagroal/pipeline_perf.c:32:
/home/ankush/pgagroal/src/libpgagroal/pipeline_perf.c: In function ‘performance_client’:
/home/ankush/pgagroal/src/include/logging.h:71:34: error: statement with no effect [-Werror=unused-value]
   71 | #define PGAGROAL_LOG_POSTGRES(x) 1
      |                                  ^
/home/ankush/pgagroal/src/libpgagroal/pipeline_perf.c:123:4: note: in expansion of macro ‘PGAGROAL_LOG_POSTGRES’
  123 |    PGAGROAL_LOG_POSTGRES(msg);
      |    ^~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[2]: *** [src/CMakeFiles/pgagroal.dir/build.make:289: src/CMakeFiles/pgagroal.dir/libpgagroal/pipeline_perf.c.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:1155: src/CMakeFiles/pgagroal.dir/all] Error 2
make: *** [Makefile:166: all] Error 2
[ankush@cognitive build]

can i use this ((void)(x)) in place of 1?? like

#ifdef DEBUG
#define PGAGROAL_LOG_POSTGRES(x) pgagroal_log_postgres(x)
#else
#define PGAGROAL_LOG_POSTGRES(x) ((void)(x))
#endif

If it works, yes, please use it.

@fluca1978
Copy link
Collaborator

@ankushT369 commit 0f96b0b is fine for me, except the problem with clang-format (#667).

@jesperpedersen
Copy link
Collaborator

@fluca1978 Once @Userfrom1995 has signed-off on it, do a deep review, and then merge

@fluca1978
Copy link
Collaborator

@fluca1978 Once @Userfrom1995 has signed-off on it, do a deep review, and then merge

Ok for me, so far it seems ok right now, but I'll wait for any news from @Userfrom1995 .
In the meantime, @ankushT369 please rebase on 6ea50e0.

For example something like:

git fetch --all
git rebase origin/master

and then force push the changes.

@fluca1978
Copy link
Collaborator

@ankushT369 we were both using the wrong clang-format version.
Please install clang-format version 21, rebase on fa30ea7 and force push.

For ubuntu, I did the following to install clang-format version 21:

          sudo apt install -y lsb-release wget software-properties-common gnupg
          wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo gpg --dearmor -o /usr/share/keyrings/llvm-snapshot.gpg
          echo "deb [signed-by=/usr/share/keyrings/llvm-snapshot.gpg] http://apt.llvm.org/$(lsb_release -cs)/ llvm-toolchain-$(lsb_release -cs)-21 main" | sudo tee /etc/apt/sources.list.d/llvm.list
          sudo apt update -y
          sudo apt remove -y clang-format 2>/dev/null 
          sudo apt install -y clang-format-21
          sudo ln -sf /usr/bin/clang-format-21 /usr/bin/clang-format

@ankushT369
Copy link
Contributor Author

@fluca1978 is this ok ?? do i have to squash ?

@jesperpedersen
Copy link
Collaborator

@ankushT369 Yes, always squash

@ankushT369
Copy link
Contributor Author

@jesperpedersen this is off topic can I work on other existing issue or the tests you are talking about test cases for io_uring and epoll scenarios using one or more connections.?

@jesperpedersen
Copy link
Collaborator

@ankushT369 The focus right now is to get epoll, io_uring and kqueue bug-free. So, write test cases that uses the the different pipelines and use your function to list the problems found

@jesperpedersen
Copy link
Collaborator

@ankushT369 See CI

@fluca1978
Copy link
Collaborator

I'm going to reiview this soon.

@fluca1978
Copy link
Collaborator

@ankushT369 looks good to me, but:

  • there are two lines of codes that I don't get, it seems there's a copy and paste (see commends)
  • I wrongly added a commit to your branch while playing with gh pr (it turned out `gh pr update-branch} doesn't do what I expected), shame on me! Please reset your branch to commit b53bc0e

@jesperpedersen once commit b53bc0e has been fixed, can we merge or do you want some extra review?

@jesperpedersen
Copy link
Collaborator

@fluca1978 Yes, start merging - we need clang-format 21+

@ankushT369
Copy link
Contributor Author

@ankushT369 looks good to me, but:

  • there are two lines of codes that I don't get, it seems there's a copy and paste (see commends)
  • I wrongly added a commit to your branch while playing with gh pr (it turned out `gh pr update-branch} doesn't do what I expected), shame on me! Please reset your branch to commit b53bc0e

@jesperpedersen once commit b53bc0e has been fixed, can we merge or do you want some extra review?

like i have to reset only ?? or like change any file too??

@ankushT369 ankushT369 force-pushed the ankdev branch 2 times, most recently from c831444 to 33cbaa9 Compare January 8, 2026 17:17
@fluca1978
Copy link
Collaborator

@ankushT369 commit 33cbaa9 is almost fine, but:

  • there are changes in 71-git.md, why?
  • there are duplicated lines in utils.c
  • there are duplicated lines in status.c

Adjust the above and I'll merge today, thanks.

@ankushT369
Copy link
Contributor Author

@fluca1978

  • I will keep the 71-git.md same as it is in the current codebase
  • remove the duplicates
    This is it?

@fluca1978
Copy link
Collaborator

@fluca1978

* I will keep the 71-git.md same as it is in the current codebase

* remove the duplicates
  This is it?

Exactly!

@fluca1978 fluca1978 merged commit ea0dad2 into pgagroal:master Jan 9, 2026
6 of 8 checks passed
@fluca1978
Copy link
Collaborator

Merged with a clean commit message.

Thanks for your contribution @ankushT369

@ankushT369
Copy link
Contributor Author

Merged with a clean commit message.

Thanks for your contribution @ankushT369

I learned a lot .... Thanks to you people!!! I am looking forward for more contribution.

@fluca1978
Copy link
Collaborator

@ankushT369 sorry, I just found you are not listed in AUTHORS.md, could you please add yourself and open a new PR about that?

@jesperpedersen
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hard to debug PostgreSQL protocol misalignment during dump

4 participants