The first thoughts about "poison pill" messages #100

eao197 · 2025-10-27T15:19:37Z

eao197
Oct 27, 2025
Maintainer

There is an interesting topic in message-handling systems: "poison pill" messages. A very good introduction to this topic can be found here: Kafka Poison Pill.

An application built on SObjectizer can also be seen as a "message handling system", so it also can suffer from "poison pill" messages. So it's interesting to investigate such a topic relative to SObjectizer. This post contains the very first thoughts on this topic. Maybe this post starts a journey to invent and implement some useful tools to cope with the "poison pill" message problem.

Disclaimer: everything described below is my personal opinion so it may be totally wrong. And if you find something that contradicts your point of view then feel free to tell me about it.

First of all, I think that the "poison pill" effect is a consequence of an error in application code. An agent receives a message it can't handle because of an error in agent implementation (or in the code that agent uses, for example, agent tries to deserialize message data and encounters a bug in external deserialization library). There are could be different consequences of such errors:

the agent can crash the whole application somehow (by calling std::terminate or std::abort or even by dereferencing a null- or dangling pointer);
the agent doesn't work properly but stays alive. For example, it can ignore incoming message or can produce erroneous output, or it can hang forever in an infinite loop;
an exception thrown from agent's event handler;

SObjectizer can't protect from errors in the code of user-written agents or in libraries used in user-written agents. But we can discuss how SObjectizer could help with the consequences of errors in the users agents.

There is nothing to do if an error leads to an application crash. I think the only good way to fight with such an error is to use multi-process configuration where agents work in different processes with separate address spaces and communicate via some form of interprocess communications (IPC) like sockets, pipes, message brokers, database tables or even files. In such a configuration a problematic agent will crash just one process, but doesn't affect other processes. The crashed process will be restarted by some supervisor.

If a problematic agent throws exceptions then this case can be handled by exception reactions already provided by SObjectizer. Please see the corresponding section in the project Wiki for more details.

The most interesting case is when an agent stays in the SObjectizer Environment, but doesn't work properly.

SObjectizer can't control the correctness of the agent's behaviour. For example, SObjectizer doesn't know that when an agent receives message A it has to reply by message B. This is a part of application logic and SObjectizer doesn't know anything about this logic.

But maybe SObjectizer can help in the following two areas.

The first. An agent hangs and doesn't return control back to the dispatcher. Think of it as an infinite loop in the agent's event handler. The agent's work can't be interrupted, but if such a case is diagnosed then the application will have a chance to fail and restart as early as possible.

I think it could be done by using SObjectizer's run-time monitoring facilities (information from dispatchers in particular). At the moment it's possible to control the amount of demands in dispatchers' queues.

But the count of demands in a queue is not a precise indicator, especially if an agent receives messages from time to time (once in several seconds or even in several minutes).

Maybe it is worth adding a data source like the time of the last event handler execution time. Now SObjectizer provides the work_thread_activity message that contains only the number of event handler executions and total+average times. It's possible to analyze such information (for example, if total time grows but the number of handled demands does not, then it's a sign that the last event handler works too long). But I think it's not as convenient as it should be.

So there is an idea to add another message to so_5::stats::messages namespace. Something like last_handler_execution_time. This message may be sent only if the execution time exceeds some threshold (1s for example).

An application can have a monitor agent that receives last_handler_execution_time and checks a number inside the message. If the number exceeds some maximum value then the application may decide that some of its agents hang and initiate the immediate restart.

There is an open question: what should be included into last_handler_execution_time in addition to a time duration? The name of the agent? The direct mbox of the agent? The coop_handle of the agent cooperation?

The second. Deregistration of the problematic agent. It's not hard to deregister an agent if you have coop_handle_t related to the agent's coop. We won't discuss how such a handle can be obtained.

The main problem I think is that the normal deregistration procedure could be inappropriate if we decided that an agent doesn't work properly. The normal procedure means that all pending demands will be delivered to the agent. And this may not be what we wanted.

So I wonder if there is a way to do something like "a distant call of agent's so_deactivate_agent" from outside an agent. May be in the form of "urgent deregistration". Maybe it's possible to add some hypothetical deregister_coop_urgently method to the environment_t class. This method will do the typical deregistration related actions with one additional step: all agents in the cooperation will be deactivated (so_deactivate_agent will be called for every agent somehow). Because of that additional step agents from the cooperation being deregistered won't handle pending messages.

This "urgent deregistration" won't remove the coop from SOEnv immediately, this operation will still be asynchronous with unpredictable delays. But if there are 100500 pending messages waiting for processing they will be skipped because all agents from deregistered cooperation will be deactivated first.

Disclaimer №1. I don't know yet how this "distant deactivation" can be implemented. Just believe that this is possible somehow.

Disclaimer №2. The described approach with "urgent deregistration" won't help with agents that are hung in infinite loops.

ilpropheta · 2025-10-29T11:30:14Z

ilpropheta
Oct 29, 2025

Thanks for sharing your considerations on this topic.

I add my thoughts at a very early stage. I still need more reflection and discussion.

First of all, we should distinguish between poison message and poison pill, which refer to two different concepts. The literature does not seem to be unified on the terminology, and this is partly my fault as I wasn't aware of this ambiguity when I first mentioned it.

Poison messages are those that cause a consumer to fail every time they are received. This is what you're referring to in the first link. I believe this concept is mostly relevant in truly distributed systems, where senders expect acknowledgments ("ack") when a message is processed.

For example, in RabbitMQ:

[poison messages are] messages that cause a consumer to repeatedly requeue a delivery (possibly due to a consumer failure) such that the message is never consumed completely and positively acknowledged so that it can be marked for deletion by RabbitMQ.

The main issue here is that failing messages cause requeueing, which can lead to infinite processing loops or similar problems.

In contrast, SObjectizer uses a fire-and-forget messaging model, so we do not face the requeue problem. However, any effort toward issue detection could be beneficial, and your reflections on runtime monitoring and "urgent deregistration" are quite interesting.

Now let me introduce the other pattern: the poison pill. Again, terminology on the web is mixed, but I interpret a poison pill as a special message that signals no more messages will be sent, allowing consumers to terminate. A similar concept is explained here. A poison pill could be seen as a special message that signals to a subscriber: please, unsubscribe. Then the subscriber can take appropriate action.

This raises new questions:

Should it mean "unsubscribe from this message box"?
Should it trigger agent deactivation (so_deactivate_agent())?
Should it imply cooperation deregistration?

Technically, we can already handle a poison pill signal. The opportunity lies in formalizing and simplifying its usage.

I see two main opportunities:

Introduce a standard poison pill type (e.g. so_5::poison_pill)
Add declarative, context-based syntactic sugar to configure poison pill reactions. For example:

// Using a standard poison pill type
agent_t(agent_context_t ctx)
 : agent_t(ctx + deactivate_on_poison_pill())

// Specifying a custom poison pill type
agent_t(agent_context_t ctx)
 : agent_t(ctx + deactivate_on<poison_pill_type>())

// Or specifying the reaction with an enum
agent_t(agent_context_t ctx)
 : agent_t(ctx + on_poison_pill(deactivate))

agent_t(agent_context_t ctx)
 : agent_t(ctx + on_poison_pill(deregister))

agent_t(agent_context_t ctx)
 : agent_t(ctx + on_poison_pill(unsubscribe))

And the send might just:

send<so_5::poison_pill>(...);

// or
send_poison_pill(...);

I think your "urgent deregistration" idea overlaps a bit with this.

That's all for now. I won’t add more to these ramblings.

Please let me know your thoughts.

2 replies

eao197 Oct 29, 2025
Maintainer Author

Thanks for the feedback!

I'll add more in next comment a bit later, but for now I have to make a small remark related to this:

In contrast, SObjectizer uses a fire-and-forget messaging model, so we do not face the requeue problem.

One of patterns I used often in the past was repetition of messages after a timeout. For example, there is an agent A that want to do something and sends a message do_what_i_want to a special mbox. In many cases agent A doesn't known what is behind this mbox. It may be one agent that handles do_what_i_want or several such agents, it doesn't matter. Agent A may wait for some time and if there is no answer then agent A may repeat the same message. Agent A may do several attempts.

In such a case a message that causes a failure somewhere behind the special mbox will lead to the same consequences as described for RabbitMQ.

eao197 Oct 29, 2025
Maintainer Author

Let's try to discuss the "poison pill" message as a tool to deactivate an agent. There are at least two topics that have to be considered.

Topic №1: how would an agent react to such a message?

My current opinion is: it's good to have an ability to send a "poison pill" message to any agent, without a need to modify an existing agent. I mean that if we would require to make a special subscription or to add something to an agent (a special member in agent class, like a specific state object) then it makes "poison pill" functionality less useful.

It would be great to have a possibility to send a "poison pill" message to any agent. It's somewhat similar to the possibility to initiate deregistration of a coop if you know the coop's handle.

But the other side of the coin: how much could it cost?

Most agents won't require such a possibility at all. So if we'll increase the size of agent_t class by a dozen bytes that won't be used in 99% cases, then it's not good.

Topic №2: how fast a "poison pill" message could be delivered to an agent?

I mean that if an agent has 1000 pending messages in its queue then it's no good to wait for completion of all of those messages before the "poison pill" message will be dequeued and handled.

It leads to the necessity to have something like "urgent delivery" of a message. I'll try to describe this idea later in a separate reply.

Generally speaking, if some agent needs a possibility to stop working on a particular message, then it could be implemented by existing tools:

we can define a signal to be used as "poison pill";
we can make an ordinary subscription to this message in an agent;
we can initiate what we want (like deregistration) in an event handler for the "poison pill" signal.

However, all those actions have (let's name them this way) limitations :

different agents can have different "poison pill" message types. So it's better to have just one such a message in the SObjectizer;
an agent can have its own states and it may be hard to make proper subscription to the "poison pill" message to handle it regardless of the current agent state. Especially if inheritance is used and our agent is delivered from an agent written by someone else (or from a 3rd party library) and we have to possibility to modify states in the parent types;
ordinary messages can't be delivered out-of-order, they have to wait in the queue. It makes the delivery of the "poison pill" message less useful than I want.

So it's necessary to address these limitations in the SObjectizer.

ilpropheta · 2025-10-29T13:13:22Z

ilpropheta
Oct 29, 2025

Sure, I recognize this as a common pattern. However, the key point is that RabbitMQ inherently supports an "ack" request-response mechanism, whereas in SObjectizer, we need to implement such behavior ourselves.

My thought is that the concept of a poison message seems more naturally rooted to systems where the acknowledgment cycle is built into the infrastructure. That said, having support for detecting and reacting to such issues in SObjectizer would still be valuable.

0 replies

eao197 · 2025-10-30T11:58:20Z

eao197
Oct 30, 2025
Maintainer Author

A few words about the "urgent message delivery" idea. However, it is not an idea yet, just a fantasy, but I have nothing better at the moment.

Let's suppose that so_5::agent_t has a hidden field of type std::atomic<execution_demand_t*>. This field will be checked before handling any incoming messages.

I mean that now we have demand_handler_on_message that is called for every incoming message and works this way:

void
agent_t::demand_handler_on_message(
	current_thread_id_t working_thread_id,
	execution_demand_t & d )
{
	message_limit::control_block_t::decrement( d.m_limit );

	auto handler = d.m_receiver->m_handler_finder(
			d, "demand_handler_on_message" );
	if( handler )
		process_message(
				working_thread_id,
				d,
				handler->m_thread_safety,
				handler->m_method );
}

So demand_handler_on_message tries to find an event handler for the incoming message and calls the handler if it has been found.

This logic could be changed this way:

void
agent_t::demand_handler_on_message(
	current_thread_id_t working_thread_id,
	execution_demand_t & d )
{
	message_limit::control_block_t::decrement( d.m_limit );

	std::unique_ptr<execution_demand_t> urgent_demand{
			m_urgent_demand.exchange(nullptr)
		};
	if( urgent_demand )
	{
		// Handle urgent_demand first, and only then we'll return to the
		// current incoming demand.
		auto handler = d.m_receiver->m_handler_finder(
				*urgent_demand, "demand_handler_on_message" );
		if( handler )
			process_message(
					working_thread_id,
					*urgent_demand,
					handler->m_thread_safety,
					handler->m_method );
	}

	auto handler = d.m_receiver->m_handler_finder(
			d, "demand_handler_on_message" );
	if( handler )
		process_message(
				working_thread_id,
				d,
				handler->m_thread_safety,
				handler->m_method );
}

It means that an additional field of so_5::agent_t will be checked for every incoming message.

It's an open question how "urgent message" will be sent to an agent. Maybe it will be a special function, something like:

so_5::send_urgently<msg_type>(dest, ...);

Maybe it will be another form of existing so_5::send, like:

so_5::send<msg_type>(so_5::urgent_delivery_to(agent_ptr), ...);

or:

so_5::send<so_5::urgent_message<msg_type>>(dest, ...);

But it seems that such an urgent message can't be delivered via so_5::send_delayed nor so_5::send_periodic.

I don't know how a message instance passed to send_urgently will be stored in the field so_5::agent_t::m_urgent_demand. If it requires changing the so_5::abstract_message_mbox_t interface then it will be sad. Maybe it could be handled by so_5::event_queue_t.

It also seems that a new urgent message will replace the existing one. For example:

so_5::send_urgently<msg_type1>(dest, ...); // (1)
so_5::send_urgently<msg_type2>(dest, ...); // (2)

It's possible that the first message will be handled before completion of point (1).
But if it doesn't happen, then the first urgent message will just be replaced by the second at the point (2).

It also seems that urgent messages won't be counted by message limits.

There is also an open question with exception safety for an agent. This area has to be investigated.

0 replies

ilpropheta · 2025-10-30T13:54:46Z

ilpropheta
Oct 30, 2025

I agree with your point that handling the poison pill should ideally be as "automatic" as deregistration. However, backward compatibility could be a concern, as older agents might be unintentionally deactivated upon receiving a poison pill. That's why I opted for a more conservative approach by introducing a context option.

As I imagine it, poisoning could be triggered either via a standard send<so_5::poison_pill>(agent_direct_mbox) or through a helper function like so_poison(agent_direct_mbox). Internally, this could still rely on messaging, but the mechanism should remain transparent to the user. For this reason, exposing a public send_urgently might be too ambiguous? I mean, it could lead to confusion, especially regarding message priority that is out of scope.

This brings us to the next question: should a poisoned agent transition to the deactivated state, or should we introduce a new state like poisoned? And if all agents within a cooperation are poisoned, should that trigger deregistration of the entire cooperation? Personally, I lean toward using the deactivated state followed by deregistration, as it keeps the model simpler, but I might be overlooking something. Maybe cooperation deregistration could be triggered automatically by the last poisoned agent?

Regarding internal changes: I agree this is a relatively niche feature. If the implementation cost is significant, it might be worth reconsidering.

Lastly, how does this relate to issue #96? When a poison pill is received, it’s expected to take precedence over other messages in the queue. This is similar to #96, though it’s not part of a shutdown process. Perhaps there are shortcuts or ideas from #96 that could be reused here?

1 reply

eao197 Oct 30, 2025
Maintainer Author

That's why I opted for a more conservative approach by introducing a context option.

I think I could start with this approach because it looks much easier and doesn't introduces hidden costs. If such feature will be implemented and some experience will be collected then it would be possible to decide to make it as the default feature of every agent.

For this reason, exposing a public send_urgently might be too ambiguous?

I thought about send_urgently in the past like a simple tool to deliver some important information to the agent out-of-order. Like a command to reconfigure the agent as fast as it can. It seems that if send_urgently can deliver arbitrary messages then it can also be used for poisoning agents.

This brings us to the next question: should a poisoned agent transition to the deactivated state, or should we introduce a new state like poisoned?

I think that poison_pill should lead to deregistration of the whole cooperation with the target agent.

Maybe cooperation deregistration could be triggered automatically by the last poisoned agent?

A cooperation should be seen as "one_for_all": all agents have to work normally, but if one of them is poisoned then all others have to be deregistered too.

Lastly, how does this relate to issue #96?

No, I don't think so.

But the more I think about it, the more often I remember this idea: #64
It seems that a custom event queue for agent can solve issues with out-of-order delivery of the poison_pill signal. And if we decide to use something like that:

my_agent::my_agent(context_t ctx, ...)
  : so_5::agent_t(ctx + so_5::add_poison_pill_support())
{...}

then it will be possible to assign a special event queue to the my_agent.

ilpropheta · 2025-11-01T12:44:48Z

ilpropheta
Nov 1, 2025

I have a couple of comments:

Introducing a general form of send_urgently could be useful for handling situations where high-priority messages are required. For example, issue Skipping of demands at SObjectizer's shutdown #96 could be addressed in this way within the agent itself (although I still prefer the declarative solution based on a context option);
I agree that if any agent gets poisoned, the entire cooperation should be deregistered in order to preserve the “atomicity” invariant of cooperations.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The first thoughts about "poison pill" messages #100

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

The first thoughts about "poison pill" messages #100

Uh oh!

eao197 Oct 27, 2025 Maintainer

Replies: 5 comments · 3 replies

Uh oh!

ilpropheta Oct 29, 2025

Uh oh!

eao197 Oct 29, 2025 Maintainer Author

Uh oh!

eao197 Oct 29, 2025 Maintainer Author

Uh oh!

ilpropheta Oct 29, 2025

Uh oh!

eao197 Oct 30, 2025 Maintainer Author

Uh oh!

ilpropheta Oct 30, 2025

Uh oh!

eao197 Oct 30, 2025 Maintainer Author

Uh oh!

ilpropheta Nov 1, 2025

eao197
Oct 27, 2025
Maintainer

Replies: 5 comments 3 replies

ilpropheta
Oct 29, 2025

eao197 Oct 29, 2025
Maintainer Author

eao197 Oct 29, 2025
Maintainer Author

ilpropheta
Oct 29, 2025

eao197
Oct 30, 2025
Maintainer Author

ilpropheta
Oct 30, 2025

eao197 Oct 30, 2025
Maintainer Author

ilpropheta
Nov 1, 2025