You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After an in-depth pass over src/praisonai/praisonai/, three concrete gaps stand out that directly contradict the project philosophy (multi-agent safe + async-safe by default, production-ready, DRY). They are not stylistic β each one produces a real, observable behavioral defect today.
Each finding below is reproduced against the current main (HEAD a9f4bd5) with file paths and line numbers, plus a proposed fix.
1. arun() silently skips observability initialization and adapter setup β sync and async paths have drifted
Meanwhile _arun_framework (lines 742β787) takes its own divergent shape:
# agents_generator.py:742-787 (ASYNC path only)framework=self.frameworkorconfig.get('framework', 'crewai')
# AutoGen version selection logic β does NOT exist in the sync pathifframework=="autogen":
autogen_v4_adapter=self._get_framework_adapter("autogen_v4")
autogen_v2_adapter=self._get_framework_adapter("autogen")
...
framework="autogen_v4"ifuse_v4else"autogen"# AgentOps init β does NOT exist in the sync pathagentops_api_key=os.getenv("AGENTOPS_API_KEY")
ifagentops_api_key:
try:
importagentopsagentops.init(agentops_api_key, default_tags=[framework])
exceptImportError:
pass
And on top of that, the tool-resolution loop differs subtly:
# SYNC (line 561) β instantiates class tools eagerlyresolved_tool=self.tool_resolver.resolve(tool_name, instantiate=True)
# ASYNC (line 704) β does NOT pass instantiate=True; instantiation is open-coded belowresolved_tool=self.tool_resolver.resolve(tool_name)
...
tools_dict[tool_name] = (
resolved_tool() ifinspect.isclass(resolved_tool) elseresolved_tool
)
Format-conversion blocks (sync lines 515β535 vs async lines 663β675) are also near-identical copies that have already begun drifting (the sync copy still carries comments that the async one dropped).
Why it matters
Same AgentsGenerator(...), same YAML, two entrypoints β run() and arun() β produce different runtime behavior:
Behavior
run() (sync)
arun() (async)
adapter.resolve() for framework variant
β
β
init_observability(adapter.name)
β
β
adapter.setup(framework_tag=...)
β
β
Manual autogen v2/v4 selection
β
β
agentops.init(...) from AGENTOPS_API_KEY
β
β
instantiate=True in tool resolver call
β
β
Concrete user-visible consequences:
Langfuse/observability traces silently disappear the moment a user switches to arun() or await gen.agenerate_crew_and_kickoff(). Observability hooks are never wired.
Adapter setup() hooks never fire on the async path (any framework adapter that depends on setup() for telemetry, callback registration, or environment preparation is broken under async).
Drift is one-directional: AgentOps init is only on the async path, so a sync run never registers with AgentOps even if the env var is set.
Every future bug fix has to be applied twice and review pressure inevitably misses one side.
This is a direct violation of the βmulti-agent + async safe by defaultβ pillar β same feature, three ways (CLI, YAML, Python), but only two of the three (sync paths) are actually fully wired.
Proposed fix
Extract the shared preparation into one method, then have the two entrypoints reduce to just prepare β dispatch:
# agents_generator.pydef_prepare_run(self, config):
"""Shared preparation for sync and async runs. Returns: Tuple of (adapter, topic, tools_dict, framework_name, config) ready for adapter.run()/adapter.arun() dispatch. """# --- Canonical format conversion ---------------------------------if'agents'inconfigand'roles'notinconfig:
config['roles'] = {}
foragent_name, agent_configinconfig['agents'].items():
role_config=dict(agent_config) ifagent_configelse {}
if'instructions'inrole_configand'backstory'notinrole_config:
role_config['backstory'] =role_config['instructions']
if'role'notinrole_config:
role_config['role'] =agent_name.replace('_', ' ').title()
if'goal'notinrole_config:
role_config['goal'] =role_config.get('backstory', 'Complete the assigned task')
if'backstory'notinrole_config:
role_config['backstory'] =f'You are a {role_config["role"]}'config['roles'][agent_name] =role_configtopic=config.get('input', config.get('topic', ''))
self._validate_agents_config(config)
# --- Tool resolution (single source of truth) --------------------tools_dict=self._collect_tools(config)
# --- Framework adapter resolution + lifecycle --------------------framework=self.frameworkorconfig.get('framework', 'crewai')
initial_adapter=self._get_framework_adapter(framework)
adapter=initial_adapter.resolve()
from .framework_adapters.validatorsimportassert_framework_availableassert_framework_available(adapter.name)
# Observability + adapter setup β MUST run for both sync and asyncfrom .observability.hooksimportinit_observabilityinit_observability(adapter.name)
adapter.setup(framework_tag=adapter.name)
# AgentOps (was only in async path β should apply to both)agentops_api_key=os.getenv("AGENTOPS_API_KEY")
ifagentops_api_key:
try:
importagentopsagentops.init(agentops_api_key, default_tags=[adapter.name])
exceptImportError:
passself.framework=adapter.nameself.framework_adapter=adapterself._validate_cli_backend_compatibility(config, adapter.name)
self.logger.info(f"Using framework: {adapter.name}")
returnadapter, topic, tools_dict, configdef_load_config(self):
ifself.agent_yaml:
returnyaml.safe_load(self.agent_yaml)
ifself.agent_filein ('/app/api:app', 'api:app'):
self.agent_file='agents.yaml'withopen(self.agent_file, 'r') asf:
config=yaml.safe_load(f)
ifself.cli_config:
self._merge_cli_config(config, self.cli_config)
returnconfigdefgenerate_crew_and_kickoff(self):
config=self._load_config()
ifconfigisNone:
returnifself._is_workflow_yaml(config):
returnself._run_yaml_workflow(config)
adapter, topic, tools_dict, config=self._prepare_run(config)
returnadapter.run(
config, self.config_list, topic,
tools_dict=tools_dict,
agent_callback=getattr(self, 'agent_callback', None),
task_callback=getattr(self, 'task_callback', None),
cli_config=getattr(self, 'cli_config', None),
)
asyncdefagenerate_crew_and_kickoff(self):
config=self._load_config()
ifconfigisNone:
returnifself._is_workflow_yaml(config):
returnawaitself._arun_yaml_workflow(config)
adapter, topic, tools_dict, config=self._prepare_run(config)
returnawaitadapter.arun(
config, self.config_list, topic,
tools_dict=tools_dict,
agent_callback=getattr(self, 'agent_callback', None),
task_callback=getattr(self, 'task_callback', None),
cli_config=getattr(self, 'cli_config', None),
)
This collapses ~150 lines of duplicated logic into one shared path, makes observability and adapter setup mandatory on both sides, and removes the AgentOps/AutoGen-version drift.
2. File handle leak in KanbanDispatcher._spawn_worker
# kanban_dispatcher.py:147-166# Start process with output redirect to avoid deadlockimporttempfilewithtempfile.NamedTemporaryFile(mode='w+', delete=False, suffix='.log') astemp_log:
temp_log_path=temp_log.nameprocess=subprocess.Popen(
cmd,
env=env,
stdout=open(temp_log_path, 'w'), # <-- file object is never assigned, never closedstderr=subprocess.STDOUT,
text=True
)
# Store log path for later cleanupifnothasattr(self, '_temp_logs'):
self._temp_logs= {}
self._temp_logs[task.id] =temp_log_path# Track the running taskself.running_tasks[task.id] =process
grep confirms this is the only place in the package that constructs an inline open(...) argument to subprocess.Popen without storing the handle:
Popen does dup the file descriptor for the child, but the parent process keeps its own copy of the FD open until garbage collection β the file object is anonymous and held only by Popen for as long as the process tracks it. Even after the subprocess exits and _cleanup_completed_tasks (line 200) removes the process from running_tasks, the parent FD is only released when CPython gets around to GC-ing the Popen object.
For a gateway that polls every poll_interval=5.0s and spawns up to max_concurrent=3 worker subprocesses, this is a slow but unbounded leak. On long-running deployments β exactly the production case this dispatcher targets β the dispatcher eventually trips the per-process FD limit and crashes with OSError: [Errno 24] Too many open files. Worse, by that point in-flight tasks are already in claimed state in the SQLite store, so a restart leaves orphaned claims that other workers canβt pick up.
Compare the correct pattern already used in scheduler/daemon_manager.py:54-65 (the file is opened, passed to Popen, then the parent-side handle is dropped at the end of the with block β Popen keeps the child-side FD it duped).
Proposed fix
Either keep an explicit reference and close it on cleanup, or open inside a with block (parent drops its FD after Popen dups it):
# Option A β explicit ownership, close in _cleanup_completed_taskslog_handle=open(temp_log_path, 'w')
try:
process=subprocess.Popen(
cmd,
env=env,
stdout=log_handle,
stderr=subprocess.STDOUT,
text=True,
)
exceptBaseException:
log_handle.close()
raiseself.running_tasks[task.id] =processself._temp_logs[task.id] =temp_log_pathself._log_handles[task.id] =log_handle# close in _cleanup_completed_tasks# ...in _cleanup_completed_tasks, after the subprocess is reaped:handle=self._log_handles.pop(task_id, None)
ifhandleisnotNone:
handle.close()
# Option B β let the parent FD drop immediately, like daemon_manager.pywithopen(temp_log_path, 'w') aslog_handle:
process=subprocess.Popen(
cmd,
env=env,
stdout=log_handle,
stderr=subprocess.STDOUT,
text=True,
)
# parent FD closed here; child still has its duped copyself.running_tasks[task.id] =processself._temp_logs[task.id] =temp_log_path
Option B matches the existing convention elsewhere in the codebase and is the smaller diff.
3. Bare except: in the kanban dispatcher silently strands tasks and breaks clean shutdown
Site 3 β force-terminate in _shutdown (lines 332β338):
fortask_id, processinself.running_tasks.items():
try:
logger.warning(f"Force terminating task {task_id}")
process.terminate()
process.wait(timeout=5)
except: # <-- bare exceptprocess.kill() # always runs on KeyboardInterrupt too
Why it matters
Bare except: catches BaseException, which means KeyboardInterrupt, SystemExit, and asyncio.CancelledError are all swallowed.
Stranded claims (sites 1 & 2): When release_claim fails (DB locked, connection dropped, network blip), the failure is silently swallowed. The task remains claimed by a worker that no longer holds it. The next dispatch cycle sees the task is "in progress" and skips it. Other workers canβt pick it up either β they look for status='ready'. The task is stuck until someone manually clears the claim. Because the error is passβd with no log, operators have nothing to triage from. For a feature whose entire purpose is multi-agent coordination, thatβs a quiet deadlock.
Broken shutdown (site 3): A KeyboardInterrupt arriving mid-terminate() jumps into process.kill() instead of propagating. Combined with the leaked file handle from finding Merge pull request #1 from MervinPraison/developΒ #2, Ctrl-C against a busy dispatcher leaves zombie processes, leaked FDs, and stuck claims β none of which surface in logs.
Async cancellation is swallowed:run_forever (line 295) is driven by asyncio.create_task (line 368 in start_kanban_dispatcher). When the task is cancelled (e.g. server shutdown), cancellation enters _shutdown β loops the tasks β may hit process.wait(timeout=5). The bare except at line 337 catches CancelledError, runs process.kill(), and continues the loop. Cancellation never propagates β the parent await may hang past the shutdown timeout.
Proposed fix
Three small, targeted changes β narrow the catches, log the failures, never silently pass:
# Site 1 (lines 108-114)exceptExceptionase:
logger.error(f"Error processing task {task.id}: {e}")
try:
store.release_claim(task.id, self.worker_id)
exceptExceptionasrelease_err:
logger.error(
"Failed to release claim for task %s (worker %s); task may be stuck ""in 'claimed' state until manually released: %s",
task.id, self.worker_id, release_err,
exc_info=True,
)
# Site 2 (lines 273-279) β same shapeexceptExceptionase:
logger.error(f"Error processing completed task {task_id}: {e}")
try:
store.release_claim(task_id, self.worker_id)
exceptExceptionasrelease_err:
logger.error(
"Failed to release claim during cleanup for task %s: %s",
task_id, release_err,
exc_info=True,
)
# Site 3 (lines 332-338) β let BaseException propagate; only handle timeout/oserrorfortask_id, processinself.running_tasks.items():
logger.warning(f"Force terminating task {task_id}")
try:
process.terminate()
process.wait(timeout=5)
exceptsubprocess.TimeoutExpired:
logger.warning(f"Task {task_id} did not terminate in 5s; sending SIGKILL")
process.kill()
exceptOSErrorasos_err:
logger.error(f"OS error while terminating task {task_id}: {os_err}")
process.kill()
# KeyboardInterrupt, SystemExit, CancelledError now propagate as intended
Scope of validation
All three findings were validated by reading the actual files at HEAD a9f4bd5 on branch claude/bold-bohr-Yb720.
A separate ruamel.yaml safety claim that came up during the investigation was tested and confirmed to be a false positive (ruamel.yaml.YAML() round-trip loader is safe against !!python/... tags), so it was excluded from this issue.
Line numbers above match the current source.
Suggested order of attack
Finding Github actions fixΒ #1 is the highest-leverage fix β it removes ~150 lines of drifting duplication and restores observability/adapter lifecycle parity between run() and arun(). Touch this once; never re-fix bugs in two places again.
Finding MainΒ #3 is a low-risk surgical change with immediate impact on multi-agent coordination correctness.
Summary
After an in-depth pass over
src/praisonai/praisonai/, three concrete gaps stand out that directly contradict the project philosophy (multi-agent safe + async-safe by default, production-ready, DRY). They are not stylistic β each one produces a real, observable behavioral defect today.Each finding below is reproduced against the current
main(HEADa9f4bd5) with file paths and line numbers, plus a proposed fix.1.
arun()silently skips observability initialization and adapter setup β sync and async paths have driftedFile:
src/praisonai/praisonai/agents_generator.pySync path:
generate_crew_and_kickoffβ lines 472β626Async path:
agenerate_crew_and_kickoff+_arun_frameworkβ lines 628β795Where the issue is
The synchronous
generate_crew_and_kickoffperforms three lifecycle calls that are completely absent from the async path:grepconfirms these three calls exist exactly once in the file β only in the sync path:Meanwhile
_arun_framework(lines 742β787) takes its own divergent shape:And on top of that, the tool-resolution loop differs subtly:
Format-conversion blocks (sync lines 515β535 vs async lines 663β675) are also near-identical copies that have already begun drifting (the sync copy still carries comments that the async one dropped).
Why it matters
Same
AgentsGenerator(...), same YAML, two entrypoints βrun()andarun()β produce different runtime behavior:run()(sync)arun()(async)adapter.resolve()for framework variantinit_observability(adapter.name)adapter.setup(framework_tag=...)autogenv2/v4 selectionagentops.init(...)fromAGENTOPS_API_KEYinstantiate=Truein tool resolver callConcrete user-visible consequences:
arun()orawait gen.agenerate_crew_and_kickoff(). Observability hooks are never wired.setup()hooks never fire on the async path (any framework adapter that depends onsetup()for telemetry, callback registration, or environment preparation is broken under async).This is a direct violation of the βmulti-agent + async safe by defaultβ pillar β same feature, three ways (CLI, YAML, Python), but only two of the three (sync paths) are actually fully wired.
Proposed fix
Extract the shared preparation into one method, then have the two entrypoints reduce to just prepare β dispatch:
This collapses ~150 lines of duplicated logic into one shared path, makes observability and adapter setup mandatory on both sides, and removes the AgentOps/AutoGen-version drift.
2. File handle leak in
KanbanDispatcher._spawn_workerFile:
src/praisonai/praisonai/gateway/kanban_dispatcher.pyLine: 155
Where the issue is
grepconfirms this is the only place in the package that constructs an inlineopen(...)argument tosubprocess.Popenwithout storing the handle:Why it matters
Popendoes dup the file descriptor for the child, but the parent process keeps its own copy of the FD open until garbage collection β the file object is anonymous and held only by Popen for as long as the process tracks it. Even after the subprocess exits and_cleanup_completed_tasks(line 200) removes the process fromrunning_tasks, the parent FD is only released when CPython gets around to GC-ing the Popen object.For a gateway that polls every
poll_interval=5.0sand spawns up tomax_concurrent=3worker subprocesses, this is a slow but unbounded leak. On long-running deployments β exactly the production case this dispatcher targets β the dispatcher eventually trips the per-process FD limit and crashes withOSError: [Errno 24] Too many open files. Worse, by that point in-flight tasks are already inclaimedstate in the SQLite store, so a restart leaves orphaned claims that other workers canβt pick up.Compare the correct pattern already used in
scheduler/daemon_manager.py:54-65(the file is opened, passed toPopen, then the parent-side handle is dropped at the end of thewithblock βPopenkeeps the child-side FD it duped).Proposed fix
Either keep an explicit reference and close it on cleanup, or open inside a
withblock (parent drops its FD after Popen dups it):Option B matches the existing convention elsewhere in the codebase and is the smaller diff.
3. Bare
except:in the kanban dispatcher silently strands tasks and breaks clean shutdownFile:
src/praisonai/praisonai/gateway/kanban_dispatcher.pyLines: 108β114, 273β279, 332β338
grepconfirms this file is one of only two in the package (the other is a.bakbackup) with bareexcept::Where the issue is
Site 1 β claim-release in
dispatch_once(lines 108β114):Site 2 β claim-release in
_cleanup_completed_tasks(lines 273β279):Site 3 β force-terminate in
_shutdown(lines 332β338):Why it matters
Bare
except:catchesBaseException, which meansKeyboardInterrupt,SystemExit, andasyncio.CancelledErrorare all swallowed.Stranded claims (sites 1 & 2): When
release_claimfails (DB locked, connection dropped, network blip), the failure is silently swallowed. The task remainsclaimedby a worker that no longer holds it. The next dispatch cycle sees the task is "in progress" and skips it. Other workers canβt pick it up either β they look forstatus='ready'. The task is stuck until someone manually clears the claim. Because the error ispassβd with no log, operators have nothing to triage from. For a feature whose entire purpose is multi-agent coordination, thatβs a quiet deadlock.Broken shutdown (site 3): A
KeyboardInterruptarriving mid-terminate()jumps intoprocess.kill()instead of propagating. Combined with the leaked file handle from finding Merge pull request #1 from MervinPraison/developΒ #2, Ctrl-C against a busy dispatcher leaves zombie processes, leaked FDs, and stuck claims β none of which surface in logs.Async cancellation is swallowed:
run_forever(line 295) is driven byasyncio.create_task(line 368 instart_kanban_dispatcher). When the task is cancelled (e.g. server shutdown), cancellation enters_shutdownβ loops the tasks β may hitprocess.wait(timeout=5). The bareexceptat line 337 catchesCancelledError, runsprocess.kill(), and continues the loop. Cancellation never propagates β the parentawaitmay hang past the shutdown timeout.Proposed fix
Three small, targeted changes β narrow the catches, log the failures, never silently
pass:Scope of validation
a9f4bd5on branchclaude/bold-bohr-Yb720.ruamel.yaml.YAML()round-trip loader is safe against!!python/...tags), so it was excluded from this issue.Suggested order of attack
run()andarun(). Touch this once; never re-fix bugs in two places again.