Three high-impact gaps in src/praisonai/praisonai: sandbox path traversal, sync/async generator divergence, rate-limiter serialization + leak

## Summary

In-depth analysis of `src/praisonai/praisonai/` (the wrapper layer) surfaced three high-impact, validated gaps. Each has been confirmed by reading the actual source — file paths and line numbers are exact, and the snippets below are copied from the tree as it currently stands.

Scope is intentionally narrow: not docs, not tests, not file sizes. These are correctness / safety / production-reliability issues sitting on hot paths (sandbox file API, multi-agent generation, bot send loop).

---

## 1. Sandbox file APIs (`SubprocessSandbox`, `DockerSandbox`) are vulnerable to path traversal

**Files:**
- `src/praisonai/praisonai/sandbox/subprocess.py` — `write_file` (L269-288), `read_file` (L290-310), `list_files` (L312-336)
- `src/praisonai/praisonai/sandbox/docker.py` — same three methods (L340-407)

**The bug**

Both backends compute the host-side full path with only `path.lstrip("/")` and join — no canonicalization, no containment check. `lstrip("/")` strips leading slashes; it does **not** neutralize `../` sequences.

```python
# src/praisonai/praisonai/sandbox/subprocess.py:290-310
async def read_file(self, path: str) -> Optional[Union[str, bytes]]:
    if not self._temp_dir:
        return None
    full_path = os.path.join(self._temp_dir, path.lstrip("/"))
    if not os.path.exists(full_path):
        return None
    try:
        with open(full_path, "r") as f:
            return f.read()
    except UnicodeDecodeError:
        with open(full_path, "rb") as f:
            return f.read()
    except Exception:
        return None
```

```python
# src/praisonai/praisonai/sandbox/subprocess.py:269-288
async def write_file(self, path: str, content: Union[str, bytes]) -> bool:
    if not self._temp_dir:
        return False
    full_path = os.path.join(self._temp_dir, path.lstrip("/"))
    os.makedirs(os.path.dirname(full_path), exist_ok=True)
    ...
```

Reproduction (the math, not a payload):

```
os.path.realpath(os.path.join("/tmp/praisonai_sandbox_xxx",
                              "../../../etc/passwd".lstrip("/")))
# -> "/etc/passwd"
```

`DockerSandbox` is doubly bad: the documented intent is "code execution isolation", but the file APIs operate on the **host** filesystem (`tempfile.mkdtemp`), not on the container's FS — so the traversal escapes onto the host, not into the container.

The sister backend `sandbox/sandlock.py` already has the correct guard, `_safe_sandbox_path` (L167-184):

```python
# src/praisonai/praisonai/sandbox/sandlock.py:167-184
def _safe_sandbox_path(self, path: str) -> Optional[str]:
    if not self._temp_dir:
        return None
    candidate = os.path.realpath(
        os.path.join(self._temp_dir, path.lstrip("/"))
    )
    sandbox_root = os.path.realpath(self._temp_dir)
    if not candidate.startswith(sandbox_root + os.sep):
        logger.warning("Path traversal attempt blocked: %s", path)
        return None
    return candidate
```

`subprocess.py` and `docker.py` just don't call any equivalent — confirmed by grep over both files.

**Why it matters**

The whole point of `SubprocessSandbox` / `DockerSandbox` is to safely run LLM-generated code and tool calls. The LLM chooses the `path` argument when it calls `read_file` / `write_file` / `list_files`. A prompt-injected agent can read `/etc/passwd`, `~/.ssh/id_rsa`, `~/.aws/credentials`, the project's `.env`, or overwrite arbitrary files the process can write to — completely outside the "sandbox" boundary.

**Fix**

Promote `_safe_sandbox_path` to a shared helper in `sandbox/_compat.py` (or `_shell.py`) and use it in every backend.

```python
# src/praisonai/praisonai/sandbox/_compat.py
import os, logging
logger = logging.getLogger(__name__)

def safe_sandbox_path(temp_dir: str | None, path: str) -> str | None:
    if not temp_dir:
        return None
    candidate = os.path.realpath(os.path.join(temp_dir, path.lstrip("/")))
    sandbox_root = os.path.realpath(temp_dir)
    if not (candidate == sandbox_root or candidate.startswith(sandbox_root + os.sep)):
        logger.warning("Path traversal attempt blocked: %s", path)
        return None
    return candidate
```

```python
# src/praisonai/praisonai/sandbox/subprocess.py
from ._compat import safe_sandbox_path

async def read_file(self, path: str) -> Optional[Union[str, bytes]]:
    full_path = safe_sandbox_path(self._temp_dir, path)
    if full_path is None or not os.path.exists(full_path):
        return None
    ...

async def write_file(self, path: str, content: Union[str, bytes]) -> bool:
    full_path = safe_sandbox_path(self._temp_dir, path)
    if full_path is None:
        return False
    os.makedirs(os.path.dirname(full_path), exist_ok=True)
    ...

async def list_files(self, path: str = "/") -> list[str]:
    full_path = safe_sandbox_path(self._temp_dir, path)
    if full_path is None or not os.path.exists(full_path):
        return []
    ...
```

Same change in `docker.py`. Add a regression test that calls `read_file("../../../etc/passwd")` and asserts `None`.

---

## 2. `AgentsGenerator` sync and async paths run different setup logic — silent divergence

**Files:**
- `src/praisonai/praisonai/agents_generator.py` — `generate_crew_and_kickoff` (sync, L472-626)
- `src/praisonai/praisonai/agents_generator.py` — `_arun_framework` (async, L660-795), dispatched from `agenerate_crew_and_kickoff`

**The bug**

Both methods independently reimplement ~150 lines of YAML normalization, tool resolution, framework adapter selection, and observability bootstrap. They are not just duplicated — they have **silent behavioural divergence**:

| Step | Sync (`generate_crew_and_kickoff`) | Async (`_arun_framework`) |
|---|---|---|
| AutoGen v0.2 / v0.4 version selection | **missing** | L744-763 |
| AgentOps init from `AGENTOPS_API_KEY` | **missing** | L765-772 |
| `_validate_cli_backend_compatibility` | **missing** | L783-784 |
| `initial_adapter.resolve()` (variant resolution) | L599-600 | **missing** |
| `init_observability(adapter.name)` | L607-608 | **missing** |
| `adapter.setup(framework_tag=...)` | L611 | **missing** |
| Tool instantiation strategy | `resolve(name, instantiate=True)` (L561) | `resolve(name)` + manual call (L702-713) |

Grep confirms:

```
$ grep -n 'init_observability\|_validate_cli_backend_compatibility\|adapter.setup(\|agentops' \
       src/praisonai/praisonai/agents_generator.py
41:    "AGENTOPS_AVAILABLE": "agentops",
346:    def _validate_cli_backend_compatibility(self, config, framework):
607:        from .observability.hooks import init_observability
608:        init_observability(adapter.name)
611:        adapter.setup(framework_tag=adapter.name)
765:        # Initialize AgentOps if configured
769:                import agentops
770:                agentops.init(agentops_api_key, default_tags=[framework])
784:        self._validate_cli_backend_compatibility(config, framework)
```

Each of these symbols appears exactly once — i.e., each setup step exists in exactly one of the two paths.

Concrete tool-instantiation difference:

```python
# Sync, agents_generator.py:561
resolved_tool = self.tool_resolver.resolve(tool_name, instantiate=True)
if resolved_tool is not None:
    tools_dict[tool_name] = resolved_tool
```

```python
# Async, agents_generator.py:702-713
resolved_tool = self.tool_resolver.resolve(tool_name)
if resolved_tool is None:
    self.logger.warning(f"Tool '{tool_name}' not found")
    continue
tools_dict[tool_name] = (
    resolved_tool() if inspect.isclass(resolved_tool) else resolved_tool
)
```

The sync path's `resolve(..., instantiate=True)` swallows constructor exceptions inside `ToolResolver`; the async path lets them surface to its `try/except`. These are not equivalent.

**Why it matters**

1. **Observability blind spot in async mode.** Long-running production servers use the async path; they get no `init_observability()` and no `adapter.setup()` hook. The shape that's most exposed to production traffic is the one that's under-instrumented.
2. **AutoGen version pinning is async-only.** Users running `praisonai run` (sync) on a system where only AutoGen 0.4 is installed get no version switch.
3. **`--cli-backend` validation silently no-ops in sync.** Misconfiguration produces a cryptic downstream error instead of the explicit `_validate_cli_backend_compatibility` message.
4. **Future drift compounds.** Every new feature has to be added in two places; whoever forgets one creates another mode-specific bug.

**Fix**

Extract a single `_prepare()` method called by both entry points. The only thing that should differ is `adapter.run(...)` vs `await adapter.arun(...)`:

```python
# src/praisonai/praisonai/agents_generator.py
def _prepare(self) -> tuple[dict, FrameworkAdapter, dict]:
    """Shared by sync and async entry points."""
    config = self._load_config()
    if self.cli_config:
        self._merge_cli_config(config, self.cli_config)
    config = self._canonicalize_config(config)
    self._validate_agents_config(config)

    tools_dict = self._collect_tools(config)              # includes tools.py load
    framework = self._select_framework(config)            # incl. autogen v2/v4
    self._init_agentops(framework)
    self._validate_cli_backend_compatibility(config, framework)

    adapter = self._get_framework_adapter(framework).resolve()
    init_observability(adapter.name)
    adapter.setup(framework_tag=adapter.name)
    self.framework, self.framework_adapter = adapter.name, adapter
    return config, adapter, tools_dict

def generate_crew_and_kickoff(self):
    config, adapter, tools_dict = self._prepare()
    return adapter.run(
        config, self.config_list, config.get('input', ''),
        tools_dict=tools_dict,
    )

async def _arun_framework(self, _=None):
    config, adapter, tools_dict = self._prepare()
    return await adapter.arun(
        config, self.config_list, config.get('input', ''),
        tools_dict=tools_dict,
    )
```

Tool instantiation should also collapse to a single strategy inside `_collect_tools`, with `ToolResolver.resolve(..., instantiate=True)` (or its async-aware counterpart) being the only call site.

---

## 3. `RateLimiter.acquire` serializes all bot callers and leaks `_channel_last_send` forever

**File:** `src/praisonai/praisonai/bots/_rate_limit.py:85-120`

**The bug**

```python
# src/praisonai/praisonai/bots/_rate_limit.py:85-120
async def acquire(self, channel_id: Optional[str] = None) -> None:
    async with self._lock:                                  # global lock
        now = time.monotonic()
        elapsed = now - self._last_refill
        self._tokens = min(
            self._config.burst_size,
            self._tokens + elapsed * self._config.messages_per_second
        )
        self._last_refill = now

        if self._tokens < 1.0:
            wait_time = (1.0 - self._tokens) / self._config.messages_per_second
            await asyncio.sleep(wait_time)                  # sleep holding the lock
            self._tokens = 1.0

        self._tokens -= 1.0

        if channel_id:
            last_send = self._channel_last_send.get(channel_id, 0.0)
            channel_elapsed = now - last_send               # `now` is stale
            if channel_elapsed < self._config.per_channel_delay:
                wait_time = self._config.per_channel_delay - channel_elapsed
                await asyncio.sleep(wait_time)              # sleep holding the lock
            self._channel_last_send[channel_id] = time.monotonic()   # unbounded dict
```

Three concrete problems:

1. **Lock held across `await asyncio.sleep`** (lines 106 and 119). When one caller waits, *every* other caller — including ones targeting different channels with plenty of tokens — blocks on `self._lock`. Token-bucket semantics promise concurrency proportional to `messages_per_second`; the implementation delivers strict serialization regardless of channel. Telegram's `PLATFORM_LIMITS` of 25 msg/sec is effectively pinned to 1 in-flight send.
2. **`now` captured pre-sleep is reused after the global-token sleep** (line 115). Real wall time has advanced by `wait_time` seconds, but `now` is still the pre-sleep value. `channel_elapsed` is underestimated → redundant double-wait during bursts.
3. **`_channel_last_send` is never evicted.** Discord / Slack / Telegram bots routinely see thousands of unique channel IDs over a process lifetime (DMs, ephemeral threads, deleted channels). Grep confirms only `.get(...)`, `[...] = ...`, and `.clear()` in `reset()` — no scoped eviction. Steady memory leak + a slowdown of every subsequent `acquire()` as the hash table grows.

**Why it matters**

This sits on the bot send path — every outbound user message goes through `acquire()`. Under load:
- Effective throughput collapses to the slowest channel's rate.
- Memory grows linearly with unique channel count for the life of the process.
- Pre-sleep `now` causes redundant waits during burst traffic.

**Fix**

Decouple state mutation (under lock) from sleeping (lock-free), and bound the channel dict with an LRU:

```python
# src/praisonai/praisonai/bots/_rate_limit.py
from collections import OrderedDict

class RateLimiter:
    _CHANNEL_CAPACITY = 4096  # tune via config

    def __init__(self, config=None):
        self._config = config or RateLimitConfig()
        self._tokens = float(self._config.burst_size)
        self._last_refill = time.monotonic()
        self._channel_last_send: "OrderedDict[str, float]" = OrderedDict()
        self._lock = asyncio.Lock()

    async def acquire(self, channel_id: Optional[str] = None) -> None:
        # Phase 1: under lock, compute waits + reserve token + update last_send.
        async with self._lock:
            now = time.monotonic()
            self._tokens = min(
                self._config.burst_size,
                self._tokens + (now - self._last_refill) * self._config.messages_per_second,
            )
            self._last_refill = now

            global_wait = 0.0
            if self._tokens < 1.0:
                global_wait = (1.0 - self._tokens) / self._config.messages_per_second
                self._tokens = 1.0  # reserved; consumed below
            self._tokens -= 1.0

            channel_wait = 0.0
            if channel_id:
                last = self._channel_last_send.pop(channel_id, 0.0)
                projected_now = now + global_wait
                elapsed = projected_now - last
                if elapsed < self._config.per_channel_delay:
                    channel_wait = self._config.per_channel_delay - elapsed
                # LRU touch + bounded insertion
                self._channel_last_send[channel_id] = projected_now + channel_wait
                while len(self._channel_last_send) > self._CHANNEL_CAPACITY:
                    self._channel_last_send.popitem(last=False)

        # Phase 2: sleep OUTSIDE the lock so other channels proceed concurrently.
        total_wait = global_wait + channel_wait
        if total_wait > 0:
            await asyncio.sleep(total_wait)
```

This preserves the token-bucket math, removes serialization, fixes the stale-`now` double-wait, and caps memory.

---

## Honorable mentions (not in top 3, worth flagging during the fix work)

- **`tool_resolver.py:273`** — fast-path cache read bypasses the `instantiate` flag. If the same tool is resolved with mixed `instantiate=True/False` calls, the cached value can come back as either a class or an instance depending on first-call order.
- **`bots/_session.py:113-125`** — `_get_lock` / `_get_agent_lock` lazily insert `asyncio.Lock()` into a dict; `reap_stale` / `reset` evict locks via `.pop(...)` without checking whether they're currently held, which can break in-flight chats on a shared agent.

---

## Validation summary

Each finding was confirmed by directly reading the offending file in this tree:

| Finding | Verified by |
|---|---|
| Sandbox traversal | Read `subprocess.py` L269-336, `docker.py` L340-407, `sandlock.py` L167-184. Confirmed `realpath`/`startswith` guard exists only in `sandlock.py`. |
| Generator sync/async divergence | Read `agents_generator.py` L472-626 and L660-795 in full. Grep for `init_observability`, `_validate_cli_backend_compatibility`, `adapter.setup(`, `agentops` shows each symbol appearing in exactly one of the two methods. |
| Rate limiter serialization + leak | Read `_rate_limit.py` end-to-end. Confirmed both `await asyncio.sleep(...)` calls are inside `async with self._lock:`, `now` is captured once at L92 and reused at L115, and `_channel_last_send` has no eviction. |

Step	Sync (`generate_crew_and_kickoff`)	Async (`_arun_framework`)
AutoGen v0.2 / v0.4 version selection	missing	L744-763
AgentOps init from `AGENTOPS_API_KEY`	missing	L765-772
`_validate_cli_backend_compatibility`	missing	L783-784
`initial_adapter.resolve()` (variant resolution)	L599-600	missing
`init_observability(adapter.name)`	L607-608	missing
`adapter.setup(framework_tag=...)`	L611	missing
Tool instantiation strategy	`resolve(name, instantiate=True)` (L561)	`resolve(name)` + manual call (L702-713)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Three high-impact gaps in src/praisonai/praisonai: sandbox path traversal, sync/async generator divergence, rate-limiter serialization + leak #1869

Summary

1. Sandbox file APIs (`SubprocessSandbox`, `DockerSandbox`) are vulnerable to path traversal

2. `AgentsGenerator` sync and async paths run different setup logic — silent divergence

3. `RateLimiter.acquire` serializes all bot callers and leaks `_channel_last_send` forever

Honorable mentions (not in top 3, worth flagging during the fix work)

Validation summary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Finding	Verified by
Sandbox traversal	Read `subprocess.py` L269-336, `docker.py` L340-407, `sandlock.py` L167-184. Confirmed `realpath`/`startswith` guard exists only in `sandlock.py`.
Generator sync/async divergence	Read `agents_generator.py` L472-626 and L660-795 in full. Grep for `init_observability`, `_validate_cli_backend_compatibility`, `adapter.setup(`, `agentops` shows each symbol appearing in exactly one of the two methods.
Rate limiter serialization + leak	Read `_rate_limit.py` end-to-end. Confirmed both `await asyncio.sleep(...)` calls are inside `async with self._lock:`, `now` is captured once at L92 and reused at L115, and `_channel_last_send` has no eviction.

Uh oh!

Three high-impact gaps in src/praisonai/praisonai: sandbox path traversal, sync/async generator divergence, rate-limiter serialization + leak #1869

Description

Summary

1. Sandbox file APIs (SubprocessSandbox, DockerSandbox) are vulnerable to path traversal

2. AgentsGenerator sync and async paths run different setup logic — silent divergence

3. RateLimiter.acquire serializes all bot callers and leaks _channel_last_send forever

Honorable mentions (not in top 3, worth flagging during the fix work)

Validation summary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Sandbox file APIs (`SubprocessSandbox`, `DockerSandbox`) are vulnerable to path traversal

2. `AgentsGenerator` sync and async paths run different setup logic — silent divergence

3. `RateLimiter.acquire` serializes all bot callers and leaks `_channel_last_send` forever