Skip to content

feat: WhatsApp interface V2 — media I/O, interactive tools, Team/Workflow support#6466

Open
Mustafa-Esoofally wants to merge 39 commits intomainfrom
update/whatsapp-interface-v2
Open

feat: WhatsApp interface V2 — media I/O, interactive tools, Team/Workflow support#6466
Mustafa-Esoofally wants to merge 39 commits intomainfrom
update/whatsapp-interface-v2

Conversation

@Mustafa-Esoofally
Copy link
Contributor

@Mustafa-Esoofally Mustafa-Esoofally commented Feb 10, 2026

Summary

Complete overhaul of the WhatsApp interface — adds media I/O, interactive UI tools, Team/Workflow entity support, security hardening, and 16 cookbooks aligned with Slack's naming conventions.

28 files changed: 6 core modules, 17 cookbooks, 4 test suites, 1 deleted (utils/whatsapp.py → consolidated into helpers.py).

Architecture

os/interfaces/whatsapp/
  router.py     — Webhook handler + background message processor
  helpers.py    — NEW: Graph API helpers, media upload/download, message batching
  security.py   — HMAC signature validation + replay protection
  whatsapp.py   — Interface class (thin wrapper, passes config to attach_routes)

tools/whatsapp.py — 9 sync tools for agents to send interactive WhatsApp messages

Key design pattern: The router handles inbound messages and sends agent responses. WhatsAppTools lets agents send during execution (reply buttons, list messages, reactions). _WA_TOOL_NAMES in router.py prevents duplicate text when a tool already messaged the user — verified via experiment (see line comment).

Core changes

Router rewrite (router.py):

  • Unified entity dispatch: Agent, Team, and Workflow all route through entity.arun()
  • Full media response handling: images, files (PDF etc.), audio (with PCM→WAV conversion for Gemini TTS)
  • Session isolation: phone numbers are SHA-256 hashed before reaching storage — deterministic (same phone = same session) but irreversible
  • /new command starts a fresh session without losing old data
  • show_reasoning support — extended thinking sent as italicized prefix
  • Typing indicator refresh every 20s while agent runs (WhatsApp auto-dismisses after ~25s)
  • Handles 6 inbound types: text, image, video, audio, document, interactive (button_reply + list_reply)

Extracted helpers (helpers.py — NEW file):

  • WhatsAppConfig dataclass — credentials resolved once at startup, passed via closure
  • upload_and_send_media_async — unified upload+send for image/document/audio with text fallback on failure
  • prepare_audio_for_whatsapp — converts raw PCM (e.g. Gemini TTS audio/L16;rate=24000) to WAV container
  • send_whatsapp_message_async — batches messages >4096 chars into 4000-char chunks with [1/N] prefix
  • extract_media_bytes — handles both raw bytes and base64-encoded content from tool results
  • Caption truncation at 1024 chars (WhatsApp hard limit)

WhatsAppTools rewrite (tools/whatsapp.py):

  • 9 sync tools with per-tool enable_* flags and all=True override
  • Tools: send_text, send_template, send_reply_buttons, send_list_message, send_image, send_document, send_location, send_reaction, mark_as_read
  • JSON error responses (never raise) — agents can handle failures gracefully
  • WhatsApp API limits enforced: max 3 reply buttons, max 10 list sections, max 10 rows per section

Security hardening (security.py):

  • Removed development mode bypass (was if APP_ENV == "development": return True)
  • Added replay protection: rejects messages with timestamps >5 minutes old
  • HMAC-SHA256 signature validation using WHATSAPP_APP_SECRET

Deleted utils/whatsapp.py — all functionality consolidated into helpers.py with proper async support. Replaced requests with httpx.

Cookbooks (16 total, all E2E tested)

Naming aligned with Slack interface cookbooks for cross-platform consistency.

Cookbook What it demonstrates
basic.py Minimal agent, text I/O
basic_workflow.py 2-step workflow (Research → Write)
streaming_deep_research.py 7+ toolkits, WhatsApp interactive features
research_assistant.py Web search → PDF report via WhatsApp
agent_with_media.py Image input → text (Gemini vision)
image_generation_model.py Text → image output (Gemini)
image_generation_tools.py Text → image output (OpenAI DALL-E)
audio_tools.py Text → audio output (ElevenLabs TTS)
agent_with_user_memory.py Memory persistence across messages
reasoning_agent.py Extended thinking + show_reasoning
interactive_concierge.py Reply buttons, list messages, location sharing
tourist_guide.py Interactive buttons + location tools
support_team.py Team entity via WhatsApp
multimodal_team.py Team with vision + generation agents
multimodal_workflow.py Workflow with media pipeline
multiple_instances.py Multiple agents on one server, prefix routing

Plus cookbook/91_tools/whatsapp_all_tools.py — standalone sync demo of all 9 tools.

Tests (115 passing)

Test suite Count Coverage
test_whatsapp_router.py 30 Webhook verification, all message types, session isolation, /new command, error handling, media responses
test_whatsapp_helpers.py 37 Media extraction, audio conversion, message batching, caption truncation, upload+send flows
test_whatsapp_security.py 11 HMAC validation, replay protection, missing signatures, no-secret bypass
test_whatsapp_tools.py 37 All 9 tools, init flags, API limits, error responses

Type of change

  • New feature
  • Improvement

Checklist

  • Code complies with style guidelines
  • Ran format/validation scripts (./scripts/format.sh and ./scripts/validate.sh)
  • Self-review completed
  • Documentation updated (module docstrings in tools, WHY comments in code)
  • Examples and guides: 16 cookbooks + 1 tools demo included
  • Tested in clean environment (E2E via WhatsApp Web + ngrok tunnel)
  • Tests added/updated (115 unit tests across 4 test suites)

Additional Notes

  • Follows the Slack interface as reference implementation — cookbook names, reload=True, operation IDs
  • WhatsApp Cloud API v22.0
  • requests fully replaced with httpx (sync + async)
  • Cookbook names aligned with Slack: content_workflowbasic_workflow, deep_researchstreaming_deep_research, research_agentresearch_assistant
  • All cookbooks use reload=True instead of hardcoded port=8000 (works with new AGENT_OS_PORT env var from feat: add AGENT_OS_HOST/AGENT_OS_PORT env var fallbacks to serve() #6857)

…and tests

- Rewrite WhatsAppTools with 9 sync-only tools and enable/disable flags
- Add send_reply_buttons, send_list_message, send_image, send_document,
  send_location, send_reaction, and mark_as_read tools
- Harden security: remove dev mode bypass, add replay protection (5-min window)
- Handle interactive replies, location, reaction, contacts, sticker messages
- Replace requests with httpx in utils
- Add Pydantic response models and operation IDs to router
- Add 48 unit tests covering tools, security, and router
@Mustafa-Esoofally Mustafa-Esoofally requested a review from a team as a code owner February 10, 2026 18:54
@VirusDumb
Copy link
Contributor

@claude what are the sync only tools added and create some code example to test an agent using those tools in prod

@Himanshu040604
Copy link
Contributor

Himanshu040604 commented Feb 18, 2026

Code review

Found 1 issue:

  1. Replay protection causes false "signature invalid" errors in security.py

The HMAC-SHA256 signature validation algorithm itself is correct. The issue is the replay protection at lines 30-33 uses the message body timestamp (when the user sent the message) instead of an HTTP request timestamp. This causes:

  • Legitimate messages arriving >5 min after sending (Facebook retries, queued messages) are rejected
  • validate_webhook_signature returns False before the HMAC check ever runs
  • Router logs "Invalid webhook signature" which is misleading — the signature was never checked
  • Non-message webhooks (status updates) skip replay protection entirely since they have no messages field

"""
if timestamp is not None:
if abs(time.time() - timestamp) > 300:
log_warning("Rejecting webhook: timestamp too old (possible replay attack)")
return False

@Mustafa-Esoofally Mustafa-Esoofally changed the title feat: enhance WhatsApp interface with new tools, security hardening, and tests feat: enhance WhatsApp interface Feb 19, 2026
@Mustafa-Esoofally Mustafa-Esoofally marked this pull request as draft February 19, 2026 12:41
@VirusDumb
Copy link
Contributor

@claude what are the new tools needed to be tested for whatsapp and create an agent that uses them all

github-actions bot and others added 13 commits February 20, 2026 07:49
Co-authored-by: VirusDumb <VirusDumb@users.noreply.github.com>
Enable showing agent reasoning in the WhatsApp interface and switch to a newer Claude model. Updated cookbook agent to use "claude-sonnet-4-6" and enabled debug_mode; pass show_reasoning=True to the Whatsapp interface. In the library, default show_reasoning is now True for attach_routes and the Whatsapp class (added constructor param and stored property), and the router message formatting for reasoning was slightly adjusted. This makes the WhatsApp interface surface agent reasoning by default and aligns the example with the updated model.
Add interactive message handling and typed WhatsApp tools, plus cookbook examples and tests.\n\n- Introduces handling for interactive messages (button_reply, list_reply) in the WhatsApp router and exposes send_user_number_to_context to include the sender number in agent/team/workflow context.\n- Adds ReplyButton, ListRow and ListSection Pydantic models to WhatsAppTools and validates limits (max buttons, sections, rows). Improves HTTP error handling when posting to the WhatsApp API.\n- Updates Whatsapp interface to accept send_user_number_to_context and propagate it to the router.\n- Adds cookbook examples: content_workflow, support_team, tourist_guide.\n- Updates whatsapp_all_tools example content (image/document URLs) and removes a few debug_mode flags from example agents.\n- Adds/updates unit tests to cover interactive message processing and the typed tools API.
Add support for audio and document messages and include WhatsApp agent examples. New cookbook examples: audio_tools (ElevenLabs TTS voice replies) and research_agent (web research + PDF generation). Update image_generation_model to use gemini-3-pro-image-preview and enable debug_mode in tourist_guide. Enhance WhatsApp router to extract/upload media, handle images, files, audio (including PCM->WAV wrapping for Gemini TTS), and send document/audio messages. Add send_document_message and send_audio_message (sync + async) utilities in agno.utils.whatsapp and helper functions for media extraction and MIME handling.
@Mustafa-Esoofally Mustafa-Esoofally marked this pull request as ready for review March 3, 2026 16:22
)
from agno.workflow import RemoteWorkflow, Workflow

from .security import validate_webhook_signature
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full import

- Extract 8 pure utility functions from router.py into helpers.py,
  matching the Slack interface's separation pattern
- Add send_text_message_async/send_text_message to utils/whatsapp.py
  so the router uses proper async sends instead of sync httpx.post
- Unify entity dispatch with single entity.arun() call
- Fix messages[0] bug: now iterates all messages in webhook payload
- Pass add_dependencies_to_context as per-call kwarg instead of
  mutating shared agent state
- Update test mock paths for helpers extraction
…2E testing

- Remove reload=True from all WhatsApp cookbooks (causes child process issues)
- Fix gpt-5.2 -> gpt-4o in image_generation_tools and multiple_instances
- Fix multiple_instances app reference
- Add prepare_audio_for_whatsapp helper for PCM->WAV conversion
- Fix test_send_text_message_no_recipient env isolation
- Minor router cleanup from E2E testing
@Mustafa-Esoofally Mustafa-Esoofally changed the title feat: enhance WhatsApp interface feat: WhatsApp interface V2 — media I/O, interactive tools, Team/Workflow support Mar 4, 2026
WhatsApp auto-dismisses the typing indicator after ~25 seconds.
For agents that take longer to respond, the indicator would disappear
and users would see no feedback until the response arrived.

Spawns a background asyncio task that refreshes the typing indicator
every 20 seconds while entity.arun() is executing, cancelled via
try/finally when the run completes.
- deep_research: multi-tool agent with 7 toolkits, PDF output,
  reply buttons for format choice, reactions, mark-as-read
- multimodal_team: coordinated Vision Analyst + Creative Agent
  team for image analysis and generation
- interactive_concierge: showcases ALL WhatsApp interactive features
  (buttons, lists, location pins, reactions, images) in a concierge flow
- multimodal_workflow: parallel workflow with visual analysis +
  web research followed by creative synthesis with DALL-E + PDF
- Add geocode-mcp (OpenStreetMap Nominatim) for accurate venue coordinates
- Update instructions to use mcp_geocoding_get_coordinates before send_location
- Add instruction to keep text responses brief when interactive elements are sent
- Pin uvx to Python 3.12 for geocode-mcp compatibility
@agno-agi agno-agi deleted a comment from claude bot Mar 4, 2026
@agno-agi agno-agi deleted a comment from claude bot Mar 4, 2026
VirusDumb and others added 9 commits March 5, 2026 13:53
- Extract parse_whatsapp_message() to helpers with ParsedMessage dataclass
- Move extract_earliest_timestamp to security.py (co-located with validation)
- Replace _MIME_MAP dict with stdlib mimetypes.guess_type()
- Add operation_id to /status endpoint (prevents FastAPI collision)
- Accept Union[str, float] for lat/lng in send_location
- Skip signature validation when WHATSAPP_APP_SECRET is unset (dev convenience)
- Truncate button/list titles to WhatsApp limits (20/24 chars)
- Default show_reasoning to False
- Add test coverage for video/audio/document parsing, edge cases
Adds inline comments across all WhatsApp files matching the Slack
interface commenting style. Also refactors run_kwargs from inline
ternaries to incremental dict building for clarity.
…on truncation

- Move Graph API functions from utils/whatsapp.py into
  os/interfaces/whatsapp/helpers.py and delete the old module
- Introduce WhatsAppConfig dataclass to resolve credentials once at
  startup instead of reading env vars on every API call
- Hash phone numbers (SHA-256) before they reach storage so databases
  never contain plaintext PII
- Truncate media captions at 1024 chars and send full text as a
  separate message when content exceeds the limit
- Consolidate four upload functions into one upload_and_send_media_async
- Accept per-instance credentials in attach_routes for multi-bot
  deployments
- Update multiple_instances cookbook with detailed setup docs
Hash phone numbers with SHA-256 for PII protection — raw numbers never
reach storage as user_id or session_id. Add /new command that starts a
fresh session without deleting old data, using in-memory session mapping
with UUID suffixes. On server restart the map is empty so users resume
their original deterministic session.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…exception

- Move _WA_TOOL_NAMES to module-level frozenset (was recreated per message)
- Remove unused WhatsAppVerifyResponse model
- Remove redundant docstring on webhook endpoint (description= kwarg suffices)
- Fix except (UnicodeDecodeError, Exception) → Exception (superset)
- Clean up str(e) in f-strings and inline asyncio imports in tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename content_workflow.py -> basic_workflow.py
- Rename deep_research.py -> streaming_deep_research.py
- Rename research_agent.py -> research_assistant.py
- Replace port=8000 with reload=True across all cookbooks
- Remove stale __main__ docstrings
Remove json, log_debug, Mock — unused after helpers consolidation.
Copy link
Contributor Author

@Mustafa-Esoofally Mustafa-Esoofally left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: annotating non-obvious design decisions for reviewers. All 115 unit tests pass, 16 cookbooks E2E tested.

"send_reaction",
"mark_as_read",
}
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why _WA_TOOL_NAMES exists: When WhatsAppTools (e.g. send_reply_buttons) runs during agent execution, it sends a message directly to the user via the Graph API. Without this check, the router would also send response.content as a plain text message — duplicating the output.

Verified via experiment: with tools=[WhatsAppTools(enable_send_reply_buttons=True)], agent calls send_reply_buttons during execution. The response.tools list contains the tool execution, and this check prevents the router from also calling send_text_message_async.

frozenset because it's a module-level constant used in a hot path (any(t.tool_name in ...) is O(1) per lookup).


# Maps hashed_phone → session_id; absent key falls back to default deterministic ID.
# On server restart the map is empty, so users resume their original session.
_active_sessions: Dict[str, str] = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-memory session map — intentional. Maps hashed_phone → session_id for /new command support. On server restart, the map is empty so users resume their default deterministic session (wa:{hash}).

No lock needed: Python's GIL protects dict reads/writes, and asyncio runs on a single thread. The Slack interface doesn't need this because Slack provides its own thread-based session management.

phone_number = message["from"]
# Hash phone number before it reaches storage — deterministic so the
# same phone always resolves to the same session, but irreversible
hashed_phone = hashlib.sha256(phone_number.encode()).hexdigest()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Privacy: phone number hashing. The raw phone number is only used for sending WhatsApp API responses (line 180, 254, etc.). For all storage-facing operations (user_id, session_id, log messages), only the SHA-256 hash is used. This means:

  • Same phone → same hash → same session (deterministic)
  • Storage never sees the raw number (irreversible)
  • Log messages show truncated hash: hashed_phone[:12]


log_info(f"Processing message from {hashed_phone[:12]}: {parsed.text}")

session_id = _active_sessions.get(hashed_phone, f"wa:{hashed_phone}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Session ID fallback design: When _active_sessions has no entry (fresh server or user never ran /new), the session ID is deterministic: wa:{hashed_phone}. This means:

  1. Server restarts don't orphan sessions — the user gets the same session back
  2. /new creates wa:{hash}:{random_8hex} — a new session without deleting old data
  3. Multiple /new calls just overwrite the map entry

run_kwargs["add_dependencies_to_context"] = True

# Refresh typing indicator every 20s while the agent runs
# WhatsApp auto-dismisses the indicator after ~25s
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 20s interval: WhatsApp's typing indicator auto-dismisses after ~25 seconds. Agent execution (especially with tools like web search, file generation) can take 30-60s+. This task refreshes the indicator before it expires, giving users visual feedback that the bot is still working. The CancelledError catch ensures clean shutdown when the agent finishes.

return content
elif isinstance(content, str):
return base64.b64decode(content)
return None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why two code paths for bytes content:

  1. Some tools return media as raw binary bytes (e.g. httpx.get().content) → goes to except branch, returned as-is
  2. Other tools return base64-encoded data as bytes (e.g. b'iVBORw0KGgo...') → decode('utf-8') succeeds, then b64decode produces the actual binary
  3. String content (third branch) is straightforward base64 decode

The try/except on decode('utf-8') distinguishes case 1 from case 2 — if the bytes aren't valid UTF-8, they're already raw binary.

fmt = getattr(audio_obj, "format", None) or mime_type.split("/")[-1]
return audio_bytes, mime_type, f"audio.{fmt}"

# Raw PCM from Gemini (e.g. "audio/L16;rate=24000") — wrap as WAV
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini TTS PCM→WAV conversion: Google's Gemini TTS returns raw PCM audio with mime_type audio/L16;rate=24000 — just raw 16-bit samples with no container. WhatsApp rejects this because it requires a proper audio container format (mp3, ogg, wav, aac, amr). This wraps the PCM data in a WAV header using Python's wave module.

The sample rate is parsed from the mime_type string (rate=24000) with a fallback to 24000 (Gemini's default). Channels default to 1 (mono).

return "\n".join([f"_{line}_" for line in text.split("\n")])
return text

# WhatsApp limit is 4096 chars; split at 4000 to leave room for batch prefix
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 4000, not 4096: WhatsApp's hard limit is 4096 characters per message. We split at 4000 to leave 96 chars of headroom for the [1/N] batch prefix (e.g. [1/12] = 7 chars). This avoids edge cases where the prefix + content would exceed the limit.

if not app_secret:
# Explicit opt-out: skip validation when secret is not configured
log_warning("WHATSAPP_APP_SECRET not set — signature validation disabled (DO NOT use in production)")
return True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional opt-out for development. When WHATSAPP_APP_SECRET is not set, signature validation is skipped entirely. This lets developers test with ngrok/tunnels without configuring the app secret. The log_warning ensures this doesn't silently pass in production.

Previously this module had if APP_ENV == 'development': return True which was removed — checking for the secret's absence is more explicit and doesn't rely on environment naming conventions.


# 5-minute window guards against replay attacks
if timestamp is not None:
if abs(time.time() - timestamp) > 300:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replay protection: 5-minute window. Rejects any webhook with a message timestamp more than 5 minutes from the current server time. This guards against replay attacks where an attacker captures and re-sends a valid signed webhook payload. The abs() also catches future timestamps (clock skew). Timestamp is extracted by extract_earliest_timestamp() which walks the nested webhook payload to find the oldest message.

Mustafa-Esoofally and others added 4 commits March 5, 2026 14:36
helpers.py (-77 lines):
- Delete dead sync functions (get_media, upload_media)
- Replace extract_media_bytes with get_content_bytes()
- Privatize low-level senders (_send_text, _send_media)
- Remove banner comments, stale docstrings, redundant f-strings
- Use typed Audio/File field access instead of getattr
- Widen get_media_async/upload_media_async to catch HTTPError

router.py:
- Validate get_media_async responses before constructing media objects
- Add send_text_message/send_template_message to _WA_TOOL_NAMES

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete 5 cookbooks that overlap with existing examples or
demonstrate features better covered by tourist_guide and support_team.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… utility, delete dead code

- Rename WhatsAppConfig.from_env → init, ParsedMessage → MessageContent,
  parse_whatsapp_message → extract_message_content
- Make internal senders private: _send_text, _send_media
- Extract pcm_to_wav_bytes to agno/utils/audio.py for reuse across interfaces
- Delete prepare_audio_for_whatsapp — inline audio logic in upload_and_send_media_async
- Delete sync get_media/upload_media (router is async-only)
- Use typed media objects (Image, Audio, File) instead of getattr
- Update tests: proper media objects, updated patch targets, remove dead tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Audio import

- Router tests patched old send_text_message_async → now patch _send_text
- Remove unused Audio import from helpers.py (ruff F401)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants