Skip to content

Commit aaa7d21

Browse files
d3xvnNash0x7E2
andauthored
feat: implemented heygen avatars (#126)
* implemented heygen avatars * add lip-sync support by forwarding agent audio to heygen * switch avatar example to use gemini realtime for better lip-sync testing * WIP: audio track approach for lip-sync (audio flows but no lip movement) * Clean up HeyGen implementation and fix duplicate text sending - Removed obsolete heygen_audio_track.py (from old audio-based approach) - Removed unused _audio_sender field and transceiver logic - Removed unused _original_audio_write field - Simplified audio track management - Moved imports to top of file - Updated docstrings to reflect text-based lip-sync approach Fixed duplicate text sending issue: - Added deduplication tracking with _sent_texts set - Added minimum length filter (>15 chars) to prevent tiny fragments - Simplified event handling to avoid duplicate subscriptions - Proper buffer management between chunk and complete events Known limitation: ~3-4 second audio delay is inherent to HeyGen platform * PR cleanup * Auto-attach processors to agent (no more manual set_agent calls) - Add processor._attach_agent() lifecycle hook to Agent.__init__ - Rename HeyGen set_agent() -> _attach_agent() for consistency with LLM - Remove manual agent attachment from examples and docs - HeyGen now works like YOLO - just add to processors list Examples are now much cleaner: agent = Agent(processors=[heygen.AvatarPublisher()]) # That's it! No manual wiring needed. * fixed audio duplication and sluggishness * Fix video aspect ratio stretching - add letterboxing * fixed and simplified both implementations * Fix ruff linting - remove unused imports * Fix HeyGen plugin tests - import paths and mocking * Fix mypy type errors in HeyGen plugin * Allow reattaching to new HeyGen video tracks on renegotiation * Migrate quality to enum * Ruff and Mypy * More ruff issues * Fix broken method sigs * Unused var * final ruff error --------- Co-authored-by: Neevash Ramdial (Nash) <[email protected]>
1 parent 60f6d83 commit aaa7d21

File tree

24 files changed

+1889
-8
lines changed

24 files changed

+1889
-8
lines changed

agents-core/pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ deepgram = ["vision-agents-plugins-deepgram"]
4242
elevenlabs = ["vision-agents-plugins-elevenlabs"]
4343
gemini = ["vision-agents-plugins-gemini"]
4444
getstream = ["vision-agents-plugins-getstream"]
45+
heygen = ["vision-agents-plugins-heygen"]
4546
kokoro = ["vision-agents-plugins-kokoro"]
4647
krisp = ["vision-agents-plugins-krisp"]
4748
moonshine = ["vision-agents-plugins-moonshine"]
@@ -57,6 +58,7 @@ all-plugins = [
5758
"vision-agents-plugins-elevenlabs",
5859
"vision-agents-plugins-gemini",
5960
"vision-agents-plugins-getstream",
61+
"vision-agents-plugins-heygen",
6062
"vision-agents-plugins-kokoro",
6163
"vision-agents-plugins-krisp",
6264
"vision-agents-plugins-moonshine",

agents-core/vision_agents/core/agents/agents.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -220,6 +220,11 @@ def __init__(
220220

221221
self.llm._attach_agent(self)
222222

223+
# Attach processors that need agent reference
224+
for processor in self.processors:
225+
if hasattr(processor, '_attach_agent'):
226+
processor._attach_agent(self)
227+
223228
self.events.subscribe(self._on_vad_audio)
224229
self.events.subscribe(self._on_agent_say)
225230
# Initialize state variables
@@ -1176,10 +1181,13 @@ def publish_audio(self) -> bool:
11761181
"""Whether the agent should publish an outbound audio track.
11771182
11781183
Returns:
1179-
True if TTS is configured or when in Realtime mode.
1184+
True if TTS is configured, when in Realtime mode, or if there are audio publishers.
11801185
"""
11811186
if self.tts is not None or self.realtime_mode:
11821187
return True
1188+
# Also publish audio if there are audio publishers (e.g., HeyGen avatar)
1189+
if self.audio_publishers:
1190+
return True
11831191
return False
11841192

11851193
@property
@@ -1305,6 +1313,11 @@ def _prepare_rtc(self):
13051313
if self.realtime_mode and isinstance(self.llm, Realtime):
13061314
self._audio_track = self.llm.output_track
13071315
self.logger.info("🎵 Using Realtime provider output track for audio")
1316+
elif self.audio_publishers:
1317+
# Get the first audio publisher to create the track
1318+
audio_publisher = self.audio_publishers[0]
1319+
self._audio_track = audio_publisher.publish_audio_track()
1320+
self.logger.info("🎵 Audio track initialized from audio publisher")
13081321
else:
13091322
# Default to WebRTC-friendly format unless configured differently
13101323
framerate = 48000

aiortc

Lines changed: 0 additions & 1 deletion
This file was deleted.

plugins/aws/example/uv.lock

Lines changed: 3 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ async def simple_audio_response(
152152
audio_bytes = pcm.resample(
153153
target_sample_rate=16000, target_channels=1
154154
).samples.tobytes()
155-
mime = f"audio/pcm;rate=16000"
155+
mime = "audio/pcm;rate=16000"
156156
blob = Blob(data=audio_bytes, mime_type=mime)
157157

158158
await self._require_session().send_realtime_input(audio=blob)

plugins/heygen/README.md

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# HeyGen Avatar Plugin for Vision Agents
2+
3+
Add realistic avatar video to your AI agents using HeyGen's streaming avatar API.
4+
5+
## Features
6+
7+
- 🎭 **Realistic Avatars**: Use HeyGen's high-quality avatars with natural movements
8+
- 🎤 **Automatic Lip-Sync**: Avatar automatically syncs with audio from any TTS provider
9+
- 🚀 **WebRTC Streaming**: Low-latency real-time video streaming via WebRTC
10+
- 🔌 **Easy Integration**: Works seamlessly with Vision Agents framework
11+
- 🎨 **Customizable**: Configure avatar, quality, resolution, and more
12+
13+
## Installation
14+
15+
```bash
16+
pip install vision-agents-plugins-heygen
17+
```
18+
19+
Or with uv:
20+
21+
```bash
22+
uv pip install vision-agents-plugins-heygen
23+
```
24+
25+
## Quick Start
26+
27+
```python
28+
import asyncio
29+
from uuid import uuid4
30+
from dotenv import load_dotenv
31+
32+
from vision_agents.core import User, Agent
33+
from vision_agents.plugins import cartesia, deepgram, getstream, gemini, heygen
34+
from vision_agents.plugins.heygen import VideoQuality
35+
36+
load_dotenv()
37+
38+
async def start_avatar_agent():
39+
agent = Agent(
40+
edge=getstream.Edge(),
41+
agent_user=User(name="AI Assistant with Avatar", id="agent"),
42+
instructions="You're a friendly AI assistant.",
43+
44+
llm=gemini.LLM("gemini-2.0-flash"),
45+
tts=cartesia.TTS(),
46+
stt=deepgram.STT(),
47+
48+
# Add HeyGen avatar
49+
processors=[
50+
heygen.AvatarPublisher(
51+
avatar_id="default",
52+
quality=VideoQuality.HIGH
53+
)
54+
]
55+
)
56+
57+
call = agent.edge.client.video.call("default", str(uuid4()))
58+
59+
with await agent.join(call):
60+
await agent.edge.open_demo(call)
61+
await agent.simple_response("Hello! I'm your AI assistant with an avatar.")
62+
await agent.finish()
63+
64+
if __name__ == "__main__":
65+
asyncio.run(start_avatar_agent())
66+
```
67+
68+
## Configuration
69+
70+
### Environment Variables
71+
72+
Set your HeyGen API key:
73+
74+
```bash
75+
HEYGEN_API_KEY=your_heygen_api_key_here
76+
```
77+
78+
### AvatarPublisher Options
79+
80+
```python
81+
from vision_agents.plugins.heygen import VideoQuality
82+
83+
heygen.AvatarPublisher(
84+
avatar_id="default", # HeyGen avatar ID
85+
quality=VideoQuality.HIGH, # Video quality: VideoQuality.LOW, VideoQuality.MEDIUM, or VideoQuality.HIGH
86+
resolution=(1920, 1080), # Output resolution (width, height)
87+
api_key=None, # Optional: override env var
88+
)
89+
```
90+
91+
## Usage Examples
92+
93+
### With Realtime LLM
94+
95+
```python
96+
from vision_agents.plugins import gemini, heygen, getstream
97+
98+
agent = Agent(
99+
edge=getstream.Edge(),
100+
agent_user=User(name="Realtime Avatar AI"),
101+
instructions="Be conversational and responsive.",
102+
103+
llm=gemini.Realtime(fps=2), # No separate TTS needed
104+
105+
processors=[
106+
heygen.AvatarPublisher(avatar_id="professional_presenter")
107+
]
108+
)
109+
110+
call = agent.edge.client.video.call("default", str(uuid4()))
111+
112+
with await agent.join(call):
113+
await agent.finish()
114+
```
115+
116+
### With Multiple Processors
117+
118+
```python
119+
from vision_agents.plugins import ultralytics, heygen
120+
121+
agent = Agent(
122+
edge=getstream.Edge(),
123+
agent_user=User(name="Fitness Coach"),
124+
instructions="Analyze user poses and provide feedback.",
125+
126+
llm=gemini.Realtime(fps=3),
127+
128+
processors=[
129+
# Process incoming user video
130+
ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt"),
131+
# Publish avatar video
132+
heygen.AvatarPublisher(avatar_id="fitness_trainer")
133+
]
134+
)
135+
```
136+
137+
## How It Works
138+
139+
1. **Connection**: Establishes WebRTC connection to HeyGen's streaming API
140+
2. **Audio Input**: Receives audio from your TTS provider or Realtime LLM
141+
3. **Avatar Generation**: HeyGen generates avatar video with lip-sync
142+
4. **Video Streaming**: Streams avatar video to call participants via GetStream Edge
143+
144+
## Requirements
145+
146+
- Python 3.10+
147+
- HeyGen API key (get one at [heygen.com](https://heygen.com))
148+
- GetStream account for video calls
149+
- TTS provider (Cartesia, ElevenLabs, etc.) or Realtime LLM
150+
151+
## Troubleshooting
152+
153+
### Connection Issues
154+
155+
If you experience connection problems:
156+
157+
1. Check your HeyGen API key is valid
158+
2. Ensure you have network access to HeyGen's servers
159+
3. Check firewall settings for WebRTC traffic
160+
161+
### Video Quality
162+
163+
To optimize video quality:
164+
165+
- Use `quality=VideoQuality.HIGH` for best results
166+
- Increase resolution if bandwidth allows
167+
- Ensure stable internet connection
168+
169+
## API Reference
170+
171+
### AvatarPublisher
172+
173+
Main class for publishing HeyGen avatar video.
174+
175+
**Methods:**
176+
- `publish_video_track()`: Returns video track for streaming
177+
- `state()`: Returns current state information
178+
- `close()`: Clean up resources
179+
180+
## License
181+
182+
MIT
183+
184+
## Links
185+
186+
- [Documentation](https://visionagents.ai/)
187+
- [GitHub](https://github.com/GetStream/Vision-Agents)
188+
- [HeyGen API Docs](https://docs.heygen.com/docs/streaming-api)
189+

0 commit comments

Comments
 (0)