PaksaTalker is an enterprise-grade AI framework for generating hyper-realistic talking head videos with perfect lip-sync, natural facial expressions, and life-like gestures. Built on cutting-edge AI research, it seamlessly integrates multiple state-of-the-art models to deliver production-ready video synthesis.
PaksaTalker is an advanced AI-powered platform that creates hyper-realistic talking avatars with synchronized facial expressions and natural body gestures. The system combines multiple state-of-the-art AI models including Qwen for language processing, SadTalker for facial animation, and PantoMatrix/EMAGE for full-body gesture generation, delivering production-ready video synthesis with unprecedented realism.
- Precise Lip-Sync: Frame-accurate audio-visual synchronization
- Expressive Faces: Emotionally aware facial animations
- Natural Gestures: Context-appropriate head movements and expressions
- High Fidelity: 4K resolution support with minimal artifacts
- Multi-model architecture (SadTalker, Wav2Lip, Qwen)
- GPU-accelerated processing
- Batch processing support
- Real-time preview
- RESTful API for easy integration
-
Modular design for easy model swapping
-
Plugin system for custom integrations
-
Support for custom voice models
-
Multi-language support
-
Precise Lip-Sync: Frame-accurate audio-visual synchronization using SadTalker
-
Expressive Faces: Emotionally aware facial animations with micro-expressions
-
Full-Body Gestures: Context-appropriate body language and hand movements
-
High Fidelity: 4K resolution support with DSLR-quality rendering
- Multi-Model Architecture: Integrates Qwen LLM, SadTalker, and PantoMatrix
- GPU-Accelerated: Optimized for NVIDIA GPUs with CUDA support
- Modular Design: Swappable components for customization
- High-Quality Output: 1080p+ resolution with advanced rendering
- RESTful API: Easy integration with existing systems
- Modular Pipeline: Independent components for face, body, and voice
- Custom Avatars: Support for 3D models and 2D images
- Plugin System: Extend with custom models and effects
- Multi-Language: Support for multiple languages and accents
- Customization: Fine-tune animation styles and rendering parameters
-
Python 3.8+
-
CUDA 11.3+ (for GPU acceleration)
-
ffmpeg 4.4+
-
8GB+ VRAM recommended
-
Python 3.9+ with pip
-
Node.js 16+ and npm 8+ (for web interface)
-
CUDA 11.8+ (for GPU acceleration)
-
ffmpeg 4.4+ for video processing
-
NVIDIA GPU with 16GB+ VRAM recommended
-
Docker (optional, for containerized deployment
-
Clone the repository:
git clone https://github.com/yourusername/paksatalker.git cd paksatalker -
Set up Python environment:
# Create and activate virtual environment python -m venv venv # On Windows: .\venv\Scripts\activate # On macOS/Linux: # source venv/bin/activate # Install Python dependencies pip install -r requirements.txt
-
Install AI Models:
# Download SadTalker models python -c "from models.sadtalker import download_models; download_models()" # Download PantoMatrix/EMAGE models python -c "from models.gesture import download_models; download_models()" # Download Qwen model weights (optional, can use API) # python -c "from models.qwen import download_models; download_models()"
-
Set up frontend (for web interface):
cd frontend npm install npm run build cd ..
# Basic usage
python -m PaksaTalker.cli \
--image input/face.jpg \
--audio input/speech.wav \
--output output/result.mp4 \
--enhance_face True \
--expression_intensity 0.8
# Advanced options
python -m PaksaTalker.cli \
--image input/face.jpg \
--audio input/speech.wav \
--output output/result.mp4 \
--resolution 1080 \
--fps 30 \
--background blur \
--gesture_level mediumfrom PaksaTalker import PaksaTalker
### Generate a Talking Avatar from Text
```bash
# Generate speech and animate avatar from text
python -m cli.generate \
--text "Hello, I'm your AI assistant. Welcome to PaksaTalker!" \
--image assets/avatars/default.jpg \
--voice en-US-JennyNeural \
--output output/welcome.mp4 \
--gesture-style natural \
--resolution 1080# Animate avatar with existing audio
python -m cli.animate \
--image assets/avatars/presenter.jpg \
--audio input/presentation.wav \
--output output/presentation.mp4 \
--expression excited \
--background blur \
--lighting studio# Full pipeline with custom settings
python -m cli.pipeline \
--prompt "Explain quantum computing in simple terms" \
--avatar assets/avatars/scientist.jpg \
--voice en-US-ChristopherNeural \
--style professional \
--gesture-level high \
--output output/quantum_explainer.mp4 \
--resolution 4k \
--fps 30 \
--enhance-face \
--background officefrom paksatalker import Pipeline
# Initialize the pipeline
pipeline = Pipeline(
model_dir="models",
device="cuda" # or "cpu" if no GPU
)
# Generate a talking avatar video
result = pipeline.generate(
text="Welcome to PaksaTalker, the future of digital avatars.",
image_path="assets/avatars/host.jpg",
voice="en-US-JennyNeural",
output_path="output/welcome.mp4",
gesture_style="casual",
resolution=1080
)
print(f"Video generated at: {result['output_path']}")from paksatalker import (
TextToSpeech,
FaceAnimator,
GestureGenerator,
VideoRenderer
)
# Initialize components
tts = TextToSpeech(voice="en-US-ChristopherNeural")
animator = FaceAnimator(model_path="models/sadtalker")
gesture = GestureGenerator(model_path="models/pantomatrix")
renderer = VideoRenderer(resolution=1080, fps=30)
# Process pipeline
text = "Let me show you how this works..."
audio = tts.generate(text)
face_animation = animator.animate("assets/avatars/assistant.jpg", audio)
body_animation = gesture.generate(audio, style="presentation")
# Render final video
video = renderer.combine(
face_animation=face_animation,
body_animation=body_animation,
audio=audio,
output_path="output/demo.mp4"
)from pathlib import Path
pt = PaksaTalker( device="cuda", # or "cpu" model_dir="models/", temp_dir="temp/" )
result = pt.generate( image_path="input/face.jpg", audio_path="input/speech.wav", output_path="output/result.mp4", config={ "resolution": 1080, "fps": 30, "expression_scale": 0.9, "head_pose": "natural", "background": { "type": "blur", "blur_strength": 0.7 }, "post_processing": { "denoise": True, "color_correction": True, "stabilization": True } } )
## ποΈ Architecture
PaksaTalker/ βββ api/ # REST API endpoints β βββ routes/ # API route definitions β βββ schemas/ # Pydantic models β βββ utils/ # API utilities β βββ config/ # Configuration management β βββ init.py β βββ config.py β βββ core/ # Core functionality β βββ engine.py # Main processing pipeline β βββ video.py # Video processing β βββ audio.py # Audio processing β βββ integrations/ # Model integrations β βββ sadtalker/ # SadTalker implementation β βββ wav2lip/ # Wav2Lip integration β βββ qwen/ # Qwen language model β βββ gesture/ # Gesture generation β βββ models/ # Model architectures β βββ base.py # Base model interface β βββ registry.py # Model registry β βββ static/ # Static files β βββ css/ β βββ js/ β βββ templates/ β βββ tests/ # Test suite β βββ unit/ β βββ integration/ β βββ utils/ # Utility functions β βββ audio_utils.py β βββ video_utils.py β βββ face_utils.py β βββ app.py # Main application βββ cli.py # Command-line interface βββ requirements.txt # Dependencies
- Start the development servers:
This will start:
# In the project root directory python run_dev.py- Frontend at http://localhost:5173
- Backend API at http://localhost:8000
- API Docs at http://localhost:8000/api/docs
-
Build the frontend:
cd frontend npm run build cd ..
-
Start the production server:
uvicorn app:app --host 0.0.0.0 --port 8000
The application will be available at http://localhost:8000
python app.py --input "Hello world" --output output/video.mp4PaksaTalker is highly configurable. Here's an example configuration:
# config/config.yaml
models:
sadtalker:
checkpoint: "models/sadtalker/checkpoints"
config: "models/sadtalker/configs"
wav2lip:
checkpoint: "models/wav2lip/checkpoints"
qwen:
model_name: "Qwen/Qwen-7B-Chat"
processing:
resolution: 1080
fps: 30
batch_size: 4
device: "cuda"
api:
host: "0.0.0.0"
port: 8000
workers: 4
debug: falseRun the test suite:
# Install test dependencies
pip install -r requirements-test.txt
# Run tests
pytest tests/We welcome contributions! Please see our Contributing Guidelines for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.
For detailed documentation, please visit our Documentation.
paksatalker/
βββ frontend/ # React + TypeScript frontend
β βββ src/ # Source files
β βββ public/ # Static files
β βββ package.json # Frontend dependencies
βββ api/ # API endpoints
βββ config/ # Configuration files
βββ models/ # AI models
βββ static/ # Static files (served by FastAPI)
βββ app.py # Main application entry point
βββ requirements.txt # Python dependencies
βββ README.md # This file
Create a .env file in the project root with the following variables:
# Backend
DEBUG=True
PORT=8000
# Database
DATABASE_URL=sqlite:///./paksatalker.db
# JWT
SECRET_KEY=your-secret-key
ALGORITHM=HS256
ACCESS_TOKEN_EXPIRE_MINUTES=30For development, you can also create a .env.development file in the frontend directory.
Once the server is running, visit /api/docs for interactive API documentation (Swagger UI).
For detailed documentation, please visit our documentation website.
Project Link: https://github.com/yourusername/paksatalker
- SadTalker - For the amazing talking head generation
- Wav2Lip - For lip-sync technology
- Qwen - For advanced language modeling
- All contributors and open-source maintainers who made this project possible
Run the bundled stable server with fallbacks and background asset prefetch:
python stable_server.py
# API: http://localhost:8000
# Swagger UI: http://localhost:8000/api/docs
Optionally force asset ensure:
curl -X POST http://localhost:8000/api/v1/assets/ensure
POST /api/v1/generate/fusion-video supports optional background parameters:
backgroundMode:none|blur|portrait|cinematic|color|image|greenscreenbackgroundColor: hex (for color/greenscreen)backgroundImage: file (for image/greenscreen)chromaColor,similarity,blend: greenβscreen chroma key tuning
Suggest presets matching your context:
curl -X POST http://localhost:8000/api/v1/style-presets/suggest \
-F prompt="energetic keynote" -F cultural_context=GLOBAL -F formality=0.7