refactor: reframe as voice-driven avatar animation engine

jcggl · claude · jcggl · commit 8bcc0d9d271e · 2026-02-26T13:46:49.000+09:00
Shift messaging from "lip sync" to full animation engine concept:
- Voice → lip sync + facial expressions + eye animation + body motion
- New "What AnimaSync Does" section with animation layer breakdown
- Updated architecture diagram showing expression/blink/body layers
- Richer V1/V2 comparison (expression depth, body motion, VAD)
- Hero banner: "Voice-driven Avatar Animation" subtitle
- Landing page: emotion-aware description
- All example READMEs emphasize multi-layer animation output
- Repo description updated

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -4,9 +4,9 @@
 
 <br><br>
 
-**Real-time audio-to-blendshape lip sync for the browser.**
+**Voice-driven 3D avatar animation engine for the browser.**
 
-Rust/WASM engine that converts speech into ARKit-compatible facial animations at 30fps — entirely client-side.
+Extracts emotion from speech and generates lip sync, facial expressions, and body motion in real time — entirely client-side via Rust/WASM.
 
 <br>
 
@@ -36,33 +36,48 @@ Rust/WASM engine that converts speech into ARKit-compatible facial animations at
 <tr>
 <td width="50%">
 
-**Browser-native WASM**<br>
-<sub>No server needed. Entire pipeline runs in the browser with near-native performance via Rust → WebAssembly compilation.</sub>
+**Voice → Full-body Animation**<br>
+<sub>Not just lip sync. Analyzes speech to generate lip movements, emotional facial expressions, eye blinks, and body poses — all from a single audio stream.</sub>
 
-**ARKit-compatible Output**<br>
-<sub>Standard 52-dim or 111-dim blendshape weight arrays. Works with any 3D framework — Three.js, Babylon.js, Unity WebGL.</sub>
+**Emotion-aware Expressions**<br>
+<sub>Automatically maps vocal characteristics to facial expressions. Eyebrow raises, smile intensity, jaw dynamics, and blink patterns respond to how things are said, not just what is said.</sub>
 
-**Built-in Bone Animation**<br>
-<sub>Embedded VRMA idle/speaking pose clips with automatic crossfade. Natural body movement out of the box.</sub>
+**Built-in Body Motion**<br>
+<sub>Embedded VRMA bone animation clips (idle / speaking poses) with automatic crossfade. Your avatar breathes, shifts weight, and moves naturally — out of the box.</sub>
 
 </td>
 <td width="50%">
 
-**Real-time Streaming**<br>
-<sub>AudioWorklet-based microphone capture with ~300ms latency. Stream TTS audio or process recorded files.</sub>
+**Browser-native WASM**<br>
+<sub>No server needed. Entire pipeline runs in the browser at 30fps with near-native performance via Rust → WebAssembly. ARKit-compatible 52 or 111-dim output.</sub>
 
-**30-day Free Trial**<br>
-<sub>No signup, no API key. Call `init()` and start building. Internet required for license validation only.</sub>
+**Real-time Streaming**<br>
+<sub>AudioWorklet-based microphone capture with ~300ms latency. Feed live mic, TTS, or recorded audio — get animated avatar frames back instantly.</sub>
 
-**Three.js + VRM Ready**<br>
-<sub>First-class integration with @pixiv/three-vrm. Drop a VRM avatar and it just works.</sub>
+**Plug & Play**<br>
+<sub>3 lines of code to go from audio to animated avatar. 30-day free trial, no signup. First-class Three.js + VRM integration.</sub>
 
 </td>
 </tr>
 </table>
 
 ---
 
+## What AnimaSync Does
+
+Most lip sync engines stop at mouth shapes. AnimaSync goes further — it treats voice as the **complete animation source**:
+
+| Layer | What it generates | How |
+|-------|-------------------|-----|
+| **Lip Sync** | Mouth shapes matching phonemes | ONNX inference → ARKit blendshapes (jaw, mouth, tongue) |
+| **Facial Expression** | Emotion-driven brows, cheeks, eyes | Voice energy & pitch → expression mapping + anatomical constraints |
+| **Eye Animation** | Natural blinks, micro-movements | Stochastic blink injection (2.5–4.5s intervals, 15% double-blink) |
+| **Body Motion** | Idle breathing, speaking gestures | Embedded VRMA bone clips with automatic idle ↔ speaking crossfade |
+
+One audio stream in → a fully animated 3D avatar out.
+
+---
+
 ## Quick Start
 
 ### Install
@@ -85,9 +100,10 @@ import { LipSyncWasmWrapper } from '@goodganglabs/lipsync-wasm-v2';
 const lipsync = new LipSyncWasmWrapper();
 await lipsync.init(); // 30-day free trial — no key needed
 
+// One call — get lip sync + expressions + blinks, all at once
 const result = await lipsync.processFile(audioFile);
 for (let i = 0; i < result.frame_count; i++) {
-  const frame = lipsync.getFrame(result, i); // number[52] — ARKit blendshapes
+  const frame = lipsync.getFrame(result, i); // number[52] — full face animation
   applyToYourAvatar(frame);
 }
 ```
@@ -114,9 +130,9 @@ Working examples you can run locally — zero npm install, all loaded from CDN.
 
 | Example | Description | Source |
 |---------|-------------|--------|
-| **[Basic](examples/vanilla-basic/)** | Audio file → blendshape bar chart. No 3D, pure API demo. | [index.html](examples/vanilla-basic/index.html) |
-| **[VRM Avatar](examples/vanilla-avatar/)** | Full 3D avatar with mic, file upload, bone animation. | [index.html](examples/vanilla-avatar/index.html) |
-| **[V1 vs V2](examples/vanilla-comparison/)** | Side-by-side dual avatar comparison. Same audio, two engines. | [index.html](examples/vanilla-comparison/index.html) |
+| **[Basic](examples/vanilla-basic/)** | Audio → animated blendshape visualization. No 3D, pure API demo. | [index.html](examples/vanilla-basic/index.html) |
+| **[VRM Avatar](examples/vanilla-avatar/)** | Full 3D avatar — lip sync, expressions, body motion, mic streaming. | [index.html](examples/vanilla-avatar/index.html) |
+| **[V1 vs V2](examples/vanilla-comparison/)** | Side-by-side dual avatar comparison. Same voice, two animation engines. | [index.html](examples/vanilla-comparison/index.html) |
 
 **Run any example:**
 
@@ -135,58 +151,76 @@ npx serve .                  # or: python3 -m http.server 8080
 | **Output** | 52-dim ARKit blendshapes | 111-dim ARKit blendshapes |
 | **Model** | Student distillation (direct prediction) | Phoneme classification → viseme mapping |
 | **Post-processing** | crisp_mouth + fade + auto-blink | OneEuroFilter + anatomical constraints |
-| **Idle expressions** | Not included | Built-in `IdleExpressionGenerator` |
-| **Voice activity** | Not included | Built-in `VoiceActivityDetector` |
+| **Expression generation** | Blink injection in post-process | Built-in `IdleExpressionGenerator` (blinks + micro-expressions) |
+| **Voice activity** | Not included | Built-in `VoiceActivityDetector` (body pose switching) |
 | **ONNX fallback** | None (ONNX required) | Heuristic mode (energy-based) |
+| **Body motion** | VRMA idle/speaking (both versions) | VRMA idle/speaking + VAD auto-switch |
 | **Best for** | Most projects, quick integration | Full expression control, custom avatars |
 
 ---
 
 ## Architecture
 
 ```
-┌────────────────────────────────────────────────────────────┐
-│  Browser                                                    │
-│                                                             │
-│  Audio Source (File / Mic / TTS)                            │
-│       │                                                     │
-│       ▼                                                     │
-│  ┌──────────┐    ┌────────────┐    ┌─────────────────────┐ │
-│  │   WASM   │    │    ONNX    │    │        WASM         │ │
-│  │ Feature  │───▶│ Inference  │───▶│  Post-processing    │ │
-│  │ Extract  │    │   (JS)     │    │  + Blendshape map   │ │
-│  └──────────┘    └────────────┘    └─────────┬───────────┘ │
-│                                               │             │
-│                                               ▼             │
-│                                    52 / 111-dim ARKit       │
-│                                    Blendshapes @30fps       │
-│                                               │             │
-│                                               ▼             │
-│                                     3D Avatar (Three.js,    │
-│                                     Babylon, Unity WebGL)   │
-└────────────────────────────────────────────────────────────┘
+┌─────────────────────────────────────────────────────────────────────┐
+│  Browser                                                             │
+│                                                                      │
+│  Audio Source (File / Mic / TTS)                                     │
+│       │                                                              │
+│       ▼                                                              │
+│  ┌──────────┐    ┌────────────┐    ┌──────────────────────────────┐ │
+│  │   WASM   │    │    ONNX    │    │           WASM               │ │
+│  │ Feature  │───▶│ Inference  │───▶│  Post-processing             │ │
+│  │ Extract  │    │   (JS)     │    │  + Expression mapping        │ │
+│  └──────────┘    └────────────┘    └────────────┬─────────────────┘ │
+│                                                  │                   │
+│                        ┌─────────────────────────┼────────────┐     │
+│                        │                         │            │     │
+│                        ▼                         ▼            ▼     │
+│                   Lip Sync              Facial Expression   Blinks  │
+│                 (jaw, mouth,          (brows, cheeks,     (natural  │
+│                  tongue)               smile, frown)     stochastic)│
+│                        │                         │            │     │
+│                        └─────────────┬───────────┘            │     │
+│                                      ▼                        │     │
+│                           52/111-dim ARKit Blendshapes @30fps │     │
+│                                      │  ◄─────────────────────┘     │
+│                                      ▼                              │
+│                        ┌──────────────────────────┐                 │
+│                        │  VRMA Bone Animation      │                │
+│                        │  idle ↔ speaking crossfade │                │
+│                        │  (body pose + gestures)   │                │
+│                        └────────────┬─────────────┘                 │
+│                                     ▼                               │
+│                           3D Avatar (Three.js / Babylon / Unity)    │
+└─────────────────────────────────────────────────────────────────────┘
 ```
 
 ### V2 Pipeline
 
 ```
 Audio 16kHz PCM
   → [WASM] librosa-compatible features: 141-dim @30fps
-  → [JS]   ONNX student model: 52-dim direct output
-  → [WASM] crisp_mouth → fade_in_out → add_blinks
-  → [Optional] Preset blending
+  → [JS]   ONNX student model → 52-dim (lip sync + expressions)
+  → [WASM] crisp_mouth (mouth sharpening) → fade_in_out (natural onset/offset)
+  → [WASM] add_blinks (stochastic eye animation)
+  → [WASM] Preset blending: expression channels (brows, eyes) blended with lip sync
+  → [VRMA] Bone animation: idle ↔ speaking pose auto-crossfade
 ```
 
 ### V1 Pipeline
 
 ```
 Audio 16kHz PCM
   → [WASM] MFCC extraction: 13-dim @100fps
-  → [JS]   ONNX inference: 61 phoneme probabilities
-  → [WASM] Phoneme → 22 visemes → 111-dim ARKit blendshapes
+  → [JS]   ONNX inference: 61 phoneme → 22 visemes
+  → [WASM] Viseme → 111-dim ARKit blendshapes (lip + expression + extras)
   → [WASM] FPS conversion: 100fps → 30fps
-  → [WASM] Anatomical constraints + OneEuroFilter
-  → [Optional] Preset blending (face 40% + mouth 60%)
+  → [WASM] Anatomical constraints (bilateral symmetry + jaw correction)
+  → [WASM] OneEuroFilter (temporal smoothing for natural motion)
+  → [WASM] Preset blending: face 40% (expression) + mouth 60% (lip sync)
+  → [WASM] IdleExpressionGenerator: blinks (2.5–4.5s, 15% double) + micro-expressions
+  → [VRMA] Bone animation: idle ↔ speaking pose crossfade (VAD-triggered)
 ```
 
 ---
@@ -241,10 +275,10 @@ interface ProcessResult {
 
 | Method | Use Case |
 |--------|----------|
-| `processFile(file)` | File upload UI |
-| `processAudio(float32)` | Pre-loaded audio (fetched from API) |
+| `processFile(file)` | File upload → returns lip sync + expression + blink frames |
+| `processAudio(float32)` | Pre-loaded audio (e.g., fetched from TTS API) |
 | `processAudioChunk(chunk)` | Real-time mic / TTS streaming |
-| `getVrmaBytes()` | Bone animations for idle & speaking poses |
+| `getVrmaBytes()` | Bone animation clips for idle breathing & speaking gestures |
 | `reset()` | Clear streaming state between utterances |
 
 ### Loading Progress Stages
diff --git a/assets/readme/hero-banner.svg b/assets/readme/hero-banner.svg
@@ -52,7 +52,7 @@
 
   <!-- Blendshape bars (right) -->
   <g transform="translate(480, 75)">
-    <text x="0" y="-8" fill="white" fill-opacity="0.25" font-family="monospace" font-size="9">ARKit Blendshapes</text>
+    <text x="0" y="-8" fill="white" fill-opacity="0.25" font-family="monospace" font-size="9">Face + Body Animation</text>
     <!-- Labels + bars -->
     <text x="0" y="15" fill="#888" font-family="monospace" font-size="9">jawOpen</text>
     <rect x="80" y="6" width="120" height="12" rx="3" fill="#4cc9f0" opacity="0.85"/>
@@ -84,8 +84,8 @@
   <text x="810" y="100" fill="#4cc9f0" font-family="system-ui, -apple-system, sans-serif" font-size="38" font-weight="700" text-anchor="middle">Sync</text>
 
   <!-- Subtitle -->
-  <text x="765" y="128" fill="white" fill-opacity="0.4" font-family="system-ui, -apple-system, sans-serif" font-size="13" text-anchor="middle">Audio to Blendshapes</text>
-  <text x="765" y="146" fill="white" fill-opacity="0.3" font-family="system-ui, -apple-system, sans-serif" font-size="11" text-anchor="middle">Rust/WASM  |  Browser-native  |  30fps</text>
+  <text x="765" y="128" fill="white" fill-opacity="0.4" font-family="system-ui, -apple-system, sans-serif" font-size="13" text-anchor="middle">Voice-driven Avatar Animation</text>
+  <text x="765" y="146" fill="white" fill-opacity="0.3" font-family="system-ui, -apple-system, sans-serif" font-size="11" text-anchor="middle">Lip Sync  |  Expressions  |  Body Motion</text>
 
   <!-- Version badges -->
   <g transform="translate(710, 165)">
diff --git a/examples/vanilla-avatar/README.md b/examples/vanilla-avatar/README.md
@@ -1,14 +1,15 @@
 # Vanilla Avatar
 
-Full 3D VRM avatar that lip-syncs to audio using AnimaSync V2. Supports file upload and real-time microphone streaming.
+Full 3D VRM avatar that comes alive from voice alone. Lip sync, emotional facial expressions, natural eye blinks, and body motion — all generated from a single audio stream via AnimaSync V2.
 
 ## What it demonstrates
 
-- Three.js + `@pixiv/three-vrm` avatar rendering
-- VRMA bone animation (idle pose crossfade)
-- Real-time mic streaming via `processAudioChunk()` + AudioWorklet
-- Batch file processing via `processFile()`
-- 52-dim ARKit blendshape application to VRM expressions
+- **Lip sync**: Mouth shapes driven by voice phonemes
+- **Facial expressions**: Brows, cheeks, and eye area respond to vocal characteristics
+- **Eye animation**: Natural stochastic blinks injected automatically
+- **Body motion**: VRMA bone animation (idle breathing ↔ speaking pose crossfade)
+- Real-time mic streaming + batch file processing
+- Three.js + `@pixiv/three-vrm` integration
 
 ## Run locally
 
@@ -28,6 +29,6 @@ Drop any `.vrm` file onto the canvas. Free CC0 avatars are available at:
 ## How it works
 
 1. Page loads → WASM + ONNX model initialized from CDN
-2. Drop a `.vrm` file → Three.js scene renders the avatar with idle bone animation
-3. Upload audio or click Microphone → blendshapes applied to VRM at 30fps
-4. Frame queue pattern: audio processing pushes frames, render loop consumes at 30fps
+2. Drop a `.vrm` file → Three.js scene renders the avatar with idle breathing animation
+3. Upload audio or click Microphone → engine generates lip sync + expressions + blinks
+4. All animation layers (face + body) applied to VRM at 30fps via frame queue
diff --git a/examples/vanilla-avatar/index.html b/examples/vanilla-avatar/index.html
@@ -3,7 +3,7 @@
 <head>
   <meta charset="UTF-8">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
-  <title>AnimaSync — VRM Avatar</title>
+  <title>AnimaSync — Voice-driven VRM Avatar</title>
   <script type="importmap">
   { "imports": {
     "three": "https://cdn.jsdelivr.net/npm/three@0.179.1/build/three.module.js",
diff --git a/examples/vanilla-basic/README.md b/examples/vanilla-basic/README.md
@@ -1,12 +1,12 @@
 # Vanilla Basic
 
-Minimal AnimaSync example — no 3D avatar, no Three.js. Drop an audio file and watch blendshape values animate in real time.
+Minimal AnimaSync example — no 3D avatar, no Three.js. Drop an audio file and see how voice drives lip sync, facial expression, and blink animation data in real time.
 
 ## What it demonstrates
 
 - Loading `@goodganglabs/lipsync-wasm-v2` from CDN (zero `npm install`)
-- `processFile()` batch API
-- Extracting frames with `getFrame()` and visualizing 23 key ARKit channels
+- `processFile()` batch API — returns lip sync + expressions + blinks in one call
+- Visualizing 23 key ARKit channels: jaw, mouth, eyes, brows, cheeks
 
 ## Run locally
 
@@ -22,7 +22,7 @@ Open `http://localhost:8080` (or the port your server shows).
 ## How it works
 
 1. WASM + ONNX model load from jsdelivr CDN on page load
-2. Drop/select an audio file → `processFile()` returns all frames at once
-3. `requestAnimationFrame` loop plays frames at 30fps, updating bar widths
+2. Drop/select an audio file → `processFile()` returns all animation frames (lip sync + expressions + blinks)
+3. `requestAnimationFrame` loop plays frames at 30fps, showing how each facial channel responds to the voice
 
 No bundler, no framework, single HTML file.
diff --git a/examples/vanilla-basic/index.html b/examples/vanilla-basic/index.html
@@ -3,7 +3,7 @@
 <head>
   <meta charset="UTF-8">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
-  <title>AnimaSync — Basic Example</title>
+  <title>AnimaSync — Basic: Voice-driven Animation Data</title>
   <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.0/dist/ort.min.js"></script>
   <style>
     *, *::before, *::after { margin: 0; padding: 0; box-sizing: border-box; }
@@ -142,7 +142,7 @@ <h2>Audio Input</h2>
 
   <!-- Right: Blendshapes -->
   <div class="card">
-    <h2>ARKit Blendshapes (52-dim)</h2>
+    <h2>Face Animation Data (52-dim)</h2>
     <div class="bs-grid" id="bs-grid"></div>
   </div>
 </main>
@@ -155,7 +155,7 @@ <h2>ARKit Blendshapes (52-dim)</h2>
 <script type="module">
 // ================================================================
 // AnimaSync — Vanilla Basic Example
-// No 3D avatar, no Three.js. Pure audio → blendshape visualization.
+// No 3D avatar, no Three.js. Pure audio → lip sync + expression + blink data.
 // ================================================================
 
 const VERSION = '0.3.9';
diff --git a/examples/vanilla-comparison/README.md b/examples/vanilla-comparison/README.md
diff --git a/examples/vanilla-comparison/index.html b/examples/vanilla-comparison/index.html
diff --git a/index.html b/index.html