| layout | page |
|---|---|
| title | VoViT |
| subtitle | Low Latency Graph-based Audio-Visual Voice Separation |
The VoViT model consist of TODO
<style type="text/css">
.tg {border-collapse:collapse;border-color:#93a1a1;border-spacing:0;margin:0px auto;}
.tg td{background-color:#fdf6e3;border-color:#93a1a1;border-style:solid;border-width:0px;color:#002b36;
font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{background-color:#657b83;border-color:#93a1a1;border-style:solid;border-width:0px;color:#fdf6e3;
font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
| Preprocessing | Inference | Preprocessing + Inference | ||
|---|---|---|---|---|
| Graph Network | Whole model | |||
| VoViT-s1 | 17.95 | 4.50 | 52.21 | 82.18 |
| VoViT | 17.95 | 4.55 | 57.45 | 93.31 |
| VoViT-s1 fp16 | 10.94 | 2.88 | 30.47 | 52.43 |
| VoViT fp16 | 10.94 | 2.86 | 34.18 | 46.14 |
Latency estimation for the different variants of VoViT. Average of 10 runs, batch size 100. Device: Nvidia RTX 3090. GPU utilization >98%, memory on demand. Two forward passed done to warm up. Timing corresponds to ms to process 10s of audio
Note: Pytorch version is no longer supporting complex32 dtype in pytorch 1.11
TODO
