Skip to content

Latest commit

 

History

History
80 lines (72 loc) · 2.51 KB

File metadata and controls

80 lines (72 loc) · 2.51 KB
layout page
title VoViT
subtitle Low Latency Graph-based Audio-Visual Voice Separation
The VoViT model consist of TODO

VoViT

Latency

<style type="text/css"> .tg {border-collapse:collapse;border-color:#93a1a1;border-spacing:0;margin:0px auto;} .tg td{background-color:#fdf6e3;border-color:#93a1a1;border-style:solid;border-width:0px;color:#002b36; font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{background-color:#657b83;border-color:#93a1a1;border-style:solid;border-width:0px;color:#fdf6e3; font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} </style>
Preprocessing Inference Preprocessing + Inference
Graph Network Whole model
VoViT-s1 17.95 4.50 52.21 82.18
VoViT 17.95 4.55 57.45 93.31
VoViT-s1 fp16 10.94 2.88 30.47 52.43
VoViT fp16 10.94 2.86 34.18 46.14

Latency estimation for the different variants of VoViT. Average of 10 runs, batch size 100. Device: Nvidia RTX 3090. GPU utilization >98%, memory on demand. Two forward passed done to warm up. Timing corresponds to ms to process 10s of audio

Note: Pytorch version is no longer supporting complex32 dtype in pytorch 1.11

TODO