VoViT/index.html at master · IPCV/VoViT · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
layout: page
gh-repo: JuanFMontesinos/VoViT
gh-badge: [star, watch, fork, follow]
share-description: Official website of VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer
---
<div class="overlay"></div>
<div class="container">
    <div class="row">
        <div class="col-xl-12 mx-auto text-center">
            <h1>VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer</h1>
        </div>
        <div class="col-md-10 col-lg-8 col-xl-7 mx-auto">
        </div>
    </div>
</div>


<div class="col-xl-10 col-lg-8 offset-lg-1">

    <!-- Testimonials -->
    <section class="testimonials text-center">
        <div class="container">
            <div class="row">
                <div class="col-lg-4 text-center" style="">
                    <div class="testimonial-item mx-auto mb-5 mb-lg-0">
                        <h5>
                            <a href="mailto:juanfelipe.montesinosATupfDOTedu"
                               style="text-decoration : none; color : #000000;">
                                Juan F. Montesinos
                            </a>
                        </h5>
                        <p class="font-weight-light mb-0"></p>
                    </div>
                </div>
                <div class="col-lg-4 text-center">
                    <div class="testimonial-item mx-auto mb-5 mb-lg-0" style="width: 109%">
                        <h5>
                            &nbsp;
                            <a href="mailto:venkatesh.kadandaleATupfDOTedu"
                               style="text-decoration : none; color : #000000;">
                                Venkatesh S. Kadandale
                            </a>
                        </h5>
                        <p class="font-weight-light mb-0"></p>
                    </div>
                </div>
                <div class="col-lg-4 text-center">
                    <div class="testimonial-item mx-auto mb-5 mb-lg-0">
                        <h5>
                            <a href="mailto:gloria.haroATupfDOTedu" style="text-decoration : none; color : #000000;">
                                Gloria Haro
                            </a>
                        </h5>
                        <p class="font-weight-light mb-0"></p>
                    </div>
                </div>
            </div>
            <div class="row">
                <div class="offset-lg-3 col-lg-6 padtop" style="padding-bottom: 2rem">
          <span class="align-middle">
            <p class="mylead2">
                <a href="https://www.upf.edu/web/etic"
                   style="color:black">Universitat Pompeu Fabra, Barcelona, Spain</a><br>


          </span>
                </div>
            </div>
        </div>
    </section>
    <div class="row justify-content-center">
        <div class="col-sm-3 text-center">
            <a target="_blank"
               href="http://arxiv.org/abs/2203.04099"><img src="assets/img/paper.png" width="120" height="130"
                                                            style="border:1px solid black;"></a>
            <h5 style="padding-bottom: 5%; padding-top: 5%">Paper</h5>
        </div>
        <div class="col-sm-3 text-center">
            <a href="https://github.com/JuanFMontesinos/VoViT"
               style="color: #242124">
                <i class="fab fa-github fa-8x"></i></a>
            <h5 style="padding-bottom: 5%; padding-top: 5%">Code + Weights</h5>
        </div>
        <div class="col-sm-3 text-center">
            <a style="color: #242124;"
               href="./demos/">
                <i class="fas fa-film fa-8x" style="transform: scale(1,1.275); padding-top: 2.5px"></i>
            </a>
            <h5 style="padding-bottom: 5%; padding-top: 5px">Demos</h5>
        </div>
    </div>
    <h6 class="mx-auto text-center" style="color: saddlebrown">Accepted in ECCV 2022</h6>
    </br>
    <!-- Image Showcases -->
    <h2 style="text-align: center">Abstract</h2>
    <p class="lead mb-0" align="justify">
        This paper presents an audio-visual approach for voice separation which outperforms state-of-the-
        art methods at a low latency in two scenarios: speech and singing voice. The model is based on
        a two-stage network. Motion cues are obtained with a lightweight graph convolutional network
        that processes face landmarks. Then, both audio and motion features are fed to an audio-visual
        transformer which produces a fairly good estimation of the isolated target source. In a second stage,
        the predominant voice is enhanced with an audio-only network. We present different ablation studies
        and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained
        for speech separation in the task of singing voice separation. The demos, code, and weights are
        publicly available at <a href="https://ipcv.github.io/VoViT/">https://ipcv.github.io/VoViT/</a>
    </p>
    </br>


    <div class="mx-auto">
        <br>
        <h5>Citation</h5>
        <pre class="hightlight" style="background-color:rgba(0,0,0, 0.1)"><p class="mb-0" align="justify">
        @inproceedings{montesinos2022vovit,
author = {Montesinos, Juan F. and Kadandale, Venkatesh S. and Haro, Gloria},
title = {VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer},
year = {2022},
isbn = {978-3-031-19835-9},
publisher = {Springer-Verlag},
doi = {10.1007/978-3-031-19836-6_18},
booktitle = {Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII},
pages = {310–326},
}
}</p></pre>
    </div>

    <div class="mx-auto">
        <br>
        <h5>Video presentation at ECCV 2022</h5>
    </div>
       <iframe width="100%" height="400" src="https://www.youtube.com/embed/EtgSx2pjtyU" title="VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


    <div class="mx-auto">
        <br>
        <h5>Acknowledgements</h5>
        <p class="lead mb-0" align="justify">
            The authors acknowledge support by MICINN/FEDER UE project, ref. PGC2018-098625-B-I00;
            PID2021-127643NB-I00 project; H2020-MSCA-RISE-2017 project, ref. 777826 NoMADS;
            ReAViPeRo network, ref. RED2018-102511-T;
            and Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence
            Program (MDM-2015-0502) and the Social European Funds. J. F. M. acknowledges support by
            FPI scholarship PRE2018-083920. V. S. K. has received financial support through “la Caixa”
            Foundation (ID 100010434), fellowship code: LCF/BQ/DI18/11660064. V.S.K has also received
            funding from the European Union’s Horizon 2020 research and innovation programme under the
            Marie SkłodowskaCurie grant agreement No. 713673. We gratefully acknowledge NVIDIA Corporation
            for the donation of GPUs used for the experiments.

    </div>
    <div class="row justify-content-center">
        <div style="text-align: center;">
            <img class="round" src="assets/logo_ministerio.png" width="700px">
        </div>
    </div>
</div>