PeerDetection/try_1.json at main · Mohit-coder-droid/PeerDetection · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[ {
    "title": "Variational Hetero-Encoder Randomized GANs for Joint Image-Text Modeling",
    "pdf_link": "https://openreview.net/pdf?id=H1x5wRVtvS",
    "abstract": "For bidirectional joint image-text modeling, we develop variational hetero-encoder (VHE) randomized generative adversarial network (GAN), a versatile deep genera- tive model that integrates a probabilistic text decoder, probabilistic image encoder, and GAN into a coherent end-to-end multi-modality learning framework. VHE randomized GAN (VHE-GAN) encodes an image to decode its associated text, and feeds the variational posterior as the source of randomness into the GAN image generator. We plug three off-the-shelf modules, including a deep topic model, a ladder-structured image encoder, and StackGAN++, into VHE-GAN, which already achieves competitive performance. This further motivates the development of VHE-raster-scan-GAN that generates photo-realistic images in not only a multi- scale low-to-high-resolution manner, but also a hierarchical-semantic coarse-to-\ufb01ne fashion. By capturing and relating hierarchical semantic and visual concepts with end-to-end training, VHE-raster-scan-GAN achieves state-of-the-art performance in a wide variety of image-text multi-modality learning and generation tasks.",
    "paper_text": "Published as a conference paper at ICLR 2020 VARIATIONAL HETERO -ENCODER RANDOMIZED GAN S FOR JOINT IMAGE -TEXT MODELING Hao Zhang, Bo Chen\u0003, Long Tian, Zhengjue Wang National Laboratory of Radar Signal Processing Xidian University, Xian, China zhanghao xidian@163.com bchen@mail.xidian.edu.cn tianlong xidian@163.com zhengjuewang@163.com Mingyuan Zhou McCombs School of Business The University of Texas at Austin, Austin, TX 78712, USA mingyuan.zhou@mccombs.utexas.edu ABSTRACT For bidirectional joint image-text modeling, we develop variational hetero-encoder (VHE) randomized generative adversarial network (GAN), a versatile deep genera- tive model that integrates a probabilistic text decoder, probabilistic image encoder, and GAN into a coherent end-to-end multi-modality learning framework. VHE randomized GAN (VHE-GAN) encodes an image to decode its associated text, and feeds the variational posterior as the source of randomness into the GAN image generator. We plug three off-the-shelf modules, including a deep topic model, a ladder-structured image encoder, and StackGAN++, into VHE-GAN, which already achieves competitive performance. This further motivates the development of VHE-raster-scan-GAN that generates photo-realistic images in not only a multi- scale low-to-high-resolution manner, but also a hierarchical-semantic coarse-to-\ufb01ne fashion. By capturing and relating hierarchical semantic and visual concepts with end-to-end training, VHE-raster-scan-GAN achieves state-of-the-art performance in a wide variety of image-text multi-modality learning and generation tasks. 1 I NTRODUCTION Images and texts commonly occur together in the real world. There exists a wide variety of deep neural network based unidirectional methods that model images (texts) given texts (images) (Gomez et al., 2017; Kiros & Szepesvari, 2012; Reed et al., 2016; Xu et al., 2018; Zhang et al., 2017a). There also exist probabilistic graphic model based bidirectional methods (Srivastava & Salakhutdinov, 2012b;a; Wang et al., 2018) that capture the joint distribution of images and texts. These bidirectional methods, however, often make restrictive parametric assumptions that limit their image generation ability. Exploiting recent progress on deep probabilistic models and variational inference (Kingma & Welling, 2014; Zhou et al., 2016; Zhang et al., 2018a; Goodfellow et al., 2014; Zhang et al., 2017b), we propose an end-to-end learning framework to construct multi-modality deep generative models that can not only generate vivid image-text pairs, but also achieve state-of-the-art results on various unidirectional tasks (Srivastava & Salakhutdinov, 2012b;a; Wang et al., 2018; Gomez et al., 2017; Xu et al., 2018; Zhang et al., 2017a;b; Verma et al., 2018; Zhang et al., 2018b), such as generating photo-realistic images given texts and performing text-based zero-shot learning. To extract and relate semantic and visual concepts, we \ufb01rst introduce variational hetero-encoder (VHE) that encodes an image to decode its textual description ( e:g:, tags, sentences, binary attributes, and long documents), where the probabilistic encoder and decoder are jointly optimized using variational inference (Blei et al., 2017; Hoffman et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014). The latent representation of VHE can be sampled from either the variational posterior provided \u0003Corresponding author \nPublished as a conference paper at ICLR 2020 by the image encoder given an image input, or the posterior of the text decoder via MCMC given a text input. VHE by construction has the ability to generate texts given images. To further enhance its text generation performance and allow synthesizing photo-realistic images given an image, text, or random noise, we feed the variational posterior of VHE in lieu of random noise as the source of randomness into the image generator of a generative adversarial network (GAN) (Goodfellow et al., 2014). We refer to this new modeling framework as VHE randomized GAN (VHE-GAN). Off-the-shelf text decoders, image encoders, and GANs can be directly plugged into the VHE-GAN framework for end-to-end multi-modality learning. To begin with, as shown in Figs. 1(a) and 1(b), we construct VHE-StackGAN++ by using the Poisson gamma belief network (PGBN) (Zhou et al., 2016) as the VHE text decoder, using the Weibull upward-downward variational encoder (Zhang et al., 2018a) as the VHE image encoder, and feeding the concatenation of the multi-stochastic-layer latent representation of the VHE as the source of randomness into the image generator of StackGAN++ (Zhang et al., 2017b). While VHE-StackGAN++ already achieves very attractive performance, we \ufb01nd that its performance can be clearly boosted by better exploiting the multi-stochastic-layer semantically meaningful hierarchical latent structure of the PGBN text decoder. To this end, as shown in Figs. 1(a) and 1(c), we develop VHE-raster-scan-GAN to perform image generation in not only a multi-scale low-to-high-resolution manner in each layer, as done by StackGAN++, but also a hierarchical-semantic coarse-to-\ufb01ne fashion across layers, a unique feature distinguishing it from existing methods. Consequently, not only can VHE-raster-scan-GAN generate vivid high-resolution images with better details, but also build interpretable hierarchical semantic-visual relationships between the generated images and texts. Our main contributions include: 1) VHE-GAN that provides a plug-and-play framework to integrate off-the-shelf probabilistic decoders, variational encoders, and GANs for end-to-end bidirectional multi-modality learning; the shared latent space can be inferred either by image encoder q(zjx), if given images, or by Gibbs sampling from the conditional posterior of text decoder p(tjz), if given texts; 2) VHE-raster-scan-GAN that captures and relates hierarchical semantic and visual concepts to achieve state-of-the-art results in various unidirectional and bidirectional image-text modeling tasks. 2 V ARIATIONAL HETERO -ENCODER RANDOMIZED GAN S V AEs and GANs are two distinct types of deep generative models. Consisting of a generator (decoder) p(xjz), a priorp(z), and an inference network (encoder) q(zjx)that is used to approximate the posterior p(zjx), V AEs (Kingma & Welling, 2014; Rezende et al., 2014) are optimized by maximizing the evidence lower bound (ELBO) as ELBO =Ex\u0018pdata(x)[L(x)];L(x) :=Ez\u0018q(zjx)[lnp(xjz)]\u0000KL[q(zjx)jjp(z)];(1) wherepdata(x) =PN i=11 N\u000exirepresents the empirical data distribution. Distinct from V AEs that make parametric assumptions on data distribution and perform posterior inference, GANs in general use implicit data distribution and do not provide meaningful latent representations (Goodfellow et al., 2014); they learn both a generator Gand a discriminator Dby optimizing a mini-max objective as minGmax DfEx\u0018pdata(x)[lnD(x)] +Ez\u0018p(z)[ln(1\u0000D(G(z)))]g; (2) wherep(z)is a random noise distribution that acts as the source of randomness for data generation. 2.1 VHE-GAN OBJECTIVE FUNCTION FOR END -TO-END MULTI -MODALITY LEARNING Below we show how to construct VHE-GAN to jointly model images xand their associated texts t, capturing and relating hierarchical semantic and visual concepts. First, we modify the usual V AE into VHE, optimizing a lower bound of the text log-marginal-likelihood Et\u0018pdata(t)[lnp(t)]as ELBO vhe=Epdata(t;x)[Lvhe(t;x)];Lvhe(t;x) :=Ez\u0018q(zjx)[lnp(tjz)]\u0000KL[q(zjx)jjp(z)];(3) wherep(tjz)is the text decoder, p(z)is the prior, p(t) =Ez\u0018p(z)[p(tjz)], andLvhe(t;x)\u0014 lnEz\u0018q(zjx)[p(tjz)p(z) q(zjx)] = lnp(t). Second, the image encoder q(zjx), which encodes image x into its latent representation z, is used to approximate the posterior p(zjt) =p(tjz)p(z)=p(t). Third, variational posterior q(zjx)in lieu of random noise p(z)is fed as the source of randomness into the GAN image generator. Combing these three steps, with the parameters of the image encoder \nPublished as a conference paper at ICLR 2020 (1)\uf071 (2)\uf071 1h 2h 3h (3)\uf071 1s 2s 3s (1)\uf071 (2)\uf071 (3)\uf071 t (1)g (2)g (3)g ()fx (3)\uf046 (2)\uf046 (1)\uf046 \uf071 2\uf0afx 4\uf0afx x D D D (a) (1)\uf071 (2)\uf071 1h 2h 3h (3)\uf071 1s 2s 3s (1)\uf071 (2)\uf071 (3)\uf071 t (1)g (2)g (3)g ()fx (3)\uf046 (2)\uf046 (1)\uf046 \uf071 2\uf0afx 4\uf0afx x D D D (b) (1)\uf071 (1) 1h (1) 2h (1) 3h (2)\uf071 (2) 1h (2) 2h (2) 3h (3)\uf071 (3) 1h (3) 2h (3) 3h (3) 1s (3) 2s (3) 3s (2) 1s (2) 2s (2) 3s (1) 1s (1) 2s (1) 3s 2\uf0afx 16\uf0afx 8\uf0afx 4\uf0afx x D D D D D (1)\uf071 (2)\uf071 (3)\uf071 t (1)g (2)g (3)g ()fx (3)\uf046 (2)\uf046 (1)\uf046 (c) (1)\uf071 (2)\uf071 h (3)\uf071 s \uf071 x D (d) (1)\uf071 (1) 1h (2)\uf071 (2) 1h (3)\uf071 (3) 1h (3) 1s (2) 1s (1) 1s 4\uf0afx 2\uf0afx x D D D (e) Figure 1: Illustration of (a) VHE, (b) StackGAN++, (c) raster-scan-GAN, (d) vanilla-GAN, and (e) simple- raster-scan-GAN. VHE-raster-scan-GAN consists of (a) and (c). x#dis down-sampled from xwith scaling factord. VHE-StackGAN++, consisting of (a) and (b), VHE-vanilla-GAN, consisting of (a) and (d), and VHE-simple-raster-scan-GAN, consisting of (a) and (e), are all used for ablation studies. q(zjx), text decoder p(tjz), and GAN generator denoted by E,Gvae, andGgan, respectively, we express the objective function of VHE-GAN for joint image-text end-to-end learning as min E;G vae;Gganmax DEpdata(t;x)[L(t;x)]; L(t;x) := lnD(x) +KL[q(zjx)jjp(z)] +Ez\u0018q(zjx)\u0002 ln(1\u0000D(Ggan(z)))\u0000lnp(tjz)\u0003 :(4) Note the objective function in (4)implies a data-triple-reuse training strategy, which uses the same data mini-batch in each stochastic gradient update iteration to jointly train the VHE, GAN discriminator, and GAN generator; see a related objective function, shown in (10) of Appendix A, that is resulted from naively combining the VHE and GAN training objectives. In VHE-GAN, the optimization of the encoder parameter Eis related to not only the VHE\u2019s ELBO, but also the GAN mini-max objective function, forcing the variational posterior q(zjx)to serve as a bridge between VHE and GAN, allowing them to help each other. Although there are some models (Mescheder et al., 2017; Makhzani et al., 2015; Tolstikhin et al., 2018; Dumoulin et al., 2017; Donahue et al., 2017; Che et al., 2017; Srivastava et al., 2017; Grover et al., 2018; Larsen et al., 2016; Huang et al., 2018) combining V AEs and GANs in various ways, they focus on single-modality tasks while the VHE-GAN on two different modalities. In Appendix A, we analyze the properties of the VHE-GAN objective function and discuss related works. Below we develop two different VHE-GANs, one integrates off-the-shelf modules, while the other introduces new interpretable hierarchical latent structure. 2.2 VHE-S TACK GAN++ WITH OFF -THE-SHELF MODULES As shown in Figs. 1(a) and 1(b), we \ufb01rst construct VHE-StackGAN++ by plugging into VHE-GAN three off-the-shelf modules, including a deep topic model (Zhou et al., 2016), a ladder-structured encoder (Zhang et al., 2018a), and StackGAN++ (Zhang et al., 2017b). For text analysis, both sequence models and topic models are widely used. Sequence models (Bengio et al., 2003) often represent each document as a sequence of word embedding vectors, capturing local dependency structures with some type of recurrent neural networks (RNNs), such as long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997). Topic models such as latent Dirichlet allocation (LDA) (Blei et al., 2003) often represent each document as a bag of words (BoW), capturing global word cooccurrence patterns into latent topics. Suitable for capturing local dependency structure, existing sequence models often have dif\ufb01culty in capturing long-range word dependencies and hence macro- level information, such as global word cooccurrence patterns ( i:e:, topics), especially for long documents. By contrast, while topic models ignore word order, they are very effective in capturing latent topics, which are often directly related to macro-level visual information (Gomez et al., 2017; Dieng et al., 2017; Lau et al., 2017). Moreover, topic models can be applied to not only sequential texts, such as few sentences (Wang et al., 2009; Jin et al., 2015) and long documents (Zhou et al., 2016), but also non-sequential ones, such as textual tags (Srivastava & Salakhutdinov, 2012a; 2014; Wang et al., 2018) and binary attributes (Elhoseiny et al., 2017b; Zhu et al., 2018). For this reason, for the VHE text decoder, we choose PGBN (Zhou et al., 2016), a state-of-the-art topic model that can also be represented as a multi-stochastic-layer deep generalization of LDA (Cong et al., 2017). We complete VHE-StackGAN++ by choosing the Weibull upward-downward variational encoder (Zhang et al., 2018a) as the VHE image encoder, and feeding the concatenation of all the hidden layers of PGBN as the source of randomness to the image generator of StackGAN++ (Zhang et al., 2017b). As in Fig. 1, we use a VHE that encodes an image into a deterministic-upward\u2013stochastic-downward ladder-structured latent representation, which is used to decode the corresponding text. Speci\ufb01cally, we represent each document as a BoW high-dimensional sparse count vector tn2ZK0, where \nPublished as a conference paper at ICLR 2020 Z=f0;1;\u0001\u0001\u0001g andK0is the vocabulary size. For the VHE text decoder, we choose to use PGBN to extract hierarchical latent representation from tn. PGBN consists of multiple gamma distributed stochastic hidden layers, generalizing the \u201cshallow\u201d Poisson factor analysis (Zhou et al., 2012; Zhou & Carin, 2015) into a deep setting. PGBN with Lhidden layers, from top to bottom, is expressed as \u0012(L) n\u0018Gam\u0010 r;1=s(L+1) n\u0011 ;r\u0018Gam(\r0=KL;1=s0); \u0012(l) n\u0018Gam\u0010 \b(l+1)\u0012(l+1) n;1=s(l+1) n\u0011 ;l=L\u00001;\u0001\u0001\u0001;2;1;tn\u0018Pois\u0010 \b(1)\u0012(1) n\u0011 ; (5) where the hidden units \u0012(l) n2RKl +of layerlare factorized under the gamma likelihood into the product of topics \b(l)2RKl\u00001\u0002Kl + and hidden units of the next layer, R+=fx;x\u00150g,s(l) n>0, andKlis the number of topics of layer l. If the texts are represented as binary attribute vectors bn, we can add a Bernoulli-Poisson link layer as bn=1(tn\u00151)(Zhou, 2015; Zhou et al., 2016). We place a Dirichlet prior on each column of \b(l). The topics can be organized into a directed acyclic graph (DAG), whose node kat layerlcan be visualized with the top words of\u0002Ql\u00001 t=1\b(t)\u0003 \u001e(l) k; the topics tend to be very general in the top layer and become increasingly more speci\ufb01c when moving downwards. This semantically meaningful latent hierarchy provides unique opportunities to build a better image generator by coupling the semantic hierarchical structures with visual ones. Let us denote \b=f\b(1);:::;\b(L);rgas the set of global parameters of PGBN shown in (5). Given \b, we adopt the inference in Zhang et al. (2018a) to build an Weibull upward-downward variational image encoder asQN n=1QL l=1q(\u0012(l) njxn;\b(l+1);\u0012(l+1) n), where \b(L+1):=r,\u0012(L+1) n :=;, and q(\u0012(l) njxn;\b(l+1);\u0012(l+1) n) =Weibull (k(l) n+\b(l+1)\u0012(l+1) n;\u0015(l) n): (6) The Weibull distribution is used to approximate the gamma distributed conditional posterior, and its parametersk(l) n;\u0015(l) n2RKlare deterministically transformed from the convolutional neural network (CNN) image features f(xn)(Szegedy et al., 2016), as shown in Fig. 1(a) and described in Appendix D.1. We denote  as the set of encoder parameters. We refer to Zhang et al. (2018a) for more details about this deterministic-upward\u2013stochastic-downward ladder-structured inference network, which is distinct from a usual inference network that has a pure bottom-up structure and only interacts with the generative model via the ELBO (Kingma & Welling, 2014; Gulrajani et al., 2017). The multi-stochastic-layer latent representation z=f\u0012(l)gL l=1is the bridge between two modalities. As shown in Fig. 1(b), VHE-StackGAN++ simply randomizes the image generator of StackGAN++ (Zhang et al., 2017b) with the concatenated vector \u0012=\u0002 \u0012(1);\u0001\u0001\u0001;\u0012(L)\u0003 . We provide the overall objective function in (15) of Appendix D.2. Note that existing neural-network-based models (Gomez et al., 2017; Xu et al., 2018; Zhang et al., 2017a;b; Verma et al., 2018; Zhang et al., 2018b) are often able to perform unidirectional but not bidirectional transforms between images xand textst. However, bidirectional transforms are straightforward for the proposed model, as zcan be either drawn from the image encoder q(zjx)in(6), or drawn with an upward-downward Gibbs sampler (Zhou et al., 2016) from the conditional posteriors p(zjt)of the PGBN text decoder p(tjz)in (5). 2.3 VHE- RASTER -SCAN -GAN WITH A HIERARCHICAL -SEMANTIC MULTI -RESOLUTION IMAGE GENERATOR While we \ufb01nd that VHE-StackGAN++ has already achieved impressive results, its simple concate- nation of\u0012(l)does not fully exploit the semantically-meaningful hierarchical latent representation of the PGBN-based text decoder. For three DAG subnets inferred from three different datasets, as shown in Figs. 21 -23 of Appendix C.7, the higher-layer PGBN topics match general visual concepts, such as those on shapes, colors, and backgrounds, while the lower-layer ones provide \ufb01ner details. This motivates us to develop an image generator to exploit the semantic structure, which matches coarse-to-\ufb01ne visual concepts, to gradually re\ufb01ne its generation. To this end, as shown in Fig. 1(c), we develop \u201craster-scan\u201d GAN that performs generation not only in a multi-scale low-to-high-resolution manner in each layer, but also a hierarchical-semantic coarse-to-\ufb01ne fashion across layers. Suppose we are building a three-layer raster-scan GAN to generate an image of size 2562. We randomly select an image xnand then sample f\u0012(l) ng3 l=1from the variational posteriorQ3 l=1q(\u0012(l) njxn;\b(l+1);\u0012(l+1) n). First, the top-layer latent variable \u0012(3), often capturing general \nPublished as a conference paper at ICLR 2020 semantic information, is transformed to hidden features h(3) ifor theithbranch: h(3) 1=F(3) 1(\u0012(3));h(3) i=F(3) i(h(3) i\u00001;\u0012(3)); i= 2;3; (7) whereF(l) iis a CNN. Second, having obtained fh(3) ig3 i=1, generatorsfG(3) ig3 i=1synthesize low- to-high-resolution image samples fs(3) i=G(3) i(h(3) i)g3 i=1, wheres(3) 1,s(3) 2, ands(3) 3are of 162, 322, and 642, respectively. Third, s(3) 3is down-sampled to ^s(3) 3of size 322and combined with the information from \u0012(2)to provide the hidden features at layer two: h(2) 1=C(F(2) 1(\u0012(2));^s(3) 3);h(2) i=F(2) i(h(2) i\u00001;\u0012(2)); i= 2;3; (8) whereCdenotes concatenation along the channel. Fourth, the generators synthesize image samples fs(2) i=G(2) i(h(2) i)g3 i=1, wheres(2) 1,s(2) 2, ands(2) 3are of 322,642, and 1282, respectively. The same process is then replicated at layer one to generate fs(1) i=G(1) i(h(1) i)g3 i=1, wheres(1) 1,s(1) 2, ands(1) 3 are of size 642,1282, and 2562, respectively, and s(1) 3becomes a desired high-resolution synthesized image with \ufb01ne details. The detailed structure of raster-scan-GAN is described in Fig. 26 of Appendix D.3. PyTorch code is provided to aid the understanding and help reproduce the results. Different from many existing methods (Gomez et al., 2017; Reed et al., 2016; Xu et al., 2018; Zhang et al., 2017b) whose textual feature extraction is separated from the end task, VHE-raster-scan-GAN performs joint optimization. As detailedly described in the Algorithm in Appendix E, at each mini- batch based iteration, after updating \bby the topic-layer-adaptive stochastic gradient Riemannian (TLASGR) MCMC of Cong et al. (2017), a Weibull distribution based reparameterization gradient (Zhang et al., 2018a) is used to end-to-end optimize the following objective: minfG(l) igi;l; maxfD(l) igi;lEpdata(xn;tn)EQ3 l=1q(\u0012(l) njxn;\b(l+1);\u0012(l+1) n )\b \u0000logp(tnj\b(1);\u0012(1) n) +P3 l=1KL[q(\u0012(l) njxn;\b(1+1);\u0012(l+1) n)jjp(\u0012(l) nj\b(1+1);\u0012(l+1) n)] +P3 l=1P3 i=1[logD(l) i(x(l) n;i;\u0012(l) n) + log(1\u0000D(l) i(G(l) i(\u0012(l) n);\u0012(l) n))]\t ; (9) wherefx(l) n;ig3;3 i=1;l=1denote different resolutions of xn, corresponding tofs(l) n;ig3;3 i=1;l=1. 2.4 R ELATED WORK ON JOINT IMAGE -TEXT LEARNING Gomez et al. (2017) develop a CNN to learn a transformation from images to textual features pre- extracted by LDA. GANs have been exploited to generate images given pre-learned textual features extracted by RNNs (Denton et al., 2015; Reed et al., 2016; Zhang et al., 2017a; Xu et al., 2018; Zhang et al., 2018b; Li et al., 2019). All these works need a pre-trained linguistic model based on large-scale extra text data and the transformations between images and texts are only unidirectional. The recently proposed Obj-GAN (Li et al., 2019) needs even more side information such as the locations and labels of objects inside images, which could be dif\ufb01cult and costly to acquire in practice. On the other hand, probabilistic graphical model based methods (Srivastava & Salakhutdinov, 2012b;a; Wang et al., 2018) are proposed to learn a joint latent space for images and texts to realize bidirectional transformations, but their image generators are often limited to generating low-level image features. By contrast, VHE-raster-scan-GAN performs bidirectional end-to-end learning to capture and relate hierarchical visual and semantic concepts across multiple stochastic layers, capable of a wide variety of joint image-text learning and generation tasks, as described below. 3 E XPERIMENTAL RESULTS For joint image-text learning, following previous work, we evaluate the proposed VHE-StackGAN++ and VHE-raster-scan-GAN on three datasets: CUB (Wah et al., 2011), Flower (Nilsback & Zisserman, 2008), and COCO (Lin et al., 2014), as described in Appendix F. Besides the usual text-to-image generation task, due to the distinct bidirectional inference capability of the proposed models, we can perform a rich set of additional tasks such as image-to-text, image-to-image, and noise-to-image-text- pair generations. Due to space constraint, we present below some representative results, and defer additional ones to the Appendix. We provide the details of our experimental settings in Appendix F. PyTorch code is provided at https://github.com/BoChenGroup/VHE-GAN. \nPublished as a conference paper at ICLR 2020 Table 1: Inception score (IS, larger is better) and Frechet inception distance (FID, smaller is better) of StackGAN++ (Zhang et al., 2017b), HDGAN (Zhang et al., 2018b), AttGAN (Xu et al., 2018), Obj-GAN (Li et al., 2019), and the proposed VHE-raster-scan-GAN; the values labeled with\u0003are calculated by the provided well-trained models and the others are quoted from the original publications; see Tab. 5 in Appendix C.1 for the error bars of IS. Note that while the FID of Obj-GAN is the lowest, it does not necessarily imply it produces high-quality images, as shown in Figs. 13 and 27; this is because FID only measures the similarity on the image feature space, but ignores the shapes of objects and diversity of generated images. More discussions can be found in Section 3.1 and Appendix G. Method StackGAN++ HDGAN AttnGAN Obj-GAN VHE-raster-scan-GAN Criterion IS FID IS FID IS FID IS FID IS FID Flower 3.26 48.68 3.45 40.12\u0003\u2013 \u2013 - - 3.72 35.13 CUB 3.84 15.30 4.15 13.48\u00034.36 13.02\u0003- - 4.41 12.02 COCO 8.30 81.59 11.86 78.16\u000325.89 77.01\u000326.58\u000336.98\u000327.16 75.88 Table 2: Ablation study for image-to-text learning, where the structures of different variations of raster-scan- GAN are illustrated in Figs. 1(b), 1(d), and 1(e). Method PGBN+StackGAN++ VHE-vanilla-GAN VHE-StackGAN++ VHE-simple-raster-scan-GAN Criterion IS FID IS FID IS FID IS FID Flower 3.29 41.04 3.01 52.15 3.56 38.66 3.62 36.18 CUB 3.92 13.79 3.52 21.24 4.20 12.93 4.31 12.35 COCO 10.63 79.65 6.36 97.15 12.63 78.02 20.13 77.18 Brown duck playing on the lake making a  poodle. StackGAN ++ VHE-raster -scan -GANVHE-StackGAN ++HDGAN This bird is  yellow with  grey wings and  a black crown. An all black  bird with a  thick, round black bill. This flower has  long, curling orange petals  with dark red  spots. This is a purple  bell shaped flower, with a  yellow pistil and long stigma. This flower  contains hundred  of needle like  yellow petals around the  brighter yellow  stamen. A wooden desk  topped with a  laptop  computer. A very dark  city street with  cars and  buildings. AttnGAN  N/A Figure 2: Comparison on image generation given texts from CUB, Flower, and COCO. Shown in the top row are the textual descriptions and their associated real images; see Appendix C.2 for higher-resolution images. Note AttnGAN did not perform experiments on Flower and hence its results on Flower are not shown, and since Obj-GAN only performed experiment on COCO, we defer its visual results to Appendix C.3. 3.1 T EXT-TO-IMAGE LEARNING Although the proposed VHE-GANs do not have a text encoder to directly project a document to the shared latent space, given a document and a set of topics inferred during training, we use the upward- downward Gibbs sampler of Zhou et al. (2016) to draw f\u0012(l)gL l=1from its conditional posterior under PGBN, which are then fed into the GAN image generator to synthesize random images. Text-to-image generation: In Tab. 1, with inception score (IS) (Salimans et al., 2016) and Frechet inception distance (FID) (Heusel et al., 2017), we compare our models with three state-of-the-art GANs in text-to-image generation. For visualization, we show in the top row of Fig. 2 different test textual descriptions and the real images associated with them, and in the other rows random images generated conditioning on these textual descriptions by different algorithms. Higher-resolution images are shown in Appendix C.2. We also provide example results on COCO, a much more challenging dataset, in Fig. 13 of Appendix C.3. It is clear from Fig. 2 that although both StackGAN++ (Zhang et al., 2017b) and HDGAN (Zhang et al., 2018b) generate photo-realistic images nicely matched to the given texts, they often misrepresent or ignore some key textual information, such as \u201cblack crown\u201d for the 2nd test text, \u201cyellow pistil\u201d for 5th, \u201cyellow stamen\u201d for 6th, and \u201ccomputer\u201d for 7th. These observations also apply to AttnGAN (Xu et al., 2018). By contrast, both the proposed VHE-StackGAN++ and VHE-raster-scan-GAN do a better job in capturing and faithfully representing these key textual information into their generated images. Fig. 13 for COCO further shows the advantages of VHE-raster-scan-GAN in better \nPublished as a conference paper at ICLR 2020 red pink  large petal  stamen light red pink  stamen petal  long yellow red yellow  long petal  brown thin yellow long  red thin  petal green yellow long  green stamen  petal darkorange black  brown dark  yellow headred brown   gray dark  large black yellow black  dark head  body bluered grey  brown dark  long large yellow blue  black body  long colored red body red head black wings long bill black bellyClass name: Rhinoceros Auklet Itisaseabird, nesting inseabird colonies, with alarge orange/brown bill.Plumage isdark on topand paler below, inoffshore and inshore water .Sometimes itswim inthewater and sometimes itstand onthestrong .colorful body blue head green back solid tail yellow belly (a) red pink  large petal  stamen light red pink  stamen petal  long yellow red yellow  long petal  brown thin yellow long  red thin  petal green yellow long  green stamen  petal darkorange black  brown dark  yellow headred brown   gray dark  large black yellow black  dark head  body bluered grey  brown dark  long large yellow blue  black body  long colored red body red head black wings long bill black bellyClass name: Rhinoceros Auklet Itisaseabird, nesting inseabird colonies, with alarge orange/brown bill.Plumage isdark on topand paler below, inoffshore and inshore water .Sometimes itswim inthewater and sometimes itstand onthestrong .colorful body blue head green back solid tail yellow belly (b) red pink  large petal  stamen light red pink  stamen petal  long yellow red yellow  long petal  brown thin yellow long  red thin  petal green yellow long  green stamen  petal darkorange black  brown dark  yellow headred brown   gray dark  large black yellow black  dark head  body bluered grey  brown dark  long large yellow blue  black body  long colored red body red head black wings long bill black bellyClass name: Rhinoceros Auklet Itisaseabird, nesting inseabird colonies, with alarge orange/brown bill.Plumage isdark on topand paler below, inoffshore and inshore water .Sometimes itswim inthewater and sometimes itstand onthestrong .colorful body blue head green back solid tail yellow belly (c) Figure 3: Example results of VHE-raster-scan-GAN on three different tasks: (a) image generation given \ufb01ve textual attributes; (b) image generation given a long class-speci\ufb01c document (showing three representative sentences for brevity) from CUB; and (c) latent space interpolation for joint image-text generation on CUB (left column) and Flower (right column), where the texts in the \ufb01rst and last row are given. representing the given textual information in its generated images. Note Obj-GAN, which learns a bounding box generator that restricts object locations, obtains the lowest FID on COCO. However, it appears that this type of restriction signi\ufb01cantly improves FID at the expense of sacri\ufb01cing the diversity of generated images given text, as shown in Fig. 27 of Appendix G. From the results in Fig. 13, it also appears that Obj-GAN overly emphasizes correctly arranging the spatial locations of different visual features, which is important to achieve low FID, but does not do well in generating correct object shapes, which is important to visual effect. Besides, the training of Obj-GAN requires more side information including the locations and labels of objects in the images, which are often not provided in practice (e.g., neither CUB nor Flower comes with this type of side information). While the proposed VHE-GAN models do not need these additional side information, they could be further improved by following Obj-GAN to take them into consideration. As discussed in Section 2.2, compared with sequence models, topic models can be applied to more diverse textual descriptions, including textual attributes and long documents. For illustration, we show in Figs. 3(a) and 3(b) example images generated conditioning on a set of textual attributes and an encyclopedia document, respectively. These synthesized images are photo-realistic and their visual contents well match the semantics of the given texts. Trained on CelebA (Liu et al., 2015), we provide in Fig. 9 examples of facial image generation given attributes; see Appendix B for details. Ablation studies: We also consider several ablation studies for text-to-image generation, as shown in Tab. 2. First , we modify StackGAN++ (Zhang et al., 2017b), using the text features extracted by PGBN to replace the original ones by RNN, referred to as PGBN+StackGAN++. It is clear that PGBN+StackGAN++ outperforms the original StackGAN++, but underperforms VHE-StackGAN++, which can be explained by that 1) the PGBN deep topic model is more effective in extracting macro-level textual information, such as key words, than RNNs; and 2) jointly end-to-end training the textual feature extractor and image encoder, discriminator, and generator helps better capture and relate the visual and semantical concepts. Second , note that VHE-StackGAN++ has the same structured image generator as both StackGAN++ and HDGAN do, but performs better than them. We attribute its performance gain to 1) its PGBN deep topic model helps better capture key semantic information from the textual descriptions; and 2) it performs end-to-end joint image-text learning via the VHE-GAN framework, rather than separating the extraction of textual features from text-to- image generation. Third , VHE-vanilla-GAN underperforms VHE-StackGAN++, suggesting that the stacking structure is helpful for generating high resolution images, as previously veri\ufb01ed in Zhang et al. (2017a). VHE-simple-raster-scan-GAN outperforms VHE-StackGAN++ but underperforms VHE-raster-scan-GAN, con\ufb01rming the bene\ufb01ts of combining the stacking and raster-scan structures. More visual results for ablation studies can be found in Appendix C.2. Below we focus on illustrating the outstanding performance of VHE-raster-scan-GAN. Latent space interpolation: In order to understand the jointly learned image and text manifolds, given textst1andt2, we draw\u00121and\u00122and use the interpolated variables between them to generate \nPublished as a conference paper at ICLR 2020 both images via the GAN\u2019s image generator and texts via the PGBN text decoder. As in Fig. 3(c), the \ufb01rst row shows the true texts t1and images generated with \u00121, the last row shows t2and images generated with \u00122, and the second to fourth rows show the generated texts and images with the interpolations from \u00121to\u00122. The strong correspondences between the generated images and texts, with smooth changes in colors, object positions, and backgrounds between adjacent rows, suggest that the latent space of VHE-raster-scan-GAN is both visually and semantically meaningful. Additional more \ufb01ne-gridded latent space interpolation results are shown in Figs. 15-18 of Appendix C.4. Visualization of captured semantic and visual concepts: Zhou et al. (2016) show that the semantic concepts extracted by PGBN and their hierarchical relationships can be represented as a DAG, only a subnet of which will be activated given a speci\ufb01c text input. In each subplot of Fig. 4, we visualize example topic nodes of the DAG subnet activated by the given text input, and show the corresponding images generated at different hidden layers. There is a good match at each layer between the visual contents of the generated images and semantics of the top activated topics, which are mainly about general shapes, colors, or backgrounds at the top layer, and become more and more \ufb01ne-grained when moving downward. In Fig. 5, for the DAG learned on COCO, we show a representative subnet that is rooted at a top-layer node about \u201crooms and objects at home,\u201d and provide both semantic and visual representations for each node. Being able to capture and relate hierarchical semantic and visual concepts helps explain the state-of-the-art performance of VHE-raster-scan-GAN. 3.2 I MAGE -TO-TEXT LEARNING VHE-raster-scan-GAN can perform a wide variety of extra tasks, such as image-to-text generation, text-based zero-shot learning (ZSL), and image retrieval given a text query. In particular, given image xn, we draw ^tnas^tnj\u0012n\u0018p(tj\b;\u0012n);\u0012njxn\u0018q (\u0012j\b;xn)and use it for downstream tasks. Image-to-text generation: Given an image, we may generate some key words, as shown in Fig. 6(a), where the true and generated ones are displayed on the left and right of the input image, respectively. It is clear that VHE-raster-scan-GAN successfully captures the object colors, shapes, locations, and backgrounds to predict relevant key words. Text-based ZSL: Text-based ZSL is a speci\ufb01c task that learns a relationship between images and texts on the seen classes and transfer it to the unseen ones (Fu et al., 2018). We follow the the same settings on CUB and Flower as existing text-based ZSL methods summarized in Tab. 3. There are two default splits for CUB\u2014the hard (CUB-H) and easy one (CUB-E)\u2014and one split setting for Flower, as described in Appendix F. Note that except for our models that infer a shared semantically meaningful latent space between two modalities, none of the other methods have generative models for both modalities, regardless of whether they learn a classi\ufb01er or a distance metric in a latent space for ZSL. Tab. 3 shows that VHE-raster-scan-GAN clearly outperforms the state of the art in terms of the Top-1 accuracy on both the CUB-H and Flower, and is comparable to the second best on CUB-E (it is the best among all methods that have reported their Top-5 accuracies on CUB-E). Note for CUB-E, every unseen class has some corresponding seen classes under the same super-category, which makes the classi\ufb01cation of surface or distance metric learned on the seen classes easier to generalize to the unseen ones. We also note that both GAZSL and ZSLPP rely on visual part detection to extract image features, making their performance sensitive to the quality of the visual part detector that often has to be elaborately tuned for different classes and hence limiting their generalization ability, for example, the visual part detector for birds is not suitable for \ufb02owers. Tab. 3 also includes the results of ZSL using VHE, which show that given the same structured text decoder and image encoder, VHE consistently underperforms both VHE-StackGAN++ and VHE-raster-scan-GAN. This suggests 1) the advantage of a joint generation of two modalities, and 2) the ability of GAN in helping VHE achieve better data representation. The results in Tab. 3 also show that the ZSL performance of VHE-raster-scan-GAN has a clear trend of improvement as PGBN becomes deeper, suggesting the advantage of having a multi-stochastic-hidden-layer deep topic model for text generation. We also collect the ZSL results of the last 1000 mini-batch based stochastic gradient update iterations to calculate the error bars. For existing methods, since there are no error bars provided in published paper, we only provide the text error bars of the methods that have publicly accessible code. 3.3 I MAGE /TEXT RETRIEVAL As discussed in Section 2.4, the proposed models are able to infer the shared latent space given either an image or text. We test both VHE-StackGAN++ and VHE-raster-scan-GAN on the same image/text \nPublished as a conference paper at ICLR 2020 Layer 3 Layer 2 Layer 1ruffled  petal  wavy red petal brightclustered dark stamen Real  Image  Textruffled  large  wavy red  green  pink stamen  yellow  center red  flower color green  leaves group red  green colored Topic 1 Topic 2 Topic 3 This bright colored red flower on the  green leaves has petals that surround  the ovary in a ruffled wavy manner. (a) Layer 3 Layer 2 Layer 1red  bright  body grey small wingblack rounded eye Real  Image  Textred  grey  bird bright  body  light large  standing  body bird  standing body grey  dark large bird  body whiteTopic 1 Topic 2 Topic 3 This bright red colored bird with dark  rounded eyes, grey wing and brown  beak are standing.  (b) Layer 3 Layer 2 Layer 1blue  sky  sunshine house clustered roomroad ground grey Real  Image  Textblue  sky  cloudy house  low  many village sky beautifulvillage country attractivewhite  sky cloudy view  village wideTopic 1 Topic 2 Topic 3 The picture shows a view of village  having blue fine sky, low house,  grey road and green trees.  (c) Figure 4: Visualization of example semantic and visual concepts captured by a three-stochastic-hidden-layer VHE-raster-scan-GAN from (a) Flower, (b) Bird, and (c) COCO. In each subplot, given the real text tnshown at the bottom, we draw f\u0012(l) ng3 l=1via Gibbs sampling; we show the three most active topics in \b(l)(ranked by the weights of\u0012(l) n) at layerl= 3;2;1, where each topic is visualized by its top three words; and we feed f\u0012(l) ng3 l=1 into raster-scan-GAN to generate three random images (one per layer, coarse to \ufb01ne from layers 3 to 1). Table 3: Accuracy (%) of ZSL on CUB and Flower. Note that some of them are attribute-based methods but applicable in our setting by replacing attribute vectors with text features (labeled by\u0003), as discussed in (Elhoseiny et al., 2017b). Text-ZSL dataset CUB-H CUB-E Flower Accuracy criterion top-1 top-1 top-5 top-1 WAC-Kernel (Elhoseiny et al., 2017a) 7.7\u00060.28 33.5\u00060.22 64.3\u00060.20 9.1\u00062.77 ZSLNS (Qiao et al., 2016) 7.3\u00060.36 29.1\u00060.28 61.8\u00060.22 8.7\u00062.46 ESZSL\u0003(Romeraparedes & Torr, 2015) 7.4\u00060.31 28.5\u00060.26 59.9\u00060.20 8.6\u00062.53 SynC\u0003(Changpinyo et al., 2016) 8.6 28.0 61.3 8.2 ZSLPP (Elhoseiny et al., 2017b) 9.7 37.2 \u2013 \u2013 GAZSL (Zhu et al., 2018) 10.3\u00060.26 43.7\u00060.28 67.61\u00060.24 \u2013 VHE-L3 14.0\u00060.24 34.6\u00060.25 64.6\u00060.20 8.9\u00061.57 VHE-StackGAN++-L3 16.1 38.5 68.2 10.6 VHE-raster-scan-GAN-L1 11.7\u00060.31 32.1\u00060.32 62.6\u00060.33 9.4\u00061.68 VHE-raster-scan-GAN-L2 14.9\u00060.26 37.1\u00060.24 64.6\u00060.25 11.0\u00061.54 VHE-raster-scan-GAN-L3 16.7\u00060.24 39.6\u00060.20 70.3\u00060.18 12.1\u00061.47 Table 4: Comparison of the image-to-text retrieval performance, measured by Top-1 accuracy, and text-to-image retrieval performance, measured by AP@50, between different methods on CUB-E. Method CNN-LSTM AttnGAN TA-GAN VHE-StackGAN++ VHE-raster-scan-GAN (Li et al., 2017) (Xu et al., 2018) (Nam et al., 2018) Top1-ACC( %) 61.5 55.1 61.3 60.2 61.7 AP@50( %) 57.6 51.0 62.8 61.3 62.6 retrieval tasks as in TA-GAN (Nam et al., 2018), where we use the cosine distance between the inferred latent space given images ( q(\u0012jx), image encoder) and these given texts ( p(\u0012jt), Gibbs sampling) to compute the similarity scores. Similar with TA-GAN, the top-1 image-to-text retrieval accuracy (Top-1 Acc) and the percentage of matching images in top-50 text-to-image retrieval results (AP@50) on CUB-E dataset are used to measure the performance. As shown in Table 4, VHE-raster- scan-GAN clearly outperforms AttnGAN (Xu et al., 2018) and is comparable with TA-GAN. Note TA-GAN needs to extract its text features based on the fastText model (Bojanowski et al., 2017) pre-trained on a large corpus, while VHE-raster-scan-GAN learns everything directly from the current dataset in an end-to-end manner. Also, VHE-raster-scan-GAN outperforms VHE-StackGAN++, which further con\ufb01rms the bene\ufb01ts of combining both the stacking and raster scan structures. 3.4 G ENERATION OF RANDOM TEXT -IMAGE PAIRS Below we show how to generate data samples that contain both modalities. After training a three- stochastic-hidden-layer VHE-raster-scan-GAN, following the data generation process of the PGBN text decoder, given f\b(l)g3 l=1andr, we \ufb01rst generate \u0012(3)\u0018Gam\u0000 r;1=s(4)\u0001 and then downward propagate it through the PGBN as in (5)to calculate the Poisson rates for all words using \b(1)\u0012(1). Given a random draw, f\u0012(l)g3 l=1is fed into the raster-scan-GAN image generator to generate a \nPublished as a conference paper at ICLR 2020 room living bed pillow windowtv screen flat room walltable chairs kitchen dining woodenrefrigerator fridge door food kitchenbed room table window chairs tvrefrigerator kitchen stove cabinets tableroom living table furniture stove kitchen television Figure 5: An example topic hierarchy learned on COCO and its visual representation. We sample \u0012(1:3) n\u0018 q(\u0012(1:3) nj\b;xn)for alln; for topic node kof layerl, we show both its top words and the top two images ranked by their activations \u0012(l) nk. yellow color long skinny stamen small bright greenyellow color stamen long bright white center skinny white purple blue center light round shaped lightwhite purple green blue round center yellow shapedbrown grey large long wings bill black backbrown water grey large dark black wings long man person skateboard street cars down board middlecars building street man person people car middle pink spikey ball round white thin longcolored red crown brown belly long tail red yellow edge stamen colored middle surroundblue green colorful head wings belly body building city sky light night view dusk food large pizza green covered inside sandwic h (a) yellow color long skinny stamen small bright greenyellow color stamen long bright white center skinny white purple blue center light round shaped lightwhite purple green blue round center yellow shapedbrown grey large long wings bill black backbrown water grey large dark black wings long man person skateboard street cars down board middlecars building street man person people car middle pink spikey ball round white thin longcolored red crown brown belly long tail red yellow edge stamen colored middle surroundblue green colorful head wings belly body building city sky light night view dusk food large pizza green covered inside sandwic h (b) Figure 6: Example results of using VHE-raster-scan-GAN for (a) image-to-textual-tags generation, where the generated tags highlighted in red are included in the original ones; (b) image-text-pair generations (columns from left to right are based on Flower, CUB, and COCO, respectively). corresponding image. Shown in Fig. 6(b) are six random draws, for each of which we show its top seven words and generated image, whose relationships are clearly interpretable, suggesting that VHE-raster-scan-GAN is able to recode the key information of both modalities and the relationships between them. In addition to the tasks shown above, VHE-raster-scan-GAN can also be used to perform image retrieval given a text query, and image regeneration; see Appendices C.5 and C.6 for example results on these additional tasks. 4 C ONCLUSION We develop variational hetero-encoder randomized generative adversarial network (VHE-GAN) to provide a plug-and-play joint image-text modeling framework. VHE-GAN is a versatile deep generative model that integrates off-the-shelf image encoders, text decoders, and GAN image discrim- inators and generators into a coherent end-to-end learning objective. It couples its VHE and GAN components by feeding the VHE variational posterior in lieu of noise as the source of randomness of the GAN generator. We show VHE-StackGAN++ that combines the Poisson gamma belief network, a deep topic model, and StackGAN++ achieves competitive performance, and VHE-raster-scan-GAN, which further improves VHE-StackGAN++ by exploiting the semantically-meaningful hierarchi- cal structure of the deep topic model, generates photo-realistic images not only in a multi-scale low-to-high-resolution manner, but also in a hierarchical-semantic coarse-to-\ufb01ne fashion, achieving outstanding results in many challenging image-to-text, text-to-image, and joint text-image learning and generation tasks. \nPublished as a conference paper at ICLR 2020 ACKNOWLEDGEMENTS B. Chen acknowledges the support of the Program for Young Thousand Talent by Chinese Central Government, the 111 Project (No. B18039), NSFC (61771361), NSFC for Distinguished Young Scholars (61525105), Shaanxi Innovation Team Project, and the Innovation Fund of Xidian University. M. Zhou acknowledges the support of the U.S. National Science Foundation under Grant IIS-1812699. REFERENCES Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for \ufb01ne-grained image classi\ufb01cation. In CVPR , pp. 2927\u20132936, 2015. Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. Journal of Machine Learning Research , 3(6):1137\u20131155, 2003. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research , 3:993\u20131022, 2003. David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association , 112(518):859\u2013877, 2017. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics , 5:135\u2013146, 2017. Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Synthesized classi\ufb01ers for zero-shot learning. In CVPR , pp. 5327\u20135336, 2016. Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial networks. In ICLR , 2017. Yulai Cong, Bo Chen, Hongwei Liu, and Mingyuan Zhou. Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC. In ICML , 2017. Emily L Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS , pp. 1486\u20131494, 2015. Adji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. TopicRNN: A recurrent neural network with long-range semantic dependency. In ICLR , 2017. Jeff Donahue, Philipp Krahenbuhl, and Trevor Darrell. Adversarial feature learning. In ICLR , 2017. Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron C Courville. Adversarially learned inference. In ICLR , 2017. Mohamed Elhoseiny, Ahmed M Elgammal, and Babak Saleh. Write a classi\ufb01er: Predicting visual classi\ufb01ers from unstructured text. IEEE Transactions on Pattern Analysis and Machine Intelligence , 39(12):2539\u20132553, 2017a. Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and Ahmed M Elgammal. Link the head to the \u201dbeak\u201d: Zero shot learning from noisy text description at part precision. In CVPR , pp. 6288\u20136297, 2017b. Yanwei Fu, Tao Xiang, Yu Gang Jiang, Xiangyang Xue, Leonid Sigal, and Shaogang Gong. Recent advances in zero-shot recognition. IEEE Signal Processing Magazine , 35, 2018. Lluis Gomez, Yash Patel, Marcal Rusinol, Dimosthenis Karatzas, and C V Jawahar. Self-supervised learning of visual features through embedding images into text topic spaces. In CVPR , pp. 2017\u2013 2026, 2017. Ian J Goodfellow, Jean Pougetabadie, Mehdi Mirza, Bing Xu, David Wardefarley, Sherjil Ozair, Aaron C Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS , pp. 2672\u20132680, 2014. \nPublished as a conference paper at ICLR 2020 Aditya Grover, Manik Dhar, and Stefano Ermon. Flow-GAN: Combining maximum likelihood and adversarial learning in generative models. In AAAI , pp. 3069\u20133076, 2018. Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron C Courville. PixelV AE: A latent variable model for natural images. In ICLR , 2017. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS , pp. 6626\u20136637, 2017. Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation , 9(8): 1735\u20131780, 1997. Matthew D Hoffman and Matthew J Johnson. ELBO surgery: Yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS , 2016. Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research , 14(1):1303\u20131347, 2013. Huaibo Huang, Zhihang Li, Ran He, Zhenan Sun, and Tieniu Tan. IntroV AE: Introspective variational autoencoders for photographic image synthesis. In NeurIPS , 2018. Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, and Changshui Zhang. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. In CVPR , 2015. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014. Diederik P Kingma and Max Welling. Stochastic gradient VB and the variational auto-encoder. In ICLR , 2014. Ryan Kiros and Csaba Szepesvari. Deep representations and codes for image auto-annotation. pp. 908\u2013916, 2012. Anders Boesen Lindbo Larsen, Soren Kaae Sonderby, Hugo Larochelle, and Ole Winther. Autoen- coding beyond pixels using a learned similarity metric. In ICML , pp. 1558\u20131566, 2016. Jey Han Lau, Timothy Baldwin, and Trevor Cohn. Topically driven neural language model. In ACL, pp. 355\u2013365, 2017. Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. Identity-aware textual- visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision , pp. 1890\u20131899, 2017. Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-driven text-to-image synthesis via adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 12174\u201312182, 2019. Tsungyi Lin, Michael Maire, Serge J Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV , pp. 740\u2013755, 2014. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV) , December 2015. Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 , 2015. Lars Mescheder, S Nowozin, and Andreas Geiger. Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In ICML , pp. 2391\u20132400. PMLR, 2017. Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. Text-adaptive generative adversarial networks: Manipulating images with natural language. In Advances in Neural Information Processing Systems , pp. 42\u201351, 2018. \nPublished as a conference paper at ICLR 2020 Maria-Elena Nilsback and Andrew Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of classes. In Computer Vision, Graphics & Image Processing, 2008. ICVGIP\u201908. Sixth Indian Conference on , pp. 722\u2013729. IEEE, 2008. Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and Anton Van Den Hengel. Less is more: Zero-shot learning from online textual documents with noise suppression. In CVPR , pp. 2249\u20132257, 2016. Scott E Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML , pp. 1060\u20131069, 2016. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML , pp. 1278\u20131286, 2014. Bernardino Romeraparedes and Philip H S Torr. An embarrassingly simple approach to zero-shot learning. In ICML , pp. 2152\u20132161, 2015. Tim Salimans, Ian J Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In NIPS , pp. 2234\u20132242, 2016. Akash Srivastava, Lazar Valkoz, Chris Russell, Michael U Gutmann, and Charles A Sutton. VEEGAN: Reducing mode collapse in GANs using implicit variational learning. In NIPS , pp. 3308\u20133318, 2017. Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep Boltzmann machines. In NIPS , pp. 2222\u20132230, 2012a. Nitish Srivastava and Ruslan Salakhutdinov. Learning representations for multimodal data with deep belief nets. In NIPS workshop , pp. 2222\u20132230, 2012b. Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research , 15(1):2949\u20132980, 2014. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR , pp. 2818\u20132826, 2016. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. InICLR , 2018. Vinay Kumar Verma, Gundeep Arora, Ashish Kumar Mishra, and Piyush Rai. Generalized zero-shot learning via synthesized examples. In CVPR , pp. 4281\u20134289, 2018. C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, 2011. Chaojie Wang, Bo Chen, and Mingyuan Zhou. Multimodal Poisson gamma belief network. In AAAI , 2018. Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Gong. Multi-document summarization using sentence-based topic models. In ACL, pp. 297\u2013300, 2009. Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. InCVPR , pp. 1316\u20131324, 2018. Han Zhang, Tao Xu, and Hongsheng Li. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In CVPR , 2017a. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence , PP(99):1\u20131, 2017b. Hao Zhang, Bo Chen, Dandan Guo, and Mingyuan Zhou. WHAI: Weibull hybrid autoencoding inference for deep topic modeling. In ICLR , 2018a. \nPublished as a conference paper at ICLR 2020 Zizhao Zhang, Yuanpu Xie, and Lin Yang. Photographic text-to-image synthesis with a hierarchically- nested adversarial network. In CVPR , 2018b. Mingyuan Zhou. In\ufb01nite edge partition models for overlapping community detection and link prediction. In AISTATS , pp. 1135\u20131143, 2015. Mingyuan Zhou and Lawrence Carin. Negative binomial process count and mixture modeling. IEEE Trans. Pattern Anal. Mach. Intell. , 37(2):307\u2013320, 2015. Mingyuan Zhou, Lauren Hannah, David Dunson, and Lawrence Carin. Beta-negative binomial process and Poisson factor analysis. In AISTATS , pp. 1462\u20131471, 2012. Mingyuan Zhou, Yulai Cong, and Bo Chen. Augmentable gamma belief networks. Journal of Machine Learning Research , 17(163):1\u201344, 2016. Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed M Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In CVPR , 2018. \nPublished as a conference paper at ICLR 2020 A M ODEL PROPERTY OF VHE-GAN AND RELATED WORK Let us denote q(z) =Ex\u0018pdata(x)[q(zjx)] =1 NPN n=1q(zjxn)as the aggregated posterior (Hoff- man & Johnson, 2016; Makhzani et al., 2015). Removing the triple-data-reuse training strategy, we can re-express the VHE-GAN objective in (4) as min E;G vae;Gganmax D[\u0000ELBO vhe+Lgan];Lgan:=Ex\u0018pdata(x)lnD(x) +Ez\u0018q(z)ln(1\u0000D(Ggan(z))); (10) which corresponds to a naive combination of the VHE and GAN training objectives, where the data samples used to train the VHE, GAN generator, and GAN discriminator in each gradient update iteration are not imposed to be the same. While the naive objective function in (10) differs from the true one in (4)that is used to train VHE-GAN, it simpli\ufb01es the analysis of its theoretical property, as described below. Let us denote q(z;x;t) :=q(zjx)pdata(x;t)as the joint distribution of (x;t)andzunder the VHE variational posterior q(zjx),Iq(x;z) :=Eq(z;x)\u0002 lnq(z;x) q(z)pdata (x)\u0003 as the mutual in- formation between x\u0018pdata(x)andz\u0018q(z), and JDS(p1jjp2) :=1 2KL[p1jj(p1+p2)=2] + 1 2KL[p2jj(p1+p2)=2]as the Jensen\u2013Shannon divergence between distributions p1andp2. Similar to the analysis in Hoffman & Johnson (2016), the VHE\u2019s ELBO can be rewritten as ELBO vhe= Eq(z;x;t)[logp(tjz)]\u0000Iq(x;z)\u0000KL[q(z)jjp(z)];where the mutual information term can also be expressed as Iq(x;z) =Ex\u0018pdata (x)KL[q(zjx)jjq(z)]. Thus maximizing the ELBO encourages the mutual information term Iq(x;z)to be minimized, which means while the data reconstruction termEq(z;x;t)[logp(tjz)]needs to be maximized, part of the VHE optimization objective penalizes azfrom carrying the information of the xthat it is encoded from. This mechanism helps provide necessary regularization to prevent over\ufb01tting. As in Goodfellow et al. (2014), with an optimal discriminator D\u0003 Gfor generator G, we have minLGAN(D\u0003 G;G) = ln 4 + 2 JSD(pdata(x)jjpGz(x)); wherepGz(x)denotes the distribution of the generated data G(z)that usez\u0018q(z)as the random source fed into the GAN generator. The JSD term is minimized when pGz(x) =pdata(x). With these analyses, given an optimal GAN discriminator, the naive VHE-GAN objective function in (10) reduces to min E;G gan;Gvae\u0000Eq(z;x;t)[logp(tjz)] + KL[q(z)jjp(z)] +Iq(x;z) + 2 JSD(pdata(x)jjpGz(x)):(11) From the VHEs\u2019 point of view, examining (11) shows that it alleviates the inherent con\ufb02ict in VHE of maximizing the ELBO and maximizing the mutual information Iq(x;z). This is because while the VHE part of VHE-GAN still relies on minimizing Iq(x;z)to regularize the learning, the GAN part tries to transform q(z)through the GAN generator to match the true data distribution pdata(x). In other words, while its VHE part penalizes a zfrom carrying the information about the xthat it is encoded from, its GAN part encourages a zto carry information about the true data distribution pdata(x), but not necessarily the observed xthat it is encoded from. From the GANs\u2019 point of view, examining (11) shows that it provides GAN with a meaningful latent space, necessary for performing inference and data reconstruction (with the aid of the data-triple-use training strategy). More speci\ufb01cally, this latent representation is also used by the VHE to maximize the data log-likelihood, a training procedure that tries to cover all modes of the empirical data distribution rather than dropping modes. For VHE-GAN (4), the source distribution is q(zjx), not only allowing GANs to participate in posterior inference and data reconstruction, but also helping GANs resist mode collapse. In the following, we discuss some related works on combining V AEs and GANs. A.1 R ELATED WORK ON COMBINING VAE S AND GAN S Examples in improving V AEs with adversarial learning include Mescheder et al. (2017), which allows the V AEs to take implicit encoder distribution, and adversarial auto-encoder (Makhzani et al., 2015) and Wasserstein auto-encoder (Tolstikhin et al., 2018), which drop the mutual information term from the ELBO and use adversarial learning to match the aggregated posterior and prior. Examples in allowing GANs to perform inference include Dumoulin et al. (2017) and Donahue et al. (2017), \nPublished as a conference paper at ICLR 2020 which use GANs to match the joint distribution q(zjx)pdata(x)de\ufb01ned by the encoder and the one p(xjz)p(z)de\ufb01ned by the generator. However, they often do not provide good data reconstruction. Examples in using V AEs or maximum likelihood to help GANs resist mode collapse include Che et al. (2017); Srivastava et al. (2017); Grover et al. (2018). Another example is V AEGAN (Larsen et al., 2016) that combines unit-wise likelihood at hidden layer and adversarial loss at original space, but its update of the encoder is separated from the GAN mini-max objective. On the contrary, IntroV AE (Huang et al., 2018) retains the pixel-wise likelihood with an adversarial regularization on the latent space. Sharing network between the V AE decoder and GAN generator in V AEGAN and IntroV AE, however, limit them to model a single modality. B M ORE DISCUSSION ON SEQUENCE MODELS AND TOPIC MODELS IN TEXT ANALYSIS . In Section 3.1, we have discussed two models to represent the text: sequence models and topic models. Considering the versatility of topic models (Wang et al., 2009; Jin et al., 2015; Zhou et al., 2016; Srivastava & Salakhutdinov, 2012a; 2014; Wang et al., 2018; Elhoseiny et al., 2017b; Zhu et al., 2018) in dealing with different types of textual information, and its effectiveness in capturing latent topics that are often directly related to macro-level visual information (Gomez et al., 2017; Dieng et al., 2017; Lau et al., 2017), we choose a state-of-the-art deep topic model, PGBN, to model the textual descriptions in VHE. Due to space constraint, we only provide simple illustrations in Figs. 3(a) and 3(b). In this section, more insights and discussions are provided. Red body   Black belly   Long bill  Yellow bill   Black wings  Long tail Solid tail  Striped tail  Blue breast   White wings   Striped back  Buff back  Spotted back  White body  Dark wings   Brown crown  Blue head  Blue head  Colorful body   Green back   Red head  Black throat  White bill   Short leg  Orange belly  Long leg   Yellow belly  Round belly   Grey back  Yellow belly   Figure 7: Generated random images by VHE-raster-scan-GAN conditioning on \ufb01ve binary attributes. As discussed before, topic models are able to model non-sequential texts such as binary attributes. The CUB dataset provides 312 binary attributes (Wah et al., 2011) for each images, such as whether \u201ccrown color is blue\u201d and whether \u201ctail shape is solid\u201d to de\ufb01ne the color or shape of different body parts of a bird. We \ufb01rst transform these binary attributes for the nth image to a 312-dimensional binary vectortn, whoseith element is 1 or 0 depending on whether the bird in this image owns the ith attribute or not. The binary attribute vectors tnare used together with the corresponding bird imagesxnto train VHE-raster-scan-GAN. As shown in Fig. 7, we generate images given \ufb01ve binary attributes, which are formed into a 312-dimensional binary vector t(with \ufb01ve non-zero elements at these \ufb01ve attributes) that becomes the input to the PGBN text decoder. Clearly, these generated images are photo-realistic and faithfully represent the \ufb01ve provided attributes. The proposed VHE-GANs can also well model long documents. In text-based ZSL discussed in Section 3.2, each class (not each image) is represented as a long encyclopedia document, whose global \nPublished as a conference paper at ICLR 2020 semantic structure is hard to captured by existing sequence models. Besides a good ZSL performance achieved by VHE-raster-scan-GAN, illustrating its advantages of text generation given images, we show Fig. 8 example results of image generation conditioning on long encyclopedia documents on the unseen classes of CUB-E (Qiao et al., 2016; Akata et al., 2015) and Flower (Elhoseiny et al., 2017a). It is a seabird, nesting in seabird  colonies, with a large orange/brown bill.   Plumage is dark on top and paler below,  in offshore and inshore water.   Sometimes it swim in the water and  sometimes it stand on the strong.   Brownish -olive upperparts, darker on the  wings and tail, yellowish underparts.   Have small bill short tail, on a perch low  or in the middle of a tree.   Its eyes are dark and round with  radiating  vigor, like looking for food or insects.  Class name: Yellow Bellied Flycatche r   Class name: Rhinoceros Auklet   (a) It tends to form a spheroid shape ranging  in size from a golf ball to a soccer ball.   It may hinder tree growth.   Its petals are stripe -like yellow ones and  its stamen is also round dark brown or  yellow.  It bear a large capitulum  with striking,  two-lipped ray floret in yellow or orange.   Colors include white, yellow, and pink.   Its petals are medium, and each of them  is round and the number is about six.  Class name: Barberton Daisy  Class name: Ball Moss   (b) Figure 8: Image generation conditioning on long encyclopedia documents using VHE-raster-scan- GAN trained on (a) CUB-E and (b) Flower. Shown in the top part of each subplot are representative sentences taken from the long document that describes an unseen class; for the three rows of images shown in the bottom part, the \ufb01rst row includes three real images from the corresponding unseen class, and the other two rows include a total of six randomly generated images conditioning on the long encyclopedia document of the corresponding unseen class. Analogous to how the Bird images are generated in Fig. 7, we also perform facial image generation given a set of textual attributes. On CelebA dataset, given attributes, we train VHE-stackGAN++ and \nPublished as a conference paper at ICLR 2020 eyeglasses receding hairline mouth slightly open male smilingeyeglasses straight hair young rosy cheeks no beardmustache black hair male high cheekbones sideburnssmiling blond hair wavy hair attractive high cheekbonesbald male narrow eyes no beard old VHE-StackGAN ++ VHE-raster -scan-GAN Figure 9: Example results of facial image generation conditioning on \ufb01ve textual attributes, by VHEStackGAN++ and VHE-raster-scan-GAN trained on the CelebA dataset. Both models are trained with 20 epochs, with the output resolution set as 128\u0002128. Note our current network architecture, designed mainly for natural images, has not yet been \ufb01ne-tuned for facial images. VHE-raster-scan-GAN to generate the facial images with resolution 128\u0002128. As shown in Fig. 9, after the training of 20 epochs, we generate facial images given \ufb01ve attributes. While the facial images generated by both models nicely match the given attributes, VHE-raster-scan-GAN provides higher visual quality and does a better job in representing the details. \nPublished as a conference paper at ICLR 2020 C M ORE EXPERIMENTAL RESULTS ON JOINT IMAGE -TEXT LEARNING C.1 T ABLES 1AND 2WITH ERROR BARS . For text-to-image generation tasks, we use the of\ufb01cial pre-de\ufb01ned training/testing split (illustrated in Appendix F) to train and test all the models. Following the de\ufb01nition of error bar of IS in StackGAN++ (Zhang et al., 2017b), HDGAN (Zhang et al., 2018b), and AttnGAN (Xu et al., 2018), we provide the IS results with error bars for various methods in Table 5, where the results of the StackGAN++ , HDGAN, and AttnGAN are quoted from the published papers. The FID error bar is not included as it has not been clearly de\ufb01ned. Table 5: Inception score (IS) results in Table 1 with error bars. Method StackGAN++ HDGAN AttnGAN Obj-GAN VHE-raster-scan-GAN Flower 3.26\u0006.01 3.45\u0006.07 \u2013 - 3.72\u0006.01 CUB 3.84\u0006.06 4.15\u0006.05 4.36\u0006.03 - 4.41\u0006.03 COCO 8.30\u0006.10 11.86\u0006.18 25.89\u0006.47 26.68\u0006.52 27.16\u0006.23 Table 6: Inception score (IS) results in Table 2 with error bars. Method PGBN+StackGAN++ VHE-vanilla-GAN VHE-StackGAN++ VHE-simple-raster-scan-GAN Flower 3.29\u0006.02 3.01\u0006.06 3.56\u0006.03 3.62\u0006.02 CUB 3.92\u0006.06 3.52\u0006.08 4.20\u0006.04 4.31\u0006.06 COCO 10.63\u0006.10 6.36\u0006.20 12.63\u0006.15 20.13\u000622 \nPublished as a conference paper at ICLR 2020 C.2 H IGH-QUALITY IMAGES OF FIGURE 2 Due to space constraint, we provide relative small-size images in Fig. 2. Below we show the corresponding images with larger sizes. Brown duck playing on the lake making a  poodle. StackGAN ++ VHE-raster -scan -GANVHE-StackGAN ++HDGAN This bird is  yellow with  grey wings and  a black crown. An all black  bird with a  thick, round black bill. AttnGAN VHE-vanilla -GAN VHE-simple -raster -scan -GAN Figure 10: The images above the blue line are the larger-size replots of CUB Bird images in Figure 2, while the images below the blue line are results for ablation study. \nPublished as a conference paper at ICLR 2020 StackGAN ++ VHE-raster -scan -GANVHE-StackGAN ++HDGAN This flower has  long, curling orange petals  with dark red  spots. This is a purple  bell shaped flower, with a  yellow pistil and long stigma. This flower  contains hundred  of needle like  yellow petals around the  brighter yellow  stamen. VHE-vanilla -GAN VHE-simple -raster -scan -GAN Figure 11: The images above the blue line are the larger-size replots of Flower images in Figure 2, while the images below the blue line are results for ablation study. \nPublished as a conference paper at ICLR 2020 StackGAN ++ VHE-raster -scan -GANVHE-StackGAN ++HDGAN A wooden desk  topped with a  laptop  computer. A very dark  city street with  cars and  buildings. AttnGAN VHE-vanilla -GAN VHE-simple -raster -scan -GAN Figure 12: The images above the blue line are the larger-size replots of COCO images in Figure 2, while the images below the blue line are results for ablation study. \nPublished as a conference paper at ICLR 2020 C.3 M ORE TEXT -TO-IMAGE GENERATION RESULTS ON COCO COCO is a more challenging dataset than CUB and Flower, as it contains very diverse objects and scenes. We show in Fig. 13 more samples conditioned on different textural descriptions. A living room  filled with lots  of furniture. A white tall  cathedral  towering under  the sky.A child with  black clothes is  standing on top  of a snow  covered hill. People playing  with kites  outside under  the blue sky.A beach on a  cloudy day with  a bunch of  people on it. The baseball  player with  white clothes is  ready for the  game. StackGAN++ HDGAN VHE-raster-scan-GAN AttnGAN  Obj-GAN Figure 13: Example text-to-image generation results on COCO. \nPublished as a conference paper at ICLR 2020 C.4 L ATENT SPACE INTERPOLATION In addition to the latent space interpolation results of VHE-raster-scan-GAN in Fig. 3(c) of Section 3.1, below we provide more \ufb01ne-gridded latent space interpolation in Figs. 15-18. Figure 14: Example of latent space interpolation on CUB. Figure 15: Example of latent space interpolation on CUB. \nPublished as a conference paper at ICLR 2020 Figure 16: Example of latent space interpolation on CUB. Figure 17: Example of latent space interpolation on Flower. Figure 18: Example of latent space interpolation on Flower.\nPublished as a conference paper at ICLR 2020 C.5 I MAGE RETRIEVAL GIVEN A TEXT QUERY For imagexn, we draw its BoW textual description ^tnas^tnj\u0012n\u0018p(tj\b;\u0012n);\u0012njxn\u0018 q (\u0012j\b;xn). Given the BoW textual description tas a text query, we retrieve the top \ufb01ve im- ages ranked by the cosine distances between tand^tn\u2019s. Shown in Fig. 19 are three example image retrieval results, which suggest that the retrieved images are semantically related to their text queries in colors, shapes, and locations. white pink purple yellow center blue long black tail dark motorcycle parked parking black ground Figure 19: Top-5 retrieved images given a text query. Rows 1 to 3 are for Flower, CUB, and COCO, respectively. C.6 I MAGE REGENERATION We note for VHE-GAN, its image encoder and GAN component together can also be viewed as an \u201cautoencoding\u201d GAN for images. More speci\ufb01cally, given image x, VHE-GAN can provide random regenerations using G(q (\u0012j\b;x)). We show example image regeneration results by both VHE-StackGAN++ and VHE-raster-scan-GAN in Fig. 20. These example results suggest that the regenerated random images by the proposed VHE-GANs more of less resemble the original real image fed into the VHE image encoder. Original Reconstruction  by VHE- StackGAN++ Reconstruction  by VHE-raster- scan-GAN Figure 20: Example results of image regeneration using VHE-StackGAN++ and VHE-raster-scan-GAN. An original image is fed into the VHE image encoder, whose latent representation is then fed into the GAN image generator to generate a corresponding random image. The models in columns 1-4 are trained on Flower, columns 5-8 on CUB, and columns 9-12 on COCO. C.7 L EARNED HIERARCHICAL TOPICS IN VHE The inferred topics at different layers and the inferred sparse connection weights between the topics of adjacent layers are found to be highly interpretable. In particular, we can understand the meaning of each topic by projecting it back to the original data space viahQl\u00001 t=1\b(t)i \u001e(l) kand understand the relationship between the topics by arranging them into a directed acyclic graph (DAG) and choose \nPublished as a conference paper at ICLR 2020 its subnets to visualize. We show in Figs. 21, 22, and 23 example subnets taken from the DAGs inferred by the three-layer VHE-raster-scan-GAN of size 256-128-64 on Flower, CUB, and COCO, respectively. The semantic meaning of each topic and the connection weights between the topics of adjacent layers are highly interpretable. For example, in Figs. 21, the topics describe very speci\ufb01c \ufb02ower characteristics, such as special colors, textures, shapes, and parts, at the bottom layer, and become increasingly more general when moving upwards. 12 brown flowers fushia green black stalksbrown flowers fushia green black stalks72 stalks pistil pedicels ovary stem budsstalks pistil pedicels ovary stem buds50 pedicel sepals color long leaves pointedpedicel sepals color long leaves pointed62 red stamen pink bright dark greenred stamen pink bright dark green11 striations striped creases ridges striated clusteredstriations striped creases ridges striated clustered23 soft Smooth waxy bunch deep tightsoft Smooth waxy bunch deep tight1 sepals pedicel shown petal stigma pistilsepals pedicel shown petal stigma pistil66 yellow golden bright center color roundyellow golden bright center color round77 clustered round bunch filament sticking largeclustered round bunch filament sticking largered green long color pedicel pointed pink sepals stamen looking7 68 red color bright stamen green dark stigma pedicel pistil deep shown9 center yellow bright petal stamen color clustered large round9711 green pedicel color purple large ruffled wavy long shaped edges stem60 red green color stamen ruffled edges long bright pedicel large dark sepal 55 ruffled rounded wavy large wrinkled curledruffled rounded wavy large wrinkled curledruffled large  wavy brown green rounded wide wrinkled  pedicels curled pistil Figure 21: An example topic hierarchy taken from the directed acyclic graph learned by a three-layer VHE- raster-scan-GAN of size 256-128-64 on Flower. 47 brown light wings dark tan tail sharpbrown light wings dark tan tail sharp66 white belly breast wings head chest headwhite belly breast wings head chest head79 black eye beak rings yellow unique crestblack eye beak rings yellow unique crest67 beak long pointy sharp breast feather blendsbeak long pointy sharp breast feather blends45 webbed footed shore snowy waterbird standing rockwebbed footed shore snowy waterbird standing rock36 black tail wings feather body petite retricesblack tail wings feather body petite retrices97 yellow belly brown short wingbars greenish secondaryyellow belly brown short wingbars greenish secondary15 beak body pointed feathers tiny flat multicoloredbeak body pointed feathers tiny flat multicoloredwhite black beak long head eye body pointy color rings rest37 87 black orange feet white red crown webbed beak long head tarsus40 yellow grey wings  head belly body beak  small brown wingbars2215 black white orange beak  long crown wings head feet large body36 yellow green grey small belly gray white  beak eye crown head 96 tarsus black feet gray throat nape abdomentarsus black feet gray throat nape abdomenbrown belly breast white  beak throat long wings  crown light small 23 feet webbed large tarsus body dark widefeet webbed large tarsus body dark wide Figure 22: Analogous plot to Fig. 21 on CUB. 85 room living furniture fireplace large chairs windowsroom living furniture fireplace large chairs windows86 television tv screen flat room wall sittingtelevision tv screen flat room wall sitting52 pink purple colorful bright flower rests cutepink purple colorful bright flower rests cute97 keyboard mouse desk sitting monitor white desktopkeyboard mouse desk sitting monitor white desktop9 window looking outside sill glass door openwindow looking outside sill glass door open39 table chairs room dining wooden kitchen areatable chairs room dining wooden kitchen area50 refrigerator open fridge door food kitchen whiterefrigerator open fridge door food kitchen white67 clock wall large time mounted gold ornateclock wall large time mounted gold ornateluggage floor suitcase  sitting bag suitcases  old bags red airport70 24 chair desk table sitting  laptop room window bed  wooden chairs old2 kitchen refrigerator  white stove cabinets  sink table room open3726 bear teddy stuffed bears  sitting animals brown toy  white holding animal37 room living table couch  television tv chair sitting  furniture chairs kitchen 8 white black photo picture dark image backgraphwhite black photo picture dark image backgraphbed room bedroom white  window beds hotel sitting  large small pillows 16 old sitting fashioned vintage antique style displayold sitting fashioned vintage antique style display Figure 23: Analogous plot to Fig. 21 on COCO. \nPublished as a conference paper at ICLR 2020 D S PECIFIC MODEL STRUCTURE IN VHE-S TACK GAN++ AND VHE- RASTER -SCAN -GAN D.1 M ODEL STRUCTURE OF VHE In Fig. 24, we give the structure of VHE used in VHE-StackGAN++ and VHE-raster- scan-GAN, where f(x)is the image features extracted by Inception v3 network and \"(l)\u0018QKl k=1Uniform (\"(l) k; 0;1). With the de\ufb01nition of g(0)=f(x), we have k(l)= exp( W(l) 1g(l)+b(l) 1); (12) \u0015(l)= exp( W(l) 2g(l)+b(l) 2); (13) g(l)= ln[1 + exp( W(l) 3g(l\u00001)+b(l) 3)]; (14) where W(l) 12RKl\u0002Kl,W(l) 22RKl\u0002Kl,W(l) 32RKl\u0002Kl\u00001,b(l) 12RKl,b(l) 22RKl, and b(l) 32RKl. \uf028\uf0292\u03b8 \uf028\uf0292\uf065 \uf028\uf0292g \uf028\uf0292k \uf028\uf0292\u03bb \uf028\uf0291\u03b8 \uf028\uf0291\uf065 \uf028\uf0291g \uf028\uf0291k \uf028\uf0291\u03bb \uf028\uf0293\u03b8 \uf028\uf0293\uf065 \uf028\uf0293g \uf028\uf0293k \uf028\uf0293\u03bb exp exp exp xInception v3   \uf028\uf0293\u03a6 \uf028\uf0292\u03a6 \uf028\uf029 fx \uf028\uf0291\u03a6 t Figure 24: The architecture of VHE in VHE-StackGAN++ and VHE-raster-scan-GAN. D.2 M ODEL OF VHE-S TACK GAN++ In Section 2.2, we \ufb01rst introduce the VHE-StackGAN++, where the multi-layer textual representation f\u0012(1);\u0012(2);\u0001\u0001\u0001;\u0012(L)gis concatenated as \u0012=h \u0012(1);\u0001\u0001\u0001;\u0012(L)i and then fed into StackGAN++ (Zhang et al., 2017b). In Figs. 1 (a) and (b), we provide the model structure of VHE-StackGAN++. We also provide a detailed plot of the structure of StackGAN++ used in VHE-StackGAN++ in Fig. 25, where JCU is a speci\ufb01c type of discriminator; see Zhang et al. (2017b) for more details. The same with VHE-raster-scan-GAN, VHE-StackGAN++ is also able to jointly optimize all compo- nents by merging the expectation in VHE and GAN to de\ufb01ne its loss function as min ;fGig3 i=1maxfDig3 i=1Epdata(xn;tn)EQL l=1q(\u0012(l) njxn;\b(1+1);\u0012(l+1) n )\b \u0000logp(tnj\b(1);\u0012(1) n) +PL l=1KL[q(\u0012(l) njxn;\b(1+1);\u0012(l+1) n)jjp(\u0012(l) nj\b(1+1);\u0012(l+1) n)] +P3 i=1[logDi(xn;i;\u0012n) + log(1\u0000Di(Gi(\u0012n);\u0012n))]\t : (15) \nPublished as a conference paper at ICLR 2020 (1)\uf071 (2)\uf071 (3)\uf071 \uf071 4x4  x64Ng  64x64   x4Ng  128x128   x2Ng  256x256   xNg   256x256x3 G3 128x128x3  64x64x3 JCU  D1 JCU  D2     JCU  D3  FC with reshape    Joining    Conv3x3  Upsampling  Residual   G2 G1  Figure 25: The structure of Stack-GAN++ in VHE-StackGAN++, where JCU is a type of discrimina- tor proposed in Zhang et al. (2017b). \nPublished as a conference paper at ICLR 2020 D.3 S TRUCTURE OF RASTER -SCAN -GAN In Fig. 26, we provide a detailed plot of the structure of the proposed raster-scan-GAN. (1)\uf071 4x4  x1024  16x16   x256  16x16   x272  64x64   x64  128x128   x32  256x256   x16   256x256x3  (1) 3G (1) 2G (1) 1G128x128x3  64x64x3        JCU  (1) 1D (1) 3D (1) 2DJCU JCU  (2)\uf071 4x4  x512  16x16   x128  16x16   x136  32x32   x64  64x64   x32  128x128   x16  (2) 3G (2) 2G (2) 1G 32x32x3        JCU  (2) 1D (2) 3D (2) 2DJCU JCU  128x128x3  64x64x3  128x128   x32  16x16   x64  (2)\uf071 4x4  x256  16x16   x64  32x32   x32  64x64   x16  (3) 3G (3) 2G (3) 1G 16x16x3        JCU  (3) 1D (3) 3D (3) 2DJCU JCU  64x64   x16  16x16   x32  64x64x3  32x32x3  FC with reshape    Joining    Conv3x3  Upsampling  Residual    Downsampling   Figure 26: The structure of raster-scan-GAN in VHE-raster-scan-GAN, where JCU is a type of discriminator proposed in Zhang et al. (2017b). \nPublished as a conference paper at ICLR 2020 E J OINT OPTIMIZATION FOR VHE- RASTER -SCAN -GAN Based on the loss function of VHE-raster-scan-GAN (9), with TLASGR-MCMC (Cong et al., 2017) and WHAI (Zhang et al., 2018a), we describe in Algorithm 1 how to perform mini-batch based joint update of all model parameters. Algorithm 1 Hybrid TLASGR-MCMC/VHE inference algorithm for VHE-raster-scan-GAN. Initialize encoder parameters  , topic parameters of PGBN f\b(l)g1;L, generatorG, and discrimi- natorD. foriter= 1;2;\u0001\u0001\u0001do Randomly select a mini-batch containing Nimage-text pairs d=fxn;tngN n=1; Draw random noisen \"(l) noN;L n=1;l=1from uniform distribution; CalculaterDL(D;G;  jx); CalculaterGL(D;G;  jx); Calculater Lby the aid ofn \"(l) noN;L n=1;l=1; UpdateDasD=D+rDL(D;G;  jx); UpdateGasG=G\u0000rGL(D;G;  jx); Update  as = \u0000r L; Samplef\u0012(l) ngL l=1from (6)given  andf\b(l)gL l=1, and useftgN n=1to update topicsf\b(l)gL l=1 according to TLASGR-MCMC; end for F D ATA DESCRIPTION ON CUB, F LOWER ,AND COCO WITH TRAINING DETAILS In image-text multi-modality learning, CUB (Wah et al., 2011), Flower (Nilsback & Zisserman, 2008) and COCO (Lin et al., 2014) are widely used datasets. CUB ( http://www.vision.caltech.edu/visipedia/CUB-200-2011.html ): CUB contains 200 bird species with 11,788 images. Since 80 %of birds in this dataset have object-image size ratios of less than 0.5 (Wah et al., 2011), as a preprocessing step, we crop all images to ensure that bounding boxes of birds have greater-than-0.75 object-image size ratios, which is the same with all related work. For textual description, Wah et al. (2011) provide ten sentences for each image and we collect them together to form BoW vectors. Besides, for each species, Elhoseiny et al. (2017a) provide its encyclopedia document for text-based ZSL, which is also used in our text-based ZSL experiments. For CUB, there are two split settings: the hard one and the easy one. The hard one ensures that the bird subspecies belonging to the same super-category should belong to either the training split or test one without overlapping, referred to as CUB-hard (CUB-H in our manuscript). A recently used split setting (Qiao et al., 2016; Akata et al., 2015) is super-category split, where for each super-category, except for one subspecies that is left as unseen, all the other are used for training, referred to as CUB-easy (CUB-E in our manuscript). For CUB-H, there are 150 species containing 9410 samples for training and 50 species containing 2378 samples for testing. For CUB-E, there are 150 species containing 8855 samples for training and 50 species containing 2933 samples to testing. We use both of them the for the text-based ZSL, and only CUB-E for all the other experiments as usual. Flower http://www.robots.ox.ac.uk/ \u02dcvgg/data/flowers/102/index.html : Oxford-102, commonly referred to as Flower, contains 8,189 images of \ufb02owers from 102 different categories. For textual description, Nilsback & Zisserman (2008) provide ten sentences for each image and we collect them together to form BoW vectors. Besides, for each species, Elhoseiny et al. (2017a) provide its encyclopedia document for text-based ZSL, which is also used in our text-based ZSL experiments in section 4.2.2. There are 82 species containing 7034 samples for training and 20 species containing 1155 samples for testing. \nPublished as a conference paper at ICLR 2020 For text-based ZSL, we follow the same way in Elhoseiny et al. (2017a) to split the data. Speci\ufb01cally, \ufb01ve random splits are performed, in each of which 4=5of the classes are considered as \u201cseen classes\u201d for training and 1=5of the classes as \u201cunseen classes\u201d for testing. For other experiments, we follow Zhang et al. (2017b) to split the data. COCO http://cocodataset.org/#download : Compared with Flower and CUB, COCO is a more challenging dataset, since it contains images with multiple objects and diverse backgrounds. To show the generalization capability of the proposed VHE-GANs, we also utilize COCO for evaluation. Following the standard experimental setup for COCO (Reed et al., 2016; Zhang et al., 2017b), we directly use the pre-split training and test sets to train and evaluate our proposed models. There are 82081 samples for training and 40137 samples for testing. Training details: we train VHE-rater-scan-GAN in four Nvidia GeForce RTX2080 TI GPUs. The experiments are performed with mini-batch size 32 and about 30.2G GPU memory space. We run 600 epochs to train the models on CUB and Flower, taking about 797 seconds for CUB-E and 713 seconds for Flower for each epoch. We run 100 epochs to train the models on COCO, taking about 6315 seconds for each epoch. We use the Adam optimizer (Kingma & Ba, 2014) with learning rate2\u000210\u00004,\f1= 0:5, and\f2= 0:999to optimize the parameters of the GAN generator and discriminator, and use Adam with learning rate 10\u00004,\f1= 0:9, and\f2= 0:999to optimize the VHE parameters. The hyper-parameters to update the topics \bwith TLASGR-MCMC are the same with those in Cong et al. (2017). \nPublished as a conference paper at ICLR 2020 G A DDITIONAL DISCUSSION ON OBJ-GAN Focusing on the COCO dataset, the recently proposed Obj-GAN (Li et al., 2019) exploits more side information, including the bounding boxes and labels of objects existing in the images, to perform text-to-image generation. More speci\ufb01cally, Obj-GAN \ufb01rst trains an attentive sequence to sequence model to infer the bounding boxes given a text t: B1:T= [B1;B2;\u0001\u0001\u0001;BT] =Gbox(e); (16) where,eare the pre-trained bi-LSTM word vectors of t,Bt= (lt;bt)consists of the class label of the tth object and its bounding box b= (x;y;w;h )2R4. Given the bounding boxes B1:T, Obj-GAN learn a shape generator to predict the shape of each object in its bounding box: ^M1:T=Gshape(B1:T;z1:T); (17) wherezt\u0018N(0;1)is a random noise vector. Having obtained B1:Tand ^M1:T, Obj-GAN trains an attentive multi-stage image generator to generate the images conditioned on B1:T,^M1:T, ande. Although Obj-GAN achieves a better FID on COCO, it has two major limitations in practice. First, it is not always possible to obtain accurate bounding boxes and labels of objects in the image; even they can be acquired by manual labeling, it is often time and labor consuming, especially on large datasets. Second, each word is associated with one \ufb01xed bounding box; in other words, given one sentence, the locations of the objects in the generated images are \ufb01xed, which clearly hurts the diversity of the Obj-GAN generated images, as shown in Fig. 27. People playing with  kites outside under  the blue sky. Obj-GAN VHE-raster -scan-GAN Figure 27: The generated random images of Obj-GAN given text lack diversity."
},
{
    "title": "Learning to Explore using Active Neural SLAM",
    "pdf_link": "https://openreview.net/pdf?id=HklXn1BKDH",
    "abstract": "This work presents a modular and hierarchical approach to learn policies for ex- ploring 3D environments, called \u2018Active Neural SLAM\u2019. Our approach leverages the strengths of both classical and learning-based methods, by using analytical path planners with learned SLAM module, and global and local policies. The use of learning provides \ufb02exibility with respect to input modalities (in the SLAM module), leverages structural regularities of the world (in global policies), and provides robustness to errors in state estimation (in local policies). Such use of learning within each module retains its bene\ufb01ts, while at the same time, hierarchical decomposition and modular training allow us to sidestep the high sample complex- ities associated with training end-to-end policies. Our experiments in visually and physically realistic simulated 3D environments demonstrate the effectiveness of our approach over past learning and geometry-based approaches. The proposed model can also be easily transferred to the PointGoal task and was the winning entry of the CVPR 2019 Habitat PointGoal Navigation Challenge.",
    "paper_text": "Published as a conference paper at ICLR 2020 LEARNING TOEXPLORE USING ACTIVE NEURAL SLAM Devendra Singh Chaplot1y, Dhiraj Gandhi2, Saurabh Gupta3\u0003, Abhinav Gupta1,2\u0003, Ruslan Salakhutdinov1\u0003 1Carnegie Mellon University,2Facebook AI Research,3UIUC Project webpage: https://devendrachaplot.github.io/projects/Neural-SLAM Code: https://github.com/devendrachaplot/Neural-SLAM ABSTRACT This work presents a modular and hierarchical approach to learn policies for ex- ploring 3D environments, called \u2018Active Neural SLAM\u2019. Our approach leverages the strengths of both classical and learning-based methods, by using analytical path planners with learned SLAM module, and global and local policies. The use of learning provides \ufb02exibility with respect to input modalities (in the SLAM module), leverages structural regularities of the world (in global policies), and provides robustness to errors in state estimation (in local policies). Such use of learning within each module retains its bene\ufb01ts, while at the same time, hierarchical decomposition and modular training allow us to sidestep the high sample complex- ities associated with training end-to-end policies. Our experiments in visually and physically realistic simulated 3D environments demonstrate the effectiveness of our approach over past learning and geometry-based approaches. The proposed model can also be easily transferred to the PointGoal task and was the winning entry of the CVPR 2019 Habitat PointGoal Navigation Challenge. 1 I NTRODUCTION Navigation is a critical task in building intelligent agents. Navigation tasks can be expressed in many forms, for example, point goal tasks involve navigating to speci\ufb01c coordinates and semantic navigation involves \ufb01nding the path to a speci\ufb01c scene or object. Irrespective of the task, a core problem for navigation in unknown environments is exploration, i.e., how to ef\ufb01ciently visit as much of the environment. This is useful for maximizing the coverage to give the best chance of \ufb01nding the target in unknown environments or for ef\ufb01ciently pre-mapping environments on a limited time-budget. Recent work from Chen et al. (2019) has used end-to-end learning to tackle this problem. Their motivation is three-fold: a)learning provides \ufb02exibility to the choice of input modalities (classical systems rely on observing geometry through the use of specialized sensors, while learning systems can infer geometry directly from RGB images), b)use of learning can improve robustness to errors in explicit state estimation, and c)learning can effectively leverage structural regularities of the real world, leading to more ef\ufb01cient behavior in previously unseen environments. This lead to their design of an end-to-end trained neural network-based policy that processed raw sensory observations to directly output actions that the agent should execute. While the use of learning for exploration is well-motivated, casting the exploration problem as an end-to-end learning problem has its own drawbacks. Learning about mapping, state-estimation, and path-planning purely from data in an end-to-end manner can be prohibitively expensive. Consequently, past end-to-end learning work for exploration from Chen et al. (2019) relies on the use of imitation learning and many millions of frames of experience, but still performs worse than classical methods that don\u2019t require any training at all. This motivates our work. In this paper, we investigate alternate formulations of employing learning for exploration that retains the advantages that learning has to offer, but doesn\u2019t suffer from the yCorrespondence: chaplot@cs.cmu.edu \u0003Equal Contribution \nPublished as a conference paper at ICLR 2020 drawbacks of full-blown end-to-end learning. Our key conceptual insight is that use of learning for leveraging structural regularities of indoor environments, robustness to state-estimation errors, and \ufb02exibility with respect to input modalities, happens at different time scales and can thus be factored out. This motivates the use of learning in a modular and hierarchical fashion inside of what one may call a \u2018classical navigation pipeline\u2019. This results in navigation policies that can work with raw sensory inputs such as RGB images, are robust to state estimation errors, and leverage the regularities of real-world layouts. This results in extremely competitive performance over both geometry-based methods and recent learning-based methods; at the same time requiring a fraction of the number of samples. More speci\ufb01cally, our proposed exploration architecture comprises of a learned Neural SLAM module, a global policy, and a local policy, that are interfaced via the map and an analytical path planner. The learned Neural SLAM module produces free space maps and estimates agent pose from input RGB images and motion sensors. The global policy consumes this free-space map with the agent pose and employs learning to exploit structural regularities in layouts of real-world environments to produce long-term goals. These long-term goals are used to generate short-term goals for the local policy (using a geometric path-planner). This local policy uses learning to directly map raw RGB images to actions that the agent should execute. Use of learning in the SLAM module provides \ufb02exibility with respect to input modality, learned global policy can exploit regularities in layouts of real-world environments, while learned local policies can use visual feedback to exhibit more robust behavior. At the same time, hierarchical and modular design and use of analytical planning, signi\ufb01cantly cuts down the search space during training, leading to better performance as well as sample ef\ufb01ciency. We demonstrate our proposed approach in visually andphysically realistic simulators for the task of geometric exploration (visit as much area as possible). We work with the Habitat simulator from Savva et al. (2019). While Habitat is already visually realistic (it uses real-world scans from Chang et al. (2017) and Xia et al. (2018) as environments), we improve its physical realism by using actuation and odometry sensor noise models, that we collected by conducting physical experiments on a real mobile robot. Our experiments and ablations in this realistic simulation reveal the effectiveness of our proposed approach for the task of exploration. A straightforward modi\ufb01cation of our method also tackles point-goal navigation tasks, and won the AI Habitat challenge at CVPR2019 across all tracks. 2 R ELATED WORK Navigation has been well studied in classical robotics. There has been a renewed interest in the use of learning to arrive at navigation policies, for a variety of tasks. Our work builds upon concepts in classical robotics and learning for navigation. We survey related works below. Navigation Approaches. Classical approaches to navigation break the problem into two parts: mapping and path planning. Mapping is done via simultaneous localization and mapping (Thrun et al., 2005; Hartley and Zisserman, 2003; Fuentes-Pacheco et al., 2015), by fusing information from multiple views of the environment. While sparse reconstruction can be done well with monocular RGB images (Mur-Artal and Tard\u00f3s, 2017), dense mapping is inef\ufb01cient (Newcombe et al., 2011) or requires specialized scanners such as Kinect (Izadi et al., 2011). Maps are used to compute paths to goal locations via path planning (Kavraki et al., 1996; Lavalle and Kuffner Jr, 2000; Canny, 1988). These classical methods have inspired recent learning-based techniques. Researchers have designed neural network policies that reason via spatial representations (Gupta et al., 2017; Parisotto and Salakhutdinov, 2018; Zhang et al., 2017; Henriques and Vedaldi, 2018; Gordon et al., 2018), topological representations (Savinov et al., 2018; 2019), or use differentiable and trainable planners (Tamar et al., 2016; Lee et al., 2018; Gupta et al., 2017; Khan et al., 2017). Our work furthers this research, and we study a hierarchical and modular decomposition of the problem and employ learning inside these components instead of end-to-end learning. Research also focuses on incorporating semantics in SLAM (Pronobis and Jensfelt, 2012; Walter et al., 2013). Exploration in Navigation. While a number of works focus on passive map-building, path planning and goal-driven policy learning, a much smaller body of work tackles the the problem of active SLAM, i.e., how to actively control the camera for map building. We point readers to Fuentes- Pacheco et al. (2015) for a detailed survey, and summarize the major themes below. Most such works frame this problem as a Partially Observable Markov Decision Process (POMDP) that are approximately solved (Martinez-Cantin et al., 2009; Kollar and Roy, 2008), and or seek to \ufb01nd a sequence of actions that minimizes uncertainty of maps (Stachniss et al., 2005; Carlone et al., 2014). \nPublished as a conference paper at ICLR 2020 Another line of work explores by picking vantage points (such as on the frontier between explored and unexplored regions (Dornhege and Kleiner, 2013; Holz et al., 2010; Yamauchi, 1997; Xu et al., 2017)). Recent works from Chen et al. (2019); Savinov et al. (2019); Fang et al. (2019) attack this problem via learning. Our proposed modular policies unify the last two lines of research, and we show improvements over representative methods from both these lines of work. Exploration has also been studied more generally in RL in the context of exploration-exploitation trade-off (Sutton and Barto, 2018; Kearns and Singh, 2002; Auer, 2002; Jaksch et al., 2010). Hierarchical and Modular Policies. Hierarchical RL (Dayan and Hinton, 1993; Sutton et al., 1999; Barto and Mahadevan, 2003) is an active area of research, aimed at automatically discovering hierarchies to speed up learning. However, this has proven to be challenging, and thus most work has resorted to using hand-de\ufb01ning hierarchies. For example in the context of navigation, Bansal et al. (2019) and Kaufmann et al. (2019) design modular policies for navigation, that interface learned policies with low-level feedback controllers. Hierarchical and modular policies have also been used for Embodied Question Answering (Das et al., 2018a; Gordon et al., 2018; Das et al., 2018b). 3 T ASK SETUP We follow the exploration task setup proposed by Chen et al. (2019) where the objective is to maximize the coverage in a \ufb01xed time budget. The coverage is de\ufb01ned as the total area in the map known to be traversable. Our objective is to train a policy which takes in an observation stat each time steptand outputs a navigational action atto maximize the coverage. We try to make our experimental setup in simulation as realistic as possible with the goal of trans- ferring trained policies to the real world. We use the Habitat simulator (Savva et al., 2019) with the Gibson (Xia et al., 2018) and Matterport (MP3D) (Chang et al., 2017) datasets for our experiments. Both Gibson and Matterport datasets are based on real-world scene reconstructions are thus signif- icantly more realistic than synthetic SUNCG dataset (Song et al., 2017) used for past research on exploration (Chen et al., 2019; Fang et al., 2019). In addition to synthetic scenes, prior works on learning-based navigation have also assumed simplistic agent motion. Some works limit agent motion on a grid with 90 degree rotations (Zhu et al., 2017; Gupta et al., 2017; Chaplot et al., 2018). Other works which implement \ufb01ne-grained control, typically assume unrealistic agent motion with no noise (Savva et al., 2019) or perfect knowledge of agent pose (Chaplot et al., 2016). Since the motion is simplistic, it becomes trivial to estimate the agent pose in most cases even if it is not assumed to be known. The reason behind these assumptions on agent motion and pose is that motion and sensor noise models are not known. In order to relax both these assumptions, we collect motion and sensor data in the real-world and implement more realistic agent motion and sensor noise models in the simulator as described in the following subsection. 3.1 A CTUATION AND SENSOR NOISE MODEL We represent the agent pose by (x;y;o )wherexandyrepresent the xyco-ordinate of the agent measured in metres and orepresents the orientation of the agent in radians (measured counter- clockwise from x-axis). Without loss of generality, assume agents starts at p0= (0;0;0). Now, suppose the agent takes an action at. Each action is implemented as a control command on a robot. Let the corresponding control command be \u0001ua= (xa;ya;oa). Let the agent pose after the action bep1= (x?;y?;o?). The actuation noise ( \u000fact) is the difference between the actual agent pose ( p1) after the action and the intended agent pose ( p0+ \u0001u): \u000fact=p1\u0000(p0+ \u0001u) = (x?\u0000xa;y?\u0000ya;o?\u0000oa) Mobile robots typically have sensors which estimate the robot pose as it moves. Let the sensor estimate of the agent pose after the action be p0 1= (x0;y0;o0). The sensor noise ( \u000fsen) is given by the difference between the sensor pose estimate ( p0 1) and the actual agent pose( p1): \u000fsen=p0 1\u0000p1= (x0\u0000x?;y0\u0000y?;o0\u0000o?) In order to implement the actuation and sensor noise models, we would like to collect data for navigational actions in the Habitat simulator. We use three default navigational actions: Forward: move forward by 25cm, Turn Right: on the spot rotation clockwise by 10 degrees, and Turn Left: on the spot rotation counter-clockwise by 10 degrees. The control commands are implemented as \nPublished as a conference paper at ICLR 2020 Figure 1: Overview of our approach . The Neural SLAM module predicts a map and agent pose estimate from incoming RGB observations and sensor readings. This map and pose are used by a Global policy to output a long-term goal, which is converted to a short-term goal using an analytic path planner. A Local Policy is trained to navigate to this short-term goal. uForward = (0:25;0;0),uRight : (0;0;\u000010\u0003\u0019=180) anduLeft: (0;0;10\u0003\u0019=180) . In practice, a robot can also rotate slightly while moving forward and translate a bit while rotating on-the-spot, creating rotational actuation noise in forward action and similarly, a translation actuation noise in on-the-spot rotation actions. We use a LoCoBot1to collect data for building the actuation and sensor noise models. We use the pyrobot API (Murali et al., 2019) along with ROS (Quigley et al., 2009) to implement the control commands and get sensor readings. For each action a, we \ufb01t a separate Gaussian Mixture Model for the actuation noise and sensor noise, making a total of 6 models. Each component in these Gaussian mixture models is a multi-variate Gaussian in 3 variables, x,yando. For each model, we collect 600 datapoints. The number of components in each Gaussian mixture model is chosen using cross-validation. We implement these actuation and sensor noise models in the Habitat simulator for our experiments. We have released the noise models, along with their implementation in the Habitat simulator in the open-source code. 4 M ETHODS We propose a modular navigation model, \u2018Active Neural SLAM\u2019. It consists of three components: a Neural SLAM module , aGlobal policy and a Local policy as shown in Figure 1. The Neural SLAM module predicts the map of the environment and the agent pose based on the current observations and previous predictions. The Global policy uses the predicted map and agent pose to produce a long-term goal. The long-term goal is converted into a short-term goal using path planning. The Local policy takes navigational actions based on the current observation to reach the short-term goal. Map Representation. The Active Neural SLAM model internally maintains a spatial map, mtand pose of the agent xt. The spatial map, mt, is a 2\u0002M\u0002Mmatrix where M\u0002Mdenotes the map size and each element in this spatial map corresponds to a cell of size 25cm2(5cm\u00025cm) in the physical world. Each element in the \ufb01rst channel denotes the probability of an obstacle at the corresponding location and each element in the second channel denotes the probability of that location being explored. A cell is considered to be explored when it is known to be free space or an obstacle. The spatial map is initialized with all zeros at the beginning of an episode, m0= [0]2\u0002M\u0002M. The posext2R3denotes the xandycoordinates of the agent and the orientation of the agent at timet. The agent always starts at the center of the map facing east at the beginning of the episode, x0= (M=2;M=2;0:0). Neural SLAM Module. The Neural SLAM Module ( fSLAM ) takes in the current RGB observation, st, the current and last sensor reading of the agent pose x0 t\u00001:t, last agent pose and map estimates, ^xt\u00001;mt\u00001and outputs an updated map, mt, and the current agent pose estimate, ^xt, (see Figure 2):mt;^xt=fSLAM (st;x0 t\u00001:t;^xt\u00001;mt\u00001j\u0012S), where\u0012Sdenote the trainable parameters of the Neural SLAM module. It consists of two learned components, a Mapper and a Pose Estimator. The Mapper (fMap) outputs a egocentric top-down 2D spatial map, pego t2[0;1]2\u0002V\u0002V(whereVis the vision range), predicting the obstacles and the explored area in the current observation. The Pose Estimator ( fPE) predicts the agent pose ( ^xt) based on past pose estimate ( ^xt\u00001) and last two 1http://locobot.org \nPublished as a conference paper at ICLR 2020 Figure 2: Architecture of the Neural SLAM module: The Neural SLAM module ( fMap) takes in the current RGB observation, st, the current and last sensor reading of the agent pose x0 t\u00001:t, last agent pose estimate, ^xt\u00001 and the map at the previous time step mt\u00001and outputs an updated map, mtand the current agent pose estimate, ^xt. \u2018ST\u2019 denotes spatial transformation. egocentric map predictions ( pego t\u00001:t). It essentially compares the current egocentric map prediction to the last egocentric map prediction transformed to the current frame to predict the pose change between the two maps. The egocentric map from the Mapper is transformed to a geocentric map based on the pose estimate given by the Pose Estimator and then aggregated with the previous spatial map (mt\u00001) to get the current map ( mt). More implementation details of the Neural SLAM module are provided in the Appendix. Global Policy. The Global Policy takes ht2[0;1]4\u0002M\u0002Mas input, where the \ufb01rst two channels ofhtare the spatial map mtgiven by the SLAM module, the third channel represents the current agent position estimated by the SLAM module, the fourth channel represents the visited locations, i.e. 8i;j2f1;2;:::;mg: ht[c;i;j ] =mt[c;i;j ]8c2f0;1g ht[2;i;j] = 1 ifi= ^xt[0]andj= ^xt[1] ht[3;i;j] = 1 if(i;j)2[(^xk[0];^xk[1])]k2f0;1;:::;tg We perform two transformations before passing htto the Global Policy model. The \ufb01rst transforma- tion subsamples a window of size 4\u0002G\u0002Garound the agent from ht. The second transformation performs max pooling operations to get an output of size 4\u0002G\u0002Gfromht. Both the transformations are stacked to form a tensor of size 8\u0002G\u0002Gand passed as input to the Global Policy model. The Global Policy uses a convolutional neural network to predict a long-term goal, gl tinG\u0002Gspace: gl t=\u0019G(htj\u0012G), where\u0012Gare the parameters of the Global Policy. Planner. The Planner takes the long-term goal ( gl t), the spatial obstacle map ( mt) and the agnet pose estimate ( ^xt) as input and computes the short-term goal gs t, i.e.gs t=fPlan(gl t;mt;^xt). It computes the shortest path from the current agent location to the long-term goal ( gl t) using the Fast Marching Method (Sethian, 1996) based on the current spatial map mt. The unexplored area is considered as free space for planning. We compute a short-term goal coordinate (farthest point within ds(= 0:25m) from the agent) on the planned path. Local Policy. The Local Policy takes as input the current RGB observation ( st) and the short-term goal (gs t) and outputs a navigational action, at=\u0019L(st;gs tj\u0012L), where\u0012Lare the parameters of the Local Policy. The short-term goal coordinate is transformed into relative distance and angle from the agent\u2019s location before being passed to the Local Policy. The Local Policy is a recurrent neural network consisting of a pretrained ResNet18 (He et al., 2016) as the visual encoder. 5 E XPERIMENTAL SETUP We use the Habitat simulator (Savva et al., 2019) with the Gibson (Xia et al., 2018) and Matterport (MP3D) (Chang et al., 2017) datasets for our experiments. Both Gibson and MP3D consist of scenes \nPublished as a conference paper at ICLR 2020 which are 3D reconstructions of real-world environments, however, Gibson is collected using a different set of cameras, consists mostly of of\ufb01ce spaces while MP3D consists of mostly homes with a larger average scene area. We will use Gibson as our training domain, and use MP3D for domain generalization experiments. The observation space consists of RGB images of size 3\u0002128\u0002128 and base odometry sensor readings of size 3\u00021denoting the change in agent\u2019s x-y coordinates and orientation. The actions space consists of three actions: move_forward, turn_left, turn_right . Both the base odometry sensor readings and the agent motion based on the actions are noisy. They are implemented using the sensor and actuation noise models based on real-world data as discussed in Section 3.1. We follow the Exploration task setup proposed by Chen et al. (2019) where the objective to maximize the coverage in a \ufb01xed time budget. Coverage is the total area in the map known to be traversable. We de\ufb01ne a traversable point to be known if it is in the \ufb01eld-of-view of the agent and is less than 3:2m away. We use two evaluation metrics, the absolute coverage area in m2(Cov) and the percentage of area explored in the scene ( % Cov ), i.e. ratio of coverage to maximum possible coverage in the corresponding scene. During training, each episode lasts for a \ufb01xed length of 1000 steps. We use train/val/test splits provided by Savva et al. (2019) for both the datasets. Note that the set of scenes used in each split is disjoint, which means the agent is tested on new scenes never seen during training. Gibson test set is not public but rather held out on an online evaluation server for the Pointgoal task. We use the validation as the test set for comparison and analysis for the Gibson domain. We do not use the validation set for hyper-parameter tuning. To analyze the performance of all the models with respect to the size of the scene, we split the Gibson validation set into two parts, a small set of 10 scenes with explorable area ranging from 16m2to36m2, and a large set of 4 scenes with explorable area ranging from 55m2to100m2. Note that the size of the map is usually much larger than the traversable area, with the largest map being about 23mlong and 11mwide. Training Details. We train our model in the Gibson domain and transfer it to the Matterport domain. The Mapper is trained to predict egocentric projections, and the Pose Estimator is trained to predict agent pose using supervised learning. The ground truth egocentric projection is computed using geometric projections from ground truth depth. The Global Policy is trained using Reinforcement Learning with reward proportional to the increase in coverage as the reward. The Local Policy is trained using Imitation Learning (behavioral cloning). All the modules are trained simultaneously. Their parameters are independent, but the data distribution is inter-dependent. Based on the actions taken by the Local policy, the future input to Neural SLAM module changes, which in turn changes the map and agent pose input to the Global policy and consequently affects the short-term goal given to the Local Policy. For more architecture and hyperparameter details, please refer to the supplementary material and the open-source code. Baselines. We use a range of end-to-end Reinforcement Learning (RL) methods as baselines: RL + 3LConv: An RL Policy with 3 layer convolutional network followed by a GRU (Cho et al., 2014) as described by Savva et al. (2019). RL + Res18: A RL Policy initialized with ResNet18 (He et al., 2016) pre-trained on ImageNet followed by a GRU. RL + Res18 + AuxDepth: This baseline is adapted from Mirowski et al. (2017) who use depth prediction as an auxiliary task. We use the same architecture as our Neural SLAM module (conv layers from ResNet18) with one additional deconvolutional layer for Depth prediction followed by 3 layer convolution and GRU for the policy. RL + Res18 + ProjDepth: This baseline is adapted form Chen et al. (2019) who project the depth image in an egocentric top-down in addition to the RGB image as input to the RL policy. Since we do not have depth as input, we use the architecture from RL + Res18 + AuxDepth for depth prediction and project the predicted depth before passing to 3Layer Conv and GRU policy. For all the baselines, we also feed a 32-dimensional embedding of the sensor pose reading to the GRU along with the image-based representation. This embedding is also learnt end-to-end using RL. All baselines are trained using PPO (Schulman et al., 2017) with increase in coverage as the reward (identical to the reward used for Global policy). All the baselines require access to the ground-truth map during training for computing the reward. The supervision for the Global Policy, the Local Policy and the Mapper can also be obtained from the ground-truth map. The Pose Estimator requires additional supervision in the form of the ground-truth agent pose. We study the effect of this additional supervision in ablation experiments. \nPublished as a conference paper at ICLR 2020 \u2019 Table 1: Exploration performance of the proposed model, Active Neural SLAM (ANS) and baselines. The baselines are adated from [1] Savva et al. (2019), [2] Mirowski et al. (2017) and [3] Chen et al. (2019). Domain Generalization Gibson Val MP3D Test Method % Cov. Cov. (m2) % Cov. Cov. (m2) RL + 3LConv [1] 0.737 22.838 0.332 47.758 RL + Res18 0.747 23.188 0.341 49.175 RL + Res18 + AuxDepth [2] 0.779 24.467 0.356 51.959 RL + Res18 + ProjDepth [3] 0.789 24.863 0.378 54.775 Active Neural SLAM (ANS) 0.948 32.701 0.521 73.281 0 200 400 600 800 1000 Episode length0.00.20.40.60.81.0% CoverageGibson Val - Large 0 200 400 600 800 1000 Episode lengthGibson Val - Small ANS RL + 3LConv + GRU RL + Res18 + GRU RL + Res18 + GRU + AuxDepth RL + Res18 + GRU + ProjDepth 0 200 400 600 800 1000 Episode lengthGibson Val - Overall Figure 3: Plot showing the %Coverage as the episode progresses for ANS and the baselines on the large and small scenes in the Gibson Val set as well as the overall Gibson Val set. 6 R ESULTS We train the proposed ANS model and all the baselines for the Exploration task with 10 million frames on the Gibson training set. The results are shown in Table 1. The results on the Gibson Val set are averaged over a total of 994 episodes in 14 different unseen scenes. The proposed model achieves an average absolute and relative coverage of 32:701m2=0:948as compared to 24:863m2=0:789for the best baseline. This indicates that the proposed model is more ef\ufb01cient and effective at exhaustive exploration as compared to the baselines. This is because our hierarchical policy architecture reduces the horizon of the long-term exploration problem as instead of taking tens of low-level navigational actions, the Global policy only takes few long-term goal actions. We also report the domain generalization performance on the Exploration task in Table 1 (see shaded region), where all models trained on Gibson are evaluated on the Matterport domain. ANS leads to higher domain generalization performance ( 73:281m2=0:521vs54:775m2=0:378). The absolute coverage is higher and % Cov is lower for the Matterport domain as it consists of larger scenes on average. On a set of small MP3D test scenes (comparable to Gibson scene sizes), ANS achieved a performance of31:407m2=0:836as compared to 23:091m2=0:620for the best baseline. Some visualizations of policy execution are provided in Figure 42. In Fig. 3, we plot the relative coverage (% Cov) of all the models as the episode progresses on the large and small scene sets, as well as the overall Gibson Val set. The plot on the small scene set shows that ANS is able to almost completely explore the small scenes in around 500 steps, however, the baselines are only able to explore 85-90% of the small scenes in 1000 steps (see Fig. 3 center). This indicates that ANS explores more ef\ufb01ciently in small scenes. The plot on the large scenes set shows that the performance gap between ANS and baselines widens as the episode progresses (see Fig. 3 left). Looking at the behavior of the baselines, we saw that they often got stuck in local areas. This behavior indicates that they are unable to remember explored areas over long-time horizons and are ineffective at long-term planning. On the other hand, ANS uses a Global policy on the map which allows it to have the memory of explored areas over long-time horizons, and plan effectively to reach distant long-term goals by leveraging analytical planners. As a result, it is able to explore effectively in large scenes with long episode lengths. 2Seehttps://devendrachaplot.github.io/projects/Neural-SLAM for visualization videos. \nPublished as a conference paper at ICLR 2020 Table 2: Results of the ablation experiments on the Gibson environment. Gibson Val Overall Gibson Val Large Gibson Val Small Method % Cov. Cov. (m2) % Cov. Cov. (m2) % Cov. Cov. (m2) ANS w/o Local Policy + Det. Planner 0.941 32.188 0.845 53.999 0.980 23.464 ANS w/o Global Policy + FBE 0.925 30.981 0.782 49.731 0.982 23.481 ANS w/o Pose Estimation 0.916 30.746 0.771 49.518 0.973 23.237 ANS 0.948 32.701 0.862 55.608 0.983 23.538 Figure 4: Exploration visualization . Figure showing a sample trajectory of the Active Neural SLAM model in the Exploration task. Top: RGB observations seen by the agent. Inset: Global ground truth map and pose (not visible to the agent). Bottom: Local map and pose predictions. Long-term goals selected by the Global policy are shown by blue circles. The ground-truth map and pose are under-laid in grey. Map prediction is overlaid in green, with dark green denoting correct predictions and light green denoting false positives. Agent pose predictions are shown in red. The light blue shaded region shows the explored area. 6.1 A BLATIONS Local Policy. An alternative to learning a Local Policy is to have a deterministic policy which follows the plan given by the Planner. As shown in Table 2, the ANS model performs slightly worse without the Local Policy. The Local Policy is designed to adapt to small errors in Mapping. We observed Local policy overcoming false positives encountered in mapping. For example, the Neural SLAM module could sometime wrongly predict a carpet as an obstacle. In this case, the planner would plan to go around the carpet. However, if the short-term goal is beyond the carpet, the Local policy can understand that the carpet is not an obstacle based on the RGB observation and learn to walk over it. Global Policy. An alternative to learning a Global Policy for sampling long-term goals is to use a classical algorithm called Frontier-based exploration (FBE) (Yamauchi, 1997). A frontier is de\ufb01ned as the boundary between the explored free space and the unexplored space. Frontier-based exploration essentially sample points on this frontier as goals to explore the space. There are different variants of Frontier-based exploration based on the sampling strategy. Holz et al. (2010) compare different sampling strategies and \ufb01nd that sampling the point on the frontier closest to the agent gives the best results empirically. We implement this variant and replace it with our learned Global Policy. As shown in Table 2, the performance of the Frontier-based exploration policy is comparable on small scenes, but around 10% lower on large scenes, relative to the Global policy. This indicates the importance of learning as compared to classical exploration methods in larger scenes. Qualitatively, we observed that Frontier-based exploration spent a lot of time exploring corners or small areas behind furniture. In contrast, the trained Global policy ignored small spaces and chose distant long-term goals which led to higher coverage. Pose Estimation. A difference between ANS and the baselines is that ANS uses additional supervi- sion to train the Pose Estimator. In order to understand whether the performance gain is coming from this additional supervision, we remove the Pose Estimator from ANS and just use the input sensor reading as our pose estimate. Results in Table 2 show that the ANS still outperforms the baselines even without the Pose Estimator. We also observed that performance without the pose estimator drops only about 1% on small scenes, but around 10% on large scenes. This is expected because larger scenes take longer to explore, and pose errors accumulate over time to cause drift. Passing the ground truth pose as input the baselines instead of the sensor reading did not improve their performance. \nPublished as a conference paper at ICLR 2020 Figure 5: Real-world Transfer. Left: Image showing the living area in an apartment used for the real-world experiments. Right: Sample images seen by the robot and the predicted map. The long-term goal selected by the Global Policy is shown by a blue circle on the map. 6.2 R EAL-WORLD TRANSFER We deploy the trained ANS policy on a Locobot in the real-world. In order to match the real-world observations to the simulator observations as closely as possible, we change the simulator input con\ufb01guration to match the camera intrinsics on the Locobot. This includes the camera height and horizontal and vertical \ufb01eld-of-views. In Figure 5, we show an episode of ANS exploring the living area in an apartment. The \ufb01gure shows that the policy transfers well to the real-world and is able to effectively explore the environment. The long-term goals sampled by the Global policy (shown by blue circles on the map) are often towards open spaces in the explored map, which indicates that it is learning to exploit the structure in the map. Please refer to the project webpage for real-world transfer videos. 6.3 P OINTGOAL TASK TRANSFER . PointGoal has been the most studied task in recent literature on navigation where the objective is to navigate to a goal location whose relative coordinates are given as input in a limited time budget. In this task, each episode ends when either the agent takes the stop action or at a maximum of 500 timesteps. An episode is considered a success when the \ufb01nal position of the agent is within 0:2mof the goal location. In addition to Success rate (Succ), Success weighted by (normalized inverse) Path Length or SPL is also used as a metric for evaluation as proposed by Anderson et al. (2018). All the baseline models trained for the task of Exploration either need to be retrained or at least \ufb01ne- tuned to be transferred to the Pointgoal task. The modularity of ANS provides it another advantage that it can be transferred to the Pointgoal task without any additional training. For transferring to the Pointgoal task, we just \ufb01x the Global policy to always output the PointGoal coordinates as the long-term goal and use the Local Policy and Neural SLAM module trained for the Exploration task. We found that an ANS policy trained on exploration, when transferred to the Pointgoal task performed better than several RL and Imitation Learning baselines trained on the Pointgoal task. The transferred ANS model achieves a success rate/SPL of 0.950/0.846 as compared to 0.827/0.730 for the best baseline model on Gibson val set. The ANS model also generalized signi\ufb01cantly better than the baselines to harder goals and to the Matterport domain. In addition to better performance, ANS was also 10 to 75 times more sample ef\ufb01cient than the baselines. This transferred ANS policy was also the winner of the CVPR 2019 Habitat Pointgoal Navigation Challenge for both RGB and RGB-D tracks among over 150 submissions from 16 teams. These results highlight a key advantage of our model. It allows us to transfer the knowledge of obstacle avoidance and control in low-level navigation across tasks, as the Local Policy and Neural SLAM module are task-invariant. More details about the Pointgoal experiments, baselines, results including domain and goal generalization on the Pointgoal task are provided in the supplementary material. \nPublished as a conference paper at ICLR 2020 7 C ONCLUSION In this paper, we proposed a modular navigational model which leverages the strengths of classical and learning-based navigational methods. We show that the proposed model outperforms prior methods on both Exploration and PointGoal tasks and shows strong generalization across domains, goals, and tasks. In the future, the proposed model can be extended to complex semantic tasks such as Semantic Goal Navigation and Embodied Question Answering by using a semantic Neural SLAM module which creates a multi-channel map capturing semantic properties of the objects in the environment. The model can also be combined with prior work on Localization to relocalize in a previously created map for ef\ufb01cient navigation in subsequent episodes. ACKNOWLEDGEMENTS This work was supported by IARPA DIV A D17PC00340, ONR Grant N000141812861, ONR MURI, ONR Young Investigator, DARPA MCS, and Apple. We would also like to acknowledge NVIDIA\u2019s GPU support. We thank Guillaume Lample for discussions and coding during the initial stages of this project. Licenses for referenced datasets. Gibson: http://svl.stanford.edu/gibson2/assets/GDS_agreement.pdf Matterport3D: http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf REFERENCES Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 , 2018. Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research , 3(Nov):397\u2013422, 2002. Somil Bansal, Varun Tolani, Saurabh Gupta, Jitendra Malik, and Claire Tomlin. Combining optimal control and learning for visual navigation in novel environments. In Conference on Robot Learning (CoRL) , 2019. Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems , 13(1-2):41\u201377, 2003. John Canny. The complexity of robot motion planning . MIT press, 1988. Luca Carlone, Jingjing Du, Miguel Kaouk Ng, Basilio Bona, and Marina Indri. Active slam and exploration with particle \ufb01lters using kullback-leibler divergence. Journal of Intelligent & Robotic Systems , 75(2):291\u2013311, 2014. Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV) , pages 667\u2013676. IEEE, 2017. Devendra Singh Chaplot and Guillaume Lample. Arnold: An autonomous agent to play fps games. InThirty-First AAAI Conference on Arti\ufb01cial Intelligence , 2017. Devendra Singh Chaplot, Guillaume Lample, Kanthashree Mysore Sathyendra, and Ruslan Salakhut- dinov. Transfer deep reinforcement learning in 3d environments: An empirical study. In NIPS Deep Reinforcemente Leaning Workshop , 2016. Devendra Singh Chaplot, Emilio Parisotto, and Ruslan Salakhutdinov. Active neural localization. ICLR , 2018. Tao Chen, Saurabh Gupta, and Abhinav Gupta. Learning exploration policies for navigation. In ICLR , 2019. Kyunghyun Cho, Bart Van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation , 2014. \nPublished as a conference paper at ICLR 2020 Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In CVPR , 2018a. Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Neural modular control for embodied question answering. In Conference on Robot Learning , pages 53\u201362, 2018b. Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information processing systems , pages 271\u2013278, 1993. Christian Dornhege and Alexander Kleiner. A frontier-void-based approach for autonomous explo- ration in 3d. Advanced Robotics , 27(6):459\u2013468, 2013. Kuan Fang, Alexander Toshev, Li Fei-Fei, and Silvio Savarese. Scene memory transformer for embodied agents in long-horizon tasks. In CVPR , 2019. J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rend\u00f3n-Mancha. Visual simultaneous localization and mapping: a survey. Arti\ufb01cial Intelligence Review , 2015. Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali Farhadi. Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4089\u20134098, 2018. Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2616\u20132625, 2017. Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision . Cambridge university press, 2003. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770\u2013778, 2016. Joao F Henriques and Andrea Vedaldi. Mapnet: An allocentric spatial memory for mapping envi- ronments. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 8476\u20138484, 2018. Dirk Holz, Nicola Basilico, Francesco Amigoni, and Sven Behnke. Evaluating the ef\ufb01ciency of frontier-based exploration strategies. In ISR 2010 (41st International Symposium on Robotics) and ROBOTIK 2010 (6th German Conference on Robotics) , pages 1\u20138. VDE, 2010. Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, and Andrew Fitzgibbon. KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. UIST , 2011. Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems , pages 2017\u20132025, 2015. Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research , 11(Apr):1563\u20131600, 2010. Elia Kaufmann, Mathias Gehrig, Philipp Foehn, Ren\u00e9 Ranftl, Alexey Dosovitskiy, Vladlen Koltun, and Davide Scaramuzza. Beauty and the beast: Optimal methods meet learning for drone racing. In2019 International Conference on Robotics and Automation (ICRA) , pages 690\u2013696. IEEE, 2019. Lydia E Kavraki, Petr Svestka, J-C Latombe, and Mark H Overmars. Probabilistic roadmaps for path planning in high-dimensional con\ufb01guration spaces. RA, 1996. Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning , 49(2-3):209\u2013232, 2002. \nPublished as a conference paper at ICLR 2020 Arbaaz Khan, Clark Zhang, Nikolay Atanasov, Konstantinos Karydis, Daniel D Lee, and Vijay Kumar. End-to-end navigation in unknown environments using neural networks. arXiv preprint arXiv:1707.07385 , 2017. S. Kohlbrecher, J. Meyer, O. von Stryk, and U. Klingauf. A \ufb02exible and scalable slam system with full 3d motion estimation. In Proc. IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR) . IEEE, November 2011. Thomas Kollar and Nicholas Roy. Trajectory optimization using reinforcement learning for map exploration. The International Journal of Robotics Research , 27(2):175\u2013196, 2008. Ilya Kostrikov. Pytorch implementations of reinforcement learning algorithms. https://github. com/ikostrikov/pytorch-a2c-ppo-acktr-gail , 2018. Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep reinforcement learning. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence , 2017. Steven M Lavalle and James J Kuffner Jr. Rapidly-exploring random trees: Progress and prospects. InAlgorithmic and Computational Robotics: New Directions , 2000. Lisa Lee, Emilio Parisotto, Devendra Singh Chaplot, Eric Xing, and Ruslan Salakhutdinov. Gated path planning networks. In ICML , 2018. Ruben Martinez-Cantin, Nando de Freitas, Eric Brochu, Jos\u00e9 Castellanos, and Arnaud Doucet. A bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots , 27(2):93\u2013103, 2009. Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. ICLR , 2017. Raul Mur-Artal and Juan D Tard\u00f3s. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics , 33(5):1255\u20131262, 2017. Adithyavairavan Murali, Tao Chen, Kalyan Vasudev Alwala, Dhiraj Gandhi, Lerrel Pinto, Saurabh Gupta, and Abhinav Gupta. Pyrobot: An open-source robotics framework for research and benchmarking. arXiv preprint arXiv:1906.08236 , 2019. Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. In 2011 international conference on computer vision , pages 2320\u20132327. IEEE, 2011. Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. ICLR , 2018. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. NIPS 2017 Autodiff Workshop , 2017. Andrzej Pronobis and Patric Jensfelt. Large-scale semantic mapping and reasoning with heteroge- neous modalities. In 2012 IEEE International Conference on Robotics and Automation , pages 3515\u20133522. IEEE, 2012. Morgan Quigley, Brian Gerkey, Ken Conley, Josh Faust, Tully Foote, Jeremy Leibs, Eric Berger, Rob Wheeler, and Andrew Ng. Ros: an open-source robot operating system. In Proc. of the IEEE Intl. Conf. on Robotics and Automation (ICRA) Workshop on Open Source Robotics , Kobe, Japan, May 2009. Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. In International Conference on Learning Representations (ICLR) , 2018. Nikolay Savinov, Anton Raichuk, Rapha\u00ebl Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain Gelly. Episodic curiosity through reachability. In ICLR , 2019. \nPublished as a conference paper at ICLR 2020 Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE International Conference on Computer Vision , pages 9339\u20139347, 2019. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017. James A Sethian. A fast marching level set method for monotonically advancing fronts. Proceedings of the National Academy of Sciences , 93(4):1591\u20131595, 1996. Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In CVPR , 2017. Cyrill Stachniss, Giorgio Grisetti, and Wolfram Burgard. Information gain-based exploration using rao-blackwellized particle \ufb01lters. In Robotics: Science and Systems , volume 2, pages 65\u201372, 2005. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction . MIT press, 2018. Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Arti\ufb01cial intelligence , 112(1-2):181\u2013211, 1999. Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems , pages 2154\u20132162, 2016. Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics . MIT press, 2005. Matthew R Walter, Sachithra Hemachandra, Bianca Homberg, Stefanie Tellex, and Seth Teller. Learning semantic maps from natural language descriptions. In Robotics: Science and Systems , 2013. Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on . IEEE, 2018. Kai Xu, Lintao Zheng, Zihao Yan, Guohang Yan, Eugene Zhang, Matthias Niessner, Oliver Deussen, Daniel Cohen-Or, and Hui Huang. Autonomous reconstruction of unknown indoor scenes guided by time-varying tensor \ufb01elds. ACM Transactions on Graphics (TOG) , 36(6):202, 2017. Brian Yamauchi. A frontier-based approach for autonomous exploration. In cira, volume 97, page 146, 1997. Jingwei Zhang, Lei Tai, Joschka Boedecker, Wolfram Burgard, and Ming Liu. Neural slam: Learning to explore with external memory. arXiv preprint arXiv:1706.09520 , 2017. Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Robotics and Automation (ICRA), 2017 IEEE International Conference on , pages 3357\u20133364. IEEE, 2017. \nPublished as a conference paper at ICLR 2020 Table 3: Performance of the proposed model, Active Neural SLAM (ANS) and all the baselines on the Exploration task. \u2018ANS - Task Transfer\u2019 refers to the ANS model transferred to the PointGoal task after training on the Exploration task. Domain GeneralizationGoal Generalization Test Setting! Gibson Val MP3D Test Hard-GEDR Hard-Dist Train Task Method Succ SPL Succ SPL Succ SPL Succ SPL PointGoal Random 0.027 0.021 0.010 0.010 0.000 0.000 0.000 0.000 RL + Blind 0.625 0.421 0.136 0.087 0.052 0.020 0.008 0.006 RL + 3LConv + GRU 0.550 0.406 0.102 0.080 0.072 0.046 0.006 0.006 RL + Res18 + GRU 0.561 0.422 0.160 0.125 0.176 0.109 0.004 0.003 RL + Res18 + GRU + AuxDepth 0.640 0.461 0.189 0.143 0.277 0.197 0.013 0.011 RL + Res18 + GRU + ProjDepth 0.614 0.436 0.134 0.111 0.180 0.129 0.008 0.004 IL + Res18 + GRU 0.823 0.725 0.365 0.318 0.682 0.558 0.359 0.310 CMP 0.827 0.730 0.320 0.270 0.670 0.553 0.369 0.318 ANS 0.951 0.848 0.593 0.496 0.824 0.710 0.662 0.534 Exploration ANS - Task Transfer 0.950 0.846 0.588 0.490 0.821 0.703 0.665 0.532 A P OINTGOAL EXPERIMENTS PointGoal has been the most studied task in recent literature on navigation where the objective is to navigate to a goal location whose relative coordinates are given as input in a limited time budget. We follow the PointGoal task setup from Savva et al. (2019), using train/val/test splits for both Gibson and Matterport datasets. Note that the set of scenes used in each split is disjoint, which means the agent is tested on new scenes never seen during training. Gibson test set is not public but rather held out on an online evaluation server3. We report the performance of our model on the Gibson test set when submitted to the online server but also use the validation set as another test set for extensive comparison and analysis. We do not use the validation set for hyper-parameter tuning. Savva et al. (2019) identify two measures to quantify the dif\ufb01culty of a PointGoal dataset. The \ufb01rst is the average geodesic distance (distance along the shortest path) to the goal location from the starting location of the agent, and the second is the average geodesic to Euclidean distance ratio (GED ratio). The GED ratio is always greater than or equal to 1, with higher ratios resulting in harder episodes. The train/val/test splits in the Gibson dataset come from the same distribution of having similar average geodesic distance and GED ratio. In order to analyze the performance of the proposed model on out-of-set goal distribution, we create two harder sets, Hard-Dist and Hard-GEDR. In the Hard-Dist set, the geodesic distance to goal is always more than 10mand the average geodesic distance to the goal is 13:48mas compared to 6:9=6:5=7:0min train/val/test splits (Savva et al., 2019). Hard-GEDR set consists of episodes with an average GED ratio of 2.52 and a minimum GED ratio of 2.0 as compared to average GED ratio 1.37 in the Gibson val set. We also follow the episode speci\ufb01cation from Savva et al. (2019). Each episode ends when either the agent takes the stop action or at a maximum of 500 timesteps. An episode is considered a success when the \ufb01nal position of the agent is within 0:2mof the goal location. In addition to Success rate (Succ), we also use Success weighted by (normalized inverse) Path Length or SPL as a metric for evaluation for the PointGoal task as proposed by Anderson et al. (2018). A.1 P OINTGOAL RESULTS In Table 3, we show the performance of the proposed model transferred to the PointGoal task along with the baselines trained on the PointGoal task with the same amount of data (10million frames). The proposed model achieves a success rate/SPL of 0.950/0.846 as compared to 0.827/0.730 for the best baseline model on Gibson val set. We also report the performance of the proposed model trained from scratch on the PointGoal task for 10 million frames. The results indicate that the performance of ANS transferred from Exploration is comparable to ANS trained on PointGoal. This highlights a key advantage of our model. It allows us to transfer the knowledge of obstacle avoidance and control in low-level navigation across tasks, as the Local Policy and Neural SLAM module are task-invariant. Sample ef\ufb01ciency. RL models are typically trained for more than 10 million samples. In order to compare the performance and sample-ef\ufb01ciency, we trained the best performing RL model (RL + Res18 + GRU + ProjDepth) for 75 million frames and it achieved a Succ/SPL of 0.678/0.486. ANS 3https://evalai.cloudcv.org/web/challenges/challenge-page/254 \nPublished as a conference paper at ICLR 2020 Figure 7: Performance of the proposed ANS model along with CMP and IL + Res18 + GRU (GRU) baselines with increase in geodesic distance to goal and increase in GED Ratio on the Gibson Val set. Successful TrajectoriesFailure Case Figure 8: Figure showing sample trajectories of the proposed model along with the predicted map in the PointGoal task. The starting and goal locations are shown by black squares and blue circles, respectively. The ground-truth map is under-laid in grey. Map prediction is overlaid in green, with dark green denoting correct predictions and light green denoting false positives. The blue shaded region shows the explored area prediction. On the left, we show some successful trajectories which indicate that the model is effective at long distance goals with high GED ratio. On the right, we show a failure case due to mapping error. reaches the performance of 0.789/0.703 SPL/Succ at only 1 million frames. These numbers indicate that ANS achieves >75\u0002speedup as compared to the best RL baseline. \u2026 \u2026 \u2026 \u2026 \u2026 \u2026 \u2026 \u2026 \u2026 Figure 6: Screenshot of CVPR 2019 Habitat Chal- lenge Results. The proposed model was submitted under code-name \u2018Arnold\u2019.Domain and Goal Generalization: In Table 3 (see shaded region), we evaluate all the baselines and ANS trained on the PointGoal task in the Gibson do- main on the test set in Matterport domain as well as the harder goal sets in Gibson. We also transfer ANS trained on Exploration in Gibson on all the 3 sets. The results show that ANS outperforms all the baselines at all generalization sets. Interestingly, RL based methods almost fail completely on the Hard-Dist set. We also analyze the performance of the proposed model as compared to the two best baselines CMP and IL + Res18 + GRU as a function of geodesic distance to goal and GED ratio in Figure 7. The per- formance of the baselines drops faster as compared to ANS, especially with the increase in goal distance. This indicates that end-to-end learning methods are effective at short-term navigation but struggle when long-term planning is required to reach a distant goal. In Figure 8, we show some example trajectories of the ANS model along with the predicted map. The successful trajectories indicate that the model exhibits strong backtracking behavior which makes it effective at distant goals requiring long-term planning. Figure 9 visualizes a trajectory in the PointGoal task show \ufb01rst-person observation and corresponding map predictions. Please refer to the project webpage for visualization videos. Habitat Challenge Results. We submitted the ANS model to the CVPR 2019 Habitat Pointgoal Navigation Challenge. The results are shown in Figure 6. ANS was submitted under code-name \u2018Arnold\u2019. ANS was the winning entry for both RGB and RGB-D tracks among over 150 submissions from 16 teams, achieving an SPL of 0.805 (RGB) and 0.948 (RGB-D) on the Test Challenge set. B N OISE MODEL IMPLEMENTATION DETAILS In order to implement the actuation and sensor noise models, we would like to collect data for navigational actions in the Habitat simulator. We use three default navigational actions: Forward: \nPublished as a conference paper at ICLR 2020 t=1t=50t=100t=200t=223t=150 Figure 9: Pointgoal visualization . Figure showing sample trajectories of the proposed model along with predicted map in the Pointgoal task as the episode progresses. The starting and goal locations are shown by black squares and blue circles, respectively. Ground truth map is under-laid in grey. Map prediction is overlaid in green, with dark green denoting correct predictions and light green denoting false positives. Blue shaded region shows the explored area prediction. move forward by 25cm, Turn Right: on the spot rotation clockwise by 10 degrees, and Turn Left: on the spot rotation counter-clockwise by 10 degrees. The control commands are implemented as uForward = (0:25;0;0),uRight : (0;0;\u000010\u0003\u0019=180) anduLeft: (0;0;10\u0003\u0019=180) . In practice, a robot can also rotate slightly while moving forward and translate a bit while rotating on-the-spot, creating rotational actuation noise in forward action and similarly, a translation actuation noise in on-the-spot rotation actions. We use a Locobot4to collect data for building the actuation and sensor noise models. We use the pyrobot API (Murali et al., 2019) along with ROS (Quigley et al., 2009) to implement the control commands and get sensor readings. In order to get an accurate agent pose, we use an Hokuyo UST-10LX Scanning Laser Range\ufb01nder (LiDAR) which is especially very precise in our scenario as we take static readings in 2D (Kohlbrecher et al., 2011). We install the LiDAR on the Locobot by replacing the arm with the LiDAR. We note that the Hokuyo UST-10LX Scanning Laser Range\ufb01nder is an expensive sensor. It costs $1600 as compared to the whole Locobot costing less than $2000 without the arm. Using expensive sensors can improve the performance of a model, however, for a method to be scalable, it should ideally work with cheaper sensors too. In order to demonstrate the scalability of our method, we use the LiDAR only to collect the data for building noise models and not for training or deploying navigation policies in the real-world. For the sensor estimate, we use the Kobuki base odometry available in Locobot. We approximate the LiDAR pose estimate to be the true pose of the agent as it is orders of magnitude more accurate than the base sensor. For each action, we collect 600 datapoints from both the base sensor and the LiDAR, making a total of 3600 datapoints ( 600\u00033\u00032). We use 500 datapoints for each action to \ufb01t the actuation and sensor noise models and use the remaining 100 datapoints for validation. For each action a, the LiDAR pose estimates gives us samples of p1and the base sensor readings give us samples of p0 1;i= 1;2;:::; 600. The difference between LiDAR estimates ( pi 1) and control command ( \u0001ua) gives us samples for the actuation noise for the action a:\u000fi act;a =pi 1\u0000\u0001uaand difference between base sensor readings and LiDAR estimates gives us the samples for the sensor noise,\u000fi sen;a =pi0 1\u0000pi 1. For each action a, we \ufb01t a separate Gaussian Mixture Model for the actuation noise and sensor noise using samples \u000fi act;a and\u000fi sen;a respectively, making a total of 6 models. We \ufb01t Gaussian mixture models with the number of components ranging from 1 to 20 for and pick the model with the highest likelihood on the validation set. Each component in these Gaussian mixture models is a multi-variate Gaussian in 3 variables, x,yando. We implement these actuation and sensor noise models in the Habitat simulator for our experiments. 4http://locobot.org \nPublished as a conference paper at ICLR 2020 C N EURAL SLAM MODULE IMPLEMENTATION DETAILS The Neural SLAM module ( fSLAM ) takes in the current RGB observation, st2R3\u0002H\u0002W, the current and last sensor reading of the agent pose x0 t\u00001:tand the map at the previous time step mt\u000012R2\u0002M\u0002Mand outputs an updated map, mt2R2\u0002M\u0002M, and the current agent pose estimate, ^xt(see Figure 2): mt;^xt=fSLAM (st;x0 t\u00001:t;^xt\u00001;mt\u00001j\u0012S;bt\u00001) where\u0012Sdenote the trainable parameters and bt\u00001denotes internal representations of the Neural SLAM module. The Neural SLAM module can be broken down into two parts, a Mapper ( fMap) and a Pose Estimator Unit ( fPE,). The Mapper outputs a egocentric top-down 2D spatial map, pego t2[0;1]2\u0002V\u0002V(whereVis the vision range), predicting the obstacles and the explored area in the current observation: pego t=fMap(stj\u0012M), where\u0012Mare the parameters of the Mapper. It consists of Resnet18 convolutional layers to produce an embedding of the observation. This embedding is passed through two fully-connected layers followed by 3 deconvolutional layers to get the \ufb01rst-person top-down 2D spatial map prediction. Now, we would like to add the egocentric map prediction ( pego t) to the geocentric map from the previous time step ( mt\u00001). In order to transform the egocentric map to the geocentric frame, we need the pose of the agent in the geocentric frame. The sensor reading x0 tis typically noisy. Thus, we have a Pose Estimator to correct the sensor reading and give an estimate of the agent\u2019s geocentric pose. In order to estimate the pose of the agent, we \ufb01rst calculate the relative pose change ( dx) from the last time step using the sensor readings at the current and last time step ( x0 t\u00001;x0 t). Then we use a Spatial Transformation (Jaderberg et al., 2015) on the egocentric map prediction at the last frame (pego t\u00001) based on the relative pose change ( dx),p0 t\u00001=fST(pego t\u00001jdx). Note that the parameters of this Spatial Transformation are not learnt, but calculated using the pose change ( dx). This transforms the projection at the last step to the current egocentric frame of reference. If the sensor was accurate, p0 t\u00001would highly overlap with pego t. The Pose Estimator Unit takes in p0 t\u00001andpego tas input and predicts the relative pose change: ^dxt=fPE(p0 t\u00001;pego tj\u0012P)The intuition is that by looking at the egocentric predictions of the last two frames, the pose estimator can learn to predict the small translation and/or rotation that would align them better. The predicted relative pose change is then added to the last pose estimate to get the \ufb01nal pose estimate ^xt= ^xt\u00001+^dxt. Finally, the egocentric spatial map prediction is transformed to the geocentric frame using the current pose prediction of the agent ( ^xt) using another Spatial Transformation and aggregated with the previous spatial map ( mt\u00001) using Channel-wise Pooling operation: mt=mt\u00001+fST(pego tj^xt). Combing all the functions and transformations: mt;^xt=fSLAM (st;x0 t\u00001:t;mt\u00001j\u0012S;bt\u00001) pego t=fMap(stj\u0012M) ^xt= ^xt\u00001+fPE(fST(pego t\u00001j^xt\u00001:t);pego tj\u0012P) mt=mt\u00001+fST(pego tj^xt) where\u0012M;\u0012P2\u0012S;andpego t\u00001;^xt\u000012bt\u00001 D A RCHITECTURE DETAILS We use PyTorch (Paszke et al., 2017) for implementing and training our model. The Mapper in the Neural SLAM module consists of ResNet18 convolutional layers followed by 2 fully-connected layers trained with a dropout of 0.5, followed by 3 deconvolutional layers. The Pose Estimator consists of 3 convolutional layers followed by 3 fully connected layers. The Global Policy is a 5 layer convolutional network followed by 3 fully connected layers. We also pass the agent orientation as a separate input (not captured in the map tensor) to the Global Policy. It is processed by an Embedding layer and added as an input to the fully-connected layers. The Local Policy consists of a pretrained ResNet18 convolutional layers followed by fully connected layers and a recurrent GRU layer. In addition to the RGB observation, the Local policy receives relative distance and angle to the short-term goal as input. We bin the relative distance (bin size increasing with distance), \nPublished as a conference paper at ICLR 2020 relative angle ( 5degree bins) and current timestep ( 30time step bins) before passing them through embedding layers. This kind of discretization is used previously for RL policies (Lample and Chaplot, 2017; Chaplot and Lample, 2017) and it improved the sample ef\ufb01ciency as compared to passing the continuous values as input directly. For a fair comparison, we use the same discretization for all the baselines as well. The short-term goal is processed using Embedding layers. For the exact architectures of all the modules, please refer to the open-source code. E H YPERPARAMETER DETAILS We train all the components with 72 parallel threads, with each thread using one of the 72 scenes in the Gibson training set. We maintain a FIFO memory of size 500000 for training the Neural SLAM module. After one step in all the environments (i.e. every 72 steps) we perform 10 updates to the Neural SLAM module with a batch size of 72. We use Adam optimizer with a learning rate of 0:0001 . We use binary cross-entropy loss for obstacle map and explored area prediction and MSE Loss for pose prediction (in meters and radians). The obstacle map and explored area loss coef\ufb01cients are 1 and the pose loss coef\ufb01cient is 10000 (as MSE loss in meters and radians is much smaller). The Global policy samples a new goal every 25 timesteps. We use Proximal Policy Optimization (PPO) (Schulman et al., 2017) for training the Global policy. Our PPO implementation for the Global Policy is based on Kostrikov (2018). The reward for the Global policy is the increase in coverage in m2scaled by 0.02. It is trained with 72 parallel threads and a horizon length of 40 steps (40 steps for Global policy is equivalent to 1000 low-level timesteps as Global policy samples a new goal after every 25 timesteps). We use 36 mini-batches and do 4 epochs in each PPO update. We use Adam optimizer with a learning rate of 0:000025 , a discount factor of \r= 0:99, an entropy coef\ufb01cient of 0:001, value loss coef\ufb01cient of 0:5for training the Global Policy. The Local Policy is trained using binary cross-entropy loss. We use Adam optimizer with a learning rate of 0:0001 for training the Local Policy. Input frame size is 128\u0002128, the vision range for the SLAM module is V= 64 , i.e. 3:2m(each cell is 5cmin length). Since there are no parameters dependent on the map size, it can be adaptive. We train with a map size of M= 480 (equivalent to 24m) for training and M= 960 (equivalent to 48m) for evaluation. A map of size 48m\u000248mis large enough for all scenes in the Gibson val set. The size of the Global Policy input is constant, G= 240 , which means we downscale map by 2 times during training and 4 times during evaluation. All hyperparameters are available in the code. F A DDITIONAL RESULTS 0 200 400 600 800 1000 Episode length0102030405060Coverage (m2)Gibson Val - Large 0 200 400 600 800 1000 Episode lengthGibson Val - Small ANS RL + 3LConv + GRU RL + Res18 + GRU RL + Res18 + GRU + AuxDepth RL + Res18 + GRU + ProjDepth 0 200 400 600 800 1000 Episode lengthGibson Val - Overall Figure 10: Plot showing the absolute Coverage in m2as the episode progresses for ANS and the baselines on the large and small scenes in the Gibson Val set as well as the overall Gibson Val set."
},
{
    "title": "On the Relationship between Self-Attention and Convolutional Layers",
    "pdf_link": "https://openreview.net/pdf?id=HJlnC1rKPB",
    "abstract": "Recent trends of incorporating attention mechanisms in vision have led re- searchers to reconsider the supremacy of convolutional layers as a primary build- ing block. Beyond helping CNNs to handle long-range dependencies, Ramachan- dran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work pro- vides evidence that attention layers can perform convolution and, indeed, they of- ten learn to do so in practice. Speci\ufb01cally, we prove that a multi-head self-attention layer with suf\ufb01cient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis. Our code is publicly available1.",
    "paper_text": "Published as a conference paper at ICLR 2020 ON THE RELATIONSHIP BETWEEN SELF-ATTENTION AND CONVOLUTIONAL LAYERS Jean-Baptiste Cordonnier, Andreas Loukas & Martin Jaggi \u00b4Ecole Polytechnique F \u00b4ed\u00b4erale de Lausanne (EPFL) ffirst.lastg@epfl.ch ABSTRACT Recent trends of incorporating attention mechanisms in vision have led re- searchers to reconsider the supremacy of convolutional layers as a primary build- ing block. Beyond helping CNNs to handle long-range dependencies, Ramachan- dran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work pro- vides evidence that attention layers can perform convolution and, indeed, they of- ten learn to do so in practice. Speci\ufb01cally, we prove that a multi-head self-attention layer with suf\ufb01cient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis. Our code is publicly available1. 1 I NTRODUCTION Recent advances in Natural Language Processing (NLP) are largely attributed to the rise of the trans- former (Vaswani et al., 2017). Pre-trained to solve an unsupervised task on large corpora of text, transformer-based architectures, such as GPT-2 (Radford et al., 2018), BERT (Devlin et al., 2018) and Transformer-XL (Dai et al., 2019), seem to possess the capacity to learn the underlying structure of text and, as a consequence, to learn representations that generalize across tasks. The key differ- ence between transformers and previous methods, such as recurrent neural networks (Hochreiter & Schmidhuber, 1997) and convolutional neural networks (CNN), is that the former can simultane- ously attend to every word of their input sequence. This is made possible thanks to the attention mechanism \u2014originally introduced in Neural Machine Translation to better handle long-range de- pendencies (Bahdanau et al., 2015). With self-attention in particular, the similarity of two words in a sequence is captured by an attention score measuring the distance of their representations. The representation of each word is then updated based on those words whose attention score is highest. Inspired by its capacity to learn meaningful inter-dependencies between words, researchers have recently considered utilizing self-attention in vision tasks. Self-attention was \ufb01rst added to CNN by either using channel-based attention (Hu et al., 2018) or non-local relationships across the image (Wang et al., 2018). More recently, Bello et al. (2019) augmented CNNs by replacing some convolu- tional layers with self-attention layers, leading to improvements on image classi\ufb01cation and object detection tasks. Interestingly, Ramachandran et al. (2019) noticed that, even though state-of-the art results are reached when attention and convolutional features are combined, under same com- putation and model size constraints, self-attention- only architectures also reach competitive image classi\ufb01cation accuracy. These \ufb01ndings raise the question, do self-attention layers process images in a similar manner to convolutional layers? From a theoretical perspective, one could argue that transfomers have the capacity to simulate any function\u2014including a CNN. Indeed, P \u00b4erez et al. (2019) showed that a multi- layer attention-based architecture with additive positional encodings is Turing complete under some strong theoretical assumptions, such as unbounded precision arithmetic. Unfortunately, universality results do not reveal how a machine solves a task, only that it has the capacity to do so. Thus, the question of how self-attention layers actually process images remains open. 1Code: github.com/epfml/attention-cnn . Website: epfml.github.io/attention-cnn . \nPublished as a conference paper at ICLR 2020 Contributions. In this work, we put forth theoretical and empirical evidence that self-attention layers can (and do) learn to behave similar to convolutional layers: I. From a theoretical perspective, we provide a constructive proof showing that self-attention layers can express any convolutional layers. Speci\ufb01cally, we show that a single multi-head self-attention layer using relative positional encoding can be re-parametrized to express any convolutional layer. II. Our experiments show that the \ufb01rst few layers of attention-only architectures (Ramachan- dran et al., 2019) do learn to attend on grid-like pattern around each query pixel, similar to our theoretical construction. Strikingly, this behavior is con\ufb01rmed both for our quadratic encoding, but also for relative encoding that is learned. Our results seem to suggest that localized convolution is the right inductive bias for the \ufb01rst few layers of an image classifying network. We provide an interactive website2to explore how self-attention exploits localized position-based attention in lower layers and content- based attention in deeper layers. For reproducibility purposes, our code is publicly available. 2 B ACKGROUND ON ATTENTION MECHANISMS FOR VISION We here recall the mathematical formulation of self-attention layers and emphasize the role of posi- tional encodings. 2.1 T HEMULTI -HEAD SELF-ATTENTION LAYER LetX2RT\u0002Dinbe an input matrix consisting of Ttokens in of Dindimensions each. While in NLP each token corresponds to a word in a sentence, the same formalism can be applied to any sequence of Tdiscrete objects, e.g. pixels. A self-attention layer maps any query token t2[T] fromDintoDoutdimensions as follows: Self-Attention( X)t;::= softmax (At;:)XW val; (1) where we refer to the elements of the T\u0002Tmatrix A:=XW qryW> keyX>(2) asattention scores and the softmax output3asattention probabilities . The layer is parametrized by a query matrix Wqry2RDin\u0002Dk, a key matrix Wkey2RDin\u0002Dkand a value matrix Wval2 RDin\u0002Dout.For simplicity, we exclude any residual connections, batch normalization and constant factors. A key property of the self-attention model described above is that it is equivariant to reordering, that is, it gives the same output independently of how the Tinput tokens are shuf\ufb02ed. This is problematic for cases we expect the order of things to matter. To alleviate the limitation, a positional encoding is learned for each token in the sequence (or pixel in an image), and added to the representation of the token itself before applying self-attention A:= (X+P)WqryW> key(X+P)>; (3) whereP2RT\u0002Dincontains the embedding vectors for each position. More generally, Pmay be substituted by any function that returns a vector representation of the position. It has been found bene\ufb01cial in practice to replicate this self-attention mechanism into multiple heads , each being able to focus on different parts of the input by using different query, key and value matrices. In multi-head self-attention, the output of the Nhheads of output dimension Dhare concatenated and projected to dimension Doutas follows: MHSA(X) := concat h2[Nh]\u0002 Self-Attention h(X)\u0003 Wout+bout (4) and two new parameters are introduced: the projection matrix Wout2RNhDh\u0002Doutand a bias term bout2RDout. 2epfml.github.io/attention-cnn 3softmax ( At;:)k=exp(At;k)=P pexp(At;p) \nPublished as a conference paper at ICLR 2020 2.2 A TTENTION FOR IMAGES Convolutional layers are the de facto choice for building neural networks that operate on images. We recall that, given an image tensor X2RW\u0002H\u0002Dinof widthW, heightHandDinchannels, the output of a convolutional layer for pixel (i;j)is given by Conv(X)i;j;::=X (\u000e1;\u000e2)2\u0001 \u0001KXi+\u000e1;j+\u000e2;:W\u000e1;\u000e2;:;:+b; (5) where Wis theK\u0002K\u0002Din\u0002Doutweight tensor4,b2RDoutis the bias vector and the set \u0001 \u0001K:=\u0014 \u0000\u0016K 2\u0017 ;\u0001\u0001\u0001;\u0016K 2\u0017\u0015 \u0002\u0014 \u0000\u0016K 2\u0017 ;\u0001\u0001\u0001;\u0016K 2\u0017\u0015 contains all possible shifts appearing when convolving the image with a K\u0002Kkernel. In the following, we review how self-attention can be adapted from 1D sequences to images. With images, rather than tokens, we have query and key pixels q;k2[W]\u0002[H]. Accordingly, the input is a tensor Xof dimension W\u0002H\u0002Dinand each attention score associates a query and a key pixel. To keep the formulas consistent with the 1D case, we abuse notation and slice tensors by using a 2D index vector: if p= (i;j), we write Xp;:andAp;:to mean Xi;j;:andAi;j;:;:, respectively. With this notation in place, the multi-head self attention layer output at pixel qcan be expressed as follows: Self-Attention( X)q;:=X ksoftmax ( Aq;:)kXk;:Wval (6) and accordingly for the multi-head case. 2.3 P OSITIONAL ENCODING FOR IMAGES There are two types of positional encoding that has been used in transformer-based architectures: theabsolute andrelative encoding (see also Table 3 in the Appendix). With absolute encodings, a (\ufb01xed or learned) vector Pp;:is assigned to each pixel p. The computa- tion of the attention scores we saw in eq. (2) can then be decomposed as follows: Aabs q;k= (Xq;:+Pq;:)WqryW> key(Xk;:+Pk;:)> =Xq;:WqryW> keyX> k;:+Xq;:WqryW> keyP> k;:+Pq;:WqryW> keyXk;:+Pq;:WqryW> keyPk;:(7) whereqandkcorrespond to the query and key pixels, respectively. The relative positional encoding was introduced by Dai et al. (2019). The main idea is to only consider the position difference between the query pixel (pixel we compute the representation of) and the key pixel (pixel we attend) instead of the absolute position of the key pixel: Arel q;k:=X> q;:W> qryWkeyXk;:+X> q;:W> qrycWkeyr\u000e+u>WkeyXk;:+v>cWkeyr\u000e (8) In this manner, the attention scores only depend on the shift \u000e:=k\u0000q. Above, the learnable vectorsuandvare unique for each head, whereas for every shift \u000ethe relative positional encoding r\u000e2RDpis shared by all layers and heads. Moreover, now the key weights are split into two types: Wkeypertain to the input and cWkeyto the relative position of pixels. 3 S ELF-ATTENTION AS A CONVOLUTIONAL LAYER This section derives suf\ufb01cient conditions such that a multi-head self-attention layer can simulate a convolutional layer. Our main result is the following: Theorem 1. A multi-head self-attention layer with Nhheads of dimension Dh, output dimen- sionDoutand a relative positional encoding of dimension Dp\u00153can express any convolutional layer of kernel sizepNh\u0002pNhandmin(Dh;Dout)output channels. 4To simplify notation, we index the \ufb01rst two dimensions of the tensor from \u0000bK=2ctobK=2c. \nPublished as a conference paper at ICLR 2020 The theorem is proven constructively by selecting the parameters of the multi-head self-attention layer so that the latter acts like a convolutional layer. In the proposed construction, the attention scores of each self-attention head should attend to a different relative shift within the set \u0001 \u0001K= f\u0000bK=2c;:::;bK=2cg2of all pixel shifts in a K\u0002Kkernel. The exact condition can be found in the statement of Lemma 1. Then, Lemma 2 shows that the aforementioned condition is satis\ufb01ed for the relative positional en- coding that we refer to as the quadratic encoding : v(h):=\u0000\u000b(h)(1;\u00002\u0001(h) 1;\u00002\u0001(h) 2)r\u000e:= (k\u000ek2;\u000e1;\u000e2)Wqry=Wkey:=0dWkey:=I(9) The learned parameters \u0001(h)= (\u0001(h) 1;\u0001(h) 2)and\u000b(h)determine the center and width of attention of each head, respectively. On the other hand, \u000e= (\u000e1;\u000e2)is \ufb01xed and expresses the relative shift between query and key pixels. It is important to stress that the above encoding is not the only one for which the conditions of Lemma 1 are satis\ufb01ed. In fact, in our experiments, the relative encoding learned by the neural network also matched the conditions of the lemma (despite being different from the quadratic en- coding). Nevertheless, the encoding de\ufb01ned above is very ef\ufb01cient in terms of size, as only Dp= 3 dimensions suf\ufb01ce to encode the relative position of pixels, while also reaching similar or better empirical performance (than the learned one). The theorem covers the general convolution operator as de\ufb01ned in eq. (17). However, machine learning practitioners using differential programming frameworks (Paszke et al., 2017; Abadi et al., 2015) might question if the theorem holds for all hyper-parameters of 2D convolutional layers: \u000fPadding : a multi-head self-attention layer uses by default the \"SAME\" padding while a convolutional layer would decrease the image size by K\u00001pixels. The correct way to alleviate these boundary effects is to pad the input image with bK=2czeros on each side. In this case, the cropped output of a MHSA and a convolutional layer are the same. \u000fStride : a strided convolution can be seen as a convolution followed by a \ufb01xed pooling operation\u2014with computational optimizations. Theorem 1 is de\ufb01ned for stride 1, but a \ufb01xed pooling layer could be appended to the Self-Attention layer to simulate any stride. \u000fDilation : a multi-head self-attention layer can express any dilated convolution as each head can attend a value at any pixel shift and form a (dilated) grid pattern. Remark for the 1D case. Convolutional layers acting on sequences are commonly used in the lit- erature for text (Kim, 2014), as well as audio (van den Oord et al., 2016) and time series (Franceschi et al., 2019). Theorem 1 can be straightforwardly extended to show that multi-head self-attention withNhheads can also simulate a 1D convolutional layer with a kernel of size K=Nhwith min(Dh;Dout)output channels using a positional encoding of dimension Dp\u00152. Since we have not tested empirically if the preceding construction matches the behavior of 1D self-attention in practice, we cannot claim that it actually learns to convolve an input sequence\u2014only that it has the capacity to do so. PROOF OF MAINTHEOREM The proof follows directly from Lemmas 1 and 2 stated below: Lemma 1. Consider a multi-head self-attention layer consisting of Nh=K2heads,Dh\u0015Dout and letf: [Nh]!\u0001 \u0001Kbe a bijective mapping of heads onto shifts. Further, suppose that for every head the following holds: softmax(A(h) q;:)k=\u001a1iff(h) =q\u0000k 0otherwise:(10) Then, for any convolutional layer with a K\u0002Kkernel andDoutoutput channels, there exists fW(h) valgh2[Nh]such that MHSA(X) = Conv(X)for everyX2RW\u0002H\u0002Din. \nPublished as a conference paper at ICLR 2020 Attention maps for pixelFilter matricesMulti-Head Self-Attention Layer at position   the query pixel   a key pixelat position concatenate Figure 1: Illustration of a Multi-Head Self-Attention layer applied to a tensor image X. Each head h attends pixel values around shift \u0001(h)and learn a \ufb01lter matrix W(h) val. We show attention maps computed for a query pixel at position q. Proof. Our \ufb01rst step will be to rework the expression of the Multi-Head Self-Attention operator from equation (1) and equation (4) such that the effect of the multiple heads becomes more transparent: MHSA(X) =bout+X h2[Nh]softmax(A(h))XW(h) valWout[(h\u00001)Dh+ 1 :hDh+ 1]| {z } W(h)(11) Note that each head\u2019s value matrix W(h) val2RDin\u0002Dhand each block of the projection matrix Wout of dimension Dh\u0002Doutare learned. Assuming that Dh\u0015Dout, we can replace each pair of matrices by a learned matrix W(h)for each head. We consider one output pixel of the multi-head self-attention: MHSA(X)q;:=X h2[Nh] X ksoftmax( A(h) q;:)kXk;:! W(h)+bout (12) Due to the conditions of the Lemma, for the h-th attention head the attention probability is one when k=q\u0000f(h)and zero otherwise. The layer\u2019s output at pixel qis thus equal to MHSA( X)q=X h2[Nh]Xq\u0000f(h);:W(h)+bout (13) ForK=pNh, the above can be seen to be equivalent to a convolutional layer expressed in eq. 17: there is a one to one mapping (implied by map f) between the matrices W(h)forh= [Nh]and the matrices Wk1;k2;:;:for all (k1;k2)2[K]2: Remark about DhandDout.It is frequent in transformer-based architectures to set Dh=Dout=Nh, henceDh< D out. In that case, W(h)can be seen to be of rank Dout\u0000Dh, which does not suf\ufb01ce to express every convolutional layer with Doutchannels. Nevertheless, it can be seen that any Dhout ofDoutoutputs of MHSA(X)can express the output of any convolutional layer withDhoutput channels. To cover both cases, in the statement of the main theorem we assert that the output channels of the convolutional layer should be min(Dh;Dout). In practice, we advise to concatenate heads of dimension Dh=Doutinstead of splitting the Doutdimensions among heads to have exact re-parametrization and no \u201cunused\u201d channels. Lemma 2. There exists a relative encoding scheme fr\u000e2RDpg\u000e2Z2withDp\u00153and parame- tersWqry;Wkey;cWkey;uwithDp\u0014Dksuch that, for every \u00012\u0001 \u0001Kthere exists some vector v (conditioned on \u0001) yielding softmax( Aq;:)k= 1ifk\u0000q=\u0001and zero, otherwise. Proof. We show by construction the existence of a Dp= 3 dimensional relative encoding scheme yielding the required attention probabilities. \nPublished as a conference paper at ICLR 2020 As the attention probabilities are independent of the input tensor X, we setWkey=Wqry=0which leaves only the last term of eq. (8). Setting cWkey2RDk\u0002Dpto the identity matrix (with appropriate row padding), yields Aq;k=v>r\u000ewhere\u000e:=k\u0000q. Above, we have assumed that Dp\u0014Dk such that no information from r\u000eis lost. Now, suppose that we could write: Aq;k=\u0000\u000b(k\u000e\u0000\u0001k2+c) (14) for some constant c. In the above expression, the maximum attention score over Aq;:is\u0000\u000bcand it is reached for Aq;kwith\u000e=\u0001. On the other hand, the \u000bcoef\ufb01cient can be used to scale arbitrarily the difference between Aq;\u0001and the other attention scores. In this way, for \u000e=\u0001, we have lim \u000b!1softmax( Aq;:)k= lim \u000b!1e\u0000\u000b(k\u000e\u0000\u0001k2+c) P k0e\u0000\u000b(k(k\u0000q0)\u0000\u0001k2+c) = lim \u000b!1e\u0000\u000bk\u000e\u0000\u0001k2 P k0e\u0000\u000bk(k\u0000q0)\u0000\u0001k2=1 1 + lim\u000b!1P k06=ke\u0000\u000bk(k\u0000q0)\u0000\u0001k2= 1 and for\u000e6=\u0001, the equation becomes lim\u000b!1softmax( Aq;:)k= 0;exactly as needed to satisfy the lemma statement. What remains is to prove that there exist vandfr\u000eg\u000e2Z2for which eq. (14) holds. Expanding the RHS of the equation, we have \u0000\u000b(k\u000e\u0000\u0001k2+c) =\u0000\u000b(k\u000ek2+k\u0001k2\u00002h\u000e;\u0001i+c):Now if we setv=\u0000\u000b(1;\u00002\u00011;\u00002\u00012)andr\u000e= (k\u000ek2;\u000e1;\u000e2);then Aq;k=v>r\u000e=\u0000\u000b(k\u000ek2\u00002\u00011\u000e1\u00002\u00012\u000e2) =\u0000\u000b(k\u000ek2\u00002h\u000e;\u0001i) =\u0000\u000b(k\u000e\u0000\u0001k2\u0000k\u0001k2); which matches eq. (14) with c=\u0000k\u0001k2and the proof is concluded. Remark on the magnitude of \u000b.The exact representation of one pixel requires \u000b(or the matrices WqryandWkey) to be arbitrary large, despite the fact that the attention probabilities of all other pixels converge exponentially to 0 as \u000bgrows. Nevertheless, practical implementations always rely on \ufb01nite precision arithmetic for which a constant \u000bsuf\ufb01ces to satisfy our construction. For instance, since the smallest positive float32 scalar is approximately 10\u000045, setting\u000b= 46 would suf\ufb01ce to obtain hard attention. 4 E XPERIMENTS The aim of this section is to validate the applicability of our theoretical results\u2014which state that self-attention canperform convolution\u2014and to examine whether self-attention layers in practice do actually learn to operate like convolutional layers when trained on standard image classi\ufb01cation tasks. In particular, we study the relationship between self-attention and convolution with quadratic andlearned relative positional encodings. We \ufb01nd that, for both cases, the attention probabilities learned tend to respect the conditions of Lemma 1, supporting our hypothesis. 4.1 I MPLEMENTATION DETAILS We study a fully attentional model consisting of six multi-head self-attention layers. As it has already been shown by Bello et al. (2019) that combining attention features with convolutional features improves performance on Cifar-100 and ImageNet, we do not focus on attaining state-of-the-art performance. Nevertheless, to validate that our model learns a meaningful classi\ufb01er, we compare it to the standard ResNet18 (He et al., 2015) on the CIFAR-10 dataset (Krizhevsky et al.). In all experiments, we use a 2\u00022invertible down-sampling (Jacobsen et al., 2018) on the input to reduce the size of the image. As the size of the attention coef\ufb01cient tensors (stored during forward) scales quadratically with the size of the input image, fullattention cannot be applied to bigger images. The \ufb01xed size representation of the input image is computed as the average pooling of the last layer representations and given to a linear classi\ufb01er. \nPublished as a conference paper at ICLR 2020 0 50 100 150 200 250 300 Epoch0.60.70.80.91.0Test accuracy ResNet18 SA quadratic emb. SA learned emb. SA learned emb. + content-based att. Figure 2: Test accuracy on CIFAR-10.Models accuracy # of params # of FLOPS ResNet18 0.938 11.2M 1.1B SA quadratic emb. 0.938 12.1M 6.2B SA learned emb. 0.918 12.3M 6.2B SA learned emb. + content 0.871 29.5M 15B Table 1: Test accuracy on CIFAR-10 and model sizes. SA stands for Self-Attention. Figure 3: Centers of attention of each attention head (different colors) at layer 4 during the training with quadratic relative positional encoding. The central black square is the query pixel, whereas solid and dotted circles represent the 50% and 90% percentiles of each Gaussian, respectively. We used the PyTorch library (Paszke et al., 2017) and based our implementation on PyTorch Trans- formers5. We release our code on Github6and hyper-parameters are listed in Table 2 (Appendix). Remark on accuracy. To verify that our self-attention models perform reasonably well, we dis- play in Figure 6 the evolution of the test accuracy on CIFAR-10 over the 300 epochs of training for our self-attention models against a small ResNet (Table 1). The ResNet is faster to converge, but we cannot ascertain whether this corresponds to an inherent property of the architecture or an artifact of the adopted optimization procedures. Our implementation could be optimized to exploit the locality of Gaussian attention probabilities and reduce signi\ufb01cantly the number of FLOPS. We observed that learned embeddings with content-based attention were harder to train probably due to their increased number of parameters. We believe that the performance gap can be bridged to match the ResNet performance, but this is not the focus of this work. 4.2 Q UADRATIC ENCODING As a \ufb01rst step, we aim to verify that, with the relative position encoding introduced in equation (9), attention layers learn to behave like convolutional layers. We train nine attention heads at each layer to be on par with the 3\u00023kernels used predominantly by the ResNet architecture. The center of attention of each head his initialized to \u0001(h)\u0018N(0;2I2). Figure 3 shows how the initial positions of the heads (different colors) at layer 4 changed during training. We can see that after optimization, the heads attend on speci\ufb01c pixel of the image forming a grid around the query pixel. Our intuition that Self-Attention applied to images learns convolutional \ufb01lters around the queried pixel is con\ufb01rmed. Figure 4 displays all attention head at each layer of the model at the end of the training. It can be seen that in the \ufb01rst few layers the heads tend to focus on local patterns (layers 1 and 2), while deeper layers (layers 3-6) also attend to larger patterns by positioning the center of attention further from the queried pixel position. We also include in the Appendix a plot of the attention positions for a higher number of heads ( Nh= 16 ). Figure 14 displays both local patterns similar to CNN and long range dependencies. Interestingly, attention heads do not overlap and seem to take an arrangement maximizing the coverage of the input space. 5github.com/huggingface/pytorch-transformers 6github.com/epfml/attention-cnn \nPublished as a conference paper at ICLR 2020 Figure 4: Centers of attention of each attention head (different colors) for the 6 self-attention layers using quadratic positional encoding. The central black square is the query pixel, whereas solid and dotted circles represent the 50% and 90% percentiles of each Gaussian, respectively. 4.3 L EARNED RELATIVE POSITIONAL ENCODING We move on to study the positional encoding used in practice by fully-attentional models on images. We implemented the 2D relative positional encoding scheme used by (Ramachandran et al., 2019; Bello et al., 2019): we learn a bDp=2cposition encoding vector for each row and each column pixel shift. Hence, the relative positional encoding of a key pixel at position kwith a query pixel at posi- tionqis the concatenation of the row shift embedding \u000e1and the column shift embedding \u000e2(where \u000e=k\u0000q). We chose Dp=Dout= 400 in the experiment. We differ from their (unpublished) implementation in the following points: ( i) we do not use convolution stem and ResNet bottlenecks for downsampling, but only a 2\u00022invertible downsampling layer (Jacobsen et al., 2018) at input, (ii) we useDh=Doutinstead ofDh=Dout=Nhbacked by our theory that the effective number of learned \ufb01lters is min(Dh;Dout). At \ufb01rst, we discard the input data and compute the attention scores solely as the last term of eq. (8). The attention probabilities of each head at each layer are displayed on Figure 5. The \ufb01gure con\ufb01rms our hypothesis for the \ufb01rst two layers and partially for the third: even when left to learn the positional encoding scheme from randomly initialized vectors, certain self-attention heads (depicted on the left) learn to attend to individual pixels, closely matching the condition of Lemma 1 and thus Theorem 1. At the same time, other heads pay attention to horizontally-symmetric but non-localized patterns, as well as to long-range pixel inter-dependencies. We move on to a more realistic setting where the attention scores are computed using both positional and content-based attention (i.e., q>k+q>rin (Ramachandran et al., 2019)) which corresponds to a full-blown standalone self-attention model. The attention probabilities of each head at each layer are displayed in Figure 6. We average the attention probabilities over a batch of 100 test images to outline the focus of each head and remove the dependency on the input image. Our hypothesis is con\ufb01rmed for some heads of layer 2 and 3: even when left to learn the encoding from the data, certain self-attention heads only exploit position- based attention to attend to distinct pixels at a \ufb01xed shift from the query pixel reproducing the receptive \ufb01eld of a convolutional kernel. Other heads use more content-based attention (see Figures 8 to 10 in Appendix for non-averaged probabilities) leveraging the advantage of Self-Attention over CNN which does not contradict our theory. In practice, it was shown by Bello et al. (2019) that combining CNN and self-attention features outperforms each taken separately. Our experiments shows that such combination is learned when optimizing an unconstrained fully-attentional model. The similarity between convolution and multi-head self-attention is striking when the query pixel is slid over the image: the localized attention patterns visible in Figure 6 follow the query pixel. This characteristic behavior materializes when comparing Figure 6 with the attention probabilities at a different query pixel (see Figure 7 in Appendix). Attention patterns in layers 2 and 3 are not only localized but stand at a constant shift from the query pixel, similarly to convolving the receptive \ufb01eld of a convolutional kernel over an image. This phenomenon is made evident on our interactive website7. This tool is designed to explore different components of attention for diverse images with or without content-based attention. We believe that it is a useful instrument to further understand how MHSA learns to process images. 7epfml.github.io/attention-cnn \nPublished as a conference paper at ICLR 2020 Figure 5: Attention probabilities of each head ( column ) at each layer ( row) using learned relative positional encoding without content-based attention. The central black square is the query pixel. We reordered the heads for visualization and zoomed on the 7x7 pixels around the query pixel. layer 1  layer 2  layer 3  layer 4  layer 5  layer 6 Figure 6: Attention probabilities for a model with 6 layers ( rows ) and 9 heads ( columns ) using learned relative positional encoding and content-content based attention. Attention maps are aver- aged over 100 test images to display head behavior and remove the dependence on the input content. The black square is the query pixel. More examples are presented in Appendix A. \nPublished as a conference paper at ICLR 2020 5 R ELATED WORK In this section, we review the known differences and similarities between CNNs and transformers. The use of CNN networks for text\u2014at word level (Gehring et al., 2017) or character level (Kim, 2014)\u2014is more seldom than transformers (or RNN). Transformers and convolutional models have been extensively compared empirically on tasks of Natural Language Processing and Neural Ma- chine Translation. It was observed that transformers have a competitive advantage over convolu- tional model applied to text (Vaswani et al., 2017). It is only recently that Bello et al. (2019); Ramachandran et al. (2019) used transformers on images and showed that they achieve similar ac- curacy as ResNets. However, their comparison only covers performance and number of parameters and FLOPS but not expressive power. Beyond performance and computational-cost comparisons of transformers and CNN, the study of expressiveness of these architectures has focused on their ability to capture long-term dependencies (Dai et al., 2019). Another interesting line of research has demonstrated that transformers are Turing- complete (Dehghani et al., 2018; P \u00b4erez et al., 2019), which is an important theoretical result but is not informative for practitioners. To the best of our knowledge, we are the \ufb01rst to show that the class of functions expressed by a layer of self-attention encloses all convolutional \ufb01lters. The closest work in bridging the gap between attention and convolution is due to Andreoli (2019). They cast attention and convolution into a uni\ufb01ed framework leveraging tensor outer- product. In this framework, the receptive \ufb01eld of a convolution is represented by a \u201cbasis\u201d tensor A2RK\u0002K\u0002H\u0002W\u0002H\u0002W. For instance, the receptive \ufb01eld of a classical K\u0002Kconvolutional kernel would be encoded by A\u0001;q;k= 1fk\u0000q=\u0001gfor\u00012\u0001 \u0001K. The author distinguishes thisindex-based convolution with content-based convolution where Ais computed from the value of the input, e.g., using a key/query dot-product attention. Our work moves further and presents suf\ufb01cient conditions for relative positional encoding injected into the input content (as done in prac- tice) to allow content-based convolution to express any index-based convolution. We further show experimentally that such behavior is learned in practice. 6 C ONCLUSION We showed that self-attention layers applied to images can express any convolutional layer (given suf\ufb01ciently many heads) and that fully-attentional models learn to combine local behavior (similar to convolution) and global attention based on input content. More generally, fully-attentional mod- els seem to learn a generalization of CNNs where the kernel pattern is learned at the same time as the \ufb01lters\u2014similar to deformable convolutions (Dai et al., 2017; Zampieri, 2019). Interesting di- rections for future work include translating existing insights from the rich CNNs literature back to transformers on various data modalities, including images, text and time series. ACKNOWLEDGMENTS Jean-Baptiste Cordonnier is thankful to the Swiss Data Science Center (SDSC) for funding this work. Andreas Loukas was supported by the Swiss National Science Foundation (project \u201cDeep Learning for Graph Structured Data\u201d, grant number PZ00P2 179981). \nPublished as a conference paper at ICLR 2020 REFERENCES Mart \u00b4\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man \u00b4e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin- cent Vanhoucke, Vijay Vasudevan, Fernanda Vi \u00b4egas, Oriol Vinyals, Pete Warden, Martin Watten- berg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensor\ufb02ow.org. Jean-Marc Andreoli. Convolution, attention and structure embedding. NeurIPS 2019 workshop on Graph Representation Learning, Dec 13, 2019, Vancouver, BC, Canada , 2019. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2015. Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V . Le. Attention Augmented Convolutional Networks. arXiv:1904.09925 [cs] , April 2019. Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. CoRR , abs/1703.06211, 2017. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V . Le, and Ruslan Salakhut- dinov. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. CoRR , abs/1901.02860, 2019. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. CoRR , abs/1807.03819, 2018. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR , abs/1810.04805, 2018. Jean-Yves Franceschi, Aymeric Dieuleveut, and Martin Jaggi. Unsupervised scalable representation learning for multivariate time series. In NeurIPS 2019 , 2019. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. CoRR , abs/1705.03122, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. CoRR , abs/1512.03385, 2015. Sepp Hochreiter and J \u00a8urgen Schmidhuber. Long short-term memory. Neural Computation , 9(8): 1735\u20131780, 1997. Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pp. 7132\u20137141, 2018. J\u00a8orn-Henrik Jacobsen, Arnold W.M. Smeulders, and Edouard Oyallon. i-revnet: Deep invertible networks. In International Conference on Learning Representations , 2018. Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 1746\u2013 1751, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/ D14-1181. Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced re- search). Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop , 2017. \nPublished as a conference paper at ICLR 2020 Jorge P \u00b4erez, Javier Marinkovic, and Pablo Barcel \u00b4o. On the turing completeness of modern neural network architectures. CoRR , abs/1901.03429, 2019. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2018. Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision models. CoRR , abs/1906.05909, 2019. A\u00a8aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 , 2016. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR , abs/1706.03762, 2017. Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pp. 7794\u20137803, 2018. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V . Le. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR , abs/1906.08237, 2019. Luca Zampieri. Geometric deep learning for volumetric computational \ufb02uid dynamics. pp. 67, 2019. \nPublished as a conference paper at ICLR 2020 APPENDIX A M ORE EXAMPLES WITH CONTENT -BASED ATTENTION We present more examples of attention probabilities computed by self-attention model. Figure 7 shows average attention at a different query pixel than Figure 6. Figures 8 to 10 display attention for single images. layer 1  layer 2  layer 3  layer 4  layer 5  layer 6 Figure 7: Attention probabilities for a model with 6 layers ( rows ) and 9 heads ( columns ) using learned relative positional encoding and content-content attention. We present the average of 100 test images. The black square is the query pixel. original  layer 1  layer 2  layer 3  layer 4  layer 5  layer 6 Figure 8: Attention probabilities for a model with 6 layers ( rows ) and 9 heads ( columns ) using learned relative positional encoding and content-content based attention. The query pixel (black square) is on the frog head. \nPublished as a conference paper at ICLR 2020 original  layer 1  layer 2  layer 3  layer 4  layer 5  layer 6 Figure 9: Attention probabilities for a model with 6 layers ( rows ) and 9 heads ( columns ) using learned relative positional encoding and content-content based attention. The query pixel (black square) is on the horse head. original  layer 1  layer 2  layer 3  layer 4  layer 5  layer 6 Figure 10: Attention probabilities for a model with 6 layers ( rows ) and 9 heads ( columns ) using learned relative positional encoding and content-content based attention. The query pixel (black square) is on the building in the background. \nPublished as a conference paper at ICLR 2020 B H YPER -PARAMETERS USED IN OUR EXPERIMENTS Hyper-parameters number of layers 6 number of heads 9 hidden dimension 400 intermediate dimension 512 invertible pooling width 2 dropout probability 0.1 layer normalization epsilon 10\u000012 number of epochs 300 batch size 100 learning rate 0.1 weight decay 0.0001 momentum 0.9 cosine decay X linear warm up ratio 0.05 Table 2: Self-attention network parameters C P OSITIONAL ENCODING REFERENCES Modeltype of positional encodingrelative sinusoids learned quadratic Vaswani et al. (2017) X Radford et al. (2018) X Devlin et al. (2018) X Dai et al. (2019) X X Yang et al. (2019) X X Bello et al. (2019) X X Ramachandran et al. (2019) X X Our work X X X Table 3: Types of positional encoding used by transformers models applied to text ( top) and images (bottom ). When multiple encoding types have been tried, we report the one advised by the authors. D G ENERALIZED LEMMA 1 We present a generalization of Lemma 1 that replaces the necessity of hard attention (to single pixels) by a milder assumption: the attention probabilities should span the grid receptive \ufb01eld. The conditions of this Lemma are still satis\ufb01ed by Lemma 2, hence Theorem 1 follows. Lemma 3. Consider a multi-head self-attention layer consisting of Nh\u0015K2heads,Dh\u0015Dout and let!: [H]\u0002[W]![HW]be a pixel indexing. Then, for any convolutional layer with a K\u0002 Kkernel andDoutoutput channels, there exists fW(h) valgh2[Nh]andWoutsuch that MHSA( X) = Conv( X)for every X2RW\u0002H\u0002Dinif and only if, for all q2[H]\u0002[W],8 span(fe!(q+\u0001)2RHW:\u00012\u0001 \u0001Kg)\u0012span(fvect(softmax( A(h) q;:)) :h2[Nh]g): 8the vectorization operator vect(\u0001)\ufb02attens a matrix into a vector \nPublished as a conference paper at ICLR 2020 1111000 Figure 11: Factorization of the vectorized weight matrices Vconv q andVSA qused to compute the output at position qfor an input image of dimension H\u0002W. On the left: a convolution of kernel 2\u00022, on the right : a self-attention with Nh= 5heads.Din= 2,Dout= 3in both cases. Proof. Our \ufb01rst step will be to rework the expression of the Multi-Head Self-Attention operator from equation (1) and equation (4) such that the effect of the multiple heads becomes more transparent: MHSA( X) =bout+X h2[Nh]softmax( A(h))XW(h) valWout[(h\u00001)Dh+ 1 :hDh+ 1]| {z } W(h)(15) Note that each head\u2019s value matrix W(h) val2RDin\u0002Dhand each block of the projection matrix Wout of dimension Dh\u0002Doutare learned. Assuming that Dh\u0015Dout, we can replace each pair of matrices by a learned matrix W(h)for each head. We consider one output pixel of the multi-head self-attention and drop the bias term for simplicity: MHSA( X)q;:=X h2[Nh]\u0010X ka(h) q;kXk;:\u0011 W(h)=X kXk;:\u0010X h2[Nh]a(h) q;kW(h)\u0011 |{z} WSA q;k2RDin\u0002Dout; (16) witha(h) q;k= softmax( A(h) q;:)k. We rewrite the output of a convolution at pixel qin the same manner: Conv( X)q;:=X \u00012\u0001 \u0001KXq+\u0001;:W\u0001;:;:=X k2[H]\u0002[W]Xk;: 1fk\u0000q2\u0001 \u0001KgWk\u0000q;:;:|{z} Wconv q;k2RDin\u0002Dout: (17) Equality between equations (16) and (17) holds for any input Xif and only if the linear transfor- mations for each pair of key/query pixels are equal, i.e. Wconv q;k=WSA q;k8q;k. We vectorize the weight matrices into matrices of dimension DinDout\u0002HW asVconv q:= [vect(Wconv q;k)]k2[H]\u0002[W] andVSA q:= [vect(WSA q;k)]k2[H]\u0002[W]. Hence, to show that Conv( X) = MHSA( X)for all X, we must show that Vconv q=VSA qfor allq. The matrixVconv q has a restricted support: only the columns associated with a pixel shift \u00012\u0001 \u0001K in the receptive \ufb01eld of pixel qcan be non-zero. This leads to the factorization Vconv q=WconvEq displayed in Figure 11 where Wconv2RDinDout\u0002K2andEq2RK2\u0002HW. Given an ordering of the shifts \u00012\u0001 \u0001Kindexed byj, set(Wconv):;j= vect( W\u0001;:;:)and(Eq)j;:=e!(q+\u0001). On the other hand, we decompose VSA q=WSAAqwith(WSA):;h= vect(W(h))and(Aq)h;i=a(h) q;!(i). The proof is concluded by showing that row(Eq)\u0012row(Aq)is a necessary and suf\ufb01cient condition for the existence of a WSAsuch that anyVconv q=WconvEqcan be written as WSAAq. Suf\ufb01cient. Given that row(Eq)\u0012row(Aq), there exists \b2RK2\u0002Nhsuch thatEq= \bAqand a valid decomposition is WSA=Wconv\bwhich givesWSAAq=Vconv q. Necessary. Assume there exists x2RHWsuch thatx2row(Eq)andx62row(Aq)and setx> to be a row of Vconv q. Then,WSAAq6=Vconv qfor anyWSAand there is no possible decomposition. \nPublished as a conference paper at ICLR 2020 E G ENERALIZED QUADRATIC POSITIONAL ENCODING We noticed the similarity of the attention probabilities in the quadratic positional encoding (Sec- tion 3) to isotropic bivariate Gaussian distributions with bounded support: softmax( Aq;:)k=e\u0000\u000bk(k\u0000q)\u0000\u0001k2 P k02[W]\u0002[H]e\u0000\u000bk(k0\u0000q)\u0000\u0001k2: (18) Building on this observation, we further extended our attention mechanism to non-isotropic Gaus- sian distribution over pixel positions. Each head is parametrized by a center of attention \u0001and a covariance matrix \u0006to obtain the following attention scores, Aq;k=\u00001 2(\u000e\u0000\u0001)>\u0006\u00001(\u000e\u0000\u0001) =\u00001 2\u000e>\u0006\u00001\u000e+\u000e>\u0006\u00001\u0001\u00001 2\u0001>\u0006\u00001\u0001; (19) where, once more, \u000e=k\u0000q. The last term can be discarded because the softmax is shift invariant and we rewrite the attention coef\ufb01cient as a dot product between the head target vector vand the relative position encoding r\u000e(consisting of the \ufb01rst and second order combinations of the shift in pixels\u000e): v=1 2(2(\u0006\u00001\u0001)1;2(\u0006\u00001\u0001)2;\u0000\u0006\u00001 1;1;\u0000\u0006\u00001 2;2;\u00002\u0001\u0006\u00001 1;2)>andr\u000e= (\u000e1;\u000e2;\u000e2 1;\u000e2 2;\u000e1\u000e2)>: Evaluation. We trained our model using this generalized quadratic relative position encoding. We were curious to see if, using the above encoding the self-attention model would learn to attend to non-isotropic groups of pixels\u2014thus forming unseen patterns in CNNs. Each head was parametrized by\u00012R2and\u0006\u00001=22R2\u00022to ensure that the covariance matrix remained positive semi-de\ufb01nite. We initialized the center of attention to \u0001(h)\u0018N(0;2I2)and\u0006\u00001=2=I2+N(0;0:01I2)so that initial attention probabilities were close to an isotropic Gaussian. Figure 12 shows that the network did learn non-isotropic attention probability patterns, especially in high layers. Nevertheless, the fact that we do not obtain any performance improvement seems to suggest that attention non-isotropy is not particularly helpful in practice\u2014the quadratic positional encoding suf\ufb01ces. Figure 12: Centers of attention of each attention head (different colors) for the 6 self-attention layers using non-isotropic Gaussian parametrization. The central black square is the query pixel, whereas solid and dotted circles represent the 50% and 90% percentiles of each Gaussian, respectively. Pruning degenerated heads. Some non-isotropic attention heads attend on \u201cnon-intuitive\u201d patches of pixels: either attending a very thin stripe of pixels, when \u0006\u00001was almost singular, or attending all pixels uniformly, when \u0006\u00001was close to 0(i.e. constant attention scores). We asked ourselves, are such attention patterns indeed useful for the model or are these heads degenerated and unused? To \ufb01nd out, we pruned all heads having largest eigen-values smaller than 10\u00005or condition number (ratio of the biggest and smallest eigen-values) greater than 105. Speci\ufb01cally in our model with 6-layer and 9-heads each, we pruned [2;4;1;2;6;0]heads from the \ufb01rst to the last layer. This means that these layers cannot express a 3\u00023kernel anymore. As shown in yellow on \ufb01g. 2, this ablation initially hurts a bit the performance, probably due to off biases, but after a few epochs of continued training with a smaller learning rate (divided by 10) the accuracy recovers its unpruned value. Hence, without sacri\ufb01cing performance, we reduce the size of the parameters and the number of FLOPS by a fourth. F I NCREASING THE NUMBER OF HEADS For completeness, we also tested increasing the number of heads of our architecture from 9 to 16. \nPublished as a conference paper at ICLR 2020 0 50 100 150 200 250 300 350 400 Epoch0.60.70.80.91.0Test accuracyResNet18 SA quadratic emb. SA quadratic emb. gen. SA quadratic emb. gen. pruned SA learned emb. SA learned emb. + content-based att. Figure 13: Evolution of test accuracy on CIFAR- 10. Pruned model ( yellow ) is continued training of the non-isotropic model ( orange ).Models accuracy # of params # of FLOPS ResNet18 0.938 11.2M 1.1B SA quadratic emb. 0.938 12.1M 6.2B SA quadratic emb. gen. 0.934 12.1M 6.2B SA quadratic emb. gen. pruned 0.934 9.7M 4.9B SA learned emb. 0.918 12.3M 6.2B SA learned emb. + content 0.871 29.5M 15B Table 4: Number of parameters and accuracy on CIFAR-10 per model. SA stands for Self- Attention. Figure 14: Centers of attention for 16 attention heads (different colors) for the 6 self-attention layers using quadratic positional encoding. The central black square is the query pixel, whereas solid and dotted circles represent the 50% and 90% percentiles of each Gaussian, respectively. Similar to Figure 4, we see that the network distinguishes two main types of attention patterns. Localized heads (i.e., those that attend to nearly individual pixels) appear more frequently in the \ufb01rst few layers. The self-attention layer uses these heads to act in a manner similar to how convolutional layers do. Heads with less-localized attention become more common at higher layers."
},
{
    "title": "Vid2Game: Controllable Characters Extracted from Real-World Videos",
    "pdf_link": "https://openreview.net/pdf?id=SkxBUpEKwH",
    "abstract": "We extract a controllable model from a video of a person performing a certain activity. The model generates novel image sequences of that person, according to user-de\ufb01ned control signals, typically marking the displacement of the moving body. The generated video can have an arbitrary background, and effectively capture both the dynamics and appearance of the person. The method is based on two networks. The \ufb01rst maps a current pose, and a single- instance control signal to the next pose. The second maps the current pose, the new pose, and a given background, to an output frame. Both networks include multiple novelties that enable high-quality performance. This is demonstrated on multiple characters extracted from various videos of dancers and athletes.",
    "paper_text": "Published as a conference paper at ICLR 2020 VID2GAME : C ONTROLLABLE CHARACTERS EX- TRACTED FROM REAL-WORLD VIDEOS Oran Gafni Facebook AI Research oran@fb.comLior Wolf Facebook AI Research & Tel Aviv Uni. wolf@fb.comYaniv Taigman Facebook AI Research yaniv@fb.com ABSTRACT We extract a controllable model from a video of a person performing a certain activity. The model generates novel image sequences of that person, according to user-de\ufb01ned control signals, typically marking the displacement of the moving body. The generated video can have an arbitrary background, and effectively capture both the dynamics and appearance of the person. The method is based on two networks. The \ufb01rst maps a current pose, and a single- instance control signal to the next pose. The second maps the current pose, the new pose, and a given background, to an output frame. Both networks include multiple novelties that enable high-quality performance. This is demonstrated on multiple characters extracted from various videos of dancers and athletes. 1 I NTRODUCTION We propose a new video generation tool that is able to extract a character from a video, reanimate it, and generate a novel video of the modi\ufb01ed scene, see Fig. 1. Unlike previous work, the reanimation is controlled by a low-dimensional signal, such as the one provided by a joystick, and the model has to complete this signal to a high-dimensional full-body signal, in order to generate realistic motion sequences. In addition, our method is general enough to position the extracted character in a new background, which is possibly also dynamic. A video containing a short explanation of our method, samples of output videos, and a comparison to previous work, is provided in https: //youtu.be/sNp6HskavBE . Our work provides a general and convenient way for human users to control the dynamic development of a given video. The input is a video, which contains one or more characters. The characters are extracted, and each is associated with a sequence of displacements. In the current implementation, the motion is taken as the trajectory of the center of mass of that character in the frame. This can be readily generalized to separate different motion elements. Given a user-de\ufb01ned trajectory, a realistic video of the character, placed in front of an arbitrary background, is generated. The method employs two networks, applied in a sequential manner. The \ufb01rst is the Pose2Pose (P2P) network, responsible for manipulating a given pose in an autoregressive manner, based on an input stream of control signals. The second is the Pose2Frame (P2F) network, accountable for generating a high-resolution realistic video frame, given an input pose and a background image. Each network addresses a computational problem not previously fully met, together paving the way for the generation of video games with realistic graphics. The Pose2Pose network enables guided human-pose generation for a speci\ufb01c trained domain (e.g., a tennis player, a dancer, etc.), where guiding takes the form of 2D motion controls, while the Pose2Frame network allows the incorporation of a photo-realistic generated character into a desired environment. In order to enable this, the following challenges are to be addressed: (1) replacing the background requires the system to separate the character from the surroundings, which is not handled by previous work, since they either embed the character into the same learned background, or paste the generated character into the background with noticeable artifacts, (2) the separation is not binary, and some effects, such as shadows, blend the character\u2019s motion effect with that background information, (3) the control signal is arbitrary, and can lead the character to poses that are not covered by the training set, and (4) generated sequences may easily drift, by accumulating small errors over time. \nPublished as a conference paper at ICLR 2020 Figure 1: Our method extracts a character from an uncontrolled video, and enables us to control its motion. The pose of the character, shown in the \ufb01rst row, is created by our Pose2Pose network in an autoregressive way, so that the motion matches the control signal illustrated by the joystick. The second row depicts the character\u2019s appearance, as generated by the Pose2Frame network, which also generates the masks shown in the third row. The \ufb01nal frame (last row) blends a given background and the generated frames, in accordance with these masks. (a) (b) (c) (d) Figure 2: Comparison with Esser et al. (2018b). (a) Their input, (b) their output, (c) a frame from our training video, (d) our generated frame. With different objectives and dataset types, a direct comparison is not applicable. Qualitatively, Esser et al. (2018b) output a low-res image with notice- able artifacts, and cannot model the racket, while ours is indistinguishable from the source. (a) (b) (c) (d) Figure 3: Comparison with Esser et al. (2018a). (a) Their input, (b) their generated output, (c) our pose input, (d) the output generated by our P2F network. In contrast to our method, Esser et al. (2018a) do not render environmental effects, resulting in unnatural blending of the character, undesired residues (e.g. source clothing), and works in lower resolution. Both the Pose2Pose and Pose2Frame networks adopt the pix2pixHD framework of Wang et al. (2018b) as the generator and discriminator backbones, yet add many contributions in order to address the aforementioned challenges. As a building block, we use the pose representation provided by the DensePose framework by R \u02dciza Alp G\u00fcler (2018), unmodi\ufb01ed. Similarly, the hand-held object is extracted using the semantic segmentation method of Zhou et al. (2019), which incorporates elements from Maninis et al. (2018); Law & Deng (2018). In addition to the main application of generating a realistic video from a 2D trajectory, the learned Pose2Frame network can be used for other applications. For example, instead of predicting the pose, it can be extracted from an existing video. This allows us to compare the Pose2Frame network directly with recent video-to-video solutions. 2 R ELATED WORK Novel view synthesis is a task where unseen frames, camera views, or poses, are synthesized given a prior image. Recent approaches have also shown success in generating detailed images of human subjects in different poses (Balakrishnan et al., 2018; Kanazawa et al., 2018), where some of them \nPublished as a conference paper at ICLR 2020 also condition on pose (Chan et al., 2018; Yang et al., 2018; Li et al., 2019) to guide the generation. These approaches do not build a movable character model, but transfers one image to target poses. The pose variability in these images is smaller than required for our application, the handling of the background is limited, and these were also not demonstrated on video. For example, much of the literature presents results on a fashion dataset, in which the poses are limited and a white background is used. Another common benchmark is gathered from surveillance cameras, where the resolution is low, and background generation is lacking due to an inherent lack of supervision. A method for learning motion patterns by analyzing YouTube videos is demonstrated by Peng et al. (2018), where synthetic virtual characters are set to perform complex skills in physically simulated environments, leveraging a data-driven Reinforcement Learning method that utilizes a reference motion. This method outputs a control policy that enables the character to reproduce a particular skill observed in video, which the rendered character then imitates. Unlike our method, the control signal is not provided online, one frame at a time. In addition, rendering is performed using simulated characters only, and the character in the video is not reanimated. Autoregressive models, which can be controlled one step at a time, are suitable for the dynamic nature of video games. However, such models, including RNNs, can easily drift with long range sequences (Fragkiadaki et al., 2015), and training RNN models for long sequences suffers from vanishing or exploding gradients. Holden et al. (2017) propose a more stable model by generating the weights of a regression network at each frame as a function of the motion phase. However, this is mostly practical to apply given a limited number of keypoints, whereas dense pose models contain more information. Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and conditional GANs (Mirza & Osindero, 2014), have been used for video synthesis by V ondrick et al. (2016) who separately generates the static background and the foreground motion. Frameworks such as vid2vid (Wang et al., 2018a; Chan et al., 2018) learn mappings between different videos, and demonstrate motion transfer between faces, and from poses to body. In these contributions, the reference pose is extracted from a real frame, and the methods are not challenged with generated poses. Working with generated poses, with the accompanying artifacts and the accumulated error, is considerably more challenging. In order to address this, we incorporate a few modi\ufb01cations, such as relying on a second input pose, in case one of the input poses is of lesser quality, and add additional loss terms to increase the realism of the generated image. In addition, these approaches model the entire frame, including both the character and the background, which usually leads to blurry results (Pumarola et al., 2018; Chao et al., 2018), particularly near the edges of the generated pose, and with complex objects, such as faces. It also leads to a loss of details from the background, and to unnatural background motion. A method for mixing the appearance of a \ufb01gure seen in an image with an arbitrary pose is presented by Esser et al. (2018b). While it differs greatly in the performed task, we can compare the richness of the generated images, as shown in Fig. 2. Their method results in a low-resolution output with noticeable artifacts, and cannot model the object, while our result is indistinguishable from the source. The same is true for the follow-up work (Esser et al., 2018a). We work at a higher resolution of 1024p, while their work is limited to low-resolution characters, see Fig. 3. Similarly, the work of Balakrishnan et al. (2018) provides lower resolution outputs, limited to the same background, and does not handle shadows (as seen in Fig. 9-10 of that work). In another set of experiments, Esser et al. (2018a) also present a step toward our task and show results for generating a controllable \ufb01gure, building upon the phase-based neural network of Holden et al. (2017). Their work is keypoint based and does not model environmental factors, such as shadows. The videos presented by Esser et al. (2018a) for a controllable \ufb01gure are displayed only on a synthetic background with a checkerboard \ufb02oor pattern in an otherwise empty scene. These examples are limited to either walking or running, and the motion patterns are of an existing animation model. 3 M ETHOD OVERVIEW The method\u2019s objective is to learn the character\u2019s motion from a video sequence, such that new videos of that character can be rendered, based on a user-provided motion sequence. The input of the training procedure is a video sequence of a character performing an action. From this video, the pose and an approximated foreground mask are extracted by the DensePose network, augmented by the semantic \nPublished as a conference paper at ICLR 2020 Figure 4: The architecture of the Pose2Pose generator. During training, the middle nr\u00002residual blocks are conditioned by a linear projection (FC layer) of the center-mass differences between con- secutive frames (in the x and y axes). For each concatenation of input pose and object [ pi\u00001;obji\u00001], the network generates the next consecutive pose and object [ pi;obji]. At inference time, the network generates the next pose-object pair in an autoregressive manner, conditioned on input directions. segmentation of the hand-held object, for each frame. The trajectory of the center of mass is taken to be the control sequence. At test time, the user provides a sequence of 2D displacements, and a video is created, in which the character moves in accordance with this control sequence. The background can be arbitrary, and is also selected by the user. The method then predicts the sequence of poses based on the given control sequence (starting with an arbitrary pose), and synthesizes a video in which the character extracted from the training video is rendered in the given background. The following notation is used: a video sequence with frames fiis generated, based on a sequence of posespiand a sequence of background images bi, wherei= 1;2;::: is the frame index. The frame generation process also involves a sequence of spatial masks mithat determine which regions of the background are replaced by synthesized image information zi. To generate a video, the user provides the pose at time zero: p0, the sequence of background images bi(which can be static, i.e., 8i bi=b) and a sequence of control signals si. In our experiments, the control signal is typically comprised of the desired 2D displacement of the animated character. Our method is an autoregressive pose model, coupled with a frame-rendering mechanism. The \ufb01rst aspect of our method creates a sequence of poses, and optionally of hand-held objects. Each pose and object pair [pi;obji]is dependent on the previous pair [pi\u00001;obji\u00001], as well as on the current control signal si. The second aspect generates the current frame fi, based on the current background imagebi, the previous combined pose and object pi\u00001+obji\u00001, and the current combined pose and objectpi+obji. The pose and object are combined by simply summing the object channel with each of the three RGB channels that encode the pose. This rendering process includes the generation of both a raw image output ziand a blending mask mi.mihas values between 0 and 1, with 1\u0000mi denoting the inverted mask. Formally, the high-level processing is given by the following three equations: [pi;obji] =P2P([pi\u00001;obji\u00001];si) (1) (zi;mi) =P2F([pi\u00001+obji\u00001;pi+obji]) (2) fi=zi\fmi+bi\f(1\u0000mi) (3) whereP2PandP2Fare the Pose2Pose and the Pose2Frame networks. As stated, P2Freturns a pair of outputs that are then linearly blended with the desired background, using the per-pixel multiplication operator \f. 4 T HEPOSE2POSE NETWORK As mentioned, the P2P network is an evolution of the pix2pixHD architecture. Although the primary use of the pix2pixHD framework in the literature is for unconditioned image-to-image translation, we show how to modify it to enable conditioning on a control signal. The P2P network generates a scaled-down frame (512 pixels wide), allowing the network to focus on pose representation, rather than high-resolution image generation. Generation of a high-res output is deferred to the P2F network. This enables us to train the P2P network much more effectively, resulting in a stable training process that generates natural dynamics, and leads to signi\ufb01cantly reduced inference time (post-training). \nPublished as a conference paper at ICLR 2020 The generator\u2019s architecture is illustrated in Fig. 4. The encoder is composed of a convolutional layer, followed by convolutions with batch normalization Ioffe & Szegedy (2015) and ReLU Nair & Hinton (2010) activations. The latent space combines a sequence of nrresidual blocks. The decoder is composed of fractional strided convolutions with instance normalization Ulyanov et al. (2016) and ReLU activations, followed by a single convolution terminated by a Tanh activation for the generated frame output. Recall that the P2P network also receives the control signal as a second input (Eq. 1). In our experiment, the control signal is a vector of dimension nd= 2representing displacements along the xandyaxes. This signal is incorporated into the network, by conditioning the center nr\u00002blocks of the latent space. The conditioning takes place by adding to the activations of each residual block, a similarly sized tensor that is obtained by linearly projecting the 2D control vector si. Modi\ufb01ed conditional block Rather than applying a conditioning block based on a traditional ResNet block, we apply a modi\ufb01ed one that does not allow for a complete bypass of the convolutional layers. This form of conditioning increases the motion naturalness, as seen in our ablation study. The speci\ufb01c details are as follows. The P2P network contains a down-sampling encoder e, a latent space transformation network r, and an up-sampling decoder u. Thernetwork is conditioned on the control signal s, and contains nrblocks of two types: vanilla residual blocks ( v), and conditioned blocksw. P2P(p;s) =u(r(e(p);s)) (4) r=v\u000ew\u000ew\u0001\u0001\u0001\u000ew|{z} nr\u00002times\u000ev (5) The architecture and implementation details of the P2P network can be found in appendix A. Brie\ufb02y, letxdenote the activations of the previous layer, and f1(x);f2(x)be two consecutive convolutional layers. Letsbe a 2D displacement vector, and ga fully-connected layer with a number of output neurons that equals the product of the dimensions of the tensor x. The two block types take the form: v(x) =f2(f1(x)) +x (6)w(x;s) =f2(f1(x) +g(s)) +f1(x) +g(s) (7) 4.1 T RAINING THE POSE PREDICTION NETWORK Following Wang et al. (2018b), we employ two discriminators (low-res and high-res), indexed by k= 1;2. During training, the LSGAN (Mao et al., 2017) loss is applied to the generator and discriminator. An L1 feature-matching loss is applied over the discriminators\u2019 activations, and a trained VGG (Simonyan & Zisserman, 2014b) network. The loss applied to the generator can then be formulated as: LP2P=2X k=1\u0010 LLSk+\u0015DLFMk D\u0011 +\u0015V GGLFMV GG (8) where the networks are trained with \u0015D=\u0015V GG = 10 . The LSGAN generator loss is (the obji elements are omitted for brevity): LLSk=E(pi\u00001;si)h (Dk(pi\u00001;P2P(pi\u00001;si))\u00001)2i (9) The expectation is computed per mini-batch, over the input pose pi\u00001and the associated si. The discriminator-feature matching-loss compares the predicted pose with that of the generated pose, using the activations of the discriminator, and is calculated as: LFMk D=E(pi\u00001;pi)MX j=11 NjjjD(j) k(pi\u00001;pi)\u0000D(j) k(pi\u00001;P2P(pi\u00001;si))jj1 (10) withMbeing the number of layers, Njthe number of elements in each layer, pi\u00001the input (previous) pose, pithe current (real) pose, P2P(pi\u00001;s)the estimated pose, and D(j) kthe activations of discriminator kin layerj. The VGG feature-matching loss is calculated similarly, acting as a perceptual loss over a trained VGG classi\ufb01er: LFMV GG=MX j=11 N0 jjjVGG(j)(pi)\u0000VGG(j)(P2P(pi\u00001;si))jj1 (11) \nPublished as a conference paper at ICLR 2020 withN0 jbeing the number of elements in the j-th layer, and VGG(j)the VGG classi\ufb01er activations at thej-th layer. The loss applied to the discriminator is formulated as: LDk=1 2E(pi\u00001;si)h (Dk(pi\u00001;P2P(pi\u00001;si)))2i +1 2E(pi\u00001;pi)h (Dk(pi\u00001;pi)\u00001)2i (12) The training sequences are \ufb01rst processed by employing the DensePose network, in order to extract the pose information from each frame. This pose information takes the form of an RGB image, where the 2D RGB intensity levels are a projection of the 3D (I)UV mapping. By applying a binary threshold over the DensePose RGB image, we are able to create a binary mask for the character in the video. From the binary mask tiof each frame i, we compute the center of mass of the character \u001ai. The control signal during training is denoted as si=\u001ai\u0000\u001ai\u00001. Due to the temporal smoothness in the videos, the difference between consecutive frames in the full frame-rate videos (30fps) is too small to observe signi\ufb01cant motion. This results in learned networks that are biased towards motionless poses. Hence, we train with \u0001 = 2 inter-frame intervals (where \u0001 = 1 describes using consecutive frames). During inference, we sample at 30fps and apply a directional conditioning signal that has half of the average motion magnitude during training. Stopping criteria We use the Adam optimizer (Kingma & Ba, 2016) with a learning rate of 2\u000110\u00004, \f1= 0:5and\f2= 0:999. We observe that training the P2P network does not provide for monotonic improvement in output quality. We stipulate the P2P network \ufb01nal model to be that which yields the lowest loss, in terms of discriminator feature-matching. While there are several losses applied while training the P2P network, the discriminator feature-matching loss is the only one that holds both motion context (i.e. receives both the previous and current pose), and information of different abstraction levels (i.e. feature-matching is applied over different levels of activations). This results in improved motion naturalness, and reduced perceptual distance, as evident from the ablation study. Random occlusions To cope with pose detection imperfections that occasionally occur, which in turn impair the quality of the generated character, we employ a dedicated data augmentation method, in order to boost the robustness of the P2P network. A black ellipse of random size and location is added to each input pose frame within the detection bounding box, resulting in an impaired pose (see appendix Fig. 8), with characteristics that are similar to \"naturally\" occurring imperfections. 5 T HEPOSE2FRAME NETWORK While the original pix2pixHD network transforms an entire image to an output image of the same size from a speci\ufb01ed domain, our Pose2Frame network transforms a pose to a character that is localized in a speci\ufb01c part of the output image and embedded in a given, possibly dynamic, background. This is done by both refocusing the discriminators\u2019 receptive \ufb01eld, and applying a learned blending mask over the raw image output. The DensePose network plays a crucial role, as it provides both the relevant image region and a prior over the blending mask. Focusing the discriminator on the character eliminates the need for feature-enhancing techniques, such as the introduction of a face-GAN, as done by Chan et al. (2018)), or adding a temporal loss (which is useful for reducing irrelevant background motion) as done by Wang et al. (2018a). The generator architecture is illustrated in Fig. 5(a). The P2F low-level network architecture details are somewhat similar to those of the P2P network, with the following modi\ufb01cations: (1) the P2F network generates frames with a resolution width of 1024, (2) no conditioning is applied, i.e., the w blocks are replaced by vblocks, (3) the network generates two outputs: the raw image data zand a separate blending mask m, (4) the discriminators are altered to re\ufb02ect the added focus, and (5) new regularization terms are added to ensure that the masking takes place at the relevant regions (Eq. 17), see Fig. 9 in the appendix. The generated mask mblends the raw output zwith the desired background b, rendering the \ufb01nal output frame f, according to Eq. 3 (omitting the index ifor brevity). Note that the blending mask is not binary, since various effects such as shadows, contain both character-derived information and background information, see Fig. 6. Nevertheless, we softly encourage the blending mask to favor the background in regions external to the character, and discourage the generator from rendering meaningful representations outside the character. This is done by employing several regularization \nPublished as a conference paper at ICLR 2020 (a) (b) Figure 5: The Pose2Frame network. (a) For each two combined input pose and object ( p= [pi\u00001+obji\u00001;pi+obji]), the network generates an RGB image ( zi) and a mask ( mi). The RGB and background images are then linearly blended by the generated mask to create the output frame fi. (b) The P2F discriminator setup. The multi-scale discriminator focuses on the binary-thresholded character, obtained with the binary mask t, as it appears in both the ground truth image oand the output of the P2F network, for a given pose p= (pi;pi\u00001). The#operator denotes downscaling by a factor of two, obtained by average pooling, as applied before the low-resolution discriminator. The VGG feature-matching loss term engages with the full frame, covering perceptual context in higher abstraction layers (e.g. generated shadows). (a) (b)Figure 6: Samples of masks that model both the character, and places in the scene in which appear- ance is changed by the character. (a) The shadow and the tennis racket of the character are captured by the mask, (b) the dancer\u2019s shadow appears as part of the mask. terms over the generated mask. As a side effect of these added losses, the network is required to perform higher-level reasoning and not rely on memorization. In other words, instead of expanding the mask to include all background changes, the network separates between character dependent changes, such as shadows, held items, and re\ufb02ections, and those that are independent. The discriminator setup is illustrated in Fig. 5(b). The discriminator\u2019s attention is predominantly shifted towards the character, by applying an inverse binary mask over the character. The masked character image is fed into the discriminators, affecting both the multi-scale loss, and the feature- matching loss applied over the discriminators\u2019 activations. In parallel, the fully generated frame is fed into the VGG network, allowing the VGG feature-matching loss to aid in the generation of desired structures external to the character. 5.1 T RAINING THE POSE TO FRAME NETWORK The P2F generator loss is formulated as: LP2F=2X k=1\u0010 LLSk+\u0015DLFMk D\u0011 +\u00151LFMV GG+\u00152Lmask (13) where\u00151= 10 and\u00152= 1. The LSGAN generator loss is calculated as: LLSk=E(p;t)h (Dk(p\ft;f\ft)\u00001)2i (14) wherep= [pi\u00001+obji\u00001;pi+obji]denotes the two pose images, and tis the binary mask obtained by thresholding the DensePose image at time i. The discriminator-feature matching-loss is calculated as: LFMk D=E(p;o;t )MX j=11 NjjjD(j) k(p\ft;o\ft)\u0000D(j) k(p\ft;f\ft)jj1; (15) \nPublished as a conference paper at ICLR 2020 Figure 7: Generated frames for the controllable tennis character, blended into different backgrounds. withMbeing the number of layers, Njthe number of elements in each layer, and othe real (ground truth) frame. The VGG feature-matching loss is calculated over the full ground truth frame, rather than the one masked by t: LFMV GG=MX j=11 NjjjVGG(j)(o)\u0000VGG(j)(f)jj1 (16) withobeing the ground truth frame, Njbeing the number of elements in the j-th layer, and, as before, VGG(j)the VGG activations of the j-th layer. The mask term penalizes the mask (see appendix Fig. 9 for a visual illustration): Lmask =jjm\f(1\u0000t)jj1+jjmx\f(1\u0000t)jj1+jjmy\f(1\u0000t)jj1+jj1\u0000m\ftjj1 (17) wheremis the generated mask, and mxandmythe mask derivatives in the x and y axes respectively. The \ufb01rst term acts to reduce the mask\u2019s activity outside the regions detected by DensePose. The mask, however, is still required to function in such regions, e.g., to render shadows. Similarly, we suppress the mask derivative outside the pose-detected region, in order to eliminate secluded points, and other high-frequency patterns. Finally, a term is added to encourage the mask to be active in the image regions occupied by the character. The loss applied to the two discriminators is given by: LDk=1 2E(p;t)h (Dk(p\ft;f\ft))2i +1 2E(p;o;t )h (Dk(p\ft;o\ft)\u00001)2i (18) The Adam optimizer is used for P2F similar to the P2P. The training progression across the epochs is visualized in the appendix (Fig. 10). 6 E XPERIMENTS The method was tested on multiple video sequences, see the supplementary video ( https:// youtu.be/sNp6HskavBE ). The \ufb01rst video shows a tennis player outdoors, the second video, a person swiping a sword indoors, and the third, a person walking. The part of the videos used for training consists of 5.5min, 3.5min, and 7.5min, respectively. In addition, for comparative purposes, we trained the P2F network on a three min video of a dancer, which was part of the evaluation done by Wang et al. (2018a). The controllable output of the tennis player is shown in Fig. 1, which depicts the controller signal used to drive the pose, as well as the generated pose pi, objectobji, maskmi, raw frame zi, and \nPublished as a conference paper at ICLR 2020 Dataset Method SSIM LPIPS LPIPS LPIPS (SqzNet) (AlexNet) (VGG) Tennisours 240\u00062265\u00063 400\u00064 474\u00065 pix2pixHD 301 \u000626379\u000635 533\u000642 589\u000632 Walkingours 193\u0006133216\u0006149 365\u0006252 374\u0006258 pix2pixHD 224 \u0006156308\u0006224 485\u0006347 434\u0006303 FencingOurs 45\u00064 41\u00068 52\u000611 150\u000615 pix2pixHD 308 \u000695531\u0006129 670\u0006168 642\u000686Table 1: Comparison with pix2pixHD (see also Fig. 14). SSIM and LPIPS (multiplied by 1000) are shown for three scenarios: (1) tennis (contains dynamic elements, e.g. other players, crowd, difference in camera angle), (2) walking (different character clothing, lighting, and camera angle), (3) fenc- ing (\ufb01xed background and view). output frame fi. A realistic character is generated with some artifacts (see supplementary video) around the tennis racket, for which the segmentation of the training video is only partially successful. Fig. 7 depicts additional results, in which the character is placed on a diverse set of backgrounds containing considerable motion. Appendix B also present a controlled walking character, and a controlled fencing character, which also appear in the supplementary video. A comparison of the P2F network with the pix2pixHD method of Wang et al. (2018b) is provided in Tab. 1, and as a \ufb01gure in appendix Fig. 14. We compare by Structural Similarity (SSIM) (Wang et al., 2004) and Learned Perceptual Image Patch Similarity (LPIPS) Zhang et al. (2018) distance methods. The mean and standard deviation are calculated for each generated video. The LPIPS method provides a perceptual distance metric, by comparing the activations of three different network architectures, VGG (Simonyan & Zisserman, 2014a), AlexNet (Krizhevsky, 2014), and SqueezeNet (Iandola et al., 2016), with an additional linear layer set on top of each network. For each dataset, we select a test set that was not used during training. Although this test set is evaluated as the ground-truth, there is a domain shift between the training and the test video: the tennis test set contains dynamic elements, such as other players, crowd, and a slight difference in camera angle; the walking test set contains different character clothing, background lighting, and camera angle. The fencing test set is more similar to the training set. As seen in appendix Fig. 14, the baseline method results in many background and character artifacts, and a degradation in image and character quality, as it is forced to model the entire scene, rather than focus solely on the character and its shadow, as our method does. This is also apparent in the statistics reported in the table. Another experiment dedicated to the P2F network (other methods do not employ P2P), compares it with the vid2vid method of Wang et al. (2018a). The results are reported in the supplementary video, and in appendix C. Our method produces far fewer background distortions, can better handle variation in the character\u2019s location, and has the ability to embed the character into novel backgrounds. An ablation study is presented in appendix D, showing the contribution of the various components of the system both quantitatively and qualitatively. In addition, we describe the unfavorable results obtained when replacing the autoregressive model with a concatenative model. 7 C ONCLUSIONS Generating smooth motion that combines unpredictable control, the current pose, and previous motion patterns is a challenging task. The proposed novel method employs two autoencoders: one generates autoregressive motion for a speci\ufb01c learned style, and the other generates a realistic frame for blending with a dynamic background. Our work paves the way for new types of realistic and personalized games, which can be casually created from everyday videos. In addition, controllable characters extracted from YouTube-like videos can \ufb01nd their place in the virtual worlds and augmented realities. The work is still limited in various aspects, such as not allowing control over the illumination of the character, the lack of support for novel views, and not modeling the character\u2019s interaction with scene objects. ACKNOWLEDGMENTS The authors would like to thank Lisa Rhee, Ilkka Hartikainen, and Adrian Bryant for allowing us to use their videos for training. \nPublished as a conference paper at ICLR 2020 REFERENCES Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. Synthesizing images of humans in unseen poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 8340\u20138348, 2018. Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. arXiv preprint arXiv:1808.07371 , 2018. Patrick Chao, Alexander Li, and Gokul Swamy. Generative models for pose transfer. arXiv preprint arXiv:1806.09070 , 2018. Patrick Esser, Johannes Haux, Timo Milbich, and Bj\u00f6rn Ommer. Towards learning a realistic rendering of human behavior. In ECCV WORKSHOP , 2018a. Patrick Esser, Ekaterina Sutter, and Bj\u00f6rn Ommer. A variational u-net for conditional appearance and shape generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 8857\u20138866, 2018b. Katerina Fragkiadaki, Sergey Levine, and Jitendra Malik. Recurrent network models for kinematic tracking. arXiv preprint arXiv:1508.00271 , 2015. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems . 2014. Daniel Holden, Taku Komura, and Jun Saito. Phase-functioned neural networks for character control. ACM Trans. Graph. , 36(4):42:1\u201342:13, July 2017. ISSN 0730-0301. doi: 10.1145/3072959. 3073663. URL http://doi.acm.org/10.1145/3072959.3073663 . Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 , 2016. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision (ECCV) , 2016. Angjoo Kanazawa, Jason Y . Zhang, Panna Felsen, and Jitendra Malik. Learning 3d human dynamics from video. arXiv preprint arXiv:1812.01601 , 2018. D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR , 2016. Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 , 2014. Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV) , pp. 734\u2013750, 2018. Yining Li, Chen Huang, and Chen Change Loy. Dense intrinsic appearance \ufb02ow for human pose transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 3693\u20133702, 2019. K.K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool. Deep extreme cut: From extreme points to object segmentation. In Computer Vision and Pattern Recognition (CVPR) , 2018. Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In ICCV , 2017. \nPublished as a conference paper at ICLR 2020 Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 , 2014. Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML) , pp. 807\u2013814, 2010. Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, and Sergey Levine. Sfv: Reinforce- ment learning of physical skills from videos. ACM Trans. Graph. , 37(6), November 2018. Albert Pumarola, Antonio Agudo, Alberto Sanfeliu, and Francesc Moreno-Noguer. Unsupervised person image synthesis in arbitrary poses. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018. Iasonas Kokkinos R \u02dciza Alp G\u00fcler, Natalia Neverova. Densepose: Dense human pose estimation in the wild. 2018. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014a. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014b. Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 , 2016. Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. arXiv preprint arXiv:1609.02612 , 2016. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS) , 2018a. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High- resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018b. Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing , 13(4):600\u2013612, 2004. Ceyuan Yang, Zhe Wang, Xinge Zhu, Chen Huang, Jianping Shi, and Dahua Lin. Pose guided human video generation. arXiv preprint arXiv:1807.11152 , 2018. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR , 2018. Xingyi Zhou, Jiacheng Zhuo, and Philipp Kr\u00e4henb\u00fchl. Bottom-up object detection by grouping extreme and center points. In CVPR , 2019. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. arXiv preprint arXiv:1703.10593 , 2017. \nPublished as a conference paper at ICLR 2020 (a) (b) (c) Figure 8: The occlusion-based augmentation technique used to increase robustness during training the P2P network. Each row is a single sample. (a) pi\u00001with part of it occluded by a random ellipse, (b) the predicted pose ^pi, (c) the ground truth pose pi. The generated output seems to \"\ufb01ll in\" the missing limbs, as well as predict the next frame. In this \ufb01gure and elsewhere, the colors represent the 3D UV mapping. A A DDITIONAL POSE2POSE NETWORK ARCHITECTURE AND IMPLEMENTATION DETAILS We follow the naming convention of (Wang et al., 2018b; Zhu et al., 2017; Johnson et al., 2016). Let Ck denote a Conv-InstanceNorm-ReLU layer with k \ufb01lters, each with a kernel size of 7x7, with a stride of 1. Dk denotes a Convolution-InstanceNorm-ReLU layer with k\ufb01lters and a stride of 2, where re\ufb02ection padding is used. Vk denotes a vanilla residual block with two 3x3 convolutional layers with the same number of \ufb01lters on both layers. Wk denotes a conditioned residual block. Uk denotes a 3x3 Fractional-Strided-Convolution-InstanceNorm layer with k\ufb01lters, and a stride of 0:5. The generator, i.e., the P2P network, can then be described as: C64, D128, D256, D512, D1024, V1024, W1024, W1024, W1024, W1024, W1024, W1024, W1024, V1024, U512, U256, U128, U64, C3. The input images are scaled to a width size of 512 pixels, with the height scaled accordingly. The discriminators are two PatchGANs (Isola et al., 2017) with an identical architecture of C64,C128,C256,C512, working at the input resolution and a lower resolution, down-sampled by an average-2D-pooling operation with a kernel size of 3, and a stride of 2. The architecture of the P2F network is similar to that of the P2P network, with the following adjustments: (i) the conditional residual blocks are replaced by non residual ones, (ii) the input of P2F has 6 channels for piandpi\u00001, (iii) there is an additional head generating the mask output, which uses a sigmoid activation function. B A DDITIONAL IMAGES Fig. 8 depicts the random occlusion process (P2P training), in which a black ellipse of random size and location is added to each input pose frame within the detection bounding box. This results in an impaired pose, with characteristics that are similar to \"naturally\" occurring imperfections. The mask loss term Lmask of P2F (Sec. 5) is illustrated in Fig. 9. Fig. 10 depicts the progression during training of the P2F dancer model. As training progresses, the details of the dancer become sharper and the hair becomes part of the mask, despite being outside the DensePose detected area (i.e., off pixels in t). Fig. 11 depicts a controlled walking character along with the control signal and the generated poses. \nPublished as a conference paper at ICLR 2020 Figure 9: Mask losses applied during the P2F network training. An inverse binary-thresholded mask is used to penalize pixel intensity for the generated mask, in the regions excluding the character of interest. For the generated mask, we apply regularization over the derivatives in the x and y axes as well, to encourage smooth mask generation, and discourage high-frequency pattern generation. Figure 10: Training the P2F network. (a) A sample pose, (b) the target frame, (c) the generated raw frame, the mask, and the output frame at different epochs: 10, 30, 50, and 200 (\ufb01nal). The fencing character is shown in Fig. 12. The mask for various frames in the controlled sequence is shown, as well as two backgrounds: the background of the reference video, and an animated background. Fig. 13 depicts an additional controlled walking character, along with the control signal and the generated poses. Fig 14 compares visually with the baseline method of pix2pixHD (Wang et al., 2018b). As can be seen, the baseline method results in many background and character artifacts, a degradation in image and character quality, as it is forced to model the entire scene, rather than focus solely on the character and the environmental factors, such as in our method. C C OMPARISON WITH VID 2VID Fig. 15(a-e) presents a comparison with the method of Wang et al. (2018a). Shown are the target image from which the driving pose is extracted, the extracted pose, the results of the baseline method, and our result. As can be seen, our method handles the background in a way that creates far fewer distortions, as we apply a learned mask, thus background generation is not required. The characters themselves are mostly comparable in quality, despite our choice not to add a dedicated treatment to the face. In addition, despite not applying considerable emphasis on temporal consistency \nPublished as a conference paper at ICLR 2020 Figure 11: Synthesizing a walking character, emphasizing the control between the frames. Shown are the sequence of poses generated in an autoregressive manner, as well as the generated frames. Figure 12: Generated frames for the controllable fencing character. Each column is a different pose. The rows are the obtained mask, and the placement on two different backgrounds: the one obtained by applying a median \ufb01lter to the reference video, and one taken from a motion picture. Figure 13: Synthesizing an additional walking character. Shown are the sequence of poses generated in an autoregressive manner, as well as the generated frames. \nPublished as a conference paper at ICLR 2020 (a) (b) (c) Figure 14: A comparison of the P2F network with the pix2pixHD method of Wang et al. (2018b). (a) Ground truth image used as the pose source, (b) our result, (c) The results of pix2pixHD. The baseline method results in many background artifacts, as it generates the entire frame. The degradation in image quality is apparent as well, and that of the character in particular. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) Figure 15: A comparison of the P2F network with the vid2vid method of Wang et al. (2018a). (a) The target-pose image, (b) the pose extracted from this image, (c) the result of vid2vid, (d) our result, (e) a frame from the reference video. Many artifacts are apparent in the background produced by vid2vid. vid2vid also distorts the character\u2019s appearance and dimensions to better match the pose. (f-k) The same pose, displayed by two characters on three different backgrounds, demonstrates our advantage over vid2vid in replacing backgrounds. during training (e.g. optical \ufb02ow, temporal discriminator), our method produces videos that are as smooth. Finally, the proportions of the character in our video are better maintained, while in the baseline model, the character is slightly distorted toward the driving pose. In addition, as we demonstrate in Fig. 15(f-k), our method has the ability to replace the background. \nPublished as a conference paper at ICLR 2020 Network Component SSIM LPIPS LPIPS LPIPS (SqzNet) (AlexNet) (VGG) Base Conditioning 15.0 \u00064 20.5\u00061439.8\u00062537.0\u000614 + Conditioning Block 14.7 \u00063 15.6\u00067 29.8\u000614 30.6\u00068 + Stopping Criteria 14.0 \u00063 14.9\u00067 28.1\u000614 29.5\u00068 + Object Channel 14.1 \u00063 13.3\u00066 24.9\u000612 28.6\u00067 Table 2: Ablation study of the P2P network on the tennis sequence. The results are multiplied by a factor of 1000 for readability. (a) (b) (c) (d) (e) (f) Figure 16: P2F ablation. (a) Ours, (b) no VGG FM on the full-frame (no shadows generated), (c) no mask regularization (background artifacts), (d) 1 input pose (no racket generation due to a semantic segmentation mis-detection), (e) no discriminator FM (character/racket heavily distorted), (f) no mask, i.e. background fully-generated (excessive distortion in background / character). Figure 17: P2P vs. baseline method comparison. Temporal consistency of P2P generated motion (row 1) is apparent, as opposed to the baseline method (row 2), that results in temporal inconsistency. D A BLATION STUDY We test the effect of several novel P2P network components, both by SSIM and LPIPS. The test is performed when predicting one frame into the future (in a \u201cteacher forcing\u201d mode). The results in Tab. 2 demonstrate that our conditioning block is preferable to the conventional one, and that adding the object channel is bene\ufb01cial. Selecting the model based on the minimal discriminator feature-matching loss is helpful as well. A qualitative ablation study for the P2F network is provided in Fig. 16. As can be seen, each component contributes to the naturalness of the results. To validate the need for an autoregressive motion generation, as done by the P2P network, we implemented a baseline method that copies motion patterns from the training set, matching the displacement, and veri\ufb01ed that such a naive approach fails to produce natural motion. A sequence of frames from the experiment video can be seen in Fig. 17."
}]