+ "'We used beam search as described in the previous section, but no\\ncheckpoint averaging. We present these results in Table 3.\\nIn Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions,\\nkeeping the amount of computation constant, as described in Section 3.2.2. While single-head\\nattention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.\\nIn Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality. This\\nsuggests that determining compatibility is not easy and that a more sophisticated compatibility\\nfunction than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected,\\nbigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our\\nsinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical\\nresults to the base model.\\n6.3 English Constituency Parsing\\nTo evaluate if the Transformer can generalize to other tasks we performed experiments on English\\nconstituency parsing. This task presents specific challenges: the output is subject to strong structural\\nconstraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence\\nmodels have not been able to attain state-of-the-art results in small-data regimes [37].\\nWe trained a 4-layer transformer with dmodel = 1024 on the Wall Street Journal (WSJ) portion of the\\nPenn Treebank [25], about 40K training sentences. We also trained it in a semi-supervised setting,\\nusing the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences\\n[37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens\\nfor the semi-supervised setting.\\nWe performed only a small number of experiments to select the dropout, both attention and residual\\n(section 5.4), learning rates and beam size on the Section 22 development set, all other parameters\\nremained unchanged from the English-to-German base translation model. During inference, we\\n 9\\n---\\nTable 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23\\nof WSJ)\\n Parser Training WSJ 23 F1\\n Vinyals & Kaiser el al. (2014) [37] WSJ only, discriminative 88.3\\n Petrov et al. (2006) [29] WSJ only, discriminative 90.4\\n Zhu et al. (2013) [40] WSJ only, discriminative 90.4\\n Dyer et al. (2016) [8] WSJ only, discriminative 91.7\\n Transformer (4 layers) WSJ only, discriminative 91.3\\n Zhu et al. (2013) [40] semi-supervised 91.3\\n Huang & Harper (2009) [14] semi-supervised 91.3\\n McClosky et al. (2006) [26] semi-supervised 92.1\\n Vinyals & Kaiser el al. (2014) [37] semi-supervised 92.1\\n Transformer (4 layers) semi-supervised 92.7\\n Luong et al. (2015) [23] multi-task 93.0\\n Dyer et al. (2016) [8] generative 93.3\\nincreased the maximum output length to input length + 300. We used a beam size of 21 and α = 0.3\\nfor both WSJ only and the semi-supervised setting.\\nOur results in Table 4 show that despite the lack of task-specific tuning our model performs sur-\\nprisingly well, yielding better results than all previously reported models with the exception of the\\nRecurrent Neural Network Grammar [8].\\nIn contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-\\nParser [29] even when training only on the WSJ training set of 40K sentences.\\n7 Conclusion\\nIn this work, we presented the Transformer, the first sequence transduction model based entirely on\\nattention, replacing the recurrent layers most commonly used in encoder-decoder architectures with\\nmulti-headed self-attention.\\nFor translation tasks, the Transformer can be trained significantly faster than architectures based\\non recurrent or convolutional layers.'"
0 commit comments