Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation
Authors
Abstract
State-of-the-art neural machine translation models generate outputs autoregressively, where every step conditions on the previously generated tokens. This sequential nature causes inherent decoding latency. Non-autoregressive translation techniques, on the other hand, parallelize generation across positions and speed up inference at the expense of translation quality. Much recent effort has been devoted to non-autoregressive methods, aiming for a better balance between speed and quality. In this work, we re-examine the trade-off and argue that transformer-based autoregressive models can be substantially sped up without loss in accuracy. Specifically, we study autoregressive models with encoders and decoders of varied depths. Our extensive experiments show that given a sufficiently deep encoder, a one-layer autoregressive decoder yields state-of-the-art accuracy with comparable latency to strong non-autoregressive models. Our findings suggest that the latency disadvantage for autoregressive translation has been overestimated due to a suboptimal choice of layer allocation, and we provide a new speed-quality baseline for future research toward fast, accurate translation.
1 Introduction
Fast, accurate machine translation is a fundamental goal with a wide range of applications both in research and production. State-of-the-art neural machine translation systems generate translations autoregressively where words are predicted one-byone conditioned on all previous words (Sutskever et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2017) . This sequential property causes inherent latency in inference since multiple tokens in each sentence cannot be generated in parallel. A flurry * Work partially done at Facebook AI. of recent work developed ways to (partially) parallelize the decoder with non-autoregressive machine translation (NAT, Gu et al., 2018) , thereby speeding up decoding during inference. NAT tends to suffer in translation quality because parallel decoding requires conditional independence assumptions and prevents the model from properly capturing the highly multimodal distribution of target translations (Gu et al., 2018) . Recent work proposed methods to mitigate this multimodality issue, including iterative refinement (e.g., Lee et al., 2018; Ghazvininejad et al., 2019; Gu et al., 2019b; Kasai et al., 2020) and modeling with latent variables (e.g., Ma et al., 2019; Shu et al., 2020) . These approaches modify the decoder transformer to find a balance between decoding parallelism and translation quality. In this work, however, we adopt a contrasting strategy to the speed-quality trade-off. The standard transformer for machine translation is typically assumed to have the same number of encoding and decoding layers (Vaswani et al., 2017) . Observing that the encoder transformer is inherently parallel, we place most of the model capacity in the encoder while keeping the decoder minimal, to accelerate inference. A resulting autoregressive transformer with a deep encoder and a shallow decoder (Fig. 1) achieves a substantial latency improvement over the standard transformer configuration, without sacrificing performance.
We provide extensive speed-quality comparisons between iterative NAT models and autoregressive models with varying numbers of encoder and decoder layers. In particular, we use two types of latency measures for translation and discuss their relation to computational complexity. The two measures reflect two possible scenarios in application by feeding one sentence at a time or as many words as possible into the GPU memory. The first scenario is designed to simulate, for example, instantaneous machine translation that translates text (or even speech) input from users. This is where current NAT models shine -we can make full use of parallelism across decoding positions in a GPU. For this reason, much prior work in NAT only measures latency using this metric (Gu et al., 2018 (Gu et al., , 2019b Kasai et al., 2020; . The second scenario aims at a situation where we want to translate a large amount of text as quickly as possible. In this case, we see that autoregressive models run faster than NAT models by a large margin. Computation at each time step is large enough to exploit parallelism in a GPU, which cancels out the benefit from parallel NAT decoding. Further, autoregressive models can reduce latency by caching all hidden states from the previous positions and computing each step in linear complexity with respect to the sequence length. NAT models necessitate a fresh run of quadratic self and cross attention in every decoding iteration.
Interestingly, if we apply the layer allocation strategy of deep encoder and shallow decoder to NAT models, we fail to retain the original translation quality from 6 layers each ( §5.1). This suggests that departure from autoregressive decoding necessitates more computational capacity in the decoder side, and our strategy is effective specifically for autoregressive models. Our analysis demonstrates that the decoder in NAT models requires more capacity because it needs to learn to reorder words for the target ( §6). Since the configuration of deep encoder and shallow decoder is specifically effective for autoregressive models, we need to re-establish where autoregressive transformers sit in the spectrum of the speed-quality trade-off for future work in fast, accurate machine translation.
2 Transformer And Parallelism
The transformer architecture (Vaswani et al., 2017) differs from recurrent neural networks such as LSTMs (Hochreiter and Schmidhuber, 1997) and GRUs (Cho et al., 2014) in its parallel structure. Here we review the architecture and discuss its implications for fast machine translation.
2.1 Architecture
An autoregressive transformer (Vaswani et al., 2017) consists of an encoder and a decoder. Each encoder layer takes as input a sequence of vectors X in , and outputs X out : 1
X self = self-attention(X in ) + X in X out = feed-forward(X self ) + X self
A decoder layer takes as input a sequence of vectors Y in and encoded source tokens X src from the final encoder layer:
Y self = causal-attention(Y in ) + Y in Y cross = cross-attention(Y self , X src ) + Y self Y out = feed-forward(Y cross ) + Y cross
Here causal-attention denotes a variant of self attention that only attends to the prefix (i.e., Y self,i only attends to Y in,≤i ). During training one can parallelize computation across positions both in the encoder and decoder, resulting in linear complexity in sequence length. At inference time, the decoder generates outputs sequentially, and thus computation cannot be parallelized over positions. This sequential nature of autoregressive decoding causes inherent latency, with complexity quadratic in sequence length.
2.2 Deep Encoder, Shallow Decoder
Since its first proposal (Vaswani et al., 2017) , much prior work has assumed that the transformer architecture in machine translation has the same numbers of encoder and decoder layers, including top-performing systems in recent WMT competitions Pinnis et al., 2018; . We challenge this convention and explore pairing deep encoders with a shallow decoder. As we will show in later experiments, this deep-shallow configuration retains translation accuracy, but can substatially reduce decoding time. This is because at inference time, the encoder only accounts for a minor part of the latency overhead since its computation can be easily parallelized over input positions; on the other hand, the speedup gains from a lightweight decoder are substantial. Several prior works explored the use of deep encoders and shallow decoders to improve translation accuracy (Barone et al., 2017; Wang et al., 2019a) . Here, we study the impact of such architectures from the perspective of a speed-quality trade-off.
3 Latency In Machine Translation
In this section, we present two types of latency for machine translation to target two different scenarios in application: S 1 and S max . We then discuss complexity differences between autoregressive translation (AT) and non-autoregressive translation (NAT) models and how their computational complexity affects their S 1 and S max latency. Our analysis shows that under the same layer configuration, NAT models improve S 1 over AT models by parallelizing the decoder computation. A deepshallow AT model reduces the complexity from the decoder's sequential computation, and achieves competitive S 1 to those NAT models.
3.1 Latency Measures
We use two translation latency metrics:
• S 1 measures the speed to translate one sentence at a time. It aligns with applications like instantaneous machine translation that translates text input from users immediately. • S max measures the speed to translate in minibatches as large as the hardware allows. This is closer to the scenarios where one wants to translate a large amount of text. Both metrics measure wall-clock time speedups relative to an AT baseline with a 6-layer encoder and decoder, following prior work (Gu et al., 2018; .
3.2 Complexity Analysis
Seen in Table 1 is a complexity analysis of different types of transformer layers and full translation models. Assume that the source and target lengths are both N for simplicity. T denotes the number of iterations in an iterative NAT method where T < N . We use incremental decoding for AT models where the model states from previously generated tokens are cached and reused. In this case, the total complexity in one AT decoder layer will be O(N 2 ). NAT decoding with T iterative steps will have cross and self attention of quadratic complexity. Since iterative NAT models run fresh transformer passes in each iteration (Lee et al., 2018; Ghazvininejad et al., 2019 Ghazvininejad et al., , 2020b Kasai et al., 2020; Saharia et al., 2020) , we will have complexity of O(N 2 T ) per layer. Some of these operations can be parallelized over N target positions on a GPU, resulting in reduction in time complexity (Harris, 2007 , column "w/ parallelization"). Assuming the parallelization over all N positions, each encoder layer only costs O(N ). Similarly, one NAT decoder layer with T iterations can be computed in O(N T ). The AT decoder layer still costs O(N 2 ) due to its sequential nature.
Total Complexity w/ parallelization By Layer Enc. Layer O(N 2 ) O(N ) AT Dec. Layer O(N 2 ) O(N 2 ) NAT Dec. Layer O(N 2 T ) O(N T ) Full Model AT Enc-E Dec-D O(N 2 (E + D)) O(N (E + N D)) AT Enc-E Dec-1 O(N 2 E) O(N (E + N )) NAT Enc-E Dec-D O(N 2 (E + DT )) O(N (E + DT ))
S 1 is dominated by the complexity after parallel reduction; a GPU typically has enough memory to parallelize all operations across N target positions in a NAT decoder layer. This means that a NAT model with an E-layer encoder and Dlayer decoder has an advantage over an AT model with the same layer configuration because T < N and O(N (E + DT )) < O(N (E + N D)). However, NAT and AT models have similar complexity when the AT model only uses one decoder layer
(O(N (E + DT )) vs. O(N (E + N )))
. This results in comparable S 1 latency between NAT and deep-shallow AT models. In the case of S max , total complexity without parallelization is also at stake since an AT decoder can make crucial use of a GPU by simply parallelizing over the batch instances and offsets NAT's benefit. We observe that NAT costs much more total complexity than AT because of the T factor from the decoder: O(N 2 (E + DT )). Indeed we will see in a later section that NAT models yield much slower S max than AT models.
We conduct extensive experiments on standard benchmark datasets of varying sizes. We compare latency across non-autoregressive and autoregressive models and show that autoregressive models with a deep encoder and shallow decoder provide a substantially better balance between speed and quality than standard autoregressive models with the encoder and decoder of equal total depth.
4.1 Baselines And Comparison
Prior work has proposed various approaches to nonautoregressive machine translation (NAT). These methods must seek a balance in the speed-quality trade-off: the more parallelization is introduced into a model, the more the output quality deteriorates because of a stronger conditional independence assumption. Some approaches require external models to achieve competitive accuracy such as candidate rescoring with an autoregressive model (Sun et al., 2019; ) and a reordering module to align input word order to the target (Ran et al., 2019) . Given this complication in much recent work, latency comparisons among NAT models present challenges. In this work, we focus on comparisons with iteration-based approaches because they perform competitively to autoregressive models without any external system. Specifically, we use two strong iteration-based NAT models from recent work (Ghazvininejad et al., 2019; Kasai et al., 2020) . See §7 for descriptions of more prior work on NAT.
CMLM The conditional masked language model (Ghazvininejad et al., 2019) predicts randomly masked target tokens given observed target tokens as well as the source, similar to masked language models for contextual word representations (Devlin et al., 2019; . CMLM is used for iterative NAT by the mask-predict inference. Following Ghazvininejad et al. (2019 Ghazvininejad et al. ( , 2020b , we use 4 and 10 iterations and length beam 5 where 5 most probable lengths are chosen and each of those candidates is decoded in parallel until we select the one with the best score at the end.
DisCo The disentangled context transformer (Kasai et al., 2020) is an efficient alternative to CMLM. DisCo predicts every target token given an arbitrary subset of the rest of the reference tokens. Following Kasai et al. (2020) , we use their parallel easy-first inference, and set the maximum number of iterations and length beam to be 10 and 5 respectively.
Distillation Following previous work on nonautoregressive translation (e.g., Ghazvininejad et al., 2019; Kasai et al., 2020; Saharia et al., 2020) , we apply sequence-level knowledge distillation (Kim and Rush, 2016 ) by training every model in all directions on translations produced by a standard left-to-right transformer model (transformer large for EN-DE, EN-ZH, EN-FR and base for EN-RO). We assess the impact of distillation in §6 and demonstrate that distillation is important, especially for non-autoregressive models. Notice that we apply distillation to all configurations, including autoregressive models, for fair comparisons. 2
4.2 Experimental Setup
We experiment with 7 translation directions from four datasets of various training data sizes: WMT14 EN-DE (4.5M pairs), WMT16 EN-RO (610K), WMT17 EN-ZH (20M), and WMT14 EN-FR (36M, EN→FR only). These datasets are all encoded into subwords by BPE (Sennrich et al., 2016) .We follow the preprocessing and data splits by previous work ( Gehring et al., 2017) . We evaluate performance with BLEU (Papineni et al., 2002) for all directions, except that we use SacreBLEU (Post, 2018) for EN→ZH following a previous protocol (Ghazvininejad et al., 2019 (Ghazvininejad et al., , 2020b Kasai et al., 2020) . 3 For all autoregressive models, we apply beam search decoding with beam size 5 and length penalty 1.0. S 1 and S max wallclock time speedups ( §3) for all models are evaluated on the same single Nvidia V100 GPU with 16GB memory, with CUDA 10.1, cuDNN 7.6.3, and PyTorch version 1.4.0 (Paszke et al., 2019) . We apply half-precision inference , and found it speeds up S max for non-autoregressive models by 30+%, but not S 1 , in line with previous observations (Kim et al., 2019) .
Hyperparameters We generally follow the hyperparameters of the base sized transformer (Vaswani et al., 2017) : 8 attention heads, 512 model dimensions, and 2048 hidden dimensions for both the encoder and decoder. The dropout rate is tuned 2 Several works in the NAT literature only apply distillation to NAT models, which undermines comparability. We apply weight decay with 0.01 and label smoothing with ε = 0.1. We train with a batch size of approximately 65K tokens, using Adam (Kingma and Ba, 2015) with β = (0.9, 0.98) and ε = 10 −6 . The EN→FR model is trained for 500K updates, while others for 300K (Kasai et al., 2020) . Dev. BLEU is measured at the end of each epoch, and we average the 5 best checkpoints to obtain the final model (Vaswani et al., 2017) . We use mixed precision training (Micikevicius et al., 2018) , and implement all models with fairseq . Further details are described in the appendix.
5 Results And Discussion
We provide in-depth results comparing performance and speedup across autoreogressive and nonautoregressive models. Fig. 2 shows translation speed-quality trade-off curves of CMLM, DisCo, and AT models on EN→DE and RO→EN test data. For each model we plot the results of configurations with varying encoder and decoder depths. For brevity, we denote by E-D a model with an E-layer encoder and a D-layer decoder. All speedups are measured with respect to the AT 6-6 baseline ( §3).
5.1 Deep Encoder, Shallow Decoder
Firstly, under the 6-6 configuration, the AT model outperforms both CMLM and DisCo by a considerable margin in BLEU, but it achieves the slowest S 1 . Using a single-layer decoder, AT 6-1 gains a substantial S 1 speedup (2.6x for EN→DE and 2.9x for RO→EN), but this comes at a cost of Interestingly, all NAT models achieve slower S max than the AT 6-6 baseline: DisCo 6-6: 0.3x; CMLM 6-6 T=10: 0.1x in RO→EN. This is consistent with our complexity analysis in §3.2, where we found that with the same layer allocation, iterative NAT models need more total computation than the AT counterpart. AT 12-1 still gains a considerable speedup over AT 6-6 (2.0x in EN→RO). These results suggest that current NAT models have little advantage when translating a large amount of text, and one should clarify this distinction when discussing translation latency. See the appendix for full results from all four directions. Table 2 presents results from large bitext, EN↔ZH and EN→FR. We observe similar trends: AT deep-shallow achieves similar BLEU to AT 6-6 while reducing both S 1 and S max latency substantially. For EN↔ZH, AT deep-shallow has a more S 1 speedup than DisCo (2.7x vs. 2.5x in EN→ZH, 2.9 vs. 2.6 in ZH→EN). Particularly noteworthy is its performance in EN→FR: 42.04 BLEU, a 1.4 point improvement over the best NAT model. These results illustrate that the strategy of having a deep encoder and shallow decoder remains effective in large-scale bitext when the model has to learn potentially more complex distributions from more samples.
Lastly, Table 3 compares AT deep-shallow to recent iteration-based NAT results. All NAT models use the 6-6 configuration with the base size (Vaswani et al., 2017) except that Imputer (Saharia et al., 2020) uses 12 self-attention layers over the concatenated source and target. Overall, our AT deep-shallow models outperform all NAT models. The one exception is EN→RO where Imputer achieves 34.4 points with 8 iterations compared to our 33.8 points. We note, however, latency overhead in each iteration of their model is strictly larger than that of CMLM or DisCo since every iteration involves a fresh run of 12-layer self attention over a concatenation of input and output sequences. As we saw in Fig. 2 , AT deep-shallow yields comparable S 1 to CMLM 6-6 with 4 iterations, which would be more than twice as fast as Imputer with 8 iterations.
5.2 Constrained Views
In this section, we present two controlled experiments to compare NAT and autoregressive models more thoroughly.
S 1 Latency Constraint From §5.1 we see that compared to NAT models, AT deep-shallow yields a better translation speed-quality balance-despite being slightly slower in S 1 on some of the datasets, it achieves better BLEU across the board. To confirm this result, we further compare AT deepshallow against two NAT models, controlling for S 1 latency. More specifically, we experiment with NAT models of varying encoder depths, and pair each with as many decoder layers as possible until it reaches AT 12-1's S 1 latency. Fig. 3 shows the results. For CMLM T=4, CMLM T=10, and DisCo, the best configurations of 12-layer encoders were paired up with 12, 4, and 9 decoder layers respec- tively. All NAT models improve performance as the encoder becomes deeper and surpass the scores of the 6-6 baselines (shown as squares along x = 6). Nonetheless, there is still a large performance drop from AT 12-1. This illustrates that the two NAT models are not able to match AT deep-shallow's accuracy under the same S 1 latency budget.
Layer Constraint We can speed up autoregressive translation (AT) by developing a model with a deep encoder and a one-layer decoder. Here we thoroughly compare layer allocation strategies. Shown in Fig. 4 are results of NAT and AT methods under the constraint of 12 transformer layers in total. NAT models perform well when the decoder and encoder are balanced with slight tendency to deep encoders. On the other hand, the AT models perform consistently with 4 or more encoder layers. This confirms that using deep encoders and shallow decoders is more effective in AT models than in NAT ones. Note that the number of parameters in each layer allocation differs since a decoder layer contains 30% more parameters than an encoder layer, due to cross attention ( §2.1).
6 Further Analysis
Decoder Depth and Reordering Words From earlier results we see that NAT models need deeper decoders than AT models to perform well. We hypothesize that one reason is that NAT decoders need to learn to adjust to diverging word order between the source and the target: an AT decoder takes as input all preceding tokens and explicitly learns conditional distribution, while a NAT decoder needs to learn target word ordering from scratch. To test this hypothesis, we conduct the following controlled experiment in EN→DE translation. We first run the fast align tool (Dyer et al., 2013) 4 on all bitext data (including the test set), and disable the NULL word feature to ensure that every English word is aligned to exactly one German word. We then shuffle the English words according to the order of their aligned German words. When multiple English words are aligned to the same German word, we keep the original English order. We apply the same BPE operations as the original data. Table 4 compares performance on the original and reordered data. AT gains the same improvement regardless of the layer configuration; in contrast, 12-1 NAT benefits more than NAT 6-6. This result supports our hypothesis that word reordering is one reason why NAT models need a deeper decoder. The overall improvements from reordering are consistent with Ran et al. 2019, who found that a NAT model benefits from reordering the source to match the target.
Effects Of Distillation
We applied sequencelevel knowledge distillation (Kim and Rush, 2016) to all models. Here we analyze its effects over the WMT14 EN→DE evaluation data (Table 5 ). An autoregressive transformer large model (Vaswani et al., 2017) is used as the teacher model. All models benefit from knowledge distillation as indicated by positive ∆, including the AT models. Several recent works only compare NAT models trained with knowledge distillation to AT models trained without. Our finding shows that that AT models with knowledge distillation can be an additional baseline for future NAT research. AT deep-shallow deteriorates much less on the raw data compared to the iterative NAT methods, suggesting that our strategy of speeding up autoregressive models is better suited to modeling raw, complex data than the NAT methods.
Model
Raw Dist. ∆ CMLM, T = 4 22.3 25.9 3.6 CMLM, T = 10 24.6 27.0 2.4 Imputer, T = 4 24.7 27.9 3.2 Imputer, T = 8 25.0 27.9 2.9 DisCo Enc-6 Dec-6 24.8 27.4 2.6 AT Deep-Shallow (12-1) 26.9 28.3 1.4 AT Enc-6 Dec-6 27.4 28.3 0.9 Table 5 : WMT14 EN→DE test results in BLEU that analyze the effects of distillation in fast translation methods. All distillation data are obtained from a transformer large. T denotes the number of iterations.
Speedup and Batch Size When decoding with large mini-batches, NAT models can be slower than their AT counterpart ( §5.1). Here we further study this effect. Fig. 5 plots the relative speedups of different models' decoding with varying numbers of sentences per batch up to the hardware limit ("max," §3.1). The speedup by NAT models diminishes as the batch size grows: they have similar decoding latency to AT 6-6 with batch size 50, and become slower with larger batch sizes. In contrast, AT deep-shallow achieves consistent speedups over the AT 6-6 baseline. Can we reduce the decoder further? We saw that an autoregressive model with a single-layer decoder and a sufficiently deep encoder can retain the accuracy of the baseline with 6 layers each. One may ask whether we can make the decoder even more compact. Our preliminary experiments showed that we can remove the feed-forward module from the decoder (Fig. 1) without hurting performance. This reduces the S 1 latency by 10%. We leave further exploration to future work.
7 Further Related Work
Non-autoregressive Translation In addition to the work already discussed, several other works proposed to iteratively refine (or insert) output predictions (Mansimov et al., 2019; Stern et al., 2019; Gu et al., 2019a; Chan et al., 2019a,b; . Other approaches include adding a light autoregressive module to parallel decoding (Kaiser et al., 2018; Sun et al., 2019; Ran et al., 2019) , partially decoding autoregressively (Stern et al., 2018 (Stern et al., , 2019 , rescoring output candidates autoregressively (e.g., Gu et al., 2018) , mimicking hidden states of an autoregressive teacher , training with different objectives than vanilla cross-entropy (Libovický and Helcl, 2018; Shao et al., 2020; Tu et al., 2020; Saharia et al., 2020; Ghazvininejad et al., 2020a) , reordering input sentences (Ran et al., 2019) , training on additional data from an autoregressive model (Zhou and Keung, 2020) , and modeling with latent variables (Ma et al., 2019; Shu et al., 2020) . The approach of adding a light autoregressive module is closest to our method, but note that we pack all non-autoregressive computation into the encoder.
Optimizing Autoregressive Transformer Prior work has suggested ways to optimize autoregressive transformers for fast inference. For example, Kim et al. (2019) employed layer tying (Dabre and Fujita, 2019; Dehghani et al., 2019) on the transformer decoder and found that it sped up inference on CPUs, but not on a GPU. Shi and Knight (2017) proposed a vocabulary reduction method to speed up the last softmax computation. used dynamic programming in an average attention network to accelerate inference. Press and Smith (2018) proposed an eager translation method to avoid attention computation. Reformer (Kitaev et al., 2020) reduced the quadratic complexity of attention computation by locality-sensitive hashing. Some of these methods can be used orthogonally to further facilitate fast inference in a transformer with a deep encoder and shallow decoder.
Rich Encoding, Light Decoding Our experiments suggest that rich features from a deep encoder avoid the need for multiple layers of decoding in machine translation. Wang et al. (2019a) showed that using more encoder transformer layers while keeping 6 decoder layers improves translation quality. Barone et al. (2017) found that RNNbased models with a deep encoder and a shallow decoder can reduce training time with a small performance drop. We took an extreme configuration of a single-layer transformer decoder and focused on inference latency, but all of these results corroborate the benefit of deep encoders. Beyond machine translation, a surprisingly light decoder (e.g., multilayer perceptrons) with a powerful encoder (e.g., bidirectional LSTMs) has proven successful in structured prediction, such as syntactic and semantic parsing (Kiperwasser and Goldberg, 2016; Manning, 2017, 2018; Kasai et al., 2018) . Generating a target translation is perhaps a more complex task than producing a parse tree, but our results provide further support for the claim that useful distributed representations of natural language can be obtained in a conditionally independent manner.
8 Conclusion And Future Work
We presented extensive empirical studies to demonstrate that autoregressive translation can be dramtically sped up by a simple layer allocation strat-egy: deep encoder, shallow decoder. Compared to strong non-autoregressive models, deep-shallow autoregressive models achieve substantial improvement in translation quality with comparable latency. Our results suggest that layer allocation is an important factor that future work on fast machine translation, particularly non-autoregressive machine translation, should take into consideration. More generally, our work suggests that a better layer allocation between the encoder and decoder might be able to accelerate inference in any sequence-to-sequence task. In particular, a model with a deep encoder and a shallow decoder can be used for large-scale pretraining for sequence generation such as BART where latency reduction will be key in a wide range of real-world applications. Table 8 : BLEU and speed comparisons with varying number of encoder (E) and decoder (D) layers.
Layer normalization(Ba et al., 2016) is applied after attention and feed forward. We suppress this for brevity.
https://github.com/clab/fast_align
https://github.com/facebookresearch/ Mask-Predict