A Mixture of h - 1 Heads is Better than h Heads

Hao Peng
Roy Schwartz
Dianqi Li
Noah A. Smith
ACL
2020
View in Semantic Scholar

Abstract

Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead “reallocate” them—the model learns to activate different heads on different inputs. Drawing connections between multi-head attention and mixture of experts, we propose the mixture of attentive experts model (MAE). MAE is trained using a block coordinate descent algorithm that alternates between updating (1) the responsibilities of the experts and (2) their parameters. Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks. Particularly, on the WMT14 English to German translation dataset, MAE improves over “transformer-base” by 0.8 BLEU, with a comparable number of parameters. Our analysis shows that our model learns to specialize different experts to different inputs.

1 Introduction

The transformer architecture and its variants achieve state-of-the-art performance across a variety of NLP tasks, including machine translation (Vaswani et al., 2017; , language modeling (Radford et al., 2018; Baevski and Auli, 2019) , semantic role labeling (Strubell et al., 2018) , and more (Devlin et al., 2019; Liu et al., 2019b; Yang et al., 2019b) . Under the hood, multihead attention provides the driving force: multiple separately parameterized attention functions act in parallel to contextualize the input representations; their outputs are then gathered by an affine transformation, and fed to onward computation. Figure 1: Illustration of MAE: a mixture of attentive experts. Each H i box is an attention head in a given layer; there are h of them in total. Experts are groups of h − 1 attention heads. MAE learns an input-dependent distribution of the experts (g). At each training step, a single expert is selected and updated (solid line); during the evaluation, experts' outputs are linearly combined with weights produced by g.

Figure 1: Illustration of MAE: a mixture of attentive experts. Each Hi box is an attention head in a given layer; there are h of them in total. Experts are groups of h− 1 attention heads. MAE learns an input-dependent distribution of the experts (g). At each training step, a single expert is selected and updated (solid line); during the evaluation, experts’ outputs are linearly combined with weights produced by g.

Recent efforts by Voita et al. (2019) and Michel et al. (2019) suggest that typical transformer networks are overparameterized, in the sense that at test time, many of the heads, or even a full layer (Fan et al., 2020) , can be removed without significant loss in performance. 2 In response to this observation, they propose to prune the unimportant attention heads in the model after it is trained, aiming for faster inference.

In this paper, we ask whether, instead of reducing the model capacity, we can use it more effectively. We propose mixture of attentive experts (MAE). MAE retains all attention heads, and learns to activate different heads on different inputs (see illustration in Figure 1 ). We start by showing that multi-head attention can be seen as an uniform, input-agnostic mixture of experts (Jacobs et al., 1991) , by grouping a subset of atten-tion heads as an expert ( §2.2). We then introduce MAE, which instead of uniformly weighting the experts, complements the experts with a learned, input-dependent function that assigns their responsibilities ( §2.3). To train MAE, we propose a two-step algorithm based on block coordinate descent ( §3), which alternates between updating the experts' responsibilities and their parameters.

We evaluate MAE on machine translation and language modeling ( §4). Our approach outperforms strong baselines on both; on the WMT14 English to German MT dataset, MAE outperforms transformer-base (Vaswani et al., 2017) by 0.8 BLEU with a negligible increase in the number parameters. Our analysis shows that MAE learns to encourage different experts to specialize on different inputs ( §5).

2 Mae: Mixture Of Attentive Experts

This section describes MAE in detail. It is inspired by a mixture-of-experts view of multi-head attention, which we present in §2.2. Specifically, we show that multi-head attention can be viewed as a mixture of uniformly weighted experts, each consisting of a subset of attention heads. Based on this observation, we propose MAE, which learns to weight the experts ( §2.3) depending on the input. We begin by laying out notation and necessary background in §2.1.

2.1 Background: Mixture Of Experts

Mixture of experts is a well-established technique for ensemble learning (Jacobs et al., 1991) . It jointly trains a set of expert models {f i } k i=1 that are intended to specialize across different input cases. The outputs produced by the experts are aggregated by a linear combination, with a "gating function" g = [g 1 , . . . , g k ] determining the importance of each expert in the final decision:

EQUATION (1): Not extracted; please refer to original document.

The gating function can be parameterized by, e.g., a neural network. We will also refer to g as the responsibilities or weights of the experts.

2.2 Multi-Head Attention: A Mixture-Of-Experts Perspective

Multi-head attention is the key building block for the state-of-the-art transformer architectures (Vaswani et al., 2017) . At its core are mul-tiple separately parameterized attention heads. An attention head takes as input a n-by-d matrix X, with each row being the vector representation of an input element. It contextualizes the input using a dot-product attention mechanism:

EQUATION (2): Not extracted; please refer to original document.

where Q i , K i , and V i are learned matrices, 3 and the softmax normalizes row-wise. The outputs of attention heads are then concatenated and fed through a learned affine transformation:

Z MultiHead (X) = H 1 ; . . . ; H h W (3)

where W is a learned matrix, and h denotes the number of attention heads. We now present a different computation equivalent to Eq. 3, aiming for a smoother transition into following sections. Let

EQUATION (4): Not extracted; please refer to original document.

Eq. 4 provides a different view of the output computation of the multi-head attention: each attention head first projects the contextualized representation with a learned matrix (i.e., H i = H i W i ), then their outputs are gathered with a sum (Eq. 4). We now show that this can be seen as a uniformly weighted mixture of experts.

A mixture-of-experts perspective. Let us take a closer look at Eq. 4 and rewrite it:

Z = 1 h − 1 h i=1 (−1 + h) H i = 1 h − 1   − h i=1 H i + h i=1 h j=1 H j   = h i=1 1 h gate g i h h − 1   −H i + h j=1 H j   expert f i (X; θ i ) . (5) Eq. 5 interprets multi-head attention as a mixture of h h−1 = h experts. It first constructs a set of h experts {f i (•; θ i )}, with θ i denoting f i 's param- eters. f i (•; θ i )

is a parameterized function of the input, which calculates a sum of the outputs by all but the ith attention head. This is achieved by subtracting H i from h j=1 H j , then scaling up the results by h/(h − 1). The experts share part of the parameters: any two share h − 2 attention heads. A uniform responsibility of 1/h is used.

Discussion. Viewing multi-head attention through this MoE lens suggests some interesting consequences. One can replace the input-agnostic responsibility in Eq. 5 with a function over the input. Indeed, we have good reasons for doing so. Voita et al. (2019) and Michel et al. (2019) show that for transformer networks, a handful of important attention heads are sufficient to achieve good test-time performance. They propose to prune the rest using an input-agnostic procedure. Instead of doing so, here we see a potential alternative: keep all the heads, but only activate those that are important to the input. This motivates MAE, which we now introduce.

2.3 Mae: Learning To Weight Experts

MAE is inspired by the connections between MoE and multi-head attention we draw in §2.2. On top of multi-head attention, MAE learns an inputdependent parameterized gating function g(•; φ) to complement the experts. More formally, the uniform responsibility 1/h in Eq. 5 is replaced by

EQUATION (6): Not extracted; please refer to original document.

Experts f i are the same as those in Eq. 5. g(•; φ) is parameterized with a multi-layer perceptron (MLP) followed by a softmax. It first averages X along the row (i.e., the sequence direction), and then feeds the results through a twolayer tanh-MLP. g(•; φ) outputs a normalized hdimensional vector using a softmax, indicating the responsibilities of the experts. It can be seen as a learned probability distribution over the experts.

MAE can learn to assign more responsibility to the experts that are more important to the given input, allowing them to contribute more. MAE is applicable wherever multi-head attention is used. For example, in a machine translation experiment ( §4.2), we replace with MAE all the multi-head attention in a transformer network, including the self-attention in all encoder and decoder layers, as well as those attending over the encoded source from the decoder. Each of them is separately treated as a mixture of experts, and has its own gating function. The additional parameter overhead is small: gating functions account for only 3-5% parameters of the full model (Appendix A).

3 Training Mae With Block Coordinate Descent

It is straightforward to jointly train the experts and the gating functions in an MAE model using backpropagation. However, in line with previous observations (Shen et al., 2019) , we empirically observe that this is prone to degenerate solutions where the gating functions tend to learn to similarly weight the experts (see §5.1). 4 As a remedy, we propose a block coordinate descent (BCD) training. At a high level, training is decomposed into two interleaving steps: A G step updates the gating function g(•; φ), fixing the experts; an F step fixes the gating function and updates one randomly selected expert f i (•; θ i ). 5 The computations for G and F steps differ:

• In a G step, MAE outputs a linear combination of the experts' outputs, and only updates the gating function's parameters (Algorithm 1). No expert is updated. • An F step computes the experts' responsibilities g(X), according to which an expert i is then sampled (Algorithm 2). MAE computes the output with f i , which is then updated, without updating the gating function or other experts. 6 A non-differentiable sampling from g is involved in F steps. It does not create difficulties for the 4 Besides the undesired degeneracy, we also find that the model suffers worse overfitting when θ and φ are jointly updated (Appendix B). One possible reason is that, compared to the standard multi-head attention, the learned gates give the model additional capacity to compensate for the experts' errors with others' outputs at training time, hurting generalization (Jacobs et al., 1991) . Another common degeneracy of MoEs is the "rich get richer" where one of the experts is always picked and others ignored. As observed by Voita et al. (2019) , this can happen when the experts are trained to be sparsely weighted. When tuning the hyperparameters, we observe the "rich get richer" degeneracy if the learning rate is set too large.

Algorithm 1 A G step update for MAE, with step size η.

1: procedure MAEG(X) 2: Z ← h i=1 g i (X; φ) • f i (X; θ i ) 3:

Forwardprop with Z and calculate L.

4:

Calculate ∇ φ L with backprop.

φ ← φ − η • ∇ φ L. 6: end procedure

Algorithm 2 An F step update for MAE, with step size η.

1: procedure MAEF(X) 2: Draw i ∼ Cat(g(X; φ)) 3: Z ← f i (X; θ i ) 4:

Forwardprop with Z and calculate L.

5: Calculate ∇ θ i L with backprop. 6: θ i ← θ i − η • ∇ θ i L. 7: end procedure

backpropagation, since an F step never calculates the gradients w.r.t. φ. At test time, the computation is the same as that in a G step, i.e., MAE outputs a linear combination of the experts, weighted by g.

Training time overhead. A straightforward training procedure is to, for each training instance, first take a G step, and then an F step. This doubles the forward propagation computation overhead. In practice, it is not necessary to take G steps as frequently as F steps, since they only update a small portion of the model. In the experiments, we take G steps one fifth as frequently as F steps: we make G updates every 5 epochs while always take F steps. In preliminary experiments, we find this reduces training time overhead without significant impact on the performance. 7 Algorithm 3 summarizes the block coordinate descent training in a given epoch.

Connections to dropout. In the above block coordinate descent training algorithm, an F step samples an expert to update, and ignores the rest in both forward and backward computation. It is reminiscent of dropout (Srivastava et al., 2014) . Specifically, selecting expert f i is equivalent to Algorithm 3 Block coordinate descent (BCD) training for MAE, at epoch e. D denotes the training data. 8 1:

procedure BCD(D = {X i } i , e) 2:

for X i ∈ D do 3:

Take G steps every 5 epochs.

if e mod 5 = 0 then 5:

MAEG(X i ) 6: end if 7:

Always do F step updates. 8:

MAEF(X i ) 9:

end for 10: end procedure dropping head i. 9 In other words, the F steps (Algorithm 2) can be seen as a structured dropout applied to the attention heads, but with learned input-dependent drop probabilities. When g is a constant vector with elements 1/h, it recovers the head dropout, which is also explored by concurrent work (Fan et al., 2020) .

So far, we view MAE as a mixture of h experts, each consisting of h − 1 attention heads. One can, of course, generalize this to other settings, e.g., mixing h h−2 experts, each containing h−2 heads. From the dropout view, this translates to dropping more attention heads: dropping t heads out of h is equivalent to applying a dropout with drop probability t/h, in the sense that their expected numbers of dropped units are the same.

Despite the similarity between MAE and dropout, a key difference exists between the two: with the latter, the constant dropout probability is set a priori, while MAE uses a gating function g(•; φ) to calculate a learned, input-dependent dropout probability.

4 Experiments

We empirically evaluate MAE on machine translation ( §4.2) and language modeling ( §4.3) benchmarks. We first introduce the compared models ( §4.1).

4.1 Compared Models

MAE is evaluated under two settings:

• MAE-7 mixes 8 experts each with 7 attention heads.

• MAE-6 is similar to MAE-7, but mixes 8 2 = 28 experts each with 6 attention heads. 10 We compare MAE to the following baselines.

• BASE is a sequence-to-sequence model based on the transformer architecture. • NOBCD is the same model as MAE, but does not use block coordinate descent training. Instead, it jointly updates all experts and the gating function at training time, as discussed at the start of §3. • UNI-MAE-7 is similar to MAE but does not have parameterized gating functions. It builds on BASE, and mixes 8 experts, each with 7 attention heads. Constant uniform responsibilities are assigned to the experts. At each training step, it updates one uniformly sampled expert; at test time, the outputs of all experts are averaged according to Eq. 5. • UNI-MAE-6 mixes 28 6-attention-head experts, and is otherwise the same as UNI-MAE-7. We refer the readers to Appendix A for implementation details.

4.2 Machine Translation

Datasets. We experiment with two machine translation datasets:

• WMT14 EN-DE (Bojar et al., 2014) . 11 Following previous practice (Vaswani et al., 2017) we train on WMT14, and designate newstest2013 and newstest2014 as development and test data respectively. Our preprocessing follows that of Vaswani et al. (2017) and . A shared source-target vocabulary is used, with 32k byte pair encoding types (BPE; Sennrich et al., 2016). • IWSLT14 DE-EN (Cettolo et al., 2014) . 12 It is based on TED talks, and is much smaller compared to WMT14. We use the preprocessing from . Following previous practice, we use separate vocabularies for the source and target, with around 9K and 7K BPE types respectively. Table 1 summarizes some statistics of the datasets. 10 Preliminary results show that mixing experts with fewer heads leads to underwhelming performance. We conjecture this is due to too strong a regularization effect ( §3).

Table 1: Some statistics for WMT14 and IWSLT14 datasets. We use separate source and target vocabularies in IWSLT14 experiments.

11 https://drive.google.com/a/ haopeng.name/uc?export=download&id=0B_ bZck-ksdkpM25jRUN2X2UxMm8

12 http://workshop2014.iwslt.org/. Evaluation. The models are evaluated using BLEU (Papineni et al., 2002) . A beam search with beam size 5 is used. In the WMT14 experiments, we follow Vaswani et al. (2017) , and apply a compound split postprocessing. 13 Results. Table 2 summarizes WMT14 EN-DE translation test performance. The base and large sized transformer models are due to Vaswani et al. (2017) . To control for compounding factors, we additionally compare to our implementation of the base sized model (BASE). It achieves slightly better performance than Vaswani et al. (2017) , with a 0.3 BLEU edge. MAE-7 improves over the base transformer by 0.8 BLEU, obtaining similar performance to the large-size transformer of Vaswani et al. (2017) using less than a third as many parameters. Since we do not see similar improvement by UNI-MAE-7, we attribute this gain to inputdependent expert weighting. Having a smaller number of heads for each expert, MAE-6 slightly underperforms MAE-7, and so does UNI-MAE-6 in comparison to UNI-MAE-7. Finally, NOBCD gets worse performance than the transformer baseline, demonstrating the importance of the block coordinate decent training.

Table 2: WMT14 EN-DE translation test performance on newstest2014. † randomly select an expert to update for each training instance, and ‡ learns a gating function to weight the experts. Transformer performance in the first two rows are due to Vaswani et al. (2017).

Data

We observe similar trends on the IWSLT14 DE-EN dataset, summarized in Table 3 . The BASE model here is similar to the base-sized transformer in the WMT14 experiment, but with a smaller hidden dimension. MAE-7 outperforms BASE by 0.9 BLEU. Interestingly, UNI-MAE-7 improves over BASE by 0.3 BLEU, possibly because the regularization effect of random expert selection training helps more on this smaller dataset. 14

Table 3: IWSLT14 GE-DE test set performance. See Table 2 caption for indications of the superscripts.

4.3 Token-Level Language Modeling

Dataset. We experiment with the WikiText-103 dataset (Merity et al., 2016) . It contains articles 13 https://github.com/tensorflow/ tensor2tensor/blob/master/tensor2tensor/ utils/get_ende_bleu.sh

14 Selecting an expert can be seen dropping one attention head in training ( §3). Setting. Here the BASE model is the strong language model by Baevski and Auli (2019) . It is based on a 16-layer transformer network; each multi-head attention layer has 8 heads. It uses different embedding dimensions for the tokens, based on their frequencies. We closely follow Baevski and Auli (2019) in terms of hyperparameters and training procedures. The readers are referred to their paper and Appendix A for further architecture and hyperparameter details.

Table 4: Language modeling performance on WikiText-103 test set (lower is better). ?Trains/evaluates with 3,072/2,048 context sizes and therefore not directly comparable to other models which use 512/480 sized ones. See Table 2 caption for the indications of other superscripts. Bold font indicates the best performance using smaller context sizes. The first two rows are due to Table 5 of Baevski and Auli (2019).

Notes on context size. Baevski and Auli (2019) study the effect of context window, i.e., the number of history tokens the model attends over. They find that using larger context sizes lead to better performance (Baevski and Auli, 2019, Table 5 ).

Table 5: Performance decrease for different models on WMT14 development set when only one expert is used for each multi-head attention layer (5.1).

Their best setting uses a 3,072 training context size, and 2,048 at test time (i.e., the model has access 2,048 tokens before predicting any token at test time). However, we are not able to train MAE, Trains/evaluates with 3,072/2,048 context sizes and therefore not directly comparable to other models which use 512/480 sized ones. See Table 2 caption for the indications of other superscripts. Bold font indicates the best performance using smaller context sizes. The first two rows are due to Table 5 of Baevski and Auli (2019) . nor replicate their results, under this setting-our GPUs have far less memory, and it is impossible to even load a 3,072-token context chunk. 15 Therefore we train and evaluate MAE and UNI-MAE-7 with smaller 512/480 context sizes, also explored by Baevski and Auli (2019) , which allows for a head-to-head comparison.

Results. Table 4 shows the perplexity on WikiText-103 test data. When trained under the same setting, MAE outperforms Baevski and Auli (2019) by more than 0.3 perplexity. Interestingly, despite the much smaller context at both training and test time, MAE matches the best setting by Baevski and Auli (2019) . UNI-MAE-7 and NOBCD underperform the baseline (higher perplexity).

5 Analysis

This section first empirically confirms that MAE learns to activate different experts on different inputs in §5.1. We then run a synthetic experiment to explore MAE's potential in transfer learning ( §5.2).

5.1 Does Mae Learn To Specialize The Experts?

One of the appealing properties of MoE models is that they could learn to activate different experts, depending on what "expertise" is needed for the input. Does MAE learn to do so? We empirically study this question, and present evidence indicating that it does, at least in part. We consider the encoders of the UNI-MAE-7, NOBCD, and the MAE-7 models trained on WMT14. 16 We first study whether BCD training helps drifting MAE away from uniformly weighting the experts agnostic to the inputs. We treat the gating values as probabilities, and calculate their entropies:

H(g) = − h i=1 g i • log g i , which

are then averaged across different layers. The average entropy on the development set for MAE-7 is 1.91, lower than the 2.02 by the NOBCD model trained without BCD. In comparison, UNI-MAE-7 uniformly weights the experts and has the entropy of 2.08. This indicates that gating weights of MAE trained with BCD are more "focused" on one or a subset of experts than trained without.

Second, we study whether MAE learns to specialize different experts for different inputs. To do so we attribute the development instances to the experts that maximize the gating weights. For the first encoder layer of MAE-7, the percentages of instances attributed to each of the 8 experts are relatively balanced: 13%, 14%, 9%, 16%, 10%, 15%, 10%, 12%. 17 This suggests that all experts are assigned a substantial part of the input, and it is not the case that BCD leads to a "rich get richer" outcome.

We then continue and explore whether MAE performs reasonably well when using only the most "specialized" experts. For each development instance, we select those experts maximizing the gating weights and ignore the rest, instead of linearly combining them as in Eq. 6. We see from Table 5 a 0.3 BLEU decrease under this setting. In comparison, NOBCD has a larger performance decrease of 0.7 BLEU. NOBCD's performance drop is similar to that of UNI-MAE-7, for which we randomly select an expert at each layer and average the performance over 5 runs. These results support the proposition that MAE specializes better when trained with BCD. Finally, we search for the tokens that are more likely to activate each expert. We compute the pointwise mutual information (PMI; Church and Hanks, 1990 ) between tokens and experts:

PMI(token i , expert j ) = log p(token i , expert j ) p(token i )p(expert j )

. Table 6 lists the most indicative tokens of each expert, for the first layer. While some of the terms for some experts seem loosely related (e.g., bell, reuters, and computing for expert 2, it is hard to find clear patterns in most of them.

Table 6: Indicative tokens for each expert (§5.1). Tokens attributed to Expert 2 are mostly computer science terminology; trends for other experts are less clear.

5.2 Mae'S Potential In Transfer Learning: A Case Study

We now turn to evaluate another property of MAE: its potential for data-efficient transfer learning, by only updating the gating functions, freezing the experts. We consider the pretrain-then-finetune setting. Due to computation limits, we are unable to explore MAE for pre-training contextual representations (Peters et al., 2018; Devlin et al., 2019) . Rather, we focus on the following small-scale machine translation experiments.

Setting. We explore finetuning on IWSLT14 much larger WMT14 dataset. 18 We compare three finetuning methods:

• FTG finetunes the gating functions' parameters (i.e., φ), keeping the rest frozen. • FTG+ updates the parameter matrix W in Eq. 4 in addition to φ. The rest of the model parameters are fixed. • FTALL updates all parameters. As a baseline, NOFT is the out-of-box pretrained model without any finetuning. SCRATCH trains a MAE model from scratch. Table 7 summarizes the IWSLT14 EN-DE development set performance. Surprisingly, NOFT already outperforms SCRATCH without any finetuning. We attribute this improvement to the larger pretraining (WMT14) data. Only updating the gating functions, FTG improves over NOFT by 0.8 BLEU. Yet there is still a significant gap of 1.8 BLEU between FTG and FTALL. Interestingly, FTG+ almost matches the performance of FTALL, but only updates 1/9 as many parameters. Both FTG and FTG+ reach the best performance after around 1K gradient updates, i.e., one epoch, significantly less than FTALL or SCRATCH.

Table 7: IWSLT14 development set performance of different finetuning methods (§5.2). The last two columns indicate the number of parameters to update, and the number of gradient steps needed to achieve the best development performance.

We further compare FTG+ and FTALL where less downstream training data is available. To simulate this, we randomly sample [5%, 10%, 25%, 50%, 75%] subsets of IWSLT14 training data, on which the pretrained model is finetuned. Figure 2 plots their performance. We see a clear trend: as less training data is available, the gap between FTG+ and FTALL decreases; when less than 20% of the training data is available, FTG+ outperforms FTALL. These results suggest that finetuning MAE with FTG+ can be viable in lowresource transfer learning. 18 Here we reverse the translation direction of IWSLT14: §4.2 experimented with DE-EN, here we use EN-DE.

Figure 2: IWSLT14 development performance of FTG+ and FTALL using different amount of training data (§5.2). When trained on less than 20% subset of the original training data, FTG+ outperforms FTALL.

6 Related Work

Multi-head attention. An increasing amount of effort has been devoted into developing better attention mechanisms (Malaviya et al., 2018; Deng et al., 2018; Sukhbaatar et al., 2019; Correia et al., 2019; Maruf et al., 2019, inter alia) , and improving transformer architectures (Shaw et al., 2018; Dehghani et al., 2019; Hao et al., 2019; Correia et al., 2019; Yang et al., 2019a, inter alia) . Closely related, Iida et al. (2019) applies another attention mechanism over the attention heads, allowing a learned reweighting of them. Our work focuses on the connection between multi-head attention and MoE, and the BCD training it suggests and benefits from. Concurrent to our work, (Fan et al., 2020 ) study structurally pruning transformer layers for more efficient inference. Another line of work aims to better understand the working of transformer models (Clark et al., 2019; Liu et al., 2019a; Tenney et al., 2019, inter alia) .

Mixture of experts. One of the most successful applications of MoE is ensemble learning (Caruana et al., 2004; Liu et al., 2018; Dutt et al., 2017, inter alia) . Recent efforts also explore MoE in sequence learning , and to promote diversity in text generation (He et al., 2018; Shen et al., 2019; Cho et al., 2019, inter alia) .

7 Conclusion

We presented MAE. It is inspired by a mixture-ofexperts perspective of multi-head attention. With a learned gating function, MAE activates different experts on different inputs. MAE is trained using a block coordinate descent algorithm, which alternates between updating the responsibilities of the experts and their parameters. Our experiments show that MAE outperforms the transformer baselines on machine translation and language modeling benchmarks. The analysis shows that MAE learns to activate different experts. The code is publicly available at https://github.com/ Noahs-ARK/MAE. layer. No weight decay is used. φ are updated using SGD with a fixed learning rate 1, separate from the one for the rest part of the models. This aims to avoid using momentum-based optimizing algorithms (e.g., Adam) for the gating functions, which we empirically find helps alleviate the "rich gets richer" degeneracy. 23 In the language modeling experiment, most recent 100 input vectors are averaged and then fed into the gating functions; while we average all the input vectors in the machine translation as the inputs to g(•; φ).

B Learning Curve Comparison For Mae And Nobcd

In §3 (footnote 4) we discuss an overfitting issue by jointly updating the experts and the gating function. This section empirically studies it. We compare the learning curves of BASE, NOBCD, and MAE-7 trained on the IWSLT14 dataset, plotted in Figure 3 . The models are described in §4.1. We tune dropout and 2 regularization based on development performance. Other hyperparameters are the same for the compared models. The training loss for NOBCD decreases much faster than BASE; however, on the development set, it never outperforms BASE, and the development loss starts increasing after epoch 40. MAE-7 finds a nice middle ground in terms of training loss. It outperforms both BASE and NOBCD on the validation set. This provides further evidence for the importance of BCD training. C Addtional Results for §5.1 §5.1 describes a experiment with the MAE-7 model where we attribute the development instances of WMT14 to the experts maximizing the gating weights. Table 8 presents more results. The number of instances each expert receives is relatively balanced, and the trend is consistent across different layers. 23 It is not entirely clear to us why using momentum-based optimization algorithms to learn the gating functions leads to degenerate solutions more often. One possible reason is that the accumulated momentum steers the gating functions to keeping selecting the experts they pick at the early stage of training. 13.8 14.5 10.7 10.8 15.4 7.9 16.0 10.9 3 14.0 14.4 12.4 10.6 14.3 9.8 15.4 9.0 4 14.5 13.7 10.4 8.3 15.1 11.8 11.2 15.1 5 11.9 13.8 13.7 15.7 10.1 16.4 6.9 11.5 6 12.9 10.0 12.4 14.6 9.5 15.2 15.7 9.8 Table 8 : The percentage of WMT14 development instances attributed to each of the experts in MAE-7's encoder layers ( §5.1).

Figure 3: Learning curves of BASE, NOBCD, and MAE-7 (§B), trained on the IWSLT14 EN-DE using the same setup. NOBCD quickly fits the training data, but it does not outperform BASE on validation set. Trained with BCD, MAE finds a nice middle ground. For better readability, x-axis starts at epoch 8.

Table 8: The percentage of WMT14 development instances attributed to each of the experts in MAE-7’s encoder layers (§5.1).

Our implementation is publicly available at https:// github.com/Noahs-ARK/MAE.

We do not argue that overparameterization is bad for training. In fact, it may be necessary for successful optimization and good generalization(Neyshabur et al., 2014;Zhang et al., 2016; Soudry and Carmon, 2016, inter alia). Rather, we try to explore more efficient ways to use the modeling capacity, than, e.g., removing part of the model.

Some authors explicitly distinguish queries, keys, and values(Vaswani et al., 2017). These inputs can sometimes differ, e.g., in encoder-decoder attention. We suppress such differences for clarity.

For clarity, our discussion focuses on θ and φ. The rest of the model, e.g., the word embeddings in a transformer network, are updated along with θ. Training aims to minimize loss L over {θ, φ}.6 In mini-batch training, which we use in the experiments, different experts can be sampled for different instances in a mini-batch. This is because g depends on the inputs. This means that multiple experts will be updated in an F step, but each due to a subset of the examples in the mini-batch.

In this way, training time for MAE is roughly 1.2 times longer than that of the transformer network it builds on.8 Although we assume supervised learning, we suppress the gold outputs for notational clarity. We slightly overload the notation and denote by Xi the training instance, although they cab also be the outputs of intermediate layers.

Recall from Eq. 5 that fi includes all but head i.

Baevski and Auli (2019) use NVIDIA Tesla V100 GPUs with 32GB memory, while we only have access to GeForce RTX 2080 Ti, with 11GB memory.

The same experiments can be done with the decoders, where the inputs to gating functions are German sentences. The authors lack German expertise, and interpretation of a following analysis would not have been possible for us.17 We observe similar trends in other layers. See Appendix C for more details.

https://pytorch.org/; https://github. com/pytorch/fairseq 20 https://github.com/pytorch/fairseq/ issues/34621 Due to the randomness in random expert selection, we find that warming up learning rate more slowly helps stabilize early training.22 https://github.com/pytorch/fairseq/ tree/master/examples/translation