Effective Attention Sheds Light On Interpretability

Kaiser Sun
Ana Marasović
FINDINGS
2021
View in Semantic Scholar

Abstract

An attention matrix of a transformer selfattention sublayer can provably be decomposed into two components and only one of them (effective attention) contributes to the model output. This leads us to ask whether visualizing effective attention gives different conclusions than interpretation of standard attention. Using a subset of the GLUE tasks and BERT, we carry out an analysis to compare the two attention matrices, and show that their interpretations differ. Effective attention is less associated with the features related to the language modeling pretraining such as the separator token, and it has more potential to illustrate linguistic features captured by the model for solving the end-task. Given the found differences, we recommend using effective attention for studying a transformer’s behavior since it is more pertinent to the model output by design.

1 Introduction

Attention mechanism (Bahdanau et al., 2015) is an essential component of many NLP models, including those that are built on the ubiquitous transformer architecture (Vaswani et al., 2017) . As a result, visualizing attention weights is a widely used technique to interpret models' behavior (Belinkov and Glass, 2019) . Despite that, the validity of this analysis method is a subject undergoing intense discussion and study in NLP (Jain and Wallace, 2019; Wiegreffe and Pinter, 2019; Serrano and Smith, 2019; Moradi et al., 2019; Mohankumar et al., 2020; Tutek and Snajder, 2020, i.a.) .

Related to this discussion, Brunner et al. (2020) show that, under mild conditions, the attention matrix of a transformer self-attention sublayer can be written as a sum of two components. One of them is irrelevant for the model output because its product with the value matrix is zero. They term the other component as effective attention (formally defined in §2). We study whether effective attention gives interpretations that differ from conclusions we get by analyzing standard attention. If this is the case, interpretation of effective attention is better suited for studying transformers' internals because it is more pertinent to the model output by design. Brunner et al. (2020) briefly discuss this by comparing standard and effective attention matrices from a single BERT head (Devlin et al., 2019) for one example. They observe that: (i) standard attention is largely concentrated on the delimiter tokens ([SEP], [CLS] ) or on near-diagonal elements; (ii) effective attention is more dispersed; (iii) effective attention disregards the delimiters. They stress that we should not extrapolate too much from these observations since they are based on a single example, and that further research is needed on this topic.

In this work, we aim to reliably answer whether effective attention disregards the [SEP] and [CLS] tokens, and if so, are effective attention weights dispersed to linguistic features? To address these questions, we embrace the methodology for a quantitative analysis of the attention patterns produced by individual transformer heads proposed by Kovaleva et al. (2019) . We carry out their experiments on a subset of the GLUE tasks with BERT's standard and effective attention. We show that effective attention "ignores" [SEP] and punctuation symbols ( §3.1, §3.2), but not [CLS] ( §3.2), and that it highlights end-task features instead ( §3.1, §3.2, §3.3). 1

2 Background: Effective Attention

Each transformer layer consists of multi-head selfattention and feedforward sublayers (Vaswani et al., 2017 , see Appendix A). Brunner et al. (2020) show that the standard attention matrix A can be decomposed into two components, if a mild condition

LN(V ) := {x ∈ R 1×ds |x V = 0},

is not trivial (contains vectors other than 0). This is satisfied when the maximum input sequence length is larger than the value matrix dimension (see Appendix A). The two components are: the component in the left nullspace of V (A ) and the component orthogonal to the nullspace (A ⊥ ). Notably, A does not contribute to the output of the selfattention sublayer:

AV = (A + A ⊥ )V = 0 + A ⊥ V = A ⊥ V. (1)

The effective attention matrix is defined as A ⊥ . If visualizations of standard and effective attention differ, interpretation of effective attention is an accurate interpretation because effective attention is what contributes to the model output (per Eq. 1).

We explain how to compute A ⊥ since that was not described in Brunner et al. (2020) . We first compute the singular value decomposition (SVD) of the value matrix V = U ΣW T . The rows of U that correspond to singular values equal to zero span LN(V ):

LN(V ) = span{u 1 , . . . , u k },

where k is the number of singular values that equal zero. We project each row a i of the attention matrix A ∈ R ds×ds to LN(V ) to construct a projection of the matrix A to LN(V ):

P LN(V ) (a i ) = k j=1 a i , u j u j , ∀i ∈ {1, . . . , d s }, P LN(V ) (A) = [P LN(V ) (a 1 ), . . . , P LN(V ) (a ds )] ,

where •, • denotes the dot product. Finally, effective attention equals to:

A ⊥ := A − P LN(V ) (A).

Effective attention is not guaranteed to be a probability distribution as some of its weights might be negative and larger than 1.

Table 1: Specifications of the datasets.

Table 2: Estimated percentage of the attention patterns (§3.1): block (B), diagonal (D), vertical + diagonal (V + D), heterogeneous (H), vertical (V). Effective attention exhibits different patterns than standard attention, i.e., less vertical patterns (associated with delimiter tokens) and more block patterns (associated with task features).

We observe that effective attention is slower to compute due to the SVD decomposition of V for each out of 144 BERT-base heads, and additional matrix multiplications (Table 3 ; §B). If speed is bottleneck, we recommend doing quantitative analyses with effective attention on a subset of the dev set. For qualitative analyses, common practice is already to select a subset for a manual analysis. 3 What Does Effective Attention Reveal?

Table 3: A comparison of the evaluation clock time (minutes:seconds) of BERT models (trained with the standard attention) evaluated with standard attention and effective attention separately.

We compare visualizations of standard and effective attention following the methodology for analysis of the attention patterns (Kovaleva et al., 2019) . We carry out our analyses using five Englishlanguage datasets in the GLUE benchmark (Wang et al., 2019) : RTE (Dagan et al., 2005; Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) , MRPC (Dolan and Brockett, 2005), QNLI (Rajpurkar et al., 2016; Wang et al., 2019) , SST-2 (Socher et al., 2013) , and STS-B (Cer et al., 2017) . 2 See Table 1 for their specifications. For each dataset, we train BERT-base with standard attention, a batch size of 8, maximum sequence length of 128, and 3 training epochs. 3 For analyzing effective attention, we replace standard with effective attention at the test time.

3.1 Classification Of Attention Patterns

In this section, we start studying whether effective attention disregards the delimiter tokens. The visualizations of attention matrices exhibit patterns (Clark et al., 2019; Vig and Belinkov, 2019) . Kovaleva et al. (2019) identified five frequently occurring pattern categories:

• vertical (associated with the delimiters tokens)

• diagonal (either syntactic features between neighbouring words in the English language or the previous/following token attention coming from the language modeling pretraining) • vertical + diagonal • block (intra-sentence attention for the tasks with two distinct sequences; potentially encodes semantic and syntactic information) • heterogeneous (as "block", more likely to capture interpretable linguistic features). They annotated 400 BERT's attention matrices using these categories, and used them to train a ConvNet for pattern classification of 1K random test set attention matrices. We replicate their results for standard attention (using their code), and classify effective attention matrices for a comparison. 4

Figure 1: Effective attention “pays less attention” to [SEP] and punctuation. Per-task and per-head (0–11) attention when processing [CLS] in the final layer, averaged over test set. The darker colors correspond to larger attention values. The green plots (two upper rows in subfigures) illustrate standard, and blue plots (two lower rows in subfigures) effective attention.

Figure 2: Visualizations of standard and effective attention from one head for one example from the RTE dataset (recognizing textual entailment). Only the last few rows are visible; see the full version in Fig. 7 (Appendix §B).

Figure 3: Per-task cosine similarity between the pretrained and finetuned attention weights for selected GLUE tasks, calculated across layers and heads. The darker colors corresponding to larger absolute attention weights. The top (green) figure is computed with the standard attention, and the bottom (blue) figure with the effective attention.

Results Table 2 ( Fig. 4 in Appendix B) shows a drop in the percentage of the "vertical" and "vertical + diagonal" patterns when we replace the standard with effective attention. Since the vertical patterns are associated predominantly with attention to the delimiters tokens, this result supports the hypothesis that effective attention disregards the delimiter tokens. Moreover, although the amount of "heterogeneous" patterns did not change notably, the amount of "block" and "diagonal" patterns increased. This suggests that we are better positioned to find end-task linguistic features captured by the model by visualizing effective attention.

Figure 4: Estimated percentage of the attention patterns (§3.1) for each task.

As an illustration, Figure 2 presents the attention matrices for one sentence from one attention head. In this example, effective attention highlights all mentions of the noun "antibiotics" that the adjective "new" modifies and that is also the object of the preposition "against", instead of giving prominence to the [SEP] token as standard attention.

3.2 Delimiter Tokens Vs. Linguistic Features

We showed that the "vertical" pattern, associated with the delimiter tokens, is less dominant with effective attention ( §3.1). To verify that both delimiter tokens are indeed less relevant with effective attention, following Kovaleva et al. (2019) , we re- port the standard and effective attention weights of specific token types when processing the [CLS] token in the final layer. Namely, the attention weights of linguistic features (nouns, pronouns, verbs), the delimiter tokens ([SEP], [CLS] ), and punctuation symbols that are conceptually similar to [SEP] . 5 Results Figure 1 shows that [SEP] is among the two most relevant features for all tasks except QNLI according to standard attention (upper two rows in each subfigure, colored green). For all but one task (SST-2), it loses its dominance with effective attention and its weights are apparently shifted to linguistic features. This is also the case for punctuation symbols. This result shows that the [SEP] token and punctuation symbols are not as important for understanding how the model solves the end-task as standard attention suggests.

We observe that [CLS] is attended similarly with effective and standard attention, contrary to what Brunner et al. suggested. To rule out this is because we plot the attention assigned to [CLS] when pro-cessing [CLS] , we report the attention assigned to [CLS] when processing other input words (regardless of their type) in Fig. 5 in Appendix B. Again, we do not observe differences between standard and effective attention, unlike for [SEP] (Fig. 6 in §B) . These results confirm the hypothesis of Brunner et al. that effective attention disregards [SEP] , but not [CLS] as they also hypothesized. Notably, [SEP] is associated with the LM pretraining and [CLS] only with the task-specific finetuning.

Figure 5: Per-task attention across layers and heads to the [CLS] token when processing other input tokens, averaged over sequence length and dataset items for the selected GLUE task. The darker colors corresponding to larger absolute attention weights. The top (green) figure is computed with the standard attention, and the bottom (blue) figure with the effective attention. Since the effective attention does not have a fixed range as the standard attention (from 0 to 1), we use the minimum and maximum effective attention weight for each task calculated across all weights (not only those associated with the [CLS] token).

Figure 6: Per-task attention across layers and heads to the [SEP] token when processing other input tokens, averaged over sequence length and dataset items for the selected GLUE task. The darker colors corresponding to larger absolute attention weights. The top (green) figure is computed with the standard attention, and the bottom (blue) figure with the effective attention. Since the effective attention does not have a fixed range as the standard attention (from 0 to 1), we use the minimum and maximum effective attention weight for each task calculated across all weights (not only those associated with the [SEP] token).

3.3 Effects Of Task-Specific Finetuning

To provide our final evidence that effective attention captures end-task features, we investigate how attention changes with finetuning layer-wise; again following Kovaleva et al. (2019) . They calculate the cosine similarity between pretrained and finetuned flattened attention matrices. The layers that change the most, encode most task-specific features. To reiterate, effective attention is the part of standard attention that contributes to the model output (Eq. 1; §2), and we showed that it is less associated with the pretraining feature [SEP] and more with linguistic features ( §3.1, §3.2). Thus, changes of standard attention from task-specific finetuning should be the product of changes of effective attention, and the outcome of this analysis should be the same, regardless of the attention "type".

Results As expected, we come to the same conclusion with effective attention as Kovaleva et al. did with the standard: the last two layers change the most with finetuning (Fig. 3) . This soundness check suggests once again that effective attention is the component of standard attention that manifests end-task features.

4 Conclusions

We study whether effective attention, the part of the transformer attention matrix that does not get canceled out with the value matrix, gives different interpretations than standard attention. We present a comparison of the two attentions and show that they differ in weights assigned to delimiter tokens such as [SEP] and punctuation marks, but not [CLS] as it was previously thought. Instead, effective attention gives more weight to linguistic features. Given the differences, and that effective attention is more pertinent to the model output by design, we urge to use it for studying transformers' internals.

As an alternative to analyzing attention weights, Kobayashi et al. (2020) propose anayzing the norm of vectors produced by multiplying the outputs of the value matrix with the attention weights. Follow-ing the experimental setting of Clark et al. (2019) , i.e., by analyzing 992 sequences extracted from Wikipedia, their norm-based analysis also shows that the contributions of [SEP] and punctuations are actually small. However, unlike us, they report the same observation for [CLS] . Future work might consider a more formal study between the normbased analysis and effective attention, especially since the norm-based analysis could circumvent the problem of costly SVD.

V = Z l−1 W V ∈ R ds×dv A = Softmax QK T √ d k ∈ R ds×ds Z = AV ∈ R ds×dv ,

where d s is the maximum length of the input sequence (in number of subtokens), Z l−1 is the output of the previous transformer layer, W Q , W K , W V are the query, key, and value weight matrices, respectively. For BERT-base, Brunner et al. (2020) show that the upper bound of the rank of the value matrix V is given by:

d q = d k = d v = 64, n heads = 12, d s = 512, and d v • n heads = 768.

rank(V ) = rank(Z l−1 W V ) ≤ min{d s , d v , d s , d v • n heads } ≤ min{d s , d v }.

As a result, the left nullspace of V , defined as:

LN(V ) := {x ∈ R 1×ds |x V = 0},

is non-trivial (LN(V ) = { 0}) when the maximum input length, d s , is larger than the dimension of the value matrix

d v , i.e., d s > d v .

In this case, we can construct infinitely many matrices A +Ã,

A = [x 1 , . . . , x ds ] , x i ∈ LN(V ),

which contribute exactly the same to the output as the attention matrix A:

(A +Ã)V = AV +ÃV = AV + 0 = AV.

This also holds when the weights of A +Ã are constrained to the probability simplex, and such constrained matrices A +Ã exist.

B Additional Results

We provide the following additional results that complement the discussions in Section 3:

• A comparison of the evaluation time with standard vs. effective attention. • In Figure 4 , visualization of results presented in Table 2 . token when processing other input tokens, averaged over sequence length and dataset items for the selected GLUE task. The darker colors corresponding to larger absolute attention weights. The top (green) figure is computed with the standard attention, and the bottom (blue) figure with the effective attention. Since the effective attention does not have a fixed range as the standard attention (from 0 to 1), we use the minimum and maximum effective attention weight for each task calculated across all weights (not only those associated with the [CLS] token).

Figure 6: Per-task attention across layers and heads to the [SEP] token when processing other input tokens, averaged over sequence length and dataset items for the selected GLUE task. The darker colors corresponding to larger absolute attention weights. The top (green) figure is computed with the standard attention, and the bottom (blue) figure with the effective attention. Since the effective attention does not have a fixed range as the standard attention (from 0 to 1), we use the minimum and maximum effective attention weight for each task calculated across all weights (not only those associated with the [SEP] token).

We omit larger datasets (QQP, MNLI), due to the limit of our computation budget (a single Nvdia GTX1070 with 8GB memory), and CoLA/WNLI followingKovaleva et al. (2019).3 All other hyperparameters are set to default values in the transformers library(Wolf et al., 2020).

We thank the authors for sharing their code and model weights for this experiment.

If there are multiple tokens of the same type in the input, we use the one with the maximum weight. If a word consists of the multiple subtokens, we use the weight of the first subtoken.