When Bert Forgets How To POS: Amnesic Probing of Linguistic Properties and MLM Predictions

Yanai Elazar
Shauli Ravfogel
Alon Jacovi
Yoav Goldberg
ArXiv
2020
View in Semantic Scholar

Abstract

A growing body of work makes use of probing in order to investigate the working of neural models, often considered black boxes. Recently, an ongoing debate emerged surrounding the limitations of the probing paradigm. In this work, we point out the inability to infer behavioral conclusions from probing results, and offer an alternative method which is focused on how the information is being used, rather than on what information is encoded. Our method, Amnesic Probing, follows the intuition that the utility of a property for a given task can be assessed by measuring the influence of a causal intervention which removes it from the representation. Equipped with this new analysis tool, we can now ask questions that were not possible before, e.g. is part-of-speech information important for word prediction? We perform a series of analyses on BERT to answer these types of questions. Our findings demonstrate that conventional probing performance is not correlated to task importance, and we call for increased scrutiny of claims that draw behavioral or causal conclusions from probing results.

1 Introduction

What drives a model to perform a specific prediction? What information is being used for prediction, and what would have happen if that information went missing? Since neural representation are opaque and hard to interpret, answering these questions is challenging.

The recent advancements in Language Models (LMs) and their success in transfer learning of many NLP tasks (e.g. (Peters et al., 2018; Devlin et al., 2019; Liu et al., 2019b) ) spiked interest in understanding how these models work and what is being encoded in them. One prominent methodology that attempts to shed light on those questions Figure 1 : A schematic description of the proposed amnesic intervention: we transform the contextualized representation of the word "ran" so as to remove information (here, POS), resulting in a "cleaned" version h ¬P OS ran . This representation is fed to the word-prediction layer and the behavioral influence of POS erasure is measured. is probing (Conneau et al., 2018 ) (also known as auxilliary prediction (Adi et al., 2016) and diagnostic classification (Hupkes et al., 2018) ). Under this methodology, one trains a simple model -a probe -to predict some desired information from the latent representations of the pre-trained model. High prediction performance is interpreted as evidence for the information being encoded in the representation. A key drawback of such approach is that while it may indicate that the information can be extracted from the representation, it provides no evidence for or against the actual use of this information by the model. Indeed, Hewitt and Liang (2019) have shown that under certain conditions, above-random probing accuracy can be achieved even when the information that arXiv:2006.00995v1 [cs.CL] 1 Jun 2020 one probes for is linguistically-meaningless noise, which is unlikely to have any use by the actual model. More recently, Ravichander et al. (2020) showed that models encode linguistic properties, even when not required at all for solving the task, questioning the usefulness and common interpretation of probing. These results call for higher scrutiny of causal claims based on probing results.

Figure 1: A schematic description of the proposed amnesic intervention: we transform the contextualized representation of the word “ran" so as to remove information (here, POS), resulting in a “cleaned" version h¬POSran . This representation is fed to the word-prediction layer and the behavioral influence of POS erasure is measured.

In this paper, we propose a counterfactual approach which serves as a step towards casual attribution: Amnesic Probing (see Figure 1 for a schematic view). We build on the intuition that if a property P (e.g., part-of-speech) is being used for a task T (e.g., language-modeling), than the removal of P should negatively influence the ability of the model to solve the task. Conversely, when the removal of P has little or no influence on the ability to solve T , one can argue that knowing P is not a significant contributing factor in the strategy the model employs in solving T . As opposed to previous work that focused on intervention in the input space (Goyal et al., 2019; Kaushik et al., 2020; Vig et al., 2020) or in specific parameters (Vig et al., 2020) , our intervention is done on the representation layers. This makes it easier than changing the input (which is non-trivial) and more efficient than querying millions of parameters.

We demonstrate that amnesic probing can function as a debugging and analysis tool for neural models. Specifically, by using amnesic probing we show how to deduce whether a property is used by a given model in prediction. This allows a practitioner to test their hypothesis regarding properties that a model should or should not be using, and test it in practice.

In order to build the counterfactual representations, we need a function which operates on a pretrained representation and returns a counterfactual version which no longer encodes the property we focus on. We use the recently proposed algorithm for neutralizing linear information: Iterative Nullspace Projection (INLP) (Ravfogel et al., 2020) . Put together, this approach allows us to ask the counterfactual question: "How will the prediction of a task differ without some property". This approach relies on the assumption that the usefulness of some information can be measured by neutralizing it from the representation, and witnessing the resulting behavioral change. This assumption echoes the basic idea of ablation tests where one removes some component and measures the influ-ence of that intervention.

We study several linguistic properties such as part-of-speech (POS) and dependency labels. Overall, we find that as opposed to the common belief, high probing performance does not mean that the probed information is used for predicting the main task ( §4). This is consistent with the recent findings of Ravichander et al. (2020) . Our analysis also reveals that the properties we examine are often being used differently in the masked setting (which is mostly used in LM training) and in the non-masked setting (which is commonly used for probing or fine-tuning) ( §5). We then dive deeper into a more fine-grained analysis, and show that not all of the linguistic property labels equally influence prediction ( §6). Finally, we re-evaluate previous claims about the way that BERT process the traditional NLP pipeline (Tenney et al., 2019a) with amnesic probing and provide a novel interpretation on the utility of different layers ( §7).

2.1 Setup Formulation

Given a set of labeled data D of data points X = x 1 , . . . , x n 1 and task labels Y = y 1 , . . . , y n we analyze a model f that predicts the labels Y from X:

y i = f (x i ).

We assume that this model is composed of two parts: an encoder h that transforms input x i into a representation vector h x i and a classifier c that is used for predictingŷ based on h x i :

y i = c(h(x i ))

. We refer by model to the component that follows the encoding function h and is used for the classification of the task of interest y. Each data point x i is also associated with a property of interest z i which represents an additional information, which may or may not affect the decision of the classifier c.

In this work, we are interested in the change in prediction of the classifier c on the predictionŷ which is caused due to the removal of the property Z from the representation h(

x i ), that is h(x i ) ¬Z .

2.2 Amnesic Probing With Inlp

Under the counterfactual approach, we aim to evaluate the behavioral influence of a specific type of information Z (e.g. POS) on some task (e.g. language modeling). To do so, we selectively remove this information from the representation and observe the change in the behavior of the model on the main task.

One commonly used method for information removal relies on adversarial training through the gradient reversal layer technique (Ganin et al., 2016) . However, this techniques requires to change the original encoding, which is not desired in our case as we wish to study the original model's behavior. Additionally, it was found that this technique does not manage to completely remove all the information from the learned representation (Elazar and Goldberg, 2018) Instead, we make use of a recently proposed algorithm called Iterative Nullspace Projection (INLP) (Ravfogel et al., 2020) . Given a labeled dataset of representations X, and a property to remove Z, INLP neutralizes the ability to linearly predict Z from X. It does so by training a sequence of linear classifiers c 0 , ..., c n that predict Z, interpreting each one as conveying information on a unique direction in the latent space that corresponds to Z, and iteratively removing each of these directions. Concretely, we assume that the ith classifier c i is parameterized by a matrix W i . In the ith iteration, c i 2 is trained to predict Z from X, and the data is projected onto its nullspace using a projection matrix P (W i ) . This opeartion guarantees W i P (W i ) X = 0, i.e., it neutralizes the features in the latent space which were found by W i as indicative to S. By repeating this process until no classifier achieves above-majority accuracy, INLP removes all such features. 3

2.3 Controls

The usage of INLP in this setup involves some subtleties we aim to account for. (1) Any modification to the representation, regardless of whether it removes information necessary to the task, may cause a decrease in performance. Is the performance drop simply due to the modification of the representation? (2) The removal of any property using INLP may also cause removal of correlating properties. Does the removed information only pertain to the property in question?

Control over Information In order to control for the information loss of the representations, we make use of a baseline that removes the same number of directions as INLP does, but randomly.

For every INLP iteration the data matrix' rank decreases by the number of labels of the inspected property. This operation removes information from the representation which might be used for prediction. Using this control, Rand, instead of finding the directions using a classifier that learned some task, we generate random vectors from a uniform distribution, that accounts for random directions. Then, we construct the projection matrix as in INLP, by finding the intersection of nullspaces.

If the Rand performance is higher than the amnesic probing for some property, we conclude that we removed important directions for the main task. Otherwise, when the control reaches similar performance, we conclude that the property we removed is not significant for the main task's performance.

Control over Selectivity The result of the amnesic probing is taken as an indication to whether or not the inquired model (in our case, BERT) made use of the inspected property for prediction. If the performance stays intact, it suggests that the model did not make use of that information for its predictions. Otherwise, if it decreases, it suggests that the model was relying on this information for making its predictions. However, drawing such conclusion requires to first assess the selectivity of the amnesic intervention.

In practice, the amnesic intervention relies on neutralizing the features that were used by linear classifiers to predict a property. Naturally, those might include features that are only correlative to the desired property (e.g. linear position in the sentence, which has a nonzero correlation to syntactic function). To what extent is the information removal process we employ selective to the property in focus? This is crucial, as lack of selectivity would prevent us from drawing conclusions on any specific property. We test that by explicitly providing the information that has been removed from the representation, and finetuning the last word-prediction layer (while the rest of the network is frozen). Restoring the original performance is taken as evidence that the property we aimed to remove is enough to account for the the damage sustained by the amnesic intervention (it may still be the case that the intervention removes unrelated properties; but given the explicitly-provided property information, the model can make up for the damage). However, if original performance is not restored, this indicates that the intervention removed more information than intended, and this cannot be accounted for by merely explicitly providing the value of the single property we focused on.

Concretely, we concatenate feature vectors of the studied property to the amnesic representations. Those vectors are 32-dimensional, and are initialized randomly, with a unique vector for each value of the property of interest. Those are finetuned until convergence. We note that as the new representation vectors are of a higher dimension than the original ones, we cannot use the original matrix. For an easier learning process, we use the original embedding matrix and concatenate it with a new embedding matrix, randomly initialized, and treat is as the new decision function.

3.1 Bert

We use our proposed method to investigate BERT (Devlin et al., 2019) , 4 a popular and competitive masked language model (MLM) which has recently been the subject of many analysis works (e.g., Hewitt and Manning (2019) ; Liu et al. (2019a); Tenney et al. (2019a)). While most probing works focus on the ability to decode a certain linguistic property of the input text from the representation, we aim to understand which information is being used by it when predicting words from context. For example, we seek to answer questions such as the following:"Is POS information used by the model in word prediction?"

The following experiments focus on language modeling, as a basic and popular task, but our methods is more widely applicable.

3.2 Studied Properties

We focus on tasks formulated as sequence tagging. We consider the coarse and fine-grained part-of-speech tags (c-pos and f-pos respectively) and dependency tree labels based on the English UD treebank (McDonald et al., 2013) . We also use named-entity labels (ner), and phrase markers 5 which marks the beginning and the end of a phrase (phrase start and phrase end respectively) from the English part of the OntoNotes corpus (Weischedel et al., 2013) . For training we use 100,000 random tokens from each dataset, and use all the evaluation set for each dataset respectively.

3.3 Metrics

We report the following metrics: LM accuracy: Word prediction accuracy. KullbackâȂŞLeibler Divergence (D KL ): We calculate the D KL between the distribution of the model over tokens, before and after the amnesic intervention. This measure focuses on the entire distribution, rather than on the correct token only. Larger values implies a more significant change.

4 To Probe Or Not To Probe?

By using the probing technique, different linguistic phenomenon such as POS, dependency information and NER (Tenney et al., 2019a; Liu et al., 2019a; Alt et al., 2020) have been found to be "easily extractable" (typically using linear models). These works often conclude that since information can be easily extracted by the probing model, this information is being used for the predictions. We show that this is not the case. Some properties such as syntactic structure and POS are very informative and are being used in practice to predict words. However, we also found some properties, such as phrase markers, which BERT does not make use of when predicting a token, in contrast to what one can naively deduce from probing results. This finding is in line with a recent work that observed the same behavior (Ravichander et al., 2020).

For each linguistic property, we report the probing accuracy using a linear model, as well as the word prediction accuracy after removing information about that property. The results are summarized in Table 1 . Probing achieves substantially higher performance over majority across all tasks. Moreover, after neutralizing the studied property from the representation, the performance on that task drops to majority (not presented in the table). Next, we compare the LM performance before and after the projection and observe a major drop for dep and f-pos information (decrease of 87.0 and 81.8 accuracy points respectively), and a moderate drop for c-pos and ner information (decrease of 32.2 and 10.8 accuracy points respectively). For these tasks, Rand performance on LM-Acc are lower than the original scores, but substantially higher than the Amnesic scores. (but in reverse, as a lower value indicates on a smaller change). These results suggests that to a large degree, the damage to LM performance is to be attributed to the specific information we remove, and not to rank-reduction alone. We conclude that dependency information, POS and NER are important for word prediction.

Table 1: Property statistics, probing accuracies and the influence of the amnesic intervention on the model’s distribution over words. dep: dependency edge identity; f-pos and c-pos: fine-grained and coarse POS tags; phrase start and phrase end: beginning and end of phrases. Rand refers to replacing our INLP-based projection with removal of an equal number of random directions from the representation.

Interestingly, for phrase start and phrase end we observe a small improvement in accuracy of 0.21 and 0.32 points respectively. The performance for the control on these properties is lower, therefore not only these properties are not important for the LM prediction at this part of the model, they slightly harm it. The last observation is rather surprising as phrase boundaries are coupled to the structure of sentences, and the words that form them. A potential explanation for this phenomenon, is that this information is simply not being used at this part of the model, and is rather being processed in an earlier stage. We further inspect this hypothesis in Section 7.

Finally, Spearman correlation between the probe scores and the amnesic probing scores is 8.5, with 87.1 p value, i.e. probe accuracy does not correlate with task importance as measured by our method.

These results strengthen recent works that question the usefulness of probing as an analysis tool (Hewitt and Liang, 2019; Ravichander et al., 2020) , but measure it from the usefulness of properties on the main task. We conclude that high probing performance does not entail that this information is being used at a later part of the network.

5 What Properties are Important for the Pre-Training Objective?

Probing studies tend to focus on representations that are used for an end-task (usually the last hidden layer before the classification layer). In the case of MLM models, the words are not masked when encoding them for downstream tasks (as opposed to the pre-training step where 15% of the words are masked, in BERT (Devlin et al., 2019) ). However, these representations are different from those used during the pre-training LM phase (of interest to us), where the input words are masked. It is therefore unclear if the conclusions drawn from conventional probing also apply to the way that the pre-trained model operates. From this section on, unless mentioned otherwise, we report our experiments on the masked words.

That is, given a sequence of tokens x 1 , . . . , x i , . . . , x n we encode the representation of each token i using its context, as follows:

x 1 , . . . , x i−1 , [M ASK], x i+1 , . . . , x n .

The rest of the tokens remain intact.

We feed these input tokens to BERT, and only use the masked representation of each word in its context

h(x 1 , . . . , x i−1 , [M ASK], x i+1 , . . . , x n ) i .

We repeat the experiments from Section 4 and report the results in Table 2 . As expected, the LM accuracy drops significantly, as the model does not have access to the original word, and it has to infer it only based on context. Overall, the trends in the masked setting are similar to the non-masked setting. However, this is not always the case, as we show in Section 7. We also report the selectivity control. Notice that the performance for this Table 2 : amnesic probing results for the masked representations. Properties statistics, word-prediction accuracy and D KL results for the different properties inspected in this work. We report the Vanilla word prediction accuracy and the Amnesic scores, as well as the Rand and 1-Hot controls which shows minimal information loss and high selectivity (except for the dep property which all information was removed). The D KL is also reported for all properties in the last rows which show similar trends as the accuracy performance.

Table 2: amnesic probing results for the masked representations. Properties statistics, word-prediction accuracy and DKL results for the different properties inspected in this work. We report the Vanilla word prediction accuracy and the Amnesic scores, as well as the Rand and 1-Hot controls which shows minimal information loss and high selectivity (except for the dep property which all information was removed). The DKL is also reported for all properties in the last rows which show similar trends as the accuracy performance.

experiment was improved across all tasks. In the case of dep and f-pos, where we had to neutralize most of the dimensions the performance does not fully recover. However for the rest of the properties (c-pos, ner, and the phrase-markers) the performance is fully recovered, indicating on the selectivity of our method.

To further study the effect of INLP and inspect how the different dimensions removal affect performance we display in Figure 2 the inspected tasks. We plot the LM performance after each iteration both with the amnesic probing and the control, and observe a consistent gap between them. Moreover, we highlight the difference in the slope for our method and the random direction removal. The amnesic probing exemplifies a much steeper slope than the random direction, indicating that the studied property is indeed correlated with the task of word prediction.

Figure 2: LM accuracy over INLP predictions, for the masked tokens version. We present both the Vanilla word-prediction score (straight, blue line), as well as the control (orange, large circles) and INLP (green, small circles). Note that the number of removed dimension for each iteration is different, based on the number of classes of that property.

6 Specific Labels And Word Prediction

In the previous sections we observed the impact (or lack thereof) of different properties on word prediction. But when a property affects words prediction, are all words affected similarly? In this section, we inspect a more fine-grained version of the properties of interest, and study the impact of those on word predictions.

Fine-Grained Analysis When we remove the POS information from the representation, are nouns affected the same as conjunctions? We re- Figure 2 : LM accuracy over INLP predictions, for the masked tokens version. We present both the Vanilla word-prediction score (straight, blue line), as well as the control (orange, large circles) and INLP (green, small circles). Note that the number of removed dimension for each iteration is different, based on the number of classes of that property.

peat the masked experimental setting from Section 5, but this time we inspect the word prediction performance for the different labels. We report the POS results for the c-pos tagging in Table 3 . We observe large differences in the word prediction performance before and after the POS removal between the labels. Nouns, numbers and verbs show a relatively small impact in performance (8.64, 6.91 and 11.73 respectively), while conjunctions, particles and determiners demonstrate large performance drops (73.73, 77.66 and 65.65, respectively). We see that the information about POS labels at the word-level prediction is much more important in closed-set vocabularies (such as conjunctions and determiners) than with open vocabularies (such as nouns and verbs).

Table 3: Masked, Tag removal, fine-grained lm analysis. Removing POS (tag) information and testing how specific words, accumulating by their label. ∆ is the difference in performance between the Vanilla and Amnesic scores.

Vanilla Rand Amnesic

Removal Of Specific Labels

Following the observation that classes are affected differently when predicting words, we further investigate the differences of specific labels removal. To this end, we repeat the amnesic probing experiments, but instead of removing the fine-grained information of a linguistic property, we make a subtler removal: the distinction between a specific label and the rest. For example, with POS as the general property, we now investigate whether the information of noun vs. the rest is important for predicting a word. We perform this experiment for all of the pos-c labels, and report the results in Table 4 . We observe big performance gaps when removing different labels. For example, removing the distinctions between nouns and the rest, or verbs and the rest has minimal impact on performance. On the other hand, determiners, adpositions and Table 4 : Word prediction accuracy after finegrained tag distinction removal, masked version.

Table 4: Word prediction accuracy after finegrained tag distinction removal, masked version. Rand control performance are all between 56.07 and 56.88 accuracy (with a maximum difference from Vanilla of 0.9 points

Rand control performance are all between 56.07 and 56.88 accuracy (with a maximum difference from Vanilla of 0.9 points punctuations are highly affected. This is consistent with the previous observation on removing the more specific information. These results call for more detailed observations and experiments when studying a phenomenon as the fine-grained property distinction do not behave the same across labels. 6

7 Behavior Across Layers

The results up to this section treat all of BERT's 'Transformers blocks' (Vaswani et al., 2017) as the encoding function and the embedding matrix as the model. But what happens when we remove the information of some linguistic property from earlier layers? By using INLP to remove a property from an intermediate layer, we prevent the consecutive layer from using linear information originally stored in that layer. Though this operation does not erase all the information correlative with the studied property (as INLP only removes linear information), it makes it harder for the model to use.

Concretely, we begin by extracting the representation of some text by the first k layers of BERT and then run INLP on these representations to remove the property of interest. Given that we wish to study the effect of a property on layer i, we project the representation using the corresponding projection matrix that was learned on those representation P i , and then continue the encoding of the following layers. 7

7.1 Property Recovery After An Amnesic Operation

Is the property we linearly remove from a given layer recoverable by consecutive layers? We remove the information about some linguistic property from layer i, and learn a probe classifier on all consecutive layers i + 1, . . . , n. This tests how much information about this property the following layers have recovered. We experiment with the properties that could be removed without reducing too many dimensions: pos-c, ner, phrase start and phrase end. These results are summarized in Figure 3 , both for the non-masked version (upper row) and the masked version (lower row). Notably, for the pos-c, non-masked version, the task is highly recoverable in consecutive layers when applying the amnesic operation on the first seven layers: the performance drop from the baseline probing of that layer is between 5.72 and 12.69 accuracy points. However, in the second part of the network, the drop is substantially larger: the drop is between 16.57 and 46.39 accuracy points. For the masked version, we witness an opposite trend: The pos-c information is much less recoverable in the lower parts of the network, than the upper parts. In particular, the removal of pos-c from the second layer appears to affect the rest of the layers, which do not manage to recover a high score on this task, ranging from 32.7 to 42.1 accuracy.

Figure 3: Layer-wise removal. Removing from layer i (the rows) and testing probing performance on layer j (the columns). Top row (3a) is non-masked version, bottom row (3b) is masked.

For all of the non-masked experiments the upper layers seem to make it harder for the consecutive layers to extract the property. In the masked version however, there's no clear trend. It is harder to extract properties after the lower parts for posc and ner. For phrase-start the upper part makes it harder for further extraction and for phrase-end both the lower and upper parts make it harder, as opposed to the middle layers. Further research is needed in order to understand the significance of those findings, and whether or not they are related to information usage across layers.

This lead us to the final experiment of the amnesic probing where we test for the main task performance after an amnesic operation at the intermediate layers.

7.2 Re-Rediscovering The Nlp Pipeline

In the previous set of experiments, we measured how much of the signal removed in layer i is recovered in subsequent layers. We now study how the removal of information in layer i affects the word prediction accuracy at the final layer, in order to get an alternative measure for layer impor- tance with respect to a property. The results for the different properties are presented in Figure 4 , where we plot the difference in word prediction performance between the control to the amnesic probing when removing a linguistic property from a certain layer.

Figure 4: The influence of the different properties, from each layer on LM predictions. Top figure (4a) shows the results on the regular, non-masked version, bottom figure (4b) for the masked version. Colors are only for comfort and stand for the different layers.

These results provide a clear interpretation on the internal function of BERT's layers. For the masked version, we observe that the pos-c properties are mostly important in layer 3 and its surrounding layers, as well as layer 12. However, this information is accurately extractable only towards the last layers. For ner, we observe that the main performance loss occurs at layer 4. For phrasemarkers the middle layers are important: layers 5 and 7 for start-marker (although the absolute performance loss is not big) and layer 6 for endmarker contributes the most for the word prediction performance.

The story on the non-masked version is quite different. First, notice that across all properties, in some layers, the amnesic operation causes an improvement in LM performance. Second, the drop in performance peak, across all properties is different than the masked version experiments. Particularly, it seems that for pos-c, when the words are non-masked in the input, the most important layer for pos-c is 11 (and not layer 3, as in the masked version), while this information is easily extractable across all layers (above 80% accuracy).

Interestingly, the conclusions we draw on layerimportance from amnesic probing partly differ from the ones in the "Pipeline processing" hypothesis (Tenney et al., 2019a) , which aims to localize and attribute information processing of linguistic properties to parts of BERT (for the non-masked version). 8 On one hand the ner experiment trends are similar: the last layers are much more important than earlier ones (in particular, layer 11 is the most affected in our case, with a decrease of 31.09 accuracy points. On the other hand, in contrast to their hypotheses, we find that POS information, pos-c (which was considered to be more important in the earlier layers) affects the word prediction performance much more in the upper layers (40.99 accuracy loss in the 11th layer). Finally, we note that our approach performs an ablation of these properties in the representation space, which reveals which layers are actually responsible for processing properties, as opposed to Tenney et al. (2019a) which focused on where this information is easily extractable.

We conclude that by using amnesic probing on internal parts of the model, one can study how does the representation of the input text is gradually built, and which kind of linguistic information is processed by different parts of the model. Moreover, we note the big differences in behavior when analyzing the masked vs. the non-masked version of BERT, and call for future work to make a clearer distinctions between the two.

8 Related Work

With the established impressive performance of large pre-trained language models (Devlin et al., 2019; Liu et al., 2019b) , based on the Transformer architecture (Vaswani et al., 2017) , the interest of gaining insight into how these models work and what do they encode increased and led to a large body of research dedicated to these models. These works cover a wide variety of topics, including grammatical generalization (Goldberg, 2019; Warstadt et al., 2019) , syntax (Tenney et al., 2019b; Lin et al., 2019; Reif et al., 2019; Hewitt and Manning, 2019; Liu et al., 2019a) , world knowledge (F. Petroni and Riedel, 2019; Jiang et al., 2019) , reasoning (Talmor et al., 2019) , and commonsense (Zhou et al., 2019; Weir et al., 2020) . For a thorough summary of these advancements we refer the reader to a recent primer on the subject (Rogers et al., 2020) .

A particular popular and easy-to-use interpretation method is probing (Conneau et al., 2018) . Despite its popularity, recent works have questioned the use of probing as an interpretation tool. Those critiques raise doubts regarding the ability to interpret probing results as conveying meaningful information on the probed model. This is manifested by the difficulty to discern between decoding existing properties from representations, and learning them by the supervised probe. Most recently, it has been shown that the ability to decode properties using probing, may bear no relevance to the task of interest. Hewitt and Liang (2019) have emphasized the need to distinguish between decoding and learning the probing tasks. They introduced control tasks, a consistent but linguistically meaningless attribution of labels to tokens, and have shown that probes trained on the control tasks often preform well, due to the strong lexical information that held in the repre-sentations and learned by the probe. This leads them to propose a selectivity measure that aims to choose probes which achieve high accuracy only on linguistically-meaningful tasks. Tamkin et al. (2020) claim that probing cannot serve an explanation of downstream tasks success. They observe that the probing scores do not correlate with the transfer scores achieved by fine-tuning. Finally, Ravichander et al. (2020) shows that probing can achieve non-trivial results from a model's representations that does not need to encode the probed information in order to solve the task it was trained on. Given a linguistic property (e.g. tense), they trained the model on a subset of examples that has limited variation with regard to that property (e.g. only past tense), but test it on a large varieties of examples (e.g. sentences containing verbs in present tense). As the probe achieves high performance, they question the utility of probes as a tool to analyze representations. In this work, we observe a similar phenomenon, but from a different angle. We actively remove some property of interest from the queried representation, and measure the impact of the amnesic representation to the property, on the main task.

Two recent works analyse probing, and the meaning of the claims on the "encoding" of a property in a representation, from an informationtheoretic perspective. Pimentel et al. (2020) emphasize that if one views probing as training a model which maximizes the mutual information between the representations and the labels, then a more accurate probe is necessarily better, regardless of its complexity. They question the rational behind probing, as a simple consequence of the data-processing inequality implies that a representation can at most include the same amount of information as the original input sentence, and propose ease of extractability as a future criterion for probe selection that needs formalization. Voita and Titov (2020) follow this direction, by analysing probing from an algorithmicinformation theory perspective, which uses the concept of minimum description length MDL (Rissanen, 1978) that quantifies the total information needed to transmit both the probing model and the labels predicted by its probability distribution. Our discussion in this paper is somewhat orthogonal to discussions on the meaning of encoding and of probe complexity, as we focus on the influence of information on the model behav-ior, rather than on the ability to extract it from the representation.

Finally and concurrent to this work, Feder et al. (2020) have studied a similar question of a casual attribution of concepts to representations, using adversarial training guided by casual graphs

9 Discussion

Intuitively, we would like to completely neutralize the abstract property we are interested in -e.g., POS information (completeness), as represented by the model -while keeping the rest of the representation intact (selectivity). This is a nontrivial goal, as it is not clear whether neural models actually have abstract and disentangled representations of properties such as POS, which are independent of other properties of the text; it may be the case that the representation of many properties is intertwined. Indeed, there is an ongoing debate on the meaning of the assertion that certain information is "encoded" in the representation (Voita and Titov, 2020; Pimentel et al., 2020) . Furthermore, even if a disentangled representation of the information we focus on exists to a degree, it is not clear to detect it.

In light of those limitations, we realize the information removal operation with INLP, which gives a first order approximation using linear classifiers; we note, however, that one can in principle use other approaches to achieve the same goal. While we show that we do remove the linear ability to predict the properties and provide some evidence to the selectivity of this method ( §2), one has to bear in mind that we remove only linearly-present information, and that the classifiers can rely on arbitrary features that happen to correlate with the gold label, be it a result of spurious correlations or inherent encoding of the direct property. We thus stress that the information we remove in practice should be seen only as an approximation for the abstract information we are interested in, and that one has to be cautious of casual interpretations of the results.

Another unanswered question is quantifying the relative importance of different properties encoded in the representation for the word prediction task. The different erasure portion for different properties makes it hard to draw conclusions on which property is more important for the task of interest. While we do not make claims such as "dependency information is more important than POS", these are interesting questions that should be further discussed and researched.

10 Conclusions

In this work, we propose a new method, Amnsic Probing, which aims to quantify the influence of specific properties on a model that solves a final task of interest. We demonstrate that conventional probing falls short in answering such behavioral questions, and perform a series of experiments on different linguistic phenomenon, quantifying their influence on the masked language modeling task. Furthermore, we inspect both unmasked and masked representation and detail the differences between them, which we find to be substantial. We also highlight the different influence of specific fine-grained properties (e.g. nouns, and determiners) on the final task. Finally, we use our proposed method on the different layers of BERT, and study which parts of the model make use of the different properties. Taken together, we argue that compared with probing, counterfactual interventionsuch as the one we present here -can provide a richer and more refined view of the way symbolic linguistic information is encoded and used by neural models with distributed representations.

The data points can be words, documents, images etc. based on the application.

Concretely, we use linear SVM(Pedregosa et al., 2011).3 All relevant directions are removed to the extent they are identified and being used by the classifiers we train. Therefore, we make sure that a linear classifier achieves a score within one point within majority on the development set.

Specifically, BERT-BASE-UNCASED(Wolf et al., 2019) 5 Based on the Penn Treebank syntactic definitions.

We repeat these experiments with other properties and observe similar trends.

As the representations used to train INLP do not include BERTs' special tokens (e.g. 'CLS', 'SEP'), we also don't use the projection matrix on those tokens.

As opposed toTenney et al. (2019a) which studied BERT-Large, this paper focused on BERT-Base, which might hold some differences.