ABNIRML: Analyzing the Behavior of Neural IR Models

Sean MacAvaney
Sergey Feldman
Nazli Goharian
Doug Downey
Arman Cohan
ArXiv
2020
View in Semantic Scholar

Abstract

Numerous studies have demonstrated the effectiveness of pretrained contextualized language models such as BERT and T5 for ad-hoc search. However, it is not well-understood why these methods are so effective, what makes some variants more effective than others, and what pitfalls they may have. We present a new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML), which includes new types of diagnostic tests that allow us to probe several characteristics---such as sensitivity to word order---that are not addressed by previous techniques. To demonstrate the value of the framework, we conduct an extensive empirical study that yields insights into the factors that contribute to the neural model's gains, and identify potential unintended biases the models exhibit. We find evidence that recent neural ranking models have fundamentally different characteristics from prior ranking models. For instance, these models can be highly influenced by altered document word order, sentence order and inflectional endings. They can also exhibit unexpected behaviors when additional content is added to documents, or when documents are expressed with different levels of fluency or formality. We find that these differences can depend on the architecture and not just the underlying language model.

1 Introduction

Pre-trained contextualized language models such as BERT (Devlin et al., 2019) are state-of-the-art for a wide variety of natural language processing tasks (Xia et al., 2020) . Similarly, in Information Retrieval (IR), these models have brought about large improvements in the task of ad-hoc retrieval-ranking documents by their relevance to a textual query MacAvaney et al., 2019a; Dai and Callan, 2019b) , where the models increasingly dominate the leaderboards of ad-hoc retrieval competitions (Craswell et al., 2019; Dalton et al., 2019) .

Despite this success, little is understood about why pretrained language models are effective for ad-hoc ranking. What new aspects of the task do neural models solve that previous approaches do not? Previous work has shown that traditional IR axioms, e.g. that increased term frequency should correspond to higher relevance, do not explain the behavior of recent neural models (Câmara and Hauff, 2020) . Outside of IR, others have examined what characteristics contextualized language models learn in general (Liu et al., 2019; Rogers et al., 2020; Loureiro et al., 2020) , but it remains unclear if these qualities are valuable to the ad-hoc ranking task specifically. Thus, new approaches are necessary to characterize the models.

We propose a new framework aimed at Analyzing the Behavior of Neural IR ModeLs (AB-NIRML) based on three testing strategies: "measure and match", "textual manipulation", and "dataset transfer". The "measure and match" strategy, akin to the diagnostic tests proposed by Rennings et al. (2019) , constructs test samples by controlling one measurement (e.g., term frequency) and varying another (e.g., document length) using samples from an existing IR collection. The "textual manipulation" strategy tests the effect that altering the document text has on ranking. The "dataset transfer" strategy constructs tests from non-IR datasets. The new tests allow us to isolate model characteristics-such as sensitivity to word order, or preference for summarized rather than full documents-that are imperceptible using other approaches. We also release an open-source implementation of our framework that makes it arXiv:2011.00696v1 [cs.CL] 2 Nov 2020 easy to define new diagnostics and to replicate the analysis on new models.

Using our new framework, we perform the first large-scale analysis of neural IR models. We compare today's leading ranking techniques, including those using BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020) , as well as methods focused on efficiency like DocT5Query and EPIC (MacAvaney et al., 2020) . We find evidence showing that neural models are able to make effective use of textual signals that are not reflected by classical term matching methods like BM25. For example, when controlling for term frequency match, the neural models detect document relevance much more accurately than the BM25 baseline, and the effect is more pronounced in larger neural models. Further, unlike prior approaches, rankers based on BERT and T5 are heavily influenced by word order: shuffling the words in a document consistently lowers the document's score relative to the unmodified version. We also find significant differences between different neural models: e.g., while most models treat queries navigationally (i.e., they score a document containing the query text itself higher than documents that elaborate on the query), the BERT-based EPIC model and T5 do not exhibit such behaviors. Finally, these models can exhibit unexpected (and probably unwanted behaviors): adding additional relevant text to the end of a document frequently can reduce its ranking score, and adding non-relevant content can increase itdespite document length itself having a limited effect on the ranking scores.

In summary, we present a new framework (AB-NIRML) for performing analysis of ad-hoc ranking models. We then demonstrate how the framework can provide insights into ranking model characteristics by providing the most comprehensive analysis of neural ranking models to date. Our released software framework facilitates conducting further analyses in future work.

2 Abnirml

In order to characterize the behavior of ranking models we construct several diagnostic pair tests. Each test aims to evaluate specific properties of ranking models and probe their behavior (e.g., if they are heavily influenced by term matching, discourse and coherence, conciseness/verbosity, writing styles, etc). We formulate three different ap-proaches to construct tests (Measure and Match, Textual Manipulation, and Dataset Transfer).

2.1 Preliminaries

In ad-hoc ranking, a query (expressed in natural language) is submitted by a user to a search engine, and a ranking function provides the user with a list of natural language documents sorted by relevance to the query.

More formally, let R(q, d) ∈ R be a ranking function, which maps a given query q and document d (each being a natural-language sequence of terms) to a real-valued ranking score. At query time, documents in a collection D are scored using R(•) for a given query q, and ranked by the scores (conventionally, sorted descending by score). Learning-to-rank models, such as neural approaches based on contextualized language models, define a ranking function R θ (•) as a model structure parameterized by θ. The parameters are optimized for the task of relevance ranking.

2.2 Pair Testing

We utilize a pair testing strategy, in which tests are comprised of samples, each of which consists of a query and two documents that differ primarily in some characteristic of interest (e.g., textual elaboration). The ranking scores of the two documents are then compared (with respect to the query). This allows the isolation of particular model preferences. For instance, a test could consist of summarized and full texts of news articles; if the model consistently ranks the full text above the summary, we learn that the model prefers more elaborative text.

More formally, each pair test consists of a collection of samples S, where each q, d 1 , d 2 ∈ S is a 3-tuple consisting of a query (or query-like text, q), and two documents (or document-like texts, d 1 and d 2 ). The relationship between (q, d 1 ) and (q, d 2 ) for each sample defines the test. For instance, a test probing summarization could be defined as: (1) d 2 is a summary of d 1 , and (2) d 1 is relevant to the query q. Note that the relationship between d 1 and d 2 must be asymmetric, or the test is ill-defined. For example, paraphrasing is not a valid pair test because both texts are paraphrases of one another, and it would therefore be ambiguous which to assign to d 1 and which to assign to d 2 .

Each sample in the test is scored as: (+1) scoring

d 1 above d 2 (a positive effect), (-1) scoring d 2 Inferred IR Dataset

MMTs (Measure and Match Tests) Query Query Doc Doc Doc Doc Doc Doc …

Measurement And Matching

Query Query Doc Doc Doc Doc Doc Doc …

TMTs (Textual Manipulation Tests) M(Doc) M(Doc) M(Doc) M(Doc) M(Doc)

M(Doc) Doc Doc Doc Doc Doc Doc … Doc Doc Doc Doc Doc Doc above d 1 (a negative effect), or (0) a negligible effect. Formally, the effect eff (•) of a given sample is defined as:

EQUATION (1): Not extracted; please refer to original document.

The parameter δ adjusts how large the score difference between the scores of d 1 and d 2 must be in order to count as positive or negative effect. This allows us to disregard small changes to the score that are unlikely to affect the final ranking. In practice, δ depends on the ranking model because each model scores on different scales. Therefore we tune δ for each model (see Section 3.3 for further details). The test is summarized by a single score s that averages the effect of all samples in the test:

EQUATION (2): Not extracted; please refer to original document.

Note that this score is in the interval [−1, 1]. Positive scores indicate a stronger preference towards documents from group 1 (d 1 documents), and negative scores indicate a preference towards documents from group 2 (d 2 documents). Scores near 0 indicate no strong preference or preferences that are split roughly evenly, and disentangling these two cases requires analyzing the individual effect scores.

There are several important differences between our setup and the "diagnostic dataset" approach proposed by Rennings et al. (2019) . First, by including the δ threshold, we ensure that our tests measure differences that can affect the final order in ranked lists. Second, by including the "negligible effect" case in our scoring function, we distinguish between cases in which d 1 or d 2 are preferred and cases where neither document is strongly preferred. And finally, our tests are aimed at describing model behavior, rather than evaluating models. For instance, one of our tests measures whether the model prefers a document that provides an answer to a user query over a document containing the query text itself-whether this preference is desireable depends on the application (it is a useful preference in Q&A settings, where the user is seeking further details, but may be less desirable in search engines in which navigation queries are common).

2.3 Pair Testing Strategies

In this work, we explore three strategies for designing pair tests. As discussed below, the strategies have different strengths and weaknesses. When used in concert, they allow our framework to characterize a wide variety of model behaviors. See Figure 1 for an overview of the testing strategies.

Figure 1: Overview of strategies for constructing tests. Each test in ABNIRML is comprised of samples, each of which consists of a query (q) and two documents (d1 and d2).

Measure and Match Tests (MMTs). Some surface-level characteristics of documents, such as its Term Frequency (TF) for a given query, are both easy to measure and valuable for characterizing models. 2 By comparing the ranking scores of two documents in which such a characteristic differs (but are otherwise similar), one can gain empirical evidence into what factors influence a model. Measure and Match Tests (MMTs) follow such an approach. MMTs involve first measuring the characteristics of judged query-document pairs in an IR dataset. Then, the pairs are matched to form test samples based on a control (a characteristic that matches between the documents, such as document length), and a variable (which differs between documents, such as TF). When matching control characteristics, a threshold can be employed to allow for more potential matches. MMTs have been explored by others (Câmara and Hauff, 2020; Rennings et al., 2019) by formulating existing ranking axioms 3 into empirical tests. Our MMT strategy generalizes the process for building such tests. Specifically, this setting encourages one to consider all combinations of controls and variables, resulting in a more comprehensive analysis that shows how all the characteristics interact with each other.

Textual Manipulation Tests (TMTs). Not all characteristics are easily captured with MMTs. For instance, if one wanted to test the sensitivity of models to word order, this would be difficult with MMTs for multiple reasons. First, it would be unlikely to find naturally-occurring documents that use the same words but in a different order. Further, even when identified, it would be unclear how to measure the quality of each word order. We overcome limitations like this by proposing Textual Manipulation Tests (TMTs). These tests apply a manipulation function to scored documents from an existing IR dataset. For testing word order, a simple manipulation function that shuffles the order of the words in the sentence can be used, which eliminates both the matching problem (all documents can be manipulated) and the measurement problem (the proposed manipulation function is destructive, almost certainly reducing the quality of the word order). The only prior work we are aware of that uses a similar approach for testing ranking methods is a single diagnostic dataset proposed by Rennings et al. (2019) , which tests the effect of duplicating the document-an adaptation of a traditional ranking axiom. Although TMTs allow testing a wider variety of characteristics than MMTs, we acknowledge that they involve constructing artificial data; d 2 may not resemble documents seen in practice. Despite this, their versatility make them an attractive choice for testing a variety of characteristics.

Dataset Transfer Tests (DTTs). Even with MMTs and TMTs, some characteristics may still be difficult to measure. For instance, if one wanted to test the effect of text fluency (the degree to 3 Experimental Setup

3.1 Datasets

We use the MS-MARCO passage dataset (Campos et al., 2016) to train the neural ranking models. The training subset contains approximately 809k natural-language questions from a query log (with an average length of 7.5 terms) and 8.8 million candidate answer passages (with an average length of 73.1 terms). Due to its scale in number of queries, it is shallowly annotated, almost always containing fewer than 3 positive judgments per query. This dataset is frequently used for training neural ranking models, and has been shown to effectively transfer relevance signals to other collections .

We build MMT and TMT pair tests using the TREC Deep Learning 2019 passage dataset (Craswell et al., 2019) . Although it uses the MS-MARCO passage collection, TREC DL is much smaller in terms of the number of queries (containing only 43 queries with relevance judgments). However, it has much deeper relevance judgments (on average, 215 per query). The judgments are also graded as highly relevant (7%), relevant (19%), topical (17%), and non-relevant (56%), allowing us to make more fine-grained comparison. We opt to perform our analysis in a passage ranking setting to eliminate effects of long document aggregation, an area with many model varieties that is still under active investigation .

We describe the datasets used for our proposed DTTs as they are introduced in Section 6.

3.2 Models

We compare a sample of several models covering a traditional lexical model (BM25), a conventional learning-to-rank approach (LightGBM), and neural models with and without contextualized language modeling components. We also include two models that focus on query-time computational efficiency. The neural models represent a sample of the current and recent state-of-the-art ranking models.

BM25. We use the Terrier (Ounis et al., 2006 ) implementation of BM25 with default parameters. BM25 is an unsupervised model that incorporates lexical the features of term frequency (TF), inverse document frequency (IDF), and document length.

L-GBM (Ke et al., 2017). We use a Light Gradient Boosting Machine model, currently used by the Semantic Scholar search engine (Feldman, 2020) . The model was trained on clickthrough data from this search engine, meaning that it services various information needs (e.g., navigational and topical queries). A wide variety of features are used, not all of which are available in our ranking setting (e.g., recency, in-links, etc.). Thus, we only supply the text-based features L-GBM uses like lexical overlap and scores from a light-weight language model (Heafield et al., 2013) . This serves as a non-neural learning-to-rank baseline.

KNRM and C-KNRM (Xiong et al., 2017; Dai et al., 2018) are kernel-based ranking models that calculate the cosine similarity between the word embeddings of query and document terms, aggregating the scores using Gaussian kernels. The Convolutional variant (C-KNRM) applies 1dimensional convolutions to build n-gram representations. For both models, we use default settings (e.g., 11 buckets, 3-grams), and train the models using the official training sequence of the MS-MARCO passage ranking dataset.

BERT (Devlin et al., 2019) .

We use a BERT model, which uses a linear ranking layer atop a BERT pretrained transformer language model MacAvaney et al., 2019a; Dai and Callan, 2019b) . (This setup goes by several names in the literature, including Vanilla BERT and monoBERT.) We fine-tune the bert-base-uncased model for this task using the official training sequence of the MS-MARCO passage ranking dataset.

T5 (Raffel et al., 2020) . The Text To Text Transformer ranking model scores documents by predicting whether a sequence consisting of Query: [query] Document: [document] Relevant: is likely to generate the term 'true' or 'false'. We use the model weights released by , which were tuned on the MS-MARCO passage ranking dataset. At the time of writing, this model tops several ad-hoc ranking leaderboards.

EPIC . This is an efficiency-focused BERT-based model, which separately encodes query and document content into vectors that are the size of the source vocabulary (where each element represents the importance of the corresponding term in the query/document). We use the pre-trained bert-base-uncased model, and tune the model for ranking using the train split of the MS-MARCO passage ranking dataset with the code released by the EPIC authors with default settings.

DT5Q . The T5 variant of the Doc2Query model (DT5Q) generates additional terms to add to documents based using a T5 model. The expanded document can be efficiently indexed, boosting the weight of terms likely to match queries. We use model released by the authors, which was trained using the MS-MARCO passage training dataset. For our tests, we generate four queries to add to each document. As was done in the original paper, we use BM25 a a scoring function over the expanded documents.

3.3 Choosing Δ

Recall that δ indicates the minimum absolute difference of scores in a pair test to have a positive or negative effect. Since each model scores documents on a different scale, we em-pirically choose a δ per model. We do this by first scoring the top 100 documents retrieved by BM25 for the MS-MARCO dev collection. Among the top 10 results, we calculate the differences between each adjacent pair of scores (i.e.,

{R(q, d 1 ) − R(q, d 2 ), R(q, d 2 ) − R(q, d 3 ), ..., R(q, d 9 ) − R(q, d 10 )}, where d i is

the ith highest scored document for q). We set δ to the median difference among all queries in the dev collection. By setting the threshold this way, we can expect the differences captured by the pair tests to have an effect on the final ranking score at least half the time.

3.4 Significance Testing

We use a two-sided paired T-Test to determine the significance (pairs of R(q, d 1 ) and R(q, d 2 )). We use a Bonferroni correction over each table to correct for multiple tests, and test for p < 0.001.

3.5 Software And Libraries

We use the following software libraries and packages to conduct our analysis: Anserini (Yang et al., 2018) , PyTerrier (MacDonald and Tonellotto, 2020), OpenNIR (MacAvaney, 2020), and HuggingFace Transformers (Wolf et al., 2019) .

4 Measure And Match Tests (Mmts)

Recall that MMTs measure a characteristic about a document (e.g., query term term frequency) and match them with documents that have a differing characteristic given a control. We explore the following characteristics:

• Relevance: the human-assessed graded relevance score of a document to the given query. • Length: the document length, in total number of non-stopword tokens. Note that this is the only characteristic we test that is not query-dependent. • TF: the individual Porter-stemmed Term Frequencies of non-stopword terms from the query. To determine when the TF of two documents are different, we use the condition that the TF of at least one query term in d 1 must be greater than the same term in d 2 , and that no term in d 1 can have a lower TF than the corresponding term in d 2 . • Sum-TF: the total TF of query terms.

• Overlap: the proportion of non-stopword terms in the document that appear in the query. Put another way, the Sum-TF divided by the document length. Each of these characteristics can be used as both a variable (matching based on differing values) and a control (matching based on identical values). We examine all pairs of these characteristics, greatly expanding upon IR axioms investigated in prior work. 4 The results using the TREC DL 2019 dataset are available in Table 1 . Positive scores indicate a preference for a higher variable, and negative scores indicate a preference for a lower variable.

Table 1: Results of Measure and Match Tests (MMTs) on TREC DL 2019. Positive scores indicate a preference towards a higher value of the variable. Score marked with * are not statistically significant (see Section 3.4).

Several trends stand out from this analysis. First, although all models are able to reasonably effectively distinguish relevance grades when controlling for length, overlap, and Sum-TF, only EPIC, BERT, and T5 excel at identifying relevance when TF is held constant. Further, it appears that the model's capacity is related to its effectiveness in this setting: T5 (+0.41) performs better than BERT (+0.34), which performs better than EPIC (+0.29).

We observe that most models are largely unaffected by differences in document length when controlled for relevance, TF, and Sum-TF. The most notable exceptions are BM25 and DT5Q (which uses BM25 for scoring), which are most affected when Sum-TF is constant. Although BM25 discounts the score of longer documents given the same TF, this test shows that these differences in score are often not enough to alter the order of search results in this collection.

The results when varying TF, Sum-TF, and Overlap all look similar: BM25 and DT5Q both strongly prefer documents with a higher TF. Even EPIC, BERT, and T5 are still affected by these characteristics, suggesting that they remain important for the models.

5 Textual Manipulation Tests (Tmts)

As we saw in Section 4, the ranking models that use contextualized language models are able to effectively distinguish relevance when controlled for various traditional relevance signals. We now use TMTs to investigate alternative explanations for the performance. Recall that a TMT manipulates documents in a collection and measures the effect this has. We begin by focusing on characteristics that are ignored by the traditional ranking pipeline: stop words, punctuation, inflectional word suffixes, and word order. These are a natural place to start because there are simple manipulation functions that are an important ingredient in most IR systems. To test the effects of stop words and punctuation, we simply remove these from the documents. 5 We present the results for these tests in Table 2 , each with an overall score and scores broken down by non-relevant (rel ∈ {0, 1}) and relevant (rel = {2, 3}) documents. We find that this negatively impacts most models, with the largest effect seen on EPIC and T5. The BERT model actually prefers this modification for non-relevant documents, though it reduces the score of relevant documents. This behavior is also particularly interesting because it is not exhibited by EPIC, even though it uses the same contextualized model base.

Table 2: Results of Text Manipulation Tests (TMTs) on TREC DL 2019. Positive scores indicate a preference for the manipulated document text; negative scores prefer the original text. Score marked with * are not statistically significant (see Section 3.4). rel ∈ {0, 1} are at best “topical”, and rel ∈ {2, 3} are relevant or highly relevant.

Stemming is also common practice in search engines. For our TMT, we opt to use SpaCy's (Honnibal and Montani, 2017) lemmatizer to remove word inflections, rather than a stemmer. We do this because the results from a stemming function like Porter are often not found in the lexicon of 5 For stop words, use Terrier's list of English stop words.

For punctuation, we remove the following typical punctuation symbols: !"#$%&\'() * +,-./:;<=>?@[\\]ˆ_'{|}~.

models like BERT and T5, which would result in the model using subword units. T5 is most affected by this change (-0.59). However, BERT exhibits different preferences for relevant and nonrelevant documents: there is no significant effect for non-relevant documents, and for relevant documents, stemming yields a score of -0.14. We note that in cases where there is a disparity like this (i.e., scores for relevant documents and nonrelevant documents differ), the model is applying different behavior at different relevance grades. In this case, the behavior suggests that inflectional endings are, at least in part, used as a signal that distinguishes relevant grades by BERT (i.e., they are more important for higher relevance grades).

Many ranking models make a bag-of-words assumption; that is, that the order of the terms in the query and document do not matter. This can be verified analytically for models like BM25 and KNRM, but the importance of word order on other models is unclear because they are encoded in millions of parameters. We test the effect of word order by randomly shuffling the words in each document as our next TMT. As seen in Table 2 , shuffling the words in a document can have a large effect on ranking scores. Most notably, it has a particularly large effect on EPIC, BERT, and T5. We see here again that the impact is larger for higher relevance levels, suggesting that word order is an important signal for distinguishing relevance. Note that the magnitude of these scores is higher than those for MMTs, signaling that word order can have a larger impact on model scores than traditional signals like TF.

To control for local word order (i.e., term adjacency), which is captured by n-gram approaches, we conduct another TMT that shuffles the order of the sentences in a document. (Documents containing only a single sentence are omitted from this test.) Here, we see that an effect remains for EPIC, BERT, and T5, though it is substantially reduced (and not significant, in some cases for T5). This suggests that discourse-level signals (e.g., what topics are discussed earlier in a document) have some effect on the models, or the models encode some positional bias (e.g., preferring answers at the start of documents). We also see effects when maintaining the sentence order, but shuffling the words within sentences, and even when only swapping the positions of noun chunks in the text. Finally, we find that shuffling the prepositions in a text has a large effect on T5, but no other models, suggesting that the T5 models uses the relation of terms to one another in the text as an important signal. Overall, the shuffling results indicate that ranking models like EPIC, BERT, and T5 make use of word order, and that the importance or word order is correlated with the degree of relevance.

In some applications, it is desirable to rank exact query matches highly (e.g., navigational queries), but in other applications users would expect further elaboration on the topic they are searching for (e.g., question answering). To test this behavior, our next TMT uses a manipulation function that replaces the document with the query text itself. We observe that nearly all systems are biased towards these exact query matches. The exceptions are EPIC and T5, which are substantially less biased, especially for relevant documents. We note that although EPIC is close to a score of 0, this is mostly due to the positive effects (1,321 of 2,501) and negative effects (1,129 of 2,501) cancelling one another out, rather than neutral effects (i.e., falling within the δ range). What leads to this behavior, and how to build models to exhibit predominantly negative or neutral effects, remains an open question. We now explore model behaviors when adding content to the text. We conduct two TMTs: one that adds expansion terms from the DocT5Query model to the end of the document, one that adds non-relevant text to the end of the document (by sampling a sentence from a rel = 0 document). These tests allow us to test how models respond to adding information to the text. The models that rely heavily on unigram matching (e.g., BM25) respond positively to the addition of DocT5Query terms. Even the DocT5Query model itself sees an additional boost, 6 suggesting that weighting the expansion terms higher in the document may further improve the effectiveness of this model. However, EPIC, BERT, and T5 often respond negatively to these additions. As far as we know, we are the first to demonstrate that these re-ranking methods are not effective on expanded documents.

We also find that adding non-relevant sentences to the end of relevant documents often increases the ranking score of EPIC, BERT, and T5. This is in contrast with models like BM25, in which the scores of relevant documents decrease with the addition of non-relevant information. From the variable length MMTs, we know that this increase in score is likely not due to increasing the length alone. Such characteristics may pose a risk to ranking systems based on EPIC, BERT, or T5, in which content sources could aim to increase their ranking simply by adding non-relevant content to the end of their documents.

6 Dataset Transfer Tests (Dtts)

We now explore model behaviors using DTTs. Recall that a DTT repurposes a non-IR dataset as a diagnostic tool by using it to construct pair tests. In this work, we explore three characteristics: Fluency, Formality, and Summarization.

6.1 Fluency

From the TMTs, we found that models are highly influenced by word order and inflectional endings. One possible explanation is that these models incorporate signals of textual fluency. To test this hypothesis directly, we propose a DTT using the JFLEG dataset (Napoles et al., 2017b) . This dataset contains sentences from English-language fluency tests. Each non-fluent sentence is corrected for fluency by four fluent English speakers to make the text sound 'natural' (changes include grammar and word usage changes). We treat each fluent text as a d 1 paired with the non-fluent d 2 . We generate q by randomly selecting a noun chunk that appears in both versions of the text. (If no such chunk exists, we discard the pair.)

The results for this DTT are shown in Table 3 . We observe that the models most substantially impacted by shuffling words in a text (EPIC, BERT, and T5) are also most affected here. However, the magnitude of the effect is much lower, considering that the corrections are often rather minor. We see similar results when controlling for spelling (i.e., when both versions use correct spelling). We note that in this case, a large proportion of the effects are neutral (i.e., within δ, meaning that they may not affect the ordering of ranked lists). For instance, 3,124 of the 5,069 samples for EPIC have a neutral effect. The fluency experiments indicate that the shuffling results from the TMTs may be partially explained by the fact that they reduce the fluency of the text, and that fluency can have an effect on model scores.

Table 3: Results of Dataset Transfer Tests (DTTs). Positive scores indicate a preference for fluent, formal, and summarized text. Score marked with * are not statistically significant (see Section 3.4).

6.2 Formality

We hypothesize that the style of a text may also have an influence on learning-to-rank models, potentially due to bias in training data labeling (an annotator may be more likely to select an answer that is written more formally) or due to the pretraining objective (e.g., BERT is trained on books and Wikipedia text, both of which are more formal than much other text online). We test this by building a DTT from the GYAFC dataset (Rao and Tetreault, 2018) . This dataset selects sentences from Yahoo Answers and has four annotators make edits to the text that either improve the formality (for text that is informal), or reduce the formality (for text that is already formal). We treat formal text as d 1 and informal text as d 2 . Since the text came from Yahoo Answers, we can link the text back to the original questions using the Yahoo L6 dataset 7 . We treat the question (title) as q. In cases where we cannot find the original text or there are no overlapping non-stopword lemmas from q in both d 1 and d 2 , we discard the sample.

Results from this DTT are shown in Table 3 . Positive scores indicate a preference for formal text. We split the samples by their category: entertainment (3,149 samples) and family/relationships (3,701 samples). We observe that EPIC and T5 exhibit similar behaviors (preferring formal text from the entertainment section, and having no style preference in the family section). This is in contrast with BERT's behavior, which has no significant impact on entertainment, but a preference towards informal text for family. These tests show that new ranking methods can be affected by the textual formality of the document, a quality usually not exhibited by prior models.

6.3 Summarization

Recall from the MMTs in Section 4 that most models do not have strong biases for document lengths, except in the case when the textual overlap is held constant (in which case, they prefer longer documents). This raises the question: to what extent does the verbosity of the text (i.e., how much the text elaborates on the details) matter to ranking models? To test this, we construct DTTs from summarization datasets. Intuitively, a text's summary will be less verbose than its full text. We utilize two datasets to conduct this test: XSum (Narayan et al., 2018) , and CNN/DailyMail (See et al., 2017) . The former uses extremely concise summaries from BBC articles, usually consisting of a single sentence. The CNN/DailyMail dataset uses slightly longer bullet 7 https://webscope.sandbox.yahoo.com/ catalog.php?datatype=l&did=11 point list summaries, usually consisting of around 3 sentences. For these tests, we use the title of the article as q, the summarized text as d 1 , and the article body as d 2 . When there is no overlap between the non-stopword lemmas of q in both d 1 and d 2 , we discard the samples. We further subsample the dataset at 10% because the datasets are already rather large.

Results from the summarization DTTs are shown in Table 3 . Positive scores indicate a preference for the summarized text. Here, we observe some interesting behaviors. First, BM25 has a strong (+0.66) preference for the summaries in XSum, a moderate preference for summaries in CNN (+0.37), and no significant preference for Daily Mail. This suggests different standards among the various datasets, e.g., XSum (BBC) must use many of the same terms from the titles in the summaries, and provide long documents (reducing the score) that may not repeat terms from the title much. The preference for summaries in XSum can be seen across all models except T5, which very slightly favors the full text.

The behaviors for the CNN and Daily Mail DTTs vary considerably across models. For instance, EPIC prefers summaries for both (+0.45 and +0.52, respectively), and T5 prefers full text for both (-0.64 and -0.68). These discrepancies warrant exploration in future work.

7 Related Work

Pretrained contextualized language models are neural networks that are initially trained on language modeling objectives and are later fine-tuned on task-specific objectives (Peters et al., 2018) . Signals from language modeling can be beneficial for downstream tasks, reducing the amount of in-domain training data required. Among the most well-known of these models are ELMo (Peters et al., 2018), BERT (Devlin et al., 2019) , and T5 (Raffel et al., 2020) .

Ad-hoc retrieval is the task of ranking a collection of documents by relevance to a particular natural-language query. Researchers have found that pretrained contextualized language models can effectively transfer signals to this task, either by using the model directly (i.e., vanilla or mono models) or by using the outputs as features into a larger model (MacAvaney et al., 2019a). There has been a multitude of work in this area, such as strategies for long pas-sage aggregation Dai and Callan, 2019b) and efficiency-conscious approaches (Dai and Callan, 2020; Khattab and Zaharia, 2020; Hofstätter et al., 2020) . We refer the readers to for a comprehensive survey on these techniques. These models significantly outperform prior attempts to use neural networks for ad-hoc ranking and represent the most substantial gains in effectiveness for this task in the past decade (Lin, 2019) . Our goal in this work is to help shed light on the mechanisms, strengths and weaknesses of this burgeoning body of work.

Analyses of recent learning-to-rank models tend to take an empirical approach, because analytic methods are impractical given the models' large number of parameters. Diagnostic datasets, proposed by Rennings et al. (2019) , reformulate traditional ranking axioms-e.g., that documents with a higher term frequency should receive a higher ranking score (Fang et al., 2004) -as empirical tests. Rennings et al. studied neural ranking architectures that predate the rise of contextualized language models for ranking, and focused on just four axioms. Câmara and Hauff (2020) extended this work by adding five more previously-proposed ranking axioms (e.g., term proximity (Tao and Zhai, 2007) , and word semantics (Fang and Zhai, 2006) ) and evaluating on a distilled BERT model. They found that the axioms are inadequate to explain the ranking effectiveness of their model. Unlike these prior lines of work, we propose new tests that shed light onto possible sources of effectiveness, and test against current leading neural ranking architectures.

Others have attempted to characterize the strengths of BERT-based ranking models. For example, Dai and Callan (2019b) found a BERT model to be more effective at ranking documents for question queries (as compared to keywordbased queries)-an interesting finding given that most prior ranking techniques (both neural and traditional) tend to perform better with keywordbased queries. However, the effectiveness may be due to additional information provided by the question queries in the collection, rather than the linguistic characteristics of the queries themselves. Others have found that contextualized language models are more effective at identifying salient terms. For example, doc2query identifies salient parts of a document to generate plausible queries . DeepCT models term salience explicitly, by trying to predict which document terms will appear in a query (Dai and Callan, 2019a) . EPIC performs both, by jointly modeling query and document term salience, while also performing document expansion . However, these techniques use specialized architectures, and do not necessarily imply that vanilla models are effective due to this type of modeling. For instance, doc2query uses a sequential decoder to produce terms to add to the document-a component that is not present in vanilla models. Still others have found that contextualized value similarity can be a useful signal for ranking (MacAvaney et al., 2019a; Hofstätter et al., 2020; Khattab and Zaharia, 2020,?) . However, these architectures only show that contextualized embedding similarity signals can be used for ranking, not what characteristics these embeddings capture. Rather than proposing new ranking models, in this work we analyze the effectiveness of existing models using controlled diagnostic tests., which allow us to gain insights into the particular behaviors and preferences of the ranking models.

Outside of the work in IR, others have developed techniques for investigating the behavior of contextualized language models in general. Although probing techniques (Tenney et al., 2019) and attention analysis (Serrano and Smith, 2019) can be beneficial for understanding model capabilities, these techniques cannot help us characterize and quantify the behaviors of neural ranking models. CheckList (Ribeiro et al., 2020) and other challenge set techniques (McCoy et al., 2019) differ conceptually from our goals; we aim to characterize the behaviors to understand the qualities of ranking models, rather than provide additional measures of model quality.

8 Conclusion

We presented a new framework (ABNIRML) for analyzing ranking models based on three testing strategies: Measure and Match Tests (MMTs), Textual Manipulation Tests (TMTs), and Dataset Transfer Tests (DTTs). By using these tests, we demonstrated that a variety of insights can be gained about the behaviors of recently-proposed ranking models, such as those based on BERT and T5. Our analysis is, to date, the most extensive analysis of the behaviors of neural ranking models, and sheds light on several unexpected model behaviors. For instance, adding non-relevant text can increase a document's ranking score, even though the models are largely not biased towards longer documents. We also see that the same base language model used with a different ranking architecture can yield different behaviors, such as higher sensitivity to shuffling a document's text. Meanwhile, different language models can be sensitive to different characteristics, such as the importance of prepositions.

In the case of TF, it has long been used as a core signal for ranking algorithms; a departure from monotonically increasing the score of a document as TF increases would represent a fundamental shift in the notion of relevance scoring(Fang et al., 2004).

An example is TFC1 from(Fang et al., 2004), which suggests that higher TFs should be mapped to higher relevance scores.which language sounds like a native speaker wrote it) with a MMT, one would need an effective measure of text fluency and be able to match it to documents that otherwise provide similar text, which is a tall order. To test fluency with a TMT, one would need a function that is able to consistently reduce (or improve) textual fluency, which is difficult to accomplish with today's techniques. To test characteristics like these, we propose Dataset Transfer Tests (DTTs). In this setting, a dataset built for a purpose other than ranking is repurposed to test a ranking model's behavior. By using a DTT, one could use a dataset of human-written textual fluency pairs (e.g., from the JFLEG dataset(Napoles et al., 2017a)) to sidestep challenges in both measurement and manipulation. Text pair datasets are abundant, allowing us to test a wide variety of characteristics, like fluency, formality, and summarization. With these tests, d 1 and d 2 can be easily defined by the source dataset. In some cases, external information can be used to infer a corresponding q, such as using the title of the article as a query for news article summarization tasks, a technique that has been studied before to train ranking models(MacAvaney et al., 2019b). In other cases, a query can be generated for the document pair, for instance by taking the overlapping non-stopword terms in each. In this instance, DTTs also involve artificial data, but unlike TMTs, the documents d i are authentic but q is artificial.

We omit variable TF with control Sum-TF and vice versa because they result in no pairs. Some of these tests correspond with traditional ranking axioms from(Fang et al., 2004), such as TFC1 and LNC1. A variable TF when controlled for length corresponds with TFC1, and a variable length when controlled for TF corresponds with LNC1.

We recognize that this is a somewhat strange setting: the DocT5Query model generates new terms based on a version of the document that already has DocT5Query expansion terms added. In other words, expansion is conducted on a document that is already expanded.