# Not All Claims are Created Equal: Choosing the Right Approach to Assess Your Hypotheses

## Authors

## Abstract

Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues. While alternative proposals have been well-debated and adopted in other fields, they remain rarely discussed or used within the NLP community. We address this gap by contrasting various hypothesis assessment techniques, especially those not commonly used in the field (such as evaluations based on Bayesian inference). Since these statistical techniques differ in the hypotheses they can support, we argue that practitioners should first decide their target hypothesis before choosing an assessment method. This is crucial because common fallacies, misconceptions, and misinterpretation surrounding hypothesis assessment methods often stem from a discrepancy between what one would like to claim versus what the method used actually assesses. Our survey reveals that these issues are omnipresent in the NLP research community. As a step forward, we provide best practices and guidelines tailored to NLP research, as well as an easy-to-use package called 'HyBayes' for Bayesian assessment of hypotheses, complementing existing tools.

## 1 Introduction

Empirical fields, such as Natural Language Processing (NLP), must follow scientific principles for assessing hypotheses and drawing conclusions from experiments. For instance, suppose we come across the results in Table 1 , summarizing the accuracy of two question-answering (QA) systems S 1 and S 2 on some datasets. What is the correct way to interpret this empirical observation in terms of (Devlin et al., 2019; Sun et al., 2018) on the ARC question-answering dataset (Clark et al., 2018) . ARC-easy & ARCchallenge have 2376 & 1172 instances, respectively. Acc.: accuracy as a percentage.

the superiority of one system over another? While S 1 has higher accuracy than S 2 in both cases, the gap is moderate and the datasets are of limited size. Can this apparent difference in performance be explained simply by random chance, or do we have sufficient evidence to conclude that S 1 is in fact inherently different (in particular, inherently stronger) than S 2 on these datasets? If the latter, can we quantify this gap in inherent strength while accounting for random fluctuation?

Such fundamental questions arise in one form or another in every empirical NLP effort. Researchers often wish to draw conclusions such as:

(Ca) I'm 95% confident that S1 and S2 are inherently different, in the sense that if they were inherently identical, it would be highly unlikely to witness the observed 3.5% empirical gap for ARC-easy. (Cb) With probability at least 95%, the inherent accuracy of S1 exceeds that of S2 by at least 1%.

These two conclusions differ in two respects. First, Ca claims the two systems are inherently different, while Cb goes further to claim a margin of at least 1% between their inherent accuracies. The second, more subtle difference lies in the interpretation of the 95% figure: the 95% confidence expressed in Ca is in terms of the space of empirical observations we could have made, given some underlying truth about how the inherent accuracies of S 1 and S 2 relate; while the 95% probability expressed in Cb is directly over the space of possible arXiv:1911.03850v1 [cs.CL] 10 Nov 2019 inherent accuracies of the two systems.

To support such a claim, one must turn it into a proper mathematical statement that can be validated using a statistical calculation. This in turn brings in additional choices: we can make at least four statistically distinct hypotheses here, each supported by a different statistical evaluation:

(H1) Assuming S1 and S2 have inherently identical accuracy, the probability (p-value) of making a hypothetical observation with an accuracy gap at least as large as the empirical observation (here, 3.5%) is at most 5% (making us 95% confident that the above assumption is false). (H2) Assuming S1 and S2 have inherently identical accuracy, the empirical accuracy gap (here, 3.5%) is larger than the maximum possible gap (confidence interval) that could hypothetically be observed with a probability of over 5% (making us 95% confident that the above assumption is false). (H3) Assume a prior belief (a probability distribution) w.r.t.

the inherent accuracy of a typical system. Given the empirically observed accuracies, the probability (posterior interval) that the inherent accuracy of S1 exceeds that of S2 by a margin of 1% is at least 95%. (H4) Assume a prior belief (a probability distribution) w.r.t.

the relative inherent accuracies of S1 and S2. Given the empirically observed accuracies, the odds increase by a factor of 1.32 (Bayes factor) in favor of the hypothesis that the inherent accuracy of S1 exceeds that of S2 by a margin of 1%.

As this illustrates, there are multiple ways to formulate empirical hypotheses and support empirical claims. Since each hypothesis starts with a different assumption and makes a (mathematically) different claim, it can only be tested with a certain set of statistical methods. Therefore, NLP practitioners ought to define their target hypothesis before any effort to test or assess it.

The most common statistical methodology used in NLP is null-hypothesis significance testing (NHST) which uses p-values (Søgaard et al., 2014; Koehn, 2004; Dror and Reichart, 2018) . Hypotheses H1&H2 can be tested with p-value-based methods, which include confidence intervals and operate over the probability space of observations 2 ( §2.1 and §2.2). On the other hand, there are often overlooked approaches, based on Bayesian inference (Kruschke and Liddell, 2018) , that can be used to assess hypotheses H3&H4 ( §2.3 and §2.4) and have two broad strengths: they can deal more naturally with accuracy margins and they operate directly over the probability space of inherent accuracy (rather than of observations).

For each technique reviewed in this work, we discuss how it compares with alternatives and summarize common misinterpretations surrounding it ( §3.1). For example, a common misconception about p-value is that it represents a probability over the validity of a hypothesis. While desirable, p-values in fact do not provide such a probabilistic interpretation. It is instead through a Bayesian analysis of the posterior distribution of the test statistic (inherent accuracy in the earlier example) that one can make claims about the probability space of that statistic, such as H3.

We quantify and demonstrate the related common malpractices in the field through a manual annotation of 439 ACL'18 papers, and a survey filled out by 55 NLP researchers ( §3). We highlight surprising findings from the survey, such as the following: While 86% expressed fair-to-complete confidence in the interpretation of p-values, only a small percentage of them correctly answered a basic p-value interpretation question.

Contributions. This work seeks to inform the NLP community about crucial distinctions between various statistical hypotheses and their corresponding assessment methods, helping move the community towards well-substantiated empirical claims and conclusions. Our exposition covers a broader range of methods ( §2) than those included in recent related efforts ( §1.1), and highlights that they measure different goals. Our surveys of NLP researchers reveals problematic trends ( §3), emphasizing the need for increased scrutiny and clarity. We conclude by suggesting guidelines for better testing ( §4), as well as providing a toolkit called HyBayes (cf. Footnote 1) tailored towards commonly used NLP metrics. In summary, this work is expected to encourage a better understanding of statistical assessment methods and effective reporting with measures of uncertainty.

## 1.1 Related Work

While there is an abundant discussion of significance testing in other fields, only a handful of NLP efforts address it. For instance, Chinchor (1992) defined the principles of using hypothesis testing in the context of NLP problems. Mostnotably, there are works studying various randomized tests (Koehn, 2004; Ojala and Garriga, 2010; Graham et al., 2014) , or metric-specific tests (Evert, 2004) . More recently, and Dror and Reichart (2018) provide a thorough review of frequentist tests. While an important step in better informing the community, it covers only a subset of statistical tools. Our work complements this effort by pointing out alternative tests.

With increasing over-reliance on certain hypothesis testing techniques, there are growing troubling trends of misuse or misinterpretation of such techniques (Goodman, 2008; Demšar, 2008) . Some communities, such as statistics and psychology, even have published guidelines and restrictions on the use of p-values (Trafimow and Marks, 2015; Wasserstein et al., 2016) . In parallel, some authors have advocated for using alternate paradigms such as Bayesian evaluations (Kruschke, 2010) .

NLP is arguably an equally empirical field. Yet, proper practices of scientific testing, common pitfalls, and various alternatives are rarely discussed in this community. In particular, while the limitations of p-values are highly debated in statistics and psychology, only a few NLP efforts have done so: over-estimation of significance by model-based tests (Riezler and Maxwell, 2005) , lack of independence assumption in practice (Berg-Kirkpatrick et al., 2012) , and sensitivity to the choice of the significance level (Søgaard et al., 2014) . Our goal is to provide a unifying view of the pitfalls and best practices, and equip NLP researchers with Bayesian hypothesis assessment approaches as an important tool in their toolkit.

## 2 Assessment Of Hypotheses

We often wish to draw qualitative inferences based on the outcome of experiments (for example, inferring the relative inherent performance of systems). To do so, we usually formulate a hypothesis that can be assessed through some analysis.

Suppose we want to compare two systems on a dataset of instances x = [x 1 , . . . , x n ] with respect to a measure M(S, x) representing the perfor-

mance of system S on an instance x. Let M(S, x) denote the vector [M(S, x i )] n i=1 . Given systems S 1 , S 2 , let y [M(S 1 , x), M(S 2 , x)

] denote a vector of observations. 3 In a typical NLP experiment, the goal is to infer some inherent and unknown properties of systems. To this end, a practitioner assumes a probability distribution on the observations y, parameterized by θ, the properties of the systems. In other words, y is assumed to have a distribution 4 with unknown parameters θ. In this setting, a hypothesis H is a condition on θ. Hypothesis assessment is a way of evaluating the degree to which the observations y are compatible with H. The overall process is depicted in Fig.1 .

Following our running example, we use the task of answering natural language questions (Clark et al., 2018) . While our examples are shown for this particular task, all the ideas are applicable to more general experimental settings.

For this task, the performance metric M(S, x) is defined as a binary function indicating whether a system S answers a given question x correctly or not. The performance vector M(S, x) captures the system's accuracy on the entire dataset (cf. Table 1). We assume that each system S i has an unknown inherent accuracy value, denoted θ i . Let θ = [θ 1 , θ 2 ] denote the unknown inherent accuracy of two systems. In this setup, one might, for instance, be interested in assessing the credibility of the hypothesis H that θ 1 < θ 2 . Fig. 2 shows a categorization of statistical tools developed for the assessment of such hypotheses. The two tools on the left are based on frequentist statistics, while the ones on the right are based on Bayesian inference (Kruschke and Liddell, 2018). A complementary categorization of these tools is based on the nature of the results that they provide: the ones on the top encourage binary decision making, while those on the bottom provide uncertainty around estimates. We discuss all four classes of tests in the following sub-sections.

## 2.1 Null-Hypothesis Significance Testing

In frequentist hypothesis testing, there is an asymmetric relationship between two hypotheses. The hypothesis formulated to be rejected is usually called the null-hypothesis H 0 . For instance, in our example H 0 : θ 1 = θ 2 . A decision procedure is devised by which, depending on y, the null-hypothesis will either be rejected in favor of H 1 , or the test will stay undecided.

A key notion here is p-value, the probability, under the null-hypothesis H 0 , of observing an outcome at least equal to or extreme than the empirical observations y. To apply this notion on a set of observations y, one has to define a function that maps y to a numerical value. This function is called the test statistic δ(.), and it formalizes the interpretation of extremeness. Concretely, p-value is defined as,

`EQUATION (1): Not extracted; please refer to original document.`

In this notation, Y is a random variable over possible observations and δ(y) is the empirically observed value of the test statistic. A large p-value implies that the data could easily have been observed under the null-hypothesis. Therefore, a lower p-value is used as evidence towards rejecting the null-hypothesis.

Example 1 (Assessment of H1) We form a null-hypothesis using the accuracy of the two systems (Table 1) using a one-sided z-test with δ(y)

(1/n) n i=1 [M(S 1 , x i ) − M(S 2 , x i )]

. We formulate a null-hypothesis against the claim of S 1 having strictly better accuracy than S 2 . This results in a p-value of 0.0037 (details in §A.1) and can be interpreted as the following: if the systems have inherently identical accuracy values, the probability of observing a superiority at least as extreme as our observations is 0.0037. For a significance level of 0.05 (picked before the test) this p-value is small enough to reject the null-hypothesis.

This family of the tests is thus far the most widely used tool in NLP research. Each variant of this test is based on some assumptions about the distribution of the observations, under the nullhypothesis, and an appropriate definition of the test statistics δ(.). Since a complete exposition of such tests is outside the scope of this work, we encourage interested readers to refer to the existing reviews, such as .

## 2.2 Confidence Intervals

Confidence Intervals (CIs) are used to express the uncertainty of estimated parameters. In particular, the 95% CI is the range of values for parameter θ such that the corresponding test based on p-value is not rejected:

`EQUATION (2): Not extracted; please refer to original document.`

In other words, the confidence interval merely asks which values of the parameter θ could be used, before the test is rejected.

Example 2 (Assessment of H2) Consider the same setting as in Example 1. According to the experimental result in Table 1 , the estimated value of the accuracy differences (maximum-likelihood estimates) is θ 1 − θ 2 = 0.035. A 95% CI of this quantity provides us with a range of values that are not rejected under the corresponding null-hypothesis. In particular, a 95% CI gives us

θ 1 − θ 2 ∈ [0.0136, 0.057] (details in §A.2).

The blue bar in Fig.2 (right) shows the corresponding CI. Notice that the conclusion of Example 1 is compatible with this confidence interval; the null-hypothesis θ 1 = θ 2 which got rejected is not included in the CI.

## 2.3 Posterior Intervals

Bayesian methods focus on prior and posterior distributions of θ. Recall that in a typical NLP experiment, these parameters can be, e.g., the actual mean or standard deviation for the performance of a system, as its inherent and unobserved property. In Bayesian inference frameworks, the specification of a priori assumptions and beliefs are en-coded in the form of a prior distribution P(θ) on parameters of the model. 5 In other words, a prior distribution describes the common belief about the parameters of the model. It also implies a distribution over possible observations. For assessing hypotheses H3 and H4 in our running example, we will simply use the uniform prior, i.e., the inherent accuracy is uniformly distributed over [0, 1] . This corresponds to having no prior belief about how high or low the inherent accuracy of a typical QA system may be.

In general, the choice of this prior can be viewed as a compromise between the beliefs of the analyzer and those of the audience. The above uniform prior, which is equivalent to the Beta(1,1) distribution, is completely non-committal and thus best suited for a broad audience who has no reason to believe an inherent accuracy of 0.8 is more likely than 0.3. For a moderately informed audience that already believes the inherent accuracy is likely to be widely distributed but centered around 0.67, the analyzer may use a Beta(3,1.5) prior to evaluate a hypothesis. Similarly, for an audience that already believes the inherent accuracy to be highly peaked around 0.75, the analyzer may want to use a Beta(9,3) prior. Formally, one incorporates θ in a hierarchical model in the form of a likelihood function P(y|θ). This explicitly models the underlying process that connects the latent parameters to the observations. Consequently, a posterior distribution is inferred using the Bayes rule and conditioned on the observations: P(θ|y) = P(y|θ)P(θ)

P(y) .

The posterior distribution is a combined summary of the data and prior information, about likely values of θ. The mode of the posterior (maximum a posteriori) can be seen as an estimate for θ. Additionally, the posterior can be used to describe the uncertainty around the mode.

While the posterior distribution can be analytically calculated for simple models, it is not so straightforward to compute for general models. Fortunately, recent advances in hardware, MCMC techniques and probabilistic programming 6 allow us to numerically obtain sufficiently-accurate approximations of posteriors.

One way to summarize the uncertainty around the point estimate of parameters is by marking the span of values that cover α% of the mostcredible density in the posterior distribution (e.g., α = 95%). This notion is called the Highest Density Intervals (HDIs), or Bayesian Confidence Intervals (Oliphant, 2006) (not to be confused with CI, in §2.2).

Recall that a hypothesis H is a condition on θ (see Fig.1 ). Therefore, given the posterior P(θ|y), one can calculate the probability of H, as a probabilistic event, conditioned on y: P(H|y).

For example in an unpaired t-test, H 0 is the event that the means of two groups are equal. Bayesian statisticians usually relax this strict equality θ 1 = θ 2 and instead evaluate the credibility of |θ 1 − θ 2 | < ε for some small value of ε. The intuition is that when θ 1 and θ 2 are close enough they are practically equivalent. This motivates the definition of Region Of Practical Equivalence (ROPE): An interval around zero with "negligible" radius. The boundaries of ROPE depend on the application, the meaning of the parameters and its audience. In our running example, a radius of one percent for ROPE implies that improvements less than 1 percent are not considered notable. For a discussion on setting ROPE see Kruschke 2018.

These concepts give researchers the flexibility to define and assess a wide range of hypotheses. For instance, we can address H3 (from Introduction) and its different variations that can be of interest depending on the application. The analysis of H3 is depicted in Fig. 2 and explained next. 7

Example 3 (Assessment of H3) Recall the setting from previous examples. The left panel of Fig. 2 shows the prior on the latent accuracy values of the systems and their differences (further details on the hierarchical model in §A.3.) Then the posterior distribution (Fig.2, right) is obtained (in this case, through numerical methods).

Notice that one can read the following conclusion that with a probability 0.996 the hypothesis H3 (with x = 0) holds true. As it will be mentioned in §3.1, this statement does not imply any difference with a notable margin. In fact, the posterior in Fig.2 implies that this experiment is not sufficient to claim the following statement: with probability at least 0.95 the hypothesis H3 (with x = 1) holds true.

## 2.4 Bayes Factor

A common tool among Bayesian frameworks is the notion of Bayes Factor. 8 Intuitively, Bayes Factor compares how the observations y shift the credibility from prior to posterior of the two competing hypothesis:

BF 01 = P(H 0 |y) P(H 1 |y) P(H 0 ) P(H 1 )

If the BF 01 equals to 1 then the data provide equal support for the two hypotheses and there is no reason to change our a priori opinion about the relative likelihood of the two hypotheses. A smaller Bayes Factor is an indication of rejecting the nullhypothesis H 0 . If it is greater than 1 then there is support for the null-hypothesis and we should infer that the odds are in favor of H 0 .

Notice that the symmetric nature of Bayes Factor allows all the three outcomes of "accept", "reject", and "undecided," as opposed to the definition of p-value that cannot accept a hypothesis.

Example 4 (Assessment of H4) Here we want to assess the null-hypothesis H 0 : |θ 1 − θ 2 | < 0.01 against H 1 : |θ 1 − θ 2 | ≥ 0.01 (x = 0.01). Substituting posterior and prior values, one obtains:

8 "Bayesian Hypothesis Testing" usually refers to the arguments based on "Bayes Factor." However, as shown in §2.3, there are other Bayesian approaches for assessing hypotheses.

BF 01 = 0.027 0.980 0.019 0.972 = 1.382. This value is very close to 1 which means that this observation does not change our prior belief about the two systems difference.

## 3 Trends And Comparisons

When it comes to choosing an approach to assess significance of hypotheses, there are many aspects that must be taken into account. This section highlights common practices relevant to the our target methods and compares them.

To better understand the common practices or misinterpretations in the field, we conduct a survey. We share the survey among ∼450 NLP researchers (randomly selected from ACL'18 Proceedings) from which 55 individuals filled out the survey. While similar surveys have been performed in other fields (Windish et al., 2007) , this is the first in the NLP community, to the best of our knowledge. Here we review the main highlights (see Appendix for more details and charts).

Interpreting p-values. While the majority of the participants have a self-claimed ability to interpret p-values (Fig.11 ), many choose its imprecise interpretation "The probability of the observation this extreme happening due to pure chance" (the popular choice) vs. a more precise statement "Conditioned on the null hypothesis, the probability of the observation this extreme happening." (Fig. 14, 15 The use of CIs. Even though 95% percent of the participants self-claimed the knowledge of CIs (Fig. 13) , it is rarely used in practice. In an annotation done on ACL'18 papers by two of the authors, only 6 (out of 439) papers were found to use CIs.

The use of Bayes Factors. A majority of the participants had "heard" about "Bayesian Hypothesis Testing" but did not know the definition of "Bayes Factor" (Fig. 3) . HDIs (discussed in §2.3) were the least known. Additionally, we did not find any papers in ACL'18 that use Bayesian tools for assessment of hypotheses.

The use of "significan*". A notable portion of NLP papers express their findings by using the term "significant" (e.g., "our approach improves over the baseline significantly.") Almost all ACL'18 papers use the term "significant" 9 somewhere in their paper. Unfortunately, there is no single universal interpretation of such phrases across readers. In our survey, we observe that when participants read "X significantly improves Y" in the abstract of a hypothetical paper: 1. About 82% expect the claim to be backed by "hypothesis testing"; however, only 57% expect notable empirical improvement (see Q3 in Appendix; Fig. 16 );

2. About 35% expect the paper to test "practical significance", which is not generally assessed by popular tests (see §3.1);

3. A small fraction also expect a theoretical argument. Table 3 provides a summary of the techniques studied here. We make two key observations: (i) many papers don't use any hypothesis assessment method and would benefit from one; (ii) from the final column, p-value based techniques clearly dominate the field. However, as we will delineate later, the bottom two alternatives offer multiple advantages.

## 3.1 Comparison Of Assessment Methods

Ease of Interpretation. The complexity of interpreting significance tests could result in ambiguous or misleading reports. Among the techniques studied here, p-values, due to their complex definition, have received by far the biggest number of criticisms. While p-values are the most common approach, they are inherently complex which makes them easy to misinterpret. Interpretation of CI can also be challenging since it is an extension of p-value (Hoekstra et al., 2014) . Tests that provide measures of uncertainty (like the ones in §2.3) are more natural choices for reporting the results of experiments (Kruschke and Liddell, 2018). Overall, our view on "ease of interpretation" of the approaches is summarized in Table 3 .

Measures of Certainty. p-values do not provide probability estimates on the systems being different (or equal) (Goodman, 2008; Wasserstein et al., 2016 ) (see §2.1.) Additionally, they encourage binary thinking (Gelman, 2013; Amrhein et al., 2017) . A binary significance test, can not say anything about the relative merits of two hypotheses (the null and alternative) since it is calculated assuming that the null-hypotheses is true. CIs provide a range of values for the target parameter however, this range does not have any probabilistic interpretation (du Prel et al., 2009) . Among the Bayesian analysis, reporting posterior intervals ( §2.3) generally provides the most useful the summary as they provide uncertainty estimates.

Flexibility in the Choice of Hierarchical Model.

The commonly-used tests through p-values provide a few options (e.g., t-test, z-test, etc), which leave little room for researchers to incorporate their specific assumptions into their test. Advocates of the Bayesian approaches argue that it provides a more flexible framework for integrating a variety of information about a property of interest. This especially true with the advent of faster computational tools that make it fairly easy to code a custom hierarchical model that best incorporates the assumptions of a problem at hand. This is further discussed in (Wetzels et al., 2009; Andraszewicz et al., 2015) . In this respect, the approaches in §2.3 and §2.4 are better suited to take the specifics of each setting into account.

Practical Significance Statistical significance is different from practical significance (Berger and Sellke, 1987) . While Ex. 1 showed that the difference between the system S 1 and S 2 is statistically significant (H1), it does not provide any intuition on the magnitude of their difference (the x parameter in H2-4). In other words, a small p-value does not necessarily mean that the effect is practically important. CIs alleviate this issue by providing the range of parameters that are compatible with the data, however, they do not provide probability estimates (Kruschke and Liddell, 2018) . Bayesian analysis provides probability distributions over target parameters. This allows users to report uncertainty estimates for hypotheses that encode the margins of the effects.

Sensitivity to the Choice of Prior. The choice of prior could change the output of posteriors (both §2.3 and §2.4). It is a well-known issue that decisions based on Bayes Factor §2.4 are highly sensitive to the choice of prior, and less so the posterior estimates in §2.3 (Sinharay and Stern, 2002; Vanpaemel, 2010) . Since p-values and CIs do not depend on prior they are not subject to this issue.

## 4 Recommended Practices

Given the discussion of common issues in the previous section, we provide a collection of recommendations (in addition to the prior recommendations, such as by ). The first step is to define your goal. Each of the tools in §2 provides a distinct set of information. Therefore, one needs to formalize a hypothesis and consequently the question you intend to answer by assessing this hypothesis. Here are four representative questions, one for each method:

1. Assuming that the null-hypothesis is true, is it likely to witness observations this extreme? ( §2.1) 2. How much my null-hypothesis can deviate from the mean of the observations until a p-value argument rejects it. ( §2.2) 3. Having observed the observations, how probable is my claimed hypothesis?( §2.3) 4. By observing the data how much do the odds increase in favor of the hypothesis?( §2.4)

If you decide to use frequentist tests:

• Check if your setting is compatible with the assumptions of the test. In particular, investigate if the meaning of null-hypothesis and sampling distribution match the experimental setting. • Include the summary of the above investigation.

Justify potential assumption mismatches that you could not resolve. • The statements reporting p-value and confidence interval need to be precise. The statements need to be accurate-enough so that the results are not misinterpreted (see §C.1). • The term "significant" should be used with caution and clear purpose in order to not cause any misinterpretations (see §3). One way to achieve this is by using adjectives "statistical" or "practical" before any (possibly inflected) usage of For each group: mu ∈ R+ (rate parameter) alpha ∈ R+ (shape parameter)

The count of certain patterns an algorithm could find in a big pool, in a fixed amount of time. Notice that you can't convert this into a ratio form, since there is no welldefined denominator. Ex: measuring how many of questions could be answered correctly (from an infinite pool of questions) by a particular QA systems, in a limited minute (the system is allowed to skip the questions too)

## Bootstrap / Permutation

Ordinal model ordinals Normal distribution with parameterized tresholds 2 For each group: mu ∈ R and sigma ∈ R+ Shared between groups: thresholds between possibe levels Collection of objects/labels arranged in a certain ordering, not necessarily with a metric distance between them; for example sentiment labels (https://www. aclweb.org/anthology/S16-1001.pdf), product review categories, grammaticality of sentences bootstrap / permutation Assumption 1: The observations are distributed as a t-student with unknown normality parameter (a normal distribution with potentially longer tales).

Assumption 2: The observations from each group are assumed to be i.i.d, conditioned on the inherent characterstics of two systems

Assumption 3: The total number of instances (the denominators) is known.

Assumption 4: The variable is inherently continuous, or the granularity (the denominator) is high enough to treat the variable as continuous.

Assumption 5: The observations follow a Negative-Binomail / Poisson distribution.

Assumption 6: The observations follow a binomial-distribution.

Assumption 7: The observations follow a normal distribution. * In this model (unlike frequentist t-test) outliers don't need to be discarded manually to realize the strict normality assumption. "significance." • Often times, a notable margin in the superiority of one system over another is desired to conclude from the analysis (see §3.1). In these cases, a pointwise p-value argument is not enough. It is highly recommended to perform a thorough confidence interval analysis. If the researcher finds CI not to be applicable in a specific circumstance, it should also be mentioned. If you decide to use Bayesian approaches:

• Since Bayesian tests are less known, it is better to provide a short motivation for the usage. • Familiarize yourself with packages that help you decide a hierarchical model, e.g., the software provided here. If necessary, customize these models for your specific problem. • Be clear about your hierarchical model, any parameters in the model and the choice of priors. In most cases, these choices need to be justified (see §2.3.) • Comment on the certainty (or the lack of) of your inference in terms of HDI and ROPE: (I) is HDI completely inside ROPE, (II) they are completely disjoint, (III) HDI contains values both inside and outside ROPE (see §2.3.) • For reproducibility, include further details about your test: MCMC traces, convergence plots, etc. Note that the software accompanying the paper provides all of these. • Be wary that Bayes Factor is highly sensitive to the choice of prior (see §3.1). See Appendix §C.5 for possible ways to mitigate this.

## 4.1 Package Hybayes

We provide an accompanying package, HyBayes, to facilitate comparing systems using the two Bayesian hypothesis assessment approaches discussed earlier: (a) inference with posterior probabilities and (b) Bayes Factors. Table 4 summarizes common settings in which HyBayes can be used (at the time of this publication). These models cover several typical assumptions on observed data; however, if a user has specific information on their observation or other assumptions, it is highly recommended to add a custom model, which can be done relatively easily.

## 5 Conclusion

Having solid mechanisms for hypothesis assessment is crucial for any field that relies on empirical work. The NLP community is not fully utilizing scientific assessment of hypotheses, since a relatively small number of experimental works use such test, almost all of which are based on p-value.

In this work, our goal was to review different alternatives, especially a few that are ignored in the NLP community. We compared different issues and potential dangers of careless use and interpretations of the different tools.

We did not intend to recommend a particular approach. Every technique has its own weaknesses. Therefore, a researcher should pick the right approach according to their needs and intentions. And in doing so, they have to do it with a proper understanding of the techniques. Mindless use of any of the techniques could result in misleading conclusions.

We contribute a new toolkit, HyBayes, to make it easy for NLP practitioners to use Bayesian assessment in their efforts. We hope that this work provides a complementary picture of hypothesis assessment techniques for the field and encourages a more rigorous treatment of such techniques.

## A Details Of Examples

A.1 More Details on Example 1

One sided Z-test to compare s 1 = 1721 out of 2376 vs s 2 = 1637 out of 2376:

We start with calculating the Z-score:

s 1 = 1721, n 1 = 2376

(3) s 2 = 1637, n 2 = 2376

(4) p 1 = s 1 /n 1 = 0.72432 (5) p 2 = s 2 /n 2 = 0.68897 (6) p = (s 1 + s 2 )/(n 1 + n 2 ) = 0.70664 (7)

`EQUATION (8): Not extracted; please refer to original document.`

= 2.6763676

Then we can read one-sided tail probability from a Z score table corresponding to 2.6763 as 0.00372124.

## A.2 More Details On Example 2

Letz denote the Z-score corresponding to 95%, i.e.,z = 1.644853. Also, let s be denominator of the z-score formula above, i.e., σ = p(1 −p)((1/n 1 ) + (1/n 2 )). Then the confidence interval is calculated as follows:

[p 1 − p 2 −zσ, p 1 − p 2 +zσ] = [0.0136, 0.057].

## A.3 More Details On Example 3

Hierarchical model. In this analysis, the input consists of four non-negative positive integers a 1 , n 1 , a 2 , and n 2 . The ith algorithm has answered a i (out of n i ) questions correctly. In our model, we assume that a i follows a binomial distribution with parameters θ i and n i . Note that this is mathematically the same as considering n i Bernoulli random variables with a i of them being success and n i − a i being failure. In the higher level, θ i is assumed to follow a uniform distribution in [0, 1] . In a later section, we use Beta(1, 1) instead as a generalization.

Here is Pymc3 code for specifying this model: a = 1 b = 1 with pm.Model() as pymcModel: theta = pm.Beta( "theta", a, b,shape=2) observations = [] for i in range 2 Generalizations It is possible to take previous observations in the literature into account and start with a non-uniform prior. For example, setting α and β to 1 2 incorporates the idea that the performances are generally closer to 0 or 1, on the other hand α = β = 2 lowers the probability of such extremes.

Notice that, as long as θ 1 and θ 2 are assumed to follow the same distribution, the probability that S 1 is better than S 2 is 0.5, as expected from any fair prior.

New observations. The researcher might proceed with performing another experiment with another dataset. Fig.6 shows the posterior given both observations from the performances on the "easy" and "challenge" datasets. Notice that in this case, HDI is completely higher than two percent superiority of the accuracy of S 1 over S 2 . This means that one can make the following statement: With probability 0.95, system S 1 's accuracy is two percent higher than that of S 2 's. In this case, it seems acceptable to informally state this claim as: system S 1 practically significantly outperforms system S 2 . ACL, EMNLP, NAACL, TACL and similar "natural language processing" venues AAAI, IJCAI, and similar "artificial intelligence" venues.

ICASP, InterSpeech, and other "speech" venues.

KDD, ICDM, WSDM, other "data mining" venues.

NourIPS, AISTATS, JMLR, and similar "machine learning" venues.

## C Further Details On Comparison, Misinterpretations, And Fallacies

When it comes to choosing an approach to assess significance of hypotheses, there are many issues that have to be taken into account. This section compares different techniques in terms of various well-known issues. A summary of the techniques studied here is provided in Table 3 . Notice that in the final column, p-value based techniques are dominant in the field. However, as we will delineate later, the alternatives could have advantages.

## C.1 Susceptibility To Misinterpretation

The complexity of interpreting significance tests, combined with insufficient reporting (as shown by ) could result in ambiguous or misleading conclusions. Not only many papers are unclear about their tests, but their results could also be misinterpreted by readers.

Among the techniques studied here, p-values, due to their complex definition, have received by far the biggest number of criticisms. While pvalues are the most common approach, they are inherently complex which makes them easy to misinterpret. Here are a few common misinterpretations (Demšar, 2008; Goodman, 2008) :

• Misconception #1: If p < 0.05, the nullhypothesis has only a 5% chance of being true: To see that this is false, remember that p-value is defined with the assumption that null-hypothesis is correct (Eq. 1.) • Misconception #2: If p > 0.05, there is no difference between the two systems: Having large p-value only means that the null-hypothesis is consistent with the observations, but it does not tell anything about the likeliness of the nullhypothesis. • Misconception #3: A statistically significant result (p < 0.05) indicates a large/notable difference between two systems: p-value only indicates strict superiority and provides no information about the margin of the effect. Interpretation of CI can also be challenging since it is an extension of p-value (Hoekstra et al., 2014) . As a result, the difficulties in reporting and interpreting the results of the frequentist tests have become an obstacle for researchers to communicate their conclusions effectively.

Tests that provide measures of uncertainty (like the ones in §2.3) are more natural choices for re-porting the results of experiments (Kruschke and Liddell, 2018) .

Overall, our view on "ease of interpretation" of the approaches is summarized in Table 3 .

## C.2 Dependence On Stopping Intention

The process by which samples in the test are collected could affect the outcome of the test. For example, the sample size n (whether it is determined before the process of gathering information begins, or it is a random variable with a certain distribution) could change the test.

Once the observations are recorded this distinction is usually ignored. Hence, the tests that do not depend on the distribution of n are more desirable. Unfortunately, the definition of p-value depends on the distribution of n. For instance, §11.1 of Kruschke (2010) provides examples where this subtlety can change the outcome of a test for different "stopping intentions," even when the final set of observations is identical.

## C.3 Unintended Misleading Result By

Iterative Testing While many tests are designed for a single-round experiment, in practice researchers perform multiple rounds of experiments until a predetermined condition is satisfied. This is particularly a problem in binary tests (such as p-value) when the condition is to achieve a desired result. For example, a researcher could continue experimenting until they achieve a statistically significant result (even if they don't necessarily have any intention of cheating the test) (Kim and Bang, 2016) .

Since the outcomes of p-values and CIs only can "reject" or stay "undecided", these tests reinforce an unintentional bias towards the only possible decision. Consequently, it becomes easy to misuse this testing mechanism: for big enough data points it is possible to make statistically significant claims (Amrhein et al., 2017) .

On the other hand, the approaches in §2.3 and §2.4 provide both outcomes of "Accept" and "reject", beside staying "undecided." Therefore an honest researcher is more probable to accept that their data supports the opposite of what their conjecture was.

For an in-depth study of how each test behaves in sequential tests, refer to §13.3 in (Kruschke, 2010).

## C.4 Sensitivity To The Choice Of Prior

The choice of prior could change the output of posteriors (both §2.3 and §2.4). It is a well-known issue that decisions based on Bayes Factor §2.4 are highly sensitive to the choice of prior, and less so the posterior estimates in §2.3. See Appendix C.5 or (Sinharay and Stern, 2002; Liu and Aitkin, 2008; Dienes, 2008; Vanpaemel, 2010) for discussions on this topic.

Since p-values and CIs do not depend on prior they are not subject to this issue.

## C.5 Choice Of Prior For Bayes Factor

As discussed in §2.4, §3.1, and §4, Bayes Factor is highly sensitive to the choice of prior. Here are a few options to set a prior for an analysis based on Bayes Factor:

1. Within the framework of model selection, if your priors are decided based on a clear meaning, as opposed to formulating a "vague" prior, then you can justify them with the audience.

2. Often, there are a few choices of noncommittal priors that seem equally representative of our beliefs. The best option is to perform and report one test for each of these choices to control the sensitivity to the prior.

3. Another common approach to mitigate this concern is to use a small portion of the data to get an "informed" prior and do the analysis using this prior. This ensures that the new prior is meaningful and defensible.

4. If none the above applies, it is recommended to use the approaches in §2.3 instead. Even though, the posterior density depends on prior, it is robust for different choices of similar priors.

## C.6 Bayes Factors Vs Posterior Intervals

We refer the interested readers to (Kruschke and Liddell, 2018) (see pp.165-166) for a list of Bayes factor caveats.

More precisely, over the probability space of an aggregation function over observations, called test statistics.

For simplicity of exposition, we assume the performances of two systems are on a single dataset, however, it is possible to have observations on multiple different datasets.

Parametric tests assume this distribution, while nonparametric tests do not.

We use P(x), in its most general form, to denote Probability Mass Functions for discrete variables and Probability Density Functions for continuous variables.6 For example, Pymc3 (in Python) and JAGS & STAN (in R) are among the commonly-used packages for this purpose.

Fig. 2can be readily reproduced via the accompanying software, HyBayes.

Or other variants "significantly", "significance", etc.