How Good Are My Predictions? Efficiently Approximating Precision-Recall Curves for Massive Datasets

Ashish Sabharwal
Hanie Sedghi
UAI
2017
View in Semantic Scholar

Abstract

Large scale machine learning produces massive datasets whose items are often associated with a confidence level and can thus be ranked. However, computing the precision of these resources requires human annotation, which is often prohibitively expensive and is therefore skipped. We consider the problem of cost-effectively approximating precisionrecall (PR) or ROC curves for such systems. Our novel approach, called PAULA, provides theoretically guaranteed lower and upper bounds on the underlying precision function while relying on only O(logN) annotations for a resource with N items. This contrasts favorably with Θ( √ N logN) annotations needed by commonly used sampling based methods. Our key insight is to capitalize on a natural monotonicity property of the underlying confidence-based ranking. PAULA provides tight bounds for PR curves using, e.g., only 17K annotations for resources with 200K items and 48K annotations for resources with 2B items. We use PAULA to evaluate a subset of the much utilized PPDB paraphrase database and a recent Science knowledge base.

1 Introduction

Precision-recall (PR) curves and receiver operating characteristic (ROC) curves play a fundamental role in understanding the behavior of a variety of systems in the presence of uncertainty. These curves are frequently used in machine learning (ML), electrical engineering, radiology, and many other fields to study the performance of a binary prediction system as a function of a control parameter. As the parameter is varied, one is often able to increase the precision (or decrease the false positive rate) of the system at the expense of lower recall (also known as the true positive rate or sensitivity). PR and ROC curves make this precision-recall tradeoff explicit, enabling users to adaptively choose a parameter value based on the needs of their application, or to compare two systems across the entire spectrum of the control parameter rather than at a single design point. From these curves, one can also "read off" summary performance metrics such as area under the curve (AOC) for classification algorithms, precision of the top k results (Prec@k) or average precision (AP) for ranking algorithms in applications such as Web search, and the F1 score for balancing precision and recall.

A key challenge in this space is cost-effectively approximating PR and ROC curves for massive resources produced by ML systems, when annotating whether an item in the resource is correct or not is expensive. For instance, consider the paraphrase database PPDB (Ganitkevitch et al., 2013; Pavlick et al., 2015) with 169M automatically computed pairs of English paraphrases. Each item in PPDB is associated with a confidence level, with the understanding that higher confidence correlates with being a valid paraphrase. However, the overall precision of PPDB, not to mention its PR tradeoff as a function of the confidence level, is unknown. Instead, PPDB is pre-packaged into six sizes (S, M, L, XL, etc.) , ranging from the smallest package containing the highest ranking 6.8M pairs to the largest containing all 169M pairs. Similarly, the NELL system (Mitchell et al., 2012) has collected over 50M beliefs of which 3M are identified as more reliable. Again, its PR tradeoff is not available. Even smaller resources such as a semi-automatically generated Science KB with 217K triples (Dalvi et al., 2017) do not come with an associated PR curve. The bottleneck is the high cost of human annotation.

To address this bottleneck, we propose a novel method, called PAULA for Precision Approximation Using Logarithmic Annotations, for cost-effectively (in terms of an-notations) computing a constant-factor approximation of the underlying precision function p(r), which assigns a precision value (i.e., the fraction of items that are correct) to the top r items in a large, unannotated, ranked list T = (t 1 , . . . , t N ). Note that any resource where each item is associated with a correctness confidence can be viewed as a ranked list. PAULA is able to obtain a pointwise (1 + )-approximation of p at a large number of points of interest (and roughly a pointwise (1 + ) 2 -approximation of entire p) using only roughly ∆ log 1+ N annotations in total (cf. Theorem 3), where ∆ is a monotonicity parameter that is typically in the range 50 to 100. For the Science KB with 217K items, for instance, this translates into 17K annotations when = 0.03. For a resource with 2B items, PAULA would need only 47K annotations.

The new insight is that one can leverage a natural property of p (and of derived metrics such as PR and ROC curves, to which our results also apply; see Sec. 2), that they are monotonic-when viewed through the lens of a sufficiently wide sliding window and looking beyond the initial few results in T where p is typically more erratic.

To put the benefit of this approach in perspective, we note that naïvely computing p(r) exactly even at a single value of rank r is impractical, as it requires annotating all of the top r items. A common method to approximate PR curves is to randomly draw s samples T ⊆ R T , annotate T , create the PR curve for T , and use it as a surrogate for the actual PR curve for T . This has the advantage of providing a PR curve that becomes more and more accurate (due to more samples) as we go down the ranked list. We show (cf. Section 4.4.3), however, that it requires s = Θ( √ N log N ) annotations to achieve a constant factor approximation of p. E.g., for the 217K Science KB, it requires 47K annotations for a 95% confidence when = 0.03 as opposed to only 17K annotations needed by PAULA, and this gap widens as N grows.

To analyze the behavior of PAULA, we first derive a general result (cf. Theorem 1) that under a monotonicity assumption, it is sufficient to estimate p at only logarithmically many points in T . We formalize the assumption and discuss its validity in Sections 4.1 and 4.2.

Building upon this, we consider a refined stratified sampling approach that requires only Θ(log N log log N ) annotations (Section 4.4.1, Theorem 2). However, while stratification helps reduce the number of annotations, the probabilistic approach results in cumulative errors, requiring one to be much more precise at estimating p at each individual point. We then develop our main contribution, the PAULA algorithm, in Section 4.4.2, leveraging a stronger form of monotonicity. Our estimates reveal that in typical use cases, PAULA requires an order of magnitude fewer annotations than the random sampling baseline and half the annotations needed by our refined logarithmic stratified sampling approach.

We evaluate PAULA on two large resources mentioned earlier: the PPDB dataset from natural language research and the Science KB for common facts about elementary level science. Our experiments show that the lower and upper bounds provided by PAULA are very close in practice-to each other and to the ground truth, when available. The method, being logarithmic and deterministic, easily scales to larger resources.

1.1 Related Work

Much of the existing work in this area has focused on small ranked lists (e.g., the top 10 or 100 Web search results), on aggregating across multiple ranked lists (e.g., multiple search queries), and on summary statistics (e.g., Prec@k for some small k of interest, average precision AP, discounted cumulative gain DCG, etc.). In contrast, our challenge is to compute the entire precision function (and the associated PR, ROC, and precision-yield curves) for a single, large, ranked list. From the entire curve, it is then easy to "read off" various summary statistics.

There has been much interest in random sampling based methods to compute a particular summary statistic of interest (e.g., AP), where the focus is on choosing nonuniform samples in a way that minimizes variance. While very effective for that particular statistic, these methods require different sets of annotations to minimize variance for different metrics (e.g., Prec@k, Prec@k for k = k, Gain@k, DCG, etc.). Our method, on the other hand, relies on a deterministic set of annotations for the entire curve, and hence for any derived summary metrics. Carterette et al. (2006) focus on the AP metric in the information retrieval (IR) setting, and propose a method to design a test subset that measures AP with high confidence. They choose documents based on the benefit they would provide when fully ranking the system. Yilmaz et al. (2008) consider large scale retrieval evaluation using incomplete relevance judgments, focusing on a simple yet efficient method. Unlike prior work, they provide confidence intervals for the inferred AP and improve upon two proposals, one that is accurate but complex and another called InfAP that is simple but less efficient . Kanoulas (2015) provides a short survey of methods, measures, and designs used in the field of IR to evaluate the quality of search algorithms against collections of observations, considering the use of relevance and observable user behavior to compute search quality. Schnabel et al. (2016) provide an interesting, comprehensive perspective, recasting nearly all previous approaches with guarantees as Monte Carlo estimation of an expectation, focusing on variance reduction via importance sampling. Their domain of interest is Web search, where importance is placed on the top 10 or so ranked results, aggregation across multiple queries, and specific summary metrics that are used to guide the sampling strategy so as to minimizes variance.

2 Preliminaries

Consider the ranked output T = (t 1 , t 2 , . . . , t N ) of an algorithm A, where each t i comes from some universe U (e.g., all documents on the Web, all paraphrase pairs, all subject-verb-object triples, etc.). Each item u ∈ U is associated with an unknown true label v(u) ∈ {0, 1} that captures the semantics of some underlying task (e.g., whether a document is relevant to a query, whether a pair of phrases is a linguistic paraphrase, whether a triple denotes a true fact, etc.). The precision function of A, p : [N ] → [0, 1], maps each rank r ∈ [N ] to the fraction of the top r items in T that are positive, i.e., labeled as 1:

EQUATION (1): Not extracted; please refer to original document.

where we omit A from the notation for brevity. The commonly used metric Prec@k for a specific k (e.g., Prec@10) is simply p(k). The associated recall of A at rank r is the fraction of all positive items U + ⊆ U that appear among t 1 , . . . , t r ; that is, r p(r) |U+| . Since U + is unknown in many ranking applications, it is common to instead use its unnormalized variant called yield, defined as

r i=1 v(t i ), which equals r p(r).

A plot of p(r) vs. recall (or yield) at r is commonly referred to as the precision-recall (or precision-yield, resp.) curve for A. One can similarly define a variety of metrics that are derivable from p(r), r, U + , and U , such as Gain@k, accuracy, F1, true positive rate (TPR), false positive rate (FPR), specificity, sensitivity, etc. The ROC curve refers to a plot of FPR vs. TPR, and average precision (AP) is the area under the PR curve. Fawcett (2006) and Majnik and Bosnic (2013) provide good surveys of such metrics, while Davis and Goadrich (2006) explore interesting relationships between PR and ROC curves.

Our goal in this work is to efficiently approximate p, which is a fundamental building block for all of the above metrics-indeed, a pointwise-approximation of p allows one to approximately compute all of these metrics.

Importantly, we operate under a setting where obtaining true labels v(t i ) is costly (e.g., requires a system simulation, human annotation, crowdsourcing, etc.). We aim to compute a pointwise-approximation of p with as few annotations of true labels v(t i ) as possible.

2.1 Point Estimates: Random Sampling

A simple way to obtain an unbiased point estimate of p(r) for a fixed rank r is via random sampling: Sample (with repetition) z indices J independently and uniformly from {1, 2, . . . , r} and compute the empirical averagep(r) = 1 z j∈J v(t j ). Thenp(r) is an unbiased estimator of p(r) and one can apply tail inequalities such as the two-sided Hoeffding bound (Hoeffding, 1963) to compute how tight the estimate is:

Pr[|p(r) − p(r)| ≥ p(r)] ≤ 2 exp −2z 2 p(r) 2

For a 1 − δ confidence in the outcome (e.g., a 95% confidence would mean δ = 0.05), it suffices to have 2 exp −2z 2 p(r) 2 ≤ δ, which translates into needing:

z ≥ 1 2 2 p(r) 2 ln 2 δ

(2) samples in order to achieve a (1 + )-approximation of p(r). A complementary way of viewing the same result is that the use of z samples results in the following 1 − δ confidence interval (e.g., a 95% confidence interval) which can be used for error bars around the estimate:

p(r) ± 1 2z ln 2 δ (3) 3 PRECISION WITH O( √ N log N ) ANNOTATIONS

A common practice to compute an approximation of the entire precision function p, not only its value at one point, is to draw s uniform random samples T from the ranked list T , compute the precision function p for T , and use this as a surrogate for p as p(r) ≈ p ( rs N ). How close is p ( rs N ) an approximation of p(r) is determined by the number z = rs N of the s random samples that are expected to land in the range t 1 , . . . , t r . We provide a formal analysis of this approach below.

Eqn. (3) indicates that the approximation error starts off large (since we observe very few samples when r is small) and decreases as r increases, scaling proportionally to 1/ √ r. To compare this method to our proposal for obtaining a pointwise (1 + α)-approximation for entire p, we ask the following question: Given s and N , what is the smallest

n α,s ≤ N such that for all r ≥ n α,s , p ( rs N ) is a (1 + α)-approximation of p(r)?

We apply Eqn. (2) to each of the N points, bounding the correctness confidence by δ = δ/N at each point and taking the union bound to guarantee an overall correctness confidence of δ. This yields the requirement that n α,s s/N must be at least 1 2α 2 p(r) 2 ln 2 δ . The smallest n α,s satisfying this is:

EQUATION (4): Not extracted; please refer to original document.

Thus, we obtain a (1 + α)-approximation after roughly the first N ln N sα 2 points. To obtain a (1 + α)-approximation of the entire precision function, even for low values of r, one possibility is to annotate all of t 1 , . . . , t nα,s , in addition to s uniform random samples overall. 1 This would mean annotating s + n α,s items in total. It can be verified that this expression is minimized when s is chosen such that s = n α,s , in which case the total number of annotations needed is:

EQUATION (5): Not extracted; please refer to original document.

This expression grows asymptotically as Θ( √ N log N ). The methods developed in the next section achieve the same approximation with only O(log N ) annotations.

4 Precision With O(Log N ) Annotations

Exactly computing the entire precision function p, in particular computing p(N ), requires annotating N labels, which can be prohibitive for massive datasets. To alleviate this, we develop here a novel method called PAULA to efficiently obtain, for any > 0, a pointwise γ(1 + )approximation of p with roughly only ∆ log 1+ N deterministically chosen annotation points, where ∆ is a constant capturing the level of monotonicity in the underlying data and γ is slightly larger than 1 + .

4.1 Local Precision And Monotonicity

We start by defining a ∆-local variant of p and two related notions of monotonicity. Let T = (t 1 , t 2 , . . . , t N ) be the ranked output of an algorithm A with (unknown) true labels v(t i ) ∈ {0, 1} and precision function p.

EQUATION (6): Not extracted; please refer to original document.

where we omit A as before for brevity of notation.

∆-precision may be viewed as a smoothed version of the true label sequence

v(t 1 ), v(t 2 ), . . . , v(t N )

. Although the actual label sequence, being a sequence of 0s and 1s, is bound to be erratic, we expect the density of 1s in this sequence to decrease as i increases. In other words, a sufficiently smoothed out variant of the true label sequence should be non-increasing, except perhaps for the initial few values. Formally, we use the following two characterizations of monotonicity:

Definition 2 (Weak Monotonicity). For m ∈ N + , p is m-monotone if for all r 2 ≥ r 1 + m, p(r 1 ) ≥ p(r 2 ).

Weak monotonicity guarantees that precision is nonincreasing for points ranked at least m apart. We will use this to show that it is sufficient to compute precision at only logarithmically many points.

Definition 3 (Strong Monotonicity). For ∆, m ∈ N + s.t. m ≥ ∆, p is (∆, m)-monotone if for all r 2 ≥ r 1 + m, p ∆ (r 1 ) ≥ r2 r=r1+1 p(r) r 2 − r 1 ≥ p ∆ (r 2 )

Strong monotonicity says that for every large enough rank interval [r 1 , r 2 ], the ∆-precisions at the two ends of the interval bound the average precision across the interval. We will use this property to bound the actual precision function p by functions of local precision p ∆ .

We observe that p(r) is simply p r (r). Further, when r is a multiple of ∆, p(r) can be decomposed as

∆ r r/∆ j=1 p ∆ (j∆).

Computing all of these uniformly spaced r/∆ local precision terms is, however, still as expensive in terms of the required annotation as computing p(r) directly. We will instead approximately decompose p(r) using roughly only log 1+ r local precision terms that are spaced based on a geometrically increasing spread with geometric ratio 1 + .

4.2 Setup And Assumptions

Throughout this section, let T = (t 1 , t 2 , . . . , t N ) be the ranked output of an algorithm A with (unknown) true labels v(t i ) ∈ {0, 1}. Let p be A's precision function. Let ∆ ∈ N + and p ∆ be the corresponding local precision function (cf. Definition 1). Let ∈ (0, 1],r ∈ N be such thatr ≥ ∆+2 , = log 1+ r , m = (1 + ) − 1 , γ = 1 + + 2+ m , and

L = log 1+ N . Observe that m > (1 + ) − 2 ≥ r − 2 ≥ ∆.

The notion of (∆, m)-monotonicity can thus be applied. Our method will involve annotating the true labels v for all of the first (roughly)r items in the ranked list, followed by fewer than ∆ log 1+ N annotations for the rest of the list, in order to guarantee a γ(1 + )-approximation of p.

Assumptions. The algorithms and results below rely on one or both of the following assumptions:

A.1 p is m-monotone for all r ≥r.

A.2 p is (∆, m)-monotone for all r ≥r.

While one intuitively expects monotonicity to hold (for large enough m and ∆) for any dataset produced by a well-designed prediction system, how often this happens in practice is an empirical question. We provide two pieces of support for it in Section 5: (a) visual support via the monotonically non-increasing ground truth points in Figures 2 and 3 ; and (b) quantitative support via the success of our method on two large and diverse datasets.

Figure 1: PAULA needs substantially fewer annotations to guarantee various levels of approximation of the precision function. N = 107, pmin = 0.5,∆ = 100.

We noticed that the assumption fails near the 15K point in Figure 2 , where the black ground truth curve starts to rise. Our estimate, as expected, deviates here a little, but then quickly regains accuracy as one moves to the right.

Figure 2: Bounds provided by PAULA with ∆ = 100 are very close to the ground truth on a fully annotated subset of PPDB 2.0 of size 36K. Top: = 0.03, 11K annotations. Bottom: = 0.05, 8K annotations.

4.3 Logarithmic Evaluations

We start with a general result that evaluating p at O(log 1+ N ) points is sufficient to obtain a (1 + )approximation of the entire precision function p. The idea is to compute p at the following geometrically spaced points, where the spacing is determined by :

Definition 4. For j ∈ N + and > 0, define g ,j = (1 + ) j .

For brevity, when is clear from the context, we write g j to mean g ,j . Observe that

g j+1 −g j ≥ (1+ ) j+1 −(1+ ) j − 1 = (1 + ) j − 1. When (1 + ) j ≥ m+1 , we thus have g j+1 − g j ≥ m.

If we assume p is m-monotone for large enough r, we can show that evaluating p at only roughly log 1+ N points is sufficient to obtain a (1 + )approximation of the entire precision function p. Formally, we define a step-function approximation:

Definition 5. Let N, , be as in Section 4.2 and f :

[N ] → [0, 1]. Then step f , : [N ] → [0, 1] is defined as: step f , (r) = f (r) if r ≤ g f (g j ) for j = log 1+ r otherwise

In other words, step f , (r) can be computed using g + L − evaluations of f , mirrors f (r) for small r, and is a geometrically adjusting step function afterwards. The following theorem, whose proof is deferred to the Appendix, shows that under the weak monotonicity assumption, step p , is a tight approximation of p. Theorem 1. Let p, ,r, , m, L be as in Section 4.2. If Assumption A.1 holds, then step p , is a pointwise (1+ )approximation of p and can be computed using evaluations of p at points 1, 2, . . . , g , g +1 , . . . , g L .

This result as such is not directly helpful when evaluations of p are costly, as computing p exactly even at a single point requires a linear number of true label annotations (e.g., computing p(N ) exactly requires N annotations). However, what the result shows is that if we could efficiently compute point-estimates of p, we would need to do so at roughly only log 1+ N points: Corollary 1. Let p, ,r, , m, L be as in Sec. 4.2 and β ≥ 1. If Assumption A.1 holds and q(r) is a β-approximation of p(r) for r ∈ {1, 2, . . . , g , g +1 , . . . , g L }, then step q , is a pointwise β(1 + )-approximation of p.

4.4 Efficient Point Estimates

We now consider various ways of efficiently (in terms of required label annotations) computing β-approximations q of p at the O(log N ) points g +1 , . . . , g L , in order to then apply Corollary 1 to obtain a β(1+ )-approximation of the entire precision function p.

4.4.1 Stratified Sampling

A simple way is to estimate each of these L − points is to employ random sampling and bound correctness confidence via Hoeffding's inequality, Eqn. (2). Since each point estimate requires many samples, it is substantially more efficient (in terms of evaluations of p) to reuse samples obtained for g k when evaluating p at g k+1 . The resulting lack of independence slightly weakens the probabilistic correctness guarantee (we instead use the union bound), but leads to significantly fewer samples. Specifically, only a 1+ fraction of the required samples need to be obtained afresh; the rest can be reused. This is formalized in the stratified sampling mechanism described in Algorithm 1, where X k is the set of random samples considered for the point g k and S k denotes the precision of X k . Samples in X k+1 are a combination of reusing most samples from X k and obtaining only a few new samples from the range g k + 1, . . . , g k+1 . For this algorithm, we can derive the following probabilistic correctness guarantees (see Appendix for proofs):

Lemma 1. Let T, v, p, ,r, , L be as in Sec. 4.2. Let δ > 0, p min be the minimum value of p, and β > 1. Then, with probability at least 1 − δ, q(r) in Algorithm 1 on input (T, v, ,r, δ, p min , β) is a β-approximation of p(r) for r ∈ {g +1 , . . . , g L }.

Putting this together with Corollary 1 and noting that q(r) = p(r) in Algorithm 1 when r ≤ , we obtain:

Theorem 2 (Logarithmic Stratified Sampling). Let T, v, p, ,r, , m, L be as in Sec. 4.2. Let δ > 0, p min be the minimum value of p, and β > 1. If Assumption A.1 holds, then with probability at least 1 − δ, the output of

Algorithm 1 Logarithmic Stratified Sampling for Approxi- mating Precision Function input T = (t 1 , t 2 , . . . , t N ), v, ,r, δ, p min , β = log 1+ r ; L = log 1+ N ; r = (1 + ) −1 for j = to L do r = r * (1 + ); g j = r end for s = 1 2(β−1) 2 p 2 min ln L− δ/2 X = s random samples from {1, . . . , g } S = i∈X v(t i ) for k = to L − 1 do X (1) k+1 = include each i ∈ X k independently with probability g k /g k+1 X (2) k+1 = s − |X (1) k+1 | random samples from {g k + 1, . . . , g k+1 } X k+1 = X (1) k+1 ∪ X (2) k+1 S k+1 = i∈X k+1 v(t i ) end for q(r) = 1 r r i=1 v(t i ) for r ∈ {1, . . . , g } q(g k ) = S k /s for k ∈ { + 1, . . . , L} Compute step q , using the above values of q output step q , Algorithm 1 on input (T, v, ,r, δ, p min , β) is a β(1 + )- approximation of p(r). Further, Algorithm 1 evaluates p only at 1, . . . , g and at (L− ) 2(β−1) 2 (1+ )p 2 min ln L−

δ/2 points chosen randomly via stratified sampling.

Note that since L = Θ(log N ), the stratified sampling approach requires Θ(log N log log N ) annotations.

4.4.2 New Approach: Paula

While random sampling provides efficient single point estimates, using it to approximate the entire p with error probability ≤ δ requires bounding the error probability of each point more strictly, namely, by δ L− , and using the union bound over L − dependent random events. 2 We develop a novel method called PAULA to obtain γapproximations of p at points of interest by evaluating p at logarithmically many deterministically chosen points. Since there is no probabilistic confidence involved, γapproximations of the points directly carry over to a tight approximation of entire p, without any loss.

To this end, we employ the notions of local precision and strong monotonicity introduced in Section 4.1. We begin by observing that for any k ≥ ,

g k+1 −g k ≥ g +1 −g ≥ (1 + ) +1 − (1 + ) − 1 = (1 + ) − 1 ≥ m.

The strong monotonicity assumption A.2 thus implies p ∆ (g k+1 ) ≤ p ∆ (g k ), i.e., the local precision is monotonically non-increasing along the geometrically spaced points of interest. We define our candidates for lower and upper bounds on p via telescopic sums of local precisions at these points, as follows. As we will shortly see (Lemma 3), these terms bound the yield of A at rank g k , normalizing which by g k generates bounds on p(g j ). Definition 6. Let ∆, , , L be as in Section 4.2 and k ∈ { , . . . , L}. Then:

Y − ( , , k, ∆) = g p(g ) + k−1 j= (g j+1 − g j ) p ∆ (g j+1 ) Y + ( , , k, ∆) = g p(g ) + k−1 j= (g j+1 − g j ) p ∆ (g j )

Algorithm 2 describes our precision function approximation method, PAULA. We abbreviate Y − ( , , k, ∆) and Y + ( , , k, ∆) as Y k − and Y k + , resp. The algorithm is, in fact, very simple: compute some derived parameters and then loop over k to compute Y k − and Y k + via ∆-precision computations as defined above.

A similar idea has been used previously by Ermon et al. (2013) , but in a different context, namely discrete integration for probabilistic graphical models. Our more delicate analysis, motivated by a novel use case, extends their finding for = 1 to any ∈ (0, 1]. 3 Lemma 2 captures the nature of the approximation provided by PAULA: Lemma 2. Let T, v, p, ∆, ,r, , m, γ, L be as in Section 4.2. If Assumption A.2 holds and p(g ) ≥ p ∆ (g ), then q − (r) and q + (r) in PAULA on input (T, v, ∆, ,r) are pointwise γ-approximations of p(r) from below and above, resp., for r ∈ {g +1 , . . . , g L }.

This follows from Lemmas 3 and 4 below, whose proof is deferred to the Appendix. Lemma 3. Let p, ∆, , , m be as in Section 4.2. If Assumption A.2 holds, then for any k ≥ :

Y − ( , , k, ∆) ≤ g k p(g k ) ≤ Y + ( , , k, ∆)

Lemma 4. Let p, ∆, , , m, γ be as in Section 4.2. If Assumption A.2 holds and p(g ) ≥ p ∆ (g ), then for all k ≥ :

γ • Y − ( , , k, ∆) ≥ Y + ( , , k, ∆)

3 Specifically, (a) the notion of (m, ∆)-monotonicity is irrelevant to that work, as the search space there always (implicitly) satisfies monotonicity; (b) Theorems 1 and 2 and Corollary 1 are unrelated to that work; and (c) the 2-approximation that arose there naturally from parity constraints is too loose to be useful as an approximation of the precision function.

Algorithm 2 PAULA for Approximating Precision Function input T = (t 1 , t 2 , . . . , t n ), v, ∆, ,r = log 1+ r ; L = log 1+ n ; r = (1 + ) −1 for j = to L do r = r * (1 + ); g j = r end for

Y − = Y + = g p(g ) for k = to L − 1 do Y k+1 − = Y k − + (g k+1 − g k ) p ∆ (g k+1 ) Y k+1 + = Y k + + (g k+1 − g k ) p ∆ (g k ) end for q − (r) = q + (r) = 1 r r i=1 v(t i ) for r ∈ {1, . . . , } q − (g k ) = Y k − /g k for k ∈ { + 1, . . . , L} q + (g k ) = Y k + /g k for k ∈ { + 1, . . . , L} Compute step q −

, and step q + , using the above values of q − and q + , resp.

output (step q − , , step q + , )

The above lemmas, in fact, show that Y − and Y + together provide a γ-approximation jointly, in that sense that if Y − is far from p at some point, then Y + must be close to p at that point. For simplicity, Lemma 2 (and Theorem 3 to follow shortly) states a weaker version that each of Y − and Y + is a γ-approximation of p, independently.

Combining Lemma 2 and Corollary 1, we obtain:

Theorem 3 (PAULA). Let p, ∆, , , m, γ, L be as in Section 4.2. If Assumptions A.1 and A.2 hold, and p(g ) ≥ p ∆ (g ), then the output of PAULA on input (T, v, ∆, ,r) provides pointwise γ(1 + )approximations of p from below and above, resp. Further, PAULA uses evaluations of p only at g + ∆(L − ) deterministically chosen points.

In typical use cases, g can be taken to be a constant like 500 or 1000, after which the precision function stabilizes. With ∆ being a constant (typically 50 to 100) and L = Θ(log N ), PAULA thus requires Θ(log N ) annotations.

4.4.3 Comparison With Random Sampling

PAULA starts with zero error in approximating p(r) as long as r ≤ n 0 = ∆+2 . As r increases beyond n 0 , the error increases as we only take logarithmically many samples afterwards. Nonetheless, under the monotonicity assumption, the error remains provably bounded by γ(1 + ). This contrasts with the common random sampling approach discussed in Section 3, which is inaccurate in the beginning and starts providing a (1 + α)approximation, where α = γ(1 + ) − 1, with confidence 1 − δ once r ≥ n α,s as defined in Eqn. (4).

As noted earlier, the random sampling approach requires Θ( √ N log N ) samples where as PAULA requires only Θ(log N ) samples. Further, given the same number s of true label annotations and the same desired approximation factor, we find that n α,s is often too large. Asymptotically, since s scales as Θ(log N ) for PAULA, we see from Eqn. (4) that n α,s scales as Θ(N ) when using the same s. In other words, given the same number of samples, PAULA obtains an approximation that holds everywhere whereas the random sampling method is inaccurate at a linear fraction of the N points.

For concreteness, we provide a numeric example. Consider the Science tensor (to be described later) where N = 217077. Using = 0.03, PAULA requires s = 17392 annotations to provide a γ(1 + ) approximation of p at every point. In contrast, the random sampling approach from Section 3 needs 47030 annotations to achieve the same approximation (with error probability δ = 0.05 and p(r) ≈ 0.7). Alternatively, given the same number s of samples, the random sampling baseline has a much larger error initially and provides a pointwise αapproximation of p(r) only for r ≥ n α,s = 41056.

4.4.4 Comparison With Stratified Sampling

Comparing the number of evaluations of p needed in Theorems 2 vs. 3 to achieve a γ(1 + )-approximation, we see that PAULA has an asymptotic advantage over stratified sampling:

Θ(log N ) vs. Θ(log N log log N ).

From a practical perspective, the difference is in the factor multiplying the L − term in each, which is

2(γ−1) 2 (1+ )p 2 min log L−

δ/2 for stratified sampling and simply ∆ for PAULA. To illustrate how these terms compare in practice, consider a typical application with parameter values δ = 0.05 (i.e., 95% correctness confidence), = 0.03, and p min = 0.5. If we assume p is stable after the first 1000 points (i.e.,r = 1000), then = 234. In this case, the first expression simplifies to roughly 5.8 log 40(L − 234). For a ranked list T of size N = 10000 (or 100000), we have L = 311 (389, resp.) and the expression evaluates to 46.6 (50.6, resp.). In contrast, ∆ is independent of N and can often be taken safely to be somewhat larger than 1/ = 20. This difference can be small in practice. We note, however, that the stratified sampling approach considered here is substantially more efficient than the conventional one, as it exploits our finding (Theorem 1) that logarithmically many point-estimates are sufficient.

5 Experiments

We begin by illustrating how the number of samples needed by various precision function estimation methods scales as the desired approximation factor is varied. 4 We will then evaluate the accuracy and effectiveness of PAULA on resources from two application domains: natural language processing and knowledge acquisition. In the first case, ground truth is available and we demonstrate that the bounds produced by PAULA are very close to it. In the second case, we compute bounds on the precision function for a larger resource and demonstrate that (a) the bounds are close to each other and (b) match a few independently generated point estimates.

5.1 Number Of Annotations Needed

For this experiment, we vary the desired approximation factor (1 + α) and compute the number of annotations needed by various methods to achieve this level of accuracy in estimating the precision function. Specifically, we consider the random sampling baseline (Section 3), our logarithmic stratified sampling (Section 4.4.1), and our deterministic approach, PAULA (Section 4.4.2). For the last two methods, 1 + α is a function of , namely, γ(1 + ).

1.02

1.04 1.06 1.08 1.10

Guaranteed Approximation Factor, 1 + α needed when N is 10M, p min for the sampling based methods is taken to be 0.5, and ∆ for PAULA is 100. The plot shows that PAULA requires over an order of magnitude fewer annotations than the random sampling baseline in order to achieve the same level of guaranteed accuracy. The logarithmic stratified sampling approach, which also exploits our Theorem 1, starts off as more efficient (in terms of annotations) than PAULA when α is small, but requires twice as many annotations when the approximation error is varied from 3% to 10%. 4 Code and data available at http://allenai.org/software.

5.2 Application: Nlp Resources

We consider the Paraphrase Database PPDB (Ganitkevitch et al., 2013) , a massive dataset for major NLP tasks such as paraphrase detection and entailment. Each paraphrase pair in PPDB 2.0 is associated with a validity score generated via supervised learning on 209 features (Pavlick et al., 2015) . It also includes entailment relations, word embedding similarities, and style annotations. PPDB 2.0 can be viewed as a list of 169M items, ranked based on the trained paraphrase score.

In the first experiment, our goal is to demonstrate that the bounds provided by PAULA are close to the ground truth.

To this end, we consider a subset of PPDB for which Pavlick et al. (2015) provide crowdsourced annotations on a 5-point scale (1-5) using five annotators per phrase pair. 5 If the average human judgment for a phrase pair t i is at least 3, we consider v(t i ) = 1; otherwise, v(t i ) = 0. with step functions corresponding to PAULA's upper (red) and lower (blue) bounds when = 0.03 (top) and = 0.05 (bottom), using ∆ = 100. The bounds are extremely tight in practice, even though they were obtained using only 11292 and 7822 annotations, resp.

The random baseline can provide a good estimate as well, but is generally less accurate in the earlier part of the curve and resulted in substantial random fluctuations, as also suggested by the error bounds in Figure 1 and the numerical examples discussed in Section 4.4.3.

Finally, we note that using the same setting for and ∆ as in the upper plot would require only 40K annotations to generate a precision function curve for entire PPDB 2.0 containing N = 169M items.

5.3 Application: Knowledge Bases

As a second application, we consider evaluating the quality of knowledge bases (KBs) which store facts in a structured manner, e.g., in form of (entity, relation, entity) triples such as (EiffelTower, locatedIn, Paris) or (butterfly, pollinate, flower). Knowledge acquisition is the task of extracting such knowledge from various unstructured resources, such as the Web, while knowledge completion is the task of inferring more facts given a partial KB. In both these tasks, machine learning is commonly employed to model the entities and relations, and score various combinations of them as candidate facts. The triples that score higher are considered more reliable.

In this second experiment, our goal is to demonstrate that PAULA can scale to large resources while still producing upper and lower bounds that are very close to each other and to point estimates obtained via random sampling.

To this end, we consider Science KB (Dalvi et al., 2017) , a dataset with facts about elementary science, along with corresponding confidence scores. 6 The KB has N = 217076 triples, making it prohibitively expensive to assess its quality by analyzing all triples. Figure 3 shows the precision function bounds produced by PAULA, using the same setup as earlier ( = 0.03, ∆ = 100), which led to 17392 annotations. Due to the lack of ground truth for this dataset, we compute three point estimates independently by drawing 2000 uniform samples for each and using Eqn. (3) to compute a confidence interval. We observe that the upper (red) and lower (blue) bounds are very close to each other for the entire dataset, and also near the independently obtained point estimates.

Figure 3: Bounds provided by PAULA with = 0.03,∆ = 100 using 17K annotations for Science KB.

6 Conclusion

This is the first work, to our knowledge, that provides a practical way of cost-effectively obtaining guaranteed approximations of the precision function, and associated metrics such as the PR curve, for massive datasets. This is particularly suitable for massive resources generated by today's large scale machine learning algorithms, when assessing individual data items for correctness requires human annotation. While these items are generally associated with a correctness score, and can thus be viewed as a ranked list, naïvely computing the PR curve of this list is prohibitively expensive due to the human annotation step. Consequently, these datasets often do not come with an associated precision-recall analysis, leaving their suitability for downstream applications unclear.

Our method, PAULA, addresses this limitation via tight approximations of the precision function with only logarithmically many annotations. PAULA leverages the fact that the precision function of reliable prediction systems is essentially monotonically non-increasing. 7 Our analysis reveals that sampling techniques require many more annotations (both asymptotically and for typical use cases) to guarantee the same level of approximation. Experiments on two large datasets demonstrate the accuracy of our method and its scalability.

There is a small overlap when counting this way, which we ignore for simplicity of exposition.

Even if the samples are not shared and one obtains independent estimates for each point with error probability δ each, the overall error probability is 1 − (1 − δ ) L− , which is very close to δ (L − ) when δ is small.

http://www.seas.upenn.edu/∼nlp/resources/ppdb-2.0human-labels.tgz

Available at http://data.allenai.org/tuple-kb as Aristo Tuple KB v4 (March 2017).

For poorly designed prediction systems whose score does not correlate well with correctness, the monotonicity assumption is violated and PAULA's theoretical guarantees do not apply. PAULA can still be used to obtain a cost-effective estimate.