Selecting Near-Optimal Learners via Incremental Data Allocation

Ashish Sabharwal
H. Samulowitz
G. Tesauro
AAAI
2016
View in Semantic Scholar

Abstract

We study a novel machine learning (ML) problem setting of sequentially allocating small subsets of training data amongst a large set of classifiers. The goal is to select a classifier that will give near-optimal accuracy when trained on all data, while also minimizing the cost of misallocated samples. This is motivated by large modern datasets and ML toolkits with many combinations of learning algorithms and hyper-parameters. Inspired by the principle of "optimism under uncertainty," we propose an innovative strategy, Data Allocation using Upper Bounds (DAUB), which robustly achieves these objectives across a variety of real-world datasets. We further develop substantial theoretical support for DAUB in an idealized setting where the expected accuracy of a classifier trained on n samples can be known exactly. Under these conditions we establish a rigorous sub-linear bound on the regret of the approach (in terms of misallocated data), as well as a rigorous bound on suboptimality of the selected classifier. Our accuracy estimates using real-world datasets only entail mild violations of the theoretical scenario, suggesting that the practical behavior of DAUB is likely to approach the idealized behavior.

Introduction

The goal of our work is to develop novel practical methods to enhance tractability of Data Science practice in the era of Big Data. Consider, for example, the following very common scenario: A Data Science practitioner is given a data set comprising a training set, a validation set, and a collection of classifiers in an ML toolkit, each of which may have numerous possible hyper-parameterizations. The practitioner would like to determine which classifier/parameter combination (hereafter referred to as "learner") would yield the highest validation accuracy, after training on all examples in the training set. However, the practitioner may have quite limited domain knowledge of salient characteristics of the data, or indeed of many of the algorithms in the toolkit.

In such a scenario, the practitioner may inevitably resort to the traditional approach to finding the best learner (cf. Caruana and Niculescu-Mizil 2006) , namely, brute-force training of all learners on the full training set, and selecting the one with best validation accuracy. Such an Copyright c 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. approach is acceptable if the computational cost of training all learners is not an issue. However, in the era of Big Data, this is becoming increasingly infeasible. Web-scale datasets are proliferating from sources such as Twitter, TREC, SNAP, ImageNet, and the UCI repository, particularly in domains such as vision and NLP. ImageNet datasets can exceed 100 gigabytes, and the recent "YouTube-Sports-1M" video collection exceeds 40 terabytes. Moreover, the diverse set of learners available in today's ML packages (Hall et al. 2009; Pedregosa and others 2011; Schaul et al. 2010; McCallum 2002) are continually expanding, and many of the most successful recent algorithms entail very heavy training costs (e.g., Deep Learning neural nets with Dropout).

The above factors motivate a search for techniques to reduce training cost while still reliably finding a near-optimal learner. One could consider training each learner on a small subset of the training examples, and choose the best performing one. This entails less computation, but could result in significant loss of learner accuracy, since performance on a small subset can be a misleading predictor of performance on the full dataset. As an alternative, the small-subset results could be projected forward using parameterized accuracy models to predict full training set accuracy. Creating such models is, however, a daunting task (Guerra, Prudencio, and Ludermir 2008) , potentially needing prior knowledge about learners and domain, characteristic features of the data, etc.

In this paper, we develop a novel formulation of what it means to solve the above dual-objective problem, and we present a novel solution approach, inspired by multi-armed bandit literature (Auer, Cesa-Bianchi, and Fischer 2002; Thompson 1933; Scott 2010; Agrawal and Goyal 2012) . Our method develops model-free, cost-sensitive strategies for sequentially allocating small batches of training data to selected learners, wherein "cost" reflects misallocated samples that were used to train other learners that were ultimately not selected. We express the cost in terms of the regret of the approach, comparing the algorithm's cost with that of an oracle which only allocates data to the best learner.

Our main contributions are as follows. First, we give a precise definition of a new ML problem setting, called the Cost-Sensitive Training Data Allocation Problem. Second, we present a simple, knowledge-free, easy-to-use and practical new algorithm for this setting, called DAUB (Data Allocation with Upper Bounds). Third, we give empiri-cal demonstrations that DAUB achieves significant savings in training time while reliably achieving optimal or nearoptimal learner accuracy over multiple real-world datasets. Fourth, we provide theoretical support for DAUB in an idealization of the real-world setting, wherein DAUB can work with noiseless accuracy estimates when training on n samples, in lieu of actual noisy estimates. The real-world behavior of DAUB will progressively approach the idealized behavior as n becomes large. In this setting, we establish a bound on accuracy of learners selected by DAUB, a sublinear bound on the data misallocated by DAUB, and an associated bound on the computational training cost (regret).

Related work on traditional bandit strategies mentioned above, such as the celebrated UCB1 (Auer, Cesa-Bianchi, and Fischer 2002) and Thompson sampling (Thompson 1933; Agrawal and Goyal 2012) , presume that additional trials of a given arm yield stationary payoffs. Whereas in our scenario, additional data allocations to a learner yield increasing values of its accuracy. There are also existing methods to optimize a single arbitrary function while minimizing the number of evaluations (cf. Munos 2014). These also do not fit our setting: we are dealing with multiple unknown but well-behaved functions, and wish to rank them on estimated accuracy after training on the full dataset, based on their upper-bounds from much fewer samples.

Somewhat related is algorithm portfolio selection (Rice 1976 ) which seeks the most suitable algorithm (e.g., learner) for a given problem instance, based on knowledge from other instances and features characterizing the current instance. Note, however, that most selection algorithms use parameterized accuracy models which are fit to data (e.g., Hutter et al. 2014) . Also related is work on hyper-parameter optimization, where one searches for novel configurations of algorithms to improve performance (Snoek, Larochelle, and Adams 2012; Bergstra, Yamins, and Cox 2013; Bergstra and Bengio 2012; Bergstra et al. 2011) or a combination of both (Feurer, Springenber, and Hutter 2015 ). An example is Auto-Weka (Thornton et al. 2013) , which combines selection and parameter configuration based on Bayesian optimization (cf. Brochu, Cora, and de Freitas 2009) . Predicting generalization error on unseen data has in fact been recognized as a major ML challenge (Guyon et al. 2006) .

A recent non-frequentist approach (Hoffman, Shahriari, and de Freitas 2014) takes a Bayesian view of multi-armed bandits, applicable especially when the number of arms exceeds the number of allowed evaluations, and applies it also to automatic selection of ML algorithms. Like some prior methods, it evaluates algorithms on a small fixed percentage (e.g., 10%) of the full dataset. Unlike the above approaches, we do not assume that training (and evaluation) on a small fixed fraction of the data reliably ranks full-training results.

Finally, Domhan, Springenberg, and Hutter (2015) recently proposed extrapolating learning curves to enable early termination of non-promising learners. Their method is designed specifically for neural networks and does not apply directly to many classifiers (SVMs, trees, etc.) that train non-iteratively from a single pass through the dataset. They also do not focus on a theoretical justification and fit accuracy estimates to a library of hand-designed learning curves.

Cost-Sensitive Training Data Allocation

We begin by formally defining the problem of cost-sensitive training data allocation. As before, we use learner to refer to a classifier along with a hyper-parameter setting for it. Let C = C 1 , C 2 , . . . , C M be a set of M learners which can be trained on subsets of a training set T r and evaluated on a validation set

T v . Let |T r | = N . For k ∈ N, let [k] denote the set {1, 2, . . . , k}. For i ∈ [M ], let c i : [N ] → R +

be a cost function denoting expected computational cost of training learner C i when n training examples are drawn uniformly at random from T r . 1 We make two common assumptions about the training process, namely, that it looks at all training data and its complexity grows at least linearly. Formally, c i (n) ≥ n and

c i (m) ≥ m n c i (n) for m > n. For i ∈ [M ], let f i : [N ] → [0, 1] be an accuracy function where f i (n) denotes expected accuracy of C i on T v

when trained on n training examples chosen at random from T r . The corresponding error function, e i (n), is defined as 1 − f i (n). Note that our tool also supports accuracy functions not tied to a fixed validation set T v (e.g., cross-validation) and other measures such as precision, recall, and F1-score; our analysis applies equally well to these measures.

We denote a training data allocation of n training samples to learner i by a pair a = (i, n).

Let S = (i (1) , n (1) ), (i (2) , n (2) ), . . . , (i (s) , n (s)

) be a sequence of allocations to learners in C. We will use S i to denote the induced subsequence containing all training data allocations to learner C i , i.e., the subsequence of S induced by all pairs

(i (k) , n (k) ) such that i (k) = i. In our context, if allocations (i, n (k) ) and (i, n ( ) ) are in S i with k < , then n (k) < n ( ) .

Evaluating f i (n) amounts to training learner C i on n examples from T r and evaluating its accuracy. This, in expectation, incurs a computational cost of c i (n). In general, the expected training complexity or cost associated with C under the data allocation sequence S is cost(S) = (i,n)∈S c i (n).

Our goal is to search for anĩ ∈ [M ] such that fĩ(N ) is maximized, while also ensuring that overall training cost is not too large relative to cĩ(N ). This bi-objective criterion is not easy to achieve. E.g., a brute-force evaluation, corresponding to

S = ((1, N ), (2, N ), . . . , (M, N )) andĩ = arg max i∈[M ] f i (N )

, obtains the optimalĩ but incurs maximum training cost of c i (N ) for all suboptimal learners. On the other hand, a low-cost heuristic S = (1, n), (2, n), . . . , (m, n), (ĩ, N ) for some n N and i = arg max i f i (n), incurs a small training overhead of only c i (n) for each suboptimal C i , but may choose an arbitrarily suboptimalĩ.

We seek an in-between solution, ideally with the best of both worlds: a bounded optimality gap on Cĩ's accuracy, and a bounded regret in terms of data misallocated to sufficiently suboptimal learners. Informally speaking, we will ensure that learners with performance at least ∆ worse than optimal are allocated only o(N ) training examples, i.e., an asymptotically vanishing fraction of T r . Under certain conditions, this will ensure that the training cost regret is sublinear. We next formally define the notions of suboptimality and regret in this context. Definition 1. Let C be a collection of M learners with accuracy functions f i , ∆ ∈ (0, 1], n ∈ N + , and

i * = arg max i∈[M ] f i (n). A learner C j ∈ C is called (n, ∆)- suboptimal for C if f i * (n) − f j (n) ≥ ∆, and (n, ∆)-optimal otherwise.

Definition 2. Let S be a data allocation sequence for a collection C of learners with accuracy functions f i , ∆ ∈ (0, 1], and n ∈ N

+ . The (n, ∆)-regret of S for C is defined as i:Ci is (n,∆)-suboptimal cost(S i ).

The regret of S is thus the cumulative cost of training all (n, ∆)-suboptimal learners when using S.

Definition 3 (COST-SENSITIVE TRAINING DATA AL- LOCATION PROBLEM). Let C = {C 1 , . . . , C M } be a set of learners, T r be a training set for C containing N exam- ples, T v be a validation set, c i and f i , for i ∈ [M ]

, be the training cost and accuracy functions, resp., for learner C i , and ∆ ∈ (0, 1] be a constant. The Cost-Sensitive Training Data Allocation Problem is to compute a training data allocation sequence S for C and T r as well as a valueĩ ∈ [M ] such that:

1. S contains (ĩ, N ), 2. Cĩ is (N, ∆)-optimal, 3. cost(Sĩ) ≤ d • cĩ(N ) for some fixed constant d, and 4. (N, ∆)-regret of S is o(M • cost(Sĩ)) in N .

A solution to this problem thus identifies an (N, ∆)optimal learner Cĩ, trained on all of T r , incurring on Cĩ no more than a constant factor overhead relative to the minimum training cost of cĩ(N ), and with a guarantee that any (N, ∆)-suboptimal learner C i incurred a vanishingly small training cost compared to training Cĩ (specifically,

cost(S i )/cost(Sĩ) → 0 as N → ∞).

The Daub Algorithm

Algorithm 1 describes our Data Allocation using Upper Bounds strategy. The basic idea is to project an optimistic upper bound on full-training accuracy f i (N ) of learner i using recent evaluations f i (n). The learner with highest upper bound is then selected to receive additional samples. Our implementation of DAUB uses monotone regression to estimate upper bounds on f i (N ) as detailed below, since observed accuracies are noisy and may occasionally violate known monotonicity of learning curves. Whereas in the noise-free setting, a straight line through the two most recent values of f i (n) provides a strict upper bound on f i (N ).

As a bootstrapping step, DAUB first allocates b, br, and br 2 ≤ N training examples to each learner C i , trains them, and records their training and validation accuracy in arrays

f T i and f V i , resp. If f V

i at the current point is smaller than at the previous point, DAUB uses a simple monotone regression method, making the two values meet in the middle.

Input : Learners C = {C1, . . . , CM }, training examples Tr, N = |Tr|, validation set Tv Output : Learner Cĩ trained on Tr, data allocation sequence S Params:

Geom. ratio r > 1, granularity b ∈ N + s.t. br 2 ≤ N DAUB(C, Tr, Tv, r, b) begin S ← empty sequence for i ∈ 1..M do for k ∈ 0..2 do append (i, br k ) to S; TrainLearner (i, br k ) ni ← br 2 ; ui ← UpdateBound (i, ni) while (maxi ni) < N do j ← arg max i∈[M ] ui (break ties arbitrarily) nj ← min{ rnj , N } append (j, nj) to S; TrainLearner (j, nj) uj ← UpdateBound (j) selectĩ such that nĩ = N return Cĩ, S end TrainLearner(i ∈ [M ], n ∈ [N ]) begin T ← n examples sampled from Tr; Train Ci on T f T i [n] ← training accuracy of Ci on T f V i [n] ← validation accuracy of Ci on Tv if n/r ≥ b then δ ← (f V i [n] − f V i [n/r]) if δ < 0 then f V i [n/r] −= δ/2; f V i [n] += δ/2 end UpdateBound(i ∈ [M ], n ∈ [N ]) begin f V i [n] ← LinearRegrSlope(f V i [n/r 2 ], f V i [n/r], f V i [n]) ub V i [n] ← f V i [n] + (N − n)f V i [n] return min{f T i [n], ub V i [n]} end

Algorithm 1: Data Allocation using Upper Bounds After bootstrapping, in each iteration, it identifies a learner C j that has the most promising upper bound estimate (computed as discussed next) on the unknown projected expected accuracy f j (N ) and allocates r times more examples (up to N ) to it than what C j was allocated previously. For computing the upper bound estimate, DAUB uses two sources. First, assuming training and validation data come from the same distribution, f T i [n i ] provides such an estimate. Further, as will be justified in the analysis of the idealized scenario called DAUB*,

ub V i [n i ] = f V i [n i ] + (N − n i )f V i [n i ]

also provides such an estimate under certain conditions, where

f V i [n i ]

is the estimated derivative computed as the slope of the linear regression best fit line through

f V i [n] for n ∈ {n i /r 2 , n i /r, n i }.

Once some learner Cĩ is allocated all N training examples, DAUB halts and outputs Cĩ along with the allocation sequence it used.

Theoretical Support For Daub

To help understand the behavior of DAUB, we consider an idealized variant, DAUB*, that operates precisely like DAUB but has access to the true expected accuracy and cost functions, f i (n) and c i (n), not just their observed estimates. As n grows, learning variance (across random batches of size n) decreases, observed estimates of f i (n) and c i (n) converge to these ideal values, and the behavior of DAUB thus approaches that of DAUB*.

Let f * = max i∈[M ] f i (N ) be the (unknown) target accuracy and C i * be the corresponding (unknown) optimal learner. For each C i , let u i : [N ] → [0, 1] be an arbitrary projected upper bound estimate that DAUB* uses for f i (N ) when it has allocated n < N training examples to C i . We will assume w.l.o.g. that u i is non-increasing at the points where it is evaluated by DAUB*. 2 For the initial part of the analysis, we will think of u i as a black-box function, ignoring how it is computed. Let

u min (N ) = min i∈[N ] {u i (N )}.

It may be verified that once u j (n) drops below u min (N ), DAUB* will stop allocating more samples to C j . While this gives insight into the behavior of DAUB*, for the analysis we will use a slightly weaker form n * i that depends on the target accuracy f * rather than u min (N ).

Definition 4. u i : [N ] → [0, 1] is a valid projected upper bound function if u i (n) ≥ f i (N ) for all n ∈ [N ]. Definition 5. Define n * i ∈ N as N if u i (N ) ≥ f * and as min{ | u i ( ) < f * } otherwise.

A key observation is that when using u j as the only source of information about C j , one must allocate at least n * j examples to C j before acquiring enough information to conclude that C j is suboptimal. Note that n * j depends on the interaction between u j and f * , and is thus unknown. Interestingly, we can show that DAUB*(C, T r , T v , r, b) allocates to C j at most a constant factor more examples, specifically fewer than rn * j in each step and r 2 r−1 n * j in total, if it has access to valid projected upper bound functions for C j and C i * (cf. Lemma 1 in Appendix). In other words, DAUB*'s allocation is essentially optimal w.r.t. u j .

Remark 1. A careful selection of the learner in each round is critical for allocation optimality w.r.t. u j . Consider a simpler alternative: In round k, train all currently active learners on n = br k examples, compute all f i (n) and u i (n), and permanently drop C j from consideration if u j (n) < f i (n) for some C i . This will not guarantee allocation optimality; any permanent decisions to drop a classifier must necessarily be conservative to be correct. By instead only temporarily suspending suboptimal looking learners, DAUB* guarantees a much stronger property: C j receives no more allocation as soon as u j (n) drops below the (unknown) target f * .

The following observation connects data allocation to training cost: if DAUB* allocates at most n training examples to a learner C j in each step, then its overall cost for C j is at most r r−1 c j (n) (cf. Lemma 2 in Appendix). Combining this with Lemma 1, we immediately obtain the following result regarding DAUB*'s regret: 3

Theorem 1. Let C, T r , T v , N, M, c i and f i for i ∈ [M ] be as in Definition 3. Let r > 1, b ∈ N + , and S be the allocation sequence produced by DAUB*(C, T r , T v , r, b). If the projected upper bound functions u j and u i * used by DAUB* are valid, then cost(S j ) ≤ r r−1 c j (rn * j ). In the remainder of the analysis, we will (a) study the validity of the actual projected upper bound functions used by DAUB* and (b) explore conditions under which (N, ∆)suboptimality of C j guarantees that n * j is a vanishingly small fraction of N , implying that DAUB* incurs a vanishingly small training cost on any (N, ∆)-suboptimal learner.

Obtaining Valid Projected Upper Bounds. If f i for i ∈ [M ] were arbitrary functions, it would clearly be impossible to upper bound f i (N ) by looking only at estimates of f i (n) for n < N . Fortunately, each f i is the expected accuracy of a learner and is thus expected to behave in a certain way. In order to bound DAUB*'s regret, we make two assumptions on the behavior of f i . First, f i is non-decreasing, i.e., more training data does not hurt validation accuracy. Second, f i has a diminishing returns property, namely, as n grows, the additional validation accuracy benefit of including more training examples diminishes. Formally:

Definition 6. f : N → [0, 1]

is well-behaved if it is nondecreasing and its discrete derivative, f , is non-increasing.

These assumptions on expected accuracy are wellsupported from the PAC theory perspective. Let ub i (n) be the projected upper bound function used by DAUB* for C i , namely the minimum of the training accuracy f T i (n) of C i at n and the validation accuracy based expression f i (n) + (N − n)f i (n). For DAUB*, we treat f i (n) as the one-sided discrete derivative defined as (f i (n)−f i (n−s))/s for some parameter s ∈ N + . We assume the training and validation sets, T r and T v , come from the same distribution, which means f T i (n) itself is a valid projected upper bound. Further, we can show that if f i (n) is well-behaved, then ub i (n) is a valid projected upper bound function (cf. Lemma 3 in Appendix).

Thus, instead of relying on a parameterized functional form to model f i (n), DAUB* evaluates f i (n) for certain values of n and computes an expression that is guaranteed to be a valid upper bound on f i (N ) if f i is well-behaved.

Bounding Regret. We now fix ub i as the projected upper bound functions and explore how (N, ∆)-suboptimality and the well-behaved nature of f i together limit how large n * i is. Definition 7. For ∆ ∈ (0, 1] and a well-behaved accuracy function

f i , define n ∆ i ∈ N as N if f i (N ) > ∆/N and as min{ | f i ( ) ≤ ∆/N } otherwise.

Using first order Taylor expansion, we can prove that

ub i (n ∆ i ) < f * , implying n * j ≤ n ∆ j (cf.

Lemma 4 in Appendix). Combining this with Theorem 1, we obtain:

Theorem 2. Let C, T r , T v , N, M, c i and f i for i ∈ [M ]

be as in Definition 3. Let r > 1, b ∈ N + , ∆ ∈ (0, 1], C j ∈ C be an (N, ∆)-suboptimal learner, and S be the allocation sequence produced by DAUB*(C, T r , T v , r, b). If f j and f i * are well-behaved, then cost(S j ) ≤ r r−1 c j (rn ∆ j ).

The final piece of the analysis is an asymptotic bound on n ∆ i . To this end, we observe that the derivative of any bounded, well-behaved, discrete function of N behaves asymptotically as o(1/n) (cf. Proposition 1 in Appendix). Applying this to f j , we can prove that if f j (N ) ≤ ∆/N , then n ∆ j is o(N ) in N (cf. Lemma 5 in Appendix). This leads to our main result regarding DAUB*'s regret:

Theorem 3 (Sub-Linear Regret). Let C, T r , T v , N, M, c i and f i for i ∈ [M ]

be as in Definition 3. Let r > 1, b ∈ N + , and ∆ ∈ (0, 1]. Let J = {j | C j ∈ C is (N, ∆)-suboptimal}. For all j ∈ J, suppose f j is wellbehaved and f j (N ) ≤ ∆/N . If DAUB*(C, T r , T v , r, b) outputs S as the training data allocation sequence along with a selected learner Cĩ trained on all of T r , then: Thus, DAUB* successfully solves the cost-sensitive training data allocation problem whenever for j ∈ J, f j is wellbehaved and c j (n) = O(cĩ(n)), i.e., training any suboptimal learner is asymptotically not any costlier than training an optimal learner. While more refined versions of this result can be generated, the necessity of an assumption on the cost function is clear: if a suboptimal learner C j was arbitrarily costlier to train than optimal learners, then, in order to guarantee near-optimality, one must incur a significant misallocation cost training C j on some reasonable subset of T r in order to ascertain that C j is in fact suboptimal.

1. cost(Sĩ) ≤ r r−1 cĩ(N ); 2. (N, ∆)-regret of S is o( j∈J c j (N ))

Tightness of Bounds. The cost bound on misallocated data in Theorem 2 in terms of n ∆ i is in fact tight (up to a constant factor) in the worst case, unless further assumptions are made about the accuracy functions. In particular, every algorithm that guarantees (N, ∆)-optimality without further assumptions must, in the worst case, incur a cost of the order of c j (n ∆ j ) for every suboptimal C j ∈ C (cf. Theorem 5 in Appendix for a formal statement): Theorem 4 (Lower Bound, informal statement). Let ∆ ∈ (0, 1] and A be a training data allocation algorithm that always outputs an (N, ∆)-optimal learner. Then there exists an (N, ∆)-suboptimal learner C j that would force A to incur a misallocated training cost larger than c j (n ∆ j )/2.

Experiments

Our experiments make use of 41 classifiers covering a wide range of algorithms (SVMs, Decision Trees, Neural Networks, Logistic Regression, etc.) as implemented in WEKA (Hall et al. 2009) . All experiments were conducted on AMD Opteron 6134 machines with 32 cores and 64 GB memory, running Scientific Linux 6.1. 4 We first evaluate DAUB on one real-world binary classification dataset, "Higgs boson" (Baldi, Sadowski, and Whiteson 2014) and one artificial dataset, "Parity with distractors," to examine robustness of DAUB's strategy across two extremely different types of data. In the latter task the class label is the parity of a (hidden) subset of binary featuresthe remaining features serve as distractors, with no influence on the class label. We generated 65,535 distinct examples based on 5-bit parity with 11 distractors, and randomly selected 21,500 samples each for T r and T v . For the Higgs and other real-world datasets, we first randomly split the data with a 70/30 ratio and selected 38,500 samples for T r from the 70% split and use the 30% as T v . We coarsely optimized the DAUB parameters at b = 500 and r = 1.5 based on the Higgs data, and kept those values fixed for all datasets. This yielded 11 possible allocation sizes: 500, 1000, 1500, 2500, 4000, 5000, 7500, 11500, 17500, 25500, 38500 5 . Results for HIGGS and PARITY are as follows. The accuracy loss of the ultimate classifiers selected by DAUB turned out to be quite small: DAUB selected the top classifier for HIGGS (i.e. 0.0% loss) and one of the top three classifiers for PARITY (0.3% loss). In terms of complexity reduction, Table 1 shows clear gains over "full" training of all classifiers on the full T r , in both total allocated samples as well total CPU training time, for both standard DAUB as well as a variant which does not use training set accuracy f T as an upper bound. Both variants reduce the allocated samples by ∼2x-4x for HIGGS, and by ∼5x for PARITY. The impact on CPU runtime is more pronounced, as many suboptimal classifiers with supra-linear runtimes receive very small amounts of training data. As the table shows, standard DAUB reduces total training time by a factor of ∼25x for HIGGS, and ∼15x for PARITY.

Table 1: Comparison of full training, DAUB without training accuracy bound fT , and DAUB.

Figures 1 and 2 provide additional insight into DAUB's behavior. Figure 1 shows how validation accuracy progresses with increasing training data allocation to several classifiers on the HIGGS dataset. The plots for the most part conform to our ideal-case assumptions of increasing accuracy with diminishing slope, barring a few monotonicity glitches 6 due to stochastic sampling noise. We note that, while there is one optimal classifier C * (a parameterization of a Rotation Forest) with best validation accuracy after training on all of T r , there are several other classifiers that outperformed C * in early training. For instance, LADTree is better than C * until 5,000 examples but then flattens out. Figure 2 gives perspective on how DAUB distributes data allocations among the 41 classifiers when run on the HIGGS Figure 1 : Higgs validation accuracy curves of several classifiers that initially outperform the classifier (Rotation Forest) that is ultimately the best when given all training data.

Figure 1: Higgs validation accuracy curves of several classifiers that initially outperform the classifier (Rotation Forest) that is ultimately the best when given all training data.

Figure 2: Number of Higgs examples allocated by DAUB to various classifiers, sorted by decreasing validation performance at N = 38500.

dataset. The classifiers here are sorted by decreasing validation accuracy f i (N ). While DAUB manages to select C * in this case, what's equally critical is the distribution of allocated training data. The figure shows that DAUB allocates most of the training data to the top eight classifiers. Most classifiers receive 2500 or fewer samples, and only four classifiers receive more than 10k samples, with all of them within 1.5% of the optimal performance.

Finally, in Table 2 we report results of DAUB on Higgs plus five other real-world benchmarks as indicated: Buzz (Kawala et al. 2013) ; Covertype (Blackard and Dean 2000) ; Million Song Dataset (Bertin-Mahieux et al. 2011); SUSY (Baldi, Sadowski, and Whiteson 2014) ; and Vehicle-SensIT (Duarte and Hu 2004) . These experiments use exactly the same parameter settings as for HIGGS and PAR-ITY. As before, the table shows a comparison in terms of allocated training samples and runtime. In addition it displays the incurred accuracy loss of DAUB's final selected classifier. The highest loss is ∼1%, well within an acceptable range. The average incurred loss across all six benchmarks is 0.4% and the average speedup is 16x. Our empirical findings thus show that in practice DAUB can consistently select near-optimal classifiers at a substantial reduced computational cost when compared to full training of all classifiers.

Table 2: Comparison of full training and DAUB across six benchmarks.

Conclusion

We reiterate the potential practical impact of our original Cost-Sensitive Training Data Allocation problem formulation, and our proposed DAUB algorithm for solving this problem. In our experience, DAUB has been quite easy to use, easy to code and tune, and is highly practical in robustly finding near-optimal learners with greatly reduced CPU time across datasets drawn from a variety of real-world domains. Moreover, it does not require built-in knowledge of learners or properties of datasets, making it ideally suited for practitioners without domain knowledge of the learning algorithms or data characteristics. Furthermore, all intermediate results can be used to interactively inform the practitioner of relevant information such as progress (e.g., updated learning curves) and decisions taken (e.g., allocated data). Such a tool was introduced by Biem et al. (2015) and a snapshot of it is depicted in the Appendix. Our theoretical work on the idealized DAUB* scenario also reveals novel insights and provides important support for the real-world behavior of DAUB with noisy accuracy estimates. As dataset sizes scale, we expect that DAUB will better and better approach the idealized behavior of DAUB*, which offers strong bounds on both learner sub-optimality as well as regret due to misallocated samples.

There are many opportunities for further advances in both the theoretical and practical aspects of this work. It should be possible to develop more accurate bound estimators given noisy accuracy estimates, e.g., using monotone spline regression. Likewise, it may be possible to extend the theory to encompass noisy accuracy estimates, for example, by making use of PAC lower bounds on generalization error to establish upper bounds on learner accuracy. DAUB could be further combined in an interesting way with methods (Ali, Caruana, and Kapoor 2014) to optimally split data between training and validation sets. are valid, then it allocates to C j fewer than rn * j examples in each step and r 2 r−1 n * j examples in total.

Appendix: Proof Details Lemma 1. Let C, T r , T v

Proof of Lemma 1. Suppose, for the sake of contradiction, that DAUB* allocates at least rn * j examples to learner C j at some point in its execution. Since r > 1 and all allocation sizes are at most N , n * j < N . Further, since n j , the number of examples allocated to C j , is always incremented geometrically by a factor of at most r, at some previous point in the algorithm, we must have n j ∈ {n * j , . . . , rn * j − 1}. Since the projected upper bound function u j is non-increasing, at that previous point in the algorithm, u j (n j ) ≤ u j (n * j ) < f * by the definition of n * j . On the other hand, since the projected upper bound function u i * is valid, the projected upper bound for C i * would always be at least f * .

Therefore, the algorithm, when choosing which learner to allocate the next set of examples to, will, from this point onward, always prefer C i * (and possibly another learner appearing to be even better) over C j , implying that n j will never exceed its current value. This contradicts the assumption that DAUB* allocates at least rn * j examples to C j at some point during its execution.

For the bound on the total number of examples allocated to C j , let k = log r rn * j b . Since DAUB* allocates fewer than rn * j examples to C j in any single step and the allocation sizes start at b and increase by a factor of r in each step, C j must have received precisely b+br+br 2 +. . .+br k examples in total. This is smaller than r r−1 br k ≤ r 2 r−1 n * j . Lemma 2. Let C, T r , T v , N, M, and c i for i ∈ [M ] be as in Definition 3. Let r > 1, b ∈ N + , and S be the training data allocation sequence produced by DAUB*(C, T r , T v , r, b). If DAUB* allocates at most n training examples to a learner C j ∈ C in each step, then cost(S j ) ≤ r r−1 c j (n). Proof of Lemma 2. As in the proof of Lemma 1, let k = log r (n/b) and observe that the data allocation subsequence S j for learner C j must have been (b, br, br 2 , . . . , br k ). The corresponding training cost for C j is cost(S j ) = c j (b) + c j (br) + c j (br 2 ) + . . . c j (br k ). By the assumption that c j grows at least linearly:

c j (br ) ≤ cj (br k ) r k−

for ≤ k. It follows that:

cost(S j ) ≤ c j (br k ) • (r −k + r −k+1 + r −k+2 + . . . + 1) < r r − 1 c j (br k ) ≤ r r − 1 c j (n)

This finishes the proof.

Lemma 3. If f i (n) is well-behaved, then ub i (n) is a valid projected upper bound function.

Proof of Lemma 3. Recall that ub i (n) is the minimum of

f T i (n) and f i (n) + (N − n)f i (n). Since f T i (n)

is a nonincreasing function and is already argued to be an upper bound on f i (N ), it suffices to show that g(n) = f i (n) + (N − n)f i (n) is also a non-increasing function of n and g(n) ≥ f i (N ).

g(n + 1) = f i (n + 1) + (N − n − 1)f i (n + 1) ≤ (f i (n) + f i (n)) + (N − n − 1)f i (n + 1) ≤ (f i (n) + f i (n)) + (N − n − 1)f i (n) because f i is non-increasing = f i (n) + (N − n)f i (n) = g(n)

The first inequality follows from the assumptions on the behavior of f i w.r.t. n and its first-order Taylor expansion. Specifically, recall that f i (n) is defined as (f i (n) − f i (n − s))/s for some (implicit) parameter s. For concreteness, let's refer to that implicit function as h i (n, s). Let h i (n, s) denote the discrete derivative of h i (n, s) w.r.t. n. The nonincreasing nature of f i (n) w.r.t. n implies h i (n, s), for any fixed n, is a non-decreasing function of s. In particular,

f i (n) = h i (n, s) ≥ h i (n, 1) for any s ≥ 1. It follows that f i (n + 1) = f i (n) + h i (n + 1, 1) ≤ f i (n) + h i (n + 1, s) = f i (n) + f i (n + 1) ≤ f i (n) + f i (n), as desired.

Thus, g(n) is non-increasing. Since g(N ) = f (N ) by definition, we have g(n) ≥ f (N ), finishing the proof.

Lemma 4. For an (N, ∆)-suboptimal learner C j with wellbehaved accuracy function f j , n * j ≤ n ∆ j .

Proof of Lemma 4. If f j (N ) > ∆/N , then n ∆ j = N and the statement of the lemma holds trivially. Otherwise, by the definition of n ∆ j , f j (n ∆ j ) ≤ ∆/N . In order to prove n * j ≤ n ∆ j , we must show that ub j (n ∆ j ) < f * . We do this by using first-order Taylor expansion:

ub j (n ∆ j ) ≤ f j (n ∆ j ) + (N − n ∆ j ) f j (n ∆ j ) ≤ f j (n ∆ j ) + (N − n ∆ j ) ∆ N by above observation < f j (n ∆ j ) + ∆ ≤ f j (N ) + ∆ since n ∆ j ≤ N ; fj is non-decreasing ≤ f i * (N ) by (N, ∆)-suboptimality of Cj Hence, ub j (n ∆ j ) < f i * (N ) = f * . Proposition 1. If g : N → [m 1 , m 2 ]

is a well-behaved function for m 1 , m 2 ∈ R, then its discrete derivative g (n) decreases asymptotically as o(1/n).

Proof of Proposition 1. Recall the definition of the discrete derivative from Section and assume for simplicity of exposition that the parameter s is 1, namely, g (n) = g(n) − g(n − 1). The argument can be easily extended to s > 1. Applying the definition repeatedly, we get g(n) = g(1) + (g (2) + . . . + g (n)) ≥ m 1 + g (2) + . . . g (n). If g (n) was Ω(1/n), then there would exist n 0 and c such that for all n ≥ n 0 , g (n) ≥ c/n. This would mean g(n) ≥ m 1 + n≥n0 c/n. This summation, however, diverges to infinity while g(n) is bounded above by m 2 . It follows that g (n) could not have been Ω(1/n) to start with. It must thus decrease asymptotically strictly faster than 1/n, that is, be o(1/n).

Lemma 5. For an (N, ∆)-suboptimal learner C j with a well-behaved accuracy function

f j satisfying f j (N ) ≤ ∆/N , we have that n ∆ j is o(N ) in N . Proof of Lemma 5. From Proposition 1, f j (N ) = o(1/N ), implying ∆/f j (N ) = ω(N ∆)

. This means that a value N that is o(N/∆) and suffices to ensure ∆/f j (N ) ≥ N for all large enough N , that is, f j (N ) ≤ ∆/N . Since n ∆ j is, by definition, no larger than N , n ∆ j must also be o(N/∆). Proof of Theorem 3. Since DAUB* never allocates more than N training examples in a single step to C i for any i ∈ [M ], it follows from Lemma 2 that cost(S i ) ≤ r r−1 c i (N ). In particular, cost(Sĩ) ≤ r r−1 cĩ(N ). The (N, ∆)-regret of DAUB*, by definition, is j∈J cost(S j ). By Theorem 2, this is at most Theorem 5 (Lower Bound). Let ∆ ∈ (0, 1] and A be a training data allocation algorithm that, when executed on a training set of size N , is guaranteed to always output an (N, ∆)-optimal learner. Let C, T r , T v , N, M, c i and f i for i ∈ [M ] be as in Definition 3. Let γ = ( √ 5 − 1)/2 and C j ∈ C be an (N, ∆)-suboptimal learner. Then there exists a choice of f j (n) such that f j is well-behaved, f j (N ) ≤ ∆/N, and A(C, T r , T v ) allocates to C j more than γn ∆ j examples, thus incurring a misallocated training cost on C j larger than γc j (n ∆ j ). Proof of Theorem 5. We will argue that, under certain circumstances, A must allocate at least γN ∆ j examples to C j in order to guarantee (N, ∆) optimality.

To prove the desired result by contradiction, we consider the optimal learner C i * and will construct specific accuracy functions f j (n) and f i * (n) such that they are identical for all n ≤ γN ∆ j , have derivative no more than ∆/n at N ∆ j , but differ by at least ∆ when evaluated at n = N . This would imply that A simply cannot distinguish between the accuracy of C j and C i * by evaluating them on at most γN ∆ j examples and thus cannot guarantee (N, ∆)-optimality of the learner it outputs.

Suppose we can preserve the properties of f j required by the theorem, including f j (N ∆ j ) ≤ ∆/N , but can enforce that f j (γN ∆ j ) ≥ d∆/N for some d > 1 whose value we will determine shortly. Further, let us alter f j such that it remains unchanged for n ≤ γN ∆ j and is altered for larger n such that f j (n) = d∆/N for γN ∆ j < n ≤ N ∆ j and f j (n) = ∆/N for n > N ∆ j . Recalling that N ∆ j ≤ N , this modified f j then satisfies:

f j (N ) − f j (γN ∆ j ) = (N ∆ j − γN ∆ j ) d∆ N + (N − N ∆ j ) ∆ N = (1 − γ − 1/d)N ∆ j d∆ N + ∆

which is at least ∆ as long as 1/d ≤ 1 − γ. Now define f i * s.t. f i * (n) = f j (n) for n ≤ γN ∆ j and f i * (n) = f j (γN ∆ j ) for n > γN ∆ j . Thus, f j and f i * are identical for n ≤ γN ∆ j but f i * (N ) − f j (N ) ≥ ∆, as desired. It remains to show that we can choose a valid f j satisfying the properties required by the theorem but such that f j (γN ∆ j ) ≥ d∆/N . To achieve this, consider f j (n) = 1 − c∆/n for any c > 0. Then f j is non-decreasing, f j (n) = c∆/n 2 is non-increasing, n ∆ j = √ cN , f j (n ∆ j ) = ∆/N , and f j (γn ∆ j ) = f j (γ √ cN ) = ∆/(γ 2 N ). Notice that γ = ( √ 5 − 1)/2 satisfies the condition γ 2 ≤ 1 − γ; it is in fact the largest value that satisfies the condition. Hence, we can choose d = 1/γ 2 in order to ensure 1/d ≤ 1 − γ, which in turn ensures that the altered f j (N ) and f j (γN ), as constructed above, differ by at least ∆. This finishes the construction for the lower bound.

While we define the core concepts in terms of expected values suitable for a formal definition and idealized analysis, the actual DAUB algorithm will operate on observed values of ci on particular subsets of n training examples chosen at runtime.

Since DAUB* evaluates ui for increasing values of n, it is easy to enforce monotonicity.3 All proofs are deferred to the Appendix.

Code and data, including full parameterization for each classifier, are available from the authors.

Unfortunately some of our classifiers crash and/or run out of memory above 38500 samples.6 In our experience, most of these glitches pertain to weak classifiers and thus would not significantly affect DAUB, since DAUB mostly focuses its effort on the strongest classifiers.

Figure 3: Implementation of a practitioners tool based on DAUB. The interactive dashboard shows the performance of various learners in terms of loss on the validation set based on training on currently allocated training data points. For each learner there are continuously updated plots that show the learning curve and a comparison of training and validation performance.