AdaWISH: Faster Discrete Integration via Adaptive Quantiles

Fan Ding
Hanjing Wang
Ashish Sabharwal
Yexiang Xue
ArXiv
2019
View in Semantic Scholar

Abstract

Discrete integration in a high dimensional space of $n$ variables poses fundamental challenges. The WISH algorithm reduces the intractable discrete integration problem into $n$ optimization queries subject to randomized constraints, obtaining a constant approximation guarantee. The optimization queries are expensive, which limits the applicability of WISH. We propose AdaWISH, which is able to obtain the same guarantee, but accesses only a small subset of queries of WISH. For example, when the number of function values is bounded by a constant, AdaWISH issues only $O(\log n)$ queries. The key idea is to query adaptively, taking advantage of the shape of the weight function. In general, we prove that AdaWISH has a regret of no more than $O(\log n)$ relative to an oracle that issues queries at data-dependent optimal points. Experimentally, AdaWISH gives precise estimates for discrete integration problems, of the same quality as that of WISH and better than several competing approaches, on a variety of probabilistic inference benchmarks, while saving substantially on the number of optimization queries compared to WISH. For example, it saves $81.5\%$ of WISH queries while retaining the quality of results on a suite of UAI inference challenge benchmarks.

Introduction

Discrete integration in a high dimensional space poses fundamental challenges in scientific computing. Yet, it has numerous applications in artificial intelligence, machine learning, statistics, biology, and physics (Aziz et al. 2015; Vlasselaer et al. 2016) . In probabilistic inference, discrete integration is crucial for computing core quantities such as the partition function and marginal probabilities of probabilistic graphical models. The key challenge is the exponential growth in the volume of the space as the dimensionality increases, commonly known as the curse of dimensionality.

A fruitful line of work, based on hashing and optimization, is able to achieve constant-factor upper and lower bounds to the discrete integration problem (Ermon et al. 2013b; Zhao et al. 2016; Belle, Van den Broeck, and Passerini 2015; Asteris and Dimakis 2016; Gomes, Sabharwal, and Selman 2006; Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1 : (Top) WISH obtains a constant approximation of W = w(σ) by querying quantiles of w that are exponentially apart, namely, b i , the 2 i -th largest item for w. The yellow and blue curves show the corresponding upper and lower bounds after knowing the values of these quantiles, which form a 2-approximation. (Bottom) AdaWISH makes queries adaptively. The set of queried points forms a subset of that of WISH. In this example, AdaWISH does not need to query the two points in blue squares, since their values can be inferred from the left and right neighbors. 2007; Kuck et al. 2019; Soos and Meel 2019) . The key idea is to transform the discrete integration problem into optimization problems subject to additional randomly sampled parity constraints. Each imposed parity constraint randomly cuts the original space by half. If there is one element of interest in a subspace formed by applying k random parity constraints, then there should be approximately 2 k elements of interest in the original space.

Figure 1: (Top) WISH obtains a constant approximation of W = ∑ w(σ) by querying quantiles of w that are exponentially apart, namely, bi, the 2i-th largest item for w. The yellow and blue curves show the corresponding upper and lower bounds after knowing the values of these quantiles, which form a 2-approximation. (Bottom) AdaWISH makes queries adaptively. The set of queried points forms a subset of that of WISH. In this example, AdaWISH does not need to query the two points in blue squares, since their values can be inferred from the left and right neighbors.

( ) w   1 2 i   1 2 i   2 i  …… …… Upper Bound Lower Bound 1 i b  i b 1 i b  2 2 i   2 i b  ( ) w   0 2  2 n  2 i  1 2 i   2 2 i   3 2 i   …… 0 b i b 1 i b  3 i b  n b …… … …

The WISH algorithm of Ermon et al. (2013b) first gave a constant approximation guarantee for weighted integration problems. They followed a two-step approach. In the first arXiv:1910.05811v1 [cs. LG] 13 Oct 2019

step, they worked out a logarithmic slicing schedule, which was able to obtain a constant approximation guarantee by querying only the quantiles of the weighted function that are exponentially apart; i.e., the 2 i -th largest element b i of the weighted function. See Figure 1 (Top) for an illustration. The second step was to obtain b i using hashing and optimization. They completed this step by querying optimization oracles subject to randomized parity constraints.

We propose AdaWISH, which is able to obtain the same constant approximation guarantees for weighted integration, but with a significantly fewer number of queries to the optimization oracle. The key idea is to make quantile queries in an adaptive way, using divide-and-conquer. See Figure 1 (Bottom) for the intuitive idea. Here, WISH queries all quantiles. AdaWISH, on the other hand, skips querying the two quantiles shown in blue boxes because the neighboring black quantiles to their left and the right, which sandwich the blue quantiles, are close in value. Exploiting this observation, one can prove, for instance, that AdaWISH needs only O(log n) queries if the function has a fixed number of outputs. This coincides with the results of Chakraborty, Meel, and Vardi (2016) . While their worked focused on the unweighted case, our results apply to the general discrete case.

We analyze the performance of AdaWISH when it accesses oracles in two different ways. In the first way, the oracle returns a pointwise approximation at each query point with a known approximation factor. In the second way, when queried with one specific quantile, the oracle returns a value that is guaranteed to be bounded between neighboring quantiles with a high probability. This second setting corresponds precisely to the probabilistic bound one obtains with hashing and optimization based oracles. In both settings, we are able to prove (i) (Upper bound) AdaWISH makes at most n queries, forming a subset of WISH's queries while achieving an identical constant approximation guarantee. (ii) (Regret bound) The number of queries of AdaWISH cannot exceed a logarithmic factor times that of an "optimal" algorithm, which makes the least amount of queries for the same guarantee, while knowing the exact shape of the weight function a priori. (iii) (Lower bound) Any algorithm that guarantees a constant-approximation must make at least Ω(n) accesses to the oracle. We thus conclude that the number of optimization queries made by AdaWISH is close to optimal.

Experimentally, we test the efficacy of AdaWISH in the context of computing the partition function of random Clique-structured Ising models, Grid Ising models, and instances from the UAI inference competitions. We also count the solutions of randomly generated 3-SAT instances. Our results demonstrate that AdaWISH provides precise estimates of discrete integration problems, of the same quality as that of WISH and better than competing approaches like belief propagation, mean field, dynamic importance sampling, HAK, etc. Meanwhile, AdaWISH saves a vast majority of optimization oracle queries compared to WISH. For example, it reduces the number of queries by 47%-60% on Ising models and a median of 81.5% queries for benchmarks from the UAI inference challenge.

Related Work

Over the years, many exact and approximate approaches have been proposed to tackle the discrete integration problem. Variational methods Wainwright and Jordan 2008) search for a tractable variational form to approximate the otherwise intractable integration. These methods are fast but often cannot provide tight bounds on the quality of the outcome. Approximate sampling techniques (Jerrum and Sinclair 1997; Madras 2002; Poon and Domingos 2006) are popular, but the number of samples required to obtain a reliable estimate often grows exponentially in the problem size. Importance sampling approaches such as SampleSearch (Gogate and Dechter 2012; Lou, Dechter, and Ihler 2019) and methods based on knowledge compilation (Choi, Kisa, and Darwiche 2013) have also achieved a fundamental breakthrough in performance (Hazan and Jaakkola 2012; Liu et al. 2015; Lou, Dechter, and Ihler 2017; Friedman and Van den Broeck 2018).

There have been recent developments on the use of short parity constraints to speed up computation in WISH style methods (Zhao et al. 2016; Ermon et al. 2014 ; Asteris and Dimakis 2016; Achlioptas and Theodoropoulos 2017; Achlioptas, Hammoudeh, and Theodoropoulos 2018) . While these approaches reduce the empirical complexity of answering individual optimization oracle queries, our method reduces the number of queries itself, complementing these approaches.

Only a handful of hashing based approaches provide guarantees for weighted functions (Chakraborty et al. 2015; Ermon et al. 2013a) . Chakraborty et al. (2014a) propose an algorithm that exploits the tilt parameter of a weighted Boolean formula. Our empirical comparison reveals that their approach requires substantially more queries to an NP oracle. On the other hand, it accesses NP oracles, which can be more efficient in practice than the MAP oracles used by WISH and AdaWISH.

The adaptive approach of AdaWISH is motivated by the work of Sabharwal and Xue (2018) , who addressed the problem of reducing the number of human annotations needed to reliably estimate the precision-recall curve of massive noisy datasets. Our application domain, namely discrete integration, differs in a number of aspects, such as our space being exponential and thus too large to enumerate (they operated on an explicitly stored dataset), our query oracles being NP-hard (their query was human annotation of a data point), and our weight function not being guaranteed to not increase/decrease too fast (PR curves, on the other hand, must necessarily be relatively stable for large enough ranks). These differences necessitate new proof techniques.

Preliminaries

We consider a weighted model with n binary variables x 1 , . . . , x n where x i ∈ {0, 1}. x = (x 1 , . . . , x n ) T is a vector which takes values from the space Σ = {0, 1} n . Define a weighted function w : Σ → R + that assigns a non-negative weight to each element σ in Σ. The discrete integration problem is to compute the total weight: W = σ∈Σ w(σ).

The discrete integration problem is an important but

6 w t i ← max σ w(σ) subject to H t (σ) = d 7 M ← Median(w 1 i , ..., w T i ) 8 return M as the estimateb i ;

challenging problem in machine learning. A recently proposed method Weight-Integral-And-Sum-By-Hashing (WISH) (Ermon et al. 2013b ) provides a constantapproximation guarantee to this problem. A careful analysis of WISH reveals that the constant approximation guarantee is achieved via two main ingredients: (i) a logarithmic slicing schedule; (ii) counting via hashing and optimization.

Logarithmic Slicing Schedule

We fix an ordering for all σ ∈ Σ such that w(σ j ) ≥ w(σ j+1 ) holds for all j, 1 ≤ j ≤ 2 n − 1 and let b i be the 2 i -th largest element, i.e., b i = w(σ 2 i ). The first contribution of (Ermon et al. 2013b ) is a constant approximation scheme constructed from knowing only n + 1 points of function w, namely, b 0 , . . . , b n .

Lemma 1 (Ermon et al. (2013b) )

. LB = b 0 + n i=1 b i (2 i − 2 i−1 )

is a lower bound and a 2-approximation to W . In other words, LB ≤ W ≤ 2LB.

Counting via Hashing and Optimization In order to estimate W , the next step is to estimate b i = w(σ 2 i ) for i = 0, . . . , n. The work of (Ermon et al. 2013b ) translates this problem into optimization problems subject to randomized hashing (parity) constraints. The high-level idea is as follows. To compute b m , the 2 m -th largest item, consider 2 m buckets, each of which is labeled with one vector from {0, 1} m . Suppose we hash all elements in Σ uniformly at random into these 2 m buckets. Then we use an optimization oracle to compute the largest item in a given bucket. Because elements are hashed randomly, if we repeat this process multiple times and consistently find that the largest item in one bucket is larger than w * , then we can conclude that there must be more than 2 m items larger than w * , hence b m ≥ w * . Following the same argument, if there are more than 2 m items larger thanŵ > w * , then the optimization oracle should returnŵ, larger than w * . Combining these two points, if we repeat this experiment multiple times, the median value of the largest item in the repeated bucket experiments should reflect the value of b m . In practice, we form the buckets with pairwise independent hashing functions:

Definition 1. A function family H m = {H m : {0, 1} n → {0, 1}

m } is pairwise independent if the following conditions hold when H m is chosen uniformly at random from H m . 1) ∀x ∈ {0, 1} n , the random variable H m (x) is uniformly distributed in {0, 1} m . 2) ∀x 1 , x 2 ∈ {0, 1} n and x 1 = x 2 , random variables H m (x 1 ) and H m (x 2 ) are independent.

For one configuration σ ∈ Σ, we say σ is hashed to the bucket labeled with d ∈ {0, 1} m by function H m if H m (σ) = d. In practice, pairwise independent hash functions are constructed with random parity functions. Let matrix A ∈ {0, 1} m×n be a randomly sampled 0-1 matrix. One can prove that function family {h A (x) = Ax mod 2} is pairwise independent. XorQUERY(i, Σ, w, T ) (Algorithm 1) demonstrates the actual implementation of the high-level idea to compute the 2 i -th largest item b i . The formal mathematical result in (Ermon et al. 2013b) bounds the returned value M of the algorithm between b min{i+c,n} and b max{i−c,0} , a small range around b i : Lemma 2 (Ermon et al. (2013b) ). Let M be the value returned by XorQUERY(i, Σ, w, T ). Then for any c ≥ 2, there exists an α * (c) > 0 such that for 0 < α < α * (c),

P r M ∈ [b min{i+c,n} , b max{i−c,0} ] ≥ 1 − exp(−αT ).

Combining Lemma 2 with the logarithmic slicing schedule, the authors of (Ermon et al. 2013b) are able to provide a constant approximation algorithm for the discrete integration problem with at most a logarithmic number of accesses to the optimization queries: Theorem 1 (Ermon et al. (2013b) ). For δ > 0, WISH algorithm makes Θ(n log n log(1/δ)) MAP queries and with probability at least 1 − δ, outputs a 16-approximation of W .

Adawish: Adaptive Discrete Integration

In WISH, we query all the n + 1 quantiles b 0 , b 1 , . . . , b n . In practice, the number of queries can be reduced due to the shape of the w function. For example, the two blue quantiles in Figure 1 (Bottom) are sandwiched between the left and right quantile in black, which are close in values. From this observation, we do not need to query the two blue quantiles and instead can use the values of the neighboring quantiles to replace their values.

Motivated by this example, we propose Adaptive-Weight-Integral-And-Sum-By-Hashing (AdaWISH), an algorithm which makes queries adaptively using divide-andconquer. We analyze the performance of AdaWISH, when it accesses quantile oracles in two different ways. In this first way, every oracle access returns a quantile in its point-wise bound, e.g., the multiplicative distance between the returned and the exact values are within a constant range. In the second way, when queried with one quantile, the oracle returns a value that is guaranteed to be bounded between neighboring quantiles with high probability. This oracle corresponds the case of using hashing and randomization, i.e., using Xor-QUERY in algorithm 1.

The detailed AdaWISH algorithm is shown in Algorithm 2, where the algorithm starts with estimating the quantiles b 0 , . . . , b n in the entire range (SEARCH (Σ, w, β, 0, n)), and recursively breaks the range by the middle point (geometric mean) using divide-and-conquer (shown as SEARCH (Σ, w, β, l, m) and SEARCH (Σ, w, β, m, r)) until the stopping condition is met. In this algorithm,

Algorithm 2: AdaWISH(Σ, w, β) 1 n = log 2 |Σ|; 2b 0 , . . . .,b n ← SEARCH (Σ, w, β, 0, n); 3W ←b 0 + n−1 i=0 2 ib i ; 4 returnW Algorithm 3: SEARCH (Σ, w, β, l, r) 1 if r == l + 1 then 2 // stopping condition 1 met 3b l ← ApproxQUERY (l, Σ, w) 4b r ← ApproxQUERY (r, Σ, w) 5 else 6b l ← UpperBound (l, Σ, w) 7b r ← LowerBound (r, Σ, w) 8 ifb l ≤ βb r then 9 // stopping condition 2 met 10 for i ∈ {l, . . . , r − 1} dob i ←b r 11 else 12 m ← r+l 2 // bisect the interval 13b l , . . . ,b m ← SEARCH (Σ, w, β, l, m) 14b m , . . . ,b r ← SEARCH (Σ, w, β, m, r) 15 returnb l , ...,b r

β > 1 is a user-defined parameter for trade-off. The larger β is, the worse the approximation guarantee, but the fewer number of queries that AdaWISH has to make. AdaWISH also depends on three query functions, ApproxQUERY , LowerBound , and UpperBound , which are implemented differently assuming different types of oracles. The detailed implementation of these functions will be discussed together with the specific oracle. Here, at a high level, ApproxQUERY (i, Σ, w) outputs a point estimate of quantile b i , while LowerBound (i, Σ, w) (UpperBound (i, Σ, w)) outputs a lower (upper) bound of the quantile b i , respectively. In SEARCH (Σ, w, β, l, r), Once we have queried the left pointb l and right pointb r , there are two stopping conditions. The first is r = l + 1, in which there is no point to query between the two quantiles. The second condition is that the values ofb l andb r are close: b l ≤ βb r . This suggests that the w function stays roughly flat between b l and b r . If one of these two conditions is met, we stop and replace all the b i betweenb l andb r with the value ofb r . Otherwise, we break the range l . . . r through the middle point m = r+l 2 then recursively calls SEARCH on two sub-partitions.

Case 1: Oracle With Point-Wise Bounds

First, we consider the case where we have a PointQuery oracle, which returns a quantile estimation with point-wise bound. Mathematically, there exists γ > 1 such that for all i = 0, . . . , n, we have b i /γ ≤b i ≤ b i γ, wherẽ b i denotes the returned value of PointQuery(i, Σ, w). In this case, we are able to design the AdaWISH algorithm that obtains a 2βγ 2 -approximation of W . In our implementation, ApproxQUERY (i, Σ, w) returns exactly PointQuery(i, Σ, w), UpperBound (i, Σ, w) returns γPointQuery(i, Σ, w), and LowerBound (i, Σ, w) returns PointQuery(i, Σ, w)/γ. We implement a lookup table within PointQuery. If PointQuery(i, Σ, w) is called twice with the same i, then the second time PointQuery directly returns the result computed from the first time.

Theorem 2. Let w, Σ, γ be as defined earlier. For any κ > 2γ 2 , the output of AdaWISH (Algorithm 2) on input (Σ, w, κ/(2γ 2 )), assuming oracles with point-wise bounds, is a κ-approximation of W .

(Proof sketch) We already know from Lemma 1 that there is a lower bound

LB = b 0 + n−1 i=0 b i+1 (2 i+1 −2 i ) and a upper bound U B = b 0 + n−1 i=0 b i (2 i+1 − 2 i ) of W which satisfy LB ≤ W ≤ U B ≤ 2LB if we query all b i .

That is to say, LB is a 2-approximation of W . Since here we don't query all quantiles, we can find a relaxed lower bound LB and an upper bound U B of W . We construct LB and U B by replacing each b i in LB and U B with other values. For each b i returned from ApproxQUERY , we use its returned value divided by γ (times γ) as its new value in LB (U B ). In this case, the replacements of b i in LB and U B differ by γ 2 . Because κ 2 > γ 2 (condition of theorem 2), the difference is less than κ 2 . Otherwise, b i is contained in an interval satisfying stopping condition 2. We use the output of LowerBound on b i 's immediate righthand side as its replacement in forming LB ; and use the output of UpperBound on b i 's immediate lefthand side as its replacement in forming U B . In this case, the difference of the replacements of b i in LB and U B is given by βγ 2 (stopping condition 2), which is βγ 2 = κ 2γ 2 γ 2 = κ 2 . Originally, LB ≤ U B ≤ 2LB (Lemma 1). The replacement of each quantile b i in LB and U B further broaden the distance by another factor of κ 2 . Therefore, now LB ≤ U B ≤ κ 2 2LB = κLB . Since LB and U B are still relaxed bounds, we must have LB ≤ W ≤ U B ≤ κLB . This concludes the proof.

To understand AdaWISH in terms of the number of calls to PointQuery, we analyze the upper bound, regret bound and asymptotic lower bound respectively. Theorem 3. (Upper bound) Under the conditions of Theorem 2, the number of PointQuery calls is at most n + 1, which is only a subset of that of WISH.

To prove Theorem 3, in the worst case AdaWISH has to query all b 0 , . . . , b n , which is exact the case of WISH. The lookup table implemented in the oracle guarantees the same query will not be computed twice. In practice, AdaWISH can save a lot of queries. As a specific example:

Observation 1. If function w(σ) only has a fixed set of k different values, then AdaWISH makes only O(log n) accesses to PointQuery.

Intuitively, since function w(σ) only has a fixed set of k values, it requires AdaWISH to search for the k − 1 quantiles, where the function values change from one value to another. AdaWISH uses binary search. Therefore, it requires O(log n) queries to determine one point. Since k is a constant, the total number of queries is O(log n) to determine the entire w function. The logarithmic bound coincides with WeightMC (Chakraborty et al. 2014b) , which gives a similar guarantee with O(log tilt) queries, where tilt is the ratio of the maximum weight of one assignment to minimum weight of one assignment. Although both AdaW-ISH and WeightMC apply the logarithmic slicing scheme on the weight function, WeightMC (Chakraborty et al. 2014b) slices in the value domain, while AdaWISH slices in the quantile space. The difference is similar to the discussion of vertical and horizontal slices in (Ermon et al. 2013b ).

Regret bound. We consider an "optimal" algorithm, which is guaranteed to produce a κ-approximation by issuing the least number of accesses to an ExactQuery function, which is able to return the exact value of the querying point. We allow the optimal algorithm to know the shape of the w function a priori. The optimal algorithm issues ExactQuery very smartly. Let B = {b 0 ,b i1 , . . . ,b i k ,b n } be the set of points the optimal algorithm issues ExactQuery on. We can bound the sum W between the upper bound U

B = b 0 +b 0 (2 i1 −2 0 )+ k−1 l=1 b i l (2 i l+1 −2 i l )+b i k (2 n −2 i k ) and the lower bound LB = b 0 + b i1 (2 i1 − 2 0 ) + k l=2 b i l (2 i l − 2 i l−1 ) + b n (2 n − 2 i k ).

We require the optimal algorithm to obtain a κ-approximation, i.e., LB ≤ U B ≤ κLB must hold. The mathematical definition of the optimal algorithm is the one that minimizes the size of set B, while enforcing that LB ≤ U B ≤ κLB. We call the number of accesses to ExactQuery of this optimal algorithm, i.e., the size of B, OP T . We compare the number of QUERY accesses of AdaWISH against OP T and show that the difference is within a multiplicative O(log n) factor: Theorem 4. Suppose κ = 2βγ 2 , then AdaWISH in algorithm 2 on input (Σ, w, β), assuming oracles with point-wise bounds, calls PointQuery no more than (OP T − 1)(2 + log 2 n) + 1 times.

This says that the number of QUERY calls made by AdaWISH is roughly O(OP T • log 2 n). The high level idea to prove Theorem 4 is as follows. Suppose q 1 , ..., q OP T are the actual query points of the optimal algorithm. Because AdaWISH uses a binary search, i.e., it always almost splits an interval at its geometrical middle point. Then it takes AdaWISH roughly O(log 2 n) splits to "locate" one query point q i of the optimal algorithm (more precisely, find a point that is sufficiently close to q i that guarantees the approximation bound). Hence, the total number of queries of AdaWISH is bounded by OPT times log 2 n. Our accurate proof to Theroem 4 is based on walking through the actual calling map of the function SEARCH , where each node in this map represents an actual interval that SEARCH called. We leave this proof in Supplementary Material A.2.

Asymptotic lower bound. What is the minimum number of calls to the oracle in order to guarantee a constant factor approximation to W , let's say, a κ-approximation? In this section, assuming ExactQuery(i) returns the exact value of b i , we prove that at least (1 − ) n κ 2 accesses to ExactQuery is needed to obtain a κ-approximation in the worst case, even for the optimal algorithm, which know the shape of the w function a-priori (Corollary 1). This confirms the asymptotically optimality of AdaWISH.

Our proof sketch is as follows. Prove by contradiction. Suppose one algorithm A queries less than (1 − ) n κ 2 times and obtain a κ-approximation, then we can construct two functions w 1 (σ) and w 2 (σ), such that (i) both w 1 and w 2 match on all the queried points of algorithm A, but (ii) the sum W 1 = σ w 1 (σ) and W 2 = σ w 2 (σ) differ more than a multiplicative factor of κ 2 (see Theorem 5). This suggests that algorithm A cannot give a κ-approximation, which contradicts with the assumption. Theorem 5. For κ ≥ 1 and 0 < < 1, given any (1 − ) n κ 2 queried points on w, there always exists two functions w 1 (σ) and w 2 (σ) which match on the queried points, but their corresponding sums W 1 and W 2 satisfy W 2 > κ 2 W 1 .

The proof to theorem 5 is a careful mathematical construction, and is left to A.3 in the supplementary materials. With theorem 5, we are able to show: Corollary 1. Let A be any algorithm that access w(σ) via the ExactQuery oracle. For any κ ≥ 1 and 0 < < 1, A cannot guarantee a κ-approximation of W = σ w(σ) if A only issues (1 − ) n κ 2 ExactQuery calls.

Case 2: Oracle With Neighboring Bound

We consider another case of a NeighborQuery oracle. Given c ≥ 2 and δ > 0, for all i, NeighborQuery(i, Σ, w) returns an estimation which is guaranteed to be in the range [b max {i−c,0} , b min {i+c,n} ] with probability at least 1 − δ. Notice that we can use hashing and optimization to build NeighborQuery oracles. According to Lemma 2, if we set T = ln(1/δ) α(c) ln n , then the output of XorQUERY (i, Σ, β, w, T ) (Algorithm 1) satisfies the conditions of a NeighborQuery oracle. Notice that we also implement a lookup table in this case, so the same query is not computed more than once.

We implement the AdaWISH algorithm as follows: for all i, ApproxQUERY (i, Σ, w) returns exactly NeighborQuery(i, Σ, w).

LowerBound (i, Σ, w) returns the value of NeighborQuery(min{i + c, n}, Σ, w), which with high probability is a lower bound for b i . UpperBound (i, Σ, w) returns NeighborQuery(max{i − c, 0}, Σ, w), which with high probability is an upper bound for b i . We can prove that the AdaWISH algorithm implemented in this way also gives a constant factor approximation: Theorem 6. Let w, Σ, δ, and c ≥ 2 be as defined earlier. For any κ > 2 2c , the output of AdaWISH (Algorithm 2) on input (Σ, w, κ/(2 2c )), assuming oracles with neighboring bound, is a κ-approximation of W with probability 1 − δ.

Proof. The Approximationw Given By Adawish Isw

= b 0 + n−1 i=0b i 2 i . Define L = b 0 + n−1 i=0 b min{i+c,n} 2 i and U = b 0 + n−1 i=0 b max{i−c,0} 2 i . From Lemma 2 in (Ermon et al. 2013b), we have L ≤ W ≤ U (1) and U ≤ 2 2c L

(2). We are going to proveW satisfies L /β ≤W ≤ βU (3) with high probability. Combining (2) and (3), we have L /β ≤W ≤ β2 2c L (4). Combining (1) (2) and (4), we can have W ≤ U ≤ 2 2c L ≤ 2 2c βW . In short, W ≤ 2 2c βW (5). Combining (1) and (4), we haveW ≤ β2 2c L ≤ β2 2c W (6). Together (5) and (6) imply thatW is a β2 2capproximation. Under the condition of this theorem, β is set to κ/2 2c . Therefore, the overall approximation factor is κ.

We are left to prove (3): L /β ≤W ≤ βU . Comparing the corresponding terms of L ,W , and U , it is sufficient to prove that for all i, the following Inequality (7) holds:

b min{i+c,n} /β ≤b i ≤ βb max{i−c,0} .

Consider the two stopping conditions for AdaWISH. If stopping condition 1 is met,b l andb r are from ApproxQUERY , hence (7) holds because of Lemma 2. If stopping condition 2 is met for a range from l to r, we haveb l ≤ βb r (8) Here, b l is the result from XorQU ERY (max{l − c, 0}, Σ, w, T ). According to Lemma 2,b l ≥ b l (9) with high probability. For the same reason,b r ≤ b r (10) with high probability.

The estimation ofb i in between are replaced withb r . Hence,

b i =b r ≤ b r ≤ b max{i−c,0} ≤ βb max{i−c,0} (11).

Here, the first inequality is due to (10), the second due to monoticity of the quantiles. Similarly, we haveb i =b r ≥b l /β ≥ b l /β ≥ b min{i+c,n} /β (12), where the first inequality is due to the stopping condition. With (11, 12), we also get (7).

AdaWISH also satisfies the upper bound, the regret bound, and the asymtotic lower bound assuming oracles with neighboring bounds. In terms of the upper bound, it is worth noting that we use a look-up table to save all the previous queried points. Therefore, our algorithm query no more points than WISH to obtain the same approximation guarantee. We can also analyze the regret bound. The optimal algorithm is defined the same as in the previous section. We can also prove a regret bound of O(log n) of OP T , with a slight change in the constant. In terms of asymptotic lower bound. Corollary 1 states that even the optimal algorithm, which has access to ExactQuery, has to query Ω(n) times in the worst case to guarantee a constant-approximation. AdaW-ISH in this case does not get the exact value when querying a quantile. Therefore, at least Ω(n) queries are needed.

An interesting point is that the T parameter in XorQU ERY can also be reduced adaptively. In WISH, T scales with log n because of the need for a union bound on estimation error introduced in each of the n oracle queries. If AdaWISH makes only k ≤ n queries, then T only needs to scale as log k, thus reducing the amount of repetitions.

Experiments

We implemented AdaWISH using IBM ILOG CPLEX Optimizer 12.9 for MAP queries. We adopted the implementation of WISH (Ermon et al. 2013b) augmented with randomized low-density parity constraints (Ermon et al. 2014) . In our experiments, each MAP query is carried out using CPLEX on a single core. The whole experiment was carried out on a community cluster, where each node has 24 cores and 96GB of memory. For comparison, we consider Junction Tree (JT), which provides exact inference results as ground truth, Tree-Reweighted Belief propagation (TRWBP) (Wainwright, Jaakkola, and Willsky 2003) , which provides an upper bound on the partition function, Mean Field (MF) (Wainwright and Jordan 2008) , which provides a lower bound, Loopy Belief Propagation (BP) (Murphy, Weiss, and Jordan 1999), which has no guarantees, and Double-loop GBP (HAK) (Heskes, Albers, and Kappen 2003), which is a winning solver in the UAI inference challenge. We use the implementations of these algorithms available in LibDAI (Mooij 2010) . We also compare with Dynamic Importance Sampling (DIS) (Lou, Dechter, and Ihler 2017; Lou, Dechter, and Ihler 2019) , a recently proposed strong baseline. In addition, we also compare with Weighted Model Counting (WMC) (Chakraborty et al. 2014b) on weighted SAT instances. This algorithm depends on one parameter tilt, the ratio between the maximum and minimum satisfying weights in a CNF. Since a straightforward extension of tilt beyond weighted CNFs lead to very large tilt (10 25 to 10 150 ) on UAI and Ising model instances, we therefore did not compare with WMC on these benchmarks. The inference problem we consider is to compute the partition function.

In summary, as shown in Figure 2 , AdaWISH gives very precise estimations to the partition function, almost overlapping with the WISH algorithm, better than competing approaches. Meanwhile, AdaWISH saves plenty of queries. Details in experimental setup and additional results are referred to the supplementary materials.

Figure 2: (Top) The errors in log partition function estimation for various methods on different benchmarks. (Bottom) The number of queries needed by WISH and AdaWISH. We can see that both WISH and AdaWISH provide good quality estimations. They are competetive with best benchmark methods on various tasks, much better than traditional methods such as MF, BP and TRWBP. AdaWISH needs much less number of queries than WISH (saving 60% of WISH queries on Mixed Ising models in the left column, 81.5% in median on UAI inference benchmarks in the middle column, and on average 62% on 3-SAT instances in the right column)(bottom figures).

Ising Models

We first report the result on 10 × 10 grid mixed Ising models in Figure 2(a) and 2(d) , where the coupling strength can have mixed signs. We repeat our experiments 10 times for each coupling strength and report the median error in the log10 partition function estimation. We insert structures in the Ising grids by introducing a rectangle of strong interactions. The coupling strength inside the rectangle is amplified by 10.

From Figure 2 (a), we can see that AdaWISH and WISH are competitive with the state-of-the-art method DIS, outperform other methods. DIS works well on Ising models and SAT instances, but not well on UAI instances. When coupling strength is larger than 2.5, WISH and AdaWISH can give near-optimal estimations, with an error of roughly 0.8 in the log10 partition function, while the next approach, BP, has an log10 error of 6. Meanwhile, in Figure 2(d) , AdaW-ISH reduces approximately 60% queries from WISH, while giving the same or even better estimations. We also conducted experiments on clique Ising models as well as grid Ising models with attractive coupling strength, which are left to supplementary materials.

Uai Inference

We also test AdaWISH on open datasets released by the UAI Approximate Inference Challenge. The number of queries needed by WISH and AdaWISH. We can see that both WISH and AdaWISH provide good quality estimations. They are competetive with best benchmark methods on various tasks, much better than traditional methods such as MF, BP and TRWBP. AdaWISH needs much less number of queries than WISH (saving 60% of WISH queries on Mixed Ising models in the left column, 81.5% in median on UAI inference benchmarks in the middle column, and on average 62% on 3-SAT instances in the right column)(bottom figures).

As observed, AdaWISH and WISH have very close estimations on the partition function on all instances, outperforming other methods in accuracy. A few points are missing for several methods because their errors are too big or failing to complete within the time given. Meanwhile, as shown in Figure 2 (e), the number of queries of AdaWISH is stable at around 400 over different benchmarks, saving 81.5% queries in median when compared to WISH.

3-Sat Instances

We also consider unweighted 3-SAT instances. Here, AdaWISH is guaranteed to make O(log n) queries due to Observation 1. We generate random 3-SAT instances with number of variables varying from 5 to 70 with a step size of 5 and the clause-to-variable ratio fixed at 1.

We show in Figure 2 (c) that WISH and AdaWISH give one of the best estimations within an error of roughly 0.2 in log partition function and are competitive with the state-ofthe-art methods. Mean Field does not provide good estimations and TRWBP is absent because running out of time.In addition, Figure 2(f) demonstrates that the number of queries of AdaWISH grows much slower than problem size, while the number of queries for WISH grows proportionally to the number of variables, which on average contributes to a 62% cut in queries. We use the default setting of WMC as metioned in (Chakraborty et al. 2014b) . For 30-variable SAT instances, WMC issues more than 120, 000 accesses to NP oracles, way more than that of WISH and AdaWISH. Nev-ertheless, we point out that their NP queries are cheaper than the MAP queries of WISH and AdaWISH.

Conclusion

We introduced AdaWISH, an algorithm based on hashing and optimization that provides a constant-factor approximation of the discrete integration problem. AdaWISH significantly reduces the number of optimization queries needed by WISH via an adaptive binary search, while continuing to provide equally precise estimates for discrete integration. The number of queries made by AdaWISH is, in fact, provably no more than a logarithmic factor worse than that of an oracle algorithm that optimially decides which points to query. Empirically, AdaWISH uses 47%-60% fewer queries on Ising models, and saves 81.5% of the queries as a median on benchmarks in the UAI inference challenge.

Adawish: Faster Discrete Integration Via Adaptive Quantiles

A. Proofs A.1 Proof of Lemma 1

Proof. (Lemma 1) Since we have b i ≥ w(σ k ) ≥ b i+1 for all k ∈ {2 i , . . . , 2 i+1 }. Therefore, the lower bound is obtained by replacing w(σ k ) with b i+1 in the summation, while the upper bound is obtained by replacing w(σ k ) with b i . Then, we get the lower bound

LB = b 0 + n i=1 b i (2 i − 2 i−1 ) and upper bound U B = b 0 + n i=1 b i−1 (2 i − 2 i−1 ).

Obviously, LB ≤ W ≤ U B, and we also have

U B = b 0 + n i=1 b i−1 (2 i − 2 i−1 ) = b 0 + b 0 + n−1 i=1 2b i (2 i − 2 i−1 ) ≤ 2b 0 + n−1 i=1 2b i (2 i − 2 i−1 ) ≤ 2(b 0 + n−1 i=1 b i (2 i − 2 i−1 )) ≤ 2LB

(1) This finishes the proof.

A.2 Proof Of Theorem 4

Proof. (Theorem 4) Suppose the optimal algorithm calls QUERY at points q 1 , ..., q OP T as above. These points split the entire range between [σ 2 0 , σ 2 n ] into OPT-1 segments. We call one segment of this type an opt-segment. We also define the following tuple σ 2 l , σ 2 r , d, SM ALL/BIG , where (σ 2 l , σ 2 r ) is an interval on which the function SEARCH called during the execution of AdaWISH, d is the depth of this recursive call, and the forth enrty is either SMALL or BIG. It is BIG if and only if the interval (σ 2 l , σ 2 r ) covers at least one complete opt-segment. The log distance of a tuple is measured as r−l. Two tuples are treated as cousins if they are called by a single ORACLE function as two children calls(i.e.,one is (σ 2 l , σ 2 m ), while the other is (σ 2 m , σ 2 r )). We can merge two cousin tuples. The result of merging is the tuple representing the parent function that called the functions represented by these two cousin tuples. Notice that the log distance of the merged tuple is between 2 times that of the first and second cousin tuple.

We start with the set Φ, made of tuples σ 2 l , σ 2 r , d, SM ALL/BIG such that (σ 2 l , σ 2 r ) enters the stopping condition of function SEARCH during the execution of the algorithm. We repeat the following merge operation on tuples in Φ until no tuple in Φ is tagged with SMALL:

1. Choose a tuple from Φ that has the largest depth d among those tagged with SMALL. 2. Merge this tuple with its cousin tuple. 3. Add the merged tuple back into Φ.

We refer to the set Φ obtained after merging all SMALL tuples as Φ E . First, all tuples in Φ E are tagged BIG. Therefore, each of them contains an opt-segment. Second, since the tuples in Φ E are still from the SEARCH function, these tuples must be non-overlapping. Based on these two observations, the number of tuples in Φ E is bounded by the number of opt-segment, which is OPT-1. We should notice that the merge operations each tuple in Φ E have gone through must be bounded by O(log 2 n), since the largest possible log distance is no more than log 2 2 n = n (the entire range), and each merge operation nearly doubles the log distance. Considering that there are some interval whose log distance is odd, we need to perform at most log 2 n + 1 merge operations.

Now we count the number of QUERY calls that AdaW-ISH makes. It is easy to see that

#QU ERY = |φ E | + #M ERGE + 1 (2)

where #MERGE is the total number of merge operations performed. This number is bounded by

|Φ E |+|Φ E |•max{#MERGE for one tuple in Φ E }+1 (3)

Combined with the fact that |Φ E | ≤ OP T − 1 and #M ERGE for one tuple in Φ E ≤ log 2 n + 1, we obtain the claimed bound.

A.3 Proof Of Theorem 5

Proof. (Theorem 5) Our proof works by constructing two curves passing through the shared (1 − ) n κ 2 points to show that the discrete integrals W 1 and W 2 from curves w 1 (σ) and w 2 (σ) are κ 2 times apart. To demonstrate these, w 1 (σ) and w 2 (σ) are constructed such that w(σ j ) = b r in w 1 (σ) and w(σ j ) = b l in w 2 (σ) for all j ∈ {2 l , 2 r }, where b l and b r are the neighbor queried points from those (1 − ) n κ 2 points. If we query all the n + 1 points,

W 1 is b 0 + n−1 i=0 b i+1 (2 i+1 − 2 i ) and W 2 is b 0 + n−1 i=0 b i (2 i+1 − 2 i ).

First consider when = 0, by carefully constructing a special series of b n , i.e, b n = 1 2 n , we can make each interval the same important. Therefore, we should keep the n κ 2 queried points as scattered as possible, namely, query every other κ 2 point to make W 1 the largest and W 2 the smallest, leading to the least possible W2 W1 . We want to claim that even in this situation W2 W1 is still larger than κ 2 . Firstly, we can obtain

W 1 = 1 + n/κ 2 i=0 ((κ 2 ) i+1 − (κ 2 ) i ) 1 (κ 2 ) i+1 = 1 + n κ 2 (1 − 1 κ 2 )

Meanwhile, W 2 can be written as

W 2 = 1 + n/κ 2 i=0 ((κ 2 ) i+1 − (κ 2 ) i ) 1 (κ 2 ) i = 1 + n κ 2 (κ 2 − 1)

Therefore, we can calculate the limit of W2 W1 when n tends to be infinity.

lim n→+∞ W 2 W 1 = κ 2

Since is a fixed constant, if we do not omit , W 1 would increase by O(n) and W 2 would decrease by O(n). Therefore lead to

W 2 W 1 < κ 2

Therefore, W 2 < κ 2 W 1 holds if we only has (1 − )( n κ 2 ) access to QUERY. This finishes the proof.

A.4 Proof Of Corollary 1

Proof. (Corollary 1) Suppose we have any algorithm A which has only (1 − ) n κ 2 accesses to ExactQuery and correspondingly outputsW . Given these queried points, Theorem 5 tells us that there exist two function w 1 (σ) and w 2 (σ) matching on these queried points with corresponding discrete integral W 1 and W 2 where W 2 is κ 2 times greater than W 1 . Therefore, the outputW of algorithm A is the estimate of both W 1 and W 2 . AssumingW is a κ-approximation of both W 1 and W 2 , we havẽ

W κ ≤ W 1 ≤ κW andW κ ≤ W 2 ≤ κW

Then we can directly infer that W 2 ≤ κ 2 W 1 . However, Theorem 5 told us W 2 > κ 2 W 1 , which contradicts the inference above.

B. Experiment Setup

In practice, we set a global timeout of 4 hours for AdaW-ISH and other baseline methods such as Blief Propagation, Tree-Reweighted Blief Propagation, HAK, Dynamic Importance Sampling and WeightMC for simulation of real application that has limited computational resources. Additionally, a local timeout is set for each of AdaWISH MAP queries to avoid getting blocked by some ill queries. Note that we follow the design that each ApproxQUERY issued by AdaWISH consists of a batch of MAP queries that can be executed in parallel. Moreover, ApproxQUERY with the same binary search depth can be executed simultaneously. For further illustration of the quality of estimations provided by AdaWISH, we run WISH with only local timeouts, whose value are shared with AdaWISH, and finally compare the performance of WISH and AdaWISH. Those results of WISH indicates the optimal performance AdaWISH will get with only local timeout. Empirically, we find that AdaWISH successfully complete within the 4-hours global timeout in all the benchmarks except for Clique Ising Models. Therefore we can have a solid evluation on estimations provided by AdaWISH with some queries eliminated, especially when compared to WISH.

B.1 Grid Ising Models As for 10 × 10 binary Grid Ising models, Similar experimental setup is adopted as that in WISH (Ermon et al. 2013b) . We consider the weight of a configuration σ(

x 1 x 2 ...x n ) ∈ {−1, 1} n to be w(σ) = i ψ i (x i ) ij∈E ψ ij (x i x j ), where local field ψ i (x i ) = exp(f i x i ). f i is uniformly sampled from [−f, f ] with f = 0.1. The interaction ψ ij (x i x j ) = exp(w ij x i x j ).

For the attractive case, we draw w ij from [0, w]; for the mixed case, [−w, w] , where w is known to be "Coupling Strength". We insert structures in the grids by introducing a rectangle of strong interactions, inside which the coupling strength is amplified by 10. A local timeout of 15 minutes is set for this task, so that not all the solutions can be solved to optimal. Let T, β, c as defined before. WISH/AdaWISH both use the settings of c = 5, T = 10, and the trade-off parameter β = 100 for AdaWISH.

B.2 Clique Ising Models

We consider random Cliquestructured Ising models with n binary variables x i ∈ {0, 1} for i ∈ {0, 1, .., n − 1}. The interaction between each pair of variables x i and x j is defined as ψ ij = exp(−w ij ) when x i = x j , otherwise ψ ij = 1, where w ij is uniformly sampled from [0, w |i − j|]. The coupling strength w, controling the intensity of interactions, is 0.1 in this experiment.

Some structures are further injected by introducing two closed chains of strong repulsive interactions with a length of 0.3n. The strengths of interactions are uniformly sampled from [0, γw], where γ = 10 2 × [sigmoid(x) − 0.5], x ∼ u(0, 1). Note that Clique Ising models have treewidth n and therefore those with more than 31 variables can not be solved exactly. Let β, c, δ as defined before. WISH/AdaWISH both use the settings of c = 5, δ = 0.01, and the trade-off parameter β = 1000 for AdaWISH. In this benchmark, we let T strictly follows the theoretical value and execute AdaWISH/WISH to optimal to get a result with strict theoretical guarantees. And all the probabilistic inference methods presented are free from local timeout and global timeout and execute to their optimal.

B.3 Uai Inference

We also consider testing AdaWISH on open datasets released by the UAI Approximate Inference Challenge. Specifically we collect .uai file instances with various labels from the UAI 2011 and 2014 Inference Competitions. Table 1 shows all these .uai files we test on and their abbreviations. For AdaWISH and WISH, the local timeout for every MAP query is set to be 30 min.

Table 1: Abbreviations of uai files from UAI 2011 and 2014 inference competition.

B.4 3-Sat Problems

We generate random 3-SAT instances with two distinctive settings. First we focus on the SAT instances with varying variables numbers but a fixed clause-to-variable ratio, namely 1 with number of variables varies from 5 to 70 with a step of 5: for the second, we consider 3-SAT instances with a fixed number of clauses and varying number of variables. And a local timeout of 30 min is set for both WISH and AdaWISH.

As for ground truth, JT can not complete within the timeout for SAT instances with more than 35 variables and we therefore adopt ACE, a public available exact infer- ence model as mentioned in (Huang, Chavira, and Darwiche 2006) .

C. Experimental Results

C.1 Grid Ising Models The result of experiments on attractive Grid Ising models is shown in figure 3(b). It can be seen that WISH/AdaWISH can provide accurate estimations over various coupling strengths. We notice the slight performance drop when coupling strength is close to 1 in both attractive and mixed cases. The reason is as follows. Empirically, we find optimization instances with roughly n/2 constraints are the hardest ones to solve and therefore AdaWISH fail to return their values within the local timeout of 15 minuets. Although we apply a conservative approximate policy for those missing values by replacing them with their nearest quried right neighbor(which successfully returns within the local timeout), it still have strong influence on the global estimation. As for number of queries, as shown in figure 3(e), AdaWISH have a solid 60% cut when compared to WISH with estimations of same or even better quality. Figure 3(a) reports the performance of all the methods on generated random Clique Ising models. The theoretical guarantee provided by theorem 6 holds because all the optimizations are solved to optimal. Note that it plots their estimations rather than errors since we can only get the ground truth for models with less than 31 variables. It can be seen that the estimations given by WISH and AdaWISH are visually overlapping with the plots of ground truth, outperforming the second best method BP in stability to provide accurate estimations. As for other methods, TRWBP and MF diverge from the ground truth distinctively. HAK fail to complete within the timeout therefore absent in the graph. We also note that the empirical result of AdaWISH is much better than the theoretical guarantee, yet saving 47.3% on average of queries on different sized problem when compared to WISH (figure 3(d) ). We observe the negative correlation between number of variables and the percentage of reduced queries. It's due to the increasing size of embedded structures, the strong interactions inside which makes the weights more distinctive from each other and therefore less likely to trigger an early stop in binary search.

Figure 3: (Top) log partition function estimation error of various methods on different models. (Bottom) number of queries WISH and AdaWISH rely on to provide estimations of that quality. 3(a) and 3(d) show AdaWISH/WISH outperforms other methods in Clique Ising models with a variety of problem sizes and the queries of AdaWISH is 47.3% on average less than that of WISH. Moreover, 3(b) shows that for the Attractive Grid Ising models with increasing coupling strength, AdaWISH and WISH are competetive with the state-of-the-art methods. The average log 10 estimation errors given by methods such as HAK and TRWBP diverge significantly, while AdaWISH and WISH are still able to give accurate estimations. AdaWISH also reduces 60% queries from WISH as shown in 3(e). Figure 3(c) reports that, for 3-SAT problems with fixed number of clauses, all the algorithms except for MF have similar performance. In this case the number of queries AdaWISH relies on is stable at around 15 over increasing problem sizes, as represented in 3(f), while WISH requires much more queries to provide an .

C.2 Clique Ising Models

UAI Instances TRWBP and HAK fail to complete on a few instances within the timeout and therefore they are absent from the graph. Figure 3 (c) shows that AdaWISH and WISH are still able to provide near-optimal estimations of partition function with fixed number of clauses but increasing number of variables, competitive with the state-ofthe-art methods. Meanwhile, Figure 3 (f) demonstrate that AdaWISH relies on a stable number of queries when problem size increases, resulting in higher efficiency when compared to WISH. A global timeout of 4 hours is set for WeightMC as AdaWISH and a 2500-second (40 minutes) timeout for each SAT query as the same default setting metioned in (Chakraborty et al. 2014b) . Similarly, we set a 30-minute timeout for AdaWISH and WISH on each MAP query.

Also note that although AdaWISH utilizes a MAP oracle in implementation while WMC accesses a NP oracle, we can augment AdaWISH with CryptoMiniSat (Soos, Nohl, and Castelluccia 2009) for SAT instances, which is also a NP oracle. The experimental setup will be similar to that in (Ermon et al. 2013b) . Meanwhile, according to empirical observation, WMC issues more than 120000 accesses to their oracle for estimation of a random 30-variables 3-SAT instance and that number is roughly 200 for AdaWISH to provide estimations of the same quality.