Expanding Holographic Embeddings for Knowledge Completion

Yexiang Xue
Yang Yuan
Zhitian Xu
Ashish Sabharwal
NeurIPS
2018
View in Semantic Scholar

Abstract

Neural models operating over structured spaces such as knowledge graphs require a continuous embedding of the discrete elements of this space (such as entities) as well as the relationships between them. Relational embeddings with high expressivity, however, have high model complexity, making them computationally difficult to train. We propose a new family of embeddings for knowledge graphs that interpolate between a method with high model complexity and one, namely Holographic embeddings (HolE), with low dimensionality and high training efficiency. This interpolation, termed HolEx, is achieved by concatenating several linearly perturbed copies of original HolE. We formally characterize the number of perturbed copies needed to provably recover the full entity-entity or entity-relation interaction matrix, leveraging ideas from Haar wavelets and compressed sensing. In practice, using just a handful of Haar-based or random perturbation vectors results in a much stronger knowledge completion system. On the Freebase FB15K dataset, HolEx outperforms originally reported HolE by 14.7\% on the HITS@10 metric, and the current path-based state-of-the-art method, PTransE, by 4\% (absolute).

1 Introduction

Relations, as a key concept in artificial intelligence and machine learning, allow human beings as well as intelligent systems to learn and reason about the world. In particular, relations among multiple entities and concepts enable us to make logical inference, learn new concepts, draw analogies, make comparisons, etc. This paper considers relational learning for knowledge graphs (KGs), which often contain knowledge in the form of binary relations, such as livesIn(Bill Gates, Seattle). A number of very large KGs, with millions and even billions of facts, have become prominent in the last decade, such as Freebase [3] , DBpedia [2] , YAGO [11] , WordNet [17] , and WebChild [26] .

KGs can be represented as a multigraph, where entities such as Bill Gates and Seattle are nodes, connected with zero or more relations such as livesIn and likes. Facts such as livesIn(Bill Gates, Seattle) form typed edges, with the relation-in this case livesIn-being the edge type. In particular, we are interested in the knowledge completion task for KGs: Given an existing KG, we would like to use statistical machine learning tools to extract correlations among its entities and relations, and use these correlations to derive new knowledge about them.

Compositional vector space models, also referred to as matrix or tensor factorization based methods, have proven to be highly effective for KG completion [e.g., 4, 5, 7, 8, 12, 14-16, 18, 19, 23-25, 28] . In these models, entities and relations are represented as (learned) vectors in a high dimensional space, and various forms of compositional operators are used to determine the likelihood of a candidate fact. A good design of the compositional operator is often key to the success of the model. Such design must balance computational complexity against model complexity. Not surprisingly, embedding models capable of capturing rich correlations in relational data often have limited computational scalability. On the other hand, models that can be trained efficiently are often less expressive.

We focus on two compositional operators. The first is the full tensor product [18] , which captures correlations between every pair of dimensions of two embedding vectors in R d , by considering their outer product. The resulting quadratic (d 2 ) parameter space makes this impractical even for modestly sized KGs. The second is the circular correlation underlying holographic embedding or HOLE [19] , which is inspired by holographic models of associated memory. Notably, HOLE keeps the parameter space linear in d by capturing only the sum along each diagonal of the full tensor product matrix.

Our main contribution is a new compositional operator that combines the strengths of these two models, resulting in much stronger knowledge completion system. Specifically, we propose expanded holographic embeddings or HOLEX, which is a collection of models that interpolates between holographic embeddings and the full tensor product.

The idea is to concatenate l ≥ 1 copies of the HOLE model, each perturbed by a linear vector, allowing various copies to focus on different subspaces of the embedding. HOLEX forms a complete spectrum connecting HOLE with the full tensor product model: it falls back to HOLE when l = 1 and all entries in the perturbation vector are non-zero, and is equivalent to the full tensor product model when l = d, the embedding dimension, and all perturbation vectors are linearly independent.

We consider two families of perturbation vectors, low frequency Haar wavelets [6, 10] and random 0/1 vectors. We show that using the former corresponds to considering sums of multiple subsequences of each diagonal line of the full product matrix, in contrast to the original holographic embedding, which sums up the entire diagonal. We find that even just a few low frequency vectors in the Haar matrix are quite effective in practice for HOLEX. When using the complete Haar matrix, the length of the subsequences becomes one, thereby recovering the full tensor product case. Our second family of perturbation vectors, namely random 0/1 vectors, corresponds to randomly sub-selecting half the rows of the tensor product matrix in each copy. This is valuable when the full product matrix is sparse. Specifically, using techniques from compressed sensing, if each diagonal line is dominated by a few large entries (in terms of absolute values), we show that a logarithmic number of random vectors suffice to recover information from these large entries.

To assess its efficacy, we implement HOLEX using the framework of ProjE [23] , a recent neural method developed for the Freebase FB15K knowledge completion dataset [3, 5] , where the 95% confidence interval for statistical significance is 0.3%. In terms of the standard HITS@10 metric, HOLEX using 16 random 0/1 vectors outperforms the original HOLE by 14.7% (absolute), ProjE by 5.7%, and a path-based state-of-the-art method by 4%.

2 Preliminaries

We use knowledge graphs to predict new relations between entities. For example, given entities Albany and the New York State, possible relationships between these two entities are CityIn and CapitalOf. Formally, let E denote the set of all entities in a KG G. A relation r is a subset of E × E, corresponding to all entity pairs that satisfy the relation. For example, the relation CapitalOf contains all (City, State) pairs in which the City is the capital of that particular State. For each relation r, we would like to learn the characterization function for r, φ r (s, o), which evaluates to +1 if the entity pair (s, o) is in the relation set, otherwise, -1. Notice that s and o are typically asymmetrical. For example, Albany is the capital of the New York State, but not the other way around. Relations can be visualized as a knowledge graph, where the nodes represent entities, and one relation corresponds to a set of edges connecting entity pairs with the given relation.

As mentioned earlier, compositional embeddings are useful models for prediction in knowledge graphs. Generally speaking, these models embed entities as well as relations jointly into a high dimensional space. Let s ∈ R ds , o ∈ R do , r ∈ R dr be the embeddings for entities s and o, and the relation r, respectively. Compositional embeddings learn a score function σ(.) that approximates the posterior probability of φ r (s, o) conditioned on the dataset Ω:

EQUATION (1): Not extracted; please refer to original document.

Many models have been proposed with different functional forms for σ [e.g., 4, 5, 8, 12, 15, 16, 18, 19, [23] [24] [25] 28] . A crucial part of these models is the compositional operators they use to capture the correlation between entities and relations. Given entities (and/or relations) embeddings a = (a 0 , . . . , a da−1 ) ∈ R da and b = (b 0 , .

. . , b d b −1 ) ∈ R d b , a compositional operator is a function f : R da × R d b → R d f ,

which maps a and b into another high dimensional space 1 . Such operators are used to combine the information from the embeddings of entities and relations to predict the likelihood of a particular entity-relation tuple in the score function. A good compositional operator not only extracts information effectively from a and b, but also trades it off with model complexity.

One approach is to use vector arithmetic operations, such as (weighted) vector addition and subtraction used by TransE [5] , TransH [28] , and ProjE [23] . One drawback of this approach is that the embedding dimensions remain independent in such vector operations, preventing the model from capturing rich correlations across different dimensions. Another popular compositional operator is to concatenate the embeddings of relations and entities, and later apply a non-linear activation function to implicitly capture correlations [8, 24] .

Given the importance of capturing rich correlations, we focus on two representative compositional operators that explicitly model the correlations among entities and relations: the full tensor product and the holographic embedding, described below.

Full Tensor Product Many models, such as RESCAL [18] and its compositional training extension [9] and Neural Tensor Network [25] , take the full tensor product as the compositional operator. Given two embedding vectors a, b ∈ R d , the full tensor product is defined as a ⊗ b = ab T , i.e.,

EQUATION (2): Not extracted; please refer to original document.

The full tensor product captures all pairwise multiplicative interactions between a and b. Intuitively, a feature in a ⊗ b is "on" (with large absolute value), if and only if the corresponding features in both a and b are "on". This helps entities with multiple characteristics. For example, consider an entity Obama, who is a man, a basketball player, and a former president of the US. In the embeddings for Obama, we can have one dimension firing up when it is coupled with Chicago Bulls (basketball team), but a different dimension firing up when coupled with the White House.

However, this rich expressive power comes at a cost: a huge parameter space, which makes it difficult, if not impossible, to effectively train a model on large datasets. For example, for RESCAL, the score for a triple (s, r, o) is defined as:

EQUATION (3): Not extracted; please refer to original document.

where W r ∈ R d×d is the matrix encoding for relation r, • refers to the Hadamard product (i.e., the element-wise product), and grandsum refers to the sum of all entries of a matrix. With |R| relations, the number of parameters is dominated by the embedding of all relations, totaling d 2 |R|. This quickly becomes infeasible even for modestly sized knowledge graphs.

Holographic Embedding HOLE provides an alternative compositional operator using the idea of circular correlation. Given a, b ∈ R d , the holographic compositional operator h :

R d × R d → R d

produces an interaction vector of the same dimension as a and b, with the k-th dimension being:

EQUATION (4): Not extracted; please refer to original document.

Figure 1 (left) provides a graphical illustration. HOLE computes the sum of each (circular) diagonal line of the original tensor product matrix, collapsing a two-dimensional matrix into a one-dimensional vector. Put another way, HOLE still captures pairwise interactions between different dimensions of a and b, but collapses everything along a each individual diagonal and retains only the sum for each such 'bucket'. The HOLE score for a triple (s, r, o) is defined as:

Figure 1: (Left) Visualization of HOLE, which collapses the full tensor product M = ab′ into a vector by summing up along each (circular) diagonal line, depicted with the same color. (Middle) HOLE perturbed with a vector c, where each row of M is multiplied with one entry in c prior to the holographic operation. (Right) HOLEX using the first two Haar vectors. When M has dimension d× d, this is equivalent to returning a 2× d matrix that sums up along each half of each diagonal line, depicted by the same color.

EQUATION (5): Not extracted; please refer to original document.

where • denotes dot-product. This, requires only d|R| parameters for encoding all relations. 2 1 A composition operator can be between two entities, or between an entity and a relation. 2 As discussed later, our improved reimplementation of HOLE, inspired by recent work [23] , uses a slight variation for the tail-prediction (a. The circular correlation used in Holographic embedding can be seen as a projection of the full tensor product by weighting all interactions the same along each diagonal line. Given its similarity to (circular) convolution, the actual computation can be carried out efficiently with the fast Fourier transformation (FFT): h(a, b) = F −1 F(a) • F(b) . As before, • refers to element-wise product.

F is the discrete Fourier transform. (x) represents the complex conjugate of x.

3 Expanding Holographic Embeddings

Is there a model that sits in between HOLE and the full tensor product, and provides a better trade-off than either extreme between computational complexity and model complexity? We present Expanded Holographic Embeddings or HOLEX, which is a collection of models with increasing complexity that provides a controlled way to interpolate HOLE and the full tensor product. Given a fixed vector c ∈ R d , we define the perturbed holographic compositional operator for a, b ∈ R d as:

h(a, b; c) = (c • a) b. (6) As before, • represents the Hadamard product and the score for a triple (s, r, o) is computed by taking the dot product of this composition between two elements (e.g., s and o) and a d-dimensional vector encoding the third element (e.g., r). In other words, the k-th dimension of h now becomes:

EQUATION (7): Not extracted; please refer to original document.

In practice, vector c is chosen prior to training. As depicted in Figure 1 (middle), HOLEX visually first forms the full tensor product of a and b, then multiplies each row with the corresponding dimension in c, and finally sums up along each (circular) diagonal line. Computationally, HOLEX continues to benefit from the use of fast Fourier transform:

h(a, b; c) = F −1 F(c • a) • F(b) .

On one hand, HOLEX falls back to HOLE if we only use one perturbation vector with all non-zero entries. This is because one can always rescale a to subsume the effect of c.

On the other hand, we can expand HOLE to more complex models by using multiple perturbation vectors. Suppose

Importantly, this expanded embedding has the same number of parameters as HOLE itself. 3 We start with a basic question: Does rank-l HOLEX capture more information than rank-l when l > l ? The answer is affirmative if c 0 , . . . , c l−1 are linearly independent. In fact, Theorem 1 shows that under this setting, rank-d HOLEX is equivalent to the full tensor product up to a linear transformation. Theorem 1. Let a, b ∈ R d , l = d, and R be the matrix of the full tensor product matrix arranged according to diagonal lines, i.e., R i,j = a i b (i+j) mod d . Then rank-d HOLEX satisfies:

h(a, b; C d ) = R T C d .

Note that this linear transformation is invertible if C d has full rank. In other words, learning a rank-d expanded holographic embedding is equivalent to learning the full tensor product. As an example, consider the RESCAL model with score function r T (s ⊗ o). This is an inner product between the relation embedding r and the full tensor product matrix (s ⊗ o) between the subject and object entities. Suppose we replace the tensor product matrix (s

3.1 Low Rank Holographic Expansions

Theorem 1 states that if we could afford d perturbation vectors, then HOLEX is equivalent to the full tensor product (RESCAL) matrix. What happens if we cannot afford all d perturbation vectors?

We will see that, in this case, HOLEX forms a collection of models with increasingly richer representation power and correspondingly higher computational needs. Our goal is to choose a family of perturbation vectors that provides a substantial benefit even if only a handful of vectors are used in HOLEX.

Different choices of linearly independent families of perturbation vectors extract different information from the full RESCAL matrix, thereby leading to different empirical performance. For example, consider using the truncated identity matrix I k×d , i.e., the first k columns of the d × d identity matrix, as the perturbation vectors. This is equivalent to retaining the first k major diagonal lines of the full RESCAL matrix and ignoring everything else in it. Empirically, we found that using I k×d substantially worsened performance. Our intuition is that such a choice is worse than using perturbation vectors that condense information from the entire RESCAL matrix, i.e., vectors with a wider footprint. We consider two such perturbation families, Haar and random 0/1 vectors.

The following example illustrates the intuition behind such wide-footprint vectors being a better fit for the task than I k×d . Consider the tuple Alice, review, paper42 . When we embed Alice and paper42 as entities, it may result in the i-th embedding dimension being indicative of a person (i.e., this dimension has a large value whenever the entity is a person) and the j-th dimension being indicative of an article. In this case, the (i, j)-th entry of the interaction matrix will have a large value for the pair Alice, paper42 , signaling that it fits the relation "review". If the (i, j)-th entry is far away from the main diagonal, it will be zeroed out (thus losing the information) when using I k×d for perturbation, but captured by vectors with a wide footprint.

3.1.1 Perturbation With Low Frequency Haar Vectors

The Haar wavelet system [6, 10] is widely used in signal processing. The 2 × 2 Haar matrix H 2 associated with the Haar wavelet is shown on the left in Figure 2 , which also shows H 4 . In general, the 2n × 2n Haar matrix H 2n can be derived from the n × n Haar matrix H n as shown on the right in Figure 2 , where ⊗ k represents the Kronecker product and I the identity matrix.

Figure 2. Not extracted; please refer to original document.

Haar matrices have many desirable properties. Consider multiplying H 4 with a vector a. We can see that the inner product between the first row of H 4 and a gives the sum of the entries of a (i.e., . Importantly, we can infer the sum of each half of a by examining these two inner products. Generalizing this, consider the first 2 k rows of H d , referred to as the (unnormalized) 2 k Haar wavelets with the lowest frequency. If we split vector a into 2 k segments of equal size, then one can infer the partial sum of each segment by computing the inner product of a with the first 2 k rows of H d .

H2 = 1 1 1 −1 H4 =    1 1 1 1 1 1 −1 −1 1 −1 0 0 0 0 1 −1    H2n = Hn ⊗ k [1, 1] In ⊗ k [1, −1]

This view provides an intuitive interpretation of HOLEX when using perturbation with low frequency Haar vectors. In fact, we can prove that HOLEX using the first 2 k rows of H d yields an embedding that contains the partial sums of 2 k equal-sized segments along each (circular) diagonal line of the tensor product matrix. This is stated formally in Proposition 1. The case of using the first two rows of H d is visually depicted in the rightmost panel of Figure 1 .

W = 1 l H T l h(a, b; H T d,k )

Then, W captures the partial column sums of R. In other words, W i,j is the sum of entries from

R di/l,j to R d(i+1)/l−1,j

, where all indices start from 0.

Proposition 1 formalizes how HOLEX forms an interpolation between HOLE and the full tensor product as an increasing number of Haar wavelets is used as perturbation vectors. While HOLE captures the sum along each full diagonal line, HOLEX gradually enriches the representation by adding subsequence sums as we include more and more rows from the Haar matrix.

3.2 Projection With Random 0/1 Vectors

We next consider random perturbation vectors, each of whose entries is sampled independently and uniformly from {0, 1}. As suggested by Figure 1 (middle), HOLE perturbed with one such random 0/1 vector is equivalent to randomly zeroing out roughly half the d rows (corresponding to the 0s in the vector) from the tensor product matrix, before summing along each (circular) diagonal line. The plot on the right shows the average magnitudes of the entries in each (circular) diagonal line, normalized so that the largest entry in each diagonal is 1, sorted in decreasing order, and averaged over the entire dataset.

Random vectors work particularly well if the full tensor product matrix is sparse, which turns out to often be the case. Figure 3 illustrates this sparsity for the FB15K dataset. The heat map on the left highlights that there are relatively few large (dark) entries overall. The plot on the right shows that each circular diagonal line, on average, is dominated by very few large entries. For example, on average, the 5 th largest entry (out of 64) has a magnitude of only about half that of the largest. The values decay rapidly. This is in line with our expectation that different entries of the RESCAL matrix carry different semantic information, not all of which is generally relevant for all entity-relation pairs.

Figure 3: Sparse nature of full 64× 64 RESCAL matrices learned from the FB15k dataset. The heat map on the left shows a typical entity-relation matrix, i.e., a particular s⊗ r. The plot on the right shows the average magnitudes of the entries in each (circular) diagonal line, normalized so that the largest entry in each diagonal is 1, sorted in decreasing order, and averaged over the entire dataset.

To understand why random vectors are suitable for sparse interactions, consider the extreme but intuitive case where only one of the d entries in each diagonal line has large magnitude, and the rest are close to zero. For a particular diagonal line, one random vector zeros out a set of approximately d/2 entries, and the second random vector zeros out another set of d/2 entries, chosen randomly and independently of the first set. The number of entries that are not zeroed out by either of the two random vectors is thus approximately d/4. Continuing this reasoning, in expectation, only one entry will "survive", i.e., remain not zeroed out, if one adds log 2 d of such 0/1 random vectors.

Suppose we apply HOLEX with 2 log d random vectors. For a particular diagonal line, approximately half (log d) of the random vectors will zero out the unique row of the large entry, thereby resulting in a small sum for that diagonal. Consider those log d random vectors that produce large sums. According to the previous reasoning, there is, in expectation, only one row that none of these vectors zeros out. The intersection of this row and the diagonal line must, then, be the location of the large entry.

Therefore, we have the following theorem, saying that if there is only one non-zero entry in every diagonal, HOLEX can recover the whole matrix. Theorem 2. Suppose there is only one non-zero entry, of value 1, in each diagonal line of the tensor product matrix. Let η > 0 and d be the embedding dimension. HOLEX expanded with 3 log d − log η − 1 random 0/1 vectors can locate the non-zero entry in each diagonal line of the tensor product matrix with probability at least 1 − η.

Assuming exactly one non-zero entry per diagonal might be too strong, but it can be weakened using techniques from compressed sensing, as reflected in the following theorem: Theorem 3. Suppose each diagonal line of the tensor product matrix is s-sparse, i.e., has no more than s non-zero entries. Let A ∈ R l×d be a random 0/1 matrix. Let η ∈ (0, 1) and l ≥ C(s log(d/s) + log( −1 )) for a universal constant C > 0. Then HOLEX with the rows of A as perturbation vectors can recover the tensor product matrix, i.e., identify all non-zero entries, with probability at least 1 − η.

The proofs of the above two theorems are deferred to the Appendix. We note that Theorem 3 also holds in the noisy setting where diagonal lines have s large entries, but are corrupted by some bounded noise vector e. In this case, we do not expect to fully recover the original tensor product matrix, but can identify a matrix that is close enough, which is sufficient for machine learning applications. We omit the details (cf. Theorem 2.7 of Rauhut [20] ). Thus, HOLEX works provably as long as each diagonal of the tensor product matrix can be approximated by a sparse vector.

4 Experiments

For evaluation, we use the standard knowledge completion dataset FB15K [5] . This dataset is a subset of Freebase [3] , which contains a large number of general facts about the world. FB15K contains 14,951 entities, 1,345 relations, and 592,213 facts. The facts are divided into 483,142 for training, 50,000 for validation, and 59,071 for testing.

We follow the evaluation methodology of prior work in this area. For each triple (s, r, o), we create a head prediction query (?, r, o) and a tail prediction query (s, r, ?). For head prediction (tail prediction is handled similarly), we use the knowledge completion method at hand to rank all entities based on their predicted likelihood of being the correct head, resulting in an ordered list L. Since many relations are not 1-to-1, there often are other (already known) valid facts of the form (s , r, o) with s = s in the training data. To account for these equally valid answers, we follow prior work and filter out such other valid heads from L to obtain L . Finally, for this query, we compute three metrics: the 0-based rank r of s in this (filtered) ordered list L , the reciprocal rank 1 r+1 , and whether s appears among the top k items in L (HITS@k, for k ∈ {10, 5, 1}). The overall performance of the method is taken to be the average of these metrics across all head and tail prediction queries.

We reimplemented HOLE using the recent framework of Shi and Weninger [23] , which is based on TensorFlow [1] and is optimized for multiple CPUs. We consider both the original embedding dimension of 150, and a larger dimension of 256 that is better suited for our Haar vector based linear perturbation. In the notation of Shi and Weninger [23] , we changed their interaction component between entity e and relation r from e⊕r = D e e+D r r+b c to the (expanded) holographic interaction h(e, r). We also dropped their non-linearity function, tanh, around this interaction for slightly better results. Their other implementation choices were left intact, such as computing interaction between e and r rather than between two entities, using dropout, and other hyper-parameters.

4.1 Impact Of Varying The Number Of Perturbation Vectors

To gain insight into HOLEX, we first consider the impact of adding an increasing number l of linear perturbation vectors. We start with a small embedding dimension, 32, which allows for a full interpolation between HOLE and RESCAL. We then report results with embedding dimension 256, with up to 8 random 0/1 vectors. Figure 4 depicts the resulting HITS@10 and Mean Rank metrics, as well as the training time per epoch (on a 32-CPU machine on Google Cloud Platform). On both small-and large-scale experiments, we observe that both the mean rank and HIT@10 metrics generally improve (ignoring random fluctuations) as l increases. On the large-scale experiment, even with l = 2, we already observe a substantial, 2.5% improvement in HITS@10 (higher is better) and a reduction in mean rank (lower is better) by 8. In this particular case, mean rank saturates after l = 3, although HITS@10 continues to climb steadily till l = 8, and suggests further gains if even more perturbation vectors were to be used. The rightmost plot indicates that the training time scales roughly linearly, thus making l an effective knob for trading off test performance with training time.

Figure 4: Impact of using a varying number l of random 0/1 vectors on the performance of HOLEX on FB15K. Left: HITS@10 and Mean Rank for a full interpolation between HOLE (l = 1) and RESCAL (l = 32), when the embedding dimension d = 32. Middle: Similar trend for d = 256. Right: Training time per epoch for experiments with d = 256.

4.2 Comparison With Existing Methods

We now compare HOLEX with several representative baselines. All baseline numbers, except for our reimplementation of HOLE, are taken from Shi and Weninger [23] , who also report performances of additional baselines, which fared worse than TransR reported here. Remark 1. In private communication, Shi and Weninger noted that the published numbers for their method, ProjE, were inaccurate due to a bug (Github issue #3). We use their updated code from https://github.com/bxshi/ProjE and new suggested parameters, reported here for completeness: max 50 iterations, learning rate 0.0005, and negative sampling weight 0.1. We increased the embedding dimension from 200 to 256 for consistency with our method and reduced batch size to 128, which improved the HITS@10 metric for the best variant, ProjE_listwise, from 80.0% to 82.9%. We use this final number here as the best ProjE baseline. Table 1 summarizes our main results, with various method sorted by increasing HITS@10 performance. The best baselines numbers are highlighted in bold, and so are the best numbers using our expanded holographic embeddings method. We make a few observations.

Table 1: Expanded holographic embeddings, HOLEX, outperform a variety of knowledge completion methods on the FB15K dataset. Mean Rank (0-based) and HITS@10 are the main metrics we track; other metrics are reported for a more comprehensive comparison with prior work. Numbers are averages across head- and tail-prediction tasks, individual results for which may be found in the Appendix. ? See Remark 1 for an explanation of ProjE results, and Footnote 4 for very recent models.

First, although the RESCAL approach [18] , which works with the full outer product matrix, is capable of capturing rich correlation by looking at every pair of dimensions, the resulting quadratically many parameters make it difficult to train in practice, eventually resulting in poor performance.

Second, models such as TransE [5] and TransR [16] that rely on simple vector arithmetic such as adding/subtracting vectors, are unable to capture rich correlation, again resulting in low performance. Third, reimplementing HOLE using the ProjE framework increases HITS@10 from 73.9% to 78.4%, likely due to improved training with the TensorFlow backend, regularization techniques like dropout, and entity-relation interaction rather than original HOLE's entity-entity interaction. Further, simply increasing the embedding dimension from 150 to 256 allows HOLE to achieve 83.0% HITS@10, higher than most baseline methods that do not explicitly model KG paths, except for DistMult [29] which was re-tuned very carefully for this task [13] to achieve state-of-the-art results. 4 Relative to the (reimplemented) HOLE baseline, our proposed HOLEX with 8 Haar vectors improves the HITS@10 metric by 3.7%. The use of random 0/1 vectors appears somewhat more effective, achieving 88.6% HITS@10 with 16 such vectors, which is a 5.7% improvement over ProjE, which formed our codebase. This setting also achieves a mean reciprocal rank (MRR) of 0.800 and HITS@1 of 75.0%, matching or outperforming a wide variety of existing methods along various metrics. 5

A Additional Empirical Results

Tables 2 and 3 summarize the performance of HOLEX for the head-and tail-prediction tasks, respectively. Note that the corresponding numbers are averaged when reporting the main results in Table 1 on the full task.

As has been observed in prior work, the tail-prediction task is considerably easier than head-prediction for named-entity knowledge bases such as Freebase. This is because many-to-one relations tend to be more common than one-to-many relations. For instance, many people "live in" one city or "work for" one company; where as relatively few people have been the "president of" the United States).

We see, for example, that when using 8 random 0/1 vectors in HOLEX, the tail-prediction HITS@10 metric is 90.5%, which is 5.2% higher than that for head-prediction. Similarly, the mean rank for tail-prediction is 35 in this case, compared to 58 for head prediction. Table 3 : Performance of HOLEX on the tail-prediction task. Table 1 reports the average of this and head-prediction performance.

Table 3: Performance of HOLEX on the tail-prediction task. Table 1 reports the average of this and head-prediction performance.

B Proof Details

Proof of Theorem 1. According to the definition of the expanded holographic embedding. We have the j, i-th entry of the matrix h(a, b; C d ) is:

[h(a, b; C d )] j,i = d−1 l=0 c i,l a l b (l+j) mod d .

in which c i,l is the l, i-th entry of the matrix C d , and a l b (l+j) mod d is R l,j -the l, j-th entry of matrix R. Therefore, h(a, b;

C d ) = C d R.

which is equivalent to what the Theorem states.

Definition 1. A random 0/1 matrix A ∈ {0, 1} l×d is a matrix whose entries are chosen independently and uniformly at random from {0, 1}.

Claim 1. Suppose x, y ∈ R d are two vectors, each with exactly one non-zero entry, and at different locations. Let A ∈ {0, 1} l×d be a random 0/1 matrix. Then Pr(Ax = Ay) ≤ 1 2 l .

Proof. Suppose the i-th entry is the unique non-zero in x, and similarly for the j-th entry in y. Ax = Ay must imply that A(:, i) = A(:, j). Otherwise, suppose A k,i = 1 but A k,j = 0, this leads to Ax to be non-zero but Ay to be zero. Contradiction. Given this fact,

Pr(Ax = Ay) ≤ Pr(A(:, i) = A(:, j)) = 1/2 l as claimed.

Proof of Theorem 2. Because d diagonal lines are mutually independent, it suffices to prove the statement holds for one diagonal line with probability at least 1 − η/d. A union bound argument can be applied to show that the statement holds for all d diagonal lines with probability at least 1 − η. In this case, the rest of the proof focuses on one diagonal line.

The effect of applying expanded holographic embedding with l random 0/1 vectors on one diagonal line is to multiply this diagonal line with a l-by-d random 0/1 matrix A. This fact can be quickly checked with the graphical example in Figure 1 (middle) . Suppose x and y are two possible configurations of one diagonal line of interest (i.e., both x and y have one non-zero entry of value 1). If a random 0/1 matrix A can tell apart every pairs of x and y, we can decide which configuration the diagonal line is actually in by examining the result of the expanded holographic embedding. In other words, it is sufficient to prove the following: let l = 3 log d − log η − 1. sample an l-by-d random 0/1 matrix A, then with probability at least 1 − η/d, we must have Ax = Ay holds, for any two vectors x and y with exact one non-zero entry of value 1.

Pr(∀x, y ∈ D : x = y, Ax = Ay) (10) = 1 − Pr(∃x, y ∈ D : x = y, Ax = Ay) (11)

EQUATION (13): Not extracted; please refer to original document.

Here, D is the space with vectors of exact one non-zero entry of value 1. The size of D is d(d−1)

2

. It is a union bound argument from (2) to (3). From (3) to (4) we use Claim 1. The last inequality is because l ≥ 3 log d − log η − 1.

The proof of theorem 3 makes many connections to compressed sensing. We provide a brief review here. Many definitions and lemmas can be found in [20] . We first introduce the notion of restricted isometry property.

Definition 2 (restricted isometry property [20] ). The restricted isometry constant δ s of a matrix A ∈ R m×d is defined as the smallest δ s such that

(1 − δ s ) x 2 2 ≤ Ax 2 2 ≤ (1 + δ s ) x 2 2

for all s-sparse x ∈ R d .

It is well known that restricted isometry property implies recovery of sparse vectors, which can be shown below.

Lemma 1 (Theorem 2.6, [20] ). Suppose the restricted isometry constants δ 2s of a matrix A ∈ R m×d satisfies δ 2s < 1 3 , then every s-sparse vector x * ∈ R d is recovered by 1 -minimization.

Therefore, in order to guarantee sparse recovery of x * , we need a good matrix A. It turns out that random Bernoulli matrix has good restricted isometry constant upper bound:

Lemma 2 (Theorem 2.12, [20] ). Let A ∈ R m×d be a Bernoulli random matrix, where every entry of the matrix takes the value 1 √ m or − 1 √ m with equal probability. Let , δ ∈ (0, 1) and assume m ≥ Cδ −2 (s log(d/s)) + log( −1 ) for a universal constant C > 0. Then with probability at least 1 − the restricted isometry constant of A satisfies δ s ≤ δ.

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

i=0 a i ). The inner product between the second row of H 4 and a gives the difference between the3 While we discuss expansion in the context of HOLE, it is evident from Eq. (9) that one can easily generalize the notion (even if not the theoretical results that follow) to any embedding method that can be decomposed as σ(s, r, o) = g(f (s, r), o), or a similar decomposition for another permutation of s, r, o. In this case, the expanded version would simply be l j=0 g(f (cj • s, r), o).

ConclusionWe proposed expanded holographic embeddings (HOLEX), a new family of embeddings for knowledge graphs that smoothly interpolates between the full product matrix of correlations on one hand, and an effective lower dimensionality method, namely HOLE, on the other. By concatenating several linearly perturbed copies of HOLE, our approach allows the system to focus on different subspaces of the full embedding space, resulting in a richer representation. It recovers the full interaction matrix when sufficiently many copies are used. Empirical results on the standard FB15K dataset demonstrate the strength of HOLEX even with only a handful of perturbation vectors, and the benefit of being able to select a point that effectively trades off expressivity of relational embeddings with computation.4 A recent model called EKGN[22] outperforms this with HITS@10 at 92.7% and mean rank 38. Explicitly modeling reciprocal relations, adding a new regularizer, and using weighted nuclear 3-norm have also recently been shown to improve both CP decomposition and ComplEx to HITS@10 of 91% and MRR 0.86, albeit with much larger embedding dimensions of 4,000 and 2,000 for CP and ComplEx, respectively[14,21].5 Our HOLEX implementation uses the hyper-parameters recommended for ProjE, except for embedding dimension 256 and batch size 128. Hyper-parameter tuning targeted for HOLEX should improve results further.

Table 2: Performance of HOLEX on the head-prediction task. Table 1 reports the average of this and tail-prediction performance.