Learning to Predict Citation-Based Impact Measures

Luca Weihs
Oren Etzioni
2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL)
2017
View in Semantic Scholar

Abstract

Citations implicitly encode a community's judgment of a paper's importance and thus provide a unique signal by which to study scientific impact. Efforts in understanding and refining this signal are reflected in the probabilistic modeling of citation networks and the proliferation of citation-based impact measures such as Hirsch's h-index. While these efforts focus on understanding the past and present, they leave open the question of whether scientific impact can be predicted into the future. Recent work addressing this deficiency has employed linear and simple probabilistic models; we show that these results can be handily outperformed by leveraging non-linear techniques. In particular, we find that these AI methods can predict measures of scientific impact for papers and authors, namely citation rates and h-indices, with surprising accuracy, even 10 years into the future. Moreover, we demonstrate how existing probabilistic models for paper citations can be extended to better incorporate refined prior knowledge. While predictions of scientific impact should be approached with healthy skepticism, our results improve upon prior efforts and form a baseline against which future progress can be easily judged.

1 Introduction

is paper investigates the problem of predicting scienti c impact for individual authors and papers up to 10 years into the future. As there is no consensus as to which measure of scienti c impact is best, we follow prior work [1, 20] , and quantify author impact using the h-index, and paper impact with citation counts. We test the e cacy of our methods on a data set of close to four million computer science papers wri en by approximately 800,000 authors and published in the years spanning 1975 to 2016. is data set, which we have made publicly available, 1 is unique in both size, more than an order of magnitude larger than others, and in breadth, covering papers published in over 7000 conferences and journals. 1975 1980 1985 1990 1995 2000 2005 Exponential Fit Truth Figure 1 : e cumulative number of published papers in our data set over time. e (plotted) exponential t suggests that the number of papers can be expected to double approximately every six years.

Figure 1: e cumulative number of published papers in our data set over time. e (plotted) exponential t suggests that the number of papers can be expected to double approximately every six years.

For our prediction task, we use information available in 2005 to predict impact in the subsequent 10 year period, the years 2006 to 2015. Because of their simplicity and ubiquity, citation counts have become perhaps the most popular measure of scienti c impact. While citation counts are indeed a simple and useful proxy for impact, they exhibit a number of aws, especially when being used to characterize authors. One common criticism of citation counts is that they fail to capture any notion of how citations are distributed across a researcher's publications. For instance, we might expect that an author with 60 citations is more impactful if those citations are distributed equally across 6 papers rather than across 30. Equally problematic, if an author has a short career with a single high citation paper, perhaps a survey or interdisciplinary publication, they may seem more impactful than a researcher with a long history of moderate impact publications. A growing realization, pioneered by Hirsch [12] , that these aws must be addressed has led to a recent explosion in surrogate measures of impact. A small subset of these measures include the h-index, g-index, c-index, eigenfactor, and hip-index [4, 10, 12, 25, 27] . e h-index, the author impact measure we predict, equals the largest value N for which an author has N publications each with greater than or equal to N citations. Notice that the h-index is una ected by outliers, a single publication can only ever increase the h-index by 1, and also penalizes having many papers with few citations, a publication only contributes to the h-index if it has su ciently many citations.

978-1-5386-3861-3/17/$31.00 ©2017 IEEE While quantifying current scienti c impact is of substantial interest, there are a wide variety of problems for which the future is far more important than the present (e.g. the question of granting tenure). Moreover, given the exponential explosion in published papers, see Figure 1 , it is critical to design automated systems which detect impactful work as early as possible. To be er address these questions, there has been a surge of research in scienti c impact prediction for authors and individual papers.

ere are two primary strategies for impact prediction. e rst is the spiritual successor of the work of Price [18] in which citation counts are statistically modeled using the intuitions provided by the preferential a achment model of network growth and empirical studies [20, 22] . e second predicts impact with a machine learning approach, that is, extensive feature engineering followed by supervised learning with a regression model [1, [6] [7] [8] [9] 26] . We compare these two approaches on our data set and propose a method, for paper citation prediction, to bridge the gap between them. We note that there have been some recent e orts which do not fall precisely into either of the above categories. For instance, Nezhadbiglari et al. [15] exploit K-Spectral Clustering to discover cluster centroids among author citation histories invariant to both scaling and shi s; they then combine the information from these centroids with simple author-level features to predict author cluster membership and future citation counts. e remainder of this paper is organized as follows. We begin with a summary of the features we use to characterize individual papers and authors. We then demonstrate how these features can be used to predict author h-index and paper citation counts. For papers, we compare the machine learning approaches to those inspired by probabilistic modeling; we are not aware of any predictive probabilistic models for author h-index and thus cannot make such a comparison for authors. We end with an analysis of which features are most strongly related to our prediction targets and a discussion of our results and future work. We also include an appendix detailing our modi cations to the reinforced poisson model of Shen et al. [20] used to predict paper citations.

2 Feature Engineering

In order to apply supervised learning techniques to our problem, we must rst develop a collection of features summarizing individual papers and authors. In this work we focus our a ention on features that can be extracted from the citation graph, coauthor graph, and paper metadata (e.g. authors and venue); we leave the creation of content-based features extracted from papers' text as future work. A speci cation of all 44 author features, many inspired by prior work [1, 7, 8, 19, 26] , can be found in Table 1 . e 63 features for papers are similar and remain unlisted to spare space; 2 we give a summary of all features below. Before continuing we should note that there are several one-to-many relationships between papers, authors, and venues; namely an author publishes many papers in di erent venues and a single paper can have many authors. Because of this, we o en must summarize a variable length vector of information. For instance, suppose a paper had ve authors who have, respectively, obtained an average of 5, 3, 2, 1, and 1 citations per paper. We will capture this variable length information with three summary statistics, namely the mean, minimum, and maximum. Hence the above vector (5, 3, 2, 1, 1) would be reduced to (2.4, 1, 5). As we always make such a reduction, we will not explicitly note so below.

Table 1: All features used for author h-index prediction.

2.1 Metadata

Some features can be extracted from paper metadata with li le to no processing. For instance, the author count, whether or not the paper is a survey, and the number of years since the paper was published.

2.2 Impact History

Some of the strongest predictors of future impact can be extracted directly from the time series of citation counts and h-indices for individual papers and authors. Such features include the total citation count, year-over-year change in citation rate, and longterm average citation rate. We also consider the impact history of venues, extracting essentially the same features for venues as for authors, and summarize this information for individual papers and authors.

2.3 Citation And Coauthor Graphs

e topology of the citation and coauthor graphs o er compelling information about the centrality and in uence of papers and their authors. One might expect, for example, that coauthors will tend to cite one another; thus, having high degree in the coauthor graph suggests future h-index growth. Indeed Sarigölet et al. [19] demonstrate that measures of centrality in the coauthorship network alone provide strong signals of future success. For computational eciency, our primary measures of centrality are in/out-degree and the PageRank [16] .

3 Author Impact

We rst consider author h-index prediction using a machine learning approach. We generate a collection of 44 features for each author, and use these features within several regression models. As we predict up to 10 years into the future, we generate features for authors whose rst publication was in or before 2005 and only use data that would have been available in 2005. e observed author h-indices in 2006-2015 are then used as targets for prediction. Note that we train the same models several times, once for each of the 10 target years 2006-2015. To lter out inactive authors, we follow Acuna et al. [1] and only include authors having an h-index of at least 4 and whose rst article was published between 5 and 12 years prior to 2005. We train the following regression models; these models are ordered by increasing complexity, beginning with simple baselines and ending with state-of-the-art machine learning algorithms.

(1) Plus-k (PK) -A baseline model that adds a xed constant to all author's h-indices every year; this constant, equaling 0.402, is chosen by linear regression using the Huber loss which be er handles outliers than the usual squared error loss [13] . • Citations this year and one year ago author key citations delta {0,1}

• Key citations this year and one year ago author mean citations per paper

• Mean number of citations per paper author mean citation per paper delta

• Change in mean cites per paper over last two years author mean citations per year

• Mean number of citations per year author papers

• Number of papers published author papers delta

• Number of papers published in last two years author mean citation rank

• Rank of author (between 0 and 1) among all other authors in terms of mean citations per year author unweighted pagerank

• PageRank of author in unweighted coauthorship network author weighted pagerank

• PageRank of author in weighted coauthorship network author age

• Career length (years since rst paper published) author recent num coauthors

• Total number of coauthors in last two years author max single paper citations

• Max number of citations for any of author's papers venue hindex {mean, min,max}

• H-indices of venues author has published in venue hindex delta {mean, min,max}

• 2-year h-index change for venues author has published in venue citations {mean, min,max}

• Mean citations per paper of venues author has published in venue citations delta {mean, min,max}

• Change in mean citations per paper over last two years for venues author has published in venue papers {mean, min, max}

• Number of papers in venues in which the author has published venue papers delta {mean, min, max}

• Change in number of papers in venues in which the author has published over the last two years venue rank {mean, min, max}

• Ranks of venues (between 0-1) in which the author has published determined by mean number of citations per paper venue max single paper citations {mean, min, max} • Maximum number of citations any paper published in a venue has received for each venue the author has published in total num venues

• Total number of venues published in

(3) Lasso (LAS) -A regularized linear regression model using all features with the regularization parameter chosen by 10-fold cross validation [21] . (4) Random forest (RF) -An ensemble of regression trees using randomization techniques to improve performance [5] . (5) Gradient boosted regression trees (GBRT) -a collection of simple regression trees that are trained iteratively by a type of functional gradient descent [11] . Gradient boosted trees can perform very well but, unlike random forests, can require extensive parameter tuning.

To evaluate the performance of our predictions we consider three performance metrics described below. e rst measure we consider is the well-known R 2 metric. R 2 compares the relative performance of a model against a predictor that simply returns the mean of the labels. In our se ing,

R 2 equals 1 − N i=1 ( i, j −ˆ i, j ) 2 N i=1 ( i, j − j ) 2

where N is the total number of authors, i, j is the h-index of the ith author in year j,ˆ i, j is the predicted h-index for the author in that year, and

j = 1 N N i=1 i, j

is the average h-index over all authors. While R 2 is a popular measure of regression performance, it tends to overstate predictive power in the impact prediction se ing [17] .

is in ation occurs because citation counts and h-indices cannot decrease and are highly auto-correlated, that is, dependent on their value in the prior year. To remove some of this auto-correlation, we modify the R 2 metric by subtracting the known number of citations in 2005 from the prediction targets in 2006-2015. We de ne a new metric, which we call the Past Adjusted

R 2 (PA-R 2 ), as 1 − N i=1 ( i, j −ˆ i, j ) 2 N i=1 (z i, j − z j ) 2 where z i, j = i, j − i,2005 and z j = 1 N N i=1 z i, j

. By subtracting the known quantity i,2005 from i, j in the denominator of PA-R 2 , we make the denominator strictly smaller and thus remove some misleading in ation in the statistic. While it is o en stated that the R 2 metric lies between 0 and 1, this need only be the true in some special cases, for instance, when computing training error with linear regression. Both R 2 and PA-R 2 will always be less than or equal to 1 but may be negative. For a test set of 2,566 authors, we compute the R 2 and PA-R 2 measures for all of the above models and display the results in Figure 2. In terms of R 2 , we see in (Fig. 2a ) that the gradient boosted regression trees substantially outperform the simple baseline models. Between the three machine learning models, the di erences in performance are less stark: the GBRT and RF models are essentially tied, while the lasso model performs only slightly worse (Fig. 2b) . However, when using the PA-R 2 metric, the di erences become more pronounced, and interesting trends appear (Fig. 2c, 2d ). e PA-R 2 metric illuminates the surprising di culty in predicting the h-index in short time periods of 1-2 years, a reversal from the R 2 plots where short time periods appear very predictable.

Figure 2: R2 and PA-R2 of author h-index predictions. eGBRT outperform the baselinemodels signi cantly and also provide notable improvements over the other machine learning models.

is is intuitive as an author's h-index can only increase in integral jumps and tends to grow slowly; while the long-term cumulative e ects of these changes are predictable the short term is much less so. Note also that the PA-R 2 metric suggests that the GBRT o er a notable improvement over the other machine learning models (Fig. 2d ).

Acuna et al. (2012) consider the task of predicting author hindex using elastic-net regularized linear regression on a crowdsourced data set of neuroscientists; Table 2 displays their results alongside ours. Our model substantially outperforms theirs with a 50% relative improvement in R 2 when predicting 10 years into the future. Note, however, that the data sets are distinct. eir data set was pieced together from multiple evolving sources and is not publicly available as a cohesive whole; as such, we could not run our models on their data set.

Table 2: Unadjusted R2 values for 1, 5, and 10 year author h-index predictions. Our GBRT model attains substantially higher R2 scores than those reported by Acuna et al. (2012), especially so when predicting 10 years into the future.

While R 2 -type measures are popular, we prefer Mean Absolute Percentage Error (MAPE), which averages the percentage error of each prediction. MAPE is de ned as

1 N N i=1 i, j −ˆ i, j i, j .

Here, the ith summand |( i, j −ˆ i, j )/ i, j | is the absolute percentage error of the ith prediction, and the MAPE is the average of these errors. Note that a smaller MAPE is be er, the opposite of R 2 . Examining Figure 3b we see that, in terms of MAPE, the strengths of GBRT over the other models become even more apparent.

Figure 3: MAPE for author h-index prediction with 95% con dence intervals, these intervals are very narrow. Averaging over years, the GBRT obtain an error of 0.138 in comparison to 0.211 for the SMs and 0.151 for the RFs.

Beyond its intuitive description, MAPE has two distinct advantages over the R 2 measures. First it normalizes, on a per paper basis, the error in the prediction. Without this normalization, being o by three is equally poor for an author with an h-index of one as for an author with an h-index of y, a result that makes li le intuitive sense. Moreover, unlike MAPE, R 2 measures are highly sensitive to outliers and can change dramatically when experiencing a small number of poor predictions. e MAPE produces easily interpretable results; indeed, the h-index is seen to be surprisingly predictable, even when predicting 10 years into the future, the GBRT are within ±19% of the truth on average (Fig. 3a) .

4 Paper Impact

e early history of study into the citation graph of papers was concerned with the describing the observed power-law distribution of citation counts.

e rst major success in modeling this phenomenon came from Price [18] who modeled the probability that a newly published paper, p new , would cite some other paper, p old , as being proportional to the number of citations P old had at the time of p new 's publication. Price showed that a network growing with this "rich get richer" mechanism resulted in node degrees following a power law distribution closely mimicking that observed in real citation networks. is model was later rediscovered and popularized by Barabási and Albert [2] who coined the mechanism preferential a achment.

is preferential a achment model has recently inspired probabilistic models for predicting citation counts of individual papers [20, 22] . One such predictive model describes citation trajectories using a Reinforced Poisson Process (RPP). In the RPP model, obtaining a citation increases the probability of receiving a citation in the future, a type of self-reinforcement analogous to the notion of preferential a achment [20] . In particular, the RPP models C p (t), the number of citations a paper p has a ained by time t > 0 a er its publication, as a Poisson process with rate function

r p (t) = λ p • f p (t | θ p ) • (C p (t) + m)

where λ p is a tness parameter, f p is a non-negative temporal decay function with parameters θ p , and m is a positive integer representing initial visibility. e parameters of the above model can then be inferred by maximum likelihood estimation. As maximum likelihood estimation is prone to over ing, this model can be naturally augmented by using the Bayesian framework where λ p is assumed to be generated from some prior distribution. In our case we assume that λ p is generated by a Gamma(α, β) distribution. is prior substantially reduces the number of parameters that have to be estimated and helps mitigate the e ect of over ing [20] . Unfortunately, even this updated model does not perform quite as well as one might expect [23] . In order to improve the accuracy of this elegant model, we consider the following three modi cations.

(I)

e RPP model requires knowledge of the exact date when papers are published, we extend it to the more realistic se ing where only the publication year is known. (ii) We employ regularization to help mitigate the model's propensity to over t. (iii) Instead of requiring that the Gamma prior parameters, α and β, be shared across all papers, we allow them to be the output of a fully connected single layer neural network taking as input the same features we use in the below machine learning models. is allows more re ned information about the paper to inform our prior knowledge.

For details on how the above three items are accomplished, see Appendix A. ese changes leave us with two candidate models; the rst, which we call an RPPNet, applies all three modi cations leveraging the features we extract, while the second, which we call an RPP, only implements the rst two modi cations. We also consider a machine learning approach and extract 63 features for each paper exclusively using information available in 2005. We then use these features to train the same collection of regression models described for author h-index prediction to predict citation counts in the years 2006-2015. We lter our data to only include papers having received at least ve citations before the end of 2005, a minimum threshold of impact. By simply replacing author h-indices with paper citations we adapt the MAPE, PA-R 2 , and R 2 measures to the paper impact se ing.

For a test set of size 10,000 papers, we plot the MAPE for all of the above models in Figure 4 . As for author impact prediction, GBRT substantially outperform the baseline models, a di erence of almost 15 percentage points a er 10 years (Fig. 4a) . Among all models, GBRT are consistently the best with the RPPNet a close second (Fig. 4b) . Given that the RPPNet is more easily interpretable than GBRT and naturally produces error estimates along with its predictions, its slight loss in performance when compared to the GBRT may be acceptable. Up, we do not display plot of R 2 and PA-R 2 ; instead we note that the GBRT out perform all other models but, surprisingly, the simple SM model performs almost as well as the GBRT. is performance is, however, largely a re ection of these measures' sensitivity to outliers. When we remove 50 of the best and worst predictions of each model from their evaluations (1% of the test data) the GBRT outperform the SM models in PA-R 2 by ≈ 0.1 (0.76 v.s. 0.66) when predicting 10 years into the future.

Figure 4: MAPE for paper citation prediction with 95% con dence intervals, these intervals are very narrow. As with author h-index prediction, the GBRT outperform all other models; averaging over years, the GBRT obtain an error of 0.192 in comparison to 0.29 for SMs and 0.204 for RPPNets.

5 Factors Contributing To Prediction

One notable omission from our analyses up to this point has been a discussion of the e ect of author career and paper age on predictive performance. In particular, one may expect that citation rate of a paper stabilizes with age and thus, citation counts of older papers may become easier to predict. Similarly, an author with a wellestablished career should have a more stable h-index than that of a relative newcomer. To address this question, Figure 5 author and paper in our test sets when stratifying by age. 3 Notice that while the errors in our predictions seem to be concentrated near zero, suggesting a lack of bias, we observe a substantially larger variability in the accuracy of our predictions for the younger authors and papers. is exactly ts our above intuition that older authors and papers are more predictable. Figure 5 also displays the MAPE for authors and papers for each of the di erent age groups. It's interesting to note that, for both authors and papers, the predictivity of citations and h-indices, as measured by the MAPE, increases rapidly in the rst few years and then begins to plateau. 3 Recall that we previously restricted our data set to only those authors with an hindex of ≥ 4 and with a career length between 5-12 years by 2005. To form these new predictions we t a GBRT model where we allowed authors of any age (but still required an h-index of ≥ 4).

Figure 5: For every author and paper in our test sets, we plot the percentage error of our GBRT predictions a er 10 years. We have separated authors and papers into groups based on their respective ages at the end of 2005, that is, papers published in 2004 and 2005 would, respectively, be considered to have ages 2 and 1 at the end of 2005. Author career ages are based o of the date of their rst publications. As many points overlap in these plots we have, within each age group, colored the points using a kernel density estimate, dark red corresponds many overlapping points while dark blue corresponds to few overlapping points. Above and below each group we have also included the MAPE when restricting the papers and authors within that group, for example, the MAPE among authors with a career length of 5 is 24%.

Notice that, for instance, authors with a 25 year old career by 2005 have a MAPE of 0.13 which is only four percentage points less than those authors with a 10 year old career. is suggests that there is an inherent variability in citation counts and h-indices that cannot be explained by our model even in the best case. is motivates a need for future work developing features beyond those present in the citation history. Beyond understanding the variability in our predictions, we are also interested in which features contributed most to our predictions. In order to quantify this feature importance, we measure the dependence between our features and the observed scienti c impact measure using the t * statistic, a non-parametric measure of correlation [3, 24] . We use t * as it captures any dependence structure between two variables, unlike, for example, Pearson correlation which only captures linear trends. We will focus only on the features contributing to the h-index predictions (see Fig. 6 ), since the results are very similar for paper citation prediction. Perhaps surprisingly, the relative importance of several features change when considering predictions at di erent forecasting periods. For instance, when predicting author h-index in 2006 the best feature is clearly the author's h-index in 2005. But when predicting the h-index in 2015, the h-index in 2005 is less important than the number of papers published in the years 2004-2005 ( Fig. 6 ). One might expect that the papers an author publishes today will take several years before they accumulate enough citations to in uence the author's h-index; but once su ciently many years have passed, an authors h-index is strongly determined by those papers. ese results mirror the trends seen by Acuna et al. [1] .

Figure 6: t∗ between feature values and observed h-indices in the years 2006-2015, larger values mean more dependent. All features use data available in 2005. e features are Hind. (h-index), Cites (total citations), Ave. Cites (average citations per year), Cites ’05 (citations in 2005), Papers ’04-’05 (papers published between 2004-2005), Papers (total number of papers published), PageRank (author PageRank in coauthor network). We see that the importance of authors’ hindices in 2005 decreases while the importance several other predictors, e.g. the number of papers an author published in 2004-2005, increases.

6 Discussion And Future Work

While the above results suggest that scienti c impact prediction is possible, even for a ten-year horizon, we should stress that our metrics assess average trends and do not apply to each author and paper uniformly. Indeed, as we discussed in Section 5, factors such as author and paper age play an important role in determining the accuracy of our predictions; we observed the intuitive result that the variability in our prediction decreases with the age of papers and length of researchers' careers. Somewhat surprisingly, the variability in our predictions does not appear to tend to zero for very old papers and authors.

is suggests that there is still room for new features and modeling techniques to improve our upon our predictive performance in future work. We suspect, for instance, that features describing the topic of a paper may result in substantial predictive gains, especially for those relatively young authors and papers. e need for such features is highlighted by the work of Newman [14] , who showed that the preferential a achment model predicts a substantial rst-mover advantage for those publishing in a new area. Beyond developing new features, we also expect that gains can be made by more directly modeling citation events; while the RPP and RPPNet models bring us part of the way it is clear that they do not capture all of the underlying dynamics, moreover, we still lack such a predictive probabilistic model for author h-indices and citations. Finally, as it may soon be the case that decisions are being made on the basis of impact predictions, it is clear that it is no longer enough to produce a singlenumber prediction of impact. Instead, future techniques must be able to reliably assess the con dence in their predictions, indeed this is one reason why combining machine learning techniques with probabilistic modeling approaches, such as the RPPNet, is so appealing.

Of course, any a empt to summarize scienti c impact is limited. Existing measures, such as citation counts and h-indices, do not provide a comprehensive assessment of a paper or author. Instead, they provide a signal that helps to inform our understanding of a paper's or author's impact. In this way, our results suggest that scienti c impact predictions may be a useful tool, among many, in guiding our focus to where it will be most fruitful.

A Modifications To The Reinforced Poisson Process Model

Recall that in the reinforced poisson process models C p (t), the number of citations a paper p has at time t > 0 a er it's publication, as a poisson process with rate function

r p (t) = λ p f (t | θ p ) (C p (t) + m)

where λ p , θ p and m are parameters and f is a temporal decay function [20] . While the λ p and θ p parameters are inferred by maximum likelihood, [20] found that performance is only weakly dependent on the value of m and thus we follow this prior work and simply let m = 10 be xed. To help reign in the problem of over ing one can place a Gamma(α, β) prior upon the λ p parameters. Recall from Section 4 that we make three modi cations to the above model, these modi cations are described below.

A.1 Discrete Time

As a clear extension of the continuous time RPP model we model C p (n), the number of citations a paper p has n ≥ 1 years a er it's publication, as a discrete time poisson process with rate function

r p (n) = λ p (C p (n − 1) + m) ∫ n n−1 f (t | θ p ) dt .

Note that we have integrated above as r p (n) represents the mean for the entire time period between n −1 and n. Following prior work we let

f (t | θ ) = 1 √ 2π σ t exp(− 1 2σ 2 (ln t − µ)

2 ) so that f a log-normal probability density function for θ = (µ, σ ) ∈ R × R >0 .

To avoid degenerate cases we will always assume that µ ≥ −1 and σ ≥ 0.5. Let Φ θ be the log-normal cumulative distribution function corresponding to f (t | θ ) and for all i ≥ 1 let ∆Φ θ (i) = Φ θ (i) − Φ θ (i − 1). Using this notation we may rewrite the rate function as r p (n) = λ p (C p (n − 1) + m)∆Φ θ (n).

e above de nition gives us, for all n ≥ 1, the following, selfreinforcing, relationship,

C p (n) − C p (n − 1) | C p (n − 1) ∼ Poisson(λ p (C p (n − 1) + m)∆Φ θ (i))

where C p (0) = 0. For the sake of simplifying notation we will, for the moment, drop the p subscripts.

Suppose we observe per-year citation counts C(1) − C(0) = d 1 , ..., C(n) − C(n − 1) = d n and wish perform maximum likelihood estimation of λ, θ . To do this we will rst need an explicit form of the likelihood function. By de nition,

P(C(i)−C(i − 1) = d i | d 1 , ..., d i−1 ) = e −λC i −1 ∆Φ θ (i) λ d i ∆Φ θ (i) d i C d i i−1 d i ! .

From the above it is easy to see that the likelihood of the observations is simply For every author and paper in our test sets, we plot the percentage error of our GBRT predictions a er 10 years. We have separated authors and papers into groups based on their respective ages at the end of 2005, that is, papers published in 2004 and 2005 would, respectively, be considered to have ages 2 and 1 at the end of 2005. Author career ages are based o of the date of their rst publications. As many points overlap in these plots we have, within each age group, colored the points using a kernel density estimate, dark red corresponds many overlapping points while dark blue corresponds to few overlapping points. Above and below each group we have also included the MAPE when restricting the papers and authors within that group, for example, the MAPE among authors with a career length of 5 is 24%.

L(λ, θ | d 1 , ..., d n ) = n i=1 e −λC i −1 ∆Φ θ (i) λ d i ∆Φ θ (i) d i C d i i−1 d i ! = exp −λ n i=1 C i−1 ∆Φ θ (i) λ n i =1 d i ( n i=1 ∆Φ θ (i) d i C d i i−1 d i ! ) Note that n i=1 d i

we may estimate λ, θ directly by maximum likelihood estimation. In particular, note that the log-likelihood has the form

L(λ, θ ) = log L(λ, θ | d 1 , ..., d n ) = −λ n i=1 C i−1 ∆Φ θ (i) + N log λ + n i=1 d i log(∆Φ θ (i)) + n i=1 log( C d i i−1 d i ! ), di erentiating we nd that 0 = ∂ ∂λ L(λ, θ ) ⇐⇒ λ = N n i=1 C i−1 ∆Φ θ (i) .

Plugging this optimum value,

λ * = N n i =1 C i −1 ∆Φ θ (i) , into L gives L(λ * , θ ) = n i=1 d i log(∆Φ θ (i)) − N log( n i=1 C i−1 ∆Φ θ (i)) + const .

We now di erentiate the above in µ, σ . Let ϕ θ be the density corresponding to a Normal(µ, σ 2 ) distribution. We then have that

∂ ∂µ L(λ * , θ ) = n i=1 (λ * C i−1 − d i ∆Φ θ (i) )(ϕ θ (log(i)) − ϕ θ (log(i − 1))), ∂ ∂σ L(λ * , θ ) = n i=1 (λ * C i−1 − d i ∆Φ θ (i) )( log(i) − µ σ ϕ θ (log(i)) − log(i − 1) − µ σ ϕ θ (log(i − 1))).

Using the above we can nd µ, σ using derivative based optimization approaches. Using the discovered θ = (µ, σ ) we may predict into the future using the mean of the poisson process given the prior observations then we may compute c n (t | λ, θ ) recursively to nd that, for t ≥ 1,

c n (t | λ, θ ) = (C(n) + m) t i=1 (1 + λ∆Φ θ (n + i)) − m.

A.2 Prior Extensions And Regularization

We now place a Gamma(α, β) prior on λ and compute the marginalized likelihood. To simplify notation we will let

A = ( n i=1 ∆Φ θ (i) d i C d i i −1 d i ! ).

As in the previous section, we suppose we have observed C(1) − C(0) = d 1 , ..., C(n) − C(n − 1) = d n for which we may write the marginalized likelihood

L(θ, α, β | d 1 , ..., d n ) = ∫ ∞ 0 L(λ, θ | d 1 , ..., d n ) β α Γ(α) λ α −1 e −β λ dλ = A β α Γ(α) Γ(α + N ) (β + n i=1 C i−1 ∆Φ θ (i)) α +N .

From the above we also immediately observe that the posterior distribution of λ given the observation is

λ | d 1 , ..., d n ∼ Gamma(α + N , β + n i=1 C i−1 ∆Φ θ (i))

which allows us to easily compute the posterior mean,

λ = E[λ | d 1 , ..., d n ] = α + N β + n i=1 C i−1 ∆Φ θ (i)

. Now taking the log of the marginalized likelihood gives

L(θ, α, β) = B + n i=1 d i log(∆Φ θ (i)) + α log β − log Γ(α) + log Γ(α + N ) − (α + N ) log(β + n i=1 C i−1 ∆Φ θ (i))

where

B = n i=1 log( C d i i −1 d i ! )

is constant with respect to the parameters θ, α, β. Now le ing ψ be the digamma function we have that

∂ ∂α L(θ, α, β) = log β − ψ (α) + ψ (α + N ) − log(β + n i=1 C i−1 ∆Φ θ (i)), ∂ ∂β L(θ, α, β) = α β − α + N β + n i=1 C i−1 ∆Φ θ (i) = α β − λ, ∂ ∂µ L(θ, α, β) = n i=1 (λC i−1 − d i ∆Φ θ (i) )(ϕ θ (log(i)) − ϕ θ (log(i − 1))), ∂ ∂σ L(θ, α, β) = n i=1 (λC i−1 − d i ∆Φ θ (i) )( log(i) − µ σ ϕ θ (log(i)) − log(i − 1) − µ σ ϕ θ (log(i − 1))).

It is interesting to note that, when comparing the derivatives in µ, σ to those computed in the previous section, we have simply replaced the optimal λ * in that se ing with posterior mean λ.

When performing maximum likelihood inference in the last section it was su cient to consider each paper individually as no two papers shared any common parameters. In our current se ing however, all papers share the same α, β parameters and hence we will need to perform maximum likelihood estimation for all papers simultaneously. To this end let P be a collection of papers. For each p ∈ P suppose we have observations C p (1)

= d p 1 , C p (2) − C p (1) = d p 2 , ..., C p (n p ) − C p (n p − 1) = d p n p ,

and let L p (θ p , α, β) be the log-likelihood for the individual paper p. en the log-likelihood of all papers simultaneously is simply the sum

L P (α, β, {θ p } p ∈ P ) = p ∈ P L p (θ p , α, β).

It is easy to compute gradients of L P using the derivates we have computed for the individual L p and thus performing maximum likelihood inference for the parameters α, β, and {θ p } p ∈ P can be done using any gradient based optimizer. Now suppose that α, β, θ p are xed. Obtaining future predictions is straightforward by using posterior means. By iterated conditioning one may easily check that

c p,n p (t | α, β, θ p ) = E[c n (n p + t | λ p , θ p ) | C p (1), ..., C p (n p )].

Now for xed λ p we can compute c n (n p + t | λ p , θ p ) using the results in the previous section. Hence we can approximate the above expectation to arbitrary precision using a Monte-Carlo strategy. Namely we draw samples from the posterior distribution of λ p , Gamma(α + N , β + n i=1 C i−1 ∆Φ θ (i)), compute c n (n p + t | λ p , θ p ) for each of these samples and then averaging the results.

As we have found that simply using maximum likelihood inference results in over ing we consider adding a regularization penalty of the form −γ ( α β ) 2 to the optimization procedure where γ ≥ 0 is a hyper parameter. In particular, rather than a empting to maximize L P we maximize

L * P (α, β, {θ p } p ∈ P ) = 1 |P | p ∈ P L p (θ p , α, β) − γ ( α β ) 2 .

Here α/β is equal to the mean of the prior distribution Gamma(α, β) and so this penalty discourages the prior distribution from being too large on average. In practice, γ is chosen by cross validation.

A.3 Neural Network Structure

In order to allow more subtle information to in uence our prior distribution we also consider le ing α, β be learned functions of features extracted from each paper. For each paper p let x p ∈ R k be a collection of features corresponding to the paper. en we consider α, β as functions of the x p , wri en as α(x p ), β(x p ), so that λ p ∼ Gamma(α(x p ), β(x p )). We learn the functions α, β as the output of a single layer, fully connected, neural network with soplus non-linearities, by maximizing the penalized log-likelihood

L * * P (α, β, {θ p } p ∈ P ) = 1 |P | p ∈ P L p (θ p , α(x p ), β(x p )) − γ α(x p ) 2 β(x p ) 2 .

As was shown empirically in the main paper, allowing α, β to depend on x p can improve performance.

e data set, along with code to reproduce our results, can be found at h ps://github. com/Lucaweihs/impact-prediction.

A listing of the paper features can be found, along with the code and data, at h ps: //github.com/Lucaweihs/impact-prediction.