Citation Count Analysis for Papers with Preprints

Sergey Feldman
Kyle Lo
Waleed Ammar
ArXiv
2018
View in Semantic Scholar

Abstract

We explore the degree to which papers prepublished on arXiv garner more citations, in an attempt to paint a sharper picture of fairness issues related to prepublishing. A paper's citation count is estimated using a negative-binomial generalized linear model (GLM) while observing a binary variable which indicates whether the paper has been prepublished. We control for author influence (via the authors' h-index at the time of paper writing), publication venue, and overall time that paper has been available on arXiv. Our analysis only includes papers that were eventually accepted for publication at top-tier CS conferences, and were posted on arXiv either before or after the acceptance notification. We observe that papers submitted to arXiv before acceptance have, on average, 65\% more citations in the following year compared to papers submitted after. We note that this finding is not causal, and discuss possible next steps.

1 Introduction

Preprint servers like arXiv enable researchers to self-distribute scientific paper drafts with minimal moderation. While some of these papers are never published elsewhere, many are also accepted for publication at academic venues after a double-blind peer-review process. Authors of these papers are faced with the decision to distribute their papers on arXiv before or after acceptance at their target publication venues. We refer to those papers that are posted on arXiv before acceptance as prepublished.

With the increasing popularity of prepublishing computer science (CS) papers on arXiv [Sutton and Gong, 2017] , this decision has been the subject of a considerable debate in the CS research community (among others). 1 Some researchers abstain from posting their work on arXiv until it has been accepted for publication at the target venue. One reason is to preserve author anonymity during the double-blind review process and mitigate reviewer bias favoring well-known authors and affiliations [e.g., Snodgrass, 2006 , Tomkins et al., 2017 . Other reasons may include fear of circulating incorrect results or conclusions or fear of retaliation by a reviewer in conflict with the author.

On the other side of the debate, some researchers prefer publishing drafts of their work on arXiv before it is accepted for publication. One reason is to allow other researchers to build on their work which can expedite scientific developments. Another reason is to allow researchers aside from the official reviewers at the target venue to provide feedback, which can be used to further improve the paper even before it is published (i.e., before the camera-ready due date). Authors also may use arXiv for "flag-planting", i.e., claiming a research contribution before getting scooped by other researchers who may be doing similar work.

Quoting a recent blog post by Yoav Goldberg: "[T]here is also a rising trend of people using arXiv for flagplanting, and to circumvent the peer-review process. This is especially true for work coming from 'strong' groups. Currently, there is practically no downside to posting your (often very preliminary, often incomplete) work to arXiv, only potential benefits." 2 In this work, our goal is to quantify some of these perceived benefits of posting a paper on arXiv before it is submitted for publication. While it may be hard to study reviewer bias for prepublished papers (since the review results are not made public), we can observe the number of times a paper is cited, which is often used to measure a paper's impact. 3 We focus on papers with an arXiv-published draft which have also been accepted for publication at a top-tier venue. Specifically, we're interested in studying whether there are significant differences 1 See Marti Hearst's and Kelly Cruz's thoughtful discussions of this topic at https://acl2017.wordpress.com/2017/02/19/ arxiv-and-the-future-of-double-blind-conference-reviewing/ and http://www.astrobetter.com/blog/2011/12/12/to-post-or-notto-post-publishing-to-the-arxiv-before-acceptance/.

2 https://medium.com/@yoav.goldberg/an-adversarial-review-of-adversarial-generation-of-natural-language-409ac3378bd7 3 While most peer reviews are not publicly available, a notable exception is the International Conference on Learning Representations (ICLR) which makes all reviews available and also allows any researcher to comment on papers under submission using the openreview.net platform.

in citation counts between papers that were published on arXiv before vs. after they were accepted at the venue at which they were eventually published.

To motivate this study, consider the following scenario: Two researchers R1 and R2 worked independently developed an outstanding method around the same time. R1 decides to prepublished her draft on arXiv while R2 decides to wait until the paper is accepted for publication at target venue. Naturally, the earlier exposure of the research community to R1's paper may result in researchers attributing most of the credit to R1 rather than R2, despite both being eventually published at the same venue. We may consequently observe a higher number of citations for R1's work rather than R2's work. This is especially concerning when metrics derived from citation counts (e.g. h-index) play a significant role in hiring and promotion decisions in universities and research labs (despite the controversy surrounding number of citations as a measure of a paper's impact).

In this draft, we explore the degree to which prepublished papers garner more citations, in an attempt to paint a sharper picture of arXiv-related fairness issues. We use a negative-binomial generalized linear model (GLM) to regress a paper's number of citations onto a binary indicator representing arXiv prepublishing, and control for author influence (via the authors' h-index at the time of paper writing), publication venue, and overall time that paper has been available on arXiv. We analyze papers that were eventually accepted for publication at top-tier CS conferences, and were posted on arXiv either before or after the acceptance notification. We observe a significant positive association between citation count and prepublishing on arXiv.

Our results are consistent with previous work [e.g., Larivière et al., 2014] which found papers posted on arXiv to have higher citation rate (among all papers published in Web of Science). 4 Also related is Moed [2007] who studied the higher citation rate of arXiv papers and found a strong quality bias and early view effect, and found no effect due to the open access nature of arXiv. To the best of our knowledge, this is the first study to analyze the pre-publication effect, distinguishing between papers posted on arXiv before vs. after conference acceptance. We also control for important aspects such as author popularity and venues which are known to affect citation rates.

2 Data

Here, we describe the data we used for this analysis in some detail.

2.1 Venues

All papers included in our study were eventually published at one of the following top-tier computer science conferences, which have a significant portion of their papers on arXiv: AAAI, ACL, CVPR, ECCV, EMNLP, FOCS, HLT-NAACL, ICCV, ICML, ICRA, IJCAI, INFOCOM, KDD, NIPS, SODA and WWW. We include papers published since 2007 and no later than 2016, so that we can count the number of citations they receive during the year following their publication.

To obtain this data, we queried Semantic Scholar for all the papers published in a particular conference. We then looked up each of these papers in the arXiv metadata dump contributed by Sutton and Gong [2017] , 5 and obtained arXiv submission dates for each paper that was posted. For papers with multiple versions on arXiv, we record the date of the earliest submission, and papers that were never posted to arXiv were excluded.

See Table 1 for a per-conference break down of the 4392 papers in our dataset. We used the Calls for Papers Wiki 6 to obtain paper submission deadlines for most of the conference and year combinations in our dataset. The rest were obtained via web search.

Table 1: Number of papers from each conference in our dataset.

2.2 Citations

The response variable we would like to model is the number of times a paper is cited in the calendar year following the conference, which we label as "all citations." 7 Figure 1a shows the histogram of citation count with buckets of size 5, showing that the vast majority of papers in this population receive fewer than 20 citations in the calendar year following the conference.

Figure 1. Not extracted; please refer to original document.

We also experiment with a modified definition of the response variable meant to count meaningful citations (e.g., omitting self citations), which we label as "Influential Citations" to distinguish it from "All Citations." Our definition of influential citations is based on Valenzuela et al. [2015] , and only counts citations with no overlap in the author lists. In an influential citation, the cited paper is referenced three times or more in the narrative of the citing paper, not consistently combined with other references, mentioned in context of experimental results, or explicitly mentioned as foundation for the citing paper.

2.3 Author Influence

We suspect that well known authors tend to garner more citations than less known authors. In order to control for this source of bias in our analysis, we model an observed variable which represents the authors influence. Given the paper in question, we first compute the h-index for its authors one year before it was published. Then we take the maximum h-index among all the authors of a paper and use this single value as a per-paper summary for author influence. Let h(a, year) be the h-index for author a at a specified year. The author influence for paper p can then be written as:

h max (p) = max a∈authors(p) h(a, year(p))

Because h-index is non-linear in its relationship with citation counts, we model it as a categorical variable with ten buckets each of which containing the same number of papers. The first bucket included all papers with h max (p) ≤ 6 and the last bucket included all papers with h max (p) ≥ 42.

2.4 Time Available On Arxiv

Papers prepublished on arXiv before acceptance have had more time to gather citations than those posted to arXiv after acceptance, which may explain any differences in citation counts. To control for this factor, we compute the fraction of the year the paper has been available on arXiv. In particular, we measure the number of days between the first arXiv submission and the beginning of the calendar year in which we count citations of that paper, then divide by the number of days in the year, as illustrated by the following Python code. next_year_jan_1 = datetime(year=conf_year + 1, month=1, day=1).date() delta = next_year_jan_1 -arxiv_submission_date frac_year_remaining = np.maximum(delta.days / 365, 0)

We clamp the difference (delta.days) at a minimum of zero because a paper may be put on arXiv for the first time long after it is officially published.

2.5 Submitted To Arxiv Before Vs. After Acceptance

This variable is an indicator for whether the paper was posted to arXiv before or after it was accepted for publication. Ideally, we'd like to observe whether the arXiv submission date is before or after the acceptance notification, but since the exact acceptance dates were not available for all venues, we use a conservative estimate of +28 days after the the submission deadline of the conference as our prepublishing threshold. Figure 1b contains a histogram showing the distribution of arXiv submission dates relative to the paper's target venue deadline date.

2.6 Summary Of Variables

To summarize, we compute the following variables for each paper p:

• cites_1year -number of papers that cited p and were published in the calendar year following the official publication of p (continuous response variable).

• influential_cites_1year -number of papers that cited p and were published in the calendar year following the official publication of p and satisfied 'influential' criteria (continuous response variable).

• max_hindex_decile -the decile into which the maximum (across all authors) h-index of p falls into (categorical feature -10 values).

• submitted_before_deadline -whether p was submitted 28 days after the conference submission deadline (binary feature).

• frac_year_remaining -fraction of year remaining from arXiv submission date until the year after the conference in which paper p was published (continuous feature).

• conf -the conference where p was published (categorical feature -16 values).

3 Analysis

Here, we describe how we model the variables discussed in the previous section then analyze the results.

3.1 Model

Negative binomial GLMs are a common option for modeling count-valued response variables that exhibit overdispersion (i.e. when variance of the variable exceeds its mean, thus deviating from the standard Poisson count model) which is typical of real-world data [Hilbe, 2007] . One can interpret the negative binomial distribution as a marginalized Poisson distribution where its mean is drawn from a Gamma distribution. The conditional mean model is expressed as:

E[y|x] = exp w 0 + i w i x i ,

where y is the response variable, x is the vector of covariates/features, and w i is the learned weight of the ith feature x i . In our case, the response variable y is either cites_1year or influential_cites_1year. Within our feature vector x, our primary covariate of interest is submitted_before_deadline, while the other features are possible confounders that we want to control for. We use Python's statsmodels [Seabold and Perktold, 2010] to fit the following regression models (expressed in the standard formula mini-language from R that is also used in statsmodels): cites_1year~max_hindex_decile + frac_year_remaining + conf cites_1year~max_hindex_decile + frac_year_remaining + conf + submitted_before_deadline

The only difference between these two models is the presence of the submitted_before_deadline binary variable. We repeat this again for influential_cites_1year as the response variable.

3.2 Results

We conducted a likelihood ratio test on the two models and the resulting p-value was tiny: 6.27e−29. This means that the second model has a significantly higher likelihood, indicating that it better fits the data. The coefficients of the full model that includes submitted_before_deadline are shown below: Due to the exp term in the regression function, these coefficients can be interpreted as having a multiplicative effect instead of an additive effect as in linear regression. We can thus look at the 0.5029 coefficient of submitted_before_deadline (the coef column), and interpret its effect as multiplying the number of citations by exp(0.5029) = 1.65. In other words, the fitted regression model estimates that papers submitted to arXiv before acceptance, on average, tend to have 65% more citations in the following year compared to papers submitted after.

Generalized

----------------------------------------------------------------------------------------------------- Intercept 0.

The difference is even more pronounced when we look at the number of influential citations. 8 Papers submitted to arXiv before acceptance, on average, tend to have 75% more influential citations in the following year compared to papers submitted after. We emphasize that we cannot conclude that prepublishing on arXiv has a causal effect on citation counts since this result is not based on a randomized controlled experiment.

Note that in this framework, each categorical variable with k values has only k − 1 coefficients. Each coefficient can be interpreted as being relative to some baseline value, which is determined by the left-out value. For example, the baseline category for max_hindex_decile is [0, 6], and the coefficients for the other nine deciles capture how many more citations one can expect to have with higher h-indices (in an associative, not causal, sense). In particular, an h-index between 42 and 99 is associated (on average) with more than double the number of next-year citations than if you had an h-index between 0 and 6. These coefficients increase in a nearly-monotonic way as h-index deciles increase, which is consistent with our intuition that more famous authors tend to get more citations. Similarly, the baseline conference is AAAI.

The results suggest that frac_year_remaining is a minor variable, with an estimate of 0 being part of the 95% confidence interval (last two columns). This is somewhat surprising since we expected papers which have been on arXiv for a longer fraction of a given year to have more citations in the following year.

4 Conclusion

Our exploratory analysis shows that publishing a CS paper on arXiv before it is eventually accepted (as opposed to after) for publication at a top tier target venue is associated with 65% more citations in the calendar year following the conference. Although we take into account other factors which can influence number of citations (namely, author influence, publication venue, time available on arXiv), there may be other confounding factors which we did not include in our study (e.g., author affiliation, paper quality). We invite researchers interested in this analysis to explore the effect of other factors we have not included in the model, and invite conference chairs to conduct randomized controlled experiments in which authors submitting their drafts to the conference agree to prepublish their drafts on arXiv if they are randomly selected.

We note that identifying the potential unfair advantage given to prepublished papers may not give researchers a sufficiently compelling reason to delay posting their paper drafts on arXiv until the review process has completed. Instead, we encourage the community to adopt anonymous prepublished submissions (with pre-specified time limits on the anonymity) on arXiv and related platforms, similar to how the OpenReview platform implemented the peer reviewing process for ICLR 2018. 9

http://webofknowledge.com/ 5 https://github.com/casutton/cs-arxiv-popularity-code 6 http://wikicfp.com 7 Alternatively, we could have simply counted all citations a paper received but this would require making stronger assumptions about how the number of citations change over years, which is not the focus of this study.

We omit detailed results for influential citations for brevity. 9 https://iclr.cc/archive/www/doku.php%3Fid=iclr2018:faq.html#what_is_the_signature_field_when_submitting_a_comment_ review