Abstract
Figures and tables are key sources of information in many scholarly documents. However, current academic search engines do not make use of figures and tables when semantically parsing documents or presenting document summaries to users. To facilitate these applications we develop an algorithm that extracts figures, tables, and captions from documents called “PDFFigures 2.0.” Our proposed approach analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of body text, and then locates figures and tables by reasoning about the empty regions within that text. To evaluate our work, we introduce a new dataset of computer science papers, along with ground truth labels for the locations of the figures, tables, and captions within them. Our algorithm achieves impressive results (94% precision at 90% recall) on this dataset surpassing previous state of the art. Further, we show how our framework was used to extract figures from a corpus of over one million papers, and how the resulting extractions were integrated into the user interface of a smart academic search engine, Semantic Scholar (www.semanticscholar.org). Finally, we present results of exploratory data analysis completed on the extracted figures as well as an extension of our method for the task of section title extraction. We release our dataset and code on our project webpage for enabling future research (http://pdffigures2.allenai.org).
1. Introduction
Traditional tools for organizing and presenting digital libraries only make use of the text of the documents they index. Focusing exclusively on text, however, comes at a price because in many domains much of the important content is contained within figures and tables. Especially in scholarly Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
JCDL '16, June 19-23, 2016 Tables and figures also have the potential to be used as powerful document summarization tools. It is common to get the gist of a paper by glancing through the figures 1 , which often contain both the main results as well as visual aids that outline the work being discussed. Being able to extract these figures and present them to a user would be an effective way to let users quickly get an overview of the paper's content. To this end, we introduce PDFFigures 2.0. PDFFigures 2.0 takes as input computer science papers in PDF format and outputs the figures, tables, and captions contained within them.
Our work builds upon the PDFFigures algorithm [5] . The approach used by [5] has high accuracy but was only tested on papers from a narrow range of sources. In this work, we improve upon that method to build a figure extractor that is suitable for use as part of academic search engines for computer science papers. To meet this goal we improve upon the accuracy of PDFFigures [5] and, more importantly, build an extractor that is effective across the entire range of content in a digital library. This requires an approach that is robust to the large number of possible formats and styles papers might use. Particular challenges include handling documents with widely differing spacing conventions, avoiding false positives while maintaining the ability to extract a broad range of possible captions, and extracting a highly varied selection of figures and tables.
Our approach follows the same general structure used in [5] (see Section 3) and employs data-driven heuristics that leverage formatting conventions used consistently in the computer science domain. Following a heuristic approach makes our method transparent and easy to modify [13] , which we have found to be important for developing an effective solution to this task.
While our focus is on extracting figures, our method also produces a rich decomposition of the document and analysis of the text. In this paper we demonstrate how this analysis can be leveraged for other extraction tasks, such as identifying section titles. Section titles are important because they reveal the overall structure of the document, and can be a crucial feature for upstream components analyzing body text. Section titles can also be used to identify which section figures were introduced in, thereby providing some additional context for interpreting extracted figures. We evaluate our section title extraction method on a dataset of over 50 papers and compare our results against prior work.
In order to evaluate PDFFigures 2.0 against a diverse set of documents, we introduce a new dataset of over 325 computer science papers along with ground truth labels for the locations of figures, tables, and captions within them. We also show how our method was used to extract figures from over one million documents and integrated into the user interface for Semantic Scholar [3] , a smart academic search engine for computer science papers. We conclude by using this dataset to study how figure usage has evolved over time, how figure usage relates to future citations, and how figure usage differs between conference venues.
2. Related Work
For a comprehensive survey of previous work in figure extraction as well as relevant open source tools, please see [5] . In this section we review some recent developments in the field as well as exciting applications of figure extraction.
A machine learning based approach to figure extraction was recently proposed in [11] . Their method classifies the graphical elements in a PDF as being part of a figure or not. Elements that were classified as being part of a figure were clustered to locate individual figures in the document. Rather then working primarily with the graphical elements in a document, our approach focuses on identifying body text and then using layout analysis to locate the figures, which allows our approach to not only extract a wide variety of figures but also generalize to extracting tables.
The possibility of being able to semantically parse figures is an exciting area of research, and the figure extraction method of [5] has already demonstrated its ability to facilitate pioneering work in this area. In [12] , researchers experimented with an approach to extracting data from line plots. Given a figure, their system uses a classifier to determine whether the figure is a line plot. If the figure is determined to be a line plot, a word recognition system is then used to locate text in the plot and classify that text as being part of an axis, a title, or a legend. Finally, heuristics based on color were used to identify curves in the plot and match them against the plot's legend. Their work used PDFFigures [5] to extract a large corpus of figures from papers published in top computer science conferences. The figures were mined to collect real world examples of line plots. Since PDFFigures can additionally extract the text contained in vector graphic based figures, these figures were also used to provide ground truth labels for the word detection system. Another recent project has similarly found that figures extracted by PDFFigures can be used to generate large amounts of text detection training data for a neural network [4] .
Researchers in [14] introduced a novel framework for parsing result figures in research papers. They used PDFFigures [5] to extract figures from computer science papers and subsequently used a classifier to determine the figure type. For line plots composed of vector graphics, heuristics were used to locate key elements of the charts, including the axis, axis labels, numeric scales, and legend. Apprenticeship learning was then used to train a model to identify the lines, and thus the data, contained within the plot. In all these cases PDFFigures provided the critical building block needed for building tools that are effective on real world figures and papers.
PDFFigures [5] has also been used as a component of PDFMEF, a knowledge extraction framework that extracts a wide variety of entities and semantic information from scholarly documents [15] . In PDFMEF, PDFFigures was used to add figures and tables to the elements PDFMEF is capable of extracting. The authors remarked that PDFFigures is notable for its accuracy and its ability to extract both figures and tables, and concluded by stating "...it [PDFFigures] is arguably one of the best open source figure/table extraction tools."
These projects suggest that the ability to extract figures from arbitrary documents is extremely valuable. With PDF-Figures 2.0, we hope to provide a higher quality, more robust tool for researchers wishing to use figures in their work.
The problem of locating section titles within documents has also received attention from researchers, and is addressed in systems such as ParsCit [6] , Grobid [8] and SectLabel [9] . All these approaches use machine learning to classify lines of text as being a section title or not. However, we have found that exploiting some natural properties of section titles, such as their use of salient fonts and their location relative to the rest of the document's text, makes heuristic approaches very effective for this task.
3. Approach Of Pdffigures [5]
Since our work builds upon PDFFigures [5] , we review the general strategy employed by [5] in this section. The approach is to focus primarily on identifying the captions and the body text of a document, since these elements are often the easiest to detect in scholarly articles. Once the captions and body text have been identified areas containing figures can be found by locating rectangular regions of the document that are adjacent to captions and do not contain body text. PDFFigures has three phases: Caption Detection, Region Identification, and Figure Assignment .
Caption Detection.
This phase of the algorithm identifies words that mark the beginning of captions within the document. Text is extracted from the document using Poppler [2] , and a keyword search is used to identify phrases that are likely to start a caption. False positives are then removed using a consistency assumption: that authors have labelled their figures in a consistent manner as is required by most academic venues.
If the first pass yields multiple phrases referring to the same figure, for example, two phrases of the form " Figure 1 ", it is assumed that all but one of those phrases is a false positive. If such false positives are detected, an attempt is made to remove them by applying a "filter" that removes all phrases that do not follow a particular formatting convention. Filters are only applied if they do not remove all phrases referring to a particular figure.
Region Identification.
Region identification decomposes document pages into regions, each one labelled as either caption, graphical element, body text, or figure text. Caption regions are built by starting from the caption phrases found in the prior step, and combining them with subsequent lines of text. The rest of the text in the document is grouped into paragraphs using Poppler's paragraph grouping mechanism. Paragraphs that are either too large or aligned to the left margin of a column are classified as body text, otherwise they are classified as figure text.
Page headers and page numbers are handled as special cases. PDFFigures checks if pages in the document are consistently headed by the same phrase, and if so marks those phrases as body text. Likewise page numbers are detected by checking if all pages end with a number, and if so marking those numbers as body text.
Finally, the graphical elements of the document are located. To do this each page is rendered as a 2D image using a customized PDF renderer that ignores text. The bounding boxes of the connected components in the resulting image are then used as graphical regions of the document. An example of such a decomposition is shown in Figure 1 . The last step is to assign each caption a region of the document containing the figure it refers to. First, up to four "proposal" regions are generated for each caption. Proposal regions are built by generating a rectangular region adjacent to each side of the caption, and then maximally expanding those regions as long as they do not overflow the page margin, overlap with body text, or overlap a caption. This is shown in Figure 2 . For two-column papers regions are constrained to not cross the center of the page unless the caption itself spans both columns.
Next, a single proposed figure region is selected for each caption. To do this a scoring function is used to rate each proposed region based on how likely it is to contain a figure. The scoring function gives higher scores to regions that are large and contain graphical elements. To ensure captions are not assigned regions that overlap, they iterate through every possible permutation of how figure regions could be matched to captions. Each permutation is scored based on the sum of the scores of the proposed regions it includes, with regions that overlap given a score of 0. The highest scoring permutation is then selected as the final set of figure regions to return.
An additional complication comes from cases where figures are immediately adjacent, so that they are not separated by any intervening body text or captions. In these cases, proposal regions might get overly expanded and therefore contain multiple figures. To handle these cases, when iterating through permutations, if two proposal regions overlap an attempt is made to split them by detecting a central band of whitespace that separates them. An example of such a figure can be found in Figure 4 , second row right column.