Extracting Scientific Figures with Distantly Supervised Neural Networks

Noah Siegel
Nicholas Lourie
R. Power
Waleed Ammar
JCDL
2018
View in Semantic Scholar

Abstract

Non-textual components such as charts, diagrams and tables provide key information in many scientific documents, but the lack of large labeled datasets has impeded the development of data-driven methods for scientific figure extraction. In this paper, we induce high-quality training labels for the task of figure extraction in a large number of scientific documents, with no human intervention. To accomplish this we leverage the auxiliary data provided in two large web collections of scientific documents (arXiv and PubMed) to locate figures and their associated captions in the rasterized PDF. We share the resulting dataset of over 5.5 million induced labels---4,000 times larger than the previous largest figure extraction dataset---with an average precision of 96.8%, to enable the development of modern data-driven methods for this task. We use this dataset to train a deep neural network for end-to-end figure detection, yielding a model that can be more easily extended to new domains compared to previous work. The model was successfully deployed in Semantic Scholar,\footnote\urlhttps://www.semanticscholar.org/ a large-scale academic search engine, and used to extract figures in 13 million scientific documents.\footnoteA demo of our system is available at \urlhttp://labs.semanticscholar.org/deepfigures/,and our dataset of induced labels can be downloaded at \urlhttps://s3-us-west-2.amazonaws.com/ai2-s2-research-public/deepfigures/jcdl-deepfigures-labels.tar.gz. Code to run our system locally can be found at \urlhttps://github.com/allenai/deepfigures-open.

1 Introduction

Non-textual components (e.g., charts, diagrams and tables) provide key information in many scientific documents. Previous research has studied the utility of figures in scientific search and information extraction systems; however, the vast majority of published research papers are only available in PDF format, making figure extraction a challenging first step before any downstream application involving figures or other graphical elements may be tackled. While some venues (e.g., PubMed) provide figures used in recently published documents, it remains a problem for older papers as well as many other venues which only publish PDF files of research papers.

Recent years have seen the emergence of a body of work focusing on use cases for extracted figures (see Section 2) . All of these downstream tasks rely upon accurate figure extraction. Unfortunately, the lack of large-scale labeled datasets has hindered the application of modern data-driven techniques to figure and table extraction. 3 Previous work on this task used rule-based methods to address the problem in limited domains. In particular, [6] extract figures in research papers at NIPS, ICML and AAAI, and [7] extend their work to address papers in computer science more generally; however, stylistic conventions vary widely across academic fields, and since previous methods relied primarily on hand-designed features from computer science papers, they do not generalize well to other scientific domains, as we show in section 5.

Our main contribution in this paper is to propose a novel method for inducing high-quality labels for figure extraction in a large number of scientific documents, with no human intervention. Our technique utilizes auxiliary data provided in two large web collections of scientific documents (arXiv and PubMed) to locate each figure and its associated caption in the rendered PDFs. The resulting dataset consists of 5.5 million induced labels with an average precision of 96.8%. The size of this dataset is three orders of magnitude larger than human-labeled datasets available for figure extraction in scientific documents.

To demonstrate the value of this dataset, we introduce DeepFigures, a deep neural model for detecting figures in PDF documents, built on a standard neural network architecture for modeling real world images, ResNet-101. Comparison to prior rule-based techniques reveals better generalization across different scientific domains. Additionally, we discuss a production system for figure extraction built on DeepFigures that is currently deployed in a largescale academic search engine (Semantic Scholar) 4 which covers multiple domains, illustrating the practical utility of our proposed approach.

Our main contributions are:

• We propose a novel method for inducing high-quality labels for figure extraction in large web collections of scientific documents. • We introduce, to the best of our knowledge, the first statistical model for figure extraction, using a neural network trained exclusively on our dataset with no human labels. • We release our figure extraction data, tool, and code for generating the datasets and extracting figures locally to facilitate future research in graphical information understanding in scientific documents.

2 Related Work

In this section, we discuss two lines of related work in the literature. The first line focuses on extraction and understanding of figures in scientific documents, which motivate this work. The second line reviews related neural models which we build on.

2.2 Neural Models For Computer Vision

The model architecture we use for figure extraction in this paper leverages the great success of convolutional neural networks on a variety of computer vision tasks including object recognition and detection [14] , motion analysis [12] , and scene reconstruction [27] . Inspired by the brain's visual cortex, these networks consist of millions of neurons arranged in a series of layers that learn successively higher-level visual representations. For example, when performing facial recognition, a neuron in the first layer might detect horizontal edges, a neuron in the second layer might respond to certain curves, the third layer to an eye, the fourth layer to a whole face. Recent work has used neural networks for semantic page segmentation, suggesting that these models can be applied to synthetic documents as well as natural scenes [3, 13, 28] . In this section we provide more details on two building blocks we use in DeepFigures: ResNet-101 and OverFeat.

2.2.1 ResNet-101. An important task in computer vision is object recognition: for example, given an image, determine whether it is a cat or a dog. Using the raw pixels as features poses difficulties for most traditional classification algorithms, due to the sheer volume of information and the curse of dimensionality. Instead, computer vision techniques generally extract higher level features from the image, and then run a standard machine learning classifier such as logistic regression on these features. Before neural networks, features were generally hand-engineered by researchers or practicioners; an example of one common such feature is the frequencies of edges in various regions of the image [8] . In contrast, convolutional neural networks learn their feature representations from the data. This learning is achieved by defining a broad space of possible feature extractors and then optimizing over it, typically using backpropagation and stochastic gradient descent. The architecture of the neural network corresponds to how the neurons are defined and pass information to each other and it is the neural network's architecture that defines the space of possible feature extractors we might learn.

Numerous highly successful neural network architectures have been proposed for computer vision [14, 15, 22] . Generally, with more data and more layers, neural networks tend to get increased performance. Because of this fact, large-scale datasets and optimization methods are key to neural networks' success. One problem in training neural networks with many layers is that of vanishing gradients: as gradients are propagated through successive layers, they tend to either blow up (causing parameters to quickly diverge to infinity during training) or shrink to zero, making training earlier layers difficult. Residual networks (ResNets) [14] address the problem by adding identity connections between blocks: rather than each layer receiving as input only the previous layer's output, some layers also receive the output of several layers before. These identity connections provide a path for gradients to reach earlier layers in the network undiminished, allowing much deeper networks to be trained. An ensemble of ResNet models won the ImageNet object detection competition in 2015.

Because useful image features transfer well across tasks it is common to use parts of one neural network architecture in place of components of another. ResNet-101 [14] provides one such feature extraction architecture. ResNet-101 is a 101-layer deep neural network formed by stacking "bottleneck" units consisting of a 1x1 convolutional layer, followed by a 3x3 convolutional layer that brings down the dimension of the embedding, and then another 1x1 convolutional layer that brings the dimension of the embedding back up to that of the original input. An identity connection adds the input of the bottleneck unit into the output of its last layer before passing it further down the network.

2.2.2 Overfeat.

Another common task in computer vision, and the one in which we are interested in this work, is object detection: for example, determine the location of all human faces in a given image. This task can be formalized as predicting a bounding box that encloses the object while being as small as possible. Object detection is more complex than classification since rather than predicting a binary or multiclass label, the model must predict a variable-length list of bounding box coordinates that may change in size and shape. The problem can be reduced to classification by running a classifier on every possible box on the image, but due to the high computational cost of running neural networks with millions of parameters this is generally infeasible.

OverFeat [19] introduced the idea of bounding box regression. Rather than producing a class output, the model can use regression to predict bounding box coordinates directly. To enable detecting multiple objects as well as to handle objects in various locations in the image, the model is run fully-convolutionally, i.e., the entire model is run on cropped image sections centered on a uniformly spaced grid of 20x15 points. 5 For each cropped region, in order to extract the feature vectors, OverFeat uses 5 initial layers that perform convolutions and max pooling. Classification is then performed by two fully connected layers and an output layer from the feature vectors; while bounding box regression is performed by two fully connected layers and a final output layer providing 4 numbers -the coordinates for the predicted bounding box. Each class has its own output layer to provide a bounding box for that class alone. The classification result then provides a confidence for each class and every region in the grid, while the bounding box regression yields a possible bounding box for that class. Thus, for any class many bounding boxes are predicted and then later merged. 5 Running the model at each point on a grid is significantly less computationally expensive than running the model on the same number of independent images, because the convolutional structure of network layers means much of the work on overlapping regions is redundant and can be shared; see [19] for details.

3 Inducing Figure Extraction Labels

Table 1: Number of papers, figures, and tables in the manually-labeled datasets (left) and our datasets of induced labels (right).

3.1 Aligning Figures In Latex Documents

Figure 1: Modifying LaTeX source to recover figure positions. Figure bounding boxes are shown in red, figure names in green, and captions in blue. Left: original document. Middle: document compiled from modified source. Right: image difference between original and modified documents.

Labeling Captions:

In order to find the coordinates of the bounding box for caption text, we modify the color of figure names and captions using the following command:

\usepackage[labelfont={color=green}, textfont={color=blue}]{caption}

Finally, we modify the coordinates of the bounding box of each figure and table to exclude the caption, by identifying the largest rectangular region inside the float that contains no caption text. This is robust even to uncommon caption locations (e.g., above a table, or to the side of a figure).

First, for each image in the auxiliary data, we determine which page in the corresponding PDF file contains this figure by searching the PDF text (extracted by a tool such as PDFBox) for the caption text (which is also available in the auxiliary data). Since the XML and PDF text do not always match exactly (e.g., em dash in PDF vs. a hyphen in XML), we use dynamic programming to find the substring in the PDF text with smallest Levenshtein distance to the caption text in the XML file. We modify the standard Wagner-Fischer dynamic programming algorithm for edit distance [25] by setting the cost for starting (and ending) at any position in the PDF text to 0. This modification maintains the time complexity of O(mn), where m and n are the string lengths.

Arxiv:

In order to construct our dataset, we download LaTeX source files from arXiv, 8 a popular platform for pre-publishing research in various fields including physics, computer science, and quantitative biology. When authors make a paper submission to arXiv, they are required to upload source files if the paper is typeset using LaTeX. As of the time of writing, arXiv hosts over 900,000 papers with LaTeX source code.

Pubmed:

Fortunately, however, some publishers provide XML markup for their papers, which can also be used to induce figure extraction labels. In particular, PubMed Central Open Access Subset is a free archive of medical and life sciences research papers. The National Center for Biotechnology Information (NCBI) makes this subset available through bulk downloading. In addition to the PDF files, it provides auxiliary data to improve the user experience while reading a paper. The auxiliary data includes the paper text marked up with XML tags (including figure captions) as well as image files for all graphics.

In principle, this data can be used to induce labels for figure extraction. However, unlike LaTeX documents, the XML markup cannot be used to compile the PDF. Therefore, we propose a different approach to recover the positional information of figures.

3.3 Comparison To Manual Annotation

In this section, we proposed a method for automatically inducing labeled data for figure extraction in scientific documents. An alternative approach is to train annotators to sift through a large number of research papers and label figure and table coordinates and their captions. While this approach typically results in high quality annotations, it is often impractical. Manual annotation is slow and expensive, and it is hard to find annotators with appropriate training or domain knowledge. With limited time and budget, the size of labeled data we can collect with this approach is modest. 9

Scalability Of Induced Labels:

In contrast to manual annotation, our proposed method for inducing labels is both scalable and accurate. We compare the size of our datasets with induced labels to that of manually labeled datasets in Table 1 . We compare with two manually labeled datasets:

• The "CS-Large" dataset [7] : To our knowledge, this was previously the largest dataset for the task of figure extraction. Papers in this dataset were randomly sampled from computer science papers published after the year 1999 with nine citations or more.

Dataset

LaTeX XML Figures Tables Figures Tables Precision 1. Table 2 : Precision, recall and F1 score of induced labels in the "LaTeX" and "XML" datasets.

• The "PubMed" dataset: We collected this dataset by sampling papers from PubMed, and hired experts in biological sciences to annotate figures and tables, and their captions.

Both manually labeled datasets are used as test sets in our experiments (section 5). Notably, both of our datasets with induced labels "LaTeX" and "XML" are three orders of magnitude larger than "CS-Large".

Accuracy Of Induced Labels:

In order to assess the accuracy of labels induced using our method, we collected human judgments for a sample of papers in the "LaTeX" and "XML" datasets. Table 2 reports the precision and recall of figures and tables, including captions, for 150 pages in 61 papers in the "LaTeX" dataset and 106 pages in 86 papers in the "XML" dataset. We require that the four corners of a figure (or table) and the four quadrants of its caption must be correct for each true positive data point. As shown in Table 2 , the quality of induced labels are fairly high (e.g., the F1 score of induced labels ranges between 93.9% and 100%).

The following section discusses the model we developed in order to consume the induced labeled data described in this section.

4 The Deepfigures Model

Our system takes as input a PDF file, which we then render as a list of page images, and feed each page to our figure detection neural network. The network architecture we use for figure extraction is a slight variant of several standard neural network architectures for image classification and object detection. In particular, our model is based on TensorBox [21] , applying the OverFeat detection architecture [19] to image embeddings generated using ResNet-101 [14] . This object detector then finds bounding boxes for figures in the PDF, and captions are extracted separately.

In contrast to OverFeat, which uses a relatively shallow 5-layer network to generate the spatial feature grid, we use ResNet-101 which enables higher model capacity and accuracy. Additionally, while OverFeat trained the embedding network on a classification task and then fixed those weights while training localization, learning only the weights for the final regression layer, we train the full network end-to-end, allowing the embedding network to learn features more relevant to localization and eliminating the need for pre-training.

Figure 2: Distributions of figures (left) and tables (right) in our automatically generated datasets. X-axis shows number of figures/tables per paper. Y-axis shows fraction of papers with that many figures/tables. Differences are likely a result of the differing source datasets: for example, the life science papers found in PubMedmay relymore on tables to convey information than math papers on arXiv.

As illustrated in Figure 3 , in the network we use for figure extraction each grid cell is represented by 1024 features extracted from the ResNet-101 model, resulting in a 20x15x1024 spatial feature grid. At each grid cell, the model uses these features to predict both a bounding box and a confidence score. Boxes with confidence score above a selected threshold are returned as predictions. Figure 4 illustrates the architecture in more detail.

Figure 3: High-level structure of the DeepFigures model. The input to the model is a 640x480 page image. ResNet-101 is run fully-convolutionally over the image, yielding a 20x15 spatial grid of 1024-dimensional image embedding vectors. Next, regressions to predict box coordinates and confidences are run on each of the 300 grid cells, yielding 300 candidate bounding boxes. Running non-maximum suppression and filtering out predictions with confidences below a threshold yields the final predictions.

Figure 4: Architecture of the DeepFigures network expressed fully convolutionally. Note that a 1x1 convolution is equivalent to a fully connected layer run at each point on a spatial grid. Strides are 1where not specified and all convolutional layers except those outputting predictions use ReLU activations. See [14] for the full ResNet-101 architecture.

Matching Captions:

The OverFeat-ResNet figure detection model outputs a set of bounding boxes for figures; however, many applications, including academic search, benefit most from having a set of figure-caption pairs for the PDF. Our figure extraction pipeline extracts captions' text and bounding boxes using the same method as [7] , finding paragraphs starting with a string that matches a regular expression capturing variations of " Figure N ." or " Table N ." and then locating the identified textual elements in the page using standard PDF processing libraries. Once we have a list of proposed captions, we match figures to captions in order to minimize the total euclidean distance between the centers of paired boxes. This is an instance of the linear assignment problem and can be solved efficiently using the Hungarian algorithm [16] . If there are more

Page Image

ResNet-101 Spatial Feature Grid Bounding Box Regression detected figures than captions or vice-versa, the algorithm picks the min(figure count, caption count) pairs that minimize total distance. See [7] for more details on matching captions. Both of the data generation methods described in section 3 produce bounding boxes for captions as well as figures, so in principle the captions could also be detected using a neural network. In our experience, however, training the model to predict captions reduced performance. There are a few likely causes: captions are often very small along the height dimension, amplifying small absolute errors in bounding box coordinates. Similarly, captions have fewer visual cues and are much less distinct from surrounding text than figures. Finally, the baseline caption detection model from [6] performs very well. Most errors in PDFFigures 2.0 are caused by figure detection error rather than caption detection. For these reasons, we continue to use the rules-based approach for detecting captions.

To summarize how our model combines the previously mentioned components, the model generates a 20x15 spatial grid of image embedding vectors with each embedding vector having 1024 dimensions generated using ResNet-101 [14] . The feature vectors are then input into a linear regression layer with two outputs that represent the four coordinates of the bounding box. Simultaneously, the feature vectors are passed through a logistic regression layer to predict the confidence that each grid cell is at the center of a figure. Redundant boxes are eliminated via non-maximum suppression. At test time, we run inference with a confidence threshold of 50%, although this parameter may be tuned to favor precision or recall if needed.

5 Experiments

In this section, we compare the DeepFigures model described in section 4 to PDFFigures 2.0 [7] , the previous state of the art for the task of figure extraction.

Data:

We train the DeepFigures model on 4,095,622 induced figures (1,030,671 in the LaTeX dataset and 3,064,951 in the XML dataset) and 1,431,820 induced tables (164,356 in the LaTeX dataset and 1,267,464 in the XML dataset). See section 3 for more details on the two datasets.

When using any algorithmically generated dataset, the question arises of how to ensure that the model is really learning something useful, rather than simply taking advantage of some algorithmic quirk of the generating process. Therefore, we perform evaluation entirely on human annotated figures. The algorithm is trained entirely on synthetic data and tested entirely on human labeled data, so our high performance demonstrates the quality of our distantly supervised dataset.

We run evaluation on two datasets: the "CS-Large" computer science dataset introduced by [7] , and a new dataset we introduce using papers randomly sampled from PubMed. Our new dataset, consisting of 289 figures and 124 tables from 104 papers, was annotated by experts in biological sciences.

Hyperparameters:

We use RMSProp as our optimizer with initial learning rate 0.001. We train for 5,000,000 steps, decaying the learning rate by a factor of 2 every 330,000 steps. We use a batch size of 1 during training and did not observe significant performance gains from larger batches, likely due to the inherent parallelism in sliding window detection.

Evaluation Methodology:

Our evaluation methodology follows that of [7] . A predicted box is evaluated against a ground truth box based on Jaccard index,

System

CS-Large PubMed PDFFigures 2.0 [7] 87.9% 63.5% DeepFigures (Ours) 84.9% 80.6% Table 3 : F1-scores for figure extraction systems on human labeled datasets. In keeping with [7] , a predicted bounding box is considered correct if its IOU (intersection over union) with the true box is at least 0.8.

also known as intersection over union (IOU): the area of their intersection divided by the area of their union. As in [7] , a predicted figure bounding box is considered correct if its IOU with the true box exceeds .80, while a predicted caption box is considered correct if its IOU exceeds .80 or if the predicted caption text matches the text from the true region extracted from the PDF. However, while [7] required annotations to include the figure number in order to be matched with predictions, we eliminate this requirement in order to simplify the human annotation task. Instead, we find the optimal assignment of predicted figures to true figures for each page, which is an instance of the linear assignment problem and can be done efficiently using the Hungarian algorithm.

Results:

As shown in Table 3 , DeepFigures underperorms by 3 F1 points on "CS-Large", but achieves a 17 point improvement on the "PubMed" dataset. Given that PDFFigures 2.0 is a rule-based method that was tuned specifically for the "CS-Large" test set, it is unsurprising to see that it works better than DeepFigures for this domain.

Since DeepFigures does not use any human annotation or domain specific feature engineering, it has learned a robust model for identifying figures across a variety of domains. For example, PDFFigures 2.0 often generates false positives for graphical headers which are visually distinct from actual figures, however, allowing our model to correctly reject them.

7 Conclusion

In this work, we present a novel method for inducing high-quality labels for figure extraction in scientific documents. Using this method, we contribute a dataset of 5.5 million induced labels with high accuracy, enabling researchers to develop more advanced methods for figure extraction in scientific documents.

We also introduce DeepFigures, a neural model for figure extraction trained on our induced dataset. DeepFigures has been successfully used to extract figures in 13 million papers in a largescale academic search engine, demonstrating its scalability and robustness across a variety of domains.

Future work includes training a model to perform the full task of figure extraction end-to-end, including detecting and matching captions. This task could be aided by providing the network with additional information available from the PDF other than the rendered image, e.g. the locations of text and image elements on the page. Additionally, our data generation approaches could be extended to other information about papers such as title, authors, and sections; the distinctive visual characteristics of these elements as they appear in papers suggests neural detection models could be potentially useful.

For brevity, we use figure extraction to refer to the extraction of both figures and tables in the remainder of the paper.

https://www.semanticscholar.org

We use the term LaTeX to refer to the formal language used to describe a document's content, structure and format in T E X software distributions.

The added border shifts figure positions by a few pixels, so we use the same command to add white borders to figures in the original to make them align exactly.8 https://arxiv.org/

Another alternative is to use crowdsourced workers (e.g., using Amazon Mechanical Turk https://www.mturk.com/ or CrowdFlower http://www.crowdflower.com/) to do the annotation. Although crowdsourcing has been successfully used to construct useful image datasets such as ImageNet[11],[20] found that crowdsourcing figure annotations in research papers yielded low inter-annotator agreement and significant noise due to workers' lack of familiarity with scholarly documents.

www.semanticscholar.org11 Simple Storage Service, AWS's highly scalable object store.

Figure 5: Deployment architecture for the Deepfigures service.