SciBERT: Pretrained Contextualized Embeddings for Scientific Text

Iz Beltagy
Arman Cohan
Kyle Lo
ArXiv
2019
View in Semantic Scholar

Abstract

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained contextualized embedding model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks.

1 Introduction

There has been an exponential increase in the volume of scientific publications in the past decades, making NLP an essential tool for large-scale knowledge extraction and machine reading of scientific documents. Recent progress in NLP has been driven by the adoption of deep neural network models, but training such models often requires large amounts of labeled data. In general domains, large-scale training data is often possible to obtain through crowdsourcing. In scientific domains, however, annotated data is difficult and expensive to collect due to the expertise required to perform annotation tasks.

As shown through ELMo , GPT (Radford et al., 2018) and BERT (Devlin et al., 2018) , unsupervised pretraining of language models on large corpora significantly improves performance on many NLP tasks. These models can read a sentence and return a contextualized embedding for each token which can be used in task-specific neural architectures. Due to the success of these models across a variety of NLP tasks, leveraging unsupervised pretraining has become useful especially when task-specific annotations are difficult to obtain. Yet while both BERT and ELMo have released pretrained models, they are still trained on general domain corpora such as news articles and Wikipedia.

In this work, we make the following contributions:

(i) We release SCIBERT, a new resource to successfully tackle a broad set of NLP tasks in the scientific domain. SCIBERT is based on BERT trained on a large corpus of scientific text. The code and pretrained models are available at https://github.com/allenai/scibert.

(ii) We evaluate SCIBERT against BERT on a suite of tasks in the scientific domain, i.e., sequence tagging, parsing, and text classification.

(iii) With SCIBERT, we achieve new state-ofthe-art results on many of these tasks with minimal task-specific architectures and without any hypetparameter search or fine-tuning.

2 Methods

Background The BERT model architecture (Devlin et al., 2018 ) is based on a multilayer bidirectional Transformer (Vaswani et al., 2017) . Instead of the traditional left-to-right language modeling objective, BERT is pretrained on two tasks: predicting randomly masked tokens and predicting whether two sentences follow each other. SCIB-ERT follows the same architecture as BERT and is optimized on scientific text.

Vocabulary BERT uses WordPiece for unsupervised tokenization of the input text. WordPiece, similar to byte pair encoding (Sennrich et al., 2016) , relies on a vocabulary of subword units. The vocabulary is built such that it contains the most frequently used words or subwords. We re- fer to the original vocabulary released with BERT as BASEVOCAB. We consider that frequently observed words and subwords in scientific text may differ from those occurring in general domain text. To better model the scientific vocabulary, we use the Sentence-Piece 1 library to build a new WordPiece vocabulary, SCIVOCAB, on our scientific corpus. We produce both cased and uncased vocabularies and set the vocabulary size to 30K, akin to BERT. The resulting token overlap between BASEVOCAB and SCIVOCAB is 42%.

Corpus We train SCIBERT on a random sample of 1.14M papers from Semantic Scholar (Ammar et al., 2018) . This corpus consists of 18% papers from the computer science domain and 82% from the broad biomedical domain (see Table 1). We use the full text of the papers, not just the abstracts. 2 It is worth noting that the full text is the output of a noisy PDF parser 3 , but we keep those sentences in to train a robust model against noisy inputs. The average paper length is 154 sentences (2769 tokens) resulting in a corpus size of 3.17B tokens. This provides extensive language data from scientific text, comparable to the size of the general corpora on which BERT was trained (3.3B tokens). To split the documents into sentences, we use ScispaCy (Neumann et al., 2019 ), 4 which is optimized for scientific text.

3 Experiments

To demonstrate the effectiveness of SCIBERT, we conduct extensive experiments with different variants of BERT and SCIBERT on a large suite of NLP tasks on scientific text. We first describe the BERT variants and then the models

3.1 Bert Variants

BERT-Base This is the original model detailed in (Devlin et al., 2018) . We use the pretrained weights for BERT-Base released with the original BERT code. 5 The vocabulary used is BASEVO-CAB. We evaluate both cased and uncased versions of this model.

Scibert

We use the original BERT code to train SCIBERT on our corpus with the same configuration and size as BERT-Base. We train 4 different versions of SCIBERT: (a) cased or uncased, and (b) using BASEVOCAB or SCIVOCAB. The two models that use BASEVOCAB are fine tuned from the corresponding BERT-Base models. The other two models that use the new SCIVOCAB are trained from scratch.

3.2 Pretraining Bert

Training BERT for long sentences can be slow. Following the recommended settings in the original BERT code, we set a maximum sentence length of 128 tokens, and train the model until the training loss stops decreasing. We continue training the model allowing sentence lengths up to 512 tokens.

We train the SCIBERT models on a single TPU v3 with 8 cores. Training the SCIVOCAB models from scratch on our corpus of 3.17B tokens takes 1 week 6 (5 days with max length 128, then 2 days with max length 512). Fine tuning the BASEVO-CAB models starting from the BERT-Base weights reduces the overall training time by 2 days.

3.3 Tasks

We evaluate SCIBERT on a broad set of tasks in the scientific NLP domain. NER (Named Entity Recognition): This task is a sequence-labeling task where tokens corresponding to entities in a sentence are labeled. This may also include classifying entities to a predefined set of types. PICO (Participant Intervention Comparison Outcome Extraction): This is also a sequence labeling task involving the extraction of spans of interest in clinical trial papers (Huang et al., 2006) . PICO is a technique used in evidence-based practice to frame and answer a health-related question and is helpful to develop literature search strategies in the biomedical domain. CLS (Classification): This task is classifying a sequence of tokens (e.g., a sentence) with its corresponding label. REL (Relation Classification) This task is predicting the type of relation expressed between two entities in a sentence. To mark entity locations in the sentence, the entity mentions are encapsulated by special characters. The task is then framed as a multiclass sentence-level classification problem. DEP (Dependency Parsing) This task is predicting the dependencies between tokens in the sentence as a structured tree.

3.4 Models

To keep things simple, we use minimal taskspecific architectures atop BERT-Base and SCIB-ERT embeddings. Each token is represented as the concatenation of its BERT embedding with a CNN-based character embedding. If the token has multiple BERT subword units, we use the first one. We apply a multilayer BiLSTM to token embeddings. For text classification, we apply a multilayer perceptron on the first and last BiLSTM states. For sequence tagging, we use a CRF on top of the BiLSTM, as done in (Ma and Hovy, 2016) . For dependency parsing we use the biaffine attention model from Dozat and Manning (2017) .

3.5 Task-Specific Training

For simplicity, experiments are performed without any hyperparameter tuning and with fixed BERT weights. 7 All our models are implemented in Al-lenNLP (Gardner et al., 2017) which provides an easy interface for using pretrained BERT embeddings. The BERT pretrained models are converted to be compatible with PyTorch using the pytorchpretrained-BERT library. 8

3.6 Datasets

We evaluate our models on a suite of wellestablished NLP datasets spanning across multiple scientific domains (Table 2) . For brevity, we do not explain the details of older datasets and refer the reader to the corresponding citations. Instead, we briefly describe the newer datasets.

Table 2: The domain and size of each dataset used for evaluation. The totals include train, dev and test splits.

Pubmed RCT-20K (Dernoncourt and Lee, 2017) is a dataset of discourse labels (e.g. Background, Method, Results, etc.) for sentences in scientific abstracts. ScienceCite (Cohan et al., 2019) and ACL-ARC (Jurgens et al., 2018) include citation intent labels in scientific papers (e.g. Comparison, Extension, etc.). The SciERC dataset (Luan et al., 2018) contains entities and relations from computer science abstracts. Finally, the Paper Field dataset 9 contains paper titles mapped to 7 different fields of study and is built from the Microsoft Academic Graph (Sinha et Table 3 : Results on all scientific fields, tasks and datasets. Bold indicates the best performing BERT variant, while underline indicates the best overall result. All SCIBERT results are statistically significantly higher than BERT-Base (based on 95% bootstrap confidence intervals) except for ACL-ARC and ScienceCite datasets. All results are the average of multiple runs with different random seeds to control potential non-determinism associated with neural models (Reimers and Gurevych, 2017) . Most results are macro F1 scores (span-level for NER, sentencelevel for REL and CLS, and token-level for PICO), except ChemProt and PubMed 20k RCT (micro F1 scores).

Table 3: Results on all scientific fields, tasks and datasets. Bold indicates the best performing BERT variant, while underline indicates the best overall result. All SCIBERT results are statistically significantly higher than BERTBase (based on 95% bootstrap confidence intervals) except for ACL-ARC and ScienceCite datasets. All results are the average of multiple runs with different random seeds to control potential non-determinism associated with neural models (Reimers and Gurevych, 2017). Most results are macro F1 scores (span-level for NER, sentencelevel for REL and CLS, and token-level for PICO), except ChemProt and PubMed 20k RCT (micro F1 scores). Parsing is evaluated using labeled association score (LAS) and unlabeled association score (UAS), both reported in two separate rows.

Parsing is evaluated using labeled association score (LAS) and unlabeled association score (UAS), both reported in two separate rows. 2015). 10 Table 3 summarizes the results of all experiments. Following Devlin et al. (2018) , we use the cased models for sequence tagging (NER and PICO) and dependency parsing (DEP). For text classification (CLS and REL), we use the uncased models. All reported results are the average of multiple runs with different random seeds. Except for ACL-ARC and ScienceCite, all SCIBERT results are statistically significantly higher than BERT-Base based on 95% bootstrap confidence intervals.

4 Results

Biomedical domain The top part of Table 3 summarizes the results on datasets from the biomedical domain.

We observe that SCIBERT always outperforms BERT-Base on biomedical tasks.

On average across tasks, SCIBERT has a higher F1 score than BERT-Base (+1.57 with BASEVOCAB and +2.06 with 10 https://academic.microsoft.com/ SCIVOCAB). In addition, SCIBERT achieves new state-of-the-art (SOTA) results on the following datasets:

BC5CDR (Yoon et al., 2018) , EBM-NLP (Nye et al., 2018) , and ChemProt (Lim and Kang, 2018) . SCIB-ERT performs slightly worse than the SOTA on JNLPBA (Yoon et al., 2018) , PubMed 20K RCT (Jin and Szolovits, 2018) , and GE-NIA (Nguyen and Verspoor, 2019). We suspect performance gaps can be explained by taskspecific architectures and hyperparameter tuning used in the SOTA models. For example, the current SOTA results (Jin and Szolovits, 2018) on the PubMed 20K RCT have been obtained using a hierarchical BiLSTM-CRF architecture that also takes neighboring sentences as an important signal for prediction.

Computer Science Domain The middle part of Table 3 demonstrates the results on datasets from the computer science domain. As shown, SCIB-ERT always outperforms BERT-Base on computer science datasets. On average across tasks, SCIB-ERT has a higher F1 score than BERT-Base (+3.08 with BASEVOCAB and +3.25 with SCIVOCAB). In addition, SCIBERT outperforms the SOTA on ACL-ARC (Jurgens et al., 2018) , and the NER part of SciERC (Luan et al., 2018) . For relations in SciERC, our results are not comparable with those in Luan et al. (2018) because we are performing relation classification given gold entities, while they perform NER and relation extraction jointly.

Multidomain Results

The bottom part of Table 3 illustrates the results on datasets from multiple scientific domains. As shown, SCIBERT always outperforms BERT-Base on both tasks (+0.32 F1 with BASEVOCAB and +0.62 F1 with SCIVOCAB). In addition, SCIBERT outperforms the SOTA on ScienceCite (Cohan et al., 2019) . For the Paper Field dataset, there are no published SOTA results at the time of writing.

5 Discussion

Effect of Vocabulary Table 3 shows better performance when using SCIBERT with SCIVOCAB than with BASEVOCAB. Averaging across tasks, we see an improvement of approximately +0.38 F1. This suggests that retraining the vocabulary could be an important step when retraining BERT embeddings on a new domain. Given an overlap between BASEVOCAB and SCIVOCAB of 42%, this level of improvement seems reasonable.

Effect Of Casing

We ran additional experiments to compare cased and uncased vocabularies. Averaging across tasks, we find for SCIBERT with SCIVOCAB that the cased model performs better than the uncased one on sequence tagging and parsing (+0.04 F1) and worse on sentence classification (-0.18 F1). Interestingly, BERT-Base and SCIBERT with BASEVOCAB both show better performance with uncased vocabularies on all tasks.

BIOBERT BIOBERT, a version of BERT finetuned on a collection of biomedical text, was published on ArXiv by Lee et al. (2019) during the course of our SCIBERT experiments. We also performed preliminary experiments with BIOBERT trained on the same suite of tasks. For controlled experimentation, we use the released pretrained weights 11 in the same manner as we did with previous experiments. Compared with BIOBERT, av-eraged over tasks, SCIBERT achieves +0.51 and +0.89 F1 improvements when using BASEVOCAB and SCIVOCAB, respectively. We observed larger performance gains by SCIBERT over BIOBERT on CS tasks.

6 Conclusion And Future Work

We release SCIBERT, a pretrained contextualized embedding model for scientific text based on BERT. We evaluate SCIBERT on a suite of tasks and datasets from scientific domains. SCIB-ERT often significantly outperforms BERT-Base and achieves new state-of-the-art results with minimal task-specific architectures and without finetuning.

An interesting future line of work would be to evaluate different proportions of papers from each domain, though one consideration would be that these language models are costly to retrain. This also motivates our interest in building a single resource that's useful across multiple domains.

While we achieve significant improvements on many scientific NLP tasks, the absolute performance numbers show that there is still room for improvement in many of these tasks. We are optimistic that SCIBERT will be a helpful resource to foster research in the scientific NLP domain.

https://github.com/google/sentencepiece 2 We observed significantly worse performance when training only on abstracts compared with full-text.3 https://github.com/allenai/science-parse 4 https://github.com/allenai/SciSpaCy

https://github.com/google-research/bert 6 BERT's largest model was trained on 16 Cloud TPUs for 4 days while on a 8-GPU machine, it is expected to take 40-

We've found that fine-tuning BERT weights results in 2.5x slower training times on average.8 https://github.com/huggingface/pytorch-pretrained-b 9 The corresponding paper to this dataset is not yet published at the time of writing. We will update the paper with the corresponding citation once it becomes available.

https://github.com/naver/biobert-pretrained