ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Mark Neumann
Daniel King
Iz Beltagy
Waleed Ammar
BioNLP@ACL
2019
View in Semantic Scholar

Abstract

Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/.

1 Introduction

The publication rate in the medical and biomedical sciences is growing at an exponential rate (Bornmann and Mutz, 2014) . The information overload problem is widespread across academia, but is particularly apparent in the biomedical sciences, where individual papers may contain specific discoveries relating to a dizzying variety of genes, drugs, and proteins. In order to cope with the sheer volume of new scientific knowledge, there have been many attempts to automate the process of extracting entities, relations, protein interactions and other structured knowledge from scientific papers Ammar et al., 2018; Poon et al., 2014) .

Although there exists a wealth of tools for processing biomedical text, many focus primarily on entity linking, negation detection and abbreviation detection. MetaMap and MetaMapLite (Aronson, 2001; Demner-Fushman et al., 2017) , the two most widely used and supported tools for biomedical text processing, consider additional features, such as negation detection and acronym resolution. However, tools which cover more classical natural language processing (NLP) tasks such as the GENIA tagger and phrase structure parsers such as those presented in (McClosky and Charniak, 2008) typically do not make use of new research innovations such as word representations or neural networks.

In this paper, we introduce scispaCy, a specialized NLP library for processing biomedical texts which builds on the robust spaCy library, 1 and document its performance relative to state of the art models for part of speech (POS) tagging, dependency parsing, named entity recognition (NER) and sentence segmentation. Specifically, we:

• Release a reformatted version of the GENIA 1.0 (Kim et al., 2003) with the original text from the PubMed abstracts.

• Benchmark 9 named entity recognition models for more specific entity extraction applications demonstrating competitive performance when compared to strong baselines.

• Release and evaluate two fast and convenient pipelines for biomedical text, which include tokenization, part of speech tagging, dependency parsing and named entity recognition.

2 Overview of (sci)spaCy

In this section, we briefly describe the models used in the spaCy library and describe how we build on them in scispaCy.

spaCy. The spaCy library (Honnibal and Montani, 2017) 2 provides a variety of practical tools for text processing in multiple languages. Their models have emerged as the defacto standard for practical NLP due to their speed, robustness and close to state of the art performance. As the spaCy models are popular and the spaCy API is widely known to many potential users, we choose to build upon the spaCy library for creating a biomedical text processing pipeline.

scispaCy. Our goal is to develop scispaCy as a robust, efficient and performant NLP library to satisfy the primary text processing needs in the biomedical domain. In this release of scispaCy, we retrain spaCy 3 models for POS tagging, dependency parsing, and NER using datasets relevant to biomedical text, and enhance the tokenization module with additional rules. scispaCy contains two core released packages: en core sci sm and en core sci md. Models in the en core sci md package have a larger vocabulary and include word vectors, while those in en core sci sm have a smaller vocabulary and do not include word vectors, as shown in Table 1 .

Table 1: Vocabulary statistics for the two core packages in scispaCy.

Processing Speed. To emphasize the efficiency and practical utility of the end-to-end pipeline provided by scispaCy packages, we perform a speed comparison with several other publicly available processing pipelines for biomedical text using 10k randomly selected PubMed abstracts. We report results with and without segmenting the abstracts into sentences since some of the libraries (e.g., GENIA tagger) are designed to operate on sentences. As shown in Table 2 , both models released in scispaCy demonstrate competitive speed to pipelines written in C++ and Java, languages designed for production settings.

Table 2: Wall clock comparison of different publicly available biomedical NLP pipelines. All experiments run on a single machine with 12 Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz and 62GB RAM. For the Biaffine Parser, a pre-compiled Tensorflow binary with support for AVX2 instructions was used in a good faith attempt to optimize the implementation. Dynet does support the Intel MKL, but requires compilation from scratch and as such, does not represent an “off the shelf” system. TF is short for Tensorflow.

Whilst scispaCy is not as fast as pipelines designed for purely production use-cases (e.g., NLP4J), it has the benefit of straightforward integration with the large ecosystem of Python libraries for machine learning and text processing. Although the comparison in Table 2 is not an apples to apples comparison with other frameworks (different tasks, implementation languages etc), it is useful to understand scispaCy's runtime in the context of other pipeline components. Running scispaCy models in addition to standard Entity Linking software such as MetaMap would result in only a marginal increase in overall runtime.

In the following section, we describe the POS taggers and dependency parsers in scispaCy.

The joint POS tagging and dependency parsing model in spaCy is an arc-eager transition-based parser trained with a dynamic oracle, similar to (Goldberg and Nivre, 2012) . Features are CNN representations of token features and shared across all pipeline models (Kiperwasser and Goldberg, 2016; Zhang and Weiss, 2016) . Next, we describe the data we used to train it in scispaCy.

3.1 Datasets GENIA 1.0 Dependencies. To train the dependency parser and part of speech tagger in both released models, we convert the treebank of (Mc-Closky and Charniak, 2008), 4 which is based on the GENIA 1.0 corpus (Kim et al., 2003) , to Universal Dependencies v1.0 using the Stanford Dependency Converter (Schuster and Manning, 2016) . As this dataset has POS tags annotated, we use it to train the POS tagger jointly with the dependency parser in both released models.

As we believe the Universal Dependencies converted from the original GENIA 1.0 corpus are generally useful, we have released them as a separate contribution of this paper. 5 In this data release, we also align the converted dependency parses to their original text spans in the raw, untokenized abstracts from the original release, 6 and include the PubMed metadata for the abstracts which was discarded in the GENIA corpus released by McClosky and Charniak (2008) . We hope that this raw format can emerge as a resource for practical evaluation in the biomedical domain of core NLP tasks such as tokenization, sentence segmentation and joint models of syntax.

Finally, we also retrieve from PubMed the original metadata associated with each abstract. This includes relevant named entities linked to their Medical Subject Headings (MeSH terms) as well as chemicals and drugs linked to a variety of ontologies, as well as author metadata, publication dates, citation statistics and journal metadata. We hope that the community can find interesting problems for which such natural supervision can be used. OntoNotes 5.0. To increase the robustness of the dependency parser and POS tagger to generic text, we make use of the OntoNotes 5.0 corpus 7 when training the dependency parser and part of speech tagger (Weischedel et al., 2011; Hovy et al., 2006) . The OntoNotes corpus consists of multiple genres of text, annotated with syntactic and semantic information, but we only use POS and dependency parsing annotations in this work.

3.2 Experiments

We compare our models to the recent survey study of dependency parsing and POS tagging for biomedical data (Nguyen and Verspoor, 2018) in Tables 3 and 4 . POS tagging results show that both models released in scispaCy are competitive with state of the art systems, and can be considered of equivalent practical value. In the case of dependency parsing, we find that the Biaffine parser of (Dozat and Manning, 2016) outperforms the scis-paCy models by a margin of 2-3%. However, as demonstrated in Table 2 , the scispaCy models are approximately 9x faster due to the speed optimizations in spaCy.

Figure 1: Growth of the annual number of cited references from 1650 to 2012 in the medical and health sciences (citing publications from 1980 to 2012). Figure from (Bornmann and Mutz, 2014).

Robustness to Web Data. A core principle of the scispaCy models is that they are useful on a wide variety of types of text with a biomedical focus, such as clinical notes, academic papers, clinical trials reports and medical records. In order to make our models robust across a wider range of domains more generally, we experiment with incorporating training data from the OntoNotes 5.0 corpus when training the dependency parser and POS tagger. Figure 2 demonstrates the effectiveness of adding increasing percentages of web data, showing substantially improved performance on OntoNotes, at no reduction in performance on biomedical text. Note that mixing in web text during training has been applied to previous systems -the GENIA Tagger also employs this technique.

Figure 2: Unlabeled attachment score (UAS) performance for an en core sci md model trained with increasing amounts of web data incorporated. Table shows mean of 3 random seeds.

4 Named Entity Recognition

The NER model in spaCy is a transition-based system based on the chunking model from (Lample et al., 2016). Tokens are represented as hashed, embedded representations of the prefix, suffix, shape and lemmatized features of individual words. Next, we describe the data we used to train NER models in scispaCy.

4.1 Datasets

The main NER model in both released packages in scispaCy is trained on the mention spans in the MedMentions dataset (Murty et al., 2018) . Since the MedMentions dataset was originally designed for entity linking, this model recognizes a wide variety of entity types, as well as non-standard syntactic phrases such as verbs and modifiers, but the model does not predict the entity type. In order to provide for users with more specific requirements around entity types, we release four additional packages en ner {bc5cdr|craft |jnlpba|bionlp13cg} md with finer-grained NER models trained on BC5CDR (for chemicals and diseases; , CRAFT (for cell types, chemicals, proteins, genes; Bada et al., 2011), JNLPBA (for cell lines, cell types, DNAs, RNAs, proteins; Collier and Kim, 2004) and BioNLP13CG (for cancer genetics; Pyysalo et al., 2015), respectively.

Table 3: Part of Speech tagging results on the GENIA Test set.

Table 4: Dependency Parsing results on the GENIA 1.0 corpus converted to dependencies using the Stanford Universal Dependency Converter.

4.2 Experiments

As NER is a key task for other biomedical text processing tasks, we conduct a through evaluation of the suitability of scispaCy to provide baseline performance across a wide variety of datasets. In particular, we retrain the spaCy NER model on each of the four datasets mentioned earlier (BC5CDR, CRAFT, JNLPBA, BioNLP13CG) as well as five more datasets in Crichton et al. 2017: AnatEM, BC2GM, BC4CHEMD, Linnaeus, NCBI-Disease. These datasets cover a wide variety of entity types required by different biomedical domains, including cancer genetics, disease-drug interactions, pathway analysis and trial population extraction. Additionally, they vary considerably in size and number of entities. For example, BC4CHEMD (Krallinger et al., 2015) has 84,310 annotations while Linnaeus (Gerner et al., 2009) only has 4,263. BioNLP13CG (Pyysalo et al., 2015) annotates 16 entity types while five of the datasets only annotate a single entity type. 8 Table 5 provides a through comparison of the scispaCy NER models compared to a variety of models. In particular, we compare the models to strong baselines which do not consider the use of 1) multi-task learning across multiple datasets and 2) semi-supervised learning via large pretrained language models. Overall, we find that the scis-paCy models are competitive baselines for 5 of the 9 datasets.

Table 5: Test F1 Measure on NER for the small and medium scispaCy models compared to a variety of strong baselines and state of the art models. The Baseline and SOTA (State of the Art) columns include only single models which do not use additional resources, such as language models, or additional sources of supervision, such as multi-task learning. + Resources allows any type of supervision or pretraining. All scispaCy results are the mean of 5 random seeds.

Additionally, in Table 6 we evaluate the recall of the pipeline mention detector available in both scispaCy models (trained on the MedMentions dataset) against all 9 specialised NER datasets. Overall, we observe a modest drop in average recall when compared directly to the MedMentions results in Table 7 , but considering the diverse domains of the 9 specialised NER datasets, achieving this level of recall across datasets is already nontrivial. Table 7 : Performance of the base mention detector on the MedMentions Corpus.

Table 6: Recall on the test sets of 9 specialist NER datasets, when the base mention detector is trained on MedMentions. The base mention detector is available in both en core sci sm and en core sci md models.

Table 7: Performance of the base mention detector on the MedMentions Corpus.

5 Sentence Segmentation And Citation Handling

Accurate sentence segmentation is required for many practical applications of natural language processing. Biomedical data presents many difficulties for standard sentence segmentation algorithms: abbreviated names and noun compounds containing punctuation are more common, whilst the wide range of citation styles can easily be misidentified as sentence boundaries. We evaluate sentence segmentation using both sentence and full-abstract accuracy when segmenting PubMed abstracts from the raw, untokenized GENIA development set (the Sent/Abstract columns in Table 8 ).

Table 8: Sentence segmentation performance for the core spaCy and scispaCy models. cs = custom rule based sentence segmenter and ct = custom rule based tokenizer, both designed explicitly to handle citations and common patterns in biomedical text.

Additionally, we examine the ability of the segmentation learned by our model to generalise to the body text of PubMed articles. Body text is typically more complex than abstract text, but in particular, it contains citations, which are considerably less frequent in abstract text. In order to examine the effectiveness of our models in this scenario, we design the following synthetic experiment. Given sentences from (Anonymous, 2019) 9 which were originally designed for citation intent prediction, we run these sentences individually through our models. As we know that these sentences should be single sentences, we can simply count the frequency with which our models segment the individual sentences containing citations into multiple sentences (the Citation column in Table 8 ).

As demonstrated by Table 8 , training the dependency parser on in-domain data (both the scis-paCy models) completely obviates the need for rule-based sentence segmentation. This is a positive result -rule based sentence segmentation is a brittle, time consuming process, which we have replaced with a domain specific version of an existing pipeline component.

Both scispaCy models are released with the custom tokeniser, but without a custom sentence segmenter by default. (Bada et al., 2011) 79.55 --72.31 76.17 JNLPBA (Collier and Kim, 2004) 68.95 73.48 b 75.50 bb 71.78 73.21 BioNLP13CG (Pyysalo et al., 2015) 76.74 --72.98 77.60 AnatEM (Pyysalo and Ananiadou, 2014) 88.55 91.61 ** -80.13 84.14 BC2GM (Smith et al., 2008) 84.41 80.51 b 81.69 bb 75.77 78.30 BC4CHEMD (Krallinger et al., 2015) 82.32 88.75 a 89.37 aa 82.24 84.55 Linnaeus (Gerner et al., 2009) 79.33 95. Medicine focus specifically on entity linking using the Unified Medical Language System (UMLS) (Bodenreider, 2004) as a knowledge base. (Buyko et al.) adapt Apache OpenNLP using the GENIA corpus, but their system is not openly available and is less suitable for modern, Python-based workflows. The GENIA Tagger provides the closest comparison to scispaCy due to it's multi-stage pipeline, integrated research contributions and production quality runtime. We improve on the GENIA Tagger by adding a full dependency parser rather than just noun chunking, as well as improved results for NER without compromising significantly on speed. In more fundamental NLP research, the GENIA corpus (Kim et al., 2003) has been widely used to evaluate transfer learning and domain adaptation. (McClosky et al., 2006) demonstrate the effectiveness of self-training and parse re-ranking for domain adaptation. (Rimell and Clark, 2008 ) adapt a CCG parser using only POS and lexical categories, while (Joshi et al., 2018 ) extend a neural phrase structure parser trained on web text to the biomedical domain with a small number of partially annotated examples. These papers focus mainly of the problem of domain adaptation itself, rather than the objective of obtaining a robust, high-performance parser using existing resources.

Model

NLP techniques, and in particular, distant supervision have been employed to assist the curation of large, structured biomedical resources. (Poon et al., 2015) extract 1.5 million cancer path-way interactions from PubMed abstracts, leading to the development of Literome (Poon et al., 2014) , a search engine for genic pathway interactions and genotype-phenotype interactions. A fundamental aspect of (Valenzuela-Escarcega et al., 2018; Poon et al., 2014) is the use of hand-written rules and triggers for events based on dependency tree paths; the connection to the application of scispaCy is quite apparent.

7 Conclusion

In this paper we presented several robust model pipelines for a variety of natural language processing tasks focused on biomedical text. The scis-paCy models are fast, easy to use, scalable, and achieve close to state of the art performance. We hope that the release of these models enables new applications in biomedical information extraction whilst making it easy to leverage high quality syntactic annotation for downstream tasks. Additionally, we released a reformatted GENIA 1.0 corpus augmented with automatically produced Universal Dependency annotations and recovered and aligned original abstract metadata.

https://nlp.stanford.edu/˜mcclosky/ biomedical.html 5 Available at https://github.com/allenai/ genia-dependency-trees 6 Available at http://www.geniaproject.org/

Instructions for download at http://cemantix. org/data/ontonotes.html

For a detailed discussion of the datasets and their creation, we refer the reader to https://github.com/ cambridgeltl/MTL-Bioinformatics-2016/ blob/master/Additional%20file%201.pdf

Paper currently under review.