Extracting a Knowledge Base of Mechanisms from COVID-19 Papers

Aida Amini
Tom Hope
David Wadden
Madeleine van Zuylen
E. Horvitz
Roy Schwartz
Hannaneh Hajishirzi
NAACL
2021
View in Semantic Scholar

Abstract

The COVID-19 pandemic has spawned a diverse body of scientific literature that is challenging to navigate, stimulating interest in automated tools to help find useful knowledge. We pursue the construction of a knowledge base (KB) of mechanisms—a fundamental concept across the sciences, which encompasses activities, functions and causal relations, ranging from cellular processes to economic impacts. We extract this information from the natural language of scientific papers by developing a broad, unified schema that strikes a balance between relevance and breadth. We annotate a dataset of mechanisms with our schema and train a model to extract mechanism relations from papers. Our experiments demonstrate the utility of our KB in supporting interdisciplinary scientific search over COVID-19 literature, outperforming the prominent PubMed search in a study with clinical experts. Our search engine, dataset and code are publicly available.

to provide scientists with structured knowledge, and to accelerate exploration and discovery. In this work we extract relations capturing a broad notion of mechanisms in CORD-19 papersspanning a range of mechanisms as diverse as psychological intervention techniques, computational algorithms, and molecular mechanisms of viral cell entry. This unified view of natural and artificial mechanisms can help generalize across the CORD-19 corpus and is designed to help scale the study of the many different types of processes, activities and functions described in the dataset.

We collect a set of annotations from domain experts for direct mechanisms (operations and functions explicitly described in the text) and indirect mechanisms (observed effects and interactions without an explicit description of a direct functional relation). For example, descriptions of the mechanism by which the SARS-CoV-2 virus binds to cells, or of a diagnostic procedure based on computer vision -are considered direct mechanisms. Conversely, descriptions of indirect mechanisms can for example be of observed links between COVID-19 and certain symptoms, with no explicit mention of the functional process leading from the disease to the symptoms. This distinction between direct and indirect relations is inspired by a review of biomedical and scientific ontologies (e.g., direct and indirect regulation of proteins by chemicals).

We allow annotators to select free-form text spans as the arguments in our mechanism relations; this is in contrast to many existing datasets of annotated scientific relations which are often entity-centric (e.g., protein-chemical interactions). We do so in order to capture the complexity and diversity of the many concepts and ideas described in the corpus, in a scalable approach. To address the challenging nature of the annotation task with multiple "correct" annotations of complex and diverse spans, we conduct a multi-round annotation process with final adjudication by a domain expert experienced in bioNLP annotations.

Our annotations are used in combination with existing datasets from different domains to train a relation extraction model, using a mapping schema for previously introduced scientific datasets, selecting only direct and indirect mechanisms (e.g., DIRECT UP-REGULATION in the chemprot dataset) and unifying relation labels into our typology using a domain expert. Our results indicate we outperform baselines including openIE and SRL, and also supervised models trained on related science IE datasets in the biomedical and computer science domains. We use a biomedical language model that we fine-tune to capture semantic similarity, build a graph of similar mechanisms and induce concepts by finding cliques. To support search over our KB, we use the same language model for retrieving relations similar to the query. To help boost community efforts we release our curated data and models as well as a large-scale knowledge graph of extracted mechanisms.

Extracting a Knowledge Base of Mechanisms from COVID-19 Papers

Authors

Abstract