Ontology-Aware Clinical Abstractive Summarization

Sean MacAvaney
Sajad Sotudeh
Arman Cohan
Nazli Goharian
Ish A. Talati
Ross W. Filice
SIGIR
2019
View in Semantic Scholar

Abstract

Automatically generating accurate summaries from clinical reports could save a clinician's time, improve summary coverage, and reduce errors. We propose a sequence-to-sequence abstractive summarization model augmented with domain-specific ontological information to enhance content selection and summary generation. We apply our method to a dataset of radiology reports and show that it significantly outperforms the current state-of-the-art on this task in terms of rouge scores. Extensive human evaluation conducted by a radiologist further indicates that this approach yields summaries that are less likely to omit important details, without sacrificing readability or accuracy.

1 Introduction

Clinical note summaries are critical to the clinical process. After writing a detailed note about a clinical encounter, practitioners often write a short summary called an impression (example shown in Figure 1 ). This summary is important because it is often the primary document of the encounter considered when reviewing a patient's clinical history. The summary allows for a quick view of the most important information from the report. Automated summarization of clinical notes could save clinicians' time, and has the potential to capture important aspects of the note that the author might not have considered [7] . If high-quality summaries are generated frequently, the practitioner may only need to review the summary and occasionally make minor edits. * Both authors contributed equally to this research. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SIGIR '19 Recently, neural abstractive summarization models have shown successful results [1, 11, 13, 14] . While promising in general domains, existing abstractive models can suffer from deficiencies in content accuracy and completeness [18] , which is a critical issue in the medical domain. For instance, when summarizing a clinical note, it is crucial to include all the main diagnoses in the summary accurately. To overcome this challenge, we propose an extension to the pointer-generator model [14] that incorporates domain-specific knowledge for more accurate content selection. Specifically, we link entities in the clinical text with a domain-specific medical ontology (e.g., RadLex 1 or UMLS 2 ), and encode them into a separate context vector, which is then used to aid the generation process. We train and evaluate our proposed model on a large collection of real-world radiology findings and impressions from a large urban hospital, MedStar Georgetown University Hospital. Results using the rouge evaluation metric indicate statistically significant improvements over existing state-of-the-art summarization models. Further extensive human evaluation by a radiology expert demonstrates that our method produces more complete summaries than the top-performing baseline, while not sacrificing readability or accuracy.

Figure 1. Not extracted; please refer to original document.

In summary, our contributions are: 1) An approach for incorporating domain-specific information into an abstractive summarization model, allowing for domain-informed decoding; and 2) Extensive automatic and human evaluation on a large collection of radiology notes, demonstrating the effectiveness of our model and providing insights into the qualities of our approach.

1.1 Related Work

Recent trends on abstractive summarization are based on sequenceto-sequence (seq2seq) neural networks with the incorporation of attention [13] , copying mechanism [14] , reinforcement learning objective [8, 12] , and tracking coverage [14] . While successful, a few recent studies have shown that neural abstractive summarization models can have high readability, but fall short in generating accurate and complete content [6, 18] . Content accuracy is especially crucial in medical domain. In contrast with prior work, we focus on improving summary completeness using a medical ontology. Gigioli et al. [8] used a reinforced loss for abstractive summarization in the medical domain, although their focus was headline generation from medical literature abstracts. Here, we focus on summarization of clinical notes where content accuracy and completeness are more critical. The most relevant work to ours is by Zhang et al. [19] where an additional section from the radiology report (background) is used to improve summarization. Extensive automated and human evaluation and analyses demonstrate the benefits of our proposed model in comparison with existing work.

2 Model

Pointer-generator network (PG). Standard neural approaches for abstractive summarization follow the seq2seq framework where an encoder network reads the input and a separate decoder network (often augmented with an attention mechanism) learns to generate the summary [17] . Bidirectional LSTMs (BiLSTMs) [9] are often used as the encoder and decoder. A more recent successful summarization model-called Pointer-generator network-allows the decoder to also directly copy text from the input in addition to generation [14] . Given a report x = {x 1 , x 2 , ..., x n }, the encoded input sequence h = BiLSTM(x), and the current decoding state

s t = BiLSTM(x ′ )[t]

, where x ′ is the input to the decoder (i.e., gold standard summary token at training or previously generated token at inference time), the model computes the attention weights over the input terms a = softmax(h ⊤ W 1 s ⊤ ). The attention scores are employed to compute a context vector c which is a weighted sum over input c = n i a i h i that is used along with the output of the decoder BiLSTM to either generate the next term from a known vocabulary or copy the token from the input sequence with the highest attention value. We refer the reader to See et al. [14] for additional details on the pointer-generator architecture.

Ontology-aware pointer-generator (Ontology PG). In this work, we propose an extension of the pointer-generator network that allows us to leverage domain-specific knowledge encoded in an ontology to improve clinical summarization. We introduce a new encoded sequence u = {u 1 , ..., u n ′ } that is the result of linking an ontology U to the input texts. In other words, u = F U (x) where F U is a mapping function, e.g., a simple mapping function that only outputs a word sequence if it appears in the ontology and otherwise skips it. We then use a second BiLSTM to encode this additional ontology terms similar to the way the original input is encoded h u = BiLST M(u). We then calculate an additional context vector c ′ which includes the domain-ontology information:

EQUATION (1): Not extracted; please refer to original document.

The second context vector acts as additional global information to aid the decoding process, and is akin to how Zhang et al. [19] include background information from the report. We modify the decoder BiLSTM to include the ontology-aware context vector in the decoding process. Recall that an LSTM network controls the flow of its previous state and the current input using several gates (input gate i, forget gate f, and output gate o), where each of these gates are vectors calculated according to an additive combination of the previous LSTM state and current input. For example, for the forget gate we have:

f t = tanh(W f [s t −1 ; x ′ t ] + b)

where s t −1 is the previous decoder state and x ′ t is the decoder input, and ";" shows concatenation (for more details on LSTMs refer to [9] ). The ontology-aware context vector c ′ is passed as additional input to this function for all the LSTM gates: e.g., for the forget gate we will have:

f t = tanh(W f [s t −1 ; x ′ t ; c ′ ] + b)

. This intuitively guides the information flow in the decoder using the ontology information.

3 Experimental Setup

We train and evaluate our model on a dataset of 41,066 real-world radiology reports from MedStar Georgetown University Hospital containing radiology reports with a variety of imaging modalities (e.g., x-rays, CT scans, etc). The dataset is randomly split into 80-10-10 train-dev-test splits. Each report describes clinical findings about a specific diagnostic case, and an impression summary (as shown in Figure 1 ). The findings sections are 136.6 tokens on average and the impression sections are 37.1 tokens on average. Performing cross-institutional evaluation is challenging and beyond the scope of this work due to the varying nature of reports between institutions. For instance, the public Indiana University radiology dataset [4] consists only of chest x-rays, and has much shorter reports (average length of findings: 40.0 tokens; average length of impressions: 10.5 tokens). Thus, in this work, we focus on summarization within a single institution.

Ontologies. We employ two ontologies in this work. UMLS is a general medical ontology maintained by the US National Library of Medicine and includes various procedures, conditions, symptoms, body parts, etc. We use QuickUMLS [15] (a fuzzy UMLS concept matcher) with a Jaccard similarity threshold of 0.7 and a window size of 3 to extract UMLS concepts from the radiology findings. We also evaluate using an ontology focused on radiology, RadLex, which is a widely-used ontology of radiological terms maintained by the Radiological Society of North America. It consists of 68,534 radiological concepts organized according to a hierarchical structure. We use exact n-gram matching to find important radiological entities, only considering RadLex concepts at a depth of 8 or greater. 3 In pilot studies, we found that the entities between depths 8 and 20 tend to represent concrete entities (e.g., 'thoracolumbar spine region') rather than abstract categories (e.g., 'anatomical entity').

Comparison. We compare our model to well-established extractive baselines as well as the state-of-the-art abstractive summarization models. Figure 2 : Average attention weight comparison between our approach (RadLex PG) and the baseline (PG). Color differences show to which term each model attends more while generating summary. RadLex concepts of depth 8 or lower are marked with *. Our approach attends to more RadLex terms throughout the document, allowing for more complete summaries. -Background-Aware Pointer-Generator (Back. PG) [19] : An extension of PG, which is specifically designed to improve radiology note summarization by encoding the Background section of the report to aid the decoding process. 5 Parameters and training. We use 100-dimensional GloVe embeddings pre-trained over a large corpus of 4.5 million radiology reports [19] , a 2-layer BiLSTM encoder with a hidden size of 100, and a 1-layer LSTM decoder with the hidden size of 200. At inference time, we use beam search with beam size of 5. We use a dropout of 0.5 in all models, and train to optimize negative loglikelihood loss using the Adam optimizer [10] and a learning rate of 0.001. Table 1 presents rouge evaluation results of our model compared with the baselines (as compared to human-written impressions). The extractive summarization methods (LexRank and LSA) perform particularly poorly. This may be due to the fact that these approaches are limited to simply selecting sentences from the text, and that the most central sentences may not be the most important for building an effective impression summary. Interestingly, the Back. PG approach (which uses the background section of the report to guide the decoding process) is ineffective on our dataset. This may be due to differences in conventions across institutions, such as what information is included in a report's background and what is considered important to include in its impression.

Figure 2: Average attention weight comparison between our approach (RadLex PG) and the baseline (PG). Color differences show to which term each model attends more while generating summary. RadLex concepts of depth 8 or lower are marked with *. Our approach attends to more RadLex terms throughout the document, allowing for more complete summaries.

Table 1: rouge results on MedStar Georgetown University Hospital’s development and test sets. Both the UMLS and RadLex ontology PGmodels are statistically better than the other models (paired t-test, p < 0.05).

4 Results And Analysis 4.1 Experimental Results

We observe that our Ontology-Aware models (UMLS PG and RadLex PG) significantly outperform all other approaches (paired t-test, p < 0.05) on both the development and test sets. The RadLex model slightly outperforms the UMLS model, suggesting that the radiology-specific ontology is beneficial (though the difference between UMLS and RadLex is not statistically significant). We also experimented incorporating both ontologies in the model simultaneously, but it resulted in slightly lower performance (1.26% lower than the best model on rouge-1). To verify that including ontological concepts in the decoder helps the model identify and focus on more radiology terms, we examined the attention weights. In Figure 2 , we show attention plots for two reports, comparing the attention of our approach and PG. The plots show that our approach results in attention weights being shared across radiological terms throughout the findings, potentially helping the model to capture a more complete summary.

4.2 Expert Human Evaluation

While our approach surpasses state-of-the-art results on our dataset in terms of rouge scores, we recognize the limitations of the rouge framework for evaluating summarization [2, 3] . To gain better insights into how and why our methodology performs better, we also conduct expert human evaluation. We had a domain expert (radiologist) who is familiar with the process of writing radiological findings and impressions evaluate 100 reports. Each report consists of the radiology findings, one manually-written impression, one impression generated using PG, and one impression generated using our ontology PG method (with RadLex). In each sample, the order of the Impressions are shuffled to avoid bias between samples. Samples were randomly chosen from the test set, one from each of 100 evenly-spaced bins sorted by our system's rouge-1 score. The radiologist was asked to score each impression in terms of the following on a scale of 1 (worst) to 5 (best): -Readability. Impression is understandable (5) or gibberish (1).

-Accuracy. Impression is fully accurate (5), or contains critical errors (1). -Completeness. Impression contains all important information (5), or is missing important points (1) . We present our manual evaluation results using histograms and arrow plots in Figure 3 . The histograms indicate the score distributions of each approach, and the arrows indicate how the scores changed. The starting points of an arrow indicates the score of an impression we compare to (either the human-written, or the summary generated by PG). The head of an arrow indicates the score of our approach. The numbers next to each arrow indicate how many reports made the transition. The figure shows that our approach improves completeness considerably, while maintaining the readability and accuracy. The major improvement in completeness is between the score of 3 and 4, where there is a net gain of 10 reports. Completeness is particularly important because it is where existing summarization models-such as PG-are currently lacking, as compared to human performance. Despite the remaining gap between human and generated completeness, our approach yields considerable gains toward human-level completeness. Our model is nearly as accurate as human-written summaries, only making critical errors (scores of 1 or 2) in 5% of the cases evaluated, as compared to 8% of cases for PG. No critical errors were found in the human-written summaries, although the human-written summaries go through a manual review process to ensure accuracy. The expert annotator furthermore conducted blind qualitative analysis to gain a better understanding of when our model is doing better and how it can be further improved. In line with the completeness score improvements, the annotator noted that in many cases our approach is able to identify pertinent points associated with RadLex terms that were missed by the PG model. In some cases, such as when the author picked only one main point, our approach was able to pick up important items that the author missed. Interestingly, it also was able to include specific measurement details better than the PG network, even though these measurements do not appear in RadLex. Although readability is generally strong, our approach sometimes generates repetitive sentences and syntactical errors more often than humans. These could be addressed in future work with additional post-processing heuristics such as removing repetitive n-grams as done in [12] . In terms of accuracy, our approach sometimes mixes up the "left" and "right" sides. This often occurs with findings that have mentions of both sides of a specific body part. Multi-level attention (e.g., [1] ) could address this by forcing the model to focus on important segments of the text. There were also some cases where our model under-performed in terms of accuracy and completeness due to synonymy that is not captured by RadLex. For instance, in one case our model did identify torsion, likely due to the fact that in the findings section it was referred to as twisting (a term that does not appear in RadLex).

Figure 3: Histograms and arrow plots plot depicting differences between impressions of 100 manually-scored radiology reports. Although challenges remain to reach human parity for all metrics, our approach makes strong gains to address the problem of report completeness (c, f), as compared to the next leading summarization approach (PG).

5 Conclusion

In this work, we present an approach for informing clinical summarization models of ontological information. This is accomplished by providing an encoding of ontological terms matched in the original text as an additional feature to guide the decoding. We find that our system exceeds state-of-the-art performance at this task, producing summaries that are more comprehensive than those generated by other methods, while not sacrificing readability or accuracy.

RadLex version 3.10, http://www.radlex.org/Files/radlex3.10.xlsx 2 https://www.nlm.nih.gov/research/umls/ Short Research Papers 2A: AI, Mining, and others SIGIR '19, July 21-25, 2019, Paris, France

The maximum tree depth is 20.4 For LSA and LexRank, we use the Sumy implementation (https://pypi.python.org/ pypi/sumy) with the top 3 sentences.Short Research Papers 2A: AI, Mining, and others SIGIR '19, July 21-25, 2019, Paris, France

Using the author's code at github.com/yuhaozhang/summarize-radiology-findings