MS2: Multi-Document Summarization of Medical Studies

Jay DeYoung
Iz Beltagy
Madeleine van Zuylen
Bailey Kuehl
Lucy Lu Wang
ArXiv
2021
View in Semantic Scholar

Abstract

To assess the effectiveness of any medical intervention, researchers must conduct a time-intensive and highly manual literature review. NLP systems can help to automate or assist in parts of this expensive process. In support of this goal, we release MS^2 (Multi-Document Summarization of Medical Studies), a dataset of over 470k documents and 20k summaries derived from the scientific literature. This dataset facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies, and is the first large-scale, publicly available multi-document summarization dataset in the biomedical domain. We experiment with a summarization system based on BART, with promising early results. We formulate our summarization inputs and targets in both free text and structured forms and modify a recently proposed metric to assess the quality of our system's generated summaries. Data and models are available at https://github.com/allenai/ms2

1 Introduction

Multi-document summarization (MDS) is a challenging task, with relatively limited resources and modeling techniques. Existing datasets are either in the general domain, such as WikiSum and Multi-News (Fabbri et al., 2019) , or very small such as DUC 1 or TAC 2011 (Owczarzak and Dang, 2011) . In this work, we add to this burgeoning area by developing a dataset for summarizing biomedical findings. We derive documents and summaries from systematic literature reviews, a type of biomedical paper that synthesizes results across many other studies. Our aim in introducing MSˆ2 is to: (1) expand MDS to the biomedical domain, (2) investigate fundamentally challenging issues in NLP over scientific text, such as summarization over contradictory information and assess-Figure 1: Our primary formulation (texts-to-text) is a seq2seq MDS task. Given study abstracts and a BACK-GROUND statement, generate the TARGET summary. ing summary quality via a structured intermediate form, and (3) aid in distilling large amounts of biomedical literature by supporting automated generation of literature review summaries.

Figure 1: Our primary formulation (texts-to-text) is a seq2seq MDS task. Given study abstracts and a BACKGROUND statement, generate the TARGET summary.

Systematic reviews synthesize knowledge across many studies (Khan et al., 2003) , and they are so called for the systematic (and expensive) process of creating a review; each taking 1-2 years to complete (Michelson and Reuter, 2019) . 2 As we note in Fig. 2 , a delay of around 8 years is observed between reviews and the studies they cite! The time and cost of creating and updating reviews has inspired efforts at automation (Tsafnat et al., 2014; Marshall et al., 2016; Beller et al., 2018; Marshall and Wallace, 2019) , and the constant deluge of studies 3 has only increased this need.

Figure 2: The distributions of review and study publication years in MSˆ2 show a clear temporal lag. Dashed lines mark the median year of publication.

To move the needle on these challenges and support further work on literature review automation, we present MSˆ2, a multi-document summarization dataset in the biomedical domain. Our contributions in this paper are as follows:

• We introduce MSˆ2, a dataset of 20K reviews and 470k studies summarized by these reviews. • We define a texts-to-text MDS task ( Fig. 1) based on MSˆ2, by identifying target summaries in each review and using study abstracts as input documents. We develop a BART-based model for this task, which produces fluent summaries that agree with the evidence direction stated in gold summaries around 50% of the time. • In order to expose more granular representations to users, we define a structured form of our data to support a table-to-table task ( § 4.2). We leverage existing biomedical information extraction systems (Nye et al., 2018; DeYoung et al., 2020) ( §3.3.1, §3.3.2) to evaluate agreement between target and generated summaries.

2 Background

Systematic reviews aim to synthesize results over all relevant studies on a topic, providing high quality evidence for biomedical and public health decisions. They are a fixture in the biomedical literature, with many established protocol around their registration, production, and publication (Chalmers et al., 2002; Starr et al., 2009; Booth et al., 2012) . Each systematic review addresses one or several research questions, and results are extracted from relevant studies and summarized. For example, a review investigating the effectiveness of Vitamin B12 supplementation in older adults (Andrès et al., 2010) synthesizes results from 9 studies. The research questions in systematic reviews can be described using the PICO framework (Zakowski et al., 2004) . PICO (which stands for Population: who is studied? Intervention: what intervention was studied? Comparator: what was the intervention compared against? Outcome: what was measured?) defines the main facets of biomedical research questions, and allows the person(s) conducting a review to identify relevant studies (studies included in a review generally have the same or similar PICO elements as the review). A medical systematic review is one which reports results for applying any kind of medical or social intervention to a group of people. Interventions are wideranging, including yoga, vaccination, team training, education, vitamins, mobile reminders, and more. Recent work on evidence inference (DeYoung et al., 2020; Nye et al., 2020) goes beyond identifying PICO elements, and aims to group and identify overall findings in reviews. MSˆ2 is a natural extension of these paths: we create a dataset and build a system with both natural summarization targets from input studies, while also incorporating the inherent structure studied in previous work.

In this work, we use the term review when describing literature review papers, which provide our summary targets. We use the term study to describe the documents that are cited and summarized by each review. There are various study designs which offer differing levels of evidence, e.g. clinical trials, cohort studies, observational studies, case studies, and more (Concato et al., 2000) . Of these study types, randomized controlled trials (RCTs) offer the highest quality of evidence (Meldrum, 2000) .

3 Dataset

We construct MSˆ2 from papers in the Semantic Scholar literature corpus. First, we create a corpus of reviews and studies based on the suitability criteria defined in §3.1. For each review, we classify individual sentences in the abstract to identify summarization targets ( §3.2). We augment all reviews and studies with PICO span labels and evidence inference classes as described in §3.3.1 and §3.3.2. As a final step in data preparation, we cluster reviews by topic and form train, development, and test sets from these clusters ( §3.4).

3.1 Identifying Suitable Reviews And Studies

To identify suitable reviews, we apply (i) a highrecall heuristic keyword filter, (ii) PubMed filter, (iii) study-type filter, and (iv) suitability classifier, in series. The keyword filter looks for the phrase "systematic review" in the title and abstracts of all papers in Semantic Scholar, which yields 220K matches. The PubMed filter, yielding 170K matches, limits search results to papers that have Label Sentence BACKGROUND ... AREAS COVERED IN THIS REVIEW The objective of this review is to evaluate the efficacy of oral cobalamin treatment in elderly patients .

Other

To reach this objective , PubMed data were systematic ally search ed for English and French articles published from January 1990 to July 2008 . ...

The effect of oral cobalamin treatment in patients presenting with severe neurological manifestations has not yet been adequately documented .... been indexed in the PubMed database, which restricts reviews to those in the biomedical, clinical, psychological, and associated domains. We then use citations and Medical Subject Headings (MeSH) to identify input studies via their document types and further refine the remaining reviews, see App. A for details on the full filtering process.

Finally, we train a suitability classifier as the final filtering step, using SciBERT , a BERT (Devlin et al., 2019) based language model trained on scientific text. Details on classifier training and performance are provided in Appendix C. Applying this classifier to the remaining reviews leaves us with 20K candidate reviews.

The effect of oral cobalamin treatment in patients presenting with severe neurological manifestations has not yet been adequately documented .

Target

The efficacy was particularly highlighted when looking at the marked improvement in serum vitamin B12 levels and hematological parameters , for example hemoglobin level , mean erythrocyte cell volume and reticulocyte count .

Oral cobalamin treatment avoids the discomfort , inconvenience and cost of monthly injections .

TARGET TAKE HOME MESSAGE Our experience and the present analysis support the use of oral cobalamin therapy in clinical practice • Must review studies involving multiple participants. We are interested in reviews of trials or cohort studies. We are *not* interested in reviews of case studies -which describe one or a few specific people.

• Must study an explicit population or problem (P from PICO) -Example populations: women > 55 old with breast cancer, migrant workers, elementary school children in Spokane, WA, etc.

• Must compare one or more medical interventions -Example interventions: drugs, vaccines, yoga, therapy, surgery, education, annoying mobile device reminders, professional naggers, personal trainers, and more! Note: placebo / no intervention is a type of intervention. -Comparing the effectiveness of an intervention against no intervention is okay. -Combinations of interventions count as comparisons (e.g. yoga vs. yoga + therapy). -Two different dosages also count (e.g. 500ppm fluoride vs 1000ppm fluoride in toothpaste).

-Must have an explicit outcome measure -Example outcome measures: survival time, frequency of headaches, relief of depression, survey results, and many other possibilities.

• The outcome measure must measure the effectiveness of the intervention.

3.2 Other

Table 1: Abbreviated example from Andrès et al. (2010) with predicted sentence labels (full abstract in Tab. 11, App. D.3). Spans corresponding to Population, Intervention, and Outcome elements are tagged and surrounded with special tokens.

3.2 Background And Target Identification

For each review, we identify two sections: 1) the BACKGROUND statement, which describes the research question, and 2) the overall effect or findings statement as the TARGET of the MDS task ( Fig. 1 ). We frame this as a sequential sentence classification task : given the sentences in the review abstract, classify them as BACKGROUND, TARGET, or OTHER. All BACKGROUND sentences are aggregated and used as input in modeling. All TARGET sentences are aggregated and form the summary target for that review. Sentences classified as OTHER may describe the methods used to conduct the review, detailed findings such as the number of included studies or numerical results, as well as recommendations for practice. OTHER sentences are not suitable for modeling because they either contain information specific to the review, as in methods; too much detail, in the case of results; or contain guidance on how medicine should be practiced, which is both outside the scope of our Table 2 : Sample Intervention, Outcome, evidence statement, and identified effect directions from a systematic review investigating the effectiveness of vitamin B12 therapies in the elderly (Andrès et al., 2010) .

Table 2: Sample Intervention, Outcome, evidence statement, and identified effect directions from a systematic review investigating the effectiveness of vitamin B12 therapies in the elderly (Andrès et al., 2010).

task definition and ill-advised to generate.

Five annotators with undergraduate or graduate level biomedical background labeled 3000 sentences from 220 review abstracts. During annotation, we asked annotators to label sentences into 9 classes (which we collapse into the 3 above; see App. D for detailed info on other classes). Two annotators then reviewed all annotations and corrected mistakes. The corrections yield a Cohen's κ (Cohen, 1960) of 0.912. Though we retain only BACKGROUND and TARGET sentences for modeling, we provide labels to all 9 classes in our dataset.

Using SciBERT , we train a sequential sentence classifier. We prepend each sentence with a [SEP] token and use a linear layer followed by a softmax to classify each sentence. A detailed breakdown of the classifier scores is available in Tab. 9, App. D. While the classifier performs well (94.1 F1) at identifying BACKGROUND sentences, it only achieves 77.4 F1 for TARGET sentences. The most common error for TARGET sentences is confusing them for results from individual studies or detailed statistical analysis. Tab. 1 shows example sentences with predicted labels. Due to the size of the dataset, we cannot manually annotate sentence labels for all reviews, so we use the sentence classifier output as silver labels in the training set. To ensure the highest degree of accuracy for the summary targets in our test set, we manually review all 4519 TARGET sentences in the 2K reviews of the test set, correcting 1109 sentences. Any reviews without TARGET sentences are considered unsuitable and are removed from the final dataset.

Table 9: Precision, Recall, and F1-scores for all annotation classes, averaged over five folds of cross validation.

3.3 Structured Form

As discussed in §2, the key findings of studies and reviews can be succinctly captured in a structured representation. The structure consists of PICO elements (Nye et al., 2018 ) that define what is being studied, in addition to the effectiveness of the intervention as inferred through Evidence Inference ( §3.3.2). In addition to the textual form of our task, we construct this structured form and release it with MSˆ2 to facilitate investigation of consistency between input studies and reviews, and to provide additional information for interpreting the findings reported in each document.

3.3.1 Adding Pico Tags

The Populations, Interventions, and Outcomes of interest are a common way of representing clinical knowledge (Huang et al., 2006) . Recent work (Nye et al., 2020) has found that the Comparator is rarely mentioned explicitly, so we exclude it from our dataset. Previous summarization work has shown that tagging salient entities, especially PIO elements (Wallace et al., 2020) , can improve summarization performance (Nallapati et al., 2016a,b ), so we mark PIO elements with special tokens added to our model vocabulary: , , , , , and .

Using the EBM-NLP corpus (Nye et al., 2018 ), a crowd-sourced collection of PIO tags, 4 we train a token classification model (Wolf et al., 2020) to identify these spans in our study and review documents. These span sets are denoted P = {P 1 , P 2 , ..., PP }, I = {I 1 , I 2 , ..., IĪ } and O = {O 1 , O 2 , ..., OŌ}. At the level of each review, we perform a simple aggregation over these elements. Any P, I, or O span fully contained within any other span of the same type is removed from these sets (though they remain tagged in the text). Removing these contained elements reduces the number of duplicates in our structured representation. Our dataset has an average of 3.0 P, 3.5 I, and 5.4 O spans per review.

3.3.2 Adding Evidence Inference

We predict the direction of evidence associated with every Intervention-Outcome (I/O) pair found in the review abstract. Taking the product of each I i and O j in the sets I and O yields all possible I/O pairs, and each I/O pair is associated with an evidence direction d ij , which can take on one of the values in {increases, no_change, decreases }. For each I/O pair, we also derive a sentence s ij from the document supporting the d ij classification. Each review can therefore be represented as a set of tuples T of the form (I i , O j , s ij , d ij ) and car-dinalityĪ ×Ō. See Tab. 2 for examples. For modeling, as in PICO tagging, we surround supporting sentences with special tokens and ; and append the direction class with a token.

We adapt the Evidence Inference (EI) dataset and models (DeYoung et al., 2020) for labeling. The EI dataset is a collection of RCTs, tagged PICO elements, evidence sentences, and overall evidence direction labels increases, no_change, or decreases. The EI models are composed of 1) an evidence identification module which identifies an evidence sentence, and 2) an evidence classification module for classifying the direction of effectiveness. The former is a binary classifier on top of SciBERT, whereas the latter is a softmax distribution over effectiveness directions. Using the same parameters as DeYoung et al. 2020, we modify these two modules to function solely over I and O spans. 5 The resulting 354k EI classifications for our reviews are 13.4% decreases, 57.0% no_change, and 29.6% increases. Of the 907k classifications over input studies, 15.7% are decreases, 60.7% no_change, and 23.6% increases. Only 53.8% of study classifications match review classifications, highlighting the prevalence and challenges of contradictory data.

3.4 Clustering And Train / Test Split

Reviews addressing overlapping research questions or providing updates to previous reviews may share input studies and results in common, e.g., a review studying the effect of Vitamin B12 supplementation on B12 levels in older adults and a review studying the effect of B12 supplementation on heart disease risk will cite similar studies. To avoid the phenomenon of learning from test data, we cluster reviews before splitting into train, validation, and test sets. We compute SPECTER paper embeddings using the title and abstract of each review, and perform agglomerative hierarchical clustering using the scikit-learn library (Buitinck et al., 2013) . This results in 200 clus- , and Multi-News (Fabbri et al., 2019) . Note: WikiSum only provides ranges, not exact size.

ters, which we randomly partition into 80/10/10 train/development/test sets.

3.5 Dataset Statistics

The final dataset consists of 20K reviews and 470k studies. Each review in the dataset summarizes an average of 23 studies, ranging between 1-401 studies. See Tab. 3 for statistics, and Tab. 4 for a comparison to other datasets. The median review has 6.7K input tokens from its input studies, while the average has 9.4K tokens (a few reviews have lots of studies). We restrict the input size when modeling to 25 studies, which reduces the average input to 6.6K tokens without altering the median. Fig. 2 shows the temporal distribution of reviews and input studies in MSˆ2. We observe that though reviews in our dataset have a median publication year of 2016, the studies cited by these reviews are largely from before 2010, with a median of 2007 and peak in 2009. This citation delay has been observed in prior work (Shojania et al., 2007; Beller et al., 2013) , and further illustrates the need for automated or assisted reviews.

Table 4: A comparison of MDS datasets; adapted from Fabbri et al. (2019). Datasets are DUC ’03/’041, TAC 2011 (Owczarzak and Dang, 2011), WikiSum (Liu et al., 2018), and Multi-News (Fabbri et al., 2019). Note: WikiSum only provides ranges, not exact size.

4 Experiments

We experiment with a texts-to-text task formulation (Fig. 1) . The model input consists of the BACK-GROUND statement and study abstracts; the output is the TARGET statement. We also investigate the use of the structured form described in §3.3.2 for , where all input studies are appended to the BACKGROUND and encoded together. Below: In the BART configuration, each input study is encoded independently with the review BACKGROUND. These are concatenated to form the input encoding. a supplementary table-to-table task, where given inputs of I/O pairs from the review; the model tries to predict the evidence direction. We provide initial results for the table-to-table task, although we consider this an area in need of active research.

4.1 Texts-To-Text Task

Our approach leverages BART (Lewis et al., 2020b) , a seq2seq autoencoder. Using BART, we encode the BACKGROUND and input studies as in Fig. 3 , and pass these representations to a decoder. Training follows a standard auto-regressive paradigm used for building summarization models. In addition to PICO tags ( §3.3.1), we augment the inputs by surrounding the background and each input study with special tokens , , and , .

Figure 3: Two input encoding configurations. Above: LongformerEncoderDecoder (LED), where all input studies are appended to the BACKGROUND and encoded together. Below: In the BART configuration, each input study is encoded independently with the review BACKGROUND. These are concatenated to form the input encoding.

For representing multiple inputs, we experiment with two configurations: one leveraging BART with independent encodings of each input, and LongformerEncoderDecoder (LED) which can encode long inputs of up to 16K tokens. For the BART configuration, each study abstract is appended to the BACKGROUND statement and encoded independently. These representations are concatenated together to form the input to the decoder layer. In the BART configuration, interactions happen only in the decoder. For the LED configuration, the input sequence starts with the BACK-GROUND statement followed by a concatenation of all input study abstracts. The BACKGROUND representation is shared among all input studies; global attention allows interactions between studies, and a sliding attention window of 512 tokens allows each token to attend to its neighbors.

We train a BART-base model, with hyperparameters described in App. F. We report experimental results in Tab. 5. In addition to ROUGE (Lin, 2004) , Table 5 : Results for the texts-to-text setting. We report ROUGE, ∆EI ( § 4.3), and macro-averaged F1-scores.

Table 5: Results for the texts-to-text setting. We report ROUGE, ∆EI (§ 4.3), and macro-averaged F1-scores.

Model R-1 R-2 R-L ∆EI↓

Table 6: Results for the table-to-table setting. We report macro-averaged precision, recall, and F-scores.

Table 7: Confusion matrix for human evaluation results

Table 8: Text from the input studies to Petrelli and Barni (2013), a review investigating the effectiveness of cisplatin-based (CAP) chemotherapy for non-small cell lung cancer (NSCLC). Input studies vary in their results, with some stating a positive effect for adjuvant chemotherapy, and some stating no survival benefit.

Table 10: Full 9-class sentence classification confusion matrix, averaged over five folds of cross validation.

Table 11: Example review abstract from Andrès et al. (2010) with predicted sentence labels. Spans corresponding to Population, Intervention, and Outcome elements are tagged and surrounded with special tokens.

Table 12: Confusion matrix between review effect findings and input study effect findings. Each row corresponds to the fraction of the effect direction found in the review with the fraction of that direction accounted for in the study. The most frequent confusion is with no_change, as opposed to flipping the overall direction of the finding.

we also report two metrics derived from evidence inference: ∆EI and F1. We describe the intuition and computation of the ∆EI metric in Section 4.3; because it is a distance metric, lower ∆EI is better. For F1, we use the EI classification module to identify evidence directions for both the generated and target summaries. Using these classifications, we report a macro-averaged F1 over the class agreement between the generated and target summaries (Buitinck et al., 2013) . For example generations, see Tab. 13 in App. G.

Table 13: Example summaries from the test set generated using the BART configuration.

Model P R F1

BART 50.31 67.98 65.89 Table 6 : Results for the table-to-table setting. We report macro-averaged precision, recall, and F-scores.

BART 46.98 45.39 46.97 We report precision, recall, and macro-averaged F1scores.

4.3 ∆Ei Metric

Recent work in summarization evaluation has highlighted the weaknesses of ROUGE for capturing factuality of generated summaries, and has focused on developing automated metrics more closely correlated with human-assessed factuality and quality (Zhang* et al., 2020; Falke et al., 2019) . In this vein, we modify a recently proposed metric based on EI classification distributions (Wallace et al., 2020) , intending to capture the agreement of Is, Os, and EI directions between input studies and the generated summary. For each I/O tuple (I i , O j ), the predicted direction d ij is actually a distribution of probabilities over the three direction classes P ij = (p increases , p decreases , p no_change ). If we consider this distribution for the gold summary (P ij ) and the generated summary (Q ij ), we can compute the Jensen-Shannon Distance (JSD) (Lin, 1991) , a bounded score between [0, 1], between these distributions. For each review, we can then compute a summary JSD metric, which we call ∆EI, as an average over the JSD of each I/O tuple in that review:

EQUATION (1): Not extracted; please refer to original document.

Different from Wallace et al. (2020) , ∆EI is an average over all outputs, attempting to capture an overall picture of system performance, 6 and our metric retains the directionality of increases and decreases, as opposed to collapsing them together.

To facilitate interpretation of the ∆EI metric, we offer a degenerate example. Given the case where all direction classifications are certain, and the probability distributions P ij and Q ij exist in the space of (1, 0, 0), (0, 1, 0), or (0, 0, 1), ∆EI takes on the following values at various levels of consistency between P ij and Q ij for the input studies:

100% consistent ∆EI = 0.0 50% consistent ∆EI = 0.42 0% consistent ∆EI = 0.83 In other words, in both the standard BART and LED setting, the evidence directions predicted in relation to the generated summary are slightly less than 50% consistent with the direction predictions produced relative to the gold summary.

4.4 Human Evaluation & Error Analysis

We randomly sample 150 reviews from the test set for manual evaluation. For each generated and gold summary, we annotate the primary effectiveness direction in the summary to the following classes: (i) increases: intervention has a positive effect on the outcome; (ii) no_change: no effect, or no difference between the intervention and the comparator; (iii) decreases: intervention has a negative effect on the outcome; (iv) insufficient: insufficient evidence is available; (v) skip: the summary is disfluent, off topic, or does not contain information on efficacy.

Here, increases, no_change, and decreases correspond to the EI classes, while we introduce insufficient to describe cases where insufficient evidence is available on efficacy, and skip to describe data or generation failures. Two annotators provide labels, and agreement is computed over 50 reviews (agreement: 86%, Cohen's κ: 0.76). Of these, 17 gold summaries lack an efficacy statement, and are excluded from analysis. Tab. 7 shows the confusion matrix for the sample. Around 50% (67/133) of generated summaries have the same evidence direction as the gold summary. Most confusions happen between increases, no_change, and insufficient.

Tab. 8 shows how individual studies can provide contradictory information, some supporting a positive effect for an intervention and some observing no or negative effects. EI may be able to capture some of the differences between these input studies. From observations on limited data: while studies with positive effect tend to have more EI predictions that were increases or decreases, those with no or negative effect tended to have predictions that were mostly no_change. However, more work is needed to better understand how to capture these directional relations and how to aggregate them 2013, a review investigating the effectiveness of cisplatin-based (CAP) chemotherapy for non-small cell lung cancer (NSCLC). Input studies vary in their results, with some stating a positive effect for adjuvant chemotherapy, and some stating no survival benefit.

into a coherent summary.

5 Related Work

NLP for scientific text has been gaining interest recently with work spanning the whole NLP pipeline: datasets (S2ORC , CORD-19 (Wang et al., 2020b) ), pretrained transformer models (SciBERT , BioBERT , ClinicalBERT (Huang et al., 2019) , SPECTER ), NLP tasks like NER (Nye et al., 2018; Li et al., 2016) , relation extraction (Jain et al., 2020; Luan et al., 2018; Kringelum et al., 2016) , QA (Abacha et al., 2019), NLI (Romanov and Shivade, 2018; Khot et al., 2018) , summarization (Cachola et al., 2020; Chandrasekaran et al., 2019) , claim verification (Wadden et al., 2020) , and more. MSˆ2 adds a MDS dataset to the scientific document NLP literature.

A small number of MDS datasets are available for other domains, including MultiNews (Fabbri et al., 2019) , WikiSum , and Wikipedia Current Events (Gholipour Ghalandari et al., 2020) . Most similar to MSˆ2 is MultiNews, where multiple news articles about the same event are summarized into one short paragraph. Aside from being in a different textual domain (scientific vs. newswire), one unique characteristic of MSˆ2 compared to existing datasets is that MSˆ2 input documents have contradicting evidence. Modeling in other domains has typically focused on straightforward applications of single-document summarization to the multi-document setting (Lebanoff et al., 2018; Zhang et al., 2018) , although some methods explicitly model multi-document structure using semantic graph approaches (Baumel et al., 2018; Liu and Lapata, 2019; .

In the systematic review domain, work has typically focused on information retrieval (Boudin et al., 2010; Ho et al., 2016; Znaidi et al., 2015; Schoot et al., 2020) , extracting findings (Lehman et al., 2019; DeYoung et al., 2020; Nye et al., 2020) , and quality assessment (Marshall et al., 2015 (Marshall et al., , 2016 . Only recently in Wallace et al. (2020) and this work has consideration been made for approaching the entire system as a whole. We refer the reader to App. I for more context regarding the systematic review process.

6 Discussion

Though MDS has been explored in the general domain, biomedical text poses unique challenges such as the need for domain-specific vocabulary and background knowledge. To support development of biomedical MDS systems, we release the MSˆ2 dataset. MSˆ2 contains summaries and documents derived from biomedical literature, and can be used to study literature review automation, a pressing real-world application of MDS.

We define a seq2seq modeling task over this dataset, as well as a structured task that incorporates prior work on modeling biomedical text (Nye et al., 2018; DeYoung et al., 2020) . We show that although generated summaries tend to be fluent and on-topic, they only agree with the evidence direction in gold summaries around half the time, leaving plenty of room for improvement. This observation holds both through our ∆EI metric and through human evaluation of a small sample of generated summaries. Given that only 54% of study evidence directions agree with the evidence directions of their review, modeling contradiction in source documents may be key to improving upon existing summarization methods.

Limitations Challenges in co-reference resolution and PICO extraction limit our ability to generate accurate PICO labels at the document level. Errors compound at each stage: PICO tagging, taking the product of Is and Os at the document level, and predicting EI direction. Pipeline improvements are needed to bolster overall system performance and increase our ability to automatically assess performance via automated metrics like ∆EI. Relatedly, automated metrics for summarization evaluation can be difficult to interpret, as the intuition for each metric must be built up through experience. Though we attempt to facilitate understanding of ∆EI by offering a degenerate example, more exploration is needed to understand how a practically useful system would perform on such a metric.

Future work Though we demonstrate that seq2seq approaches are capable of producing fluent and on-topic review summaries, there are significant opportunities for improvement. Data improvements include improving the quality of summary targets and intermediate structured representations (PICO tags and EI direction). Another opportunity lies in linking to structured data in external sources such as various clinical trial databases 7,8,9 rather than relying solely on PICO tagging. For modeling, we are interested in pursuing joint retrieval and summarization approaches . We also hope to explicitly model the types of contradictions observed in Tab. 8, such that generated summaries can capture nuanced claims made by individual studies.

7 Conclusion

Given increasing rates of publication, multidocument summarization, or the creation of literature reviews, has emerged as an important NLP task in science. The urgency for automation technologies has been magnified by the COVID-19 pandemic, which has led to both an accelerated speed of publication (Horbach, 2020) as well as proliferation of non-peer-reviewed preprints which may be of lower quality (Lachapelle, 2020) . By releasing MSˆ2, we provide a MDS dataset that can help to address these challenges. Though we demonstrate that our MDS models can produce fluent text, our results show that there are significant outstanding challenges that remain unsolved, such as PICO tuple extraction, co-reference resolution, and evaluation of summary quality and faithfulness in the multi-document setting. We encourage others to use this dataset to better understand the challenges specific to MDS in the domain of biomedical text, and to push the boundaries on the real world task of systematic review automation.

Ethical Concerns And Broader Impact

We believe that automation in systematic reviews has great potential value to the medical and scientific community; our aim in releasing our dataset and models is to facilitate research in this area. Given unresolved issues in evaluating the factuality of summarization systems, as well as a lack of strong guarantees about what the summary outputs contain, we do not believe that such a system is ready to be deployed in practice. Deploying such a system now would be premature, as without these guarantees we would be likely to generate plausible-looking but factually incorrect summaries, an unacceptable outcome in such a high impact domain. We hope to foster development of useful systems with correctness guarantees and evaluations to support them. Table 9 : Precision, Recall, and F1-scores for all annotation classes, averaged over five folds of cross validation.

A Mesh Filtering

For each candidate review, we extract its cited papers and identify the study type of each cited paper using MeSH publication type, 10 keeping only studies that are clinical trials, cohort studies, and/or observational studies (see Appendix A.1 for full list of MeSH terms). We exclude case reports, which usually report findings on one or a small number of individuals. We observe that publication type MeSH terms tend to be under-tagged. 11 Therefore, we also use ArrowSmith trial labels Shao et al., 2015) and a keyword heuristic (the span "randomized" occurring in the title or abstract) to identify additional RCT-like studies. 12 Candidate reviews are culled to retain only those that cite at least one suitable study and no case studies, leaving us with 30K reviews.

A.1 Suitability Mesh Terms

We use the following publication type MeSH terms to decide whether a review's input document is a study of interest: Study' 2. 'Clinical Trial' 3. 'Controlled Clinical Trial' 4. 'Randomized Controlled Trial' 5. 'Pragmatic Clinical Trial' 6. 'Clinical Trial, Phase I' 7. 'Clinical Trial, Phase II' 8. 'Clinical Trial, Phase III' 9. 'Clinical Trial, Phase IV' 10. 'Equivalence Trial' 11. 'Comparative Study' 12. 'Observational Study' 13. 'Adaptive Clinical Trial' And we exclude any reviews citing studies with the following publication type MeSH terms:

1. 'Clinical

1. 'Randomized Controlled Trial, Veterinary' 2. 'Clinical Trial, Veterinary' 3. 'Observational Study, Veterinary' 4. 'Case Report'

B Suitability Annotation

The annotation guidelines for review suitability are given below. Each annotator was tasked with an initial round of annotation, followed by a round of review, then further annotation.

B.1 Suitability Guidelines

There are many different types of reviews, and many types of documents that look like reviews. We need to identify only the "correct" types of reviews. Sometimes this can be done from the title alone, sometimes one has to read the review itself.

The reviews we are interested in:

• Must study a human population (no animal, veterinary, or environmental studies) WHAT THE READER WILL GAIN Three prospect i ve r and omized studies , a systematic review by the Cochrane group and five prospect i ve cohort studies were found and provide evidence that oral cobalamin treatment may adequately treat cobalamin deficiency .

C Suitability Classifier

Four annotators with biomedical background labeled 879 reviews sampled from the candidate pool (572 suitable, 307 not, Cohen's Kappa: 0.55) according to the suitability criteria (guidelines in Appendix B). We aim to include reviews that perform an aggregation over existing results, such as reporting on how a medical or social intervention affects a group of people, while excluding reviews that make new observations, such as identifying novel disease co-morbidities or those that synthesize case studies. For our suitability classifier, we finetune Sci-BERT using standard parameters; using five-fold cross validation we find that a threshold of 0.75 provides a precision of greater than 80% while maintaining an adequate recall (Figure 4) .

Figure 4: Five fold cross-validation results from training a binary SciBERT classifier on the annotations. Precisions increase following a logistic curve over threshold choices; recalls decrease.

Though there are a fairly large number of false positives by this criteria, we note that these false positive documents are generally reviews; however, they may not investigate an intervention, or may not have suitable target statements. In the latter case, target identification described in § 3.2 helps us further refine and remove these false positives from the final dataset.

D Sentence Annotation

Sentence annotation guidelines and detailed scores are below. Each annotator was tasked with annotating 50-100 sentences, followed by a round of review, before being asked to annotate more.

D.1 Sentence Annotation Guidelines

A systematic review is a document resulting from an in-depth search and analysis of all the literature relevant to a particular topic. We are interested in systematic reviews of medical literature, specifically those that assess varying treatments and the outcomes associated with them. Ignore any existing labels; these are automatically produced and error prone. If something clearly fits into more than one category, separate the labels by commas (annoying, we know, but it can be important). For sentences that are incorrectly broken in a way that makes them difficult to label, skip them (you can fix them, but they'll be programmatically ignored). For reviews that don't meet suitability guidelines, also skip them. We want to identify sentences within these reviews as belonging to one of several categories:

• BACKGROUND: Any background information not including goals.

• GOAL: A high level goal sentence, describing the aims or purposes of the review.

• METHODS: Anything describing the particular strategies or techniques for conducting the review. This includes methods for finding and assessing appropriate studies to include, e.g., the databases searched or other characteristics of the searched literature. A characteristic might be a study type, it might be other details, such as criteria involving the study participants, what interventions (treatments)

were studied or compared, or what outcomes are measured in those studies. This may also include whether or not a meta-analysis is performed.

• DETAILED_FINDINGS: Any sections reporting study results, often includes numbers, p-values, etc. These will frequently include statements about a subset of the trials or the populations.

• GENERAL FINDINGS: There are four types of general findings we would like you to label. These do not include things like number of patients, or a p-value (that's DETAILED FIND-INGS). Not all of these four subtypes will always be present in a paper's abstract. Some sentences will contain information about more than one subtype, and some sentences can contain information about some of these subtypes as well as DETAILED FINDINGS.

-EFFECT: Effect of the intervention, may include a statement about significance. These can cover a wide range of topics, including public health or policy changes. -EVIDENCE_QUALITY: Commentary about the strength or quality of evidence pertaining to the intervention. -FURTHER_STUDY: These statements might call for more research in a particular area, and can include hedging statements, e.g.:

* "More rigorously designed longitudinal studies with standardized definitions of periodontal disease and vitamin D are necessary."

* "More research with larger sample size and high quality in different nursing educational contexts are required."

* "However, this finding largely relies on data from observational studies; high-quality RCTs are warranted because of the potential for subject selection bias." -RECOMMENDATION: Any kind of clinical or policy recommendation, or recommendations for use in practice. This must contain an explicit recommendation, not a passive statement saying that a treatment is good. "Should" or "recommend" are good indicators. These may not always be present in an abstract. E.g.:

* "Public policy measures that can reduce inequity in health coverage, as well as improve economic and educational opportunities for the poor, will help in reducing the burden of malaria in SSA." -ETC: Anything that doesn't fit into the categories above.

All sentences appear in the context of their review. Some of the selected reviews might not actually be reviews; these were identified by accident. These should be excluded from annotation -either make a comment on the side (preferred) or delete the rows belonging to the non-review. Examples follow. Please ask questions -these guidelines are likely not perfect and we'll have missed many edge cases Examples: BACKGROUND A sizeable number of individuals who participate in population-based colorectal cancer (CRC) screening programs and have a positive fecal occult blood test (FOBT) do not have an identifiable lesion found at colonoscopy to account for their positive FOBT screen.

GOAL To determine the effect of integrating informal caregivers into discharge planning on postdischarge cost and resource use in older adults.

METHODS MAIN OUTCOMES Clinical status (eg, spirometric measures); functional status (eg, days lost from school); and health services use (eg, hospital admissions). Studies were included if they had measured serum vitamin D levels or vitamin D intake and any periodontal parameter.

DETAILED_FINDINGS Overall, 27 studies were included (13 cross-sectional studies, 6 casecontrol studies, 5 cohort studies, 2 randomized clinical trials and 1 case series study). Sixty-five percent of the cross-sectional studies reported significant associations between low vitamin D levels and poor periodontal parameters. Analysis of group cognitive-behavioural therapy (CBT) v. usual care alone (14 studies) showed a significant effect in favour of group CBT immediately post-treatment (standardised mean difference (SMD) -0.55 (95% CI -0.78 to -0.32)).

EFFECT This review identified short-term benefits of technology-supported self-guided interven-tions on the physical activity level and fatigue and some benefit on dietary behaviour and HRQoL in people with cancer. However, current literature demonstrates a lack of evidence for long-term benefit.

EVIDENCE_QUALITY Interpretation of findings was influenced by inadequate reporting of intervention description and compliance.

No meta-analysis was performed due to high variability across studies.

RECOMMENDATION

D.2 Detailed Sentence Breakdown Scores

Sentence classification scores for 9 classes are given in Table 9 . The corresponding confusion matrix can be found in Table 10 . Table 11 provides an example of sentence classification results over 3 classes.

E Dataset Contradiction Scores

The confusion matrix between review effect findings and input study effect findings is given in Table 12 .

F Hyperparameters And Modeling Details

We implement our models using PyTorch (Paszke et al., 2019) , the HuggingFace Transformers (Wolf et al., 2020) and PyTorch lightning (Falcon, 2019) libraries, starting from the BART-base checkpoint (Lewis et al., 2020b) . All models were trained using FP16, using NVidia RTX 8000 GPUs (GPUs with 40G or more of memory are required for most texts-to-text configurations). All models are trained for eight epochs as validation scores diminished over time; early experiments ran out to approximately fifty epochs and showed little sensitivity to other hyperparameters. We use gradient accumulation to reach an effective batch size of 32. We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-5, an epsilon of 1e-8, and a linear learning rate schedule with 1000 steps of warmup. We ran a hyperparameter sweep over decoding parameters on the validation set for 4, 6, Table 12 : Confusion matrix between review effect findings and input study effect findings. Each row corresponds to the fraction of the effect direction found in the review with the fraction of that direction accounted for in the study. The most frequent confusion is with no_change, as opposed to flipping the overall direction of the finding.

Background Target Generated

OBJECTIVE To explore the evidence for the effectiveness of acupuncture for nonspecific low back pain ( LBP ). SUMMARY OF BACKGROUND DATA Since the most recent systematic review s on RCTs on acupuncture for LBP, 6 RCTs have been published, which may impact on the previous conclusions.

There is moderate evidence that acupuncture is more effective than no treatment, and strong evidence of no significant difference between acupuncture and sham acupuncture , for shortterm pain relief.

The is insufficient evidence to support the use of acupuncture for LBP. CONCLUSIONS There is limited evidence for the effectiveness of acupuncture in LBP in the short term.

Objectives : To provide a quantitative analysis of all r and omized controlled trials design ed to determine the effectiveness of physical interventions for people with spinal cord injury ( SCI ).

There is initial evidence supporting the effectiveness of some physical interventions for people with SCI.

The Results : This systematic review provides evidence that physical interventions for people with SCI are effective in improving muscle strength and function in the short term.

BACKGROUND Neuroendocrine tumours ( NET ) most commonly metastasize to the liver. Hepatic resection of NET hepatic metastases ( NETHM ) has been shown to improve symptomology and survival. METHODS A systematic review of clinical studies before September 2010 was performed to examine the efficacy of hepatic resection for NETHM.

Poor histologic grade, extra-hepatic disease and a macroscopically incomplete resection were associated with a poor prognosis. CON-CLUSION Hepatic resection for NETHM provides symptomatic benefit and is associated with favourable survival outcomes although the majority of patients invariably develop disease progression Theatic resection of NETHM has been shown to improve survival in patients with advanced, well-differentiated NETs.

The aim of this systematic review and meta-analysis was to assess the efficacy on an intervention on breastfeeding self-efficacy and perceived insufficient milk supply outcomes.

Although significant effect of the interventions in improving maternal breastfeeding selfefficacy was revealed by this review, there is still a paucity of evidence on the mode, format, and intensity of interventions.

The findings of this systematic review and meta-analysis suggest that breastfeeding education is an effective intervention for improving breastfeeding self-efficacy and breastfeeding duration among primiparous women. and 8 beams; maximum lengths of 64, 128, and 256 wordpieces; and length penalties of 1, 2, and 4. We find little qualitative or quantitative variation between runs and select the setting with the highest Rouge1 scores: 6 beams, a length penalty of 2, and 128 tokens for output maximum lengths. We use an attention dropout (Srivastava et al., 2014) of 0.1. Optimizer hyperparameters, as well as any hyperparameters not mentioned, used defaults corresponding to their libraries. Training requires approximately one day on two GPUs. Due to memory constraints, we limit each review to 25 input documents, with a maximum of 1000 tokens per input document.

We make use of NumPy (Harris et al., 2020) in our models and evaluation, as well as scikitlearn (Buitinck et al., 2013) , and the general SciPy framework (Virtanen et al., 2020) for evaluation.

G Example Generated Summaries

See Table 14 : texts-to-text results on the validation set. We report ROUGE, ∆EI, and macro-averaged F1-scores. These are similar to test scores.

H Validation Results

We provide results on the validation set in Tables 14 and 15 .

I A Brief Review Of Systematic Reviews

We provide a brief overview of the systematic review process for the reader. A systematic review is a thorough, evidence-based process to answer scientific questions. In the biomedical domain, a systematic review typically consists of five steps: defining the question, finding relevant studies, determining study quality, assessing the evidence (quantitative or qualitative analysis), and drawing final conclusions. For a detailed overview of the steps, see Khan et al. (2003) . While there are other definitions and aspects of the review process (Aromataris and Munn, 2020; Higgins et al., 2019), the five-step process above is sufficient for describing reviews in the context of this work. We emphasize that this work, indeed the approaches used in this field, cannot replace the labor done in a systematic review, and may instead be useful for scoping or exploratory reviews. The National Toxicology Program, 13 part of the United States Department of Health and Human Services, conducts scoping reviews for epidemiological studies. The National Toxicology Program has actively solicited help from the natural language processing community via the Text Analysis Conference. 14 Other groups conducting biomedical systematic reviews include the Cochrane Collaboration, 15 the Joanna Briggs Institute, 16 Guidelines International Network, 17 SickKids, 18 the University of York, 19 and the public health agencies of various countries, 20 to name a few. Systematic review methodologies have also been applied in fields outside of medicine, by organizations such as the Campbell Collaboration, 21 which conducts reviews over a wide range of areas: business, justice, education, and more.

I.1 Automation In Systematic Reviews

Automation in systematic reviews has typically focused on assisting in portions of the process: search and extraction, quality assessment, and interpreting findings. For a detailed analysis of automated approaches in aiding the systematic review process, see Norman (2020) ; Marshall and Wallace (2019) .

Search and Extraction. Search, screening, and extracting the results of studies into a structured representation are several components of the sys-tematic review process that have been the major focuses of natural language processing approaches. Several systems provide active-learning enhanced search (Howard et al., 2020; Schoot et al., 2020) , or offer screening based on study type (Marshall et al., 2016) . PICO (Participants, Interventions, Controls, and Outcomes) elements can be used to assist in search and screening (Znaidi et al., 2015; Ho et al., 2016; Boudin et al., 2010) . To this end, several datasets have been introduced. EBM-NLP (Nye et al., 2018 ) is a dataset of crowd-sourced PICO elements in randomized control trial abstracts. Jin and Szolovits (2018) provides a large-scale dataset of sentence-level PICO labels that are automatically derived using the structured abstract headers in PubMed abstracts. The Chemical-Disease Relations challenge (Wei et al., 2015) offers data for some of the PICO classes and a related relation extraction task, as does the i2b2 2010 disease-relation task (Uzuner et al., 2011) . Evidence Inference (Lehman et al., 2019; DeYoung et al., 2020) attempts to automate detecting the direction of conclusions given PICO elements of interest; e.g., Nye et al. (2020) starts from RCTs, finds PICO elements, and then finds conclusions associated with those PICO elements. Many review tools 22, 23, 24, 25 incorporate workflow management tools for manual extraction of these elements and associated conclusions.

Quality Assessment. Relatively few tools focus on quality assessment. The primary tool seems to be RobotReviewer (Marshall et al., 2016) , which assesses Risk of Bias in trial results, which is one aspect of quality. There are opportunities for quality assessment that focus on automatically assessing statistical power or study design.

Interpretation. The interpretation step of the systematic review process involves drawing overall conclusions about the interventions studied: how effective is the intervention, when should it be used, what is the overall strength of the evidence supporting the effectiveness and recommendations, and what else needs to be studied. It too has received relatively little attention from those developing assistive systems. Similar to this work, Wallace et al. (2020) takes advantage of structured Cochrane reviews to identify summary targets, and uses portions of the input documents as model inputs. Shah et al. (2021) extracts relations from nutritional literature, and uses content planning methods to generate summaries highlighting contradictions in the relevant literature.

https://duc.nist.gov

https://community.cochrane.org/review-production/ production-resources/proposing-and-registering-newcochrane-reviews

Of the heterogeneous study types, randomized control trials (RCTs) offer the highest quality of evidence. Around 120 RCTs are published per day as of this writing https://ijmarshall.github.io/sote/, up from 75 in 2010(Bastian et al., 2010).

EBM-NLP contains high quality, crowdsourced and expert-tagged PIO spans in clinical trial abstracts. See App. I for a comparison to other PICO datasets.

Nye et al. (2020) found that removing Comparator elements improved classification performance from 78.0 F1 to 81.4 F1 with no additional changes or hyper-parameter tuning.

Wallace et al. (2020) only report correlation of a related metric with human judgments.

https://clinicaltrials.gov/ 8 https://www.clinicaltrialsregister.eu/ 9 https://www.gsk-studyregister.com/

https://www.nlm.nih.gov/mesh/pubtypes.html 11 From a cursory inspection of a random sample of studies, this problem seems to be widespread.12 RCTs provide the highest quality of evidence so we strive to include as many as possible as inputs in our dataset.

https://ntp.niehs.nih.gov/ 14 https://tac.nist.gov/2018/SRIE/ 15 https://www.cochrane.org/ 16 https://jbi.global/ 17 https://www.g-i-n.net 18 https://www.sickkids.ca/ 19 https://www.york.ac.uk/crd/ 20 https://www.canada.ca/en/public-health/services/reportspublications.html 21 https://www.campbellcollaboration.org

https://www.evidencepartners.com/ 23 https://www.covidence.org/reviewers/ 24 https://sysrev.com/ 25 https://www.jbisumari.org/

Table 15: table-to-table results on the validation set. We report precision, recall, and macro-averaged F1scores.