Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

On Generating Extended Summaries of Long Documents



Prior work in document summarization has mainly focused on generating short summaries of a document. While this type of summary helps get a high-level view of a given document, it is desirable in some cases to know more detailed information about its salient points that can’t fit in a short summary. This is typically the case for longer documents such as a research paper, legal document, or a book. In this paper, we present a new method for generating extended summaries of long papers. Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model through a multi-task learning approach. We then present our results on three long summarization datasets, arXiv-Long, PubMed-Long, and Longsumm. Our method outperforms or matches the performance of strong baselines. Furthermore, we perform a comprehensive analysis over the generated results, shedding insights on future research for long-form summary generation task. Our analysis shows that our multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences across diverse sections. Our datasets, and codes are publicly available at https: //github.com/Georgetown-IR-Lab/ExtendedSumm.


In the past few years, there has been a significant progress on both extractive (e.g., Nallapati, Zhai, and Zhou 2017; Zhou et al. 2018; Liu and Lapata 2019; Xu et al. 2020; Jia et al. 2020 ) and abstractive (e.g., See, Liu, and Manning 2017; MacAvaney et al. 2019; Zhang et al. 2019; Sotudeh, Goharian, and Filice 2020; Dong et al. 2020 ) approaches for document summarization. These approaches generate a concise summary of a document, capturing its salient content. However, for a longer document containing numerous details, it is sometimes helpful to read an extended summary, providing details about its different aspects. Scientific papers are examples of such documents; while their abstracts provide a short summary about their main Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. methods and findings, the abstract does not include details of the methods or experimental conditions. To those who seek more detailed information about a document without having to cover the entire document, an extended or long summary can be desirable (Chandrasekaran et al. 2020; Sotudeh, Cohan, and Goharian 2020; Ghosh Roy et al. 2020) .

Many long documents, including scientific papers, follow a certain hierarchical structure where content is organized throughout multiple sections and sub-sections. For example, research papers often describe objectives, problem, methodology, experiments, and conclusions (Collins, Augenstein, and Riedel 2017) . Few prior studies have noted the importance of documents' structure in shorter-form summary generation (Collins, Augenstein, and Riedel 2017; . However, we are not aware of existing summarization methods explicitly approaching modeling the document structure when it comes to generating extended summaries.

We approach the problem of generating extended summary by incorporating document's hierarchical structure into the summarization model. Specifically, we hypothesize that integrating the processes of sentence selection and section prediction improves the summarization model's performance over the existing baseline models on extended summarization task. To substantiate our hypothesis, we test our proposed model on three extended summarization datasets, namely, arXiv-Long, PubMed-Long, and Longsumm. We further provide comprehensive analyses over the generated results for two long datasets, demonstrating the qualities of our model over the baseline. Our analysis reveals that the multi-tasking model helps with adjusting sentence extraction probability to the advantage of salient sentences scattered across different sections of the document. Our contributions are threefold:

1. A multi-task learning approach for leveraging document structure in generating extended summaries of long documents.

2. In-depth and comprehensive analyses over the generated results to explore the qualities of our arXiv:2012.14136v1 [cs.CL] 28 Dec 2020 model in comparison with the baseline model.


Collecting two large-scale extended summarization datasets with oracle labels for facilitating ongoing research in extended summarization domain.

Related Work

Scientific document summarization Summarizing scientific papers has garnered vast attention from the research community during recent years, although it has been studied for decades. The characteristics of scientific papers, namely the length, writing style, and discourse structure, lead to special model considerations to overcome the summarization task in scientific domain. Researchers have utilized different approaches to address these challenges.

In earlir work, Teufel and Moens (2002) proposed a Naïve bayes classifier to do content selection over the documents' sentences with regard to their rhetorical sentence role. More recent works have given rise to the importance of discourse structure and its usefulness in summarizing scientific papers. For example, Collins, Augenstein, and Riedel (2017) used a set of pre-defined section clusters that source sentences are appeared in as a categorical feature to aid the model at identifying summary-worthy sentences. introduced large-scale datasets of arXiv and PubMed (collected from public repositories), and used a hierarchical encoder to model the discourse structure of a paper, and then used an attentive decoder to generate the summary. More recently, Xiao and Carenini (2019) proposed a sequence-to-sequence model that incorporates both the global context of the entire document and local context within the specified section. Inspired by the fact that discourse information is important when dealing with long documents , we utilize this structure in scientific summarization. Unlike prior works, we integrate sentence selection and sentence section labeling processes through a multi-task learning approach. In a different line of research, the use of citation context information has been shown to be quite effective at summarizing scientific papers (Abu-Jbara and Radev 2011). For instance, Goharian (2015, 2018) utilized a citation-based approach, denoting how the paper is cited in the reference papers, to form the summary. Here, we do not exploit any citation context information.

Extended summarization While summarization research has been extensively explored in literature, extended summarization has recently gained a huge deal of attention from the research community. Among the first attempts to encourage the ongoing research in this field, Chandrasekaran et al. (2020) set up the Longsumm shared task 1 on producing extended summaries from scientific documents and provided a extended summarization dataset called Longsumm over which participants were urged to generate extended summaries. To tackle this challenge, researchers used different methodologies. For instance, Sotudeh, Cohan, and Goharian (2020) proposed a multi-tasking approach to jointly learn sentence importance along with its section to be included in the summary. Herein, we aim at validating the multi-tasking model on a variety of extended summarization datasets and provide a comprehensive analysis to guide future research. Moreover, Ghosh Roy et al. (2020) utilized sectioncontribution pre-computations (training set) to assign weights via a budget module for generating extended summaries. After specifying the section contribution, an extractive summarizer is executed over each section separately to extract salient sentences. Unlike their work, we unify sentence selection and sentence section prediction tasks to effectively aid the model at identifying summaryworthy sentences scattered around different sections. Furthermore, Reddy et al. (2020) proposed a CNNbased classification network for extracting salient sentences. Gidiotis, Stefanidis, and Tsoumakas (2020) proposed to use a divide and conquer (DANCER) approach (Gidiotis and Tsoumakas 2020) to identify the key sections of the paper to be summarized. The PEGASUS abstractive summarizer (Zhang et al. 2019) then runs over each section separately to produce section summaries, which are finally concatenated to form the extended summary. Beltagy, Peters, and Cohan (2020) proposed "Longformer" that utilizes "Dilated Sliding Windows", enabling the model to achieve better long-range coverage on long documents. With all being mentioned above, to the best of our knowledge, we are the first to conduct quite a comprehensive analysis over the generated summarization results in the extended summarization domain.


We use three extended summarization datasets in this research. The first one is Longsumm dataset, which has been provided in the Longsumm 2020 shared task (Chandrasekaran et al. 2020) . To further validate the model, we collect two additional datasets called arXiv-Long and PubMed-Long by filtering the instances of arXiv and PubMed corpora to retain those whose abstract contains at least 350 tokens. Also, to measure how our model works on the mixed varied-length scientific dataset, we exploit the arXiv summarization dataset .

Longsumm The Longsumm dataset was provided for the Longsumm challenge (Chandrasekaran et al. 2020) whose aim was to generate extended summaries for scientific papers. It consists of two types of summaries:

• Extractive summaries: these summaries are coming from the TalkSumm dataset (Lev et al. 2019) , containing 1705 extractive summaries of scientific papers according to their video talks in conferences (i.e., ACL, NAACL, etc.). Each summary within this corpus is formed by appending top 30 sentences of the paper.

• Abstractive summaries: an add-on dataset containing 531 abstractive summaries from several CS domains such as Machine Learning, NLP, and AI, that are written by NLP and ML researchers on their blogs. The length of summaries in this dataset ranges from 50-1500 words per paper.

In our experiments, we use the extractive set along with 50% of the abstractive set as our training set, containing 1969 papers; and 20% of it as the validation set. Note that these splits are made for the purpose of our internal experiments as the official test set containing 22 abstractive summaries is blind (Chandrasekaran et al. 2020 ).

Arxiv-Long & Pubmed-Long.

To further test our methods on additional datasets, we construct two extended summarization datasets for our task. For creating the first dataset, we take arXiv summarization dataset introduced by Cohan et al. 2018and filter the instances whose abstract (i.e., ground-truth summary) contains at least 350 tokens. We call this dataset arXiv-Long. We repeat the same process on the PubMed papers obtained from the Open Access FTP service 2 and call this dataset PubMed-Long. The motivation is that we are interested in validating our model on extended summarization datasets to investigate its effects compared to the existing works, and 350 is the length threshold that we use to characterize papers with "long" summaries. The resulting sets contain 11,149 instances for arXiv-Long, and 88,035 instances for PubMed-Long datasets. Note that the abstract of papers are used as ground-truth summaries in these two datasets. The overall statistics of the datasets are shown in Table 1 . We release these datasets to facilitate future research in extended summarization.

Table 1: Statistics on arXiv (Cohan et al. 2018), Longsumm (Chandrasekaran et al. 2020), and two extended summarization datasets (arXiv-Long, PubMed-Long), collected by this work.


In this section, we discuss our proposed method that aims at jointly learning to predict sentence importance and its corresponding section. Before discussing the details of our summarization model, we investigate the preliminary background that provides a fair basis for implementing our method.


Summarization The extractive summarization system aims at extracting salient sentences to be included in the summary. Formally, let P show a scientific paper containing sentences [s 1 , s 2 , s 3 , ..., s m ], where m is the number of sentences. The extractive summarization is then defined as the task of assigning a binary label (ŷ i ∈ {0, 1}) to each sentence s i within the paper, signifying whether the sentence should be included in the summary.

Bertsum: Bert For Summarization

As our base model we use the BERTSUM extractive summarization model (Liu and Lapata 2019), a BERT-based sentence classification model fine-tuned for summarization.

After BERTSUM outputs sentence representations within the input document, several inter-sentence Transformer layers are stacked upon the BERTSUM to collect document-level features. The final output layer is a linear classifier with Sigmoid activation function to decide whether the sentence should be included or not. The loss function is defined as below:

EQUATION (1): Not extracted; please refer to original document.

where N is the output size,ŷ i is the output of the model, and y i is the corresponding target value. In our experiments, we use this model to extract salient sentences (i.e., those with the positive label) to form the summary. We set this model as the baseline called BERTSUMEXT (Liu and Lapata 2019).

Our Model: A Section-Aware Summarizer

Inspired by few prior works that have studied the effect of document's hierarchical structure in summarization task (Conroy and Davis 2017; Cohan et al. 2018), we define a section prediction task, aiming at predicting the relevant section for each sentence in the document. Specifically, we add an additional linear classification layer on top of BERTSUM sentence representations to predict the relevant section to each sentence. The loss function for the section prediction network is defined as follows:

EQUATION (2): Not extracted; please refer to original document.

where y i andŷ i are the ground-truth and the model scores for each section i in S.

Bertsum Transformer Encoder

Linear Layer Sentence Selection Linear Layer Section Prediction Figure 1 : The overview of BERTSUMEXTMULTI model. The baseline model (i.e., BERTSUMEXT) is dash-boarded. The extension to the baseline model is addition of Section Prediction linear layer (specified in green box).

Figure 1: The overview of BERTSUMEXTMULTI model. The baseline model (i.e., BERTSUMEXT) is dash-boarded. The extension to the baseline model is addition of Section Prediction linear layer (specified in green box).

The entire extractive network is then trained to optimize both tasks (i.e., sentence selection and section prediction) in a multi-task setting:

L Multi = αL 1 + (1 − α)L 2 (3)

where L 1 is the binary cross-entropy loss from sentence selection task, L 2 is the categorical crossentropy loss from section prediction network, and α is the weighting parameter that balances the learning procedure between the sentence and section prediction tasks.

Experimental Setup

In this section, we give details about the preprocessing steps on the datasets and parameters that we used for the experimented models. For our baseline, we used the pre-trained BERTSUM model and implementation provided by the authors (Liu and Lapata 2019) . 4 The BERTSUMEXTMULTI is that of the model used in (Sotudeh, Cohan, and Goharian 2020), but without post-processing module at inference time, which utilizes trigramblocking (Liu and Lapata 2019) to hinder repetitions in the final summary. We intentionally removed the post-processing part as the model could attain higher scores in the absence of this module throughout our experiments. In order to obtain ground-truth section labels associated with each sentence, we utilized the external sequential-sentence package 5 by Cohan et al. (2019) . To provide oracle labels for source sentences in our datasets, we use a greedy labelling approach (Liu and Lapata 2019) with slight modification for labelling up top 30, 15, and 25 sentences for Longsumm, arXiv-Long, and PubMed-Long datasets, respectively, since these numbers of oracle sentences yielded the highest oracle scores. 6 For the joint model, we tuned α (loss weighting parameter) at 0.5 as it resulted in the highest scores throughout our experiments. In all our experiments, we pick the checkpoint that achieves the best average of ROUGE-2 and ROUGE-L scores on the validation intervals as our best model for inference.


In this section, we present the performance of the baseline and our model over the validation and test sets of the extended summarization datasets. We then discuss our proposed model's performance compared to baseline over a mix of varied-length summarization dataset (i.e., arXiv). As the evaluation metrics, we report the summarization systems' performance in terms of ROUGE-1 (F1), ROUGE-2 (F1), and ROUGE-L (F1)) metrics.

As we see in Table 2 , we notice that having section predictor model incorporated into summarization model (i.e., BERTSUMEXTMULTI model) performs fairly well compared to the baseline model. This is a particularly important finding since it characterizes the importance of injecting documents' structure when summarizing a scientific paper. While the score gap is relatively higher in arXiv-Long and Longsumm datasets, it is similar in PubMed-Long dataset.

Table 2: ROUGE (F1) results of the baseline (i.e., BERTSUMEXT) and our proposed model (i.e., BERTSUMEXTMULTI) on extended summarization datasets. ∗ shows the statistically significant improvement (paired t-test, p < 0.01). The validation set for Longsumm refers to our internal validation set (20% of the abstractive set) as there was no official validation set provided for this dataset.

As observed in (Chandrasekaran et al. 2020 ). While this model improves ROUGE-1 quite significantly over the other state-of-the-art, it stays competitive on ROUGE-2 and ROUGE-L metrics. In terms of ROUGE (F1) F-Measure average, BERTSUMEXTMULTI model ranks first by a huge margin compared to the other systems.

Table 3: ROUGE (F1) results of our multi-tasking model on the blind test set of Longsumm shared task containing 22 abstractive summaries (Chandrasekaran et al. 2020), along with the performance of other participants’ systems. We only show top 5 participants in this table.

To test the model on mixed varied-length summarization datasets, we trained and tested it on arXiv dataset, which contains a mix of varying length abstracts as ground-truth summaries. Table 4 shows that our model can achieve competitive performance on this dataset. While the model does not yield any improvement on arXiv dataset, our hypothesis was to investigate if our model is superior to existing models on longer-form datasets -such as those we have used in this research, which we validated by presenting the evaluation results on long summarization datasets.

Table 4: ROUGE (F1) results of the baseline (i.e., BERTSUMEXT) and our proposed model (i.e., BERTSUMEXTMULTI) on arXiv summarization dataset.


In order to gain insights into how our multitasking approach works on different long datasets, we perform an extensive analysis in this section to explore the qualities of our multi-tasking system (i.e., BERTSUMEXTMULTI) over the baseline (i.e., BERTSUMEXT). Specifically, we perform two types of analyses: 1) quantitative analysis; 2) qualitative analysis.

For the first part, we choose to use two metrics: RG diff which denotes the average ROUGE (F1) difference (i.e., gap) between the baseline and our model 7 . Positive values indicate the improvement, while negative values denote the decline in scores. Similarly, F diff is the average difference of F1 score between the baseline and our model. We create three bins sorted by RG diff : IMPROVED which contains the reports whose average ROUGE (F1) score is improved by the multi-tasking model; TIED including those that the multi-tasking model leaves unchanged in terms of modifying average ROUGE (F1) score; and DECLINED containing those whose average ROUGE (F1) score has decreased by the joint model.

For the qualitative analysis section, we specifically

Validation Test Model Dataset RG-1(%) RG-2(%) RG-L(%) RG-1(%) RG-2(%) RG-L(%)

BERTSUMEXT arXiv Table 4 : ROUGE (F1) results of the baseline (i.e., BERTSUMEXT) and our proposed model (i.e., BERTSUMEXTMULTI) on arXiv summarization dataset.

5 2 -3 8 5 3 8 9 -5 9 1 6 2 4 -8 7 2 8 9 5 -1 3 7 0 1 3 7 5 -1 6 3 1 3 5 0 -3 7 0 3 7 0 -3 8 4 3 8 4 -3 9 9 3 9 9 -4 1 7 4 1 7 -4 4 7 4 4 7 -4 9 0 4 9 0 -5 6 9 5 6 9 -7 3 4 7 3 4 -1 0 7 4 1 0 7 4 -2 1 2 3 : Bar charts exhibiting the correlation of ground-truth summary length (in tokens) with the baseline (i.e., BERTSUMEXT) and our multi-tasking model's (i.e., BERTSUMEXTMULTI) performance. The diagrams are shown for Longsumm and arXiv-Long datasets' test set. Each bin contains 31 summaries for Longsumm, and 196 summaries for arXiv-Long. As denoted, the multi-tasking model generally outperforms the baseline on later bins which include longer-form summaries.

aim at comparing the methods in terms of section distribution since that is where our method's improvements are expected to come from. Furthermore, we conduct an additional length analysis over the results generated by the baseline versus our model.

Quantitative Analysis

We first perform the quantitative analysis over the long summarization datasets' test sets in two parts including 1) Metric analysis which aims at comparing different bins based on the average ROUGE score difference of the baseline and our model; 2) Length analysis that targets at finding the correlation between the summary length on different bins and models' performance. Table 5 shows the overall quantities of Longsumm and arXiv-Long datasets in terms of average difference of ROUGE and F1 scores.

Table 5: IMPROVED, TIED, and DECLINED bins on the test set of Longsumm and arXiv-Long datasets. The numbers show the improvements (positive) and drops (negative) compared to the baseline model (i.e., BERTSUMEXT).

Metric Analysis

As shown, the multi-tasking approach is able to improve 76 summaries with an average ROUGE (F1) improvement of 2.05%. This is even more when it (F1), and ROUGE-L (F1) scores.

comes to evaluating the model on arXiv-Long dataset with average ROUGE improvement of 2.40%. Interestingly, our method can consistently improve F1 measure in general (See total F1 scores in Table. 5) . Seemingly, F1 metric directly correlates with ROUGE (F1) metric on arXiv-Long dataset, whereas this is not the case on DECLINED bin of the Longsumm dataset. This might be due to the relatively small test set size of Longsumm dataset. It has to be mentioned that IMPROVED bin holds relatively higher counts and improved metrics than that of DECLINED bin across both datasets in our evaluation.

Length Analysis

We analyze the generated results by both models to see if the summary length affects the models' performance using bar charts in Figure 2 . The bar charts are intended to provide the basis for comparing both models on different length bins (xaxis), which are evenly-spaced (i.e., having the same number of papers). It has to be mentioned that we used five bins (each bin with 31 summaries) and ten bins (each bin with 196 summaries) for Longsumm and arXiv-Long datasets, respectively.

Figure 2: Bar charts exhibiting the correlation of ground-truth summary length (in tokens) with the baseline (i.e., BERTSUMEXT) and our multi-tasking model’s (i.e., BERTSUMEXTMULTI) performance. The diagrams are shown for Longsumm and arXiv-Long datasets’ test set. Each bin contains 31 summaries for Longsumm, and 196 summaries for arXiv-Long. As denoted, the multi-tasking model generally outperforms the baseline on later bins which include longer-form summaries.

As shown in Figure 2 (a) , for Longsumm dataset, as the length of the ground-truth summary increases , 0 1 2 3 4 5 6 7 8 9 10 11 12 13 *14* 15 16 17 18 19 20 21 22 23 24 25 26 27 *28* 29 30 31 32 33 34 (b) Extraction probability distribution of the multi-tasking model (i.e., BERTSUMEXTMULTI) over the source sentences. astro-ph9807040 sampled from arXiv-Long dataset). For simplicity, we have only shown the sentences that gain over 15% extraction probability by the models. The cells bordered in black show the models' final selection, and oracle sentences are indicated with *. Table 5 : IMPROVED, TIED, and DECLINED bins on the test set of Longsumm and arXiv-Long datasets. The numbers show the improvements (positive) and drops (negative) compared to the baseline model (i.e., BERTSUMEXT).

the multi-tasking model generally improves over the baseline consistently on both datasets, except for the last bin on Longsumm dataset where it achieves comparable performance. This behaviour is also observed on ROUGE-1 and ROUGE-L for Longsumm dataset. The ROUGE improvement is even more noticeable when it comes to analysis over arXiv-Long dataset (See Figure 2 (b) ). Thus, the length analysis supports our hypothesis that the multi-tasking model outperforms the baseline more significantly when the summary is of longer-form.

Qualitative Analysis

As the results of the qualitative analysis on the IMPROVED bin is observed, we found out that the multi-tasking model can effectively sample sentences from diverse sections when the ground-truth summary is also sampled from diverse sections. It improves significantly over the baseline when the extractive model can detect salient sentences from important sections.

By investigating the summaries from DECLINED bin, we noticed that in declined summaries, while our multi-tasking approach can adjust extraction probability distribution to diverse sections, it has difficulty picking up salient sentences (i.e., positive sentences) from the corresponding section; thus, it leads to relatively lower ROUGE score. This might be improved if two networks (i.e., sentence selection and section prediction) are optimized in a more elegant way such that the extractive summarizer can further select salient sentences from the specified sections when they could be identified. For example, the improved multi-tasking methods can involve task prioritization (Guo et al. 2018) to dynamically balance the learning process between two tasks during training, rather than using a fixed α parameter.

In the cases where the F1 score and ROUGE (F1) were not consistent with each other, we observed that adding non-salience sentences to the final summary hurts the final ROUGE (F1) scores. In other words, while the multi-tasking approach can achieve a higher F1 score compared to the baseline since it chooses different non-salient (i.e., negative) sentences than baseline, the overall ROUGE (F1) scores drop slightly. Having conditional decoding length (i.e., sentences) might help with this as done in (Mao et al. 2020 ). Fig. 3 shows the extraction probabilities that each model outputs on the source sentences. It is observable that the baseline model picks most of the sentences (47%) from the beginning of the paper, while the multi-tasking approach (b) can effectively distract probability distribution to summary-worthy sentences that are all around different sections of the paper, and pick those with higher confidence. Our model achieves the overall F1 score of 53.33% on this sample paper, while the baseline's F1 score is 33.33%.

Figure 3: Heat-maps showing the extraction probabilities over the source sentences (Paper ID: astro-ph9807040 sampled from arXiv-Long dataset). For simplicity, we have only shown the sentences that gain over 15% extraction probability by the models. The cells bordered in black show the models’ final selection, and oracle sentences are indicated with *.

Conclusion & Future Work

In this paper, we approach the problem of generating extended summaries, given a long document. Our proposed model is a multi-task learning approach that unifies sentence selection and section prediction processes, extracting summary-worthy sentences. We further collect two large-scale extended summary datasets (arXiv-Long and PubMed-Long) from scientific papers. Our results on three datasets show the efficacy of the joint multi-task model in the extended summarization task. While it achieves fairly competitive performance with the baseline on one of three datasets, it consistently improves over the baseline in the other two evaluation datasets. We further performed extensive quantitative and qualitative analyses over the generated results by both models. These evaluations revealed our model's qualities compared to the baseline. Based on the error analysis, it could be noticed that the performance of this model highly depends on the multi-tasking objectives. Future studies could fruitfully explore this issue further by optimizing the multi-task objectives in a way that both sentence selection and section prediction tasks can benefit.


https://www.ncbi.nlm.nih.gov/pmc/tools/ftp 3 https://github.com/Georgetown-IR-Lab/ ExtendedSumm

https://github.com/nlpyang/PreSumm 5 https://github.com/allenai/sequential_sentence_ classification

The modification was made to assure that the oracle sentences are sampled from diverse sections.

The average is defined on ROUGE-1 (F1), ROUGE-2