Cooperative Generator-Discriminator Networks for Abstractive Summarization with Narrative Flow

Saadia Gabriel
Antoine Bosselut
Ari Holtzman
Kyle Lo
A. Çelikyilmaz
Yejin Choi
ArXiv
2019
View in Semantic Scholar

Abstract

We introduce Cooperative Generator-Discriminator Networks (Co-opNet), a general framework for abstractive summarization with distinct modeling of the narrative flow in the output summary. Most current approaches to abstractive summarization, in contrast, are based on datasets whose target summaries are either a single sentence, or a bag of standalone sentences (e.g., extracted highlights of a story), neither of which allows for learning coherent narrative flow in the output summaries. To promote research toward abstractive summarization with narrative flow, we first introduce a new dataset, Scientific Abstract SummarieS (SASS), where the abstracts are used as proxy gold summaries for scientific articles. We then propose Co-opNet, a novel transformer-based framework where the generator works with the discourse discriminator to compose a long-form summary. Empirical results demonstrate that Co-opNet learns to summarize with considerably improved global coherence compared to competitive baselines.

1 Introduction

We study the task of generating abstractive summarization with narrative flow: given an input document, the distinct goal is to generate a paragraph-length abstractive summary that has a proper narrative flow. Our study contrasts with most previous work that focused on either extractive document-level summarization (Nenkova and McKeown, 2012; Allahyari et al., 2017) or abstractive sentence-level summarization (Rush et al., 2015; Grusky et al., 2019; Narayan et al., 2018a) , where maintaining a good narrative flow in the output summary was not within the scope of the task definition.

Work-In-Progress Manuscript

Discourse relation characterizes the internal structure and logical relation of a coherent text.

Automatically identifying these relations not only plays an important role in discourse comprehension and generation, but also obtains wide applications in many other relevant natural language processing tasks, such as text summarization (Yoshida et al., 2014) , conversation (Higashinaka et al., 2014) , question answering (Verberne et al., 2007) and information extraction (Cimiano et al., 2005) .

Generally, discourse relations can be divided into two categories: explicit and implicit, which can be illustrated in the following example: The company was disappointed by the ruling . . .

Introduction Abstract

Inter-sentence narrative flow

Summary Generation With Discourse-Aware Discriminator

Discourse-aware transformer architecture Figure 1 : Structure for introduction → abstract scientific paper generation.

Our study contrasts also with recent work towards abstractive summarization at the documentlevel. Importantly, most such studies could not directly model or evaluate abstractive summarization with narrative flow in the output summary. This is largely due to the inherent limitations of the existing datasets; the reference summaries available in most commonly used large-scale datasets, such as CNN/DailyMail dataset (Hermann et al., 2015) , are mainly the headlines of the newsarticles or stories, which are often sets of disconnected sentences. These summaries neither provide the inductive bias for models to learn the desired narrative flow in output summaries, nor enable researchers to measure the quality of narrative flow using reference summaries (Chen and Bansal, 2018) . This lack of proper inductive bias also implies that the learned models often exhibit extractive tendency. They do not learn the abstractive generation capability necessary to achieve coherent narrative flow in the output summary (Hoang et al., 2019) .

Accordingly, we present a new large-scale dataset, Scientific Abstract SummarieS (SASS), to promote research toward abstractive summarization with narrative flow. SASS provides over 700k samples to support three distinct levels of abstractive summarization formulations: (1) introto-abstract, (2) abstract-to-title, and (3) intro-totitle. The first task formulation, intro-to-abstract, supports unique opportunities to study summaries with narrative flow, as abstracts in scientific papers are structured with highly coherent discourse flow. In addition, scientific paper abstracts maintain loose, therefore abstractive alignments with respect to the introduction (Figure 1 ), which is challenging for current models to learn to abstract rather than extract.

We then introduce Cooperative Generator-Discriminator Networks (Co-opNet), a new modeling framework for abstractive summarization with distinct modeling of the narrative flow in the output summary. In this framework, the generator, based on transformer language models that are fine-tuned for abstractive summarization, proposes a pool of candidate summaries. Then, the discriminator, also based on transformers, selects the summary that has the best narrative flow across adjacent sentences. Our work presents the first study that adapts the learning-to-write framework of Holtzman et al. (2018) , originally proposed for open-ended text generation, to abstractive summarization, along with comprehensive performance reports on the new SASS dataset.

Comprehensive empirical results demonstrate that Co-opNet learns to summarize with a considerably improved global coherence compared to competitive baselines. Based on the recently proposed BERTScore (Zhang et al., 2019a) , in particular, our model outperforms all other models by 3.98 points on BERTScore-R. In addition, human judgments demonstrate that domain experts prefer Co-opNet over the base summarization model in over 64% of cases.

The rest of the paper is organized as follows. We first describe the generator network and the discriminator network in §2 and §3 respectively.

We then describe how we combine the two into a cooperative generator-discriminator network in §4. We introduce our new dataset in §5 followed by an empirical study and analysis in §6 and §7. We discuss related work in §8 and conclude in §9.

2 Generator Networks

We use the transformer architecture of Radford et al. (2019) as our generator's architecture. Following the work of Liu et al. 2018, we adapt a language model to the task of abstractive summarization by concatenating the article a, a delimiter token d, and the summary s into one fixed-length

input x = (a 1 , ..., a |a| , d, s 1 , ..., s |s| ),

where |a| is the length of the gold article and |s| is the length of the gold summary. The transformer has the same block architecture as the model of Radford et al. (2019) and consists of L attention blocks consisting of self-attention and feed-forward layers. At each time step i, the model produces an output probability distribution over the vocabulary for the next token w i given all previous output tokens w

EQUATION (2): Not extracted; please refer to original document.

where W e is a word embedding matrix, p j is the position embedding, h 0 i is the initial representation, and {h} l−1 ≤j is the set of all preceding layer block outputs for tokens up to w j . Finally, for the current position i in the sequence:

EQUATION (3): Not extracted; please refer to original document.

where W e is the same word embedding matrix as in Equation 1 and h L i is the final layer transformer block output at position i.

The model is trained to minimize the negative loglikelihood of the next word w i given all preceding words:

L gen = − |a|+|s| j=1 log p(w i |w 0 , ...w i−1 ) (4)

where w i is the i th token of x. At test time, x only consists of the gold article and delimiter token (a 1 , ..., a |a| , d) and we decode generated summaries g starting from this input.

3 Discriminator Networks

Because the autoregressive nature of the generator makes it unlikely to achieve narrative flow across long time horizons, we incorporate a discriminator model into the decoding process. Due to the difficulty of explicitly defining discourse properties to discriminate between generations with good and bad narrative flow, we rely on a parametrized scoring function to approximate this discourse property by scoring whether pairs of adjacent sentences are consistent with one another.

3.1 Discriminator Architecture

Sentence Pair Representation To model the likelihood of adjacency between two sentences s u and s v of length |u| and |v| respectively, we first compute a hidden representation of the sentence pair using BERT (Devlin et al., 2019) . This representation allows us to better capture the finegrained contextual information necessary for understanding the relationship between s u and s v . The initial input to the encoder is the concatenation of these sentences:

s = [CLS] + s u + [SEP ] + s v + [SEP ], where [CLS]

is a special token associated with the task and [SEP ] is a sentence delimiter token. As in Devlin et al. (2019) , each word in the sequence is encoded by a word embedding w i and positional embedding p i that identifies its position in the s. Additionally, we have a learned segmentation embedding q i for each token in s to indicate which of the two sentences, s u or s v , the word belongs to. Therefore, our full input to each of the encoder layers is E = (e 1 , ...., e |u|+|v|+3 ), where e i = w i + p i + q i . We encode these representations using BERT (Devlin et al., 2019) and the final contextual representation used for adjacency prediction is the pooled hidden representations of the [CLS] token, which we denote as h cls .

Adjacency Classification Once the sentence pair s=(s u ,s v ) has been encoded as the hidden representation h cls , we obtain the probability of adjacency between the pair of sentences (s u and s v ) by a linear projection (W disc ) followed by a logsoftmax:

EQUATION (5): Not extracted; please refer to original document.

In § 4, we describe how the adjacency scores are used to re-rank candidate summaries.

Sentence Selection For Discriminator Models

To train a discriminator model, we use a subset of adversarial and positive sentence pair examples extracted from the training set. The sentence pairs are extracted from gold abstracts containing at least five sentences using the following approach: For a randomly selected sentence s u from the abstract, we randomly select an adjacent sentence, s u−1 or s u+1 , for a positive example and any nonadjacent sentence

s v / ∈[u−1,u+1

] for a negative example.

Adjacency Learning Objective

We define the training objective for the adjacency discriminator to minimize the negative loglikelihood of predicting whether two sentences are adjacent or not:

EQUATION (6): Not extracted; please refer to original document.

where ADJ(s) is an indicator function for whether the two sentences in s are adjacent and P adj (s) is the discriminator's estimate of adjacency from Equation 5.

4 Cooperative Generation

The discriminator reranks candidate summary generations based on the overall likelihood of sentences within the summary being adjacent. More specifically, for each candidate generated summary g and sentence pair s = (s u , s v ) contained in g, we define the likelihood of sentence adjacency as the score that the discriminator assigns to that sentence pair (Equation 5). By defining discourse using this approach, we place minimal restrictions on the model's ability to hone in on patterns pertaining to desirable narrative flow.

To incorporate this objective into our summarization framework, we use a modified decoding objective function. First, we generate a pool of candidate summaries from the base summarization model ( §2) using an arbitrary decoding strategy (e.g., beam search, nucleus sampling, top-k sampling). Then, the discriminator is used to re-rank these candidates. Specifically, for each generated sentence s u present in a candidate summary g of length |g| tokens with S sentences, we maximize both the language model probability of each token p(w i |w
p(g) = δ gen |g| i=1 p(w i |w 1 , ...w i−1 ) + δ disc S u=1 P adj (s u , s u−1 ) (7)

where δ gen and δ disc are hyper-parameters controlling the contribution of the generator and discourse discriminator to the final predicted summary generation. The score P adj (s u , s u−1 ) is the adjacency score computed by the discourse discriminator model (Equation 5).

5 Data

In the following section, we introduce SASS, a dataset of over 700K introduction-abstract pairs from arXiv 1 . We also describe AAN, an existing dataset of NLP papers published at top venues 2 . These datasets rely on greater discourse structure and abstractiveness than previous summarization corpora. We specifically choose to focus on scientific papers incorporating a wide range of domain knowledge and subjects to rigorously test the generalization of our models across different fields of scientific endeavour.

5.1 Datasets

SASS The dataset is crawled from arxiv.org and contains over 700k introduction-abstract pairs from scientific articles. In our experiments we primarily focus on the CS 3 and Bio 4 domain subsets. The task in SASS presents a challenge to existing summarization models, since it requires models to learn relevant domain knowledge for the scientific domain of interest, as well as recognize common discourse structure for papers written in that domain. While the work in this paper focuses mainly on Introduction → Abstract summarization, we also note that we extracted text for two other summarization tasks (Introduction → Title, and Abstract → Title) and provide results for those Appendix A.

AAN In addition to the SASS dataset, we include an existing dataset of scientific articles that focuses on papers in the NLP computer science domain. This dataset consists of 12k paper subset from the ACL Anthology Network (AAN) (Radev et al., 2009) , found after removing articles from the anthology without abstracts and duplicates. As with SASS, we define the task for AAN as generation of abstracts from introductions.

5.2 Narrative Flow Analysis

Since the focus of this work is on generating summaries with more coherent narrative flow, we concentrate on datasets requiring narrative structure to generate good summaries. Particular attributes of these datasets that connect to discourse structure are:

• Length of summaries → Are the summaries long enough to clearly show narrative flow properties?

• Abstractiveness of gold summaries → Do the summaries exhibit particular sentence-level flow, or are the summary sentences extracted highlights from the context?

As can be seen in Table 1 , SASS and AAN have properties missing from existing summarization datasets based on Newswire data. XSum (Narayan et al., 2018a) and Newsroom (Grusky et al., 2019)

Table 1: Statistics of gold summaries in different summarization datasets.

6 Experimental Setup

In this section we outline comparison baselines and describe experimental setups for our generator and discriminator models.

6.1 Baselines

We train a 2-layer bi-LSTM sequence-to-sequence model with attention. The bi-LSTM is used to encode a given source article a and a separate decoder LSTM produces the generated summary g. At each decoding time step, the decoder attends to all the context vectors produced by the encoder as well as the maintained state from the previous decoder tokens to produce the next token in the summary. We also implement a Pointer-Generator (PGEN + Coverage) model (See et al., 2017) that extends the base LSTM model (LSTM + Coverage) to allow tokens to be copied from the input during generation. Baselines are trained for up to 40000 steps with a batch size of 16. Following previous work, we decode from these baselines using beam search with a beam size of 4. We set a maximum decoding length of 200 tokens. The RNN baselines additionally have a minimum decoding length of 35 tokens as in See et al. (2017) .

6.2 Generator Model

Input We perform word and sentence-level tokenization using spaCy and NLTK (Loper and Bird, 2002) . Because of the fixed input size introduced by the transformer language model (Radford et al., 2019) , the input context is truncated to a maximum of 800 tokens and summaries are truncated to a maximum of 200 tokens. Implementation All our models are implemented in Pytorch. Our code is adapted from the Ope-nAI Huggingface implementation of the 117M parameter GPT-2 language model 5 and uses the pre-trained weights of Radford et al. (2019) .

Training We use a learning rate of 2e-5 for finetuning. We use a batch size of 1 with gradient accumulation to simulate a batch size of 16. On the AAN subset, we train the base summarization transformer model for 11 epochs. On the SASS CS and Bio subsets, we train the base summarization transformer model for 8 epochs. All experiments are run on a Titan-X GPU. Training time for the AAN and SASS Bio datasets is about 30 minutes per epoch. Training time for the SASS CS dataset is 2.5 hours per epoch.

6.3 Discriminator Model

Input We use a max sentence length of 200 tokens to accommodate the fixed input size of BERT (512 tokens), reduce inference time, and discourage the model from generating abnormally long run-on sentences that indicate the presence of coherence issues. Implementation The discriminator models are adapted from the Huggingface implementation of the BERT next sentence prediction classifier 6 . We initialize the 12-layer BERT-base model with the pretrained weights of the SciBERT-uncased model, which was originally trained on 1.14 million scientific papers (Beltagy et al., 2019) .

Training We fine-tune the discriminator using a learning rate of 2e-5 and batch size of 2. We train two discriminators: one is trained on AAN for decoding both SASS CS and AAN, while the other discriminator is trained on SASS Bio and used exclusively for decoding that subset. All discriminator models are trained for 17 epochs on a Titan-X GPU over a single day.

6.4 Generation Hyperparameters

During inference time, we use top-k sampling with k=4 (Fan et al., 2018b) to generate 10 candidate summaries for each model. In the re-ranking objective (Equation 7), we weigh the generation and discriminator models equally for all experiments by setting δ gen = δ disc = .5 and select the candidate out of the 10 that achieves the highest joint score in Equation 7 to be the final summary. We filter candidate summaries from the hypothesis generation pool that contain sentences longer than a fixed max length of 200 tokens, a clear sign of coherence deterioration.

7 Experiments

Recent work in abstractive summarization has reported issues with reference-based automatic metrics such as ROUGE (Hoang et al., 2019; Sun et al., 2019; Kilickaya et al., 2017) . Due to these concerns, we first discuss human evaluation in §7.1 before reporting automatic metrics in §7.2. Finally, in §7.3 we explore modelbased alternatives to traditional n-gram based metrics.

7.1 Human Evaluation

Since coherence of generated text is difficult to measure automatically, we conduct human evaluations to evaluate how the discriminator affects generation quality. We randomly sample 34 abstracts from the AAN test set of NLP scientific papers.

Then, we present a side-by-side blind comparison of the Base-generated version of the abstract and Co-opNet-generated version along with the gold introduction context to 1 of 10 in-domain experts whose experience in the field of NLP ranges from 3 to 10 years. To reduce bias, the ordering of generated abstracts is also randomized and experts are not told that the abstracts are machine-generated. Each pairwise comparison is evaluated by three unique experts. The experts are asked to assess generation quality based on three key criteria:

• Flow → Which abstract does a better job of presenting a coherent summary that displays correct discourse properties?

• Relevance → Which abstract does a better job of summarizing the main ideas presented in the gold introduction?

• Overall → Which abstract is better overall?

Each expert casts a vote for which abstract is preferred on each of these criteria or "No Preference," if there is no distinguishable difference between the abstracts based on a given criteria. Results The results of this expert human evaluation (shown in Table 2 ) show that Co-opNet clearly improves the quality of generated abstracts over the Base transformer model. In particular, Co-opNet is selected as best in 64.2 % of cases for the Flow criteria, while the relevance score is less clearly superior, indicating that improved discourse is the primary reason for the Overall preference.

Table 2: Expert Human Evaluation Results. The base model is the transformer model that does not use the discriminator to evaluate its generations. Our model is the transformer model that uses discriminator.

7.2 Traditional Automatic Evaluations

To match previous work on summarization, we use the ROUGE metric (Lin, 2004) for automatic evaluation Specifically, we report ROUGE-1, ROUGE-2 and ROUGE-L F-1 scores. Table 3 shows the results of our experiments on the AAN dataset. The results of experiments on the SASS CS and Bio subsets are also shown in Tables 4 and 5, respectively. Notably, the model using the discriminator (Co-opNet) outperforms the generator-only (Base) model across ROUGE metrics on all datasets except for ROUGE-2 on the Bio and CS SASS datasets. Co-opNet also outperforms the baseline models on the SASS subsets. On the more domain-specific AAN subset, results are mixed, as the PGen + Coverage baseline (See et al., 2017) achieves higher performance on ROUGE-1 and ROUGE-2, but our model is over 12% better on ROUGE-L, which is generally considered the best summarization metric. Our hypothesis is that performance gains achieved with the inclusion of the SciBERT discriminator are caused by three main factors:

Table 3: ROUGE scores for generating abstracts for scientific paper introductions for the AAN dataset

Table 4: ROUGE scores for generating abstracts for scientific paper introductions for the CS portion of the SASS dataset

Table 5: ROUGE scores for generating abstracts for scientific paper introductions for the Bio portion of the SASS dataset

Flow Since SciBERT was fine-tuned on a large corpus of scientific papers, the model introduces external contextual information about scientific documents that enhances the ability of the summarization model to distinguish between "good" and "bad" narrative flow for in-domain text. Additionally, Table 6 shows that abstracts generated from Co-opNet achieve a ROUGE-L precision score that is 2.87 points lower than the score for abstracts generated from the base model when using the introduction as a reference. This indicates using a discriminator leads to more abstractive summaries than the base model, as less of the summary can be pulled in the same order from the document (Hoang et al., 2019) .

Table 6: ROUGE-L comparison for generated and gold AAN abstracts compared to introductions. The ROUGE-L precision measures the average extractiveness of the abstracts.

Content The discriminator encourages selection of more contentful generations that include salient keyphrases and domain-specific information. Table 6 shows that abstracts generated from the base model achieve a ROUGE-L recall score that is 6.94 points lower than the score for abstracts generated from Co-opNet. This indicates that Co-opNet can pull more relevant content from the introduction compared to the base model.

Repetition The adjacency task modeled by the discriminator assigns a high likelihood of choosing strongly correlated sentences that follow naturally from each other, but are not exact copies Table 7 : BERTScore results for AAN subset using both BERT and SciBERT as the evaluation model. or paraphrases of one another. This can be attributed to the fact that adjacent sentences tend to contain related information instead of repeating information, reducing the overall repetitiveness of discriminator model generations.

Table 7: BERTScore results for AAN subset using both BERT and SciBERT as the evaluation model.

7.3 Alternatives To Rouge

Despite our generally superior results on the ROUGE metric, past work (Schluter, 2017; Hoang et al., 2019; Sun et al., 2019) has explored the limitations of ROUGE for evaluating summarization tasks, including the issue that ROUGE measures n-gram overlap without distinguishing between salient n-grams and non-contentful tokens. This failure means that models with a higher probability of generating generic, frequent terms such as "the" and "how" can potentially outperform models that better capture conceptual information, but may paraphrase rather than extract common terms.

To overcome these limitations, we explore using contextualized evaluation to measure how well models learn to extract conceptual meaning from scientific paper introductions and generate highquality abstracts. The BERTScore metric (Zhang et al., 2019a) , which has been shown to be well correlated with human judgments on other NLP tasks, is one such metric that measures similarity between contextual embeddings of the generated summary and reference summary. For a generated and gold abstract, we first convert the sequences of tokens into sequences of contextual vector representations using either the uncased version of BERT-base or the SciBERT variant. Following (Zhang et al., 2019a) , we then compute precision, recall and F1 scores from these contextual representations.

Results Our results on the BERTScore evaluation reinforce our observations from the human and ROUGE score evaluations. Co-opNet generally outperforms the PGen + Coverage baseline across all metrics, regardless of the evaluation model used (BERT, SciBERT). eResults are more mixed when comparing Co-opNet to the base model with no discriminator. Overall, Co-opNet is less precise than the Base model at capturing contextual meaning, potentially due to the fact that the Co-opNet tends to produce longer summaries (86.4 tokens vs. 49.11 tokens for Base). This observation is supported by the fact that Co-opNet achieves the best results in terms of recall, beating the Base model by 3.98 points on BERT-R and 2.43 points on SciBERT-R.

8 How does the discriminator improve coherence?

Figure 2: Model Architecture

Abstracts generated from the non-discriminator models tend to lack completeness. While generations from these models are adept at introducing the task presented in the scientific paper, they do not provide a full summary of the paper's contents. Generations from the Base and PGen + Coverage lack details about final results and end abruptly instead of coming to a natural conclusion. As shown by Figure 3 , the Base and PGen + Coverage generations are also often over-specific (for example, mentioning the CRAFT corpus without specifying that it is a set of published scientific articles). In contrast, Co-opNet-generated summaries have a clear narrative flow that capture each key part of a coherent abstract: 1) Introduction of the task, 2) Methods used to address this problem, and 3) Main findings and results. Despite overcoming these limitations to modeling discourse structure and coherence, our analysis reveals two key types of errors in Co-opNet generations: Semantic Repetition Despite improvements in reducing repetition over baselines, Figure 3 provides one example of how the model repeats information within the same sentence without exact copyinggenerating "blogs", then "blog posts." Veracity Co-opNet also hallucinates inaccurate information like the nonexistent term "bi-diag sections," or erroneous acronyms like "(TEU)." This indicates the model could still benefit from more external domain knowledge for grounding. These findings indicate that the candidate summaries selected by the discriminator still pose an interesting challenge for future research on im-

Figure 3: Analysis of common errors and coherence issues in generations of scientific abstracts.

Discriminator With Discriminator

In this paper, we describe the rhetorical roles of sentences in the craft corpus, a set of full-text papers that we have annotated using the cisp schema.

Hand alignment of the resulting annotations suggests that patterns in these cisp-annotated sentences correspond to common argumentative gambits in scientific writing.

Base Transformer Model

Pointer-Generator + Coverage Model

Transformer + Discriminator

In this paper, we analyze the role of rhetorical representations in taxonomic parsing of text.

In the CRAFT corpus, sentence distributional models with annotated data consistently outperform models with no annotated data.

We present an approach for systematically annotating, in a corpus of full-text published scientific articles, rhetorical roles of sentences in scientific writing....

These hierarchical approaches can be thought of as hierarchical output (TEU) annotation techniques...

Specifically, in this paper, we propose methods for generating dependency relations with bi-directional LFG.

The empirical evaluation of our methods shows that they can be applied to parts of the scientific literature such as abstracts, bi-diag sections, blogs, and blog posts. proving quality of generated scientific summaries.

9 Related Work

Generation with Narrative Flow Due to the need for accurate understanding of long-distance dependencies and narrative structure, modeling coherent narrative flow has proved to be a major challenge in the field of text generation, particularly for scientific documents (Koncel-Kedziorski et al., 2019; Nikolov et al., 2018) . A number of solutions have been proposed in recent years for improving coherence in text generation, including global-tracking of entities and discourseaware neural rewards (Kiddon et al., 2016; Holtzman et al., 2018; Fan et al., 2018b) . In particular, the work of Cohan et al. incorporates narrative structure into Pointer-Generator networks (See et al., 2017) by using a discourse-aware attention mechanism for abstractive summarization of scientific papers. We expand upon this previous work, using the global context provided by our Transformer-based Cooperative Generator-Discriminators to better capture far-distant information useful for learning "good" narrative flow.

Neural Abstractive Summarization In the past, abstractive summarization models (Rush et al., 2015; Chopra et al., 2016; Gehrmann et al., 2018) have relied upon a seq2seq encoder-decoder architecture that follows the generation framework of (Sutskever et al., 2014) . Recently, new challenges in abstractive summarization such as topic-aware and controllable summarization, have encouraged exploration of other model architectures like convolutional neural networks (CNNs) (Allamanis et al., 2016; Fan et al., 2018a; Narayan et al., 2018b) as an alternative to RNNs. Reinforcementlearning based approaches (Celikyilmaz et al., 2018; Chen and Bansal, 2018) have also enhanced the overall quality and conciseness of abstractive summaries. In addition to CNNs, Transformer models have emerged as a promising architecture for text generation and achieved state-of-the-art results across several NLP tasks (Vaswani et al., 2017; Radford et al., 2019) . Liu et al. proposed a decoder-only transformer model for summarization across multiple Wikipedia articles and several later works have explored Transformer-based architectures for abstractive summarization over Newswire (Khandelwal et al., 2019; Zhang et al., 2019b; Hoang et al., 2019) . While our model builds upon this work, to our knowledge, it is the first Transformer summarization model to explicitly model narrative flow in text.

10 Conclusion

In this work, we introduced Cooperative Generator-Discriminator Networks, a framework for more coherent natural language generation with transformer language models through the integration of a discriminator that can discern good narrative flow. We showed the limitations of current automatic metrics for evaluating summarization at capturing improvements in

Gold

This research is concerned with making recommendations to museum visitors based on their history within the physical environment, and textual information associated with each item in their history. (...) This study compares and analyses different methods of path prediction including an adapted naive Bayes method, document similarity, visitor feedback and measures of lexical similarity.

Pgen + Coverage

In this paper we study the problem of recommendation systems for the task of visitors to an information rich environment such as a museum, are interested to convey predictions about what theme or topic a given visitor is interested in. We are interested to get the most out of their visit. As a behaviour to evaluate the relative impact of different factors in the physical space. (...)

Base

This paper proposes a novel approach to measuring the success of machine learning methods in a user's selection of a particular exhibit to be produced. An unsupervised framework is used to jointly compute the likelihood of the value of the best exhibit to be produced. (...) The experiments show that models produced by supervised methods improve user performance in selecting exhibits over unsupervised methods.

Co-Opnet

This paper presents a user-centric perspective on the property of location, focusing on some relevant factors in deciding which exhibit a user intends to visit. We exploit variation and infrequency in data from the Victoria/Ausland Courtyard (VCA) Collection (...) We make three contributions:

(1) Our experimental system provides empirical evidence for the effectiveness of supervised learning techniques in predicting a user's audience behavior;

(2) Our structure based method allows unsupervised learning to be applied to multiple sets of related information.

(3) Our experimental system uses unsupervised model adaptation in a supervised setting. (See et al., 2017) (PGen + Coverage) and two of our proposed models, Base (topk,k=4) and Co-opNet (topk, k=4), on the NLP scientific domain. Coherency issues in the PGen + Coverage generated abstract, including incorrectly structured sentences and lack of concluding details, are highlighted in red. We highlight transitional phrases that contribute to coherent flow by properly delineating sections of abstracts in purple.

coherence, and proposed a new evaluation setup for summarization that takes into account contextual similarities between summaries. Through these analyses and eliciting human judgments, we empirically show that the discriminator model is better at selecting generations that are both more relevant and more narratively coherent.

A Appendices

A.1 Additional baselines for SASS benchmarks ConvS2S For this model, we use the convolutional seq2seq architecture proposed by (Gehring et al., 2017) . A given source text x = (x 1 , ..., x n ) and the positions of tokens within that source text p = (p 1 , ..., p n ) are first encoded using embedding matrices. The embedded representations are passed to a 20-layer convolutional encoder with a kernel width of 3 and hidden size of 512. For the convolutional decoder, we use the same hyper-parameters. During decoding, we penalize generation of UNK characters and use diverse beam search to handle commonly occurring longsequence generation errors.

RNNExt The baseline we use for reinforcement learning is from the work of (Chen and Bansal, 2018) . This model uses a hybrid extractive and abstractive summarization approach. First, each sentence in the source text is encoded using a temporal convolutional model, then a bi-directional LSTM is used to compute a context-aware representation. A set of salient sentences is extracted from among the encoded sentences using a LSTM pointer network. The seq2seq abstractor rewrites these sentences through compression and paraphrasing. The whole model is trained using the REINFORCE algorithm by formulating a Markov Decision Process where the extractor is a RL agent that receives a reward each time the abstractor finishes summarization of an extracted sentence. 7

Table 8: Example of gold and generated abstracts from baseline Pointer Networks + Coverage (See et al., 2017) (PGen + Coverage) and two of our proposed models, Base (topk,k=4) and Co-opNet (topk, k=4), on the NLP scientific domain. Coherency issues in the PGen + Coverage generated abstract, including incorrectly structured sentences and lack of concluding details, are highlighted in red. We highlight transitional phrases that contribute to coherent flow by properly delineating sections of abstracts in purple.

A.2 Additional Sass Tasks

We report summarization baseline results on the full SASS dataset for the Abstract → Title, Introduction → Title and Introduction → Abstract tasks. (See et al., 2017) 0.4053 0.2180 0.3653 ConvS2S (Gehring et al., 2017) 0.3421 0.1976 0.3207 Table 9 : Abstract to Title 7 We only use this baseline for the intro-to-abstract task due to the fact the convolutional encoder only accommodates multi-sentence summaries. (See et al., 2017) 0.3365 0.1083 0.2933 ConvS2S (Gehring et al., 2017) 0.2857 0.0965 0.2451 RNNExt (Chen and Bansal, 2018) 0.4000 0.1253 0.3683 (See et al., 2017) 0.3189 0.1444 0.2886 ConvS2S (Gehring et al., 2017) 0.3226 0.1794 0.3032

Table 9: Abstract to Title

https://arxiv.org 2 https://www.aclweb.org/anthology 3 https://arxiv.org/corr 4 https://arxiv.org/archive/q-bio

https://github.com/huggingface/pytorch-pretrained-BERT

Table 10: Intro to Title

Table 11: Intro to Abstract Task

Table 12: BERTScore results for AAN subset

Table 13: Analysis of how often the discriminator changes the sequence selected as the top candidate.

References

Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D Trippe, Juan B Gutier- rez, and Krys Kochut. 2017. Text summariza- tion techniques: a brief survey. arXiv preprint arXiv:1707.02268.
Return to section: 1 Introduction

Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for ex- treme summarization of source code. In Inter- national Conference on Machine Learning, pages 2091-2100.
Return to section: 9 Related Work

Angela Fan, David Grangier, and Michael Auli. 2018a. Controllable abstractive summarization. ACL 2018, page 45.
Return to section: 9 Related Work

Angela Fan, Mike Lewis, and Yann Dauphin. 2018b. Hierarchical neural story generation. In ACL.
Return to section: 6.4 Generation Hyperparameters, 9 Related Work

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional Sequence to Sequence Learning. In Proc. of ICML.
Return to section: A Appendices, A.2 Additional Sass Tasks

Sebastian Gehrmann, Yuntian Deng, and Alexander M. Rush. 2018. Bottom-up abstractive summarization. ArXiv, abs/1808.10792.
Return to section: 9 Related Work

Max Grusky, Mor Naaman, and Yoav Artzi. 2019. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In NAACL.
Return to section: 1 Introduction, 5.2 Narrative Flow Analysis

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- leyman, and Phil Blunsom. 2015. Teaching ma- chines to read and comprehend. In Advances in Neu- ral Information Processing Systems, pages 1693- 1701.
Return to section: Summary Generation With Discourse-Aware Discriminator

Andrew Hoang, Antoine Bosselut, Asli Celikyilmaz, and Yejin Choi. 2019. Efficient adaptation of pre- trained transformers for abstractive summarization. ArXiv, abs/1906.00138.
Return to section: Summary Generation With Discourse-Aware Discriminator, 7 Experiments, 7.2 Traditional Automatic Evaluations, 7.3 Alternatives To Rouge, 9 Related Work

Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. 2018. Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1638-1649.
Return to section: Summary Generation With Discourse-Aware Discriminator, 9 Related Work

Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, and Lukasz Kaiser. 2019. Sample efficient text sum- marization using a single pre-trained transformer. ArXiv, abs/1905.08836.
Return to section: 9 Related Work

Chloé Kiddon, Luke S. Zettlemoyer, and Yejin Choi. 2016. Globally coherent text generation with neural checklist models. In EMNLP.
Return to section: 9 Related Work

Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Pretrained contextualized embeddings for scientific text. ArXiv, abs/1903.10676.
Return to section: 6.3 Discriminator Model

Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. 2017. Re-evaluating automatic metrics for image captioning. In Proceedings of the 15th Conference of the European Chapter of the As- sociation for Computational Linguistics: Volume 1, Long Papers, pages 199-209.
Return to section: 7 Experiments

Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Hajishirzi. 2019. Text generation from knowledge graphs with graph transformers. ArXiv, abs/1904.02342.
Return to section: 9 Related Work

Chin-Yew Lin. 2004. Rouge: A package for auto- matic evaluation of summaries. Text Summarization Branches Out.
Return to section: 7.2 Traditional Automatic Evaluations

Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summariz- ing long sequences. ArXiv, abs/1801.10198.

Edward Loper and Steven Bird. 2002. Nltk: The natu- ral language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Com- putational Linguistics -Volume 1, ETMTNLP '02, pages 63-70, Stroudsburg, PA, USA. Association for Computational Linguistics.
Return to section: 6.2 Generator Model

Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018a. Don't give me the details, just the summary! topic-aware convolutional neural networks for ex- treme summarization. In EMNLP.
Return to section: 1 Introduction, 5.2 Narrative Flow Analysis

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018b. Don't give me the details, just the summary! topic-aware convolutional neural networks for ex- treme summarization. ArXiv, abs/1808.08745.
Return to section: 9 Related Work

Ani Nenkova and Kathleen McKeown. 2012. A survey of text summarization techniques. In Mining Text Data.
Return to section: 1 Introduction

Nikola I. Nikolov, Michael Pfeiffer, and Richard H. R. Hahnloser. 2018. Data-driven summarization of sci- entific articles. ArXiv, abs/1804.08875.
Return to section: 9 Related Work

Dragomir R. Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. 2009. The acl an- thology network corpus. Language Resources and Evaluation, 47:919-944.
Return to section: 5.1 Datasets

Antoine Bosselut, Asli Çelikyilmaz, Xiaodong He, Jianfeng Gao, Po-Sen Huang, and Yejin Choi. 2018. Discourse-aware neural rewards for coherent text generation. In NAACL-HLT.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
Return to section: 2 Generator Networks, 6.2 Generator Model, 9 Related Work

Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sen- tence summarization. ArXiv, abs/1509.00685.
Return to section: 1 Introduction, 9 Related Work

Natalie Schluter. 2017. The limits of automatic sum- marisation according to rouge. In EACL.
Return to section: 7.3 Alternatives To Rouge

Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer- generator networks. In Proceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073- 1083. Association for Computational Linguistics.
Return to section: 6.1 Baselines, 7.2 Traditional Automatic Evaluations, 9 Related Work, Co-Opnet, A.2 Additional Sass Tasks

Simeng Sun, Ori Shapira, Ido Dagan, and Ani Nenkova. 2019. How to compare summarizers without target length ? pitfalls , solutions and re- examination of the neural summarization literature.
Return to section: 7 Experiments, 7.3 Alternatives To Rouge

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural net- works. In Advances in neural information process- ing systems, pages 3104-3112.
Return to section: 9 Related Work

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998-6008.
Return to section: 9 Related Work

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019a. Bertscore: Evaluating text generation with bert. ArXiv, abs/1904.09675.
Return to section: Summary Generation With Discourse-Aware Discriminator, 7.3 Alternatives To Rouge

Xingxing Zhang, Furu Wei, and Ming de Zhou. 2019b. Hibert: Document level pre-training of hierarchical bidirectional transformers for document summariza- tion. ArXiv, abs/1905.06566.
Return to section: 9 Related Work

Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In NAACL-HLT.
Return to section: 9 Related Work

Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac- tive summarization with reinforce-selected sentence rewriting. In ACL.
Return to section: Summary Generation With Discourse-Aware Discriminator, 9 Related Work, A Appendices, A.2 Additional Sass Tasks

Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with at- tentive recurrent neural networks. In HLT-NAACL.
Return to section: 9 Related Work

Elizabeth Clark, Asli Celikyilmaz, and Noah A. Smith. 2019. Sentence moverâȂŹs similarity: Automatic evaluation for multi-sentence texts. In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers).

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. A discourse-aware attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understand- ing. In NAACL.
Return to section: 3.1 Discriminator Architecture

Cooperative Generator-Discriminator Networks for Abstractive Summarization with Narrative Flow

Authors

Abstract

1 Introduction

Work-In-Progress Manuscript

Introduction Abstract

Summary Generation With Discourse-Aware Discriminator

2 Generator Networks

3 Discriminator Networks

3.1 Discriminator Architecture

Sentence Selection For Discriminator Models

Adjacency Learning Objective

4 Cooperative Generation

5 Data

5.1 Datasets

5.2 Narrative Flow Analysis

6 Experimental Setup

6.1 Baselines

6.2 Generator Model

6.3 Discriminator Model

6.4 Generation Hyperparameters

7 Experiments

7.1 Human Evaluation

7.2 Traditional Automatic Evaluations

7.3 Alternatives To Rouge

Base Transformer Model

Transformer + Discriminator

9 Related Work

10 Conclusion

Gold

Pgen + Coverage

Base

Co-Opnet

A Appendices

A.2 Additional Sass Tasks