Cooperative Generator-Discriminator Networks for Abstractive Summarization with Narrative Flow
Authors
Abstract
We introduce Cooperative Generator-Discriminator Networks (Co-opNet), a general framework for abstractive summarization with distinct modeling of the narrative flow in the output summary. Most current approaches to abstractive summarization, in contrast, are based on datasets whose target summaries are either a single sentence, or a bag of standalone sentences (e.g., extracted highlights of a story), neither of which allows for learning coherent narrative flow in the output summaries. To promote research toward abstractive summarization with narrative flow, we first introduce a new dataset, Scientific Abstract SummarieS (SASS), where the abstracts are used as proxy gold summaries for scientific articles. We then propose Co-opNet, a novel transformer-based framework where the generator works with the discourse discriminator to compose a long-form summary. Empirical results demonstrate that Co-opNet learns to summarize with considerably improved global coherence compared to competitive baselines.
1 Introduction
We study the task of generating abstractive summarization with narrative flow: given an input document, the distinct goal is to generate a paragraph-length abstractive summary that has a proper narrative flow. Our study contrasts with most previous work that focused on either extractive document-level summarization (Nenkova and McKeown, 2012; Allahyari et al., 2017) or abstractive sentence-level summarization (Rush et al., 2015; Grusky et al., 2019; Narayan et al., 2018a) , where maintaining a good narrative flow in the output summary was not within the scope of the task definition.
Work-In-Progress Manuscript
Discourse relation characterizes the internal structure and logical relation of a coherent text.
Automatically identifying these relations not only plays an important role in discourse comprehension and generation, but also obtains wide applications in many other relevant natural language processing tasks, such as text summarization (Yoshida et al., 2014) , conversation (Higashinaka et al., 2014) , question answering (Verberne et al., 2007) and information extraction (Cimiano et al., 2005) .
Generally, discourse relations can be divided into two categories: explicit and implicit, which can be illustrated in the following example: The company was disappointed by the ruling . . .
Introduction Abstract
Inter-sentence narrative flow
Summary Generation With Discourse-Aware Discriminator
Discourse-aware transformer architecture Figure 1 : Structure for introduction → abstract scientific paper generation.
Our study contrasts also with recent work towards abstractive summarization at the documentlevel. Importantly, most such studies could not directly model or evaluate abstractive summarization with narrative flow in the output summary. This is largely due to the inherent limitations of the existing datasets; the reference summaries available in most commonly used large-scale datasets, such as CNN/DailyMail dataset (Hermann et al., 2015) , are mainly the headlines of the newsarticles or stories, which are often sets of disconnected sentences. These summaries neither provide the inductive bias for models to learn the desired narrative flow in output summaries, nor enable researchers to measure the quality of narrative flow using reference summaries (Chen and Bansal, 2018) . This lack of proper inductive bias also implies that the learned models often exhibit extractive tendency. They do not learn the abstractive generation capability necessary to achieve coherent narrative flow in the output summary (Hoang et al., 2019) .
Accordingly, we present a new large-scale dataset, Scientific Abstract SummarieS (SASS), to promote research toward abstractive summarization with narrative flow. SASS provides over 700k samples to support three distinct levels of abstractive summarization formulations: (1) introto-abstract, (2) abstract-to-title, and (3) intro-totitle. The first task formulation, intro-to-abstract, supports unique opportunities to study summaries with narrative flow, as abstracts in scientific papers are structured with highly coherent discourse flow. In addition, scientific paper abstracts maintain loose, therefore abstractive alignments with respect to the introduction (Figure 1 ), which is challenging for current models to learn to abstract rather than extract.
We then introduce Cooperative Generator-Discriminator Networks (Co-opNet), a new modeling framework for abstractive summarization with distinct modeling of the narrative flow in the output summary. In this framework, the generator, based on transformer language models that are fine-tuned for abstractive summarization, proposes a pool of candidate summaries. Then, the discriminator, also based on transformers, selects the summary that has the best narrative flow across adjacent sentences. Our work presents the first study that adapts the learning-to-write framework of Holtzman et al. (2018) , originally proposed for open-ended text generation, to abstractive summarization, along with comprehensive performance reports on the new SASS dataset.
Comprehensive empirical results demonstrate that Co-opNet learns to summarize with a considerably improved global coherence compared to competitive baselines. Based on the recently proposed BERTScore (Zhang et al., 2019a) , in particular, our model outperforms all other models by 3.98 points on BERTScore-R. In addition, human judgments demonstrate that domain experts prefer Co-opNet over the base summarization model in over 64% of cases.
The rest of the paper is organized as follows. We first describe the generator network and the discriminator network in §2 and §3 respectively.
We then describe how we combine the two into a cooperative generator-discriminator network in §4. We introduce our new dataset in §5 followed by an empirical study and analysis in §6 and §7. We discuss related work in §8 and conclude in §9.
2 Generator Networks
We use the transformer architecture of Radford et al. (2019) as our generator's architecture. Following the work of Liu et al. 2018, we adapt a language model to the task of abstractive summarization by concatenating the article a, a delimiter token d, and the summary s into one fixed-length
input x = (a 1 , ..., a |a| , d, s 1 , ..., s |s| ),
where |a| is the length of the gold article and |s| is the length of the gold summary. The transformer has the same block architecture as the model of Radford et al. (2019) and consists of L attention blocks consisting of self-attention and feed-forward layers. At each time step i, the model produces an output probability distribution over the vocabulary for the next token w i given all previous output tokens w
EQUATION (2): Not extracted; please refer to original document.
where W e is a word embedding matrix, p j is the position embedding, h 0 i is the initial representation, and {h} l−1 ≤j is the set of all preceding layer block outputs for tokens up to w j . Finally, for the current position i in the sequence:
EQUATION (3): Not extracted; please refer to original document.
where W e is the same word embedding matrix as in Equation 1 and h L i is the final layer transformer block output at position i.
The model is trained to minimize the negative loglikelihood of the next word w i given all preceding words:
L gen = − |a|+|s| j=1 log p(w i |w 0 , ...w i−1 ) (4)
where w i is the i th token of x. At test time, x only consists of the gold article and delimiter token (a 1 , ..., a |a| , d) and we decode generated summaries g starting from this input.
3 Discriminator Networks
Because the autoregressive nature of the generator makes it unlikely to achieve narrative flow across long time horizons, we incorporate a discriminator model into the decoding process. Due to the difficulty of explicitly defining discourse properties to discriminate between generations with good and bad narrative flow, we rely on a parametrized scoring function to approximate this discourse property by scoring whether pairs of adjacent sentences are consistent with one another.
3.1 Discriminator Architecture
Sentence Pair Representation To model the likelihood of adjacency between two sentences s u and s v of length |u| and |v| respectively, we first compute a hidden representation of the sentence pair using BERT (Devlin et al., 2019) . This representation allows us to better capture the finegrained contextual information necessary for understanding the relationship between s u and s v . The initial input to the encoder is the concatenation of these sentences:
s = [CLS] + s u + [SEP ] + s v + [SEP ], where [CLS]
is a special token associated with the task and [SEP ] is a sentence delimiter token. As in Devlin et al. (2019) , each word in the sequence is encoded by a word embedding w i and positional embedding p i that identifies its position in the s. Additionally, we have a learned segmentation embedding q i for each token in s to indicate which of the two sentences, s u or s v , the word belongs to. Therefore, our full input to each of the encoder layers is E = (e 1 , ...., e |u|+|v|+3 ), where e i = w i + p i + q i . We encode these representations using BERT (Devlin et al., 2019) and the final contextual representation used for adjacency prediction is the pooled hidden representations of the [CLS] token, which we denote as h cls .
Adjacency Classification Once the sentence pair s=(s u ,s v ) has been encoded as the hidden representation h cls , we obtain the probability of adjacency between the pair of sentences (s u and s v ) by a linear projection (W disc ) followed by a logsoftmax:
EQUATION (5): Not extracted; please refer to original document.
In § 4, we describe how the adjacency scores are used to re-rank candidate summaries.
Sentence Selection For Discriminator Models
To train a discriminator model, we use a subset of adversarial and positive sentence pair examples extracted from the training set. The sentence pairs are extracted from gold abstracts containing at least five sentences using the following approach: For a randomly selected sentence s u from the abstract, we randomly select an adjacent sentence, s u−1 or s u+1 , for a positive example and any nonadjacent sentence
s v / ∈[u−1,u+1
] for a negative example.
Adjacency Learning Objective
We define the training objective for the adjacency discriminator to minimize the negative loglikelihood of predicting whether two sentences are adjacent or not:
EQUATION (6): Not extracted; please refer to original document.
where ADJ(s) is an indicator function for whether the two sentences in s are adjacent and P adj (s) is the discriminator's estimate of adjacency from Equation 5.
4 Cooperative Generation
The discriminator reranks candidate summary generations based on the overall likelihood of sentences within the summary being adjacent. More specifically, for each candidate generated summary g and sentence pair s = (s u , s v ) contained in g, we define the likelihood of sentence adjacency as the score that the discriminator assigns to that sentence pair (Equation 5). By defining discourse using this approach, we place minimal restrictions on the model's ability to hone in on patterns pertaining to desirable narrative flow.
To incorporate this objective into our summarization framework, we use a modified decoding objective function. First, we generate a pool of candidate summaries from the base summarization model ( §2) using an arbitrary decoding strategy (e.g., beam search, nucleus sampling, top-k sampling). Then, the discriminator is used to re-rank these candidates. Specifically, for each generated sentence s u present in a candidate summary g of length |g| tokens with S sentences, we maximize both the language model probability of each token p(w i |w
p(g) = δ gen |g| i=1 p(w i |w 1 , ...w i−1 ) + δ disc S u=1 P adj (s u , s u−1 ) (7)
where δ gen and δ disc are hyper-parameters controlling the contribution of the generator and discourse discriminator to the final predicted summary generation. The score P adj (s u , s u−1 ) is the adjacency score computed by the discourse discriminator model (Equation 5).
5 Data
In the following section, we introduce SASS, a dataset of over 700K introduction-abstract pairs from arXiv 1 . We also describe AAN, an existing dataset of NLP papers published at top venues 2 . These datasets rely on greater discourse structure and abstractiveness than previous summarization corpora. We specifically choose to focus on scientific papers incorporating a wide range of domain knowledge and subjects to rigorously test the generalization of our models across different fields of scientific endeavour.
5.1 Datasets
SASS The dataset is crawled from arxiv.org and contains over 700k introduction-abstract pairs from scientific articles. In our experiments we primarily focus on the CS 3 and Bio 4 domain subsets. The task in SASS presents a challenge to existing summarization models, since it requires models to learn relevant domain knowledge for the scientific domain of interest, as well as recognize common discourse structure for papers written in that domain. While the work in this paper focuses mainly on Introduction → Abstract summarization, we also note that we extracted text for two other summarization tasks (Introduction → Title, and Abstract → Title) and provide results for those Appendix A.
AAN In addition to the SASS dataset, we include an existing dataset of scientific articles that focuses on papers in the NLP computer science domain. This dataset consists of 12k paper subset from the ACL Anthology Network (AAN) (Radev et al., 2009) , found after removing articles from the anthology without abstracts and duplicates. As with SASS, we define the task for AAN as generation of abstracts from introductions.
5.2 Narrative Flow Analysis
Since the focus of this work is on generating summaries with more coherent narrative flow, we concentrate on datasets requiring narrative structure to generate good summaries. Particular attributes of these datasets that connect to discourse structure are:
• Length of summaries → Are the summaries long enough to clearly show narrative flow properties?
• Abstractiveness of gold summaries → Do the summaries exhibit particular sentence-level flow, or are the summary sentences extracted highlights from the context?
As can be seen in Table 1 , SASS and AAN have properties missing from existing summarization datasets based on Newswire data. XSum (Narayan et al., 2018a) and Newsroom (Grusky et al., 2019)
6 Experimental Setup
In this section we outline comparison baselines and describe experimental setups for our generator and discriminator models.
6.1 Baselines
We train a 2-layer bi-LSTM sequence-to-sequence model with attention. The bi-LSTM is used to encode a given source article a and a separate decoder LSTM produces the generated summary g. At each decoding time step, the decoder attends to all the context vectors produced by the encoder as well as the maintained state from the previous decoder tokens to produce the next token in the summary. We also implement a Pointer-Generator (PGEN + Coverage) model (See et al., 2017) that extends the base LSTM model (LSTM + Coverage) to allow tokens to be copied from the input during generation. Baselines are trained for up to 40000 steps with a batch size of 16. Following previous work, we decode from these baselines using beam search with a beam size of 4. We set a maximum decoding length of 200 tokens. The RNN baselines additionally have a minimum decoding length of 35 tokens as in See et al. (2017) .
6.2 Generator Model
Input We perform word and sentence-level tokenization using spaCy and NLTK (Loper and Bird, 2002) . Because of the fixed input size introduced by the transformer language model (Radford et al., 2019) , the input context is truncated to a maximum of 800 tokens and summaries are truncated to a maximum of 200 tokens. Implementation All our models are implemented in Pytorch. Our code is adapted from the Ope-nAI Huggingface implementation of the 117M parameter GPT-2 language model 5 and uses the pre-trained weights of Radford et al. (2019) .
Training We use a learning rate of 2e-5 for finetuning. We use a batch size of 1 with gradient accumulation to simulate a batch size of 16. On the AAN subset, we train the base summarization transformer model for 11 epochs. On the SASS CS and Bio subsets, we train the base summarization transformer model for 8 epochs. All experiments are run on a Titan-X GPU. Training time for the AAN and SASS Bio datasets is about 30 minutes per epoch. Training time for the SASS CS dataset is 2.5 hours per epoch.
6.3 Discriminator Model
Input We use a max sentence length of 200 tokens to accommodate the fixed input size of BERT (512 tokens), reduce inference time, and discourage the model from generating abnormally long run-on sentences that indicate the presence of coherence issues. Implementation The discriminator models are adapted from the Huggingface implementation of the BERT next sentence prediction classifier 6 . We initialize the 12-layer BERT-base model with the pretrained weights of the SciBERT-uncased model, which was originally trained on 1.14 million scientific papers (Beltagy et al., 2019) .
Training We fine-tune the discriminator using a learning rate of 2e-5 and batch size of 2. We train two discriminators: one is trained on AAN for decoding both SASS CS and AAN, while the other discriminator is trained on SASS Bio and used exclusively for decoding that subset. All discriminator models are trained for 17 epochs on a Titan-X GPU over a single day.
6.4 Generation Hyperparameters
During inference time, we use top-k sampling with k=4 (Fan et al., 2018b) to generate 10 candidate summaries for each model. In the re-ranking objective (Equation 7), we weigh the generation and discriminator models equally for all experiments by setting δ gen = δ disc = .5 and select the candidate out of the 10 that achieves the highest joint score in Equation 7 to be the final summary. We filter candidate summaries from the hypothesis generation pool that contain sentences longer than a fixed max length of 200 tokens, a clear sign of coherence deterioration.
7 Experiments
Recent work in abstractive summarization has reported issues with reference-based automatic metrics such as ROUGE (Hoang et al., 2019; Sun et al., 2019; Kilickaya et al., 2017) . Due to these concerns, we first discuss human evaluation in §7.1 before reporting automatic metrics in §7.2. Finally, in §7.3 we explore modelbased alternatives to traditional n-gram based metrics.
7.1 Human Evaluation
Since coherence of generated text is difficult to measure automatically, we conduct human evaluations to evaluate how the discriminator affects generation quality. We randomly sample 34 abstracts from the AAN test set of NLP scientific papers.
Then, we present a side-by-side blind comparison of the Base-generated version of the abstract and Co-opNet-generated version along with the gold introduction context to 1 of 10 in-domain experts whose experience in the field of NLP ranges from 3 to 10 years. To reduce bias, the ordering of generated abstracts is also randomized and experts are not told that the abstracts are machine-generated. Each pairwise comparison is evaluated by three unique experts. The experts are asked to assess generation quality based on three key criteria:
• Flow → Which abstract does a better job of presenting a coherent summary that displays correct discourse properties?
• Relevance → Which abstract does a better job of summarizing the main ideas presented in the gold introduction?
• Overall → Which abstract is better overall?
Each expert casts a vote for which abstract is preferred on each of these criteria or "No Preference," if there is no distinguishable difference between the abstracts based on a given criteria. Results The results of this expert human evaluation (shown in Table 2 ) show that Co-opNet clearly improves the quality of generated abstracts over the Base transformer model. In particular, Co-opNet is selected as best in 64.2 % of cases for the Flow criteria, while the relevance score is less clearly superior, indicating that improved discourse is the primary reason for the Overall preference.
7.2 Traditional Automatic Evaluations
To match previous work on summarization, we use the ROUGE metric (Lin, 2004) for automatic evaluation Specifically, we report ROUGE-1, ROUGE-2 and ROUGE-L F-1 scores. Table 3 shows the results of our experiments on the AAN dataset. The results of experiments on the SASS CS and Bio subsets are also shown in Tables 4 and 5, respectively. Notably, the model using the discriminator (Co-opNet) outperforms the generator-only (Base) model across ROUGE metrics on all datasets except for ROUGE-2 on the Bio and CS SASS datasets. Co-opNet also outperforms the baseline models on the SASS subsets. On the more domain-specific AAN subset, results are mixed, as the PGen + Coverage baseline (See et al., 2017) achieves higher performance on ROUGE-1 and ROUGE-2, but our model is over 12% better on ROUGE-L, which is generally considered the best summarization metric. Our hypothesis is that performance gains achieved with the inclusion of the SciBERT discriminator are caused by three main factors:
Flow Since SciBERT was fine-tuned on a large corpus of scientific papers, the model introduces external contextual information about scientific documents that enhances the ability of the summarization model to distinguish between "good" and "bad" narrative flow for in-domain text. Additionally, Table 6 shows that abstracts generated from Co-opNet achieve a ROUGE-L precision score that is 2.87 points lower than the score for abstracts generated from the base model when using the introduction as a reference. This indicates using a discriminator leads to more abstractive summaries than the base model, as less of the summary can be pulled in the same order from the document (Hoang et al., 2019) .
Content The discriminator encourages selection of more contentful generations that include salient keyphrases and domain-specific information. Table 6 shows that abstracts generated from the base model achieve a ROUGE-L recall score that is 6.94 points lower than the score for abstracts generated from Co-opNet. This indicates that Co-opNet can pull more relevant content from the introduction compared to the base model.
Repetition The adjacency task modeled by the discriminator assigns a high likelihood of choosing strongly correlated sentences that follow naturally from each other, but are not exact copies Table 7 : BERTScore results for AAN subset using both BERT and SciBERT as the evaluation model. or paraphrases of one another. This can be attributed to the fact that adjacent sentences tend to contain related information instead of repeating information, reducing the overall repetitiveness of discriminator model generations.
7.3 Alternatives To Rouge
Despite our generally superior results on the ROUGE metric, past work (Schluter, 2017; Hoang et al., 2019; Sun et al., 2019) has explored the limitations of ROUGE for evaluating summarization tasks, including the issue that ROUGE measures n-gram overlap without distinguishing between salient n-grams and non-contentful tokens. This failure means that models with a higher probability of generating generic, frequent terms such as "the" and "how" can potentially outperform models that better capture conceptual information, but may paraphrase rather than extract common terms.
To overcome these limitations, we explore using contextualized evaluation to measure how well models learn to extract conceptual meaning from scientific paper introductions and generate highquality abstracts. The BERTScore metric (Zhang et al., 2019a) , which has been shown to be well correlated with human judgments on other NLP tasks, is one such metric that measures similarity between contextual embeddings of the generated summary and reference summary. For a generated and gold abstract, we first convert the sequences of tokens into sequences of contextual vector representations using either the uncased version of BERT-base or the SciBERT variant. Following (Zhang et al., 2019a) , we then compute precision, recall and F1 scores from these contextual representations.
Results Our results on the BERTScore evaluation reinforce our observations from the human and ROUGE score evaluations. Co-opNet generally outperforms the PGen + Coverage baseline across all metrics, regardless of the evaluation model used (BERT, SciBERT). eResults are more mixed when comparing Co-opNet to the base model with no discriminator. Overall, Co-opNet is less precise than the Base model at capturing contextual meaning, potentially due to the fact that the Co-opNet tends to produce longer summaries (86.4 tokens vs. 49.11 tokens for Base). This observation is supported by the fact that Co-opNet achieves the best results in terms of recall, beating the Base model by 3.98 points on BERT-R and 2.43 points on SciBERT-R.
8 How does the discriminator improve coherence?
Abstracts generated from the non-discriminator models tend to lack completeness. While generations from these models are adept at introducing the task presented in the scientific paper, they do not provide a full summary of the paper's contents. Generations from the Base and PGen + Coverage lack details about final results and end abruptly instead of coming to a natural conclusion. As shown by Figure 3 , the Base and PGen + Coverage generations are also often over-specific (for example, mentioning the CRAFT corpus without specifying that it is a set of published scientific articles). In contrast, Co-opNet-generated summaries have a clear narrative flow that capture each key part of a coherent abstract: 1) Introduction of the task, 2) Methods used to address this problem, and 3) Main findings and results. Despite overcoming these limitations to modeling discourse structure and coherence, our analysis reveals two key types of errors in Co-opNet generations: Semantic Repetition Despite improvements in reducing repetition over baselines, Figure 3 provides one example of how the model repeats information within the same sentence without exact copyinggenerating "blogs", then "blog posts." Veracity Co-opNet also hallucinates inaccurate information like the nonexistent term "bi-diag sections," or erroneous acronyms like "(TEU)." This indicates the model could still benefit from more external domain knowledge for grounding. These findings indicate that the candidate summaries selected by the discriminator still pose an interesting challenge for future research on im-
Discriminator With Discriminator
In this paper, we describe the rhetorical roles of sentences in the craft corpus, a set of full-text papers that we have annotated using the cisp schema.
Hand alignment of the resulting annotations suggests that patterns in these cisp-annotated sentences correspond to common argumentative gambits in scientific writing.
Base Transformer Model
Pointer-Generator + Coverage Model
Transformer + Discriminator
In this paper, we analyze the role of rhetorical representations in taxonomic parsing of text.
In the CRAFT corpus, sentence distributional models with annotated data consistently outperform models with no annotated data.
We present an approach for systematically annotating, in a corpus of full-text published scientific articles, rhetorical roles of sentences in scientific writing....
These hierarchical approaches can be thought of as hierarchical output (TEU) annotation techniques...
Specifically, in this paper, we propose methods for generating dependency relations with bi-directional LFG.
The empirical evaluation of our methods shows that they can be applied to parts of the scientific literature such as abstracts, bi-diag sections, blogs, and blog posts. proving quality of generated scientific summaries.
9 Related Work
Generation with Narrative Flow Due to the need for accurate understanding of long-distance dependencies and narrative structure, modeling coherent narrative flow has proved to be a major challenge in the field of text generation, particularly for scientific documents (Koncel-Kedziorski et al., 2019; Nikolov et al., 2018) . A number of solutions have been proposed in recent years for improving coherence in text generation, including global-tracking of entities and discourseaware neural rewards (Kiddon et al., 2016; Holtzman et al., 2018; Fan et al., 2018b) . In particular, the work of Cohan et al. incorporates narrative structure into Pointer-Generator networks (See et al., 2017) by using a discourse-aware attention mechanism for abstractive summarization of scientific papers. We expand upon this previous work, using the global context provided by our Transformer-based Cooperative Generator-Discriminators to better capture far-distant information useful for learning "good" narrative flow.
Neural Abstractive Summarization In the past, abstractive summarization models (Rush et al., 2015; Chopra et al., 2016; Gehrmann et al., 2018) have relied upon a seq2seq encoder-decoder architecture that follows the generation framework of (Sutskever et al., 2014) . Recently, new challenges in abstractive summarization such as topic-aware and controllable summarization, have encouraged exploration of other model architectures like convolutional neural networks (CNNs) (Allamanis et al., 2016; Fan et al., 2018a; Narayan et al., 2018b) as an alternative to RNNs. Reinforcementlearning based approaches (Celikyilmaz et al., 2018; Chen and Bansal, 2018) have also enhanced the overall quality and conciseness of abstractive summaries. In addition to CNNs, Transformer models have emerged as a promising architecture for text generation and achieved state-of-the-art results across several NLP tasks (Vaswani et al., 2017; Radford et al., 2019) . Liu et al. proposed a decoder-only transformer model for summarization across multiple Wikipedia articles and several later works have explored Transformer-based architectures for abstractive summarization over Newswire (Khandelwal et al., 2019; Zhang et al., 2019b; Hoang et al., 2019) . While our model builds upon this work, to our knowledge, it is the first Transformer summarization model to explicitly model narrative flow in text.
10 Conclusion
In this work, we introduced Cooperative Generator-Discriminator Networks, a framework for more coherent natural language generation with transformer language models through the integration of a discriminator that can discern good narrative flow. We showed the limitations of current automatic metrics for evaluating summarization at capturing improvements in
Gold
This research is concerned with making recommendations to museum visitors based on their history within the physical environment, and textual information associated with each item in their history. (...) This study compares and analyses different methods of path prediction including an adapted naive Bayes method, document similarity, visitor feedback and measures of lexical similarity.
Pgen + Coverage
In this paper we study the problem of recommendation systems for the task of visitors to an information rich environment such as a museum, are interested to convey predictions about what theme or topic a given visitor is interested in. We are interested to get the most out of their visit. As a behaviour to evaluate the relative impact of different factors in the physical space. (...)
Base
This paper proposes a novel approach to measuring the success of machine learning methods in a user's selection of a particular exhibit to be produced. An unsupervised framework is used to jointly compute the likelihood of the value of the best exhibit to be produced. (...) The experiments show that models produced by supervised methods improve user performance in selecting exhibits over unsupervised methods.
Co-Opnet
This paper presents a user-centric perspective on the property of location, focusing on some relevant factors in deciding which exhibit a user intends to visit. We exploit variation and infrequency in data from the Victoria/Ausland Courtyard (VCA) Collection (...) We make three contributions:
(1) Our experimental system provides empirical evidence for the effectiveness of supervised learning techniques in predicting a user's audience behavior;
(2) Our structure based method allows unsupervised learning to be applied to multiple sets of related information.
(3) Our experimental system uses unsupervised model adaptation in a supervised setting. (See et al., 2017) (PGen + Coverage) and two of our proposed models, Base (topk,k=4) and Co-opNet (topk, k=4), on the NLP scientific domain. Coherency issues in the PGen + Coverage generated abstract, including incorrectly structured sentences and lack of concluding details, are highlighted in red. We highlight transitional phrases that contribute to coherent flow by properly delineating sections of abstracts in purple.
coherence, and proposed a new evaluation setup for summarization that takes into account contextual similarities between summaries. Through these analyses and eliciting human judgments, we empirically show that the discriminator model is better at selecting generations that are both more relevant and more narratively coherent.
A Appendices
A.1 Additional baselines for SASS benchmarks ConvS2S For this model, we use the convolutional seq2seq architecture proposed by (Gehring et al., 2017) . A given source text x = (x 1 , ..., x n ) and the positions of tokens within that source text p = (p 1 , ..., p n ) are first encoded using embedding matrices. The embedded representations are passed to a 20-layer convolutional encoder with a kernel width of 3 and hidden size of 512. For the convolutional decoder, we use the same hyper-parameters. During decoding, we penalize generation of UNK characters and use diverse beam search to handle commonly occurring longsequence generation errors.
RNNExt The baseline we use for reinforcement learning is from the work of (Chen and Bansal, 2018) . This model uses a hybrid extractive and abstractive summarization approach. First, each sentence in the source text is encoded using a temporal convolutional model, then a bi-directional LSTM is used to compute a context-aware representation. A set of salient sentences is extracted from among the encoded sentences using a LSTM pointer network. The seq2seq abstractor rewrites these sentences through compression and paraphrasing. The whole model is trained using the REINFORCE algorithm by formulating a Markov Decision Process where the extractor is a RL agent that receives a reward each time the abstractor finishes summarization of an extracted sentence. 7
A.2 Additional Sass Tasks
We report summarization baseline results on the full SASS dataset for the Abstract → Title, Introduction → Title and Introduction → Abstract tasks. (See et al., 2017) 0.4053 0.2180 0.3653 ConvS2S (Gehring et al., 2017) 0.3421 0.1976 0.3207 Table 9 : Abstract to Title 7 We only use this baseline for the intro-to-abstract task due to the fact the convolutional encoder only accommodates multi-sentence summaries. (See et al., 2017) 0.3365 0.1083 0.2933 ConvS2S (Gehring et al., 2017) 0.2857 0.0965 0.2451 RNNExt (Chen and Bansal, 2018) 0.4000 0.1253 0.3683 (See et al., 2017) 0.3189 0.1444 0.2886 ConvS2S (Gehring et al., 2017) 0.3226 0.1794 0.3032
https://arxiv.org 2 https://www.aclweb.org/anthology 3 https://arxiv.org/corr 4 https://arxiv.org/archive/q-bio
https://github.com/huggingface/pytorch-pretrained-BERT