Efficient Adaptation of Pretrained Transformers for Abstractive Summarization

Andrew Hoang
Antoine Bosselut
A. Çelikyilmaz
Yejin Choi
ArXiv
2019
View in Semantic Scholar

Abstract

Large-scale learning of transformer language models has yielded improvements on a variety of natural language understanding tasks. Whether they can be effectively adapted for summarization, however, has been less explored, as the learned representations are less seamlessly integrated into existing neural text production architectures. In this work, we propose two solutions for efficiently adapting pretrained transformer language models as text summarizers: source embeddings and domain-adaptive training. We test these solutions on three abstractive summarization datasets, achieving new state of the art performance on two of them. Finally, we show that these improvements are achieved by producing more focused summaries with fewer superfluous and that performance improvements are more pronounced on more abstractive datasets.

1 Introduction

Recent work in large-scale language models [19; 20; 5] has allowed pretrained contextual representations to be easily adapted for a variety of downstream tasks, yielding improvements on many benchmarks evaluating natural language understanding [27] . Less explored, however, has been the effect of these pretrained representations on text production tasks, such as abstractive summarization, where state of the art performance is still achieved with sequence to sequence (seq2seq) models [1; 6] .

These sequence-to-sequence methods typically use an encoder and decoder model with separate parameters to represent the input article and produce the output summary, and the most successful solutions [1; 6; 22] use attention mechanisms that learn an alignment between encoder and decoder states. Pretrained language models, however, do not learn the parameters for such a task-specific alignment, making it challenging to integrate their learned representations into a summarization architecture at a higher level of abstraction than the word embedding.

In this work, we adapt full transformer language models for abstractive summarization. Building off the work of Liu et al. [12] , who first proposed concatenating input and output text to a joint sequence and using a common transformer to encode both, we use a language model as a summarizer (rather than an encoder-decoder). With this approach, representations from a pretrained transformer language model (in this case, GPT [20] ) can be used to fully initialize the parameters of the summarization model, allowing it to leverage the representational power of a model trained at much larger scale.

To accomplish this effectively, we outline two strategies for adapting pretrained representations for abstractive summarization. In the first, we augment the input representation of the summarization model by instantiating source embeddings that encode the token type of the text being read. This change allows the model to recognize whether a given token belongs to the input article or the output summary, thereby learning how to distinguish both types of text when encoding. In the second, we Figure 1 : The embedding process for inputs to the Transformer-SM model. introduce a domain-adaptive training procedure that fine-tunes the transformer toward understanding general newswire text before training on the summarization end task directly, allowing the model to learn the general structure and language distribution of newswire text before being fine-tuned to produce summaries.

Figure 1: The embedding process for inputs to the Transformer-SM model.

A comprehensive empirical study across three datasets, CNN/DailyMail [8] , XSum [16] , and Newsroom [7] , shows that transformer language models can be used to train abstractive summarizers, producing summaries that are more concise and focused than state of the art baselines. Our investigation also empirically validates several observations about the abstractive summarization task. First, echoing the results of Sun et al. [24] , the most common summarization evaluation metric, ROUGE [10] , is highly sensitive to summary length, providing an advantage to methods that produce longer summaries, either through learning or with minimum summary length constraints. Second, achieving higher ROUGE scores is not strongly consistent with human assessments of abstractive summary quality. Finally, despite being conceived as abstractive summarizers, most current state of the art models are highly extractive, copying phrases and even sentences verbatim from the document.

2 Model

In this paper, we focus on a variant of the Transformer [26] that has been pretrained on a large corpus of natural language stories: the GPT model [20] . As our architecture is practically identical to the one proposed in Radford et al. [20] , we point readers to that work for background on the architecture of the model, and focus below on the enhancements to the input representation made in our approach.

2.1 Input Representation

Each article is represented as a sequence of M tokens X a = {x a } M m=1 = x a 1 , ..., x a M and its corresponding summary is a sequence of N tokens

X s = {x s } N n=1 = x s 1 , ..., x s N .

As outlined in Figure 1 , the input structure of the training set is a pair of article and corresponding summary concatenated into two sequences similar to [12] :

EQUATION (1): Not extracted; please refer to original document.

where T = M + N + 2, and and are special tokens identifying the delimitation and end of the sequence. Below, we define the process of encoding these sequences as inputs to the transformer.

Word Embedding First, each token x t in the concatenated sequence X indexes a word embedding e[x t ] ∈ R h from a joint vocabulary for the article and summary (and special tokens).

Position Embedding Second, since the transformer (a self-attention model) has no concept of ordering of tokens, a position embedding p t ∈ R h is initialized for each absolute position in the sequence [26] . The embedding for each position in the sequence is added to the word embedding of the token occupying that position, augmenting the final representation of the input. For example, each token in the article would be represented as: w a m = e[x a m ] + p m . Once the delimitation token is reached, the position counter is reset. For example, the first token of the article, x a 1 , and the first token of the summary, x s 1 , both receive p 1 as a positional embedding to augment their representations.

Source Embedding Finally, because the transformer must recognize pragmatic differences between the text of the article it reads and the text of the summary it learns to produce, an additional, sourcespecific embedding is initialized, d ∈ R h . The source embedding encodes whether a token is from the article portion d a of the concatenated input, or the summary portion d s . For any article token (Eq. 2) or summary token (Eq. 3) then, the final encoding is:

EQUATION (3): Not extracted; please refer to original document.

In contrast to the other embeddings in the model, the source embeddings are not pretrained, introducing the potential that they could dominate pretrained representations for the word and position embeddings when summed (Eq. 2, 3). To avoid this, we normalize the random initialization of the source embeddings to have norm equal to half of the average norm of the word embeddings.

3 Training

The model is initialized with pretrained parameters from the GPT model [20] that was trained on the BooksCorpus [28] . Following this initialization, we pursue two additional training procedures: domain-adaptive training and end task training.

3.1 Domain-Adapative Training

Despite the benefit of using pretrained representations from the GPT model to initialize a summarizer, there is a language shift between the storybooks data on which the GPT model was trained and the type of language found in newswire summarization datasets [8; 16; 7] . Additionally, there are structural differences between how articles are written (usually expressing salient points early on, followed by details later) and how stories unfold (less front-loading of key information).

To address this discrepancy, we propose domain-adaptive training (DAT) to adapt the transformer summarization model to the language distribution of newswire text by maximizing the conditional loglikelihood of the article tokens and summary tokens given all previous tokens in their concatenated input representation (see Figure 1 ):

EQUATION (4): Not extracted; please refer to original document.

where M is length of the article, N is the length of the summary, {x a }

3.2 End Task Training

During end task training (ETT), the model is trained specifically to be able to produce a summary given a document, constraining the loss function toward maximizing the conditional loglikelihood of producing only the correct summary tokens given the set of article tokens {x a } M :

EQUATION (5): Not extracted; please refer to original document.

where {x s } [16] consists of ∼230k article summary pairs taken from the BBC. Each summary is a single sentence long and is professionally written (usually by the author), making the dataset exhibit more abstractive content than typical summarization datasets, such as CNN/DailyMail [8] . The Newsroom dataset [7] consists of ∼1.2M article summary pairs scraped from the Internet Archive. The articles come from a set of 38 publishers and cover diverse topics. We provide statistics about each dataset in Table 1 .

Table 1: Comparison of summarization datasets with respect to dataset size, proportion of unique n-grams, mean article length in words, and mean summary length in words.

Data Preprocessing

We used a bytepair encoding (BPE) for tokenization. For each summarization dataset, we use the BPE to tokenize each article and summary, and then truncate the articles to a maximum length of 512 tokens and each summary to a maximum length of 110 tokens. We then format each article summary pair into the format outlined in Figure 1 .

Model Specifications We used a transformer decoder with N = 12 blocks and h = 12 masked self-attention heads in each block. We set the dimensionality of each self-attention head to be d model = 768. Unless stated otherwise, we use the pretrained weights of Radford et al. [20] to initialize the parameters of the model. Special tokens that are added to the vocabulary (i.e. the end token, start token, and delimiter token) are initialized by sampling from the standard normal distribution. Our full model with source embeddings ( §2.1) is denoted as as Transformer-SM and we also train an ablation, Transformer-LM, that does not use source embeddings.

Training Details All models were trained with a learning rate of 6.25 × 10 −5 and a minibatch size of 64. When domain-adaptive training (DAT) is used, we train for 10 epochs using DAT and then for an additional 10 epochs using end task training (ETT). Without DAT, we train on the end task for 20 epochs. Unless specified otherwise, the final model trained for each dataset uses both domain-adaptive training and end task training. We did not tune hyperparameters. All models were trained using the PyTorch package 1 and the HuggingFace implementation of GPT. 2 We trained each model on 8 Tesla V100-SMX2. Training for a total of 20 epochs took approximately 1 day of clock time for the XSum and CNN/Daily Mail datasets, and 3 days for the Newsroom dataset. Our source code is publicly available. 3 Generation We perform generation by using beam search with a beam size of 3. We use the trigram trick [18] during beam search. Each summary token is generated by decoding from the distribution yielded by the model from processing an input tensor that is the concatenation of the article tokens, the delimiter token, and any previously generated summary tokens.

Evaluation We evaluate our system with common summarization metrics: ROUGE-1 (R-1), a measure of unigram recall between the summary and document, ROUGE-2 (R-2), a similar measure of bigram recall, and ROUGE-L (R-L), a measure of the longest common subsequence between the summary and document [11] . We also report the length of the summary in terms of tokens produced. For each dataset, for evaluation on the test set, we selected models with the largest ROUGE-1 score on a subset of 500 samples from the validation set. Automatic Metrics We report our results using automatic metrics in Table 2 . On this dataset, our main model, Transformer-SM, performs slightly worse than other state of the art models. We note that our model tends to generate shorter summaries than the gold summaries (∼ 20% shorter), which could lower ROUGE recall performance.

Table 2: ROUGE F1 results on the test set of CNN/Daily Mail. Best model results are bolded.

In Figure 2 , we investigate the correlation of ROUGE-L scores with summary length, and note that a minimum decoding length used by state-of-the-art algorithms places baseline generated summaries in length bins of higher average ROUGE-L performance. When Transformer-SM produces summaries in these same length bins (i.e., more than 30 tokens), its performance is only consistently beaten by the DCA model, which was fine-tuned with RL. Each worker was presented two model-generated summaries, one produced by the Transformer-SM model, and one from the DCA model [1] or the PGen+Cov model [22] . Workers were asked to select the better summary for four different quality metrics from Celikyilmaz et al. [1] : nonredundancy (fewer of the same ideas are repeated), coherence (ideas are expressed clearly), focus (the main ideas of the document are shared while avoiding superfluous details), and overall.

Figure 2: Average ROUGE-L for summaries in different length bins. Scatter plots correspond to ROUGE-L scores for each bin, while solid lines correspond to the number of summaries in each bin

The results are presented in Table 3 . Interestingly, the summaries from Transformer-SM are consistently preferred by humans across all 4 evaluations dimensions compared to those from the DCA and PGen+Coverage models, indicating that the Transformer-SM's lower ROUGE scores observed in Table 2 are not necessarily correlated with human judgments of quality. Efficiency Due to the large improvements over the baseline models in the human evaluation categories of non-redundancy and focus, and the generally shorter summaries produced by Transformer-SM, we investigate whether Transformer-SM is able to more efficiently express key ideas of the document. To evaluate the efficiency of each model, we remove non-content words from the modelgenerated summaries and articles, and compute the ROUGE score between them. This measure serves as a proxy for the rate at which ideas expressed in the summary can be found in the document.

Table 3: Head-to-head comparison between test set outputs of (Left) DCA [1] and TransformerSM (Right) PGen+Cov [22] and Transformer-SM. Analyses done on summaries for CNN/DailyMail.

We report these results in Table 4 and observe that Transformer-SM reports comparable ROUGE-L recall scores to other baselines when evaluated with respect to the article, despite producing summaries that, on average, 27% shorter. Meanwhile, ROUGE-L precision is also very similar to the baseline models, indicating that the summaries of all models indicate a similar degree of information relevance. 4 Combined with the results from Table 3 , we conjecture that Transformer-SM is able to more efficiently express key ideas from the document. While other models may be producing longer summaries that yield higher ROUGE performance (Table 2) , the additional tokens may reflect redundant and unsalient information, which human evaluators penalize.

Table 4: ROUGE-L precision (R-L P), recall (RL R), and F1 (R-L F1) scores computed between generated summaries and input CNN/DailyMail articles after removing stop words

Analysis of domain-adaptive training and source embeddings Our approach involved two strategies for efficiently using transformer language models for abstractive summarization: domain-adaptive training and source embeddings. To assess their individual impact, we evaluate multiple training schedule permutations (e.g., various combinations of using pretrained representations from the GPT model and using domain-adaptive training), as well as the impact of source embeddings. Our results in Table 5 yield multiple interesting conclusions. First, in general, domain-adaptive training (+DAT in Table 5 ) provides a clear improvement over training directly on the end task, irrespective of whether pretrained representations are used. Similarly, using source embeddings (T-SM in Table 5 ) provides a repeated improvement over the T-LM ablation. Surprisingly, when pretrained initializations, DAT, and source embeddings are used in tandem, performance drops slightly compared to not using DAT or not using source embeddings. We note, however, that this observation does not hold true for the XSum dataset ( §5.2), and conjecture that the extractive nature of the CNN/DailyMail dataset may make these approaches have redundant effects in this setting.

Table 5. Not extracted; please refer to original document.

5.2 Xsum

A study on the quality of abstractive summaries is best performed on the XSum dataset [16] , which is specifically designed with gold summaries that are less extractive than the other datasets (Table 1) .

Baselines We report the performance of Transformer-SM on this dataset in comparison to baselines originally reported in Narayan et al. [16] : an attention-based sequence to sequence model (AttnS2S), a pointer-generator model capable of generating words and copying directly from the input (PGen), a second pointer-generator model with a coverage mechanism to prevent repetition (PGen+Cov), and the top performing variant of the topic aware convolutional sequence to sequence model (T-ConvS2S), in which the encoder and decoder are provided with word topic and document topic distributions obtained using LDA as additional inputs. Our final baseline is the Multi-Level Memory Network (MMN) [9] , which applies attention over multiple memory layers for varying levels of abstraction. Results We report our results in Table 6 . Our models significantly outperform the comparison baselines across all three variants of the ROUGE metric. Interestingly, the Transformer-SM achieves noticeable improvement over the Transformer-LM model, suggesting that both source embeddings and domain adaptive training are helpful when target summaries are more abstractive. Examples of model-generated summaries from the XSum dataset illustrate the improvement over baselines qualitatively in Table 7 . In support of results presented earlier, the model produces abstractive summaries that provide focused information about the main points of the articles.

Table 6: Comparison results on the XSum test set using the F1 variants of ROUGE

Table 7: XSum samples for the baseline T-ConvS2S model and Transformer-SM along with the gold summary. Articles are shortened for brevity. Capitalization was manually added for ease of reading.

5.3 Newsroom

Finally, we report the performance of our model on the Newsroom dataset [7] , the largest of the evaluation datasets. Due to the large cost of training, only the Transformer-SM model was evaluated.

Baselines As baselines, we report the performance of models released by the authors of the Newsroom dataset [7] . These models included an attentive encoder-decoder (Attn-S2S) and a pointergenerator network (PGen). We also compared against C10110 [23] , a complex encoder-decoder that uses LSTMs, encoder attention, intra-decoder attention, and pointer-generation to produce summaries. We also compare against the Multi-Level Memory Network (MMN) [9] mentioned earlier. The authors of this baseline only evaluated on the abstractive subset of the Newsroom dataset. Officials said the attack happened at the Europa shopping centre in the capital Minsk. ... Police later arrested the 18-year-old suspect. ... "He cut one woman with the chainsaw and hit her with a hammer. She died. He also attacked others." The injured woman was taken to a local hospital. The attacker had brought the chainsaw and the axe to the shopping centre ...

T-Convs2S

A man has been arrested on suspicion of attempted murder by after a knife attack on a shopping centre in central London.

Transformer-SM A teenage girl has been killed by a chainsaw attack at a shopping centre in central Russia, police say.

Paris St-Germain have completed the signing of Zlatan Ibrahimovic from Paris St-Germain for an undisclosed fee.

Transformer-SM Zlatan Ibrahimovic says he will leave Paris St-Germain at the end of the season to return to the club.

A puppy's pet shop has been stolen from a shop in Lancashire. Transformer-SM A tortoise has been stolen from a pet shop.

Gold

A young man has attacked people with a chainsaw and an axe at a shopping centre in Belarus, killing one woman and injuring another.

Zlatan Ibrahimovic will leave Paris St-Germain at the end of the season.

Article snippet ... The animal was taken from Lathom pets and aquatics in Ormskirk on Tuesday afternoon, Lancashire police said. The shop's owner said CCTV showed a man taking the tortoise -which needs calcium supplements -out of the tank. ...

A baby tortoise has been stolen from a pet shop in Lancashire. [7] 39. 11 Results We report our results with ROUGE-style automatic metrics in Table 8 , showing that Transformer-SM outperforms the previous best model, C10110 [23] , across all metrics. Interestingly, our model achieves its highest performance increase over baseline models on Rouge-L, the metric usually considered as being most strongly correlated with strong summaries. Furthermore, an analysis of different validation subsets of the Newsroom dataset in Table 9 (split on the level of extractiveness of the gold summaries) shows that Transformer-SM performs better than baselines approaches on all varieties of summary types.

Table 8. Not extracted; please refer to original document.

Table 9: ROUGE F1 results on validation subsets and full validation set for Newsroom

-D R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L PGen

Article Snippet

The 34-year-old Sweden striker's contract with the french champions expires in the summer, and he has been linked with Manchester United, LA Galaxy and AC Milan. ... PSG said Ibrahimovic leaves as "the greatest striker and one of the very best players in the club's history" . ...

6 Related Work

Abstractive Summarization There has been a large variety of work exploring different methods for neural abstractive document summarization. Attention mechanisms have been shown to improve a variety of models [14; 25; 3] , and is one of the motivating factors for this work. Pointer generator networks introduced in See et al. [22] have been shown to increase summary veracity, and inspired the tangential usage of copy mechanisms in Transformers for document summarization Gehrmann et al. [6] . Other works have also explored the use of reinforcement learning to directly optimize summarization models on the ROUGE metric [17; 18; 2] .

Contextualized Representations Our approach is also relevant to recent work on contextualized language representations that are pretrained on large-scale language corpora. These representations can then be simply integrated or fine-tuned for improved performance on many downstream tasks [27] . SSL [4] , CoVe [13] , and ELMo [19] all learned contextualized representations through training RNN language models and encoder-decoders. Follow-up work extended these ideas, but replaced the RNN with a deep transformer [20] that was trained to learn language patterns on a large story dataset. BERT [5] more clearly extended the idea of using Transformers for language modeling by making the encoded representations bidirectional and adding two new loss functions: a masked token loss and next sentence prediction loss for more accurate discourse representations. More recently, GPT2 [21] expanded the scale of pretrained language models, and showed promising results on zero-shot tasks.

7 Conclusion

In this work, we introduce two approaches for effectively adapting pretrained language model representations to abstractive summarization: domain-adaptive training, and source embeddings. We evaluate the effect of both approaches across three abstractive summarization testbeds: CNN/DailyMail, XSum, and Newsroom, and achieve state of the art ROUGE-L results on two of them, while showing superior human evaluation performance on the third. In the process, we show that the ROUGE-L metric often used for abstractive summarization evaluation is quite sensitive to summary length, allowing it to be exploitable by approaches that use heuristics to control summary length.

https://pytorch.org/ 2 https://github.com/huggingface/pytorch-openai-transformer-lm 3 https://github.com/Andrew03/transformer-abstractive-summarization

The high precision scores across all models confirm that despite being conceived as abstractive generators, these models display highly extractive behavior.

https://www.cnn.com; https://www.dailymail.co.uk 6 https://archive.org/ 7 https://www.bbc.com/ 8 https://github.com/Andrew03/transformer-abstractive-summarization