Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Elizabeth Clark
A. Çelikyilmaz
Noah A. Smith
ACL
2019
View in Semantic Scholar

Abstract

For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming. The most common automatic metrics, like BLEU and ROUGE, depend on exact word matching, an inflexible approach for measuring semantic similarity. We introduce methods based on sentence mover’s similarity; our automatic metrics evaluate text in a continuous space using word and sentence embeddings. We find that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries (average length of 3.4 sentences) and human-authored essays (average length of 7.5). We also show that sentence mover’s similarity can be used as a reward when learning a generation model via reinforcement learning; we present both automatic and human evaluations of summaries learned in this way, finding that our approach outperforms ROUGE.

1 Introduction

Automatic text evaluation reduces the need for human evaluations, which can be expensive and time-consuming to collect, particularly when evaluating long, multi-sentence texts. Automatic metrics allow faster measures of progress when training and testing models and easier development of text generation systems.

However, existing automatic metrics for evaluating text are problematic. Due to their computational efficiency, metrics based on word-matching are common, such as ROUGE (Lin, 2004) for summarization, BLEU (Papineni et al., 2002) for machine translation, and METEOR (Banerjee and Lavie, 2005) or CIDER (Vedantam et al., 2015) for image captioning. Nevertheless, these metrics of- Figure 1 : An illustration of S+WMS (a sentence mover similarity metric that uses both word and sentence embeddings) between two documents. This metric finds the minimal cost of "moving" both the word embeddings (orange) and the sentence embeddings (blue) in Document A to those in Document B. An arrow's width is the proportion of the embedding's weight being moved, and its label is the Euclidean distance. Here we show only the highest weighted connections. ten fail to capture information that has been reworded or reordered from the reference text, as shown in Kilickaya et al. (2017) and Table 1 . 1 They have also been found to correlate weakly with human judgments (Liu et al., 2016; Novikova et al., 2017) .

Figure 1: An illustration of S+WMS (a sentence mover similarity metric that uses both word and sentence embeddings) between two documents. This metric finds the minimal cost of “moving” both the word embeddings (orange) and the sentence embeddings (blue) in Document A to those in Document B. An arrow’s width is the proportion of the embedding’s weight being moved, and its label is the Euclidean distance. Here we show only the highest weighted connections.

Table 1: A comparison of scores for three different summaries for a reference passage (the first lines of a news article). The human summary has been permuted with its clauses rearranged (Word order) and repeated (Repetition). Word order changes negatively affect ROUGE-L more than repetition; the other metrics are unaffected by word order choices but, to varying degrees, penalize repetition.

To avoid these shortcomings, word mover's distance (WMD; Kusner et al., 2015) can be used to evaluate text in a continuous space using pretrained word embeddings instead of relying on exact word matching. WMD has been used successfully for tasks including image caption evaluation (Kilickaya et al., 2017) , automatic essay evaluation (Tashu and Horváth, 2018) , and affect detection (Alshahrani et al., 2017) . This bag-ofembeddings approach is flexible but fails to reflect the grouping of words and ideas, a shortcoming that becomes more problematic as the length of the document grows.

We modify WMD for evaluating multi-sentence texts by basing the score on sentence embeddings ( §3), giving it access to higher-level representa-Reference passage. the only thing crazier than a guy in snowbound massachusetts boxing up the powdery white stuff and offering it for sale online ? people are actually buying it . for $ 89 , self-styled entrepreneur kyle waring will ship you 6 pounds of boston-area snow in an insulated styrofoam box -enough for 10 to 15 snowballs , he says .

Rouge-L Wms Sms S+Wms

Human summary. a man in suburban boston is selling snow online to customers in warmer states . for $ 89 , he will ship 6 pounds of snow in an insulated styrofoam box . Repetition. a man in suburban boston is selling snow is selling snow online to customers in warmer states in warmer states . for $ 89 , he will ship he will ship 6 pounds 6 pounds of snow in an insulated styrofoam box in a styrofoam box . Table 1 : A comparison of scores for three different summaries for a reference passage (the first lines of a news article). The human summary has been permuted with its clauses rearranged (Word order) and repeated (Repetition). Word order changes negatively affect ROUGE-L more than repetition; the other metrics are unaffected by word order choices but, to varying degrees, penalize repetition.

tions of the text. We introduce two new metrics: sentence mover's similarity (SMS), which relies only on sentence embeddings, and sentence and word mover's similarity (S+WMS), which uses word and sentence embeddings, as in Figure 1 . In §4, we find that sentence mover's similarity metrics significantly improve correlation with human evaluations over ROUGE-L (the longest common subsequence variant of ROUGE) and WMD when scoring automatically generated summaries (averaging 3.4 sentences). We also automatically evaluate human-authored essays (averaging 7.5 sentences) and find smaller but significant gains. We compute sentence mover's similarity metrics with type-based embeddings and contextual embeddings and find these results hold regardless of embedding type, with no significant difference caused by the choice of embedding.

Finally, we show in §5 that sentence mover's similarity metrics can also be used when learning to generate text. Generating summaries using reinforcement learning with sentence mover's similarity as the reward results in higher quality summaries than those generated using a ROUGE-L or WMD reward, according to both automatic metrics and human evaluations.

2 Background: Word Mover'S Distance

Earth mover's distance (EMD, also known as the Wasserstein metric; Rubner and Guibas, 1998) is a measure of the distance between two probability distributions. Word mover's distance (WMD; Kusner et al., 2015 ) is a discrete version of EMD that evaluates the distance between two sequences (e.g., sentences, paragraphs, etc.), each represented with relative word frequencies. It combines (1) item similarity 2 on bag-of-word (BOW) histogram representations of text (Goldberg et al., 2018) with (2) word embedding similarity.

For any two documents A and B, WMD is defined as the minimum cost of transforming one document into the other. Each document is represented by the relative frequencies of words it contains, i.e., for the ith word type,

d A,i = count(i)/|A| (1)

where |A| is the total word count of document A, and d B,i is defined similarly. Now let the ith word be represented by v i ∈ R m , i.e., an m-length embedding, 3 allowing us to define distances between the ith and jth words, denoted ∆(i, j). V is the vocabulary size. We follow Kusner et al. (2015) and use the Euclidean

distance ∆(i, j) = v i − v j 2 .

The WMD is then the solution to the linear program:

WMD(A, B) = min T≥0 V i=1 V j=1 T i,j ∆(i, j) (2a) s.t. ∀i, V j=1 T i,j = d A,i , (2b) ∀j, V i=1 T i,j = d B,j (2c) T ∈ R V

×V is a nonnegative matrix, where each T i,j denotes how much of word i (across all its tokens) in A is assigned to tokens of word j in B, and the constraints ensure the flow of a given word cannot exceed its weight. Specifically, WMD ensures that the entire outgoing flow from word i

equals d A,i , i.e., j T i,j = d A,i .

Additionally, the amount of incoming flow to word j must match

d B,j , i.e., i T i,j = d B,j

. Following the example of Kilickaya et al. (2017) , we transform WMD into a similarity (WMS):

EQUATION (3): Not extracted; please refer to original document.

WMS measures two documents' similarity by minimizing the total distance to move words between two documents, combining the strengths of BOW and word embedding-based similarity metrics. In Figure 1 , WMS would calculate the cost of moving from Document A to Document B using only the word embeddings, denoted in orange. WMS is symmetric, and WMS(A, A) = 1 when word embeddings are deterministic.

Empirically, WMD has improved the performance of NLP tasks (see §6), specifically sentence-level tasks, such as image caption generation (Kilickaya et al., 2017) and natural language inference (Sulea, 2017) . However, its cost grows prohibitively as the length of the documents increases, and the BOW approach can be problematic when documents become large as the relation between sentences is lost. By only measuring word distances, the metric cannot capture information conveyed by the grouping of words, for which we need higher-level document representations (Dai et al., 2015; Wu et al., 2018) .

3 Sentence Mover'S Similarity Metrics

We modify WMS to measure the similarity between two documents using sentence embeddings, which we call a sentence mover's similarity approach. We introduce two new metrics: Sentence Mover's Similarity (SMS) and Sentence and Word Mover's Similarity (S+WMS). SMS replaces the word embeddings in WMS with sentence embeddings ( §3.1), while S+WMS combines the two metrics and uses both word and sentence embeddings ( §3.2). Our code (an extension of an existing WMD implementation 4 ) and datasets are publicly available. 5

3.1 Sentence Mover'S Similarity

Sentence Mover's Similarity (SMS) performs the same linear optimization problem in Eq. 2a as WMS, except now each document is represented as a bag of sentence embeddings rather than a bag of word embeddings. In Figure 1 , SMS considers only the sentence embeddings, denoted in blue.

To get the representation of a sentence in a document, we combine the sentence's word embeddings. Sentence representations based on averaging or pooling word embeddings perform competitively on tasks including sentence classification, recognizing textual entailment, and paraphrase detection (Conneau and Kiela, 2018) . We use sentence representations that are the average of their word embeddings, as this approach outperformed pooling methods in preliminary results.

While in WMS word embeddings are weighted according to their frequency in the document (see Eq. 1), SMS weights each sentence embedding by the number of words (|A|) it contains. 6 So a sentence i in document A will receive a weight of:

d A,i = |i|/|A| (4)

We solve the same linear program, Eq. 1, by calculating the cumulative distance of moving a document's sentences to match another document. Now the vocabulary is the set of sentences in the documents instead of the words, as in Figure 2 .

Figure 2: The S+WMS T matrix for documents A and B from Figure 1 (with empty rows/columns removed). Contrarily, WMS’s T matrix only maps between words and has the dimensions of the dashed region labeled “Words,” and SMS’s maps between sentences in the shape of the dashed region “Sentences.” Best viewed in color.

3.2 Sentence And Word Mover'S Similarity

Sentence and Word Mover's Similarity (S+WMS) combines WMS and SMS and represents each document as a collection of both words and sentences. Each document is now a bag of both word and sentence embeddings (as seen in Figure 1 ), where each word embedding is weighted according to its frequency and each sentence embedding is weighted according to its length. Now the bag of words and sentences representing document A is normalized by 2|A|, so that: Contrarily, WMS's T matrix only maps between words and has the dimensions of the dashed region labeled "Words," and SMS's maps between sentences in the shape of the dashed region "Sentences." Best viewed in color.

EQUATION (5): Not extracted; please refer to original document.

As in WMS and SMS, the same linear program in Eq. 1 is solved, this time calculating the cumulative distance of moving both a document's words and sentences to match another document. The vocabulary is the set of sentences and words in the documents (see Figure 2 ). The sentence embeddings are treated the same as word embeddings in the optimization; the only difference is their length-based weights.

This means a sentence embedding can be mapped to a word embedding (e.g., "They have fun." maps to "play" in Figure 1 ) or vice versa. It also means that a sentence's words do not have to move to the same word or sentence embedding(s) that their sentence moves to (as seen in Figure 1 ); a sentence in document A could be transported to an embedding in document B and have none of its words moved to the same embedding. More constraints could be introduced to further control the flow between documents, which we leave to future work.

4 Intrinsic Evaluation

To test the performance of the SMS and S+WMS metrics, we first examine their usefulness as evaluation metrics. (In §5, we evaluate their performance as cost functions for an extrinsic task, abstractive summarization.)

We measure the correlations between the scores assigned to texts by various automatic metrics (ROUGE-L, WMS, SMS, S+WMS) and the scores assigned by human judges. We are interested in multi-sentence texts, both machine-and humangenerated. Therefore, we consider subsets of two corpora that have been judged by humans: a collection of automatically generated summaries of articles in the CNN/Daily Mail news dataset (alongside reference summaries; see Section 4.1; Chaganty et al., 2018; Hermann et al., 2015; Nallapati et al., 2016) and student essays from the Hewlett Foundation's Automated Student Assessment Prize (Section 4.2). 7 Statistics describing the datasets are in A.1.

Because the word and sentence mover's similarity metrics are based on pretrained representations, we explore the effect of varying the word embedding method. We present results for two different types of word embeddings: GloVe embeddings (Pennington et al., 2014) and ELMo embeddings 8 . We obtain GloVe embeddings, which are type-based, 300-dimensional embeddings trained on Common Crawl, 9 using spaCy, 10 while the ELMo embeddings are character-based, 1,024-dimensional, contextual embeddings trained on the 1B Word Benchmark (Chelba et al., 2013) . We use ELMo to embed each sentence, which produces three vectors for each word, one from each layer of the model. We average the vectors to get a single embedding for each word in the sentence.

All correlations are Spearman correlations (Elliott and Keller, 2014; Kilickaya et al., 2017) , and significance in the improvement between two metrics' correlations with human judgment is calculated using the Williams (1959) significance test. 11

4.1 Summaries Dataset Evaluation

To understand how the sentence mover's similarity metrics evaluate automatically generated text, we use the subset of the CNN/Daily Mail dataset for which Chaganty et al. (2018) collected human annotations. Annotators evaluated summaries (generated with four different neural models) on a scale from -1 to 1. We consider the subset of summaries scored by two or more judges, taking the average to be the summary's score. The automatic evaluation metrics score each generated summary's similarity to the human-authored reference summary from the CNN/Daily Mail dataset. Table 2 shows each metric's correlation with the human judgments. SMS correlates best with human judgments, and both sentence-based metrics outperform ROUGE-L and WMS. We find that the difference between GloVe and ELMo's scores is not significant. 12 Discussion Two examples of generated summaries and their scores are shown in Table 3 . Because the scores cannot be directly compared between metrics, we distinguish scores that are in the top quartile for their metric (i.e., the highest rated) and in the bottom quartile (i.e., the lowest rated).

Table 2: Spearman correlation of metrics with human evaluations. Asterisks indicate significant improvement over ROUGE-L, with (*) for p < 0.05 and (**) for p < 0.01.

Table 3: Two examples from the Summaries dataset along with the scores they received (using GloVe) comparing reference (human summary) to hypothesis (model generated summary). Scores that are in the top quartile for a given metric are in green and bold. Scores in the bottom quartile are in red and italics. Human scores range from –1 to 1. Please see A.2 for details.

The first example in Table 3 is highly rated by metrics using word and sentence embeddings, but judged to be a poor summary by ROUGE-L because information is reworded and reordered from the reference. For example, the phrase "asked for medical help" is worded as "sought medical attention" in the hypothesis summary. Nevertheless, exact word matching can be important for ensuring factual correctness. While the generated hypothesis summary states "six officers have been suspended with pay", the reference states they were actually "suspended without pay."

The second example, which was generated with a seq2seq model, was one of the best summaries according to ROUGE-L but one of the worst according to SMS and S+WMS. It also received low human judgments, most likely due to its nonsensical repetitions. While the short, repeated phrases like "three different flavours" match the reference summary well enough to score well with ROUGE-L, the overall sentence representations are distant from those in the reference summary, resulting in low SMS and S+WMS scores.

4.2 Essays Dataset Evaluation

To test the metrics on human-authored text, we use a dataset of graded student essays that consists of responses to standardized test questions for tenth graders. We use a subset of Question #3 from the exam, which asks the test-taker to synthesize information from a reading passage, where student responses contain 5-15 sentences. Graders assigned the student-authored responses with scores ranging from 0 to 3. For the reference essay, we use a top-scoring sample essay, which the graders had access to as a reference while assigning scores. The full reference essay is in A.2. Table 2 shows the correlation of each metric with the evaluators' scores. As in the summarization task, SMS outperforms both ROUGE-L and WMS. However, in this case, having the sentence representations in the metric gives the best result, with S+WMS correlating best with human scores, significantly better than ROUGE-L. This is consistent across embedding type; once again, the choice of embedding does not create a significant difference between the sentence mover's metrics. 13 Discussion Aside from the length of the text, the Essays dataset presents the metrics with several challenges not found in the Summaries dataset. For example, the dataset contains a large number of spelling mistakes, due to both author misspellings and errors in the transcription process. One essay begins, "The setting of the story had effected the cycle's becuse if it was sub earbs he could have stoped any where and got water ..."

The tone and style of the essay can also vary from the reference essay. (For example, the author of Sample #3 in A.2 ends their essay by reflecting on how they would respond in the protagonist's place.) Embedding-based metrics may be more forgiving to deviations in writing style from the reference essay, such as the use of first person.

While Table 2 indicates sentence mover's similarity metrics significantly improve correlation with human judgments over standard methods, there is still enough disagreement that we believe automatic metrics should not replace human evaluations. Rather, they should complement human evaluations as an automatic proxy that can be used

5 Extrinsic Evaluation

In addition to automatically evaluating text, we can also use sentence mover's metrics as rewards while learning text generation models. To demonstrate this, we train an encoder-decoder model on the CNN/Daily Mail dataset to generate summaries using reinforcement learning (RL). Instead of maximizing likelihood, policy gradient RL methods can directly optimize discrete target evaluation metrics that are non-differentiable, such as ROUGE (Paulus et al., 2018; Jaques et al., 2017; Pasunuru and Bansal, 2017; Wu et al., 2016; Celikyilmaz et al., 2018; Edunov et al., 2018) . Here, we learn policies to maximize WMS/SMS/S+WMS metrics, guiding the model to learn semantic similarities, while policies trained using ROUGE rely only on word n-gram matches between generated and ground-truth text.

Model We encode the input document using 2-layered bidirectional LSTM networks and a 2layered LSTM network for the decoder. We use the attention mechanism (Bahdanau et al., 2015; See et al., 2017) to force the decoder model to learn to focus (i.e., attend) on specific parts of the input sequence when decoding, instead of relying only on the hidden vector of the decoder's LSTM. We also include pointer networks (See et al., 2017; Cheng and Lapata, 2016) , which point to elements of the input sequence at each decoding step. To train our policy-based generator, we use a mixed training objective that jointly optimizes multiple losses, which we describe below. MLE Our baseline model uses maximum likelihood training for sequence generation. Given y * ={y * 1 ,y * 2 ,...,y * T } as the ground-truth summary for a given input document d, we compute the loss as:

EQUATION (6): Not extracted; please refer to original document.

by taking the negative log-likelihood of the target word sequence. (Celikyilmaz et al., 2018) , [3] Mixed MLE and RL training with Pgen and intra-decoder attention (Paulus et al., 2018) . The lower block reports re-trained baselines and our models with new metrics. Bold indicates best among the lower block.

Reinforcement Learning (RL) Loss The decoder generates the summary sequenceŷ, which is then compared against the ground truth sequence y * to compute the reward r(ŷ). Our model learns using a self-critical training approach (Rennie et al., 2016), by exploring new sequences and comparing them against the best greedily decoded sequence. For each training example d, we generate two output sequences:ŷ, which is sampled from the probability distribution at each time step, p(ŷ t |ŷ 1 . . .ŷ t−1 , d), andỹ, the baseline output, which is greedily generated by argmax decoding from p(ỹ t |ỹ 1 . . .ỹ t−1 , d). Our mixed training objective is then to minimize:

EQUATION (7): Not extracted; please refer to original document.

It ensures that, with better exploration, the model learns to generate sequencesŷ that receive higher rewards than the baselineỹ, increasing the overall reward expectation of the model. Mixed Loss While training with only MLE loss will learn a better language model, it may not guarantee better results on discrete performance measures such as WMS and SMS. Similarly, optimizing with only RL loss using SMS as a reward may increase the reward gathered at the expense of diminished readability and fluency of the generated summary. A combination of the two objectives can yield improved task specific scores while maintaining a good language model:

L MIXED = γL RL + (1 − γ)L MLE (8)

where γ is a hyperparameter balancing the two objective functions. We pre-train models with MLE loss, and then continue with the mixed loss. We train four different models on the CNN/Daily Mail dataset using mixed loss (MLE+RL) with ROUGE-L, WMS, SMS, and S+WMS as the reward functions. Training details are in A.3 and A.4.

5.1 Generated Summary Evaluation

We evaluate the generated summaries from each model with ROUGE-L, WMS, SMS, and S+WMS in Table 4 . While we include previously reported numbers, we re-trained the mixed loss models using ROUGE-L and use those as our baseline, as previously trained models should be heavily optimized and use more complex networks than ours. For fair comparison, we kept the encoder-decoder network type, structure, hyperparameters, and initialization the same for each model, changing only the reward. We pre-trained an MLE model ("MLE+Pgen (no reward) (re-trained baseline)" in Table 4 ) and used it to initialize the mixed loss models with different reward functions.

Table 4: Evaluation on summarization task when various metrics are used as rewards during learning. Columns show average score of each model’s generated summaries according to various metrics. Previously reported results (upper block): [1] MLE training with pointer networks (Pgen) (See et al., 2017) ; [2] Mixed MLE and RL training with Pgen (Celikyilmaz et al., 2018), [3] Mixed MLE and RL training with Pgen and intra-decoder attention (Paulus et al., 2018). The lower block reports re-trained baselines and our models with new metrics. Bold indicates best among the lower block.

Across all metrics, the models trained using WMS and SMS metrics as the reward outperform models trained with ROUGE-L as the reward function. S+WMS models lag behind ROUGE-L. The SMS model outperforms all other models across all metrics on the abstractive summarization task, consistent with SMS's performance at evaluating summaries in §4.1. Table 5 shows summaries generated from each of the mixed loss models.

Table 5: Summaries generated from the mixed MLE+RL loss models with ROUGE-L, WMS, S+WMS, and SMS metrics as rewards, along with the corresponding human-authored reference summary.

5.2 Human Evaluation

We collected human evaluations for 100 summaries generated by the mixed loss models to compare ROUGE-L as a reward to WMS, SMS, and S+WMS. Amazon Mechanical Turkers chose between two generated summaries, one from the ROUGE-L model and one from WMS, SMS, or Table 6 : Human evaluations on a random subset of 100 summaries. The frequencies from the head-to-head comparison of models trained with ROUGE-L against WMS/SMS/S+WMS are shown. Each summary is evaluated by 3 judges (300 summaries per criteria). '=' indicates no difference. All improvements are statistically significance at p < 0.001.

Table 6: Human evaluations on a random subset of 100 summaries. The frequencies from the head-to-head comparison of models trained with ROUGE-L against WMS/SMS/S+WMS are shown. Each summary is evaluated by 3 judges (300 summaries per criteria). ‘=’ indicates no difference. All improvements are statistically significance at p < 0.001.

S+WMS. They selected one of the two summaries based on: (1) non-redundancy, fewer repeated ideas, (2) coherence, clearly expressed ideas, (3) focus, ideas free of superfluous details, and (4) overall, the summary effectively communicates the article's content. These criteria help evaluate the impact of the metrics used as reward. (Task details are in A.5.)

Results We asked human judges to evaluate the output of the mixed loss model trained with a ROUGE-L reward versus models trained with WMS, SMS, and S+WMS the reward. The results are shown in Table 6 .

Human judges significantly prefer summaries produced by models optimized with WMS, SMS, and S+WMS over ROUGE-L. SMS and S+WMS were preferred over ROUGE-L more often than WMS was. There is no significant difference between the evaluations of SMS and S+WMS. Among all other metrics, SMS was rated the highest on the non-redundancy question (69% improvement over the ROUGE-L score), indicating that the model learns to generate summaries that contain less rep-etition between sentences.

While the SMS model's output was highlyscored by both the automatic and human evaluations, removing word-level scoring does come with a cost, as seen in the example in Table 5 . The SMS summary contains a mistake, stating that "priscilla will tie the knot" instead of "serve as a witness". This issue may be mitigated by a better encoder for the summarization task and better sentence and word representations. As future work, we will investigate summarization models with more complex sentence embeddings and encoder structures (e.g., self-attention models).

6 Related Work

Evaluation has been among the most discussed topics of the natural language generation (NLG) research area (Lapata and Barzilay, 2005; Barzilay and Lapata, 2008; Reiter and Belz, 2009; Reiter, 2011; Novikova et al., 2017) . There are three main ways to evaluate NLG methods: (1) automatic metrics to compare NLG texts against refer-ence texts, (2) task-based (extrinsic) evaluation to measure the impact of a NLG system on a downstream task, and (3) human evaluations, which ask people to rate generated texts. In this work we introduce new automatic evaluation metrics for long text generation and evaluation. Automatic evaluation metrics compare generated text against reference texts using word overlap metrics such as: BLEU (Papineni et al., 2002) ; ROUGE (Lin, 2004) ; NIST (Doddington, 2002) , a version of BLEU; METEOR (Lavie and Agarwal, 2007) , unigram precision and recall; CIDER (Vedantam et al., 2015) , the average n-gram cosine similarity; cosine similarity between the average word embedding; and WMD, which calculates the word embedding-based "travel cost". Though all have strengths and weaknesses, ROUGE metrics (particularly ROUGE-L) are common for multisentence text evaluations. Textual metrics that consider specific qualities in the system outputs, like complexity and diversity, are also used to evaluate NLG systems (Dusek et al., 2019; Hashimoto et al., 2019; Sagarkar et al., 2018; Purdy et al., 2018) . Word mover's distance has recently been used for NLP tasks like learning word embeddings (Zhang et al., 2017; Wu et al., 2018) , textual entailment (Sulea, 2017) , document similarity and classification (Kusner et al., 2015; Huang et al., 2016; Atasu et al., 2017) , image captioning (Kilickaya et al., 2017), document retrieval (Balikas et al., 2018) , clustering for semantic word-rank (Zhang and Wang, 2018) , and as additional loss for text generation that measures the optimal transport between the generated hypothesis and reference text (Chen et al., 2019) . We investigate WMD for multi-sentence text evaluation and generation and introduce sentence embedding-based metrics.

7 Conclusion

We present SMS and S+WMS, sentence mover's similarity metrics for automatically evaluating multi-sentence texts. We find including sentence embeddings in automatic metrics significantly improves scores' correlation with human judgments, both on automatically generated and human-authored texts. The metrics' gain over ROUGE-L is consistent across word embedding types; there is no significant difference between type-based and contextual embeddings. Moreover, we find these metrics can be used to generate text; summaries generated with SMS as a reward are of better quality than ones generated with ROUGE-L, according to both automatic and human evaluations.

A Appendix

A.1 Datasets Summaries and Essays: For the intrinsic tasks in §4, we use two types of human-evaluated texts: machine-generated summaries and humanauthored essays. We follow Kusner et al. (2015) and remove punctuation and stopwords. (For contextual embeddings, these are removed after the embeddings are obtained.) The details of the subsets we used are in Table 7 CNN/Daily Mail: CNN/Daily Mail dataset (Nallapati et al., 2017; Hermann et al., 2015 ) is a collection of online news articles along with multi-sentence summaries. We use the same data splits as in Nallapati et al. (2017) . Earlier work anonymized entities by replacing each named entity with a unique identifier (e.g., Dominican Re-public→entity15). In this work we used the non-anonymized version.

Table 8: Summary statistics of CNN/Daily Mail (CNN/DM) Datasets.

A.2 More Examples

In Table 9 , we show samples of the summaries that we used to perform intrinsic evaluations in the main text.

Table 9: Examples of human generated and model generated summaries from Summaries and Essays datasets

A.3 Extrinsic Model Training Details

We use 128 dimensional bidirectional 2-layered LSTMs for the encoder and 128 unidirectional LSTMs for the decoder. For both datasets, we limit the input and output vocabulary size to the 30,000 most frequent tokens in the training set.

We initialize word embeddings with FastText 14 (Mikolov et al., 2018) 300-dimensional vectors and finetune them during training. For WMS, SMS and S+WMS embeddings, we use the GloVe word embeddings described in §4. We train using Adam with a learning rate of 0.001 for the MLE models and 10 −5 for the MLE+RL models. We select the MLE models with the lowest cross-entropy loss and the MLE+RL models with the highest reward on a sample of validation data to evaluate on the test set. At test time, we use beam search of width 5 on all our models to generate final predictions. For the Mixed RL trained models, we initialize the weights with pre-trained MLE model, and we start with γ = 0.97 and gradually increase its value. We train our models for ∼25 epochs which took 1-2 days on an NVIDIA V100 GPU machine.

A.4 Policy Gradient Reinforce Training

Maximum likelihood-based training of sequence generation models poses exposure bias issues since the model is evaluated by comparing the model to empirical distribution, whereas at test time we use automatic metrics to evaluate the model generated text (Ranzato et al., 2015) . Reinforced based policy gradient approach is used to address this issue by learning to optimize discrete target evaluation metrics that are nondifferentiable. We use REINFORCE (Williams, 1992) to learn a policy p θ defined by the model parameters θ to predict the next action (word). The RL loss function is defined as:

EQUATION (9): Not extracted; please refer to original document.

whereŷ is the sequence of sampled words. The derivative of the the objective function based on Monte Carlo sampling yields:

EQUATION (10): Not extracted; please refer to original document.

The baseline b is a bias estimator and is used for variance reduction in RL training. In this work we use self-critical training and use the reward obtained from a sequence that is generated by greedily decoding,ỹ, as a baseline:

θ L RL = −(r(ŷ) − r(ỹ)) θ log p θ (ŷ) (11)

A.5 Human Evaluations Evaluation Procedure We randomly selected 100 samples from the CNN/Daily Mail test set

For readability, we scale ROUGE scores by a factor of 100 and sentence mover's metrics by a factor of 1000.

The similarity can be defined as cosine, Jaccard, Euclidean, etc.3 Our evaluation scores depend on pretrained word embeddings, which can be type-based or contextual. Our experiments consider both; see §4 and §5. When using contextual embeddings, we treat each token as its own type, as each word will have a different embedding depending on its context.

https://github.com/src-d/wmd-relax 5 https://github.com/eaclark07/sms 6 Preliminary results showed count-based sentence weightings performed better than uniform weightings. Other weighting options, such as frequency-based weighting as done in BERTScore, are a direction for extending this work.

https://www.kaggle.com/c/asap-sas 8 https://allennlp.org/elmo 9 http://commoncrawl.org/the-data/ 10 https://spacy.io/models/en#en_core_ web_md 11 https://github.com/ygraham/ nlp-williams

Williams test: p = 0.35 (SMS) and p = 0.16(S+WMS)

Williams test: p = 0.33 (SMS) and p = 0.46 (S+WMS)

https://fasttext.cc/docs/en/ english-vectors.html

Figure 3: Spearman correlation with each metric and human evaluations using GloVe and ELMo embeddings on the Summaries and Essays datasets. (Best viewed in color.)