Evaluating Text GANs as Language Models

Guy Tevet
Gavriel Habib
Vered Shwartz
Jonathan Berant
NAACL
2019
View in Semantic Scholar

Abstract

Generative Adversarial Networks (GANs) are a promising approach for text generation that, unlike traditional language models (LM), does not suffer from the problem of “exposure bias”. However, A major hurdle for understanding the potential of GANs for text generation is the lack of a clear evaluation metric. In this work, we propose to approximate the distribution of text generated by a GAN, which permits evaluating them with traditional probability-based LM metrics. We apply our approximation procedure on several GAN-based models and show that they currently perform substantially worse than state-of-the-art LMs. Our evaluation procedure promotes better understanding of the relation between GANs and LMs, and can accelerate progress in GAN-based text generation.

1 Introduction

Neural networks have revolutionized the field of text generation, in machine translation (Sutskever et al., 2014; Neubig, 2017; Luong et al., 2015; Chen et al., 2018) , summarization (See et al., 2017) , image captioning (You et al., 2016 ) and many other applications (Goldberg, 2017) .

Traditionally, text generation models are trained by going over a gold sequence of symbols (characters or words) from left-to-right, and maximizing the probability of the next symbol given the history, namely, a language modeling (LM) objective. A commonly discussed drawback of such LM-based text generation is exposure bias (Ranzato et al., 2015) : during training, the model predicts the next token conditioned on the ground truth history, while at test time prediction is based on predicted tokens, causing a train-test mismatch. Models trained in this manner often struggle to overcome previous prediction errors.

Generative Adversarial Networks (Goodfellow et al., 2014) offer a solution for exposure bias. * The authors contributed equally Originally introduced for images, GANs leverage a discriminator, which is trained to discriminate between real images and generated images via an adversarial loss. In such a framework, the generator is not directly exposed to the ground truth data, but instead learns to imitate it using global feedback from the discriminator. This has led to several attempts to use GANs for text generation, with a generator using either a recurrent neural network (RNN) Guo et al., 2017; Press et al., 2017; Rajeswar et al., 2017) , or a Convolutional Neural Network (CNN) (Gulrajani et al., 2017; Rajeswar et al., 2017) .

However, evaluating GANs is more difficult than evaluating LMs. While in language modeling, evaluation is based on the log-probability of a model on held-out text, this cannot be straightforwardly extended to GAN-based text generation, because the generator outputs discrete tokens, rather than a probability distribution. Currently, there is no single evaluation metric for GAN-based text generation, and existing metrics that are based on n-gram overlap are known to lack robustness and have low correlation with semantic coherence (Semeniuta et al., 2018) .

In this paper, we propose a method for evaluating GANs with standard probability-based evaluation metrics. We show that the expected prediction of a GAN generator can be viewed as a LM, and suggest a simple Monte-Carlo method for approximating it. The approximated probability distribution can then be evaluated with standard LM metrics such as perplexity or Bits Per Character (BPC).

To empirically establish our claim, we implement our evaluation on several RNN-based GANs: (Press et al., 2017; Guo et al., 2017) . We find that all models have substantially lower BPC compared to state-of-the-art LMs. By directly comparing to LMs, we put in perspective the current performance of RNN-based GANs for text generation. Our results are also in line with recent concurrent work by Caccia et al. (2018) , who reached a similar conclusion by comparing the performance of textual GANs to that of LMs using metrics suggested for GAN evaluation.

Our code is available at: http: //github.com/GuyTevet/SeqGAN-eval and http://github.com/GuyTevet/ rnn-gan-eval.

2 Background

Following the success of GANs in image generation, several works applied the same idea to texts using convolutional neural networks (Gulrajani et al., 2017; Rajeswar et al., 2017) , and later using RNNs (Press et al., 2017; . RNNs enable generating variable-length sequences, conditioning each token on the tokens generated in previous time steps. We leverage this characteristic in our approximation model ( §4.1).

A main challenge in applying GANs for text is that generating discrete symbols is a nondifferentiable operation. One solution is to perform a continuous relaxation of the GAN output, which leads to generators that emit a nearly discrete continuous distribution (Press et al., 2017) . This keeps the model differentiable and enables end-to-end training through the discriminator. Alternatively, SeqGAN and Leak-GAN (Guo et al., 2017) used policy gradient methods to overcome the differentiablity requirement. We apply our approximation to both model types.

3 Evaluating GANs and LMs LM Evaluation. Text generation from LMs is commonly evaluated using probabilistic metrics. Specifically, given a test sequence of symbols (t 1 , . . . , t n ), and a LM q, the average crossentropy over the entire test set is computed:

ACE = − 1 n n i=1 log 2 q(t i | t 1 ...t i−1 ).

For word-based models, the standard metric is perplexity: P P = 2 ACE , while for character-based models it is BP C = ACE directly.

Intrinsic improvement in perplexity does not guarantee an improvement in an extrinsic downstream task that uses a language model. However, perplexity often correlates with extrinsic measures (Jurafsky and Martin, 2018) , and is the de-facto metric for evaluating the quality of language models today.

GAN-based Text Generation Evaluation. By definition, a text GAN outputs a discrete sequence of symbols rather than a probability distribution. As a result, LM metrics cannot be applied to evaluate the generated text. Consequently, other metrics have been proposed:

• N-gram overlap: Press et al., 2017) : Inspired by BLEU (Papineni et al., 2002) , this measures whether n-grams generated by the model appear in a held-out corpus.

A major drawback is that this metric favors conservative models that always generate very common text (e.g., "it is"). To mitigate this, self-BLEU has been proposed (Lu et al., 2018) as an additional metric, where overlap is measured between two independently sampled texts from the model. • LM score: The probability of generated text according to a pre-trained LM. This has the same problem of favoring conservative models. • Zhao et al. 2017suggested an indirect score by training a LM on GAN-generated text, and evaluating it using perplexity. The drawback in this setting is the coupling of the performance of the GAN with that of the proxy LM. • Heusel et al. 2017 Semeniuta et al. (2018) , and in parallel to this work, Caccia et al. (2018) proposed a temperature sweep method that trades-off quality for diversity using a single parameter. Similar to our findings, they concluded that GANs perform worse than LMs on this metric.

t x t+1 x t+2 x t+3 h t h t+3 o t o t−1 o t+1 o t+2

Figure 1: Generator recurrent connections. {h t } is the internal state sequence and {o t } is the generator prediction sequence (one-hot). During inference, the outputs {o t } are fed back as the input for the next time step (dashed lines). During LM approximation, the input {x t } is a sequence of one-hot vectors from the test set.

Overall, current evaluation methods cannot fully capture the performance of GAN-based text generation models. While reporting various scores as proposed by Semeniuta et al. (2018) is possible, it is preferable to have a single measure of progress when comparing different text generation models.

4 Proposed Method

We propose a method for approximating a distribution over tokens from a GAN, and then evaluate the model with standard LM metrics. We will describe our approach given an RNN-based LM, which is the most commonly-used architecture, but the approximation can be applied to other auto-regressive models (Vaswani et al., 2017) .

4.1 Language Model Approximation

The inputs to an RNN at time step t, are the state vector h t and the current input token x t . The output token (one-hot) is denoted by o t . In RNNbased GANs, the previous output token is used at inference time as the input x t Guo et al., 2017; Press et al., 2017; Rajeswar et al., 2017) . In contrast, when evaluating with BPC or perplexity, the gold token x t is given as input. Hence, LM-based evaluation neutralizes the problem of exposure bias addressed by GANs. Nevertheless, this allows us to compare the quality of text produced by GANs and LMs on an equal footing. Figure 1 illustrates the difference between inference time and during LM approximation.

We can therefore define the generator function at time step t as a function of the initial state h 0 and the past generated tokens (x 0 . . . x t ), which we denote as o t = G t (h 0 , x 0 ...x t ) (x 0 is a start token). Given a past sequence (x 0 . . . be gained either by using a noise vector as the initial state h 0 (Press et al., 2017) , or by sampling from the GAN's internal distribution over possible output tokens Guo et al., 2017) . Since h 0 is constant or a noise vector that makes G t stochastic, we can omit it to get G t (x 0 . . . x t ). In such a setup, the expected value

E[G t (x 0 . . . x t )]

is a distribution q over the next vocabulary token a t :

q(a t | a 0 . . . a t−1 ) = {E[G t (x 0 . . . x t )]} at

To empirically approximate q, we can sample from it N i.i.d samples, and compute an approx

- imationG t,N = 1 N Σ N n=1 g t,n

, where g t,n is one sample from G t (x 0 ...x t ). Then, according to the strong law of large numbers:

EQUATION (1): Not extracted; please refer to original document.

Given this approximate LM distribution, we can evaluate a GAN using perplexity or BPC. We summarize the evaluation procedure in Algorithm 1. 1

4.2 Approximation Bound

We provide a theoretical bound for choosing a number of samples N that results in a good ap-

proximation ofG t,N to E[G t ].

Perplexity and BPC rely on the log-probability of the ground truth token. Since the ground truth token is unknown, we conservatively define the bad event B in which there exists

v ∈ V such that |{E[G t ]} v − {G t,N } v | > γ,

where V is the vocabulary. We can then bound the probability of B by some . We define the following notations:

1. The probability of a token a t to be

v is p v ∆ = q(a t = v|a 0 . . . a t−1 ) = {E[G t (x 0 . . . x t )]} v . 2. χ v,n ∆ = {g t,n } v

is a random variable representing the binary value of the v'th index of g t,n which is a single sample of G t . Note that the average of χ v,n over N samples is

X v ∆ = 1 N N n=1 χ v,n = 1 N N n=1 g t,n v = {G t,N } v .

Using the above notation, we can re-define the probability of the bad event B with respect to the individual coordinates in the vectors:

P r(B) = P r E[Gt] −Gt,N ∞ > γ = P r v∈V |pv − Xv| > γ ! < (2)

We note that χ v,n ∼ Bernoulli(p v ), and given that {χ v,n } N n=1 are i.i.d., we can apply the Chernoff-Hoeffding theorem (Chernoff et al., 1952; Hoeffding, 1963) . According to the theorem, for every

v ∈ V , P r(|X v − p v | > γ) < 2e −2N γ 2 .

Taking the union bound over V implies:

P r(B) = P r v∈V |X v − p v | > γ < 2|V |e −2N γ 2 < (3)

Hence, we get a lower bound on N :

EQUATION (4): Not extracted; please refer to original document.

As a numerical example, choosing γ = 10 −3 and = 10 −2 , for a character-based LM over the text8 dataset, with |V | = 27, we get the bound: N > 4.3 • 10 6 . With the same γ and , a typical word-based LM with vocabulary size |V | = 50, 000 would require N > 8.1 • 10 6 . In practice, probability vectors of LMs tend to be sparse (Kim et al., 2016) . Thus, we argue that we can use a much smaller N for a good approxi-mationG t,N . Since the sparsity of LMs is difficult to bound, as it differs between models, we suggest an empirical method for choosing N .

The approximationG t,N is a converging sequence, particularly over • ∞ (see Equation 1). Hence, we can empirically choose an N which sat-

isfies G t,N −α −G t,N ∞ < γ , α ∈ N. In Sec- tion 5 we empirically measure G t,N −α −G t,N ∞

as a function of N to choose N . We choose a global N for a model, rather than for every t, by averaging over a subset of the evaluation set.

5.1 Models

We focus on character-based GANs as a test-case for our method. We evaluate two RNN-based GANs with different characteristics. As opposed to the original GAN model (Goodfellow et al., 2014) , in which the generator is initialized with random noise, the GANs we evaluated both leverage gold standard text to initialize the generator, as detailed below.

Recurrent GAN (Press et al., 2017 ) is a continuous RNN-based generator which minimizes the improved WGAN loss (Gulrajani et al., 2017) . To guide the generator, during training it is initialized with the first i − 1 characters from the ground truth, starting the prediction in the ith character. Stochasticity is obtained by feeding the generator with a noise vector z as a hidden state. At each time step, the input to the RNN generator is the output distribution of the previous step.

SeqGAN ) is a discrete RNNbased generator. To guide the generator, it is pretrained as a LM on ground truth text. Stochasticity is obtained by sampling tokens from an internal distribution function over the vocabulary. To overcome differentiation problem, it is trained using a policy gradient objective (Sutton et al., 2000) .

We also evaluated LeakGAN (Guo et al., 2017) , another discrete RNN-based generator, but since it is similar to SeqGAN and performed worse, we omit it for brevity.

5.2 Evaluation Settings

To compare to prior work in LM, we follow the common setup and train on the text8 dataset. 2 The dataset is derived from Wikipedia, and includes 26 English characters plus spaces. We use the standard 90/5/5 split to train/validation/test. Finally, we measure performance with BPC.

We tuned hyper-parameters on the validation set, including sequence length to generate at test time (7 for Press et al. (2017) , 1000 for ). We chose the number of samples N empirically for each model, as described in Section 4.2. We set α to 10, and the boundary to γ = 10 −3 as a good trade-off between accuracy and run-time. Figure 2 ). To be safe, we used N = 2000. (Krause et al., 2017) 1.19 Large mLSTM +emb +WN +VD (Krause et al., 2016) 1.27 Large RHN (Zilly et al., 2016) 1.27 LayerNorm HM-LSTM (Chung et al., 2016) 1.29 BN LSTM (Cooijmans et al., 2016) 1.36 Unregularised mLSTM (Krause et al., 2016) 1.40

Figure 2: Approximate error ‖G̃t,N−α − G̃t,N‖∞ as a function of samples N . α = 10, γ′ = 10−3.

5.3 Results

SeqGAN -pre-trained LM 1.85 1.95

GANs (LM Approximation) SeqGAN -full adversarial training 1.99 2.08 Recurrent GAN without pre-training (Press et al., 2017) 3.31

Uniform Distribution 4.75 1. rics things where a weeks thered databignand jacob reving the imprisoners could become poveran brown 2. nine other set of of one eight one two by belarigho and singing signal theus to accept natural corp 3. ragems the downran maintain the lagar linear stream hegels p in five six f march one nine nine nine

Seqgan

Full adversarial training 1. four zero five two memaire in afulie war formally dream the living of the centuries to quickly can f 2. part of the pract the name in one nine seven were mustring of the airports tex works to eroses exten 3. eight four th jania lpa ore nine zero zero zero sport for tail concents englished a possible for po Recurrent GAN 1. nteractice computer may became were the generally treat he were computer may became were the general 2. lnannnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnne and and and and and and and and and and and and and and and a 3. perors as as seases as as as as as as as as as selected see see see see see see see see see see see Because SeqGAN models output a distribution over tokens at every time step, we can measure the true BPC and assess the quality of our approximation. Indeed, we observe that approximate BPC is only slightly higher than the true BPC.

GAN-based models perform worse than stateof-the-art LMs by a large margin. Moreover, in SeqGAN, the pre-trained LM performs better than the fully trained model with approximate BPC scores of 1.95 and 2.06, respectively, and the BPC deteriorates as adversarial training continues.

Finally, we note that generating sequences larger than 7 characters hurts the BPC of Press et al. (2017) . It is difficult to assess the quality of generation with such short sequences.

Table 1: Test set evaluation of different character-based models on the text8 dataset. State-of-the-art results are taken from https://github.com/sebastianruder/NLP-progress/blob/master/language_ modeling.md. The uniform distribution is equivalent to guessing the next character out of |V | = 27 characters.

In Table 2 we present a few randomly gener-ated samples from each model. We indeed observe that adversarial training slightly reduces the quality of generated text for SeqGAN, and find that the quality of 100-character long sequences generated from Press et al. (2017) is low.

Table 2. Not extracted; please refer to original document.

6 Conclusions

We propose an evaluation procedure for text GANs that is based on approximating the GAN output distribution and using standard LM metrics. We provide a bound for the number of samples required for the approximation and empirically show in practice as few as 2000 samples per time-step suffice. We evaluate character-based GAN models using our procedure, and show their performance is substantially lower than state-of-the-art LM. We hope our simple evaluation method leads to progress in GAN-based text generation by shedding light on the quality of such models.

Our evaluation algorithm is linear in the length of the test set and in the number of samples N .