All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text
Human evaluations are typically considered the gold standard in natural language generation, but as models’ fluency improves, how well can evaluators detect and judge machinegenerated text? We run a study assessing nonexperts’ ability to distinguish between humanand machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3and humanauthored text at random chance level. We explore three approaches for quickly training evaluators to better identify GPT3-authored text (detailed instructions, annotated examples, and paired examples) and find that while evaluators’ accuracy improved up to 55%, it did not significantly improve across the three domains. Given the inconsistent results across text domains and the often contradictory reasons evaluators gave for their judgments, we examine the role untrained human evaluations play in NLG evaluation and provide recommendations to NLG researchers for improving human evaluations of text generated from state-of-the-art models.
Human-quality text has long been a holy grail for the output of natural language generation (NLG) systems, serving as an upper bound on their performance. Since we lack a good way of encoding many aspects of what constitutes human-quality output in an automated method, we often must rely on human evaluation for our models. Though evaluations with end-users in an applied setting are encouraged (Belz and Reiter, 2006) , in practice, most human evaluations instead ask people to rate generated text's intrinsic quality (van der Lee et al., 2019; Howcroft et al., 2020) . Sometimes the generated text is explicitly compared to human-authored text (e.g., Liu et al., 2016; Zellers et al., 2021; Figure 1: Excerpts from human evaluators' explanations for why they believe a GPT3-generated story (also excerpted) was written by a human (left) or a machine (right). The evaluators point to a wide range of text attributes to make their decisions, sometimes using the same aspect of the text to come to opposite conclusions. et al., 2020), but even when no human-authored text is evaluated, evaluators implicitly compare the generated text to their knowledge of language and norms within specific domains.
Evaluators are often asked to assess a text holistically, e.g., based on its overall quality, naturalness, or humanlikeness (van der Lee et al., 2021; Howcroft et al., 2020) , where the exact evaluation criteria is left to the discretion of the evaluator. Though other evaluations are broken down along specific dimensions of text quality (e.g., grammaticality, coherence, etc.), Novikova et al. (2017 Novikova et al. ( , 2018 and Callison-Burch et al. (2007) found that these dimensions are often correlated and may be conflated in some evaluation settings. This is con-cerning because, as NLG models improve, evaluators are asked to read longer passages of text conditioned on large amounts of context. In these cases, fluency-related aspects of quality (i.e., the ones that don't require careful reading of the context and meaning of the passage) are the easiest to assess, particularly in small-batch evaluations with non-expert evaluators where speed is incentivized. This poses a challenge when collecting human evaluations for state-of-the-art language models, as errors are often content-based (e.g., factual inaccuracies or inconsistencies with the context) rather than fluency-based (Brown et al., 2020) , so a superficial read may not be sufficient to catch model errors. For accurate assessments of generated text, we need human evaluations that are designed to encourage a sufficiently careful reading of the text to examine these subtler aspects of text quality.
We asked non-expert evaluators to assess the humanlikeness (operationalized as how believably human an evaluator finds a text) of text generated by current NLG models (GPT2 and GPT3) to test what current human evaluation practices can reveal about the models' quality ( §2). We found that evaluators were unable to distinguish between GPT3and human-authored text across story, news, and recipe domains. However, when we categorized the aspects of text the evaluators used to make their judgments, we found they primarily focused on the grammar, spelling, and style of the text. The evaluators' responses also indicated that they underestimated the quality of text current models are capable of generating (as seen in Figure 1 ). To our knowledge, this paper is the first to evaluate human evaluations of GPT3-generated text across multiple domains.
We then looked at three different evaluator training methods-providing detailed instructions, annotated examples, and human-machine paired examples-to test whether we could improve evaluators' accuracy ( §3). While we found including examples in the task increased the set of texts evaluators thought could be machine-generated and increased their focus on textual content, no training method significantly increased evaluators' performance consistently across domains.
Based on our results (discussed in §4), we recommend moving away from small-batch evaluations with little training when collecting human evaluations of NLG models ( §5). We also encourage practitioners to consider alternative evaluation frameworks that capture the usefulness of generated text in downstream settings rather than its humanlikeness.
2 How well can untrained evaluators identify machine-generated text?
In our first study, we ask how well untrained evaluators can distinguish between human-and machinegenerated text. This task format, inspired by the Turing (1950) Test, is used to compare the quality of machine-generated text to human-authored text and, as models' fluency improves, to analyze NLG models' ability to "fool" readers (Garbacea et al., 2019; Brown et al., 2020) . By asking evaluators to assess the humanlikeness of the text with only minimal instructions (see Figure 2), we observe how well untrained evaluators can detect state-of-the-art machine-generated text and which attributes evaluators focus on and think are important for detecting machine-generated text.
2.1 The Task
We gave evaluators 5 text passages, some of which were written by people and some generated by a model. We asked them to rate the text on a 4-point scale :
1. Definitely human-written 2. Possibly human-written 3. Possibly machine-generated 4. Definitely machine-generated If they selected option 1, we asked them: "Why did you select this rating?" Otherwise, they were asked, "What would you change to make it seem more human-like?" The interface is shown in Figure 2 .
We considered human-and machine-generated text in three different domains: stories, news articles, and recipes. In all three cases, we collected 50 human-authored texts in English and generated 50 texts from both the 175B parameter GPT3 model (also known as Davinci; Brown et al., 2020) 1 and GPT2-XL (Radford et al., 2019 ). 2 Evaluators were assigned to one domain and one model; the texts read by any given evaluator included some humanauthored texts and some texts generated by their assigned model. We only considered texts 100 words or longer, and after reaching 100 words, all texts were truncated at the end of the next sentence. 3 To generate text, we used the "three-shot" setting described in Brown et al. (2020) , conditioning the text on three additional samples of in-domain, human-authored text, which we refer to as the priming texts (all priming texts are in the supplementary materials and at ark.cs.washington.edu/ human_evals_ACL21). While this setting is not typically how GPT2 is used in practice, we held this approach constant to directly compare how model quality changes evaluators' ability to distinguish between texts. For each domain, each generated text was conditioned on the same set of priming texts. The texts were delimited with an EOS token and generated using the default GPT3 generation settings (i.e., sampling with temperature = 0.7).
The human-authored texts came from the Reddit WritingPrompts dataset (Fan et al., 2018) . 4 We collected all the stories that began with Once upon a time (255 stories total) and randomly chose 50 human-authored stories from this set. For the machine-generated text, we conditioned the models on the three priming texts and on the phrase Once upon a time. We removed generated stories that directly copied a priming text (with > 80% overlap) and regenerated those texts (9 instances with GPT2, 2 with GPT3). This is the most open-ended of the three domains, as the story's content is virtually unrestricted, and the only creative domain. It is also the noisiest of the human-authored datasets, as the stories were originally collected from social media comments with no quality-based filtering.
2.2.2 News Articles
We collected 2,111 recent local news articles from 15 different newspapers using Newspaper3k 5 (details in Appendix A.1). After filtering out articles under 100 words, we manually filtered out articles that weren't local news or that referenced the coronavirus pandemic. We randomly chose 50 articles to use as our human-authored news articles and another 50 to use as prompts for our generation models. We conditioned each generated text on the headline and first sentence from the prompt articles, along with the three priming texts.
Because the title and the first sentence of a news article often summarize its contents, the generated content must adhere to the topics they introduce. By using local, recent news, we also limit the models' ability to copy from their training data. The models seemed to have the most trouble with this dataset structurally, e.g., generating new headlines without ending the current article or outputting invalid end-of-file tags.
We collected 50 human-authored recipes from the RecipeNLG dataset (Bień et al., 2020) , which contains 2,231,142 recipes scraped from the web. We randomly chose an additional 50 recipes and used their titles and ingredient lists as prompts, appending them to the end of the priming texts. This is the most closed of the three domains, as the recipe must incorporate the listed ingredients and result in the dish described by the title. Recipes are typically written in clear commands, leaving little room for surprising or unexpected text.
We used Amazon Mechanical Turk (AMT) to collect the text evaluations with non-expert evaluators, commonly used in NLG evaluations (van der Lee et al., 2019). To have adequate power in our analyses (based on a power analysis with β = 0.8; Card et al., 2020), we had 130 different evaluators for each of the 6 task settings (3 domains × 2 models). Each participant evaluated 5 texts each, giving us a total of 780 participants and 3,900 text evaluations.
We paid evaluators US$1.25 for completing the task. Following common best practice on AMT (Berinsky et al., 2012) , evaluators had to have over a 95% acceptance rate, be in the United States, and have completed over 1,000 HITs (AMT tasks). We excluded evaluators' work if their explanations were directly copied text from the task, did not match their responses, did not follow the instructions, or were short, vague, or otherwise uninterpretable. Across experiments, 445 participants (18.6%) were rejected and not included in the §2 results (780 approved participants) and §3 results (1,170 approved participants).
Overall, evaluators choosing between human and GPT2-generated text correctly identified the author of the text 57.9% of the time, 6 but the evaluators choosing between human-and GPT3-generated text only guessed correctly 49.9% of the time (Table 1), compared to 50% random chance. While the accuracy of classifying GPT2-vs. human-authored text is significantly 7 different from chance, evaluators' accuracy distinguishing GPT3-and humanauthored text is not. 8 This remains the case regardless of text domain; we failed to find any evidence that evaluators' accuracy on any one domain for GPT3 differs from the overall GPT3 accuracy of ≈ 50%. 9 The story texts saw the biggest drop in evaluator accuracy from GPT2 to GPT3 (62% to 48%, Cohen's d = 0.57). The distribution of evaluators' scores are shown in Appendix A.2.
In Table 1 , we see other statistics worsen as well between GPT2 and GPT3: how well evaluators identified the machine-generated text (F 1 , precision, and recall), evaluators' agreement (Krippendorff's α, a measure of annotator agreement that corrects for the probability of random agreement), and the percent of guesses that the text was humanwritten (% human). Given that the texts are equally likely to be human-and machine-written, there are disproportionately many human guesses, making up two thirds of the responses in the GPT3 experiments. Despite the significantly lower scores, evaluators' confidence (the percent of Definitely responses) remains fairly constant across conditions.
Taken on its own, the evaluators' difficulty identifying GPT3-generated text compared to GPT2 points to the improvement of new NLG models. However, it also points to concerns about extending current human evaluation methodologies to state-of-theart text generation. In particular, the evaluators' explanations reveal underlying confusion and misconceptions about state-of-the-art NLG.
To better understand what untrained evaluators focused on in the text to make their decisions, the authors annotated 150 random responses from the evaluators who distinguished between human-and GPT3-generated text (see Appendix A.3 for annotation details). We divided the text annotation labels into three categories: form, content, and machine capabilities. Form qualities focus on the format, style, and tone of the text, while content focuses on the text's meaning. We also coded for comments that explicitly referenced people's perceptions of what types of language machines are capable (or incapable) of generating (machine capabilities).
We found nearly twice as many comments about the form of the text than the content (form: 47% of labels, content: 25%). Evaluators in our sample focused most on the spelling, grammar, or punctuation of the texts (45 out of 150 comments) and the style or tone of the text (24 out of 150 comments). However, these dimensions of text are unlikely to be helpful in identifying text generated by current models, considering that GPT3 has already been shown to generate fluent text and to adapt easily to new generation domains (Brown et al., 2020) .
We also found that the reasons evaluators gave for their answers often contradicted each other. The formality of the text, spelling and grammar errors, and clarity were all cited to justify both human and machine judgments. This was also reflected in the low agreement scores between evaluators, with Krippendorff's α ≈ 0 across domains.
Evaluators Table 1 : §2 results, broken down by domain and model, along with the F 1 , precision, and recall at identifying machine-generated text, Krippendorff's α, % human-written guesses, and % confident guesses (i.e., Definitely machine-or human-authored). * indicates the accuracies significantly better than random (two-sided t-test, for Bonferroni-corrected p < 0.00333).
els are capable of ranged from thinking their text is already indistinguishable from human-authored text ("I have no idea if a human wrote anything these days. No idea at all.") to doubting machines' ability to use basic language ("Usually AI has terrible grammer [sic] and messes up."). But overall we found most evaluators' beliefs about generated language underestimated or misunderstood current NLG models, as seen in Appendix A.4.
3 Can we train evaluators to better identify machine-generated text?
Given evaluators' inability to distinguish GPT3and human-authored text and their inconsistent reasoning for their decisions, we investigated whether there were simple ways of improving evaluators' ability to spot attributes of GPT3-generated text. Inspired by crowdsourcing research on guiding workers on writing or other subjective tasks (Kim et al., 2017; Mitra et al., 2015) , we tested three lightweight evaluator-training methods to see if we could improve people's ability to identify machinegenerated text while maintaining the short, lowcost nature of the evaluations.
3.1 Evaluator Training Methods
We considered 3 evaluator trainings that can be added to the beginning of a human evaluation task, at most requiring only 3 extra samples of humanand machine-generated text. To test the effectiveness of each type of training, we re-ran the experiments from §2, but this time, we prepended one of three evaluator-training methods to the evaluation task: an instruction-based training, an example-based training, and a comparison-based training. Screenshots of the training interfaces are in Appendix A.6; the full set of training materials are in the supplementary materials and at ark.cs.washington.edu/human_evals_ACL21.
Other than the training, the task setup was identical to the GPT3-based tasks in §2. We again ran the task on Amazon Mechanical Turk across three domains (stories, news, and recipes), using the same texts. As each individual participant was only permitted to complete one set of evaluations, the set of evaluators who received these trainings was completely disjoint from the set of evaluators from our first study. The participants were subject to the same restrictions described in §2.3 and excluded according the same criteria; we did not use the trainings to filter out evaluators. For each domain and training method pair, we had 130 unique evaluators complete the task, giving us 5,850 text annotations from 1,170 evaluators.
3.1.1 Training With Instructions
To give evaluators a better sense of which parts of the text to pay attention to, we extended the original task instructions to include dimensions of the text that could be helpful for identifying machinegenerated text (repetition and factuality) and ones that could be misleading (grammar, spelling, and style). We chose these dimensions based on previous work and evaluators' comments in a pilot study (see Appendix A.5).
The Instructions training was the simplest of our 3 evaluator training methods. It was general enough to be applied across the 3 domains but provided little information about the quality and domain of text the evaluator would be rating. It did not increase the cost of collecting evaluations (US$1.25 per HIT) because it does not require any extra work on the part of the evaluator, though this also made it the easiest training to ignore. The instruction-based training is the most prescriptive of the training methods, as the researcher has to choose the dimensions they want the evaluators to focus on.
3.1.2 Training With Examples
Our Examples training consisted of 3 practice rounds of the actual task: given a text, guess if it is machine-or human-authored. We collected 3 additional texts in the same manner described in §2.2 and wrote a short explanation of which aspects of the text hinted at its source. After an evaluator makes their guess, the correct answer and explanation are shown. Each domain had its own set of examples and explanations.
By showing examples, this training helps set the evaluators' expectations about the quality of the human-and machine-generated text. We paid evaluators more for completing this task (US$1.75 per HIT) to compensate for the extra texts they needed to read. As with the instruction-based training, while pointing out specific text dimensions can help evaluators focus on important features, it may also restrict their search space.
3.1.3 Training With Comparison
In the Comparison training, we took the example passages from the Examples training and paired them with a text from the opposite source (machine or human) that began with the same prompt. We asked evaluators to guess which of the two texts was the machine-generated one. We then provided the correct answer to the evaluator, along with the same explanations used in the Examples training.
This training allows evaluators to directly compare human and machine texts written from the same prompt. It is also the most expensive training, as it required evaluators to read three more passages than the Examples training; we paid evaluators US$2.25 per HIT.
We found that while all 3 training methods improved evaluators' accuracy at identifying machinevs. human-authored text over the no-training accuracy, the Examples training was the only one that showed significant improvement (see Table 2 ). 10 Breaking down the results by domain, however, we find the Examples accuracy did not significantly increase over the no-training accuracy when considering any of the three domains individually. Even so, the significant difference in overall performance is mainly contributed by the story domain; when comparing evaluators' performance with no training to its Examples training counterpart, we see a change of 0.019 and 0.062 mean accuracy in the news and recipe domains, respectively, versus 0.086 on the story domain. This is perhaps due to the examples helping override the preconception that machines cannot generate "creative" text.
Across all 3 domains, the Examples and Comparison trainings produced the highest recall and F 1 scores for evaluators' judgments and decreased the percentage of texts they guessed were humanwritten, which indicate that evaluators were willing to consider a broader set of texts to be machinegenerated than the evaluators in §2. However, despite the trainings and the increased proportion of confident responses, evaluator agreement remained low across domain and training settings (α ≤ 0.11), and higher agreement did not correspond to higher accuracy.
We again annotated 150 comments along the dimensions listed in Appendix A.3, divided into form, content, and machine capabilities categories, this time from evaluators who received the bestperforming Examples training. As shown in Table 3, we found that the proportion of form comments dropped in the sample of evaluators who went through the Examples training, while the proportion of content comments doubled. We also saw a drop in the number of comments mentioning evaluators' expectations of machine-generated text. While this change in focus doesn't necessarily correspond to correct judgments, content reasons are more in-line with current NLG model capabilities (Brown et al., 2020) .
Overall, none of our three training methods significantly improved evaluators' ability to detect machine-generated text reliably across text domains while still maintaining the small-batch nature of Amazon Mechanical Turk. This speaks to the improving quality of NLG models, but we also found that untrained evaluators mainly focused on the format of the text, deciding if it was human or machine-generated based on whether Table 2 : §3 results, broken down by domain and training method, along with the F 1 , precision, and recall at identifying machine-generated text, Krippendorff's α, % human-written guesses, and % confident guesses (i.e., Definitely machine-or human-authored). "None" training refers to the GPT3 results from §2. * indicates accuracies significantly better than None (no training; two-sided t-test, for Bonferroni-corrected p < 0.00333).
Training Form Content Machine Capabilities
None 47.1 24.6 28.3 Examples 32.5 50.0 17.5 Table 3 : % of annotation labels that reference the text's form and content and the evaluator's perception of machines' capabilities the text was grammatically or stylistically correct. This, combined with the high percentage of human guesses, the low recall scores for the machine guesses, and the evaluators' comments on their expectations of NLG models, indicates a systematic underestimation by the evaluators of the quality of machine-generated text. Evaluators who were trained with examples had higher expectations of machine-generated text and focused more on the text's content; however, the training was not sufficient to significantly raise evaluators' scores across all three domains. Many of the explanations given by evaluators included references to the text that reflected human attributes or intent that they suspected machines could not generate (e.g., "personal description a machine wouldn't understand, [like a pirate] wanting to be home with his wife and son" from Figure 1 and the examples in Appendix A.4). However, current NLG models are capable of generating text with at least superficial reference to human attributes or intent, as seen in the generated story in Figure 1 . This assumption that machines can't generate text with these aspects of humanlikeness led many evaluators astray, and we suspect it is one cause of the low accuracy we found.
Crowdsourcing studies dealing only with humanauthored texts often include extensive training, quality checks, or coordination (Kittur and Kraut, 2008; Kim et al., 2017; Bernstein et al., 2010) . NLG evaluations usually forego such structures, based, we suspect, on the assumption that evaluating machine-generated text requires only fluency in the language the text is generated in. Our results suggest otherwise. Evaluators often mistook machine-generated text as human, citing superficial textual features that machine generation has surpassed (Brown et al., 2020) . One potential remedy for this is to focus evaluator training on debunking this misconception. We did see evidence that the increase in accuracy we saw with our Examples training was associated with fewer explanations mistakenly referencing machine capabilities, even though the training did not specifically focus on this.
Based on our findings, if NLG researchers must run human evaluations as small-batch evaluations on Amazon Mechanical Turk or similar platforms, we recommend they train evaluators with examples. This will help calibrate the evaluators' expectations of generated text and indicate the careful reading they may need to do to properly assess the text's quality. Our experiments also indicate the importance of confirming with evaluators why they have made the decisions they have, as the criteria they might implicitly be evaluating may be mismatched with researchers' intended criteria. However, other evaluation setups may be more successful on Amazon Mechanical Turk, such as long-term evaluations with qualified evaluators who have gone through an extended training (like those in Kittur and Kraut, 2008; Zellers et al., 2019a) or thirdparty evaluator quality tools (e.g., Positly, used by Brown et al., 2020) .
However, given the increasing length of text NLG models can handle and the careful reading needed to detect many errors in generated text, we encourage NLG researchers to move away from standalone, intrinsic human evaluation tasks. We found that, by default, our evaluators in this evaluation setting were most likely to focus on surfacelevel, fluency-related aspects of quality. We join past work (Belz and Reiter, 2006; van der Lee et al., 2021) in recommending a move towards evaluation settings where evaluators are better motivated to carefully consider the content and usefulness of generated text. For example, TuringAdvice (Zellers et al., 2021) asks evaluators to rate NLG models by their ability to generate helpful advice, and RoFT (Dugan et al., 2020) engages evaluators through a guessing game to determine the boundary between human-and machine-generated text. Other evaluation methods ask the evaluators to directly interact with the generated text; for example, Choose Your Own Adventure (Clark and Smith, 2021) and Storium (Akoury et al., 2020) evaluate story generation models by having people write stories with the help of generated text. 11 We see that GPT3 can successfully mimic human-authored text across several domains, renewing the importance of evaluations that push beyond surface-level notions of quality and consider whether a text is helpful in a down- 11 Note that we initially tried a fourth training condition along these lines, where we asked evaluators to directly interact with the generated text by rewriting it to be more humanlike. We found we were unable to successfully recruit evaluators to complete this task. The rate of retention was less than 30%, and the rejection rate was over 50%. We found AMT was not a good platform for this type of task, at least not for the format and the price point we explored in this work. stream setting or has attributes that people would want from machine-generated text.
Finally, given the mixed effect we found different trainings can have on evaluators' performance and the lack of human evaluation details typically presented in NLG papers (van der Lee et al., 2019; Howcroft et al., 2020) , we encourage NLG researchers to include details of any instructions and training they gave evaluators in their publications. This, along with efforts to standardize human evaluation design (Belz et al., 2020; Howcroft et al., 2020) and deployment (Khashabi et al., 2021; Gehrmann et al., 2021) , will support future development of evaluator training procedures and the comparison of human evaluation results in future NLG evaluation work.
6 Related Work
A subfield of NLG analyzes the role of human evaluations, including discussions of the tradeoffs of human and automatic evaluations (Belz and Reiter, 2006; Hashimoto et al., 2019) . There are critiques and recommendations for different aspects of human evaluations, like the evaluation design (Novikova et al., 2018; Santhanam and Shaikh, 2019) , question framing (Schoch et al., 2020) , and evaluation measures like agreement (Amidei et al., 2018) , as well as analyses of past NLG papers' human evaluations (van der Lee et al., 2021; Howcroft et al., 2020) . Additionally, crowdsourcing literature has work on effectively using platforms like Amazon Mechanical Turk (e.g., Daniel et al., 2018; Oppenheimer et al., 2009; Weld et al., 2014; Mitra et al., 2015) . In this work, we focus on the role evaluator training can play for producing better accuracy at distinguishing human-and machinegenerated text, though other quality control methods are worth exploring.
Previous work has asked evaluators to distinguish between human-and machine-authored text. For example, found that trained evaluators were able to detect openended GPT2-L-generated text 71.4% of the time, Garbacea et al. (2019) reported that individual evaluators guessed correctly 66.6% of the time when evaluating product reviews, and Brown et al. (2020) found evaluators could guess GPT3-davincigenerated news articles' source with 52% accuracy, though these results are not directly comparable to ours due to differences in the evaluation setup, data, and participants.
Finally, our findings that untrained evaluators are not well equipped to detect machine-generated text point to the importance of researching the safe deployment of NLG systems. Gehrmann et al. (2019) proposed visualization techniques to help readers detect generated text, and work like Zellers et al. (2019b) , , and Uchendu et al. (2020) investigated large language models' ability to detect generated text.
We found that untrained evaluators were unable to distinguish between human-and GPT3-generated text from three domains. However, we also found that the evaluators focused on surface-level text qualities to make these decisions and underestimated current NLG models' capabilities. We experimented with three methods for training evaluators, and while example-based trainings led to increases in recall and the amount of content-based evaluations, they did not lead to significant improvements in accuracy across all domains. Given that evaluators struggled to distinguish between human-and machine-generated text in this setting, we should shift how we think about collecting human evaluations for current NLG models.
beta.openai.com/ 2 huggingface.co/gpt2-xl
Using NLTK; www.nltk.org/ 4 github.com/pytorch/fairseq/tree/ master/examples/stories
Tukey's HSD adjusted p < 0.003 for distinguishing between the Examples training and no training, d = 0.25