Evaluating Machines by their Real-World Language Use

Rowan Zellers
Ari Holtzman
Elizabeth Clark
Lianhui Qin
Ali Farhadi
Yejin Choi
ArXiv
2020
View in Semantic Scholar

Abstract

There is a fundamental gap between how humans understand and use language -- in open-ended, real-world situations -- and today's NLP benchmarks for language understanding. To narrow this gap, we propose to evaluate machines by their success at real-world language use -- which greatly expands the scope of language tasks that can be measured and studied. We introduce TuringAdvice, a new challenge for language understanding systems. Given a complex situation faced by a real person, a machine must generate helpful advice. We make our challenge concrete by introducing RedditAdvice, a dataset and leaderboard for measuring progress. Though we release a training set with 600k examples, our evaluation is dynamic, continually evolving with the language people use: models must generate helpful advice for recently-written situations. Empirical results show that today's models struggle at our task, even those with billions of parameters. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 9% of cases. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

1 Introduction

In October 2019, a team from Google surprised many in the natural language processing (NLP) community by announcing T5, a new 11-billion parameter model pretrained on hundreds of gigabytes of language text (Raffel et al., 2019) . Like many other large models released in the last few years, T5 showed impressive gains on a variety of NLP benchmarks, adding to a growing list of "solved" datasets on which machines outperform humans.

Yet, when T5 generates language, we observe clear gaps between machine-level and human-level language understanding. Consider the example in Figure 1 , in which a woman asks for advice. She is assigned to dissect an animal for her class project, but has extreme anxiety about dead animals -and her teacher refused to give her another assignment. Humans can respond with helpful advice, reflecting our unique ability of real-world language use: to

Figure 1, in which a woman asks for advice. She is assigned to dissect an animal for her class project, but has extreme anxiety about dead animals – and her teacher refused to give her another assignment. Humans can respond with helpful advice, reflecting our unique ability of real-world language use: to

communicate and tackle open-ended issues. The helpful advice in this example -but not the only one possible -suggests that she escalate the situation slightly by sending a short email to her guidance counselor.

On the other hand, not only is T5's advice unhelpful, it also reveals key misunderstandings of the situation. It seems to believe that the student is asking the teacher to do a class project involving dead animals. This reading comprehension error is particularly strange, as T5 outperforms humans on a variety of reading comprehension benchmarks. Others in the community have observed similar issues, raising concerns about what today's benchmark datasets measure (Yogatama et al., 2019; Kryscinski et al., 2019; McClelland et al., 2019; Gardner et al., 2019) .

We argue that there is a deep underlying issue: a gap between how humans use language in the real world, and what our evaluation methodology can measure. Today's dominant paradigm is to study static datasets, and to grade machines by the similarity of their output with predefined correct answers. For example, we score multiple choice exams by how often the correct answers are chosen, and evaluate generative tasks like machine translation by similarity with respect to correct translations. However, when we use language in the real world to communicate with each other -such as when we give advice, or teach a concept to someone -there is rarely a universal correct answer to compare with, just a loose goal we want to achieve.

We introduce a framework to narrow this gap between benchmarks and real-world language use. We propose to evaluate machines by their success in using language to (1) communicate with humans in (2) tackling complex, open-ended, real-world situations. Our goal is a machine that, like a human, can generate language that is useful and helpful. Doing so necessarily requires a deep understanding of language and the world, as per a line of thought that the complete meaning representation is one that suffices to complete a task (Artzi et al., 2013) .

As a case-study of our framework, we introduce TuringAdvice as a new grand challenge for AI systems. A machine reads a situation written by a person seeking advice, like Figure 1 , and must then write advice that is helpful to the advice-seeker. Like a Turing Test (Turing, 1950) , we establish a simple condition required for a model to 'pass': model-generated advice must be at least as helpful to the advice-seeker as human-written advice.

We make our challenge concrete by introducing a new dataset, RedditAdvice, and accompanying leaderboard. We tie our dataset to the Reddit community, which resolves two additional sources of bias. First, Reddit users are intrinsically motivated, seeking advice about highly complex real issueswhich past work suggests differ from hypothetical issues that crowd workers might come up with (e.g. Kwiatkowski et al., 2019; Gurari et al., 2018) . Second, we make our dataset and leaderboard dynamic, rather than static -evaluating models on Reddit situations posted over the previous two weeks, at the time of submission. Models therefore must tackle the same language task as humans, generalizing to new situations and patterns of language.

Experimental results show that RedditAdvice is incredibly challenging for today's machines. Today's largest model, T5, with 11 billion parameters (Raffel et al., 2019) , produces advice that is preferable to human-written advice only 9% of the time -after being finetuned for our task, on a training dataset with 600k pieces of advice. What's more, our experimental setup finds statistically significant differences between current models, allowing us to meaningfully grade varying levels of performance.

We also study our task from the perspective of today's standard 'core' NLP tasks. Broadly, we find that machines frequently confuse who is who, are self-contradictory, or seem to miss important world knowledge. However, these mistakes tend not to fall into the neat categories defined by standard task definitions. We address this by introducing diagnostics questions, which systematically measure these language understanding errors.

In summary, our paper makes three major contributions. First, we introduce a new framework for measuring language understanding through directly tackling real-world language problems. Second, we introduce TuringAdvice as a new challenge for AI systems, along with a dynamic dataset and leaderboard. Third, we connect our task to existing atomic language understanding tasks, introducing a new setting that reveals areas where progress is still needed.

2 Real World Language Use

Our key proposal is to evaluate machines by their success at real-world language use: using language to communicate with a human, in response to a naturally occurring situation, in order to achieve a desired outcome. Our approach is inspired by Wittgenstein's notion of semantics, that "meaning is use": language is grounded in our desire to make sense of one another and cooperate to meet our needs (Wittgenstein, 1953) .

As machines do not have humanlike needs or desires, we propose to evaluate machines' success at a task by how well it serves a human who is interested in the outcome. For example, if a machine orders food on my behalf, then I can evaluate it based on whether I enjoy the dish it ordered. Though this requires careful task selection in order to make things feasible for current models, as we will show in Section 3, it results in a powerful and reliable human evaluation.

2.1.1 Pragmatics In Nlp

Our evaluation relates to pragmatics in NLP, where communication is modeled also through listeners and speakers (Golland et al., 2010; Frank and Goodman, 2012) . One approach is to introduce a communication game, with an explicit objective. For example, Wang et al. (2016) study a blocks world where humans must build a structure by giving commands to a block-placing machine. The machine is then graded on accuracy. Our proposed evaluation instead covers complex everyday scenarios faced by a human, where the objective is to help them as much as possible.

Pragmatics can also be studied through machinemachine communication; e.g., through emergent language (Lazaridou et al., 2017) . Recent work uses pretrained question-answering models to evaluate summarization models (Chen et al., 2018; Scialom et al., 2019; Eyal et al., 2019; Vasilyev et al., 2020) . However, ensuring that machines communicate in standard English is difficult, as there is usually a more efficient machine-language coding scheme for the task (Kottur et al., 2017 ).

2.1.2 Two Major Approaches For Evaluation

Today, we see two major approaches for model evaluation, which we discuss below.

Quality of generations. The first approach studies generative tasks like chit-chat dialogue or storywriting, and measures the inherent quality of generations, often through individual attributes such as "sensibleness" and "specificity" (e.g., Venkatesh et al., 2018; Hashimoto et al., 2019; Adiwardana et al., 2020) . This approach is orthogonal to ours: though these attributes might be desirable, they are often not sufficient to guarantee task success.

Correctness. The second (and perhaps more common) approach is to evaluate tasks through correctness over static datasets. For example, machines can be graded by the similarity of their generated translation to correct translations, 1 or, by how often they choose the correct answer on a multiple choice exam. Many goal-oriented dialogue and semantics tasks are also evaluated in this way, as a model is evaluated by whether it makes the correct API call, or produces a correct parse.

Since many language tasks cannot be evaluated through correctness, researchers often introduce proxy tasks that are easy to evaluate, while (hopefully) correlating with the underlying true task. For example, SWAG (Zellers et al., 2018 ) is a multiplechoice proxy task and dataset introduced to study the true task of commonsense reasoning.

However, there are gaps between datasets for proxy tasks (e.g. multiple choice), and the core tasks they seek to represent (e.g. commonsense reasoning), which we discuss in the next sections.

2.2 Can language use really be measured through correctness over proxy tasks?

When we reduce a complex language task to a simplified setup, with a small label space (like multiple-choice classification), we run the risk of introducing artifacts and biases: patterns that can be exploited in the simplified setup, but that are not representative of the true task (Gururangan et al., 2018; Zellers et al., 2019a) . Artifacts can enable machines to even outperform humans at the final benchmark, without solving the underlying task. While the problem of artifacts has recently taken the spotlight in the NLP community, partially because large Transformer models (Vaswani et al., 2017) are very good at picking up on artifacts, there is a deeper underlying issue. The key assumption behind a simplified language task is that, by correctly mapping from inputs X to labels Y, a machine must necessarily learn a set of attributes A that are also representative of the 'true' task. We can upper-bound the information contained by A through the information bottleneck principle of Tishby et al. (1999) . An efficient model minimizes 1 Models submitted to the 2019 Conference on Machine Translation were evaluated (by humans) on how well the model's translations agreed with either (1) human-written translations, or, (2) original source text (Barrault et al., 2019) . the following equation, for some β ą 0:

EQUATION (1): Not extracted; please refer to original document.

where I is mutual information. In other words, the model will learn attributes A that maximally compress the inputs X (minimizing IpX; Aq), while also remaining good predictors of the labels Y (maximizing IpA; Yq). However, the label prediction term is bounded by the information (or entropy, H) of the label space:

IpA; Yq " HpYq´HpY|Aq ď HpYq.

This means that an efficient model, trained on a task with a small label space, might have attributes with low information content. This phenomenon has been observed empirically, with deep models iteratively discarding information at each layer (Tishby and Zaslavsky, 2015 ).

If we wish to evaluate language understanding via proxy tasks, then the information-discarding strategy of efficient models poses an issue. Models are encouraged to forget linguistically useful information that is not directly relevant to predicting Y (Pereira, 2000) . In fact, models might exclusively learn the artifacts in the data, provided that these artifacts are enough to solve the task.

An alternate approach is to make datasets harder adversarially, so as to have fewer artifacts (Zellers et al., 2018 (Zellers et al., , 2019a . However, it might be impossible to make a dataset with no artifacts, or to know if one has been created.

Our proposal, to evaluate models through their real-world usage of language, addresses the information-discarding issue in two ways. First, by using real-world language over open-ended tasks, the mapping between possible inputs and outputs is allowed to be highly complex. For example, the space of possible advice is vast, and many pieces of advice might be equally helpful given a situation. Second, our proposal tackles language problems directly, without introducing a correctness-based proxy that machines might overfit to.

2.3 Static Datasets In A Dynamic World

To evaluate performance on a real-world task by means of a dataset, we must (implicitly) assume that the dataset is a good representation of the world (Torralba and Efros, 2011) . This assumption might be questionable when it comes to real-world language use, as static datasets necessarily capture historic patterns of language. For instance, in our field, we commonly evaluate syntactic understanding using the Penn Treebank dataset, which contains news articles from 1989 (Marcus et al., 1993) . However, the world is constantly evolving, along with the language that we use.

To bridge this gap, we propose to evaluate machines by their interactions with humans in the present. Models therefore must learn to perform the underlying language tasks, even for novel situations, rather than fitting to the historic distribution of a fixed test set. We make this notion concrete in the next section, where we introduce a dynamic dataset and leaderboard for evaluating advice.

3 TuringAdvice: a new challenge for natural language understanding

As a case study of our framework, we introduce TuringAdvice, a new challenge task for AI systems to test language understanding. The format is simple: given a situation expressed in natural language, a machine must respond with helpful advice. To pass the challenge, machine-written advice must be at least as helpful to the advice-seeker as humanwritten advice, in aggregate. We choose to focus on advice for a few reasons. First, people ask for and give advice as a part of their daily lives, encompassing settings as diverse as relationship advice and tech support (Bonaccio and Dalal, 2006) . This means that we as humans have inherent familiarity with the task, and what it means for advice to be helpful. Thus, as we will later show empirically, advice is easy to evaluate through how much it helps the advice-seeker -even though it is highly diverse. 2 Second, giving advice overlaps with core NLP tasks, such as reading comprehension and natural language inference (Section 5.4). We hypothesize that generating advice that truly helps someone requires a deep understanding of their situation.

Third, good advice is important to people. Its importance has even led to the creation of internet communities oriented around advice, making data plentiful. Likewise, we hypothesize that progress on TuringAdvice might have high impact for good. An AI capable of writing consistently helpful advice -perhaps, a virtual therapist (DeVault et al., 2014) -could greatly help people in need. I've been together with my BF (we'll call him Kyle) for a little over 8 months now. We don't live together but he only lives about a 5 minute walk from me. I would have described the relationship before this week as pretty slow. Neither of us really wanted any big commitments yet so outside of date nights, netflix and occasional hook ups the relationship has been pretty laid back. That was until this last weekend. My birthday was Saturday and we were having a party with about 9 people. Kyle made a big show about getting everyone together because he wanted to give me her present in front of everyone. Well, this is where things get crazy. For my "birthday present", Kyle got a MASSIVE tattoo on his back. Of my face. Underneath my face there is text saying "Mine forever". The silence was deafening, it didn't help that the tattoo was not even half done. This is completely out of my comfort zone and I have no clue what to do. My sister has been telling me just to break up with him and ignore him. But I just can't do that. Before Saturday I did feel a spark with him. I did like him a lot. But this is all just way to much. Any advice on what I should or can do here would be appreciated.

You gotta at least talk to him and tell her why everyone reacted like that did. "I feel like the gift was making a huge commitment that we hadn't actually discussed yet. We aren't married, engaged or even living together so I'm not sure why you thought this was a good idea. I'm sure you only had good intentions, but I'm not prepared for the type of commitment that tattoo entails." Figure 2 : An example situation, along with two pieces of top-scoring community authored advice. A machine must generate advice that is at least as helpful to the advice-seeker as the reference advice.

Figure 2: An example situation, along with two pieces of top-scoring community authored advice. A machine must generate advice that is at least as helpful to the advice-seeker as the reference advice.

3.1 Redditadvice: A Dynamic Dataset For Evaluating Advice

We propose to evaluate models dynamically, through new situations and advice that are posted to Reddit. We call our dynamic dataset Reddit-Advice. Many of Reddit's subcommunities (or 'subreddits') are devoted to asking for and giving advice, with subreddits for legal, relationship, and general life advice. 3 During evaluation time, we will retrieve new situations from Reddit as a new test set for models. Workers on Mechanical Turk then grade the model-written advice versus the Reddit-endorsed human-written advice.

3.1.1 How Advice-Giving Works On Reddit

Suppose a Reddit user faces an issue that they are seeking advice about. First, they write up their situation. The writing is typically detailed, and usually includes a question (often implicitly). They then post their situation to an advice-oriented subreddit. Users who follow that subreddit then reply to the situation, offering advice. 4 Importantly, any user can 'upvote' or 'downvote' the advice as well as the situation itself -changing its score slightly. Top-scoring advice is deemed by the wisdom of the crowd as being the most helpful, while top-scoring situations are often the most de-tailed. 5 See Figure 2 for an example. Key to the functioning of this online advice community is that users want to participate and are thus intrinsically motivated -situation posters need advice, repliers desire to have their advice recognized, and readers enjoy passing judgement on such advice (Chiu et al., 2006; Wasko and Faraj, 2005 ).

3.1.2 The Ideal Evaluation -Through Reddit?

In a sense, human advice-givers are 'evaluated' on Reddit by the score of their advice -representing how well their advice has been received by the community. Similarly, the ideal model evaluation might be to post advice on Reddit directly. If the model consistently understands written situationsenough to produce helpful advice -then its advice will in turn be consistently upvoted.

However, there is a significant ethical problem with this approach. The users who post advice questions are real people, with real problems. A user might read advice that was originally written by a machine, think it was human-endorsed, and do something harmful as a result. 6 For this reason, we take an alternate crowdsourcing approach.

3.1.3 A Crowdsourced, Hybrid Evaluationthrough Mechanical Turk

We propose a hybrid approach for dynamic evaluation of models. While the situations, and reference advice come from Reddit, we hire workers on Mechanical Turk to rate the relative helpfulness of machine-written advice. Not only is this format more ethical, it also lets us collect diagnostic ratings, allowing us to quantitatively track the natural language understanding errors made by machines. One possible concern, however, is that crowd workers might be more extrinsically motivatedperforming our task to earn income, as opposed to the Reddit users who are intrinsically motivated. To address this, we made our crowdsourcing task as fulfilling as possible: using popular situations from Reddit, and pitching the work in terms of helping people. We received feedback from many workers that our tasks were entertaining and fun. This suggests that our workers are intrinsically mo- Figure 3 : Crowdsourcing workflow. Workers on Mechanical Turk are given a situation, and two pieces of advice. First, they choose which is most helpful -in this example, B is selected. Second, they rate the helpfulness of the worse advice (A); last, they answer an additional diagnostic question that depends on whether A was rated Slightly helpful or not.

Figure 3: Crowdsourcing workflow. Workers on Mechanical Turk are given a situation, and two pieces of advice. First, they choose which is most helpful – in this example, B is selected. Second, they rate the helpfulness of the worse advice (A); last, they answer an additional diagnostic question that depends on whether

tivated, and thus are good judges of language usea finding we confirm empirically in Section 4.1.1.

3.1.4 Mechanical Turk Annotation Setup

In a single round of evaluation, we retrieve 200 popular Reddit situations that were posted in last two weeks. 7 For each situation, we retrieve the top-rated human-written advice, and generate one piece of advice per model. Workers on Mechanical Turk then compare the helpfulness of the modelgenerated advice with human-written advice, and provide diagnostic ratings. We show an overview of our Mechanical Turk task in Figure 3 . A worker is given a situation, as well as two pieces of advice: A and B. One is the top-scoring advice from Reddit, and the other is model-generated advice; the worker is not told which is which. The worker first chooses the more helpful piece of advice, then provides diagnostic information for the less helpful advice -rating it Slightly helpful , Not helpful , or Dangerous . If the worse piece of advice was Slightly helpful , they choose whether it is worse than the better advice due to a Meaning problem or a Writing problem . Otherwise, they choose if the worse advice could be Possibly helpful in some other situation, or Never helpful in any situation.

Overall, three workers rate each model-situation pair, and their ratings are combined using a majority vote. We follow best practices for Mechanical Turk, including using a qualification exam, and

4 Experimental Results On Redditadvice

In this section, we report results from one round of dynamic evaluation on RedditAdvice. We evaluate the following selection of NLP models: a. Grover (Zellers et al., 2019b) : a left-to-right transformer model. Grover was pretrained on news articles with multiple fields, perhaps making it a good fit for our task, with multiple fields of context (like the subreddit, date, and title). Sizes: We study the two largest Grover models: Grover-Large, with 0.3 billion parameters, and Grover-Mega, with 1.5 billion parameters. b. T5 (Raffel et al., 2019) : a sequence-tosequence model with a bidirectional encoder and a left-to-right generator. T5 was trained on a large dataset of cleaned web text. At the time of writing, T5 is the top-scoring model on the Glue and SuperGlue benchmarks (Wang et al., 2019b,a) , scoring above human performance on Glue (90 vs. 87) and near human-performance on SuperGlue (89.3 vs 89.8).

Sizes: We study the two largest T5 models. As their names suggest, T5-3B has 3 billion parameters, and T5-11B has 11 billion parameters. c. TF-IDF retrieval: we additionally consider a simple baseline built around retrieval, not generation. We first precompute bag-of-word TF-IDF vectors for all situations in the training set. Given a new situation, we compute its TF-IDF vector and retrieve the most similar situation from the training set. We then reply with the top-scoring advice for that situation. Last, to quantify the measurement error of our evaluation, we additionally evaluate: d. the second-highest rated Reddit advice for each situation. We send this advice through the same pipeline as machine-written advice. We train our models using cross-entropy, and generate using Nucleus Sampling (Holtzman et al., 2020) . We provide additional training and generation details for our models in Appendix B.

In our study, we do not consider purely bidirectional models, such as BERT . While these models can be adapted to generate text, their generations are generally worse than those of left-to-right models (Wang and Cho, 2019) ; moreover, T5 tends to outperform these models even on discriminative tasks. We also do not consider GPT (Radford et al., 2018) , another left-to-right model, as to make it controllably generate advice would involve more changes in finetuning, versus Grover.

4.1 Quantitative Results

In Figure 4 , we show overall results for one evaluation trial, which featured 200 situations posted on Reddit from February 1st to February 12, 2020. As a key metric for measuring the relative usefulness of model-written advice, we evaluate the frequency by which workers prefer the Reddit-written reference advice over the model-written advice. If a model's advice was just as helpful as human advice in aggregate, then that model would score 50%.

Figure 4: Helpfulness of evaluated models, relative to top-scoring Reddit advice. We show results over 200 shared situations; we also show bootstrapped 95% confidence intervals. Advice from the biggest model, T511B, is preferred 9% of the time over Reddit advice.

Model performance is quite low. The best model, T5 with 11 billion parameters, scores 9%. Other models, with fewer parameters, do worse-with Grover-Large (0.3B parameters) scoring 3.5%. In comparison, the second-highest scoring Reddit advice scores 40%, and the highest scoring advice is (by definition) 50%. However, in theory, a model could score above 50%, if it writes advice that is truly helpful and thus gets consistently chosen.

4.1.1 Measurement Error

To investigate the measurement error of our evaluation, in Figure 5 we report the statistical significance between pairs of models; details about how this is computed are in Appendix C. For pairs of models with a greater difference in parameter count, we similarly see a large (and statistically significant) difference in performance. For instance, while the improvement of T5-11B over T5-3B was 3%, and not found to be statistically significant, the improvement of T5-11B over Grover-Mega was 5% and highly statistically significant. Overall, the statistical significance results suggest that our evaluation can stably rank model performance. This, along with the finding that model performance is low on our task (ď9%) suggests that there is ample room for growth on RedditAdvice.

Figure 5: Improvement (in absolute percentage %) between pairs of models, along with statistical significance as measured by a paired t-test. The improvement of large models over smaller ones is highly significant, such as T5-11B over Grover-Mega (5% gap, pă.01).

5 Analysis And Discussion

So far, we have shown that we are able to reliably evaluate models in our dynamic setup, and that doing so results in model performance that is significantly lower than human performance.

To break down what this gap in performance means, we show a qualitative example in Figure 6 , legaladvice: Employer just informed me I was never put on payroll and "don't have a w2" Very unusual situation. My former employer who laid all of us off suddenly in mid November is just today telling me that I was never "entered into payroll" and don't have a w2. He is suggesting that instead he can write me a check for the taxes already withheld from my pay. This immediately rang alarm bells because I'm not an idiot. I said replied neutrally that I needed time to make a decision. Here's the thing, throughout my employment there I was not provided with paystubs with each paycheck. I repeatedly asked for my paystubs and it was always blamed on "the bookkeeper". I looked it up and the state I'm in does not require employers to provide paystubs. Fine. But when I was laid off I wanted to apply for unemployment and needed a final paystub. After almost a month of daily texts I finally got a final paystub from him, reviewing it now I had almost $1500 withheld in taxes from a gross pay of around $2500 over a period of about 3.5 months. This amount of tax now seems excessively high AND if I "have no w2" does that mean my employer withheld taxes from my pay and did not actually pay that money forward as taxes. What do I need to do here to both stay in the clear tax wise AND potentially recover the money withheld perhaps wrongly?

Figure 6. Not extracted; please refer to original document.

Top Reddit Advice

If you live in a state with a labor board or labor commission start there. This is a big deal and they will be able to provide the most assistance. If your state does not have that, you really might want to call your local district attorney and see if anyone does payroll fraud. After that, the IRS might actually be able to help, they have records on the other side for you pay, if your boss withheld and sent in the tax payment which I doubt. You are actually the victim of a crime here, not just a lazy accounting. Your employer could feed all the payroll info into business software and get it all straightened out, but they aren't I am not recommending a CPA or tax person because you're going to spend half your total pay on them, as your total income is well below the free tax filing software threshold.

"I love to be able to help you when I have the time, but I am dealing with x,y, and also z, which means I can't drop everything each time you need it." She sounds like she needs to get some sort of daily care package arranged by whatever sort of adult social services there is in the area. Perhaps you could look into that? TF-IDF retrieval So she had an affair and she made you apologize to the ap. Man grow some balls.

Tf-Idf Retrieval

This is a company that is big enough to have a payroll department and they did this? Jesus -I don't know where to even start with this. Swalsh is right -it's not only illegal it's potentially jail time illegal.

De nitely less helpful Not helpful Possibly helpful in a different situation

Grover-Large

Is there a method by which you can "dismiss" this information as false? If so, dispute it with the IRS.

Let her know that it's her house and she has a right to privacy. If she thinks she's getting a free pass because it's her house, she needs to learn to accept the fact that you, as a person, are no longer going to be of any use to her. You're under no obligation to be her companion anymore. You can say that it's frustrating that she talks to you almost all the time, but this part needs to go on her credit report. It's not going to get any better from here unless she starts learning to suck it up.

De nitely less helpful Not helpful Never helpful for any situation

Grover-Mega

If you can prove that they never actually paid you for work you did, you can sue them for that money.

De nitely less helpful Not helpful Possibly helpful in a different situation

You did what you thought was right. When someone says yes, then you stop assisting. It sounds like you have made your feelings and expectations clear to her. You don't have to continue this as you see fit.

De nitely less helpful Not helpful Possibly helpful in a different situation

T5-3B

Is your employer US based? Because if they were (you'll need to make a claim to the IRS), then you would have had a W2 issued to you with the taxes withheld from your pay, at minimum. If you were truly not on payroll, you still have a W2 (since you were still earning money), but that's probably for taxes withheld from your paychecks as well as any state and federal income taxes that were withheld. Neither of those numbers look good.

She's not going to stop wanting something just because you set a boundary. You've already offered one so all she needs to do is ask for something more. I'm sure that will be a lesson for her and she'll be fine.

De nitely less helpful Not helpful Possibly helpful in a different situation T5-11B I think you should start telling her you have a full time job and can't help her. "I'm sorry, but I don't have time to help you." "Sorry, I can't help you with that. I'm not going to do it." That way, if she calls you up again, you can tell her the truth and leave it at that.

Secondbest Reddit Advice

So it definitely is not quite adding up. Talk with a CPA, they can probably guide you through dealing with this. They would be able to keep you in the clear. It probably won't be free but shouldn't be too costly. Do this soon because they get bombarded with work between now and April so keep that in mind. Hiring a lawyer or going after them legally doesn't seem worth it. And CPA's are certified to deal with the IRS on your behalf if they decide to come calling or asking questions.

Slightly less helpful Slightly helpful Meaning problem Figure 6 : A qualitative example; more are in Appendix E. Though machine-generated advice matches keywords from the situation, upon a close read, it is frequently not helpful or even self-contradictory. The issues are due to critical errors in natural language understanding, such as reading comprehension, entailment, and coreference. Figure 7 : Distribution of ratings for three evaluated models: the retrieval baseline, Grover-Mega, and T5-11B; along with ratings for the second-best rated Reddit advice. Though generative models like Grover and T5 are preferred more often than the TF-IDF retrieval baseline; they also often struggle to generate coherent advice. Of note, 31% of the advice from T5 would never be helpful in any situation, versus 4% from the retrieval model. describing a wage theft situation. The top-rated Reddit advice understands the situation, and then offers helpful assistance. It recommends the adviceseeker contact their state labor board or district attorney, noting that they are "the victim of a crime." Meanwhile, machine advice often misses the heart of the issue: T5-11B, for example, suggests that the advice-seeker simply file their taxes. Even worse, T5-3B is self-contradictory, saying that "if you were truly not on payroll, you still have a W2."

Figure 7: Distribution of ratings for three evaluated models: the retrieval baseline, Grover-Mega, and T5-11B; along with ratings for the second-best rated Reddit advice. Though generative models like Grover and T5 are preferred more often than the TF-IDF retrieval baseline; they also often struggle to generate coherent advice. Of note, 31% of the advice from T5 would never be helpful in any situation, versus 4% from the retrieval model.

5.1 Problems With Machine-Written Advice

As part of our evaluation, we wish to quantitatively measure problems with machine-written advice. Recall that in our crowdsourcing setup (Section 3.1.3), we ask workers to not only select which advice is better, but also to annotate problems with the worse piece of advice. We find workers have high agreement throughout the diagnostic annotation process; moreover, we use three workers per advice for additional consistency. 9 In Figure 7 , we show the distribution of the ratings for model-written, versus humanwritten advice. Machine-written advice that was not preferred over human-written advice can have the following ratings. It can be rated as Slightly helpful (but, was rated as worse mainly due to a Meaning problem or Writing problem ), or, as either Not helpful or Dangerous (and it could be Possibly helpful in some other situation, or Never helpful in any situation). 10 The diagnostics show several interesting patterns.

9 For the classifying machine-written advice as 'helpful' versus 'not helpful' or 'dangerous' (combining the two latter categories into one), we have κ"0.689. For breaking down helpful advice into 'meaning problem' versus a 'writing problem', we have Cohen's κ"0.646; for rating unhelpful advice as 'possibly helpful' versus 'never helpful,' we have κ"0.636. 10 We found workers rarely chose Dangerous (2%), so for ease of visualization, we combined it with Not helpful .

First, stronger generators improve over weaker generators: in comparing T5-11B to the weaker machine model, Grover-Mega, we find it tends to produce less 'Not helpful/Dangerous' advice. 26% of T5-11B's advice is Never helpful in any situation, versus 29% for Grover-Mega; 21% is unhelpful but could be Possibly helpful in another situation, versus 32% for Grover-Mega.

Second, and perhaps most surprising, we find that all generators frequently commit natural language understanding errors during generation, including internal contradiction. Because of this, we find that our simple baseline -TF-IDF bag-ofwords retrieval -is competitive with that of deep generators with billions of parameters. While its advice is often irrelevant (84% of the time), it is almost never complete gibberish -since it is retrieved from top-scoring advice. In fact, very few (3%) of workers rated this advice as Never helpful in any situation, versus 26% for T5.

5.2 A Leaderboard For Advice Evaluation

So far, we have presented the results from one round of a dynamic evaluation. We propose to keep that evaluation ongoing, through a dynamic leaderboard at rowanzellers.com/advice.

At the time of writing, the leaderboard works as follows. Users submit a model API to be dynamically evaluated. The new model, along with the highest rated previously-evaluated model, will be evaluated for an additional round -using a new set of 200 situations posted over the last two weeks, using the same approach as in Section 3.1.3.

One potential concern, however, is price. In our Mechanical Turk workflow, we paid workers 52 cents per HIT. 11 After the Mechanical Turk fee, and using 3 workers per piece of advice, this costs $1.86 per piece of advice, or $372 for 200 pieces of advice. We argue that this cost should not prohibit a human evaluation, particularly when compared with the electricity cost of model development (Strubell et al., 2019) . To ensure that the cost is fairly distributed among the community, we propose that submitters to our leaderboard pay the Mechanical Turk bill. 12

5.3 Length And Complexity

One interesting aspect of RedditAdvice is that its situations are long and complex. We show a distribution of lengths in Figure 8 . Not only are the inputs and outputs complex, but we argue that the mapping between them is complex as well: it is necessarily one-to-many, as there might be many possible kinds of good advice for a situation. We believe evaluating by task success -how much the advice helps a user -is the key reason why a task like advice-giving can be scaled up. First, intrinsic motivation rewards high data quality: users who post situations are motivated to add relevant details, and advice-givers are motivated to help the user. Second, task success can stably evaluate long passages. Two pieces of advice rarely mean exactly the same thing, but this does not mean we cannot evaluate which is more helpful.

Figure 8: Length distribution of RedditAdvice, compared with other common NLU benchmarks benchmarks (HellaSWAG; Zellers et al. (2019a), GLUE; Wang et al. (2019b), SuperGlue; Wang et al. (2019a)). The examples in RedditAdvice are significantly longer, representing highly complex situations.

5.4 Relation To Existing Nlp Tasks

Shared "core" tasks such as reading comprehension and natural language inference are of considerable interest to the NLP community. Many datasets have been proposed for these tasks, and progress on them is often measured through auto-gradeable correctness metrics. However, large models have started to outperform humans on these datasets, raising doubt that further progress on them brings us closer to human-level language understanding.

We argue two things: first, that many NLP tasks are necessary components of giving advice, and second, that because giving advice remains far from solved, these tasks are also far from solved. In Appendix E, we study problems with advice from T5-11B from the point of view of existing NLP tasks. For instance, machine advice often contradicts itself, suggesting that today's systems struggle with the general task of natural language inference.

One interesting line of research would be to transfer knowledge from supervised NLP datasets into a generative setting. 13 However, one difficulty is that datasets are necessarily curated -in terms of both the label space as well as the data distribution. Paragraphs of machine-written advice, that exhibit many kinds of language understanding errors, might be significantly out-of-distribution.

We propose another way forward. Predicting the advice ratings themselves (such as whether advice could ever be helpful) is itself a language task, one that might provide signal for better generation. 14 Overall, this suggests that evaluating machines by their language use might lead to progress on existing NLP tasks in two ways. First, by studying a generative setting, we necessarily adopt a broad and inclusive definition of the task at hand; second, we can turn common challenges into small discriminative tasks to study further.

5.5 How Can We Build Models That Are Better At Giving Advice?

Over the last few years, a major trend in NLP has been towards developing bigger models, while making fewer changes to neural architecture and the training objective. Almost all of today's leaderboard models are Transformers (Vaswani et al., 2017) trained through the maximum-likelihood training objective of predicting masked out (or next) words. Our experimental results suggest that even though these models struggle with Reddit-Advice, we are still able to measure small gains. At the same time, our results suggest that scaling up parameter counts might not be enough. With 11 billion parameters, machines score 9% on our benchmark, versus 50% for humans ( Figure 4) .

We hypothesize that a key challenge for our field 13 Another idea is crowdsourcing data specifically to improve generation models, e.g. dialogue (Welleck et al., 2019) .

14 To allow for study of this problem, we make the full evaluation results -including the advice ratings -public. will be to move away from training models through word prediction objectives. We hypothesize that word prediction makes a model uniquely suited for correctness-based tasks, as word prediction itself is a task with a single correct answer. However, word prediction might not necessarily lead to the formation of mental models about the world. Other ideas include different modeling choices (Bengio, 2017) and learning paradigms (Mao et al., 2019) . These and other directions seem promising for building better models that are not just better advice-givers, but better at real-world language use in general.

5.6 Ethical Implications; Possible Dual Use

One benefit of our proposal is that evaluating machines by their language use, on tasks with intrinsic motivation, might enable progress towards social good applications. For example, machines might one day help people who need advice -potentially on sensitive topics.

However, we do not claim that our approach is a panacea. We should approach purely technological solutions to societal problems (such as mental health care) with a grain of salt. Moreover, progress on using language effectively might yield models that cause harm, such as through generating disinformation (Zellers et al., 2019b) . We as a community should (continue to) study and be mindful of these kinds of dual use issues (Hovy and Spruit, 2016; Green and Viljoen, 2020) .

6 Conclusion

In our work, we introduced new methodology for evaluating language tasks, reducing the gap between our benchmarks and the real world. We also introduced a new challenge for the community, TuringAdvice, with an accompanying dataset and dynamic leaderboard, RedditAdvice. Today's largest models struggle on RedditAdvice, so we are excited to see what new models get developed. and LegalAdvice. b. We skip 'update' posts, in which a user refers to an older situation that they posted, and 'meta' posts, in which subreddit rules are discussed. c. We skip any post that has an HTML link, since today's models (presumably) would not be able to visit such a link. d. We skip any post with a score of less than 20. e. We do our best to clean the text of the post.

Many posts include valid situations, but are then edited to include updates that took place afterwards, in response to advice that was given. These are typically delimited by dashed lines, and the word EDIT or UPDATE. f. Posts in some of the subreddits (Dating_Advice, Dating, Love, Marriage) is often in the form of tips and general suggestions, rather than situations. We skip any posts from these subreddits that do not include a question mark. g. We filter out posts that contain sensitive topics, such as assault, suicide, and abuse. h. Last, we skip any post that in total is fewer than 128 spaCy tokens, or, longer than 1280 spaCy tokens. For a retrieved situation, we do the following to extract valid advice: a. Given a post that contains a valid situation, we order the comments from highest-to-lowest scoring. We perform the following checks to determine if we can extract valid advice. Once we find valid advice, we will stop iterating. b. We skip any comment that was posted by a moderator, the Reddit user who posted the original situation, or that was edited. c. We skip any comment with a score of less than 20. d. We skip any comment that contains fewer than 32 spaCy tokens. e. One corner case is highly-scoring advice comments that refer implicitly to others. For instance, a comment might say 'You should listen to the other commenters and...' These references make sense inside a Reddit post, however, they are somewhat nonsensical when we pull the comment out of context. We thus skip any comment that seems to refer to others. Once we retrieve a situation, that has at least one piece of valid advice, we are done -and we move on to the next situation. We loop over the topscoring 1000 posts in total, and randomly select 200 valid situations from this pool.

A.2 Static Filtering Criteria For Redditadvice2019

As mentioned in the main text of the paper, we used less stringent requirements to retrieve the static training dataset RedditAdvice2019. We did this because we hypothesize that today's neural generators are data-hungry: though we could retrieve the top-scoring situations and advice for each twoweek span, this might not be enough to sufficiently train a model. Moreover, a single post (situation) on Reddit might have several comments that constitute reasonable advice. We use the following static filtering criteria. For efficiency, we were able to retrieve all of the static training data from the PushShift Reddit dump that was posted before August 1, 2019. 15 We list the changes we make to the dynamic filtering criteria listed in Appendix A.1. a. We use all posts that were posted to one of:

Relationships, Advice, NeedAdvice, Dating_Advice, Dating, Love, Marriage, InternetParents, TechSupport,

ii. The date on which it was posted, iii. The title of the situation post, iv. The body of the situation post, v. The advice posted in response to the situation. We adapt Grover in this setting by giving the model all of these fields in the given order (from i-v). Similar to how the model was pretrained, we include a field-specific start and end-token in each field, which allows the model to generate advice conditioned on the other fields.

In T5, the authors handle diverse tasks by prepending each field with its name (like Situation:) and concatenating the resulting fields. We do the same here. We place the context fields i-iv in the bidirectional encoder, and the target field (advice) is generated by the left-to-right decoder.

For the retrieval model, we combine the context fields (i-iv) into the same TF-IDF bag-of-words representation.

B.2 Length Adaptation

As shown in Figure 8 , our task contains lengths that are much longer than what has usually been explored in prior NLU work. For comparison, Grover (Zellers et al., 2019b) was trained on shorter texts (up to 1024 tokens) with absolute position embeddings. We thus pretrained Grover for 20k additional steps on three million news articles, using a new maximum length of 1536. We then finetuned Grover on RedditAdvice using a sequence length of 1536. We hypothesized that this extra step might be unecessary for T5, as it uses relative position embeddings (Shaw et al., 2018) . We finetuned T5 on RedditAdvice, using a context length of 1280 and a target length of 512.

Nevertheless, in 6% of cases, contexts are still too long. If this happens, we divide contexts into paragraphs and trim the middle ones, as often the first and last paragraphs contain important information (such as a summary or a question).

B.3 Training Generative Models

We finetune our learned models using a crossentropy loss. We trained Grover to predict all fields, 16 whereas we only trained T5 to predict the advice field (v), as the context is bidirectional.

We optimized our models using AdaFactor (Shazeer and Stern, 2018) . We validated the number of epochs and the learning rate using a small grid search over the validation set. We kept other hyperparameters to be the same as how the models were originally pretrained. For Grover-Large, we finetuned for 20 epochs with a learning rate of 1e-5 and batch size 512; for Grover-Mega, we finetuned for 20 epochs with a learning rate of 5e-6 and batch size 512; for T5-3B, we finetuned for 10 epochs with a learning rate of 2e-3 and batch size 128; for T5-11B, we finetuned for 5 epochs with a learning rate of 1e-3 and batch size 128.

B.4 Generation Through Nucleus Sampling

For open-ended generation tasks, such as ours, past work has shown that straightforward samplingalong with maximization approaches like beam search -tend to result in degenerate text (Holtzman et al., 2020) . In our work, we use Nucleus Sampling (Holtzman et al., 2020) to limit the variance of generated text. We use a threshold of p".95, meaning that at each timestep we only sample from the most probable 95% of the distribution.

C Measuring Statistical Significance

Here, we describe how we compute statistical significance for Figure 5 . For measuring statistical significance, we use a continuous version of the advice preference. The machine advice gets 1.0 points from a worker if it is chosen as De nitely more helpful , and 0.5 points if it is Slightly more helpful . We use point values of´1.0 and´0.5 for advice that is rated as De nitely less helpful and Slightly less helpful , respectively. For a single piece of advice, we average together the point values for all workers that agreed with the majority vote.

For example, suppose for a single pair that Worker 1 and 2 prefer human-written advice, and Worker 3 prefers the machine-written advice. We only use the responses from Worker 1 and 2, to agree with the majority vote. If Worker 1 rates the machine-written advice as De nitely less helpful , and Worker 2 as Slightly less helpful , then the score of the machine advice is p´0.5q`p´1.0q 2 "´0.75.

We can then use these scores to compare two different machines, using a paired t-test.

Workers (22 total Figure 10 : Helpfulness of evaluated models, separated by domain. The format is the same as Figure 4, except here we separate results by the type of subreddit -covering relationship advice (relationships, relationship_advice, dating_advice, dating, Marriage, love); legal advice (legaladvice), or life advice (internetparents, needadvice, techsupport). The results don't show a clear pattern of some domains being harder than others.

Figure 10: Helpfulness of evaluated models, separated by domain. The format is the same as Figure 4, except here we separate results by the type of subreddit - covering relationship advice (relationships, relationship_advice, dating_advice, dating, Marriage, love); legal advice (legaladvice), or life advice (internetparents, needadvice, techsupport). The results don’t show a clear pattern of some domains being harder than others.

D Miscellaneous Analysis D.1 Workers

We plot the number of annotations done per Mechanical Turk worker in Figure 9 , for the Feb 1 to Feb 12 evaluation. Overall, 22 workers participated in our evaluation, though this number also includes workers who completed very few HITs. The top 15 workers annotated 98.5% of the data.

Figure 9: Distribution of the number of annotations for each worker in our Mechanical Turk evaluation.

D.2 Are Some Domains Harder Than Others?

One question might be whether some advice domains are inherently more challenging than others. We present results in Figure 10 that do not seem to suggest a clear pattern of this. Over all advice domains, we see the same trend of human performance being high, and machine performance being low. However, it seems for all generators, 'Legal' advice is slightly less preferred than 'Relationship' advice -though the error bars overlap. This result might be somewhat surprising, as Mechanical Turk workers are (probably) not lawyers, but are still able to reliably spot model-written legal nonsense.

E Additional Qualitative Analysis

In this section, we provide additional qualitative examples. A second qualitative example, with generations from all models, is shown in Figure 11 .

Figure 11. Not extracted; please refer to original document.

In Figures 12, 13 , and 14, we categorize problems with machine-written advice under the framework of other core NLP tasks. Figure 12 is an unabridged version of the teaser figure.

Figure 12. Not extracted; please refer to original document.

The generated advice has key issues that fall under the purview of many language tasks, as seen broadly: a. Natural Language Inference (e.g. Dagan et al., 2006; Bowman et al., 2015) : whether a passage entails or contradicts another (or, neither). Generated advice often contradicts the provided situation, or even itself. b. Reading Comprehension (e.g. Rajpurkar et al., 2016) : Read and understand a passage (possibly, to be able to answer questions). Good advice requires us to first understand the situation at hand. c. Coreference Resolution (e.g. Pradhan et al., 2012) : Identify repeated entities in a document. Good advice requires us to identify who is who in a document, and not to mix people up. d. Social Commonsense Reasoning (e.g. Sap et al., 2019) : Identify people's intentions, feelings, and motivations in social interactions. Many of these situations are inherently social, so good advice often requires reasoning about social situations. e. Physical Commonsense Reasoning (e.g. Zellers et al., 2018; Bisk et al., 2020) : Have some notion of intuitive physics, and apply it to new situations. Many of these situations relate to physical situations, so writing good advice requires some physical commonsense reasoning. However, since the data distribution of these problems is complex in nature -as they manifest over long passages of advice -they might not overlap well with past (clean) datasets for these tasks. This suggests that giving advice is a promising task in which we can study and make progress on these language tasks.

relationships: My (31F) downstairs neighbor (65/70ish F) is relying too much on me and I don't know how to draw a limit without disrespecting her. Hi! I've been living alone in my apartment for about two and a half years, and I have very few neighbors (it's a small building of 12 apartments, and two out of the four apartments that surround mine are empty). The neighbors upstairs are two very aggressive and addicted men who everyone hates, so we don't talk to them. My downstairs neighbor is a widow in her late 60s early 70s , her husband died while they were sleeping about two years ago, and they had no children, she has no family and her only real life "friends" is an elderly couple who live on the other side of the building (they just say hi and talk for a bit when they meet on the hallway). I say "friends" because she says she has many, but all online, she kind of works selling stuff on Facebook and Twitter. She has a few mobility issues (she can walk but used a cane and is very slow), so she hired a girl to help her do basic stuff like walking down to the grocery store, going to the bank with her,etc. The girl is on summer vacation so she asked me if I could buy some things for her when I go to the store, I said yeah sure No problem. But then she said "great, so you can come with me to the bank on Monday, and then take me to pay the bills" she didn't ask it, she said it. I helped her a few times (walked her down the street, or went to the store for myself and got her a few things on the way). We share my WiFi and she's super responsible when she has to pay. The thing is, that's about as much as I can help, because I have a full time job, and I'm barely in my house. I feel like shit having to say no to her many requests to: help her fix her cellphone, or explain to her why her Facebook doesn't work because someone "reported" her account (I don't use Facebook so I wouldn't know how to help her), she texts me constantly telling me to drop by, but every time I do she talks for hours about her medication and depression, and her husband... I get it, I know she's lonely and I swear it breaks my heart. But I can't be her friend or her companion. I can be a helpful neighbor once in a while, and I help her (selflessly, I think) as much as I can. It's peak summer here and the power has been out for the last two or three days in the neighborhood, so I haven't been home much. Right now I'm writing this because she's begging me to go help her with something she didn't say what was, but I know it's just an excuse to talk to someone. Again, this breaks my heart but I have commitments to attend to. How do I politely tell her that I can't be there all the time, that I am not her companion? I don't want to offend her, she's sweet and very kind, but I feel really uncomfortable when she asks me to be a companion. Tl;dr: My elder neighbor is dependant of me and I feel bad having to put a stop on it.

Sceondbest Reddit Advice

You may have to become untrustworthy. I've been in this situation with people who simply would not hear the word no, and eventually I just failed to show up, failed to do the thing, whatever. It sucked, and I felt terrible, but I swear it was the only way to get my point across.

De nitely less helpful Slightly helpful Meaning problem Figure 11 : A second qualitative example, following the same format as Figure 6 , along with generations from all the models.

internetparents: How do I tell a teacher I am unable to do a project due to past trauma?

I'm a senior in High School and have been recently assigned to do a at home dissection and create a displayable animal skeleton. In my freshman year i tried dissecting a pig in class but had a severe panic attack in front of my class and I was unable to complete it and had to do an alternate assignment, as well as other experiences in my childhood that have left me distressed at the thought of dead animals. I tried talking to my teacher this past week to opt out of the project and do an alternate one, but she said "i'm making more out of it than necessary" and that I'll be fine. She then showed me photos of dismembered animals from other projects, and I started crying and was unable to continue the conversation. This project is going to be a major impact on my final grade in this class, so im desperate to do an alternate project, but she doesn't seem willing to do so. I don't want to play a 'victim' card, or seem like im just trying to get out of a hard project, but I don't know how to approach her. I'm really just anxious about the whole thing, and im feeling miserable that i'm even doing this, but i just cant cope with dead animals, especially first hand. Any advice is very appreciated, thank you! For what it's worth, creating alternate assignments in moments like these is often difficult. Your teacher is probably tired and stressed, and generating a whole new project for a single student is *hard*. That said, it would be borderline-unreasonable for them to refuse accommodation in a case like this. Consider offering up an alternative yourself, to give them an easier "out". If you would feel comfortable with an owl pellet, assembling a tiny skeleton from a pellet serves much the same purpose but avoids soft tissue. Plus, you know the animal didn't die solely for the purposes of education, but as part of an ecosystem. If speaking with your counselor or principal doesn't initially help, you should involve your parents. They can take the same steps outlined above, and have a bit more swing. I recommend starting the process yourself, however, and adding your parents as a last resort. Good luck! Feel free to shoot me a message if you have further questions. I can't promise I'm any help, but I can say that I'll *try*. u/Mzieg's advice is right on the money also! T5-11B >this past week to opt out of the project and do an alternate one, but she said "i'm making more out of it than necessary" and that I'll be fine. If you've already approached her and she's dismissing your concerns, then I think you should go back to her and state again -"I'm asking you to do a project that requires me to see dead animals -I'm afraid that this project is a dealbreaker for me." If she refuses to concede, you should go to your principal. He or she should know about your past trauma.

De nitely less helpful Not helpful Never helpful Figure 12 : An example with advice from T5. This is actually the example from the teaser (Figure 1 ), but it was abridged, because it's really long. It has the following NLU problems:

1. Coreference: T5 suggests the advice-seeker say "I'm asking you to do a project that requires me to see dead animals." This confuses who is who in the context. The science teacher was the one who asked the advice-seeker to do a project requiring them to see dead animals.

2. Social Commonsense Reasoning, Reading Comprehension, and Natural Language Inference: T5 suggests that the adviceseeker go to the principal, but says that they "should know about [the advice-seeker's] past trauma." However, it's likely a bad idea to tell the principal about personal details such as the advice-giver's past trauma, for two reasons. First, the human-written advice suggests that the most effective strategy is to "be succinct" and to summarize those feelings as "moral and emotional reasons." Second, the advice-seeker specifically says that they "don't want to play a 'victim' card." Telling the advice-seeker to describe their trauma to the principal, without acknowledging their concerns, seems like a contradiction here.

legaladvice: Kids threw a block of ice at my car January 20th I was driving down a residential road past a house where three boys about aged 10/11 were playing at the end of the driveway. One grabbed a sizeable block of ice and hurled it into the side of my car as I passed. I stopped and the boy who threw it was profusely apologizing. I rang the doorbell, mom comes out, and I tell her what happened. She says, boys, "come inside"! And then, "which one did it?" I told her, he admitted that he did it. Then, she closed the door. I live about 3 min away so I drove home and had my neighbor look at my car with me. There are 3 dents where the block of ice hit. I just bought this car certified pre-owned and had only made one payment at this time so I know for a fact that the damage is from the block of ice. I drove back to the house and said hey look, my car is damaged so I'd like to exchange info so I can get this fixed. She said, 1) how can you be sure that it's from them? and 2) a little snow doesn't do that much damage. I explained again that it was a block of ice. She told me to call the police. Police come, he takes it very seriously and lectures both the boys and the parents about how dangerous that is and asks me if I want citations issued for the boys. I say no, I just want my car fixed. He tells them they are lucky that's all I want, agrees to write the police report. The mother gives her side, lies to the police and tells them her kid's friend who is no longer present is the one who threw the ice. It was actually her younger son. I didn't correct her because I was so shocked and confused and the other boy wasn't there at the time. Over the next two days, I go get an estimate for the repair ($1327) and a copy of the police report. The officer asks me to try to resolve it with them and if I'm unable to do that, to let him know and he'll circle back and issue citations to compel them to pay restitution. At this point, the parents are not returning my calls so I'm just wondering what comes next. I expect this will land in small claims court because the parents seem positioned to ignore and then say that I can't prove anything, plus they are placing blame on the boys' friend who won't speak to the police. Will the violations actually be enough to draw restitution so I can avoid court? If not, and I end up in court, what is the likelihood that the judge will side with them and say I can't prove the damage is specifically from the ice, even though the police report indicates that the parents and boys admit to it happening?

Top Reddit advice >The officer asks me to try to resolve it with them and if I'm unable to do that, to let him know and he'll circle back and issue citations to compel them to pay restitution. > >At this point, the parents are not returning my calls so I'm just wondering what comes next.

Not sure why you're wondering what comes next -call the police back and tell them they're not being cooperative. You can always just sue them in Small Claims Court.

T5-11B

I would also call CPS. Kids need supervision. They threw a heavy block of ice and damaged your car. It's negligence to allow kids to play with ice (which is literally a hardened snow/water but that depends on where you live)

De nitely less helpful Dangerous Never helpful Figure 13 : An example with advice from T5. It has the following NLU problems:

Figure 13. Not extracted; please refer to original document.

An advice-giver can recommend or advise against a particular action, they can provide information about options, and they can offer support (Dalal and Bonaccio, 2010).

We use advice from the following subreddits: Love, Relationships, Advice, NeedAdvice, Dating_Advice, Dating, Marriage, InternetParents, TechSupport, and LegalAdvice.4 Users on Reddit can also reply to the replies themselves, in a hierarchical way. For simplicity however, we don't incorporate these nested replies in our dataset.

This is somewhat of a simplification, as other factors also influence what gets upvoted(Anderson et al., 2012;Lakkaraju et al., 2013;Muchnik et al., 2013;Jaech et al., 2015).6 One alternative might be to post advice, but add a disclaimer that the advice was AI-written, and so should be taken with a grain of salt. We tried posting advice in this way, for a filtered set of non-volatile situations where no one was at imminent risk, but subreddit moderators banned our account.

See Appendix A.1 for information about the selection. disqualifying workers who tend to choose machinewritten over human-written advice.3.2 A large static dataset for trainingWe additionally present RedditAdvice2019, a large static dataset for training advice-giving models. Because today's models have extreme reliance on data for finetuning, we collect data that is in the exact same format as RedditAdvice, yet we expand our selection criteria, optimizing for recall rather than precision (Appendix A.2). In total, we extract 616k pieces of advice, over 188k situations.To mirror the dynamic nature of the evaluation, in which models are evaluated on situations posted in 2020 and beyond, we split our dataset into static training and validation sets by date.8 We trained models on the training set, and chose hyperparameters using perplexity over the validation set.

Our training set contains 600k pieces of advice from July 2009 to June 14, 2019; validation contains 8k from June 14 to July 9th 2019.

We chose this to pay workers at least $15 per hour.

This model is used for the HYPE leaderboard in computer vision.

Advice, NeedAdvice, Dating_Advice, Dating, Love, Marriage, InternetParents, TechSupport,

Available at https://pushshift.io/.

The finetuning over the context fields i-iv is not necessary, as we never must generate those fields at test time. However, we opted to finetune on them anyways in order to provide more signal during training. We scaled the loss on the context fields to be 1/10th as much, to encourage the model to primarily learn how to generate advice.

. Natural Language Inference: The second sentence contradicts the first. It says that the boyfriend shouldn't be "running his mouth about [his insecurities] in front of people he doesn't even know that well," however, in the first sentence says that the boyfriend was bragging to his friends.

Evaluating Machines by their Real-World Language Use

Authors

Abstract

1 Introduction

2 Real World Language Use

2.1.1 Pragmatics In Nlp

2.1.2 Two Major Approaches For Evaluation

2.3 Static Datasets In A Dynamic World

3.1 Redditadvice: A Dynamic Dataset For Evaluating Advice

3.1.1 How Advice-Giving Works On Reddit

3.1.2 The Ideal Evaluation -Through Reddit?

3.1.3 A Crowdsourced, Hybrid Evaluationthrough Mechanical Turk

3.1.4 Mechanical Turk Annotation Setup

4 Experimental Results On Redditadvice

4.1 Quantitative Results

4.1.1 Measurement Error

5 Analysis And Discussion

Top Reddit Advice

Tf-Idf Retrieval

Grover-Large

Grover-Mega

T5-3B

Secondbest Reddit Advice

5.1 Problems With Machine-Written Advice

5.2 A Leaderboard For Advice Evaluation

5.3 Length And Complexity

5.4 Relation To Existing Nlp Tasks

5.5 How Can We Build Models That Are Better At Giving Advice?

5.6 Ethical Implications; Possible Dual Use

6 Conclusion

A.2 Static Filtering Criteria For Redditadvice2019

B.2 Length Adaptation

B.3 Training Generative Models

B.4 Generation Through Nucleus Sampling

C Measuring Statistical Significance

D Miscellaneous Analysis D.1 Workers

D.2 Are Some Domains Harder Than Others?

E Additional Qualitative Analysis

Sceondbest Reddit Advice

T5-11B