My Computer Is an Honor Student - but How Intelligent Is It? Standardized Tests as a Measure of AI

Peter Clark
Oren Etzioni
AI Mag.
2016
View in Semantic Scholar

Abstract

Given the well-known limitations of the Turing Test, there is a need for objective tests to both focus attention on, and measure progress towards, the goals of AI. In this paper we argue that machine performance on standardized tests should be a key component of any new measure of AI, because attaining a high level of performance requires solving significant AI problems involving language understanding and world modeling - critical skills for any machine that lays claim to intelligence. In addition, standardized tests have all the basic requirements of a practical test: they are accessible, easily comprehensible, clearly measurable, and offer a graduated progression from simple tasks to those requiring deep understanding of the world. Here we propose this task as a challenge problem for the community, summarize our state-of-the-art results on math and science tests, and provide supporting datasets

My Computer Is An

Honor

Science And Math As Challenge Areas

Standardized tests have been proposed as challenge problems for AI, for example, Bringsjord and Schimanski (2003) , Bringsjord (2011 ), Beyer et al. (2005 , Fujita et al. (2014) , as they appear to require significant advances in AI technology while also being accessible, measurable, understandable, and motivating. They also enable us easily to compare AI performance with that of humans. In our own work, we have chosen to focus on elementary and high school tests (for 6-18 year olds) because the basic language-processing requirements are surmountable, while the questions still present formidable challenges for solving. Similarly, we are focusing on science and math tests, and have recently achieved some baseline results on these tasks (Seo et al. 2015 , Koncel-Kedziorski et al. 2015 , Khot et al. 2015 , Li and Clark 2015 , Clark et al. 2016 . Other groups have attempted higher level exams, such as the Tokyo entrance exam (Strickland 2013) , and more specialized psychometric tests such as SAT word analogies (Turney 2006) , GRE word antonyms (Mohammad et al. 2013) , and TOEFL synonyms (Landauer and Dumais 1997) .

We also stipulate that the exams are taken exactly as written (no reformulation or rewording), so that the task is clear, standard, and cannot be manipulated or gamed. Typical questions from the New York Regents 4th grade (9-10 year olds) science exams, SAT math questions, and more are shown in the next section. We have also made a larger collection of challenge questions drawn from these and other exams, available on our web site. 1 We propose to leverage standardized tests, rather than synthetic tests such as the Winograd schema (Levesque, Davis, and Morgenstern 2012) or MCTest (Richardson, Burges, and Renshaw 2013) , because they provide a natural sample of problems and more directly suggest real-world applications in the areas of education and science.

Exams And Intelligence

One pertinent question concerning the suitability of exams is whether they are gameable, that is, answerable without requiring any real understanding of the world. For example, questions might be answered with a simple search-engine query or through simple corpus statistics, without requiring any understanding of the underlying material. Our experience is that while some questions are answerable in this way, many are not. There is a continuum from (computationally) easy to difficult questions, where more difficult questions require increasingly sophisticated internal models of the world. This continuum is highly desirable, as it means that there is a low barrier to entry, allowing researchers to make initial inroads into the task, while significant AI challenges need to be solved to do well in the exam. The diversity of questions also ensures a variety of skills are tested for, and guards against finding a simple shortcut that may answer them all without requiring any depth of understanding. (This contrasts with the more homogeneous Winograd schema challenge [Levesque, Davis, and Morgenstern 2012] , where the highly stylized question format risks producing specialized solution methods that have little generality). We illustrate these properties throughout this article.

In addition, 45-65 percent of the regents science exam questions (depending on the exam), and virtually all SAT geometry questions, contain diagrams that are necessary for solving the problem. Similarly, the answers to algebraic word problems are typically four numbers (see, for example, table 1). In all these cases, a Google search or simple corpus statistics will not answer these questions with any degree of reliability.

Table 1. Not extracted; please refer to original document.

A second important question, raised by Davis in his critique of standardized tests for measuring AI (Davis 2014) , is whether the tests are measuring the right thing. He notes that standardized tests are authored for people, not machines, and thus will be testing for skills that people find difficult to master, skipping over things that are easy for people but challenging for machines. In particular, Davis conjectures that "standardized tests do not test knowledge that is obvious for people; none of this knowledge can be assumed in AI systems." However, our experience is generally contrary to this conjecture: although questions do not typically test basic world knowledge directly, basic commonsense knowledge is frequently required to answer them. We will illustrate this in detail throughout this article.

The New York Regents Science Exams

One of the most interesting and appealing aspects of elementary science exams is their graduated and multifaceted nature: Different questions explore different types of knowledge and vary substantially in difficulty (for a computer), from a simple lookup to those requiring extensive understanding of the world. This allows incremental progress while still demanding significant advances for the most difficult questions. Information retrieval and bag-of-words methods work well for a subset of questions but eventually reach a limit, leaving a collection of questions requiring deeper understanding. We illustrate some of this variety here, using (mainly) the multiple choice part of the New York Regents 4th Grade Science exams 2 (New York State Education Department 2014). For a more detailed analysis, see Clark, Harrison, and Balasubramanian (2013) . A similar analysis can be made of exams at other grade levels and in other subjects.

Basic Questions

Part of the New York Regents exam tests for relatively straightforward knowledge, such as taxonomic ("isa") knowledge, definitional (terminological) knowledge, and basic facts about the world. Example questions include the following.

(1) Which object is the best conductor of electricity? (A) a wax crayon (B) a plastic spoon (C) a rubber eraser (D) an iron nail

(2) The movement of soil by wind or water is called (A) condensation (B) evaporation (C) erosion (D) friction

(3) Which part of a plant produces the seeds?

(A) flower (B) leaves (C) stem (D) roots

This style of question is amenable to solution by information-retrieval methods and/or use of existing ontologies or fact databases, coupled with linguistic processing.

Simple Inference

Many questions are unlikely to have answers explicitly written down anywhere, from questions requiring a relatively simple leap from what might be already known to questions requiring complex modeling and understanding. An example requiring (simple) inference follows:

(4) Which example describes an organism taking in nutrients? (A) dog burying a bone (B) A girl eating an apple (C) An insect crawling on a leaf (D) A boy planting tomatoes in the garden Answering this question requires knowledge that eating involves taking in nutrients, and that an apple contains nutrients.

More Complex World Knowledge

Many questions appear to require both richer knowledge of the world, and appropriate linguistic knowledge to apply it to a question. As an example, consider the following question:

(5) Fourth graders are planning a roller-skate race. Which surface would be the best for this race? (A) gravel (B) sand (C) blacktop (D) grass Strong cooccurrences between sand and surface, grass and race, and gravel and graders (road-smoothing machines), throw off information-retrieval-based guesses. Rather, a more reliable answer requires knowing that a roller-skate race involves roller skat-ing, that roller skating is on a surface, that skating is best on a smooth surface, and that blacktop is smooth. Obtaining these fragments of world knowledge and integrating them correctly is a substantial challenge.

As a second example, consider the following question:

(6) A student puts two identical plants in the same type and amount of soil. She gives them the same amount of water. She puts one of these plants near a sunny window and the other in a dark room. This experiment tests how the plants respond to (A) light (B) air (C) water (D) soil Again, information-retrieval methods and word correlations do poorly. Rather, a reliable answer requires recognizing a model of experimentation (perform two tasks, differing in only one condition), knowing that being near a sunny window will expose the plant to light, and that a dark room has no light in it.

As a third example, consider this question:

(7) A student riding a bicycle observes that it moves faster on a smooth road than on a rough road. This happens because the smooth road has (A) less gravity (B) more gravity (C) less friction (D) more friction A reliable processing of this question requires envisioning and comparing two different situations, overlaying a simple qualitative model on the situations described (smoother → less friction → faster). It also requires basic knowledge that bicycles move, and that riding propels a bicycle. All the aforementioned examples require general knowledge of the world, as well as simple science knowledge. In addition, some questions more directly test basic commonsense knowledge, such as the following:

(8) A student reaches one hand into a bag filled with smooth objects. The student feels the objects but does not look into the bag. Which property of the objects can the student most likely identify? (A) shape (B) color (C) ability to reflect light (D) ability to conduct electricity This question requires, among other things, knowing that touch detects shape, and that sight detects color.

Some questions require selecting the best explanation for a phenomenon, requiring a degree of metareasoning. For example, consider the following question:

(9) Apple trees can live for many years, but bean plants usually live for only a few months. This statement suggests that (A) different plants have different life spans (B) plants depend on other plants (C) plants produce many offspring (D) seasonal changes help plants grow This requires not just determining whether the statement in each answer option is true (here, several of them are), but whether it explains the statement given in the body of the question. Again, this kind of question would be challenging for a retrieval-based solution.

As a final example, consider the following question from the Texas Assessment of Knowledge and Skills exam 3 (Texas Education Agency 2014):

Diagrams

A common feature of many elementary grade exams is the use of diagrams in questions. We choose to include these in the challenge because of their ubiq-uity in tests, and because spatial interpretation and reasoning is such a fundamental aspect of intelligence. Diagrams introduce several new dimensions to question-answering, including spatial interpretation and correlating spatial and textual knowledge. Diagrammatic (nontextual) entities in elementary exams include sketches, maps, graphs, tables, and diagrammatic representations (for example, a food chain). Reasoning requirements include sketch interpretation, correlating textual and spatial elements, and mapping diagrammatic representations (graphs, bar charts, and so on) to a form supporting computation. Again, while there are many challenges, the level of difficulty varies widely, allowing a graduated plan of attack. Two examples are shown. The first, question 11 (figure 1), requires sketch interpretation, part identification, and label/part correlation. The second, question 12 (figure 2), requires recognizing and interpreting a spatial representation.

Mathematics And Geometry

We also include elementary mathematics in our challenge scope, as these questions intrinsically require mapping to mathematical models, a key requirement for many real-world tasks. These questions are particularly interesting as they combine elements of language processing, (often) story interpretation, mapping to an internal representation (for example, algebra), and symbolic computation. For example (from ixl.com):

(13) Molly owns the Wafting Pie Company. This morning, her employees used 816 eggs to bake pumpkin pies. If her employees used a total of 1339 eggs today, how many eggs did they use in the afternoon? Such questions clearly cannot be answered by information retrieval, and instead require symbolic processing and alignment of textual and algebraic elements (for example, Hosseini et al. 2014; Koncel-Kedziorski et al. 2015; Seo et al. 2014 Seo et al. , 2015 followed by inference. Additional examples are shown in table 1.

Note that, in addition to simple arithmetic capabilities, some capacity for world modeling is often needed. Consider, for example, the following two questions:

(14) Sara's high school won 5 basketball games this year. They lost 3 games. How many games did they play in all? (15) John has 8 orange balloons, but lost 2 of them. How many orange balloons does John have now?

Both questions use the word lost, but the first question maps to an addition problem (5 + 3) while the second maps to a subtraction problem (8 -2). This illustrates how modeling the entities, events, and event sequences is required, in addition to basic algebraic skills.

Finally we also include geometry questions, as these combine both arithmetic and diagrammatic reasoning together in challenging ways. For example,

Testing For Commonsense

Possessing and using commonsense knowledge is a central property of intelligence (Davis and Marcus 2015) . However, Davis (2015) and Weston et al. (2015) have both argued that standardized tests do not test "obvious" commonsense knowledge, and hence are less suitable as a test of machine intelligence. For instance, using their examples, the following questions are unlikely to occur in a standardized test:

Can you make a watermelon fit into a bag by folding the watermelon?

If you look at the moon then shut your eyes, can you still see the moon?

If John is in the playground and Bob is in the office, then where is John?

Can you make a salad out of a polyester shirt?

However, although such questions may not be directly posed in standardized tests, many questions indirectly require at least some of this commonsense knowledge in order to answer. For example, question (6) (about plants) in the previous section requires knowing (among other things) that if you put a plant near X (a window), then the plant will be near X. This is a flavor of blocks-world-style knowledge very similar to that tested in many of Weston et al.'s examples. Similarly question (8) (about objects in a bag) requires knowing that touch detects shape, and that not looking implies not being able to detect color. It also requires knowing that a bag filled with objects contains those objects; a smooth object is smooth; and if you feel something, you touch it. These commonsense requirements are similar in style to many of Davis's examples. In short, at least some of the standardized test questions seem to require the kind of obvious commonsense knowledge that Davis

Explanation

Tests (particularly at higher grade levels) typically include questions that not only ask for answers but also for explanations of those answers. So, at least to some degree, the ability to explain an answer is required.

Learning and Reading Reddy (1996) proposed the grand AI challenge of reading a chapter of a textbook and answering the questions at the end of the chapter. While standardized tests do not directly test textbook reading, they do include question comprehension, including sometimes long story questions. In addition, acquiring the knowledge necessary to pass a test will arguably require breakthroughs in learning and machine reading; attempts to encode the requisite knowledge by hand have to date been unsuccessful.

Dealing With Novel Problems

As our examples illustrate, test taking is not a monolithic skill. Rather it requires a battery of capabilities and the ability to deploy them in potentially novel and unanticipated ways. In this sense, test taking requires, to some level, a degree of versatility and the ability to handle new and surprising problems that we would expect of an intelligent machine.

State Of The Art On Standardized Tests

How well do current systems perform on these tests? While any performance figure will be exam specific, we can provide some example data points from our own research.

On nondiagram, multiple choice science questions (NDMC), our Aristo system currently scores on average 75 percent (4th grade), 63 percent (8th grade), and 41 percent (12th grade) on (previously unseen) New York Regents science exams (NDMC questions only, typically four-way multiple choice). As can be seen, questions become considerably more challenging at higher grade levels. On a broader multistate collection of 4th grade NDMC questions, Aristo scores 65 percent (unseen questions). The data sets are available at allenai.org/aristo.html. Note that these are the easier questions (no diagrams, multiple choice); other question types pose additional challenges as we have described. No system to date comes even close to passing a full 4th grade science exam.

On algebraic story problems such as those in table 1, our AlgeS system scores over 70 percent accuracy on story problems that translate into single equations (Koncel-Kedziorski et al. 2015) . Kushman et al. (2014) report results on story problems that translate to simultaneous algebraic equations. On geometry problems such as those in table 2, our GeoS system achieves a 49 percent score on (previously unseen) official SAT questions, and a score of 61 percent on a data set of (previously unseen) SAT-like practice questions. The relevant questions, data, and software are available on the Allen Institute's website. 4

Table 2. Examples of Problems That Current Systems Have Solved.

Summary

If a computer were able to pass standardized tests, would it be intelligent? Not necessarily, but it would demonstrate that the computer had several critical skills we associate with intelli-Articles SPRING 2016 11 gence, including the ability to answer sophisticated questions, handle natural language, and solve tasks requiring extensive commonsense knowledge of the world. In short, it would mark a significant achievement in the quest toward intelligent machines. Despite the successes of data-driven AI systems, it is imperative that we make progress in these broader areas of knowledge, modeling, reasoning, and language if we are to make the next generation of knowledgable AI systems a reality. Standardized tests can help to drive and measure progress in this direction as they present many of these challenges, yet are also accessible, comprehensible, incremental, and easily measurable, To help with this, we are releasing data sets related to this challenge.

In addition, in October 2015 we launched the Allen AI Science Challenge, 5 a competition run on kaggle .com to build systems to answer eighthgrade science questions. The competition attracted over 700 participating teams, and scores jumped from 32.5 percent initially to 58.8 percent by the end of January 2016. Athough the winner is not yet known at press time, this successful impact demonstrates the efficacy of standardized tests to focus attention and research on these important AI problems.

Of course, some may claim that existing data-driven techniques are all that is needed, given enough data and computing power; if that were so, that in itself would be a startling result. Whatever your bias or philosophy, we encourage you to prove your case, and take these challenges! AI2's data sets are available on the Allen Institute's website. 5 Notes