Diagram Understanding in Geometry Questions

Minjoon Seo
Hannaneh Hajishirzi
Ali Farhadi
Oren Etzioni
AAAI
2014
View in Semantic Scholar

Abstract

Automatically solving geometry questions is a long-standing AI problem. A geometry question typically includes a textual description accompanied by a diagram. The first step in solving geometry questions is diagram understanding, which consists of identifying visual elements in the diagram, their locations, their geometric properties, and aligning them to corresponding textual descriptions. In this paper, we present a method for diagram understanding that identifies visual elements in a diagram while maximizing agreement between textual and visual data. We show that the method's objective function is submodular; thus we are able to introduce an efficient method for diagram understanding that is close to optimal. To empirically evaluate our method, we compile a new dataset of geometry questions (textual descriptions and diagrams) and compare with baselines that utilize standard vision techniques. Our experimental evaluation shows an F1 boost of more than 17% in identifying visual elements and 25% in aligning visual elements with their textual descriptions.

1 Introduction

Designing algorithms that can automatically solve math and science questions is a long-standing problem in AI (Feigenbaum and Feldman 1963) . In this paper, we focus on geometry questions where the question text is accompanied by a diagram. More specifically, we address the problem of diagram understanding in geometry questions (Figure 1 ), a prelude to more sophisticated diagram understanding in scientific textbooks.

Figure 1: Diagram understanding: identifying visual elements in the diagram and aligning them with their textual mentions. Visual elements and their corresponding textual mentions are color coded. This Figure is best viewed in color.

Diagram understanding is the problem of discovering visual elements, their locations, their geometric properties in the diagram, and their alignment to text. For example, understanding the diagram in Figure 1 entails identifying the location and the area of the circle O, secant AB, their geometric relations, and aligning pixels in the diagram to the corresponding textual mentions (color coded in the figure).

By and large, previous work in diagram understanding has studied the problems of text analysis and diagram understanding separately. Several algorithms have identified individual shapes such as circles (Zhang and Sun 2011) , lines (Climer and Bhatia 2003) , triangles or rectangles (Li et al. 2013; Jung and Schramm 2004) from images, but ignore other shapes in the diagram, and do not attempt to discover them as we do. Furthermore, little attention has been paid to identifying shapes in a diagram by also utilizing the corresponding text. Inspired by the growing body of work that has coupled textual and visual signals (e.g., (Gupta and Mooney 2010) ), we present a novel method G-ALIGNER for diagram understanding in geometry questions by discovering visual elements and aligning them with their corresponding textual mentions. Our G-ALIGNER method identifies visual elements by maximizing the coverage of explained pixels of the diagram, the agreement between visual elements and their textual mentions, and the coherence of the identified elements. G-ALIGNER can identify a wide range of shapes including lines, circles, polygons, and other shapes composed from visual primitives (see Section 5).

We show that G-ALIGNER's objective function is submodular. This observation allows us to devise a greedy but accurate approximation procedure to identify visual elements in diagrams and align them with text. G-ALIGNER has another key advantage in being much more robust than standard vision techniques like the Hough transform (Stockman and Shapiro 2001) . Whereas standard vision techniques require parameter tuning when moving from one diagram to the next, based on factors like the number of shapes in the diagram and their size, G-ALIGNER does not.

To evaluate G-ALIGNER, we manually compiled a dataset of geometry questions (textual descriptions and diagrams) that includes ground truth labels for visual elements and their correct alignments to textual mentions. To our knowledge, no comparable dataset existed previously. We evaluate G-ALIGNER on two tasks of identifying visual elements and aligning them to mentions in text. Our experiments show that for both tasks G-ALIGNER significantly outperforms baselines that use standard vision techniques. Moreover, our experiments show the benefit of incorporating textual information.

Our contributions are three-fold: (a) We present G-ALIGNER, a method for diagram understanding that both discovers visual elements in diagrams and aligns them to textual mentions; (b) We introduce a submodular optimization formulation and a greedy but accurate approximation procedure for diagram understanding; (c) We introduce a new dataset for geometry questions that includes ground truth labels for visual elements and their alignment to textual mentions. Our experiments show improvement of at least 25% in F1 over baselines in identifying visual elements, and of 17% in aligning visual elements to textual mentions. 1

2 Related Work

Diagram understanding has been explored since early days in AI (e.g., (Srihari 1994; Lin et al. 1985; Ferguson and Forbus 1998; Hegarty and Just 1989; Ferguson and Forbus 2000; Novak 1995) ). Space does not allow comprehensive review of original attempts at the problem. We refer interested readers to (O'Gorman and Kasturi 1995) . Most previous work differ from our method because they address two problems of geometry understanding and text understanding in isolation. Our paper is related to early work on coupling over textual and visual data (Bulko 1988; Novak and Bulko 1990; Srihari 1994) , however these methods assume that the visual primitives of diagrams are manually identified. This paper aims at revisiting the problem of diagram understanding by coupling two tasks of visual understanding of diagrams and detecting alignments between text and diagrams.

The most common approach to diagram understanding is a bottom up method where primitives can be linked together (Lin and Nevatia 1998) to form larger elements such as rectangles (Li et al. 2013) or general shapes (Moon, Chellappa, and Rosenfeld 2002) . Using Hough transform is another popular alternative in detecting visual elements (Zhang and Sun 2011; Jung and Schramm 2004) . What is common among almost all conventional methods of visual element identification is thresholding of a scoring function that determines the existence of visual elements. Although being considered as a well studied subject, our experiments reveal that the thresholding step hinders applications of such techniques on real-world geometry questions. Our data suggests that there is no single threshold that results in a reliable discovery of visual elements across different diagrams. In this paper, we propose a method that initially overestimates the visual elements, but then benefits from submodular opti-1 Our dataset and a demo of G-ALIGNER are publicly available at:http://cs.washington.edu/research/ai/ geometry mization coupled with textual information to home in on the correct elements.

Coupling visual and textual information has recently attracted attention in both vision and NLP (Farhadi et al. 2010; Kulkarni et al. 2011; Gupta and Mooney 2010) . We build on this powerful paradigm, but utilize it for the more manageable task of understanding diagrams in geometry questions. Understanding these diagrams is more manageable because diagrams are less ambiguous, expose less visual variance, have smaller vocabulary of elements than images typically studied in machine vision. This easier task allows us to have more reliable estimates of visual elements and focus on interactions between textual mentions and visual elements.

3 Problem Definition

This paper addresses the problem of understanding diagrams (Figure 1 ) by coupling discovering visual elements in the diagram with aligning them with textual mentions. Before giving the formal description of the problem, we first define keywords that we use throughout the paper.

L = {L 1 , L 2 , ..., L n }.

Definition 2. A visual element is a combination of primitives that has specific properties. For instance, a triangle is a visual element that consists of three connected lines in a specific way. The vocabulary of all visual elements and their corresponding geometric properties is represented with V . The primitives in V includes: line, segment, chord, diameter, secant, tangent, radius, circle, arc, point, intersection, triangle, rectangle, trapezoid, square, altitude, base. For their geometric properties, please refer to our project web page.

Definition 3. A textual mention is a word or phrase that corresponds to a visual element. For instance, the word circle is the textual mention of the visual element circle. The set of all textual mentions extracted from the question is

T = {T 1 , T 2 , ..., T m }.

The input to our method is an image of a diagram with non-white pixels D accompanied with the text of the question that includes textual mentions T . The output is a subset of primitives along with their alignments to textual mentions. Figure 1 shows examples of detections and alignments established by our method.

4 Optimization For Primitive Identification And Alignment

Our key insight is to benefit from coupling textual and visual information available in geometry questions. This problem is a search for the best subsetL of all initial primitives L extracted from the diagram. An ideal subsetL should contain primitives that: (1) explain all important pixels in the diagram, (2) are visually coherent, and (3) form visual elements that align well with textual mentions in the question.

4.1 Formulation

We first intuitively define a set function F that measures the quality of a subsetL based on the above properties (Equation 1). We then introduce the formal definition (Equation 2). First, F has a component P to ensure that the ideal subset L has good coverage of the diagram image. That is, most of the non-white pixels in the diagram D should be explained by the subset of primitivesL.

Second, F has a component C to encourage the selection of primitivesL that can form a bigger and coherent visual element. This can be encoded by the visual agreement between primitives in terms of distances between identified primitives inL and the actual corners C (corners are extracted from the diagram image and explained in Section 5.1).

Third, F has a component S to model the alignment between textual mentions T in the question and visual elements discovered from the diagram.

For any given subsetL ⊆ L, we define:

EQUATION (1): Not extracted; please refer to original document.

where D is the diagram image, T is the set of all textual mentions in the question. The best subset is the one that maximizes the set function F.

Here, we present an optimization for identifying primitives and aligning them with textual mentions. We introduce a binary matrix W ∈ {0, 1} |L|×|T | where W i,j identifies whether the i th primitive l i is aligned with the j th textual mention T j , or not. In particular, each row i in the identifier matrix W represents textual mentions that include the primitive l i , and each column j in the identifier matrix represents primitives that are aligned with the textual mention T j . Therefore, a matrix W can represent both the set of primitives as well as alignment between textual mentions and visual elements.

We reformulate the problem of searching for the best subsetL that maximizes F in Equation 1 as the problem of finding an identifier matrix W ∈ {0, 1} |L|×|T | . Optimizing forŴ results in a unified solution for both problems of primitive identification and alignment. In

this set- ting,L = L × (Ŵ × 1 |T |×1 ) where the binary vector W × 1 |T |×1 represents what primitives in L are included inŴ . Therefore, P(D,L) in Equation 1 is represented as P(D, L × (Ŵ × 1))

in the new setting. Finally, the optimization in equation 1 is reformulated as follows:

F(W, L, D, T ) = (2) P(D, L × (W × 1)) + C(C, L × (W × 1)) + S(T, W )

The best subset of primitives and alignments are derived by maximizing for the identifier matrixŴ = arg max W F. Here we formally define each component in the equation.

Definition 4. Let D be the set of pixels in the diagram, L be the set of all the primitives initially identified in the diagram, W be the identifier matrix, andL = L × (Ŵ × 1) be the set of identified primitives.

• Coverage function P: If DL represents the set of pixels covered by the identified primitivesL then P :

D × L → R is P(D,L) = |DL| |D| .

• Visual coherence function C: Let C be the set of corners initially detected in the diagram. We consider a corner c ∈ C to be matching, if there exists a point e ∈ DL that is close enough to the detected corner c (i.e., |c − e| < for a fixed ). If CL is the set of matched corners, then C :

C × L → R is C(C,L) = |CL| |C| .

• Alignment constraint function S: Let T be the set of textual mentions in the text of the question. The vocabulary V consists of geometric descriptions of each visual element. For example, Circle corresponds to the set of points that have the same distance to the center, etc. A textual mention in the text is aligned if our method can find a corresponding model from the primitives. For example, to align a textual mention like Triangle ABC, our method needs to find three lines that mutually intersect at corners close to labels A, B, and C in the diagram. 2 A visual element like a triangle can be textually described in multiple different ways. For example, Triangle ABC or three lines AB, BC, AC. To avoid redundancy between the visual elements, we need to penalize our model for predicting overlapping visual elements. We define redundancy between two lines l 1 , l 2 as a function of the intersection of the projection of l 2 to l 1 over their union. For arcs we do the same with the convex area of the arcs. If TŴ is the set of textual mentions covered inŴ , and rŴ is the redundancy among the primitives inŴ that are not mentioned in TŴ then S :

T × W → R is S(T,Ŵ ) = |TŴ | |T | − rŴ .

Optimizing Equation 2 is a combinatorial optimization that requires 2 |L| evaluations of F. In the next section we show how to optimize Equation 2.

4.2 Optimization

Optimizing for Equation 2 is NP-hard by reduction from weighted set cover problem. However, the objective function is submodular. This means that there exists a greedy method that can accurately approximate the optimal solution. Lemma 1. The objective function F in Equation 2 is submodular. Proof sketch. To show that the objective F in Equation 2 is submodular we need to show that for L ⊆ L ⊆ L, and for

EQUATION (3): Not extracted; please refer to original document.

We compare components of F in two sides of inequality 3:

(|D L ∪l | − |D L |)/|D| ≥ (|D L ∪l | − |D L |)/|D| (|T L ∪l | − |T L |)/|T | ≥ (|T L ∪l | − |T L |)/|T | (|C L ∪l | − |C L |)/|C| ≥ (|C L ∪l | − |C L |)/|C| −(|r L ∪l − r L ) ≥ −(|r L ∪l − r L )

Inputs:

• V : the set of known visual elements and their geometric properties.

• D: the set of non-white pixels in the diagram.

•L: the set of identified primitives 1. Initialization (section 5.1) (a) Initialize primitives L in the diagram i. Run Hough transform to initialize lines and circles segments ii. set L ← top n picks from the output of the line and circle detection where n is generously high (b) Initialize corners C in the diagram (c) Initialize mentions T in the text 2. Optimize Equation 2 to identify primitives and alignments given the diagram and text (section 5.2) (a) LetL ← ∅ (b) Repeat i. For every primitive l ∈ L: Adding a primitive to the smaller set does not decrease the coverage of pixels, corners, and alignments. The intuition behind the last inequality is that adding a primitive to a larger set of primitives will result in more redundancy. Summing over these inequalities proves that the inequality 3 holds.

A. Compute G(l) ← F(L ∪ l) − F(L) using P, C, S from Equation 2 ii. select l ← arg max l∈L G(l) iii. add l to the set of primitivesL (c) until l ∈ L such that G(l) > 0.

The objective function F in Equation 2 is also monotone until all the textual mentions in the text are covered (adding new primitives does not decrease the value of the objective function).

The objective function (Equation 2) is monotone and submodular. This means that there exist a greedy method that finds a (1 − 1/e)-approximation of the global optimum (Nemhauser, Wolsey, and Fisher 1978; Sviridenko 2004 ). In the next section we explain the greedy method to identify primitives and alignments. Figure 2 explains the steps in our method G-ALIGNER for diagram understanding. Submodularity of the objective function implies that we can introduce the following iterative greedy method with proven bounds. We first initialize the set of possible primitives (Section 5.1, Step 1 in Figure 2) and then iteratively add the primitives that maximize gain (Section 5.2, Step 2 in Figure 2 ). Figure 3 schematically depicts steps of G-ALIGNER.

Figure 2: G-ALIGNER: Method for coupling primitive identification and alignment.

Figure 3: This figure shows steps of the method. It starts with an over-generation of primitives and at each iteration adds a primitive that provides the biggest gain based on Equation 4. Red line segments correspond to primitives that are added at each iteration. Blue crosses correspond to detected corners.

5.1 Initialization

The left image in Figure 3 shows an example of initial sets of primitives from which our method starts. Initialize primitives: For noise removal, we apply a weak Gaussian blur on the raw image and then binarize it using Otsu's method (Otsu 1975) . We then use Hough transform (Stockman and Shapiro 2001) to extract primitives (line and circle segments) for a given diagram. The result of Hough transform has no information about the start and end points of the lines or arcs. Only the parametric representation of the primitive is known. Therefore, post-processing is required in order to determine endpoints. We detect and connect continuous binary points that lie on the same line or arc. For each primitive of interest, the result will be a few independent segments where the start and end of the segments are stored.

Standard application of Hough transform is not applicable to our problem. This is mainly due to (1) inaccuracies at the intersections (2) confusions between circles and polygons composed of several small lines; and (3) sensitivity to parameters of Hough transform and all post processing techniques. Our experimental evaluations show that there is no single set of parameters that work well on large number of samples. To overcome these issues, we set the threshold to a low number to over-generate a large number of primitives. This way, the right set of primitives are most likely among the over-generated set of primitives. We typically obtain 20 to 60 primitives in each diagram. Figure 3 shows an example of overproduced primitives L. We then use the optimization in Equation 2 to select the right set of primitivesL from a big pool of noisy estimates of primitives L. Initialize Corners: To enforce coherent visual elements, we need to encourage the set of selected primitives to be visually coherent. We use corners in diagrams as a gluing function. Two primitives that share endpoints very close to a corner in an image are preferred to primitives that are far away from corners. We use Harris Corner detectors (Harris and Stephens 1988) to identify possible locations of corners. Corners are scored based on how close they are to primitives. Initialize mentions: We extract mentions from textual descriptions by keyword search using the list of visual elements in V .

5.2 Iterative Optimization

We initialize the optimal set of primitives as an empty set L = ∅. Then we repeat the following step. At every iteration k + 1, we select the primitive l that maximizes the following equation given that L k is the best subset at iteration k. Figure 3 shows three steps of this greedy method on a sample diagram.

EQUATION (4): Not extracted; please refer to original document.

6 Experiments

To experimentally evaluate our method we build a dataset of geometry questions along with annotations about visual elements and alignments. We test our method on how well it can identify visual elements and how accurate are the alignments established by our method. We compare our method G-ALIGNER with baselines that use standard techniques of diagram understanding. To also better understand our model we perform ablation studies for our model.

6.1 Experimental Setup

Dataset We build a dataset of high school plane geometry questions where every question has a textual description in English accompanied by a diagram. For evaluation purposes, we annotate diagrams with correct visual elements as well as alignments between textual mentions and visual elements.

Questions are compiled from four test preparation websites (RegentsPrepCenter; EdHelper; SATMath; SATPractice) for high school geometry. Ground truth labels are collected by manually annotating all the primitives in the diagram. In addition, we annotate all the alignments between textual mentions and visual elements in the diagram. In total, our dataset consists of 100 questions and 482 ground truth alignments. The dataset is publicly available in our project web page. Tasks and metrics We evaluate our method G-ALIGNER in two tasks of (1) identifying primitives in diagrams and (2) aligning visual elements with textual mentions.

For task 1, we compare detected primitives by G-ALIGNER against the ground truth dataset. For every identified primitive l and the corresponding ground truth primitive l, we measure the amount of overlap. If the primitive is a line segment, we project l onto l and measure the ratio of area of intersection over the union. For arc primitives we use the convex area and measure the amount of area of intersection over union. A detection l is considered as a correct prediction if the amount of intersection over union is bigger than a threshold α (called overlap-to-ground-truth ratio). For evaluation purposes, we vary α ∈ [0.7 : 1] and report F1 scores. Precision is the number of correctly identified primitives divided by total number of identified primitives. Recall is the number of correctly identified primitives divided by the number of primitives in ground truth.

For task 2, we evaluate if G-ALIGNER correctly aligns a textual mention with a visual element. For that, we establish alignments and report the accuracy of G-ALIGNER compared to ground truth annotations.

Baselines We compare our method G-ALIGNER with a baseline that uses Hough transform for identifying primitives. Similar to almost all of the Hough-based methods, this baseline requires a set of sensitive parameters: two thresholds for picking elements in Hough line and circle space, respectively, and three non-maximum suppression neighborhood parameters (two for line and one for circle). To find the best possible baseline, we perform a sampled grid search. This way, we can find the best set of parameters as if we have tuned the parameters on the test set. This baseline works as an upper bound on how well one can detect visual primitives using standard Hough-based methods. The overall distribution of F1 for several samples of parameters is shown in 5.

For task 1, the baseline identifies primitives that score higher than a threshold. This threshold is manually set to produce the best possible detection in our dataset. For task 2, we use identified primitives from task 1 and align them with the mention corresponding to the closest visual element.

Parameters Our method G-ALIGNER also uses Hough transform to extract initial primitives out of diagrams. However, G-ALIGNER method is not sensitive to the choice of parameters in Hough transform. We set the parameters so that we always overly generate primitives. Our optimization method reasons about what primitives to select.

6.2 Results

Identifying primitives We report the performance of G-ALIGNER in identifying primitives and compare it with the best baseline explained above. Figure 4 shows the F 1 score at different overlap-to-the-ground-truth ratios. G-ALIGNER significantly outperforms the baseline because (1) G-ALIGNER couples visual and textual elements (2) G-ALIGNER enforces diagram coherency (3) G-ALIGNER does not require parameter tuning. The baseline typically maintains relatively high recall but low precision. For example, at α = 0.8 the baseline achieves precision of 69% at the recall of 86% compared to our precision of 95% at the recall of 93%. Figure 5 reports the distribution of the F1 scores for the baseline for 500 samples of parameters. This figure also shows where the "best baseline" (whose parameters we manually tuned for the entire dataset) and

Figure 4: Comparison between G-ALIGNER and the baseline in task 1 in terms of F1 by varying overlap to the ground truth ratio α. This threshold is used to evaluate correct predictions of primitives.

Figure 5: Normalized histogram of 500 F1 scores for the baseline obtained by randomly chosen parameters. We observe a normal distribution centered at 0.5 with standard deviation 0.04. The F1 scores of best baseline parameters and G-ALIGNER also drawn.

F1#

Overlap#with#GT#Threshold# Baseline# G@ALIGNER# Figure 4 : Comparison between G-ALIGNER and the baseline in task 1 in terms of F1 by varying overlap to the ground truth ratio α. This threshold is used to evaluate correct predictions of primitives.

G--Aligner

Best baseline, tuned on en4re dataset Figure 5 : Normalized histogram of 500 F1 scores for the baseline obtained by randomly chosen parameters. We observe a normal distribution centered at 0.5 with standard deviation 0.04. The F1 scores of best baseline parameters and G-ALIGNER also drawn.

G-ALIGNER stand. The comparison clearly demonstrates how G-ALIGNER outperforms the baseline with any combination of parameters.

Ablation study We study the importance of each component in G-ALIGNER (optimization Equation 2). To study the effect of enforcing agreement between visual elements and textual mentions, we remove the term S from the optimization in Equation 2. In addition, to study the effect of enforcing diagram coherency we remove the term C from the optimization in Equation 2. We also show the effects of removing both S, C from the equation. Table 1 shows the precision, recall, and F1 scores of identifying primitives for α = 0.8. Removing both components decreases both precision and recall. The effect of removing the S is higher than that of C implying the importance of coupling textual and visual data.

Table 1: Ablation study on Task 1 (identifying primitives).

Aligning textual mentions and visual elements To study the performance of G-ALIGNER in aligning textual mentions and visual elements, we compare our alignment results with baseline alignments. G-ALIGNER achieves an accuracy of 90% and baseline obtains the accuracy of 64% for the overlap ratio of α = 0.8. This approves that coupling tex- Qualitative results Figure 6 shows examples of alignments between visual elements and textual mentions produced by G-ALIGNER . Mentions and their corresponding visual elements are color coded. For example, G-ALIGNER establishes an alignment between the textual mention base AFEDC and the corresponding line (red line) in Figure 6 . To show the effects of S, C in G-ALIGNER, Figure 7 shows examples of mistakes in primitive identification that happens when either S or C are removed from G-ALIGNER. In Figure 7 , black pixels correspond to the actual diagram and red lines and circles correspond to the detected elements. Removing S in (a) results in wrong detection of an extra circle. Without S, G-ALIGNER does not know what to expect and therefore picks a circle whose coverage is larger than the rest of the lines in the diagram. Removing C in (b) results in estimates of an incorrect line on the pixels on the word tangent in the diagram. By considering agreements between corners, G-ALIGNER correctly discards this false detection. Note that, one might come up with heuristics to avoid any of these specific cases. However, this paper provides a unified method that reasons about diagrams and can handle these cases without any specific heuristic.

Figure 6: Examples of alignments produced by G-ALIGNER. Textual mentions and visual elements are color coded. This figure is best viewed in color.

Figure 7: Examples of mistakes when S or C are removed from G-ALIGNER. In this figure, black pixels correspond to the actual pixels in the diagram and red lines or circles are detected elements. In (a), removing S results in adding a wrong detection of an extra circle whose coverage is actually bigger than some of the correct lines in the figure. In (b), removing C results in an incorrect detection of an incorrect line on the word tangent. G-ALIGNER correctly understands both of the diagrams above.

Limitations Our method shows promising results in our experiments. However, there are cases in which our model fails to identify diagram elements. The benefits of our model over the standard baselines is marginal if the text of the ques- tion does not mention any of diagram elements. Our method doesn't recognize out of the vocabulary visual elements and fails if the scale of one visual elements is out of the range.

7 Conclusion And Future Directions

Our ultimate goal is to build an automated system that can solve geometry questions. The very first step toward this goal is to build a method that can understand diagrams. To our surprise, despite a large body of work in diagram understanding, the literature lacks a unified framework that does not require parameter tuning and problem specific heuristics. This paper is one step toward building such systems. We introduce G-ALIGNER that understands diagrams by coupling primitive detection and their alignments to text. The output of G-ALIGNER is a set of detected visual elements, their location, their geometric properties, and their corresponding textual mentions. Further, G-ALIGNER can exhaustively enumerate all possible visual elements (even the ones that are not explicitly mentioned in the text).

A direct extension of G-ALIGNER allows us to solve geometry problems with drawn-to-scale diagrams, through a scaling method. For example, G-ALIGNER identifies several visual primitives (i.e., lines BD, BO, DO and CA and circle O) in the problem in Figure 8 . Additionally, G-ALIGNER can enumerate all possible visual elements, their geometric properties, and their geometric relations (Figure 8 (c) ). For instance, G-ALIGNER identifies line CE and its length in pixels. Moreover, G-ALIGNER can capture geometric relations. For example, G-ALIGNER identifies line BO as the radius of the circle O using the visual information that one end of the line BO is near the center of the circle O and the other end of the line is on the circumference of the circle.

Figure 8: Using G-ALIGNER to solve a geometry problem.

With simple textual processing, we also extract numerical relations from the question text and what the question is looking for (Figure 8 (b) ). Textual information for the example question includes (1) the radius of circle O = 5, (2) line CE = 2, and (3) the question is looking for the length of line BD. By combining textual information (1) with the extracted visual information that states "BO is the radius of Circle O", we infer that the length of line BO = 5.

This enables us to compute the scale between the units in the question (BO=5) and pixels in the diagram (Length of line BO is 75 pixels). Using textual information (2), we can solve the example problem in a slightly different way (See Figure 8 (d) for step-by-step demonstration of how G-ALIGNER can solve this problem).

G-ALIGNER, with the scaling, finds correct answers for problems with well-drawn diagrams. In future, we plan to solve the problems using more complex mathematical and logical reasoning for geometry theorem proving. We also plan to link several uncertain visual and textual relations to formulate a collective probabilistic model that will output the most probable answer to the question. In addition, we intend to use the results of diagram understanding to help understand the semantics of sentences. This is feasible because diagrams are easier to understand compared to real images. Diagram understanding should also help recognition of out-of-vocabulary visual elements and disambiguation of textual mentions. The current formulation can handle extensions to diagrams that are composed of well-defined primitives. We also plan to extend G-ALIGNER to understand more complex diagrams in other science areas. The demo and the dataset are available on our project page.

For finding positions of labels we use an off-the-shelf OCR package of Tesseract.