Learning Biological Processes with Global Constraints

A. T. Scaria
Jonathan Berant
M. Wang
P. Clark
Justin Lewis
Brittany Harding
Christopher D. Manning
EMNLP
2013
View in Semantic Scholar

Abstract

Biological processes are complex phenomena involving a series of events that are related to one another through various relationships. Systems that can understand and reason over biological processes would dramatically improve the performance of semantic applications involving inference such as question answering (QA) ‐ specifically “How?” and “Why?” questions. In this paper, we present the task of process extraction, in which events within a process and the relations between the events are automatically extracted from text. We represent processes by graphs whose edges describe a set of temporal, causal and co-reference event-event relations, and characterize the structural properties of these graphs (e.g., the graphs are connected). Then, we present a method for extracting relations between the events, which exploits these structural properties by performing joint inference over the set of extracted relations. On a novel dataset containing 148 descriptions of biological processes (released with this paper), we show significant improvement comparing to baselines that disregard process structure.

1 Introduction

A process is defined as a series of inter-related events that involve multiple entities and lead to an end result. Product manufacturing, economical developments, and various phenomena in life and social sciences can all be viewed as types of processes. Processes are complicated objects; consider for example the biological process of ATP synthesis described in Figure 1 . This process involves 12 entities and 8 events. Additionally, it describes relations between events and entities, and the relationship between events (e.g., the second occurrence of the event 'enter', causes the event 'changing'). * Both authors equally contributed to the paper Automatically extracting the structure of processes from text is crucial for applications that require reasoning, such as non-factoid QA. For instance, answering a question on ATP synthesis, such as "How do H+ ions contribute to the production of ATP?" requires a structure that links H+ ions (Figure 1 , sentence 1) to ATP (Figure 1 , sentence 4) through a sequence of intermediate events. Such "How?" questions are common on FAQ websites , which further supports the importance of process extraction.

Figure 1: Partial annotation of the ATP synthesis process. Most of the semantic roles have been removed for simplicity.

Process extraction is related to two recent lines of work in Information Extraction -event extraction and timeline construction. Traditional event extraction focuses on identifying a closed set of events within a single sentence. For example, the BioNLP 2009 and 2011 shared tasks (Kim et al., 2009; Kim et al., 2011) consider nine events types related to proteins. In practice, events are currently almost always extracted from a single sentence. Process extraction, on the other hand, is centered around discovering relations between events that span multiple sentences. The set of possible event types in process extraction is also much larger.

Timeline construction involves identifying temporal relations between events (Do et al., 2012; Mc-Closky and Manning, 2012; D'Souza and Ng, 2013) , and is thus related to process extraction as both focus on event-event relations spanning multiple sentences. However, events in processes are tightly coupled in ways that go beyond simple temporal ordering, and these dependencies are central for the process extraction task. Hence, capturing process structure requires modeling a larger set of relations that includes temporal, causal and co-reference relations.

In this paper, we formally define the task of process extraction and present automatic extraction methods. Our approach handles an open set of event types and works over multiple sentences, extracting a rich set of event-event relations. Furthermore, H+ ions flowing down their gradient enter a half channel in a stator, which is anchored in the membrane.

H+ ions enter binding sites within a rotor, changing the shape of each subunit so that the rotor spins within the membrane.

Spinning of the rotor causes an internal rod to spin as well.

Turning of the rod activates catalytic sites in the knob that can produce ATP from ADP and P_i. we characterize a set of global properties of process structure that can be utilized during process extraction. For example, all events in a process are somehow connected to one another. Also, processes usually exhibit a "chain-like" structure reflecting process progression over time. We show that incorporating such global properties into our model and performing joint inference over the extracted relations significantly improves the quality of process structures predicted. We conduct experiments on a novel dataset of process descriptions from the textbook "Biology" (Campbell and Reece, 2005 ) that were annotated by trained biologists. Our method does not require any domain-specific knowledge and can be easily adapted to non-biology domains. The main contributions of this paper are:

1. We define process extraction and characterize processes' structural properties.

2. We model global structural properties in processes and demonstrate significant improvement in extraction accuracy. 3. We publicly release a novel data set of 148 fully annotated biological process descriptions along with the source code for our system. The dataset and code can be downloaded from http://nlp.stanford.edu/ software/bioprocess/.

2 Process Definition And Dataset

We define a process description as a paragraph or sequence of tokens x = {x 1 , . . . x |x| } that describes a series of events related by temporal and/or causal relations. For example, in ATP synthesis (Figure 1 ), the event of rotor spinning causes the event where an internal rod spins. We model the events within a process and their relations by a directed graph P = (V, E), where the nodes V = {1, . . . , |V |} represent event mentions and labeled edges E correspond to event-event relations. An event mention v ∈ V is defined by a trigger t v , which is a span of words x i , x i+1 , . . . , x j ; and by a set of argument mentions A v , where each argument mention a v ∈ A v is also a span of words labeled by a semantic role l taken from a set L. For example, in the last event mention of ATP synthesis, t v = produce, and one of the argument mentions is a v = (ATP, RESULT). A labeled edge (u, v, r) in the graph describes a relation r ∈ R between the event mentions u and v. The task of process extraction is to extract the graph P from the text x. 1 A natural way to break down process extraction into sub-parts is to first perform semantic role labeling (SRL), that is, identify triggers and predict argument mentions with their semantic role, and then extract event-event relations between pairs of event mentions. In this paper, we focus on the second step, where given a set of event triggers T , we find all event-event relations, where a trigger represents the entire event. For completeness, we now describe the semantic roles L used in our dataset, and then present the set of event-event relations R.

The set L contains standard semantic roles such as AGENT, THEME, ORIGIN, DESTINATION and LO-CATION. Two additional semantic roles were employed that are relevant for biological text: RESULT corresponds to an entity that is the result of an event, and RAW-MATERIAL describes an entity that is used or consumed during an event. For example, the last event 'produce' in Figure 1 , has 'ATP' as the RE-SULT, and 'ADP' as the RAW-MATERIAL.

The event-event relation set R contains the following (assuming a labeled edge (u, v, r)):

1. PREV denotes that u is an event immediately before v. Thus, the edges (u, v, PREV) and (v, w, PREV), preclude the edge (u, w, PREV). For example, in "When a photon strikes . . . energy is passed . . . until it reaches . . . ", there is no edge (strikes, reaches, PREV) due to the intervening event 'passed'.

2. COTEMP denotes that events u and v overlap in time (e.g., the first two event mentions flowing and enter in Figure 1 ).

3. SUPER denotes that event u includes event v. For instance, in "During DNA replication, DNA polymerases proofread each nucleotide. . . " there is an edge (DNA replication, proofread, SUPER).

4. CAUSES denotes that event u causes event v (e.g., the relation between changing and spins in sentence 2 of Figure 1 ).

5. ENABLES denotes that event u creates preconditions that allow event v to take place. For example, the description ". . . cause cancer cells to lose attachments to neighboring cells. . . , allowing them to spread into nearby tissues" has the edge (lose, spread, ENABLES). An intuitive way to think about the difference between Causes and Enables is the following: if u causes v this means that if u happens, then v happens. If u enables v, then if u does not happen, then v does not happen.

6. SAME denotes that u and v both refer to the same event (spins and Spinning in Figure 1 ).

Early work on temporal logic (Allen, 1983) contained more temporal relations than are used in our relation set R. We chose a relation set R that captures the essential aspects of temporal relations between events in a process, while keeping the annotation as simple as possible. For instance, we include the SUPER relation that appears in temporal annotations such as the Timebank corpus (Pustejovsky et al., 2003 ) and Allen's work, but in practice was not considered by many temporal ordering systems (Chambers and Jurafsky, 2008; Yoshikawa et al., 2009; Do et al., 2012) . Importantly, our relation set also includes the relations CAUSES and ENABLES, which are fundamental to modeling processes and go beyond simple temporal ordering. We also added event coreference (SAME) to R. Do et al. (2012) used event coreference information in a temporal ordering task to modify probabilities provided by pairwise classifiers prior to joint inference. In this paper, we simply treat SAME as another event-event relation, which allows us to easily perform joint inference and employ structural constraints that combine both coreference and temporal relations simultaneously. For example, if u and v are the same event, then there can exist no w, such that u is before w, but v is after w (see Section 3.3)

We annotated 148 process descriptions based on the aforementioned definitions. Further details on annotation and data set statistics are provided in Section 4 and Table 1 .

Table 1: Process statistics over 148 process descriptions. NONE is used to indicate no relation.

Structural properties of processes Coherent processes exhibit many structural properties. For example, two argument mentions related to the same event cannot overlap -a constraint that has been used in the past in SRL (Toutanova et al., 2008) . In this paper we focus on three main structural properties of the graph P. First, in a coherent process, all events mentioned are related to one another, and hence the graph P must be connected. Second, processes tend to have a "chain-like" structure where one event follows another, and thus we expect Table 2 : Node degree distribution for event mentions on the training set. Predictions for the Local and Global models were obtained using 10-fold cross validation.

Table 2: Node degree distribution for event mentions on the training set. Predictions for the Local and Global models were obtained using 10-fold cross validation.

nodes' degree to generally be ≤ 2. Indeed, 90% of event mentions have degree ≤ 2, as demonstrated by the Gold column of Table 2 . Last, if we consider relations between all possible triples of events in a process, clearly some configurations are impossible, while others are common (illustrated in Figure 2 ). In Section 3.3, we show that modeling these properties using a joint inference framework improves the quality of process extraction significantly.

Figure 2: Relation triangles (a)-(c) are common in the gold standard while (d)-(e) are impossible.

3 Joint Model For Process Extraction

Given a paragraph x and a trigger set T , we wish to extract all event-event relations E. Similar to Do et al. (2012) , our model consists of a local pairwise classifier and global constraints. We first introduce a classifier that is based on features from previous work. Next, we describe novel features specific for process extraction. Last, we incorporate global constraints into our model using an ILP formulation.

3.1 Local Pairwise Classifier

The local pairwise classifier predicts relations between all event mention pairs. In order to model the direction of relations, we expand the set R to include the reverse of four directed relations: PREV-NEXT, SUPER-SUB, CAUSES-CAUSED, ENABLES-ENABLED. After adding NONE to indicate no relation, and including the undirected relations COTEMP and SAME, R contains 11 relations. The classifier is hence a function f :

T × T → R. As an example, f (t i , t j ) = PREV iff f (t j , t i ) = NEXT.

Let n be the number of triggers in a process, and t i be the i-th trigger in its description. Since

f (t i , t j ) completely determines f (t j , t i )

, it suffices to consider only pairs with i < j. Note that the process graph P is undirected under the new definition of R. work (Chambers and Jurafsky, 2008; Do et al., 2012) extracted for a trigger pair (t i , t j ). Some features were omitted since they did not yield improvement in performance on a development set (e.g., lemmas and part-of-speech tags of context words surrounding t i and t j ), or they require gold annotations provided in TimeBank, which we do not have (e.g., tense and aspect of triggers). To reduce sparseness, we convert nominalizations into their verbal forms when computing word lemmas, using WordNet's (Fellbaum, 1998) derivation links.

3.2 Classifier Extensions

A central source of information to extract eventevent relations from text are connectives such as after, during, etc. However, there is variability in the occurrence of these connectives as demonstrated by the following two sentences (connectives in boldface, triggers in italics):

Even though both sentences express the same relation (exchanged, reduced, CAUSES), the connectives used and their linear position with respect to the triggers are different. Also, in sentence 1, gene flow intervenes between exchanged and reduced. Since our dataset is small, we wish to identify the triggers related to each connective, and share features between such sentences. We do this using the syntactic structure and by clustering the connectives. Sentence 1 presents a typical case where by walking up the dependency tree from the marker because, we can find the triggers related by this marker:

EQUATION (: Not extracted; please refer to original document.

because mark ← −− − exchanged advcl ← −− − reduced.

Whenever a trigger is the head of an adverbial clause and marked by a mark dependency label, we walk on the dependency tree and look for a trigger in the main clause that is closest to the root (or the root itself in this example). By utilizing the syntactic structure, we can correctly spot that the trigger gene flow is not related to the trigger exchanged through the connective because, even though they are linearly closer. In order to reduce sparseness of connectives, we created a hand-made clustering of 30 connectives that maps words into clusters 2 (e.g., because, since and hence to a "causality" cluster). After locating the relevant pair of triggers, we use these clusters to fire the same feature for connectives belonging to the same cluster. We perform a similar procedure whenever a trigger is part of a prepositional phrase (imagine sentence 1 starting with "due to allele exchange during gene flow . . . ") by walking up the constituency tree, but details are omitted for brevity. In sentence 2, the connective hence is an adverbial modifier of the trigger reduced. We look up the cluster for the connective hence and fire the same feature for the adjacent triggers exchanged and reduced.

We further extend our features to handle the rich relation set necessary for process extraction. The first event of a process is often expressed as a nominalization and includes subsequent events (SUPER relation), e.g., "The Calvin cycle begins by incorporating...". To capture this, we add a feature that fires when the first event of the process description is a noun. We also add two features targeted at the SAME relation: one indicating if the lemmas of t i and t j are the same, and another specifying the determiner of t j , if it exists. Certain determiners indicate that an event trigger has already been mentioned, e.g., the determiner this hints a SAME relation in "The next steps decompose citrate back to oxaloacetate. This regeneration makes . . . ". Last, we add as a feature the dependency path between t i and t j , if it exists, e.g., in "meiosis produces cells that divide . . . ", the feature dobj − − → rcmod −−−→ is fired for the trigger pair produces and divide. In Section 4.1 we empirically show that our extensions to the local classifier substantially improve performance.

For our pairwise classifier, we train a maximum entropy classifier that computes a probability p ijr for every trigger pair (t i , t j ) and relation r. Hence, f (t i , t j ) = arg max r p ijr .

3.3 Global Constraints

Naturally, pairwise classifiers are local models that can violate global properties in the process structure. Figure 3 (left) presents an example for predictions made by the pairwise classifier, which result in two triggers (deleted and dupcliated) that are isolated from the rest of the triggers. In this section, we discuss how we incorporate constraints into our model to generate coherent global process structures.

Figure 3: Process graph fragments. Black edges (dotted) are predictions of Local, green (solid) are predictions of Global, and gold (dashed) are gold standard edges. To reduce clutter, we present the predictions of Global only when it disagrees with Local. In all other cases, the predictions of Global and Local are identical. Original text, Left: “... the template shifts . . . , and a part of the template strand is either skipped by the replication machinery or used twice as a template. As a result, a segment of DNA is deleted or duplicated.” Right: “Cells of mating type A secrete a signaling molecule, which can bind to specific receptor proteins on nearby cells. At the same time, cells secrete factor, which binds to receptors on A cells.”

Let θ ijr be the score for a relation r between the trigger pair (t i , t j ) (e.g., θ ijr = log p ijr ), and y ijr be the corresponding indicator variable. Our goal is to find an assignment for the indicators y = {y ijr | 1 ≤ i < j ≤ n, r ∈ R}. With no global constraints this can be formulated as the following ILP:

arg max y ijr θ ijr y ijr (1) s.t.∀ i,j r y ijr = 1

where the constraint ensures exactly one relation between each event pair. We now describe constraints that result in a coherent global process structure.

Connectivity Our ILP formulation for enforcing connectivity is a minor variation of the one suggested by Martins et al. (2009) for dependency parsing. In our setup, we want P to be a connected undirected graph, and not a directed tree. However, an undirected graph P is connected iff there exists a directed tree that is a subgraph of P when edge directions are ignored. Thus the resulting formulation is almost identical and is based on flow constraints which ensure that there is a path from a designated root in the graph to all other nodes. LetR be the set R \ NONE. An edge (t i , t j ) is in E iff there is some non-NONE relation between t i and t j , i.e. iff y ij := r∈R y ijr is equal to 1. For each variable y ij we define two auxiliary binary variables z ij and z ji that correspond to edges of the directed tree that is a subgraph of P. We ensure that the edges in the tree exist also in P by tying each auxiliary variable to its corresponding ILP variable:

EQUATION (2): Not extracted; please refer to original document.

Next, we add constraints that ensure that the graph structure induced by the auxiliary variables is a tree rooted in an arbitrary node 1 (The choice of root does not affect connectivity). We add for every i = j a flow variable φ ij which specifies the amount of flow on the directed edge z ij .

EQUATION (5): Not extracted; please refer to original document.

EQUATION (6): Not extracted; please refer to original document.

Equation 3 says that all nodes in the graph have exactly one parent, except for the root that has no parents. Equation 4 ensures that the outgoing flow from the root is n−1, and Equation 5 states that each of the other n − 1 nodes consume exactly one unit of flow. Last, Equation 6 ties the auxiliary variables to the flow variables, making sure that flow occurs only on edges. The combination of these constraints guarantees that the graph induced by the variables z ij is a directed tree and consequently the graph induced by the objective variables y is connected.

Chain structure A chain is a connected graph where the degree of all nodes is ≤ 2. Table 2 presents nodes' degree and demonstrates that indeed process graphs are close to being chains. The following constraint bounds nodes' degree by 2:

EQUATION (7): Not extracted; please refer to original document.

Since graph structures are not always chains, we add this as a soft constraint, that is, we penalize the objective for each node with degree > 2. The chain structure is one of the several soft constraints we enforce. Thus, our modified objective function is ijr θ ijr y ijr + k∈K α k C k , where K is the set of soft constraints, α k is the penalty (or reward for desirable structures), and C k indicates whether a constraint is violated (or satisfied). Note that under this formulation our model is simply a constrained conditional model (wei Chang et al., 2012). The parameters α k are tuned on a development set (see Section 4).

Relation triads A relation triad (or a relation triangle) for any three triggers t i , t j and t k in a process is a 3-tuple of relations (f (t i , t j ), f (t j , t k ), f (t i , t k )). Clearly, some triads are impossible while others are quite common. To find triads that could improve process extraction, the frequency of all possible triads in both the training set and the output of the pairwise classifier were found, and we focused on those for which the classifier and the gold standard disagree. We are interested in triads that never occur in training data but are predicted by the classifier, and vice versa. Figure 2 illustrates some of the triads found and Equa-tions 8-12 provide the corresponding ILP formulations. Equations 8-10 were formulated as soft constraints (expanding the set K) and were incorporated by defining a reward α k for each triad type. 3 On the other hand, Equations 11-12 were formulated as hard constraints to prevent certain structures.

1. SAME transitivity (Figure 2a , Eqn. 8): Coreference transitivity has been used in past work (Finkel and Manning, 2008) and we incorporate it by a constraint that encourages triads that respect transitivity. Figure 2b , Eqn. 9): If t i causes both t j and t k , then often t j and t k are co-temporal. E.g, in "genetic drift has led to a loss of genetic variation and an increase in the frequency of . . .", a single event causes two subsequent events that occur simultaneously. (Figure 2c , Eqn. 10): If t i is co-temporal with t j and t j is co-temporal with t k , then usually t i and t k are either cotemporal or denote the same event. (Figure 2d , Eqn. 11): If t i is the same event as t k , then their temporal ordering with respect to a third trigger t j may result in a contradiction, e.g., if t j is after t i , but before t k . We define 5 temporal categories that generate 5 2 possible contradictions, but for brevity present just one representative hard constraint. This constraint depends on prediction of temporal and co-reference relations jointly. (Figure 2e , Eqn. 12): As mentioned (Section 3.3), if t i is immediately before t j , and t j is immediately before t k , then t i cannot be immediately before t k .

5. Prev Contradiction

EQUATION (12): Not extracted; please refer to original document.

We used the Gurobi optimization package 4 to find an exact solution for our ILP, which contains O(n 2 |R|) variables and O(n 3 ) constraints. We also developed an equivalent formulation amenable to dual decomposition (Sontag et al., 2011) , which is a faster approximation method. But practically, solving the ILP exactly with Gurobi was quite fast (average/median time per process: 0.294 sec/0.152 sec on a standard laptop).

4 Experimental Evaluation

We extracted 148 process descriptions by going through chapters from the textbook "Biology" and marking any contiguous sequence of sentences that describes a process, i.e., a series of events that lead towards some objective. Then, each process description was annotated by a biologist. The annotator was first presented with annotation guidelines and annotated 20 descriptions. The annotations were then discussed with the authors, after which all process descriptions were annotated. After training a second biologist, we measured inter-annotator agreement κ = 0.69, on 30 random process descriptions.

Process descriptions were parsed with Stanford constituency and dependency parsers (Klein and Manning, 2003; de Marneffe et al., 2006) , and 35 process descriptions were set aside as a test set (number of training set trigger pairs: 1932, number of test set trigger pairs: 906). We performed 10fold cross validation over the training set for feature selection and tuning of constraint parameters. For each constraint type (connectivity, chain-structure, and five triad constraints) we introduced a parameter and tuned the seven parameters by coordinatewise ascent, where for hard constraints a binary parameter controls whether the constraint is used, and for soft constraints we attempted 10 different reward/penalty values. For our global model we defined θ ijr = log p ijr , where p ijr is the probability at edge (t i , t j ) for label r, given by the pairwise classifier.

We test the following systems: (a) All-Prev: Since the most common process structure was chain-like, we simply predict PREV for every two adjacent triggers in text. (b) Local base : A pairwise classifier with features from previous work (Section 3.1) (c) Local: A pairwise classifier with all features (Section 3.2) (d) Chain: For every two adjacent triggers, choose the non-NONE relation with highest probability according to Local. This baseline heuristically combines our structural assumptions with the pairwise classifier. We deterministically choose a connected chain structure, and then use the classifier to label the edges. (e) Global: Our full model that uses ILP inference.

Table 3: Features extracted for a trigger pair (ti, tj). Asteriks (*) indicate features that are duplicated, once for each trigger.

To evaluate system performance we compare the set of predictions on all trigger pairs to the gold standard annotations and compute micro-averaged precision, recall and F 1 . We perform two types of evaluations: (a) Full: evaluation on our full set of 11 relations (b) Temporal: Evaluation on temporal relations only, by collapsing PREV, CAUSES, and EN-ABLES to a single category and similarly for NEXT, CAUSED, and ENABLED (inter-annotator agreement κ = 0.75). We computed statistical significance of our results with the paired bootstrap resampling method of 2000 iterations (Efron and Tibshirani, 1993) , where the units resampled are trigger-triggerrelation triples. Table 4 presents performance of all systems. We see that using global constraints improves performance almost invariably on all measures in both full and temporal evaluations. Particularly, in the full evaluation Global improves recall by 12% and overall F 1 improves significantly by 3.7 points against Local (p < 0.01). Recall improvement suggests that modeling connectivity allowed Global to add correct relations in cases where some events were not connected to one another.

Table 4: Test set results on all experiments. Best number in each column is bolded. † and ‡ denote statistical significance (p < 0.01) against Localbase and Local baselines, respectively.

4.1 Results

The Local classifier substantially outperforms Local base . This indicates that our novel features (Section 3.2) are important for discriminating between process relations. Specifically, in the full evaluation Local improves precision more than in the temporal evaluation, suggesting that designing syntactic and semantic features for connectives is useful for distinguishing PREV, CAUSES, and ENABLES when the amount of training data is small. The Chain baseline performs only slightly worse than our global model. This demonstrates the strong tendency of processes to proceed linearly from one event to the other, which is a known property of discourse structure (Schegloff and Sacks, 1973) . However, since the structure is deterministically fixed, Chain is highly inflexible and does not allow any extensions or incorporation of other structural constraints or domain knowledge. Thus, it can be used as a simple and efficient approximation but is not a good candidate for a real system. Further support for the linear nature of process structure is provided by the All-Prev baseline, which performs poorly in the full evaluation, but in temporal evaluation works reasonably well. Table 2 presents the degree distribution of Local and Global on the development set comparing to the gold standard. The degree distribution of Global is more similar to the gold standard than Local. In particular, the connectivity constraint ensures that there are no isolated nodes and shifts mass from nodes with degree 0 and 1 to nodes with degree 2. Table 5 presents the order in which constraints were introduced into the global model using coordinate ascent on the development set. Connectivity is the first constraint to be introduced, and improves performance considerably. The chain constraint, on the other hand, is included third and the improvement in F 1 score is relatively smaller. This can be explained by the distribution of degrees in Table 2 which shows that the predictions of Local does not have many nodes with degree > 2. As for triad constraints, we see that four constraints are important and are included in the model, but one is discarded.

Table 5: Order by which constraint parameters were set using coordinate ascent on the development set. For each parameter, the value chosen and F1 score after including the constraint are provided. Negative values correspond to penalties, positive values to rewards, and a value of∞ indicates a hard constraint.

Last, we examined the results of Global when macro-averaging over processes, i.e., assigning each process the same weight by computing recall, precision and F 1 for each process and averaging those scores. We found that results are quite similar (with a slight improvement): in the full evalua- tion Global obtains R/P/F 1 of 56.4/55.0/55.7, and in the temporal evaluation Global obtains R/P/F 1 of 63.8/62.3/63.1. Figure 3 shows two examples where global constraints corrected the predictions of Local. In Figure 3, left, Local failed to predict the causal relations skipped-deleted and used-duplicated, possibly because they are not in the same sentence and are not adjacent to one another. By enforcing the connectivity constraint, Global correctly adds the correct relations and connects deleted and duplicated to the other triggers in the process. In Figure 3 , right, Local predicts a structure that results in a "SAME contradiction" structure. The triggers bind and binds cannot denote the same event if a third trigger secrete is temporally between them. However, Local predicts they are the same event, as they share a lemma. Global prohibits this structure and correctly predicts the relation as NONE.

4.2 Qualitative Analysis

To better understand the performance of Local, we analyzed the confusion matrix generated based on its predictions. Although this is a challenging 11-class classification task, most of the mass is concentrated on the matrix diagonal, as desired. Error analysis reveals that 17.5% of all errors are confusions between NONE and PREV, 11.1% between PREV and CAUSES, and 8 .6% between PREV and COTEMP. This demonstrates that distinguishing the classes PREV, CAUSES and COTEMP is challenging for Local. Our current global constraints do not address this type of error, and thus an important direction for future work is to improve the local model. The global model depends on the predictions of the local classifier, and so enforcing global constraints does not guarantee improvement in performance. For instance, if Local produces a graph that is disconnected (e.g., deleted in Figure 3, left) , then Global will add an edge. However, the label of the edge is determined by scores computed based on the local classifier, and if this prediction is wrong, we will now be penalized for both the false negative of the correct class (just as before), and also for the false positive of the predicted class. Despite that we see that Global improves overall performance by 3.7 F 1 points on the test set.

5 Related Work

A related line of work is biomedical event extraction in recent BioNLP shared tasks (Kim et al., 2009; Kim et al., 2011) . Earlier work employed a pipeline architecture where first events are found, and then their arguments are identified (Miwa et al., 2010; Björne et al., 2011) . Subsequent methods predicted events and arguments jointly using Markov logic (Poon and Vanderwende, 2010) and dependency parsing algorithms (McClosky et al., 2011) . Riedel and McCallum (2011) further improved performance by capturing correlations between events and enforcing consistency across arguments.

Temporal event-event relations have been extensively studied (Chambers and Jurafsky, 2008; Yoshikawa et al., 2009; Denis and Muller, 2011; Do et al., 2012; McClosky and Manning, 2012; D'Souza and Ng, 2013) , and we leverage such techniques in our work (Section 3.1). However, we extend beyond temporal relations alone, and strongly rely on dependencies between process events. Chambers and Jurafsky (2011) learned event templates (or frames), where events that are related to one another and their semantic roles are extracted. Recently, Cheung et al. (2013) proposed an unsupervised generative model for inducing such templates. A major difference in our work is that we do not learn typical event relations from a large and redundant corpus, but are given a paragraph and have a "one-shot" chance to extract the process structure.

We showed in this paper that global structural properties lead to significant improvements in extraction accuracy, and ILP is an effective framework Figure 3 : Process graph fragments. Black edges (dotted) are predictions of Local, green (solid) are predictions of Global, and gold (dashed) are gold standard edges. To reduce clutter, we present the predictions of Global only when it disagrees with Local. In all other cases, the predictions of Global and Local are identical. Original text, Left: "... the template shifts . . . , and a part of the template strand is either skipped by the replication machinery or used twice as a template. As a result, a segment of DNA is deleted or duplicated." Right: "Cells of mating type A secrete a signaling molecule, which can bind to specific receptor proteins on nearby cells. At the same time, cells secrete factor, which binds to receptors on A cells." for modeling global constraints. Similar observations and techniques have been proposed in other information extraction tasks. Reichart and Barzilay (2012) tied information from multiple sequence models that describe the same event by using global higher-order potentials. Berant et al. (2011) proposed a global inference algorithm to identify entailment relations. There is an abundance of examples of enforcing global constraints in other NLP tasks, such as in coreference resolution (Finkel and Manning, 2008) , parsing (Rush et al., 2012) and named entity recognition (Wang et al., 2013) .

6 Conclusion

Developing systems that understand process descriptions is an important step towards building applications that require deeper reasoning, such as biological process models from text, intelligent tutoring systems, and non-factoid QA systems. In this paper we have presented the task of process extraction, and developed methods for extracting relations between process events. Processes contain events that are tightly coupled through strong dependencies. We have shown that exploiting these structural dependencies and performing joint inference over all event mentions can significantly improve accuracy over several baselines. We have also released a new dataset containing 148 fully annotated descriptions of biological processes. Though the models we built were trained on biological processes, they do not encode domain specific information, and hence should be extensible to other domains.

In this paper we assumed that event triggers are given as input. In future work, we want to perform trigger identification jointly with extraction of eventevent relations. As explained in Section 4.2, the performance of our system is confined by the performance of the local classifier, which is trained on relatively small amounts of data. Since data annotation is expensive, it is important to improve the local classifier without increasing the annotation burden. For example, one can use unsupervised methods that learn narrative chains (Chambers and Jurafsky, 2011) to provide some prior on the typical order of events. Alternatively, we can search on the web for redundant descriptions of the same process and use this redundancy to improve classification. Last, we would like to integrate our method into QA systems and allow non-factoid questions that require deeper reasoning to be answered by matching the questions against the learned process structures.

Argument mentions are also related by coreference relations, but we neglect that since it is not central in this paper.

. Because alleles are exchanged during gene flow, genetic differences are reduced. 2. During gene flow, alleles are exchanged, and genetic differences are hence reduced.

The full set of connectives and their clustering are provided as part of our publicly released package.

We experimented with a reward for certain triads or a penalty for others and empirically found that using rewards results in better performance on the development set.