Abstract
We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling. We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event. To study the challenging task of semantic role labeling in videos or VidSRL, we introduce the VidSitu benchmark, a large scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and †Part of the work was done during Arka’s internship at PRIOR@AI2 semantic-roles every 2 seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (∼3K) and have been chosen to be both complex (∼4.2 unique verbs within a video) as well as diverse (∼200 verbs have more than 100 annotations each). We provide a comprehensive analysis of the dataset in comparison to other publicly available video understanding benchmarks, several illustrative baselines and evaluate a range of standard video recognition models. Our code and dataset is available at vidsitu.org. 1 ar X iv :2 10 4. 00 99 0v 1 [ cs .C V ] 2 A pr 2 02 1
semantic-roles every 2 seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (∼3K) and have been chosen to be both complex (∼4.2 unique verbs within a video) as well as diverse (∼200 verbs have more than 100 annotations each). We provide a comprehensive analysis of the dataset in comparison to other publicly available video understanding benchmarks, several illustrative baselines and evaluate a range of standard video recognition models. Our code and dataset is available at vidsitu.org.
Following nomenclature introduced in ImSitu[83], every verb (deflect) has a set of roles (Arg0 deflector, Arg1 thing deflected) which are realized
As a fair comparison to datasets which do not have senses associated with verbs, we collapse verb senses into a single unit for this analysis.
http : / / clear . colorado . edu / compsem / documents / propbank_guidelines.pdf
https://github.com/pytorch/fairseq/ 5 https://github.com/huggingface/transformers
https://github.com/tylin/coco-caption
1. Introduction
Videos record events in our lives with both short and long temporal horizons. These recordings frequently relate multiple events separated geographically and temporally and capture a wide variety of situations involving human beings interacting with other humans, objects and their environment. Extracting such rich and complex information from videos can drive numerous downstream applications such as describing videos [35, 82, 77] , answering queries about them [85, 81] , retrieving visual content [50] , building knowledge graphs [48] and even teaching embodied agents to act and interact with the real world [84] .
Parsing video content is an active area of research with much of the focus centered around tasks such as action classification [31] , localization [24] and spatio-temporal detection [21] . Although parsing human actions is a critical component of understanding videos, actions by themselves paint an incomplete picture, missing critical pieces such as the agent performing the action, the object being acted upon, the tool or instrument used to perform the action, location where the action is performed and more. Expository tasks such as video captioning and story-telling provide a more holistic understanding of the visual content; but akin to their counterparts in the image domain, they lack a clear definition of the type of information being extracted making them notoriously hard to evaluate [32, 74] .
Recent work in the image domain [83, 58, 22] has attempted to move beyond action classification via the task of visual semantic role labeling -producing not just the primary activity in an image or region, but also the entities participating in that activity via different roles. Building upon this line of research, we propose VidSRL -the task of recognizing spatio-temporal situations in video content. As illustrated in Figure. 1, VidSRL involves recognizing and temporally localizing salient events across the video, identifying participating actors, objects, and locations involved within these events, co-referencing these entities across events over the duration of the video, and relating how events affect each other over time. We posit that Vid-SRL, a considerably more detailed and involved task than action classification with more precise definitions of the extracted information than video captioning, is a step towards obtaining a holistic understanding of complex videos.
To study VidSRL, we present VidSitu, a large video understanding dataset of over 29K videos drawn from a diverse set of 3K movies. Videos in VidSitu are exactly 10 seconds long and are annotated with 5 verbs, corresponding to the most salient event taking place within the five 2 second intervals in the video. Each verb annotation is accompanied with a set of roles whose values 1 are annotated using free form text. In contrast to verb annotations which are derived from a fixed vocabulary, the free form role annotations allow the use of referring expressions (e.g. boy wearing a blue jacket) to disambiguate entities in the video. An entity that occurs in any of the five clips within a video is consistently referred to using the same expression, allowing us to develop and evaluate models with co-referencing capability. Finally, the dataset also contains event relation annotations capturing causation (Event Y is Caused By/Reaction To Event X) and contingency (Event X is a pre-condition for Event Y). The key highlights of VidSitu include: (i) Diverse Situations: VidSitu enjoys a large vocabulary of verbs (1500 unique verbs curated from PropBank [54] with 200 verbs having at least 100 event annotations) and entities (5600 unique nouns with 350 nouns occurring in at least 100 videos); (ii) Complex Situations: Each video is annotated with 5 inter-related events and has an average of 4.2 unique verbs, 6.5 unique entities and; (iii) Rich Annotations: Vid-Situ provides structured event representations (3.8 roles per event) with entity co-referencing and event-relation labels.
To facilitate further research on VidSRL, we provide a comprehensive benchmark that supports partwise evaluation of various capabilities required for solving VidSRL and create baselines for each capability using state-of-art architectural components to serve as a point of reference for future work. We also carefully choose metrics that provide a meaningful signal of progress towards achieving competency on each capability. Finally, we perform a humanagreement analysis that reveals a significant room for improvement on the VidSitu benchmark.
Our main contributions are: (i) the VidSRL task formalism for understanding complex situations in videos; (ii) curating the richly annotated VidSitu dataset that consists of diverse and complex situations for studying VidSRL; (iii) establishing an evaluation methodology for assessing crucial capabilities needed for VidSRL and establishing baselines for each using state-of-art components. The dataset and code are publicly available at vidsitu.org.
2. Related Work
Video Understanding, a fundamental goal of computer vision, is an incredibly active area of research involving a wide variety of tasks such as action classification [8, 16, 75] , localization [44, 43] and spatio-temporal detection [19] , video description [77, 35] , question answering [85] , and object grounding [61] . Tasks like detecting atomic actions at 1 second intervals [19, 79, 67] are short horizon tasks whereas ones like summarizing 180 second long videos [91] are extremely long horizon tasks. In contrast, our proposed task of VidSRL operates on 10 second video at 2 second intervals.
by noun values. Here, we use "value" to refer to free-form text used describing the roles (woman with shield, boulder). It entails producing a verb for the salient activity within each 2 second interval as well as predicting multiple entities that fulfill various roles related to that event, and finally relating these events across time.
In support of these tasks, the community has also proposed datasets [31, 24, 21] , over the past few years. While early datasets were small datasets with several hundred or thousand examples [65, 36] , recent datasets are massive [50] enabling researchers to train large neural models and also employ pre-training strategies [49, 92, 40] . Section 4, Table 3 and Figure 2 provide a comparison of our proposed dataset to several relevant datasets in the field. Due to space constraints, we are unable to provide a thorough description of all the relevant work. Instead we point the reader to relevant surveys on video understanding [1, 34, 87] and also present a holistic overview of tasks and datasets in Table 1 .
Visual Semantic Role Labeling has been primarily explored in the image domain under situation recognition [83, 58] , visual semantic role labeling [22, 41, 64] and human-object interaction [10, 9] . Compared to images, visual semantic role labeling in videos requires not just recognizing actions and arguments at a single time step but aggregating information about interacting entities across frames, co-referencing the entities participating across events.
Movies for Video Understanding: The movie domain serves as a rich data source for spatio-temporal detection [21] , movie description [60] , movie question answering [70] , story-based retrieval [3] , generating social graphs [72] tasks, and classifying shot style [28] . In contrast to a lot of this prior work, we focus only on the visual activity of the various actors and objects in the scene, i.e. no additional modalities like movie-scripts, subtitles or audio are presented in our dataset.
3. Vidsrl: The Task
State-of-the-art video analysis capabilities like video activity recognition and object detection yield a fairly impoverished understanding of videos by reducing complex events involving interactions of multiple actors, objects, and locations to a bag of activity and object labels. While video captioning promises rich descriptions of videos, the open-ended task definition of captioning lends itself poorly to a systematic representation of such events and evalua-tion thereof. The motivation behind VidSRL is to expand the video analysis toolbox with vision models that produce richer yet structured representations of complex events in videos than currently possible through video activity recognition, object detection, or captioning.
Formal task definition. Given a video V , VidSRL requires a model to predict a set of related salient events {E i } k i=1 constituting a situation. Each event E i consists of a verb v i chosen from a set of of verbs V and values (entities, location, or other details pertaining to the event described in text) assigned to various roles relevant to the verb. We denote the roles or arguments of a verb v as {A v j } m j=1 and A v j ←a implies that the j th role of verb v is assigned the value a. In Fig. 1 for instance, event E 1 consists of verb v="deflect (block, avoid)" with Arg0 (deflector) ← "woman with shield". The roles for the verbs are obtained from PropBank [54] . Finally, we denote the relationship between any two events E and E by l(E, E ) ∈ L where L is an event-relations label set. We now discuss simplifying assumptions and trade-offs in designing the task.
Timescale of Salient Events. What constitutes a salient event in a video is often ambiguous and subjective. For instance given the 10 sec clip in Fig. 1 , one could define fine-grained events around atomic actions such as "turning" (Event 2 third frame) or take a more holistic view of the sequence as involving a "fight". This ambiguity due to lack of constraints on timescales of events makes annotation and evaluation challenging. We resolve this ambiguity by restricting the choice of salient events to one event per fixed time-interval. Previous work on recognizing atomic actions [21] relied upon 1 sec intervals. An appropriate choice of time interval for annotating events is one that enables rich descriptions of complex videos while avoiding incidental atomic actions. We observed qualitatively that a 2 sec interval strikes a good balance between obtaining descriptive events and the objectiveness needed for a systematic evaluation. Therefore, for each 10 sec clip, we annotate 5 events {E i } 5
i=1 . Appendix B.1 elaborates on this choice. Describing an Event. We describe an event through a verb and its arguments. For verbs, we follow recent work in action recognition like ActivityNet [24] and Moments in Time [51] that choose a verb label for each video segment from a curated list of verbs. To allow for description of a wide variety of events, we select a large vocabulary of 2.2K visual verb from PropBank [54] . Verbs in PropBank are diverse, distinguish between homonyms using verb-senses (e.g. "strike (hit)" vs "strike (a pose)"), and provide a set of roles for each verb. We allow values of arguments for the verb to be free-form text. This allows disambiguation between different entities in the scene using referring expression such as "man with trident" or "shirtless man" (Fig. 1) . Understanding of a video may require consolidating partial information across multiple views or shots. In VidSRL, while the 2 sec clip is sufficient to assign the verb, roles may require information from the whole video since some entities involved in the event may be occluded or lie outside the camera-view for those 2 secs but are visible before or after. For e.g., in Co-Referencing Entities Across Events. Within a video, an entity may be involved in more than one event, for instance, "woman with shield" is involved in Events 1, 2, and 5 and "man with trident" is involved in Events 2, 3, and 4. In such cases, we expect VidSRL models to understand co-referencing i.e. a model must be able to recognize that the entity participating across those events is the same even though the entity may be playing different roles in those events. Ideally, evaluating coreferencing capability requires grounding entities in the video (e.g. using bounding boxes). Since grounding entities in videos is an expensive process, we currently require the phrases referring to the same entity across multiple events within each 10 sec clip to match exactly for coreference assessment. See supp. for details on how coreference is enforced in our annotation pipeline.
Event Relations. Understanding a video requires not only recognizing individual events but also how events affect one another. Since event relations in videos is not yet well explored, we propose a taxonomy of event relations as a first step -inspired by prior work on a schema for event relations in natural language [26] that includes "Causation" and "Contingency". In particular, if Event B follows (occurs after) Event A, we have the following relations: (i) Event B is caused by Event A (Event B is a direct result of Event B); (ii) Event B is enabled by Event A (Event A does not cause Event B, but Event B would not occur in the absence of Event A); (iii) Event B is a reaction to Event A (Event B is a response to Event A); and (iv) Event B is unrelated to Event A (examples are provided in supplementary).
4. Vidsitu Dataset
To study VidSRL, we introduce the VidSitu dataset that offers videos with diverse and complex situations (a collection of related events) and rich annotations with verbs, semantic roles, entity co-references, and event relations. We describe our dataset curation decisions (Section 4.1) followed by analysis of the dataset (Section 4.2).
4.1. Dataset Curation
We briefly describe the main steps in the data curation process and provide more information in Appendix B.
Video Source Selection. Videos from movies are well suited for VidSRL since they are naturally diverse (widerange of movie genres) and often involve multiple interacting entities. Also, scenarios in movies typically play out over multiple shots which makes movies a challenging testbed for video understanding. We use videos from Condensed-Movies [3] which collates videos from MovieClips-a licensed YouTube channel containing engaging movie scenes.
Video Selection. Within the roughly 1000 hours of MovieClips videos, we select 30K diverse and interesting 10sec videos to annotate while avoiding visually uneventful segments common in movies such as actors merely engaged in dialogue. This selection is performed using a combination of human detection, object detection and atomic action prediction followed by a sampling of no more than 3 videos per movieclip after discarding inappropriate content.
Curating Verb Senses. We begin with the entire Prop-Bank [54] vocabulary of ∼6k verb-senses. We manually remove fine-grained and non-visual verb-senses and further discard verbs that do not appear in the MPII-Movie Description (MP2D) dataset [60] (verbs extracted using a semanticrole parser [62] ). This gives us a set of 2154 verb-senses.
Curating Argument Roles. We wish to establish a set of argument roles for each verb-sense. We initialize the argument list for each verb-sense using Arg0, Arg1, Arg2 arguments provided by PropBank and then expand this using frequently used (automatically extracted) arguments present in descriptions provided by the MP2D dataset.
Annotations. Annotations for the verbs, roles and relations are obtained via Amazon Mechanical Turk (AMT). The annotation interface enables efficient annotations while encouraging rich descriptions of entities and enabling a reuse of entities through the video (to preserve coreferencing). See Appendix B.2 for details.
Dataset splits. VidSitu is split into train, validation and test sets via a 80:5:15 split, ensuring that videos from the same movie end up in exactly one of those sets. Table 2 summarizes these statistics of these splits. We emphasize Table 3 : Dataset statistics across video description datasets. We highlight key differences from previous datasets such as explicit SRL, co-reference, and event-relation annotations, and greater diversity and density of verbs, entities, and semantic roles. For a fair comparison, for all datasets we use a single description per video segment when more than one are available. Multiple Annotations for Evaluation Sets. Via controlled trials (see Sec 6.1) we measured the annotation disagreement rate for the train set. Based on this data, we obtain multiple annotations for validation and test sets using a 2-stage annotation process. In the first stage, we collect 10 verbs for each 2 second clip (1 verb per worker). In the second stage, we get role labels for the verb with the highest agreement from 3 different workers.
4.2. Dataset Analysis And Statistics
We present an extensive analysis of VidSitu focusing on three key elements: (i) diversity of events represented in the dataset; (ii) complexity of the situations; and (iii) richness of annotations. We provide comparisons to four prominent video datasets containing text descriptions -MSR-VTT [82] , MPII-Movie Description [60] , ActivityNet Captions [35] , and Vatex-en [77] (the subset of descriptions in English). Table 3 summarizes basic statistics from all datasets. For consistency, we use one description per video segment whenever multiple annotations are available, as is the case for Vatex-en, MSR-VTT, validation set of ActivityNet-Captions and both validation and test sets of VidSitu. For datasets without explicit verb or semantic role labels, we extract these using a semantic role parser [62] .
Diversity of Events. To assess the diversity of events represented in the dataset, we consider cumulative distributions of verbs 2 and nouns (see Fig. 2 -a,b). For any point n on the horizontal axis, the curves show the number of verbs or nouns with at least n annotations. VidSitu not only offers greater diversity in verbs and nouns as compared to other datasets but also a large number of verbs and nouns occur sufficiently frequently to enable learning useful representations. For instance, 224 verbs and 336 nouns have at least annotations. In general, since movies inherently intend to engage viewers, movie datasets such as MPII and VidSitu are more diverse than open-domain datasets like ActivityNet-Captions and VATEX-en.
Complexity of Situations. We refer to a situation as complex if it consists of inter-related events with multiple entities fulfilling different roles across those events. To evaluate complexity, Figs. 2-c,d compare the number of unique verbs and entities per video across datasets. Approximately, 80% of videos in VidSitu have at least 4 unique verbs and 70% have 6 or more unique entities, in comparison to 20% and 30% respectively for VATEX-en. Further, Fig. 2 -e shows that 90% of events in VidSitu have at least 4 semantic roles in comparison to only 55% in VATEX-en. Thus, situations in VidSitu are considerably more complex that existing datasets.
Richness of Annotations. While existing video description datasets only have unstructured text descriptions, VidSitu is annotated with rich structured representations of events that includes verbs, semantic role labels, entity coreferences, and event relations. Such rich annotations not only allow for more thorough evaluation of video analysis techniques but also enable researchers to study relatively unexplored problems in video understanding such as entity coreference and relational understanding of events in videos. Fig. 2 -f shows the fraction of entity coreference chains of various lengths.
5. Baselines
For a given video, VidSRL requires predicting verbs and semantic roles for each event as well as event relations. We provide powerful baselines to serve as a point of comparison these crucial capabilities. These models leverage architectures from state-of-the-art video recognition models.
Verb Prediction. Given a 2 sec clip, we require a model to predict the verb corresponding to the most salient event in the clip. As baselines, we provide state-of-art action recognition models such as I3D [8] and SlowFast [16] networks (Step 1 in Fig. 3 ). We consider variants of I3D both with and without Non-Local blocks [76] and for SlowFast networks, we consider variants with and without the Fast channel. For each architecture, we train a model from scratch as well as a model finetuned after pretraining on Kinetics [31] . All models are trained with a cross-entropy loss over the set of action labels. For subsequent stages, these verb classification models are frozen and used as feature extractors.