Visual Semantic Role Labeling for Video Understanding

Arka Sadhu
Tanmay Gupta
Mark Yatskar
R. Nevatia
Aniruddha Kembhavi
CVPR
2021
View in Semantic Scholar

Abstract

We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling. We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event. To study the challenging task of semantic role labeling in videos or VidSRL, we introduce the VidSitu benchmark, a large scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and †Part of the work was done during Arka’s internship at PRIOR@AI2 semantic-roles every 2 seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (∼3K) and have been chosen to be both complex (∼4.2 unique verbs within a video) as well as diverse (∼200 verbs have more than 100 annotations each). We provide a comprehensive analysis of the dataset in comparison to other publicly available video understanding benchmarks, several illustrative baselines and evaluate a range of standard video recognition models. Our code and dataset is available at vidsitu.org. 1 ar X iv :2 10 4. 00 99 0v 1 [ cs .C V ] 2 A pr 2 02 1

semantic-roles every 2 seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (∼3K) and have been chosen to be both complex (∼4.2 unique verbs within a video) as well as diverse (∼200 verbs have more than 100 annotations each). We provide a comprehensive analysis of the dataset in comparison to other publicly available video understanding benchmarks, several illustrative baselines and evaluate a range of standard video recognition models. Our code and dataset is available at vidsitu.org.

Following nomenclature introduced in ImSitu[83], every verb (deflect) has a set of roles (Arg0 deflector, Arg1 thing deflected) which are realized

As a fair comparison to datasets which do not have senses associated with verbs, we collapse verb senses into a single unit for this analysis.

http : / / clear . colorado . edu / compsem / documents / propbank_guidelines.pdf

https://github.com/pytorch/fairseq/ 5 https://github.com/huggingface/transformers

https://github.com/tylin/coco-caption

1. Introduction

Videos record events in our lives with both short and long temporal horizons. These recordings frequently relate multiple events separated geographically and temporally and capture a wide variety of situations involving human beings interacting with other humans, objects and their environment. Extracting such rich and complex information from videos can drive numerous downstream applications such as describing videos [35, 82, 77] , answering queries about them [85, 81] , retrieving visual content [50] , building knowledge graphs [48] and even teaching embodied agents to act and interact with the real world [84] .

Parsing video content is an active area of research with much of the focus centered around tasks such as action classification [31] , localization [24] and spatio-temporal detection [21] . Although parsing human actions is a critical component of understanding videos, actions by themselves paint an incomplete picture, missing critical pieces such as the agent performing the action, the object being acted upon, the tool or instrument used to perform the action, location where the action is performed and more. Expository tasks such as video captioning and story-telling provide a more holistic understanding of the visual content; but akin to their counterparts in the image domain, they lack a clear definition of the type of information being extracted making them notoriously hard to evaluate [32, 74] .

Recent work in the image domain [83, 58, 22] has attempted to move beyond action classification via the task of visual semantic role labeling -producing not just the primary activity in an image or region, but also the entities participating in that activity via different roles. Building upon this line of research, we propose VidSRL -the task of recognizing spatio-temporal situations in video content. As illustrated in Figure. 1, VidSRL involves recognizing and temporally localizing salient events across the video, identifying participating actors, objects, and locations involved within these events, co-referencing these entities across events over the duration of the video, and relating how events affect each other over time. We posit that Vid-SRL, a considerably more detailed and involved task than action classification with more precise definitions of the extracted information than video captioning, is a step towards obtaining a holistic understanding of complex videos.

To study VidSRL, we present VidSitu, a large video understanding dataset of over 29K videos drawn from a diverse set of 3K movies. Videos in VidSitu are exactly 10 seconds long and are annotated with 5 verbs, corresponding to the most salient event taking place within the five 2 second intervals in the video. Each verb annotation is accompanied with a set of roles whose values 1 are annotated using free form text. In contrast to verb annotations which are derived from a fixed vocabulary, the free form role annotations allow the use of referring expressions (e.g. boy wearing a blue jacket) to disambiguate entities in the video. An entity that occurs in any of the five clips within a video is consistently referred to using the same expression, allowing us to develop and evaluate models with co-referencing capability. Finally, the dataset also contains event relation annotations capturing causation (Event Y is Caused By/Reaction To Event X) and contingency (Event X is a pre-condition for Event Y). The key highlights of VidSitu include: (i) Diverse Situations: VidSitu enjoys a large vocabulary of verbs (1500 unique verbs curated from PropBank [54] with 200 verbs having at least 100 event annotations) and entities (5600 unique nouns with 350 nouns occurring in at least 100 videos); (ii) Complex Situations: Each video is annotated with 5 inter-related events and has an average of 4.2 unique verbs, 6.5 unique entities and; (iii) Rich Annotations: Vid-Situ provides structured event representations (3.8 roles per event) with entity co-referencing and event-relation labels.

To facilitate further research on VidSRL, we provide a comprehensive benchmark that supports partwise evaluation of various capabilities required for solving VidSRL and create baselines for each capability using state-of-art architectural components to serve as a point of reference for future work. We also carefully choose metrics that provide a meaningful signal of progress towards achieving competency on each capability. Finally, we perform a humanagreement analysis that reveals a significant room for improvement on the VidSitu benchmark.

Our main contributions are: (i) the VidSRL task formalism for understanding complex situations in videos; (ii) curating the richly annotated VidSitu dataset that consists of diverse and complex situations for studying VidSRL; (iii) establishing an evaluation methodology for assessing crucial capabilities needed for VidSRL and establishing baselines for each using state-of-art components. The dataset and code are publicly available at vidsitu.org.

2. Related Work

Video Understanding, a fundamental goal of computer vision, is an incredibly active area of research involving a wide variety of tasks such as action classification [8, 16, 75] , localization [44, 43] and spatio-temporal detection [19] , video description [77, 35] , question answering [85] , and object grounding [61] . Tasks like detecting atomic actions at 1 second intervals [19, 79, 67] are short horizon tasks whereas ones like summarizing 180 second long videos [91] are extremely long horizon tasks. In contrast, our proposed task of VidSRL operates on 10 second video at 2 second intervals.

by noun values. Here, we use "value" to refer to free-form text used describing the roles (woman with shield, boulder). It entails producing a verb for the salient activity within each 2 second interval as well as predicting multiple entities that fulfill various roles related to that event, and finally relating these events across time.

Figure 1: Bar graph showing number of unique verbs with respect to the rank of the video segment as computed via our heuristic based on predicted labels from SlowFast Network [16] trained on AVA[21].

In support of these tasks, the community has also proposed datasets [31, 24, 21] , over the past few years. While early datasets were small datasets with several hundred or thousand examples [65, 36] , recent datasets are massive [50] enabling researchers to train large neural models and also employ pre-training strategies [49, 92, 40] . Section 4, Table 3 and Figure 2 provide a comparison of our proposed dataset to several relevant datasets in the field. Due to space constraints, we are unable to provide a thorough description of all the relevant work. Instead we point the reader to relevant surveys on video understanding [1, 34, 87] and also present a holistic overview of tasks and datasets in Table 1 .

Figure 2: Illustration of our annotation interface. (a) depicts the initial screen an annotator sees. In the first step, one needs to watch the entire 10 second video. (b) depicts the second step of choosing a verb from a drop-down which contains verb senses obtained from PropBank. After selecting a verb, an example usage is shown along with corresponding argument roles which need to be filled. (c) depicts filling the argument slots for each verb which can be phrases of arbitrary length. Each filled in phrase can be re-used in a subsequent slot, to enforce co-reference of the entities. (d) shows the final step of choosing event relations once all the arguments for all events are filled. The event relations should be classified based on causality and contingency for Events 1,2,4,5 with respect to Event 3.

Table 1: 10A and 20A denote 10 and 20 annotations respectively. Majority denotes choosing most frequent verbs for the validation set.

Table 3: CIDEr score for all collected Arguments with 5 annotations on 100 videos.

Visual Semantic Role Labeling has been primarily explored in the image domain under situation recognition [83, 58] , visual semantic role labeling [22, 41, 64] and human-object interaction [10, 9] . Compared to images, visual semantic role labeling in videos requires not just recognizing actions and arguments at a single time step but aggregating information about interacting entities across frames, co-referencing the entities participating across events.

Movies for Video Understanding: The movie domain serves as a rich data source for spatio-temporal detection [21] , movie description [60] , movie question answering [70] , story-based retrieval [3] , generating social graphs [72] tasks, and classifying shot style [28] . In contrast to a lot of this prior work, we focus only on the visual activity of the various actors and objects in the scene, i.e. no additional modalities like movie-scripts, subtitles or audio are presented in our dataset.

3. Vidsrl: The Task

State-of-the-art video analysis capabilities like video activity recognition and object detection yield a fairly impoverished understanding of videos by reducing complex events involving interactions of multiple actors, objects, and locations to a bag of activity and object labels. While video captioning promises rich descriptions of videos, the open-ended task definition of captioning lends itself poorly to a systematic representation of such events and evalua-tion thereof. The motivation behind VidSRL is to expand the video analysis toolbox with vision models that produce richer yet structured representations of complex events in videos than currently possible through video activity recognition, object detection, or captioning.

Formal task definition. Given a video V , VidSRL requires a model to predict a set of related salient events {E i } k i=1 constituting a situation. Each event E i consists of a verb v i chosen from a set of of verbs V and values (entities, location, or other details pertaining to the event described in text) assigned to various roles relevant to the verb. We denote the roles or arguments of a verb v as {A v j } m j=1 and A v j ←a implies that the j th role of verb v is assigned the value a. In Fig. 1 for instance, event E 1 consists of verb v="deflect (block, avoid)" with Arg0 (deflector) ← "woman with shield". The roles for the verbs are obtained from PropBank [54] . Finally, we denote the relationship between any two events E and E by l(E, E ) ∈ L where L is an event-relations label set. We now discuss simplifying assumptions and trade-offs in designing the task.

Timescale of Salient Events. What constitutes a salient event in a video is often ambiguous and subjective. For instance given the 10 sec clip in Fig. 1 , one could define fine-grained events around atomic actions such as "turning" (Event 2 third frame) or take a more holistic view of the sequence as involving a "fight". This ambiguity due to lack of constraints on timescales of events makes annotation and evaluation challenging. We resolve this ambiguity by restricting the choice of salient events to one event per fixed time-interval. Previous work on recognizing atomic actions [21] relied upon 1 sec intervals. An appropriate choice of time interval for annotating events is one that enables rich descriptions of complex videos while avoiding incidental atomic actions. We observed qualitatively that a 2 sec interval strikes a good balance between obtaining descriptive events and the objectiveness needed for a systematic evaluation. Therefore, for each 10 sec clip, we annotate 5 events {E i } 5

i=1 . Appendix B.1 elaborates on this choice. Describing an Event. We describe an event through a verb and its arguments. For verbs, we follow recent work in action recognition like ActivityNet [24] and Moments in Time [51] that choose a verb label for each video segment from a curated list of verbs. To allow for description of a wide variety of events, we select a large vocabulary of 2.2K visual verb from PropBank [54] . Verbs in PropBank are diverse, distinguish between homonyms using verb-senses (e.g. "strike (hit)" vs "strike (a pose)"), and provide a set of roles for each verb. We allow values of arguments for the verb to be free-form text. This allows disambiguation between different entities in the scene using referring expression such as "man with trident" or "shirtless man" (Fig. 1) . Understanding of a video may require consolidating partial information across multiple views or shots. In VidSRL, while the 2 sec clip is sufficient to assign the verb, roles may require information from the whole video since some entities involved in the event may be occluded or lie outside the camera-view for those 2 secs but are visible before or after. For e.g., in Co-Referencing Entities Across Events. Within a video, an entity may be involved in more than one event, for instance, "woman with shield" is involved in Events 1, 2, and 5 and "man with trident" is involved in Events 2, 3, and 4. In such cases, we expect VidSRL models to understand co-referencing i.e. a model must be able to recognize that the entity participating across those events is the same even though the entity may be playing different roles in those events. Ideally, evaluating coreferencing capability requires grounding entities in the video (e.g. using bounding boxes). Since grounding entities in videos is an expensive process, we currently require the phrases referring to the same entity across multiple events within each 10 sec clip to match exactly for coreference assessment. See supp. for details on how coreference is enforced in our annotation pipeline.

Event Relations. Understanding a video requires not only recognizing individual events but also how events affect one another. Since event relations in videos is not yet well explored, we propose a taxonomy of event relations as a first step -inspired by prior work on a schema for event relations in natural language [26] that includes "Causation" and "Contingency". In particular, if Event B follows (occurs after) Event A, we have the following relations: (i) Event B is caused by Event A (Event B is a direct result of Event B); (ii) Event B is enabled by Event A (Event A does not cause Event B, but Event B would not occur in the absence of Event A); (iii) Event B is a reaction to Event A (Event B is a response to Event A); and (iv) Event B is unrelated to Event A (examples are provided in supplementary).

4. Vidsitu Dataset

To study VidSRL, we introduce the VidSitu dataset that offers videos with diverse and complex situations (a collection of related events) and rich annotations with verbs, semantic roles, entity co-references, and event relations. We describe our dataset curation decisions (Section 4.1) followed by analysis of the dataset (Section 4.2).

4.1. Dataset Curation

We briefly describe the main steps in the data curation process and provide more information in Appendix B.

Video Source Selection. Videos from movies are well suited for VidSRL since they are naturally diverse (widerange of movie genres) and often involve multiple interacting entities. Also, scenarios in movies typically play out over multiple shots which makes movies a challenging testbed for video understanding. We use videos from Condensed-Movies [3] which collates videos from MovieClips-a licensed YouTube channel containing engaging movie scenes.

Video Selection. Within the roughly 1000 hours of MovieClips videos, we select 30K diverse and interesting 10sec videos to annotate while avoiding visually uneventful segments common in movies such as actors merely engaged in dialogue. This selection is performed using a combination of human detection, object detection and atomic action prediction followed by a sampling of no more than 3 videos per movieclip after discarding inappropriate content.

Curating Verb Senses. We begin with the entire Prop-Bank [54] vocabulary of ∼6k verb-senses. We manually remove fine-grained and non-visual verb-senses and further discard verbs that do not appear in the MPII-Movie Description (MP2D) dataset [60] (verbs extracted using a semanticrole parser [62] ). This gives us a set of 2154 verb-senses.

Curating Argument Roles. We wish to establish a set of argument roles for each verb-sense. We initialize the argument list for each verb-sense using Arg0, Arg1, Arg2 arguments provided by PropBank and then expand this using frequently used (automatically extracted) arguments present in descriptions provided by the MP2D dataset.

Annotations. Annotations for the verbs, roles and relations are obtained via Amazon Mechanical Turk (AMT). The annotation interface enables efficient annotations while encouraging rich descriptions of entities and enabling a reuse of entities through the video (to preserve coreferencing). See Appendix B.2 for details.

Dataset splits. VidSitu is split into train, validation and test sets via a 80:5:15 split, ensuring that videos from the same movie end up in exactly one of those sets. Table 2 summarizes these statistics of these splits. We emphasize Table 3 : Dataset statistics across video description datasets. We highlight key differences from previous datasets such as explicit SRL, co-reference, and event-relation annotations, and greater diversity and density of verbs, entities, and semantic roles. For a fair comparison, for all datasets we use a single description per video segment when more than one are available. Multiple Annotations for Evaluation Sets. Via controlled trials (see Sec 6.1) we measured the annotation disagreement rate for the train set. Based on this data, we obtain multiple annotations for validation and test sets using a 2-stage annotation process. In the first stage, we collect 10 verbs for each 2 second clip (1 verb per worker). In the second stage, we get role labels for the verb with the highest agreement from 3 different workers.

Table 2: The distribution of Event Relations before and after filtering by taking consensus of at least two workers i.e. we consider only those instances where two workers agree on the event relation when given the verb.

4.2. Dataset Analysis And Statistics

We present an extensive analysis of VidSitu focusing on three key elements: (i) diversity of events represented in the dataset; (ii) complexity of the situations; and (iii) richness of annotations. We provide comparisons to four prominent video datasets containing text descriptions -MSR-VTT [82] , MPII-Movie Description [60] , ActivityNet Captions [35] , and Vatex-en [77] (the subset of descriptions in English). Table 3 summarizes basic statistics from all datasets. For consistency, we use one description per video segment whenever multiple annotations are available, as is the case for Vatex-en, MSR-VTT, validation set of ActivityNet-Captions and both validation and test sets of VidSitu. For datasets without explicit verb or semantic role labels, we extract these using a semantic role parser [62] .

Diversity of Events. To assess the diversity of events represented in the dataset, we consider cumulative distributions of verbs 2 and nouns (see Fig. 2 -a,b). For any point n on the horizontal axis, the curves show the number of verbs or nouns with at least n annotations. VidSitu not only offers greater diversity in verbs and nouns as compared to other datasets but also a large number of verbs and nouns occur sufficiently frequently to enable learning useful representations. For instance, 224 verbs and 336 nouns have at least annotations. In general, since movies inherently intend to engage viewers, movie datasets such as MPII and VidSitu are more diverse than open-domain datasets like ActivityNet-Captions and VATEX-en.

Complexity of Situations. We refer to a situation as complex if it consists of inter-related events with multiple entities fulfilling different roles across those events. To evaluate complexity, Figs. 2-c,d compare the number of unique verbs and entities per video across datasets. Approximately, 80% of videos in VidSitu have at least 4 unique verbs and 70% have 6 or more unique entities, in comparison to 20% and 30% respectively for VATEX-en. Further, Fig. 2 -e shows that 90% of events in VidSitu have at least 4 semantic roles in comparison to only 55% in VATEX-en. Thus, situations in VidSitu are considerably more complex that existing datasets.

Richness of Annotations. While existing video description datasets only have unstructured text descriptions, VidSitu is annotated with rich structured representations of events that includes verbs, semantic role labels, entity coreferences, and event relations. Such rich annotations not only allow for more thorough evaluation of video analysis techniques but also enable researchers to study relatively unexplored problems in video understanding such as entity coreference and relational understanding of events in videos. Fig. 2 -f shows the fraction of entity coreference chains of various lengths.

5. Baselines

For a given video, VidSRL requires predicting verbs and semantic roles for each event as well as event relations. We provide powerful baselines to serve as a point of comparison these crucial capabilities. These models leverage architectures from state-of-the-art video recognition models.

Verb Prediction. Given a 2 sec clip, we require a model to predict the verb corresponding to the most salient event in the clip. As baselines, we provide state-of-art action recognition models such as I3D [8] and SlowFast [16] networks (Step 1 in Fig. 3 ). We consider variants of I3D both with and without Non-Local blocks [76] and for SlowFast networks, we consider variants with and without the Fast channel. For each architecture, we train a model from scratch as well as a model finetuned after pretraining on Kinetics [31] . All models are trained with a cross-entropy loss over the set of action labels. For subsequent stages, these verb classification models are frozen and used as feature extractors.

Figure 3: Distribution of 100 most frequent verbs (a), genre tuples (b), and movies (c). Note that for (a), the count represents the number of events belonging to the particular verb, whereas for (b), (c) it represents the number of video segments belonging to a particular genre or movie.

Argument Prediction Given Verbs: Given a 10 sec video and a verb for each of the 5 events, a model is required to infer entities and their roles involved in each event. To this end, we adapt seq-to-seq models [68] that consist of an encoder and a decoder (Step 2(a,b) in Fig. 3 ). Specif-

" # $ % & $ " " # $ % &

Step

1: Verb Prediction And Video Representation Learning

Step 2(a): Contextualized Event Representations (reuses feature extractor from Step 1)

Step 2(b): Decoding Semantic Roles for Predicted Verbs (jointly trained with Encoder in Step 2(a))

Step The generated sequence is post-processed to obtain the argument role structure similar to those of the annotations Figure 1 . We also provide language only baselines using our TxDec architecture as well as a GPT2 decoder.

Event Relation Prediction: A model must infer how the various events within a video are related given the verb and arguments. For a pair of ordered events (E i , E j ) with i < j, with corresponding verbs and semantic roles, we construct a multimodal representation of each event denoted by m i and m j (Step 3 in Fig. 3 ). Each of these representations is a concatenation of visual representation from TxEnc and a language representation of the sequence of verbs, arguments, and roles obtained from a pretrained RoBERTa [46] base language model. m i and m j are concatenated and fed through a classifier to predict the event relation.

6. Experiments

VidSitu allows us to evaluate performance in 3 stages: (i) verb prediction; (ii) prediction of semantic roles with coreferencing given the video and verbs for each event; and (iii) event relations prediction given the video and verbs and semantic roles for a pair of events.

6.1. Evaluation Metrics

In VidSRL, multiple outputs are plausible for the same input video. This is because of inherent ambiguity in the choice of verb used to describe the event (e.g. the same event may be described by "fight", "punch" or "hit"), and the referring expression used to refer to entities in the video (e.g. "boy with black hair" or "boy in the red shirt"). We confirm this ambiguity through a human-agreement analysis on a subset of 100 videos (500 events) with 25 verb annotations and 5 role annotations per event. Importantly, through careful manual inspection we confirm that a majority of differences in annotation for the same video across AMT workers are due to this inherent ambiguity and not due to a lack of annotation quality.

Verb Prediction. The ambiguity in verbs associated with events suggests that commonly used metrics such as Accuracy, Precision, and F1 are ill suited for the verb prediction task as they would penalize correct predictions that may not be represented in the ground truth annotations. However, recall based metrics such as Recall@k are suitable for this task. Since the large verb vocabulary in Vid-Situ presents a class-imbalance challenge, we use a macro-averaged Recall@k that better reflects performance across all verb-senses instead of focusing on dominant classes.

We now describe our macro-averaged Verb Recall@k metric. For any event, we only consider the set of verbs which appears at least twice within the ground-truth annotations (each event in val and test sets has 10 verb annotations). For event E j (where j indexes events in our evaluation set), let this set of agreed-upon ground-truth be denoted by G j . We compute recall@k for each verb-sense v i ∈ V (where i indexes verb-senses in the vocabulary V) as

EQUATION (1): Not extracted; please refer to original document.

where 1 is an indicator function and P k j denotes the set of top-k verb predictions for E j . Macro-averaged verb recall@k is given by 1

|V|

i R k i .

We report macro-average verb recall@5 (R@5) but also report top-1 and top-5 accuracy (Acc@1/5) for completeness.

Semantic Role Prediction and Co-referencing. Given a video and verb for each event, we wish to measure the semantic role prediction performance. Through a humanagreement analysis we discard arguments such as direction (ADir) and manner (AMnr) which do not have a high interannotator agreement and retain Arg0, Arg1, Arg2, ALoc, and AScn for evaluation. This agreement computation is computed using the CIDEr metric by treating one of the chosen annotations as a hypothesis and remaining annotations as references for each argument. In addition to reporting a micro-averaged CIDEr score (C), we also compute macro-averaged CIDEr where the macro-averaging is performed across verb-senses (C-Vb) or argument-types (C-Arg). ROUGE-L (R-L) [42] is shown for completeness.

Since VidSitu provides entity coreference links across events and roles, we use LEA [52] a link-based co-reference metric to measure coreferencing capability. Other metrics (MUC [73] , BCUBE [2] , CEAFE [47] ) can be found in the supp. Co-referencing in our case is done via exact string matching over the predicted set of arguments. Thus, even if the predictions are incorrect, but just the coreference is correct, LEA would give it a higher score. To address this, we propose a soft version of LEA termed LEA-soft (denoted with Lea-S) which assigns weights to cluster matches using their CIDEr score (defined in the supp.).

Event-Relation Prediction Accuracy. Event-relation prediction is a 4-way classification problem. For the subset of 100 videos, We found event relations conditioned on the verbs to have 60% agreement. For evaluation, we use the subset of event pairs for which 2 out of 3 workers agreed on the relation. We use top-1 accuracy (Acc@1) averaged across the classes as the metric for relation prediction.

Vis

Enc

EQUATION GPT2: Not extracted; please refer to original document.

6.2. Vis

Table 4: Semantic Role Prediction on Validation Set. B@1: Bleu-1, B@2: Bleu-2, M: METEOR, R: ROUGE-L, C: CIDEr, Metric-Vb: Macro Averaged over Verbs, Metric-Arg: Macro Averaged over arguments, Metric-Argi: Metric computed only for the particular argument.

6.2. Results

Verb Classification: We report macro-averaged Rec@5 (preferred metric; Sec. 6.1) and Acc@1/5 on both validation and test sets in Tab. 5. We observe verb prediction in VidSitu follows similar trends as other action recognition tasks. Specifically, SlowFast architectures outperform I3D and Kinetics pretraining significantly and consistently improves recall across all models by ≈ 10 to 16 points.

Table 5: Semantic Role Prediction on Test Set. B@1: Bleu-1, B@2: Bleu-2, M: METEOR, R: ROUGE-L, C: CIDEr, Metric-Vb: Macro Averaged over Verbs, Metric-Arg: Macro Averaged over arguments, Metric-Argi: Metric computed only for the particular argument.

Argument Prediction: We report micro and macroaveraged version of CIDEr and ROUGE-L in Tab. 4 (see supp. for other metrics). First, video conditioned models significantly outperform video-blind baselines. Next, we observe that using an encoder to contextualize events in a video improves performance across almost all metrics. Interestingly, while SlowFast outperformed I3D in verb prediction, the reverse is true for semantic role prediction. Even so, a large gap exists between current methods and human performance.

We also evaluate coreferencing ability demonstrated by models without explicitly enforcing it during training. In Tab. 4, we report both Lea and Lea-S (preferred; Sec. 6.1) metrics and find that current techniques are unable to learn coreferencing directly from data. Among all models, only Vid TxEncDec outperformed a language only baseline (GPT2) on both val and test sets, leaving lots of room for improvement in future models.

Event Relation Prediction results are provided in Table 6 . Crucially, we find video-blind baselines don't train at all and end up predicting the most frequent class "Enabled By" (hence it gets 0.25 for always predicting majority class). This suggests there exists no exploitable biases within the dataset and underscores the importance and challenge posed by event relations. In contrast, video encoder models even when given just the video without any verb description outperform video-blind baselines. Adding context in the form of verb senses and arguments yields small gains.

Table 6: Event relation classification metrics. MacroAveraged Accuracy on Validation and Test Sets. We evaluate only on the subset of data where two annotators agree.

In summary, powerful baselines show promise on the three sub-tasks. However, it is clear that VidSitu poses significant new challenges with a huge room for improvement.

7. Conclusion

We introduce visual semantic role labeling in videos in which models are required to identify salient actions, participating entities and their roles within an event, co-reference entities across time, and recognize how actions affect each other. We also present the VidSitu dataset with diverse videos, complex situations, and rich annotations.

8. Acknowledgement

We thank the Mechanical Turk workers for doing an outstanding work in annotating the dataset -without them VidSitu and the paper would not exist. We are also grateful to the suggestions and feedback provided by the three anonymous reviewers. This research was supported, in part, by the Office of Naval Research under grant #N00014-18-1-2050.

Appendix

Errata: In Figure 1 , Event 2 Arg2 should be "man with trident" instead of "main with trident".

Appendix provides details on:

A. Semantic Roles: A Brief Summary

Semantic Role Labeling attempts to abstract out at a high-level who does what to whom [66] . It is a popular natural language task which attempts at obtaining such structured outputs from natural language descriptions. As such there are multiple sources to obtain semantic roles such as FrameNet [4] , PropBank [54] and VerbNet [7] . Prior work on situation recognition in images (ImSitu) [83] have curated list of verbs (situations) from FrameNet, and action recognition dataset (Moments in Time) [51] have curated action vocabulary from VerbNet. However, we qualitatively found both vocabulary to be insufficient to represent actions, and thus chose PropBank which contained actionoriented verbs. As such, PropBank has been used for video object grounding [61] but not in the context of collecting semantic roles from visual data.

PropBank contains a set of numbered semantic roles for each verb ranging from Arg0 to Arg4. Each numbered argument has a specific definition for a particular verb but some themes are similar across verbs (adapted from PropBank annotation guidelines [6] 3 ). For the verb "throw":

• Arg0: Agent -object performing the action. For e.g.

"person"

• Arg1: Patient -object on which action is performed. For e.g. "ball"

• Arg2: Instrument, Benefactive, Attribute. For e.g. "towards a basket" In general, we noticed that Arg3 and Arg4 were exceedingly rare for visual verbs, thus we restrict our attention to Arg0, Arg1, Arg2 for numbered arguments. For modifier arguments, we found Location (LOC) to be universally valid for all video segments. Thus, for those verbs where LOC doesn't apply usually, we additionally add a semantic role "Scene" which refers to "where" the event takes place (such as "living room", "near a lake"). Other arguments were chosen based on their appearance in MPIID dataset, and we most commonly used Manner (which suggests "how" the action takes place) and Direction (details in the Section B). For rest of the paper, we use ALoc, ADir, AMnr, and AScn to denote location, direction, manner and scene arguments respectively.

B. Dataset Collection

In this section we describe details on dataset collection including curation of verbs and arguments, followed by details on annotation interface, quality control and reward structure.

B.1. Dataset Curation

We provide more details on Dataset Curation which were omitted from Section 4.1 of the main paper.

Video Source Selection. As suggested in the Section 4.1 we aimed at a domain with two criterion: the videos should be by themselves cover diverse situations ("climb" verb should not just be associated with rocks or mountains, but also things like top of a car), and that the each video should contain complex situation (the video shouldn't depict someone doing the same task over extended period of time, which would lower chances of finding meaningful event relations and be repetitive in verbs and arguments over the entire video).

After a brief qualitative analysis, we found instruction domain videos (HowTo100M [50] , YouCookII [90] , COIN [69] ) to have very fine-grained actions with less diversity and less complexity within small segments, open domain sources (ActivityNet [24] , Moments in Time [51] , Kinetics [31] , HACS [87] ) to be somewhat diverse but low complexity within a small segment. This led us to Movie domain which span multiple genres leading to appreciable diversity as well as complexity.

We converged on using MovieClips [3] rather than other movie sources such as MPII [60] , since MovieClips already provide one-stage of filtering to provide interesting videos. While using the same movies as used in AVA [21] was an option, we found that the video retention was quite low (around 20% of the movie are removed from youtube), and the movie contained long contiguous segments with low complexity. We also note some other datasets like MovieNet [28] , Movie Synopsis Dataset [80] , Movie Graphs [72] do not provide movie videos and cannot be used for collecting annotations. One demerit of using movie domain is that the verb distributions are skewed towards actions like "talk", "walk", "stare". Despite this we find the videos to be reasonably complex.

Video Selection. MovieClips spans a total of 1k Hours which is far beyond what can be reasonably annotated. To best utilize available annotation budget, we are primarily interested in identifying video segments depicting complex situations with a high precision while avoiding visually uneventful segments common in movies such as those simply involving actors engaged in dialogue.

To avoid such segments, we use the following heuristic: a video with more atomic actions per person is likely to be more eventful. So, we divide all movieclips into 10 second videos with a stride of 5 seconds, obtain human bounding boxes from the MaskRCNN [23] object detector trained on the MSCOCO [45] dataset, predict atomic actions for each detected person using the SlowFast [16] activity recognition model trained on the AVA [21] dataset, and rank all videos by the average number of unique atomic actions per person in the video. In particular, we discard labels such as "talk", "listen", "stand" and "sit" as these atomic actions didn't correlate with complexity of situations. Since "action" sequences like "fight scenes" are favored by our ranking measure, we use simple heuristic of removing "martial arts" actions to avoid oversampling such scenes and improve diversity of situations represented in the selected videos.

To confirm the usefulness of the proposed heuristic, we conduct an experiment where we annotate 1k videos chosen uniformly sampled across the entire dataset (as shown in Figure 1 ). Reducing number of unique verbs shows the effectiveness of our heuristic and suggests at least 80K videos segments (which translates to 27K non-overlapping video segments) can be richly annotated.

For final video selection, we randomly choose set of videos from the top-K ranks, such that the newly chosen videos don't overlap with already chosen videos, and that no more than 3 videos are uploaded from the same Youtube video within a particular batch.

Curating Verb Senses. To curate verb senses, we follow a two-step process: from the initial list of ∼ 6k verb senses in PropBank [54] , first we manually filter verb senses which share the same lemmatized verb (as previously stated "go" has 23 verb senses) to retain only "visual" verb senses (for instance we remove the verb sense of "run" which refers to running a business). We keep all 3.7K verbs with a single verb sense and of the remaining 2364 verbs-senses (shared across 809 verbs) we retain 629 verb senses (shared across 561 verbs). Second, to further restrict the set of verbs to those useful for describing movies, we discard verbs that do no appear at all in the MPII-Movie Description (MP2D) dataset [60] . To extract verbs from the descriptions we use a semantic-role parser [62] . This results in a final set of 2154 verb-senses.

Curating Argument Roles. Once we have curated the verb-senses from PropBank, we aim to delegate a set of argument roles for each verb-sense which would be filled based on the video. While PropBank provides numbered arguments for each verb-sense there are two issues with directly using them: first, some arguments are less relevant for visual scenes (for instance Arg1 (utterance) for "talk" is not visual), second, auxiliary arguments like direction and manner are not provided (for instance direction and manner for "look" are important to describe a scene). To address this issue, we re-use the MP2D dataset to inform us what arguments are used with the verbs. For each verb, we choose set of 5 most frequently used argument role-set and use their union. We also remove roles such as TMP (usually referring to words like "now", "then") since temporal context is implicit in our annotation structure. We also removed roles like ADV (adverb) which were too infrequent. Finally, we use the following modifier roles: "Manner", "Location", "Direction", "Purpose", "Goal", but note that "purpose" and "goal" were restricted to a small number of verbs and hence not considered for evaluation.

We further added the modifier role "Scene" which describes "where" the event takes place, and only applies to verbs which don't have "Location". For instance, "stand" has the argument role "location" which refers to "where" the person is standing and doesn't have "Scene", whereas "run" doesn't contain "location" and hence contains "Scene". In general, "Scene" refers to the "place" of the event such as "in an alleyway" or "near a beach".

Event Relations. We started with the set of three event relations namely: no relation (Events A and B are unrelated), causality (Event B is Caused By Event A i.e. B happens directly as a result of A) and contingency based (Event B is Enabled By Event A i.e. A doesn't directly cause B but B couldn't have happened without A happening first) on prior work in cross-document event relations [26] . However, we found adding an additional case of "Reaction To" for causality helpful to distinguish between event relations. For instance, in the case "X punches Y" followed by "Y falls down" would be definitely "B is Caused By A", how- Figure 2 : Illustration of our annotation interface. (a) depicts the initial screen an annotator sees. In the first step, one needs to watch the entire 10 second video. (b) depicts the second step of choosing a verb from a drop-down which contains verb senses obtained from PropBank. After selecting a verb, an example usage is shown along with corresponding argument roles which need to be filled. (c) depicts filling the argument slots for each verb which can be phrases of arbitrary length. Each filled in phrase can be re-used in a subsequent slot, to enforce co-reference of the entities. (d) shows the final step of choosing event relations once all the arguments for all events are filled. The event relations should be classified based on causality and contingency for Events 1,2,4,5 with respect to Event 3. ever for the case "X punches Y" followed by "Y crouches" it is unclear if "B is Caused By A" since Y makes a voluntary decision to crouch. As a result, we call this relation "B is a Reaction To A".

B.2. Annotation Pipeline

With videos, the list of verb-sense and their roles curated, we are now ready to crowd-source annotations on Amazon Mechanical Turk (AMT).

Annotation Interface. Figure 2 shows screenshots depicting our annotation interface. For annotating a given 10 second video, the assigned worker is instructed to first watch the entire 10-second video (Figure 2 (a) ). Then for every 2 second interval, the annotator selects a verb corresponding to the most salient event from our curated list of verb-senses using a search-able drop-down menu. Once the verb is chosen, slots for the corresponding roles are displayed along with an example usage (Figure 2 (b) ). The worker fills in the values for each role using free-form text (typically a short phrase). When referring to an entity, we instruct the worker to use phrases that uniquely identify the entity in the full 10 second video. Furthermore, these phrases can be reused in filling semantic-roles in other events within the video, which provides the co-reference in-formation about the entities i.e. co-referenced entities are maintained via exact-string match (Figure 2 (c) ). Once all verbs and their roles are annotated, we ask the worker to label the relation of Events 1, 2, 4, and 5 with respect to Event 3 (Figure 2 (d) ). Note that the order of causality and contingency is different for Events 4,5 compared to Events 1,2 respecting the temporal order.

Partitioning into 2-second clips: We emphasize that splitting the video into 2-second intervals is strictly a design choice motivated by reduction in annotation cost and consistent quality of annotations. In an early version of the data collection, we asked annotators to provide "start" and "end" points for events and allowed overlaps (consistent with other datasets such as ActivityNet Captions [35] ). A close analysis showed that the noise in annotations was tremendous, took significantly longer (roughly 3x) and would lead to a much smaller and lower quality dataset given a budget. We thus simplified the task via 2-sec interval annotations and saw large improvements in consensus and speed.

Clearly, using such a scheme leads to imprecise temporal boundaries for the events. Furthermore, it doesn't allow annotating hierarchical actions. However, we argue that he downsides of this design choice are reasonably mitigated since: (a) Longer duration events get annotated via a repeat of the same verb across consecutive clips (we see many occurrences in our dataset) & (b) In the presence of multiple verbs in a clip, the most salient one gets annotated.

The 2s duration was chosen after an analysis of ∼50 videos showed that events typically spanned more than 1s but clips longer than 2s often contained multiple interesting events that we would not want to discard. Finally, we note that 2-second duration choice may not be suitable for vastly different domains (e.g. fewer actions and more talking) where 2s may be too dense, and relaxing this to longer clips may be more efficient (annotation cost wise).

Event Relation Annotation w.r.t. Middle Event: We note there are two alternatives to our proposed annotation strategy for event relation which involves only annotating only all events only with respect to middle event. First, exhaustively annotate all event-event relations which would result in 10 annotations per video. Clearly, this is a 2.5× the annotation (in practice it is even more challenging). As a result, we decided to restrict to only one event relation. Second option is to allow choosing one of the 2-second intervals as the main event and annotating event relations with respect to it. In practice, we found the choice of main event to be subjective and inconsistent across annotations. Moreover, choosing the main event could lead to biased event relations (for instance "Caused By" relation would be more pronounced). Thus, we simplified the step by choosing Event 3 spanning from 4-6 seconds as the main event and annotated other events with respect to Event 3.

Worker Qualification and Quality Control. that annotators have understood the task requirements, we put up a qualification task where a worker has to successfully annotate 3 videos. These annotations are manually verified by the first author who then provides feedback on their annotations. To filter potential workers, we restrict to more than 95% approval rate and having done at least 500 tasks. In total we qualified around 120 annotators, with at least 60 workers annotating more than 30 videos every batch of 2K videos. In addition to manual qualification, we put automated checks one average number of unique verbs provided within a video, and average description lengths. We further manually inspect around 3 random samples from every annotator after every 3K − 5K videos and provide constant feedback.

Annotating Validation and Test Sets. We ran a controlled experiments using 100 videos and annotated 25 verbs for each event. We report the human agreement in Table 1 . To compute human agreement score for any event, we use one human annotation (out of 25) as a prediction and the remaining 10 or 20 annotations as ground-truths (denoted by 10A or 20A). The final score is the average over all possible prediction/ground-truth partitions. Essentially, we find that even moving from 10 to 20 annotations, the human agreement improves from 62% to 71% which suggests even at higher number of annotations, we receive verbs which are suggested by a single annotator (and hence no agreement). This rules out metrics like accuracy, precision, or F1 scores because they would penalize predictions that may be correct but are not present in a reasonably sized set of ground truth annotations. This analysis leads us to the metric Recall@5 which measures if the verbs most agreed upon by humans are indeed recalled by the model in its top-5 predictions.

Furthermore, this prompts us to collect the annotations for validation and test set in two-stages, in the first stage we collect 9 additional annotations for verb and then in the second-stage 3 annotations for argument roles and event relations given the verb (we choose the set of verbs chosen by the annotator with the highest agreement, followed by highest number of unique verbs within the video). We find this two-stage process to be of similar cost of obtaining 5 independent annotations but with the added advantage of being comparable across annotations. In total we annotation 3789 videos for validation and test sets. Table 2 : The distribution of Event Relations before and after filtering by taking consensus of at least two workers i.e. we consider only those instances where two workers agree on the event relation when given the verb.

Reward. We set the reward for annotating one 10second video (for training videos) to $0.75 after estimating the average time of completing an annotation to be around 5mins. This translates to around $9/hour. Overall, we received generous reviews for the reward on popular turk management website. For validation and test sets, we set the reward to $0.2 for the first stage (collecting only verbs from 9 annotators and $0.7 for the second-stage (collecting argument and event relations from 3 annotators). As a result, the cost for annotating a single video in the validation and test set turns out to be $3.9 (0.2 × 9 + 0.7 × 3) which is around 5.2× the cost of annotating a single training video. Total cost for the process comes around $36.7K (note: this doesn't account for pilot experiments, qualifications, and discarded annotations due to human errors).

Collection Timeline. Collecting the entire training set was done over a period of about 1.2 months, and an additional 1 month for collecting the validation and test sets.

C. Additional Dataset Statistics

In this section we report additional dataset statistics not included in Section 4.2 due to space constraints.

In Table 2 we report the distributions of Event Relations before and after filtering for validation and test sets. For filtering, we use consensus of two workers i.e. at least two workers agree on the argument relation which we use as the ground-truth. We largely find that the consensus on Caused By and Reaction To is low, but Enabled By and No Relations are higher.

Next, we plot the distributions for the 100 most frequent verbs, genres and chosen movies in Figure 3 . For verbs and genres we find Zipf's law in action. For verbs, we find most common verbs such as "talk", "speak", "walk", "look" which are also part of frequent atomic actions despite explicitly not scoring them. This is an inherent effect due to the movie domain where dialogue is a large focus. For genres we find that "Comedy", "Drama", "Action", "Romance" are the most frequent which tend to have more movements than "Mystery", "Thriller" which have less movements on actors with often extended still-frames.

In Figure 4 we plot the top 50 most frequent words within the argument (after removing stop-words). We find "man", "woman" are the most frequent word in all of Arg0, Arg1, Arg2 which is not surprising since the movies are human-centric. We note the over-abundance of "man" compared to "woman" is an amplification of the biases present in the movie. Interestingly, the distribution is less skewed for Location, Direction, and Manner

Figure 4: 50 Most frequent words (after removing stop-words) for Arg0, Arg1, Arg2, ALoc (location), ADir (direction ) and AMnr(Manner).

D. Implementation Details

We detail some of the implementation details for our models. All implementations are coded in PyTorch [56] . Unless otherwise mentioned we use Adam [33] optimizer with learning rate of 1e −4 .

D.1. Verb Prediction Models

All our implementations for verb prediction models such as I3D [8] , Slow-only and SlowFast networks [16] is based on the excellent repository SlowFast [15] . We use the checkpoints from the repository for kinetics pre-trained models. All models are trained with a batch size of 8 for 10 epochs, and the model with best recall@5 is chosen for testing. For classification, we use a set of 1560 verbs composed two MLP projections (first projects to half the input dimension, the second to 1560 verbs) separated with a ReLU activation. For inference, we choose the top-5 scoring verbs. Training requires considerable GPU space, and on 8 TITAN GPUs, with batch size of 8 each epoch takes around 1 hour, with total being 10 hours.

D.2. Argument Prediction Models

We extract the features from underlying base networks which is 2048 and 2304 for I3D and SlowFast respectively. For transformers, we use the implementation provided in Fairseq library [53] ( 4 and for GPT2 (medium) and Roberta (base) we use the implementation by HuggingFace transformer library [78] 5 . For tokenization and vocabulary, we utilize Byte-Pair Encoding and add special argument tokens such as [Arg0] to encode the phrases.

For both transformer encoder and decoder we use 3 layers with 8 attention heads. The decoder uses the last encoder layer outputs as encoder attention for subsequent decoding. For training, we use cross-entropy loss over the predicted sequence. For sequence generation, we use greedydecoding with temperature 1.0 as we didn't find improvements using beam-search or using different temperature.

For training, we used a batch size of 16 for all models other than GPT2 for which we could only use a batch size of 8 due to memory restrictions. Training time for GPT2 is around 10 hours over 8 GPUs (recall that GPT2 medium has 24 transformer layers and 16 attention heads). All other

E. Evaluation Metrics

In this section, we provide details on LEA as well as our proposed LEA-soft. We further report additional metrics such BLEU [55] and METEOR [5] , and coreference metrics. We also report per-argument scores for the baselines.

E.1. Co-Reference Metrics

We primarily use the metric LEA [52] which is a linkbased metrics. We also note there exists other metrics such as MUC [73] , BCUBE [2] , CEAFE [47] . We point the reader to a seminal paper on visualizing these metrics [57] for a brief overview of MUC, BCUBE and CEAFE, and [52] for comparison of other metrics with LEA.

LEA and LEA-soft As noted in the paper [52] , LEA computes an importance score and resolution score for each entity given as

ei∈E imp(e i ) × res(e i ) ei∈E imp(e i ) (E.1)

The final score is the F1-measure computed based on recall (entities are ground-truths) and precision (entities are predictions). As noted earlier, LEA doesn't consider if the proposed entity by itself is correct and thus even incorrect entity predictions could lead high co-reference score as long as the co-referencing is correct. We address this using LEA-soft which additionally weights the importance of each entity during precision computation with the sum of cider scores in the numerator and len of cider scores in the denominator.

As a result, we have

P rec LEA = ei∈E imp(e i ) × res(e i ) ei∈E imp(e i ) (E.2) P rec LEA−sof t = ei∈E ( ei C(e i )) × imp(e i ) × res(e i ) ei∈E |e i | × imp(e i ) (E.3)

where C(e i ) denotes the cider score for the i th entity. We keep the recall computation unchanged and use the modified precision to compute the final F1-Score for LEA-soft. Since we have multiple ground-truth reference, we compute the F1-score for each ground-truth reference individually and average over the 3 ground-truths.

E.2. Evaluation Of Arguments

We examine the cider scores for different arguments over a set of 100 videos (same used for verb prediction 7 https://github.com/ns-moosavi/coval results). To compare semantic role values, which are freeform text phrases, we compute CIDEr metric treating one of the chosen annotations as a hypothesis and remaining annotations as references for each argument. Table 3 compares CIDEr scores for all semantic roles and scores by argument type for a GPT2 based language only baseline that generates the sequence of roles and values given the verb for an event. We find that human-agreement is high for all arguments except direction (ADir) and manner (AMnr). For both "direction" (ADir) and "manner" (AMnr), we find that both language-only baseline and human agreements are poor. On further inspection, we find that the argument "manner" describes "how" the event took place is open to subjective interpretation, and the argument "direction" has a wide range of correct values (e.g. for "walk" directions "forward", "down the path", and "through the trees") may all be correct. For a reliable evaluation, we evaluate argument prediction performance only on arguments that achieved high human-agreement i.e. Arg0, Arg1, Arg2, ALoc, and AScn, and leave the evaluation of Direction and Manner for future work.

E.3. All Metrics

We report BLEU@1, BLUE@2, METEOR, ROUGE, and CIDEr for both val (Table 4 ) and test set (Table 5 ). For each metric we further report macro-averaged scores across verbs and arguments, and report per argument scores. Note that only CIDEr is able to take advantage of the macroaveraged scores due to its inverse document frequency reweighting. Finally, we report the co-reference metrics MUC, BCUBE, CEAFE , LEA and our proposed metric LEA-Soft.

F. Vidsitu Datasheet

The seminal work datasheets for datasets [18] outlines a list of questions to encourage transparency, accountability and mitigate unwanted biases. Here, we provide a datasheet for VidSitu closely following the guidelines in prior work. For simplicity and readability, we paste the questions verbatim. search gap between learning atomic actions and generating holistic captions. In particular, the dataset opens path for the task of Visual Semantic Role Labeling in Videos which in addition to action-recognition, emphasizes how various objects interact within an action, how various objects interact over time-period across multiple actions, co-referencing of these objects over time, and how various actions affect each other.

• Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? The dataset is created by the authors who belong to PRIOR Team at AI2. The first-author (Arka Sadhu) was a summer intern in the PRIOR Team.

• Who funded the creation of the dataset? PRIOR Team at AI2 funded the creation of the dataset.

F.2. Composition

• What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Each instance consists of a 10-second video obtained from a movie-clip available on YouTube. These are usually human-centric and hence primarily contain videos of people interacting in diverse and complex situations.

• How many instances are there in total (of each type, if appropriate)? In total there are 29.2K instances distributed across training (23.62K), validation (1.32K) and testing (4.1K). Individual test sets are available for each of the sub-task.

• Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? This question doesn't pertain to our dataset.

• What data does each instance consist of? Each instance is a 10-second video (mp4 video) available from YouTube.

• Is there a label or target associated with each instance? Each instance (10 second video) is annotated at 2-second intervals with a verb describing the event, corresponding argument roles for the verb coreferenced across the video, and event relations across the various verbs with respect to the middle event (Event 3 spanning from 4-6 seconds).

• Is any information missing from individual instances? No, every instance has the same annotations.

• Are relationships between individual instances made explicit (e.g., users' movie ratings, social network links)? We provide information about which instances are derived from the same 2 − 3 minutes YouTube video as well as the underlying movie (this information is obtained from Condensed-Movies [3] dataset). However, this information is not used for any of the task in the dataset except for splitting the videos in train, validation and test sets.

• Are there recommended data splits (e.g., training, development/validation, testing)? Yes, we provide training, validation and test sets by splitting the overall set in 80 : 5 : 15 ratio randomly based on the movie names. We also ensure (qualitatively) that the normalized distributions of verbs, and genres are same across the splits.

• Are there any errors, sources of noise, or redundancies in the dataset? The main sources of errors would be the annotations themselves, however, we have made extended efforts from automatic to manual checks to remove such errors and provided constant feedback. Some redundancy may occur due to oversampling of dialogues in movies which are described with the verb "talk". Some redundancy may also occur due to use of closely related verbs such as "run" and "jog".

• Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? Yes, the dataset provides links to YouTube videos. Since the videos are provided by a licensed channel, we expect the videos to have high online longevity.

• Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor patient confidentiality, data that includes the content of individuals' nonpublic communications)? No, our dataset is derived from movies publicly available on youtube.

• Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? Some of the videos obtained from action, crime or horror movies may be sensitive to some viewers when viewed directly. Some videos may also contain violence and gore, and we suggest user discretion in viewing the videos.

F.3. Collection Process

• How was the data associated with each instance acquired? The data was directly observable in the form of embedded youtube videos.

• What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? We used Amazon Mechanical Turk to collect the data with a custom annotation interface. We validated them by small scale user study and taking feedbacks during worker qualification.

• If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? We sampled videos which had more verbs within their duration.

• Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? Crowd-Workers were involved in the process. They were paid $0.75 for training videos and $0.2 for verb annotation and $0.7 for argument and event relation for videos in validation and test splits. On average it is around $9 − $12 per hour above the minimum wage. On popular websites, our pay was noted to be generous.

• Over what timeframe was the data collected? The data was collected over 2.2 months with initial 1.2 months for training set and rest for validation and testing.

• Were any ethical review processes conducted (e.g., by an institutional review board)? No, there was no ethical review process.

F.4. Preprocessing/Cleaning/Labeling

• Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? Only, exact string match was performed to obtain co-referenced entities. We used spacy [27] to compute dataset statistics such as noun-diversity but it is not used over the collected data for down-stream tasks.

• Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? In our case, raw data is same as cleaned data.

F.5. Uses

• Has the dataset been used for any tasks already?

We have used the data to show its usefulness for our proposed task Visual Semantic Role Labeling in Videos

• Is there a repository that links to any or all papers or systems that use the dataset? Updated information about the dataset can be found on vidsitu.org. • What (other) tasks could the dataset be used for?

We believe the dataset could be re-purposed for many down-stream video understanding tasks such as video retrieval, video question answering, action forecasting, long-term reasoning.

• Are there tasks for which the dataset should not be used? The data is obtained from movies and exhibits certain stereotypes which donot hold true in real world. It also contains highly unlikely action sequences (such as a "man flying"), and thus it shouldn't be used for real-world cases and strictly used as a video understanding benchmark.

F.6. Distribution

• Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

The dataset is publicly available at vidsitu.org.

• How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? The dataset is available through our website and github. The dataset is stored on Amazon S3 buckets.