Abstract
Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. Towards this goal, we present the Composition, Retrieval, and Fusion Network (CRAFT), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. CRAFT explicitly predicts a temporal-layout of mentioned entities (characters and objects), retrieves spatio-temporal entity segments from a video database and fuses them to generate scene videos. Our contributions include sequential training of components of CRAFT while jointly modeling layout and appearances, and losses that encourage learning compositional representations for retrieval. We evaluate CRAFT on semantic fidelity to caption, composition consistency, and visual quality. CRAFT outperforms direct pixel generation approaches and generalizes well to unseen captions and to unseen video databases with no text annotations. We demonstrate CRAFT on FLINTSTONES, a new richly annotated video-caption dataset with over 25000 videos. For a glimpse of videos generated by CRAFT, see this https URL
1 Introduction
Consider the scene description: Fred is wearing a blue hat and talking to Wilma in the living room. Wilma then sits down on a couch. Picturing the scene in our mind requires the knowledge of plausible locations, appearances, actions, and interactions of characters and objects being described, as well as an ability to understand and translate the natural language description into a plausible visual instantiation. In this work, we introduce Semantic Scene Generation (SSG), the task of generating complex scene videos from rich natural language descriptions which requires jointly modeling the layout and appearances of entities mentioned in the description. SSG models are trained using a densely annotated video dataset with scene descriptions and entity bounding boxes. During inference, the models must generate videos for novel descriptions (unseen during training).
Modelling the layout and appearances of entities for descriptions like the one above poses several challenges: (a) Entity Recall -the video must contain the relevant characters (Fred, Wilma), objects (blue hat, couch) and background (setting that resembles a living room); (b) Layout Feasibility -characters and objects must be placed at plausible locations and scales (Fred, Wilma and the couch should be placed on the ground plane, the hat must lie on top of Fred's head); (c) Appearance Fidelity -entity appearance, which may be affected by identity, pose, action, attributes and layout, should respect the scene description; (d) Interaction Consistency -appearance of characters and objects must be consistent with each other given the described, sometimes implicit, interaction (Fred and Wilma should face each other as do people when they talk to each other); (f) Language Understanding -the system must be able to understand and translate a natural language description into a plausible visual instantiation.
Currently, the dominant approaches to conditional generation of visual data from text rely on directly learning distributions in a high dimensional pixel space. While these approaches have shown impressive results for aligned images of objects (faces, birds, flowers, etc.), they are often inadequate for addressing the above challenges, due to the combinatorial explosion of the image space arising from multiple characters and objects with significant appearance variations arranged in a large number of possible layouts. In contrast, our proposed Composition, Retrieval and Fusion Network (Craft) explicitly models the spatio-temporal layout of characters and objects in the scene jointly with entity appearances. Unlike pixel generation approaches, our appearance model is based on text to entity segment retrieval from a video database. Spatio-temporal segments are extracted from the retrieved videos and fused together to generate the final video. The layout composition and entity retrieval work in a sequential manner which is determined by the language input. Factorization of our model into composition and retrieval stages alleviates the need to directly model pixel spaces, results in an architecture that exploits location and appearance contextual cues, and renders an interpretable output.
Towards the goal of SSG, we introduce Flintstones, a densely annotated dataset based on The Flintstones animated series, consisting of over 25000 videos, each 75 frames long. Flintstones has several advantages over using a random sample of internet videos. First, in a closed world setting such as a television series, the most frequent characters are present in a wide variety of settings, which serves as a more manageable learning problem than a sparse set obtained in an open world setting. Second, the flat textures in animations are easier to model than real world videos. Third, in comparison to other animated series, The Flintstones has a good balance between having fairly complex interactions between characters and objects while not having overly complicated, cluttered scenes. For these reasons, we believe that the Flintstones dataset is semantically rich, preserves all the challenges of text to scene generation and is a good stepping stone towards real videos. Flintstones consists of an 80-10-10 train-val-test split. The train and val sets are used for learning and model selection respectively. Test captions serve as novel descriptions to generate videos at test time. To quantitatively evaluate our model, we use two sets of metrics. The first measures semantic fidelity of the generated video to the desired description using entity noun, adjective, and verb recalls. The second measures composition consistency, i.e. the consistency of the appearances, poses and layouts of entities with respect to other entities in the video and the background.
We use Flintstones to evaluate Craft and provide a detailed ablation analysis. Craft outperforms baselines that generate pixels directly from captions as well as a whole video retrieval approach (as opposed to modeling entities). It generalizes well to unseen captions as well as unseen videos in the target database. Our quantitative and qualitative results show that for simpler descriptions, Craft exploits location and appearance contextual cues and outputs videos that have consistent layouts and appearances of described entities. However, there is tremendous scope for improvement. Craft can fail catastrophically for complex descriptions (containing large number of entities, specially infrequent ones). The adjective and verb recalls are also fairly low. We believe SSG on Flintstones presents a challenging problem for future research.
2 Related Work
Generative models Following pioneering work on Variational Autoencoders [15] and Generative Adversarial Networks [9] , there has been tremendous interest in generative modelling of visual data in a high dimensional pixel space. Early approaches focused on unconditional generation [2, 4, 10, 24] , whereas recent works have explored models conditioned on simple textual inputs describing objects [20, 26, 27, 33, 34] . While the visual quality of images generated by these models has been steadily improving [14, 22] , success stories have been limited to generating images of aligned objects (e.g. faces, birds, flowers), often training one model per object class. In contrast, our work deals with generating complex scenes which requires modelling the layout and appearances of multiple entities in the scene.
Of particular relevance is the work by Hong et al. [12] who first generate a coarse semantic layout of bounding boxes, refine that to segmentation masks and then generate an image using an image-to-image translation model [6, 13] . A limitation of this approach is that it assumes a fixed number of object classes (80 in their experiments) and struggles with the usual challenge of modeling high dimensional pixel spaces such as generating coherent entities. Formulating appearance generation in terms of entity retrieval from a database allows our model to scale to a large number of entity categories, guarantee intra-entity coherence and allows us to focus on the semantic aspects of scene generation and interentity consistency. The retrieval approach also lends itself to generating videos without significant modification. There have been attempts at extending GANs for unconditional [30, 31] as well as text conditional [18, 21] video generation, but quality of generated videos is usually worse than that of GAN generated images unless used in very restrictive settings. A relevant generative modelling approach is by Kwak et al. [17] who proposed a model in which parts of the image are generated sequentially and combined using alpha blending. However, this work does not condition on text and has not been demonstrated on complex scenes. Another relevant body of work is by Zitnick et al. [35] [36] [37] who compose static images from descriptions with clipart images using a Conditional Random Field formulation.
To control the structure of the output image, a growing body of literature conditions image generation on a wide variety of inputs ranging from keypoints [25] and sketches [19] to semantic segmentation maps [13] . In contrast to these approaches which condition on provided location, our model generates a plausible scene layout and then conditions entity retrieval on this layout. Phrase Grounding and Caption-Image Retrieval. The entity retriever in Craft is related to caption based image retrieval models. The caption-image embedding space is typically learned by minimizing a ranking loss such as a triplet loss [7, 8, 8, 16, 32] . Phrase grounding [23] is another closely related task where the goal is to localize a region in an image described by a phrase.
One of our contributions is enriching the semantics of embeddings learned through triplet loss by simultaneously minimizing an auxiliary classification loss based on noun, adjective and verb words associated with an entity in the text description. This is similar in principle to [29] where auxiliary autoencoding losses were used in addition to a primary binary prediction loss to learn robust visual semantic embeddings. Learning shared representations across multiple related tasks is a key concept in multitask learning [5, 11] . Figure 2 presents an overview of Composition, Retrieval and Fusion Network which consists of three parts: Layout Composer, Entity Retriever, and Background Retriever. Each is a neural network that is trained independently using ground truth supervision. During inference, Craft begins with an empty video and adds entities in the scene sequentially based on the order of appearance in the description. At each step, the Layout Composer predicts a location and scale for an entity given the text and the video constructed so far. Then, conditioned on the predicted location, text, and the partially constructed video, the Entity
Layout Composer
Fred is talking to Wilma in a kitchen. Retriever produces a query embedding that is looked up against the embeddings of entities in the target video database. The entity is cropped from the retrieved video and placed at the predicted location and scale in the video being generated. Alternating between the Layout Composer and Entity Retriever allows the model to condition the layout of entities on the appearance and vice versa. Similar to Entity Retriever, the Background Retriever produces a query embedding for the desired scene from text and retrieves the closest background video from the target database. The retrieved spatio-temporal entity segments and background are fused to generate the final video. We now present the notation used in the rest of the paper, followed by architecture and training details for the three components.
Entity Retriever
Video
Caption T Caption with length |T | {Ei} n i=1 n entities in T in order of appearance {ei} n i=1 entity noun positions in T Video F number of frames in a video {(li, si)} n i=1
position of entities in the video li entity bounding box at each frame (
{(x if , y if , w if , h if )} F f =1 ) si entity pixel segmentation mask at each frame Vi−1 partially constructed video with entities {Ej} i−1 j=1 V (= Vn) full video containing all entities {(V [m] , T [m] )} M
m=1 training data points, where M = number of data points
3.1 Layout Composer
The layout composer is responsible for generating a plausible layout of the scene consisting of the locations and scales of each character and object mentioned in the scene description. Jointly modeling the locations of all entities in a scene presents fundamentally unique challenges for spatial knowledge representation Figure 3 presents a schematic for the layout composer. Given the varying number of entities across videos, the layout composer is setup to run in a sequential manner over the set of distinct entities mentioned in a given description. At each step, a text embedding of the desired entity along with a partially constructed video (consisting of entities fused into the video at previous steps) are input to the model which predicts distributions for the location and scale of the desired entity.
The layout composer models P (l i |V i−1 , T, e i ; θ loc , θ sc ), the conditional distribution of the location and scale (width and height normalized by image size) of the i th entity given the text, entity noun position in tokenized text, and the partial video with previous entities. Let C i denote the conditioning information, (V i−1 , T, e i ). We factorize the position distribution into location and scale components as follows:
EQUATION (1): Not extracted; please refer to original document.
θ loc = {θ f loc } F f =1 and θ sc = {θ f sc } F f =1
are learnable parameters. P f loc is modelled using a network that takes C i as input and produces a distribution over all pixel locations for the f th image frame. We model P f sc using a Gaussian distribution whose mean µ f and covariance Σ f are predicted by a network given (x i , y i , C i ). Parameters θ loc and θ sc are learned from ground truth position annotations by minimizing the following maximum likelihood estimation loss:
EQUATION (2): Not extracted; please refer to original document.
where
z if = [w if ; h if ] & D [m] i = (x [m] i , y [m] i , C [m] i
). For simplicity, we manually set and freeze Σ to an isometric diagonal covariance matrix with variance of 0.005. Feature Computation Backbone. The location and scale predictors have an identical feature computation backbone comprising of a CNN and a bidirectional LSTM. The CNN encodes V i−1 (8 sub-sampled frames concatenated along the channel dimension) as a set of convolutional feature maps which capture appearance and positions of previous entities in the scene. The LSTM is used to encode the entity E i for which the prediction is to be made along with semantic context available in the caption. The caption is fed into the LSTM and the hidden output at e th i word position is extracted as the entity text encoding. The text encoding is replicated spatially and concatenated with convolutional features and 2-D grid coordinates to create a representation for each location in the convolutional feature grid that is aware of visual, spatial, temporal, and semantic context. Location Predictor. P f loc is modelled using a Multi Layer Perceptron (MLP) that produces a score for each location. This map is bilinearly upsampled to the size of input video frames. Then, a softmax layer over all pixels produces P f loc (x, y|C; θ f loc ) for every pixel location (x, y) in the f th video frame. Scale Predictor. Features computed by the backbone at a particular (x, y) location are selected and fed into the scale MLP that produces µ f (x i , y i , C i ; θ f sc ). Feature sharing and multitask training. While it is possible to train a separate network for each {P f loc , µ f } F f =1 , we present a pragmatic way of sharing features and computation for different frames and also between the location and scale networks. To share features and computation across frames, the location network produces F probability maps in a single forward pass. This is equivalent to sharing all layers across all P f loc nets except for the last layer of the MLP that produces location scores. Similarly, all the µ f nets are also combined into a single network. We refer to the combined networks by P loc and µ.
In addition, we also share features across the location and scale networks. First, we share the feature computation backbone, the output from which is then passed into location and scale specific layers. Second, we use a soft-attention mechanism to select likely positions for feeding into the scale layers. This conditions the scale prediction on the plausible locations of the entity. We combine the F spatial maps into a single attention map through max pooling. This attention map is used to perform weighted average pooling on backbone features and then fed into the scale MLP. Note that this is a differentiable greedy approximation to find the most likely location (by taking argmax of spatial probability maps) and scale (directly using output of µ, the mode for a gaussian distribution) in Fig. 4 . Entity Retriever retrieves spatio-temporal patches from a target database that match entity description as encoded by the query embedding network.
a single forward pass. To keep training consistent with inference, we use the soft-attention mechanism instead of feeding ground-truth locations into µ.
3.2 Entity Retriever
The task of the entity retriever is to find a spatio-temporal patch within a target database that matches an entity in the description and is consistent with the video constructed thus far -the video with all previous entities retrieved and placed in the locations predicted by the layout network. We adopt an embedding based lookup approach for entity retrieval. This presents several challenges beyond traditional image retrieval tasks. Not only does the retrieved entity need to match the semantics of the description but it also needs to respect the implicit relational constraints or context imposed by the appearance and locations of other entities. E.g. for Fred is talking to Wilma, it is not sufficient to retrieve a Wilma, but one who is also facing in the right direction, i.e. towards Fred. The Entity Retriever is shown in Figure 4 and consists of two parts: (i) query embedding network Q, and (ii) target embedding network R. Q and R are learned using the query-target pairs (T [m] , e [m] i , l
[m] i , V [m] i−1 ), (V [m] , l [m] i , s [m]
i ) i,m in the training data. For clarity, we abbreviate Q(T [m] , e [m] i , l
[m] i , V [m] i−1 ) as q [m] i and R(V [m] , l [m] i , s [m] i ) as r [m]
i . At each training iteration, we sample a mini-batch of B pairs without replacement and compute embeddings
{(q [m b ] i b , r [m b ] i b )} B b=1
where q and r are each sequence of F embeddings corresponding to F video frames. The model is trained using a triplet loss computed on all possible triplets in the mini-batch. Let δ b denote the set of all indices from 1 to B except b. The loss can then be defined as
EQUATION (3): Not extracted; please refer to original document.
where
q ⊙ r = 1 F F f =1 q[f ] • r[f ]
is the average dot product between corresponding query and target frame embeddings. We use a margin of γ = 0.1. Auxiliary Multi-label Classification Loss We found that models trained using triplet loss alone could simply learn a one-to-one mapping between ground truth text and entity video segments with poor generalization to unseen captions and database videos. To guide the learning to utilize the compositional nature of text and improve generalization, we add an auxiliary classification loss on top of the embeddings. The key idea is to enrich the semantics of the embedding vectors by predicting the noun, adjectives, and action words directly associated with the entity in the description. For example, Wilma's embedding produced by the query and target embedding networks in Fred is talking to a happy Wilma who is sitting on a chair. is forced to predict Wilma, happy and sitting ensuring their representation in the embeddings. A vocabulary W is constructed of all nouns, adjectives and verbs appearing in the training data. Then for each sample in the mini-batch, an MLP is used as a multi-label classifier to predict associated words from the query and target embeddings. Note that a single MLP is used to make these noun, adjective and verb predictions on both query and target embeddings. Query Embedding Network (Q). Similar to the layout composer's feature computation backbone, Q consists of a CNN to independently encode every frame of V i−1 and an LSTM to encode (T, e i ) which are concatenated together along with a 2-D coordinate grid to get per-frame feature maps. However, unlike layout composer, the query embedding network also needs to be conditioned on the position l i where entity E i is to be inserted in V i−1 . To get location and scale specific query embeddings, we use a simplified RoIAlign (RoIPool with RoI quantization and bilinear interpolation) mechanism to crop out the perframe feature maps using the corresponding bounding box l f i and scaling it to a 7 × 7 receptive field. The RoIAlign features are then averaged along the spatial dimensions to get the vector representations for each time step independently. An LSTM applied over the sequence of these embeddings is used to capture temporal context. The hidden output of the LSTM at each time step is normalized and used as the frame query embedding q[f ]. Target Embedding Network (R). Since during inference, R needs to embed entities in the target database which do not have text annotations, it does not use T as an input. Thus, R is similar to Q but without the LSTM to encode the text. In our experiments we found that using 2-D coordinate features in both query and target networks made the network susceptible to ignoring all other features since it provides an easy signal for matching ground truth query-target pairs during training. This in turn leads to poor generalization. Thus, R has no 2-D coordinate features.
3.3 Background Retriever
The task of the background retriever is to find a background scene that matches the setting described in the description. To construct a database of backgrounds without characters in them, we remove characters from videos (given bounding boxes) and perform hole filling using PatchMatch [3] . The background retriever model is similar to the entity retriever with two main differences. First, since the whole background scene is retrieved instead of entity segments, the conditioning on position is removed from both query and database embedding networks replacing RoI pooling with global average pooling. Second, while ideally we would like scene and entity retrieval to be conditioned on each other, for simplicity we leave this to future work and currently treat them independently. These modifications essentially reduce the query embedding network to a text Bi-LSTM whose output at the background word location in the description is used as the query embedding, and the target embedding network to a video Bi-LSTM without RoI pooling. The model is trained using just the triplet loss.
4 The Flintstones Dataset
Composition. The Flintstones dataset is composed of 25184 densely annotated video clips derived from the animated sitcom The Flintstones. Clips are chosen to be 3 seconds (75 frames) long to capture relatively small action sequences, limit the number of sentences needed to describe them and avoid scene and shot changes. Clip annotations contain clip's characters, setting, and objects being interacted with marked in text as well as their bounding boxes in all frames. Flintstones has a 80-10-10 train-val-test split 5 . Clip Annotation. Dense annotations are obtained in a multi-step process: identification and localization of characters in keyframes, identification of the scene setting, scene captioning, object annotation, and entity tracking to provide annotations for all frames. The dataset also contains segmentation masks for characters and objects. First, a rough segmentation mask is produced by using SLIC [1] followed by hierarchical merging. This mask is then used to initialize GrabCut [28] , which further refines the segmentation. The dataset also contains a clean background for each clip. Foreground characters and objects are excised, and the resulting holes are filled using PatchMatch [3] .
5.1 Layout Composer Evaluation
Training. We use the Adam optimizer (learning rate=0.001, decay factor=0.5 per epoch, weight decay=0.0001) and a batch size of 32. Metrics. We evaluate layout composer using 2 metrics: (a) negative log-likelihood (NLL) of ground truth (GT) entity positions under the predicted distribution, and (b) average normalized pixel distance (coordinates normalized by image height and width) of the ground truth from the most likely predicted entity location. While NLL captures both location and scale, pixel distance only measures location accuracy. We report metrics on unseen test descriptions using ground truth locations and appearances for previous entities in the partial video. Feature Ablation. The ablation study in Table 1 shows that the layout composer benefits from each of the 3 input features -text, scene context (partial video), and 2D coordinate grid. The significant drop in NLL without text features indicates the importance of entity identity, especially in predicting scale. The lack of spatial awareness in convolutional feature maps without the 2D coordinate grid causes pixel distance to approximately double. The performance drop on removing scene context is indicative of the relevance of knowing what entities are where in the scene in predicting the location of next entity. Finally, replacing vanilla convolutions by dilated convolutions increases the spatial receptive field without increasing the number of parameters improves performance, which corroborates the usefulness of scene context in layout prediction.
5.2 Entity Retriever Evaluation.
Training. We use the Adam optimizer (learning rate=0.001, decay factor=0.5 every 10 epochs) and a batch size of 30.
Metrics. To evaluate semantic fidelity of retrieved entities to the query caption, we measure noun, adjective, and verb recalls (@1 and @10) averaged across entities in the test set. The captions are automatically parsed to identify nouns, adjectives and verbs associated with each entity both in the query captions and target database (using GT database captions for evaluation only). Note that captions often contain limited adjective and verb information. For example, a red hat in the video may only be referred to as a hat in the caption, and Fred standing and talking may be described as Fred is talking. We also do not take synonyms (talking-speaking) and hypernyms (person-woman) into account. Thus the proposed metric underestimates performance of the entity retriever.
Feature Ablation. Table 2 shows that text and location features are critical to noun, adjective and verb recall 6 . Scene context only marginally affects noun recall but causes significant drop in adjective and verb recalls. Effect of Auxiliary Loss. Table 3 shows that triplet loss alone does significantly worse than in combination with auxiliary classification loss. Adding the auxiliary classification loss on either query or target embeddings improves over triplet only but is worse than using all three. Interestingly, using both auxiliary losses outperforms triplet loss with a single auxiliary loss (and triplet only) on adjective and verb recall. This strongly suggests the benefits of multi-task training in entity retrieval. Background retriever. Similar to the entity recall evaluation, we computed a top-1 background recall of 57.5 for Craft. Generalization to unseen videos. A key advantage of the embedding based text to entity video retrieval approach over text only methods is that the embedding approach can use any unseen video databases without any text annotations, potentially in entirely new domains (eg. learning from synthetic video caption datasets and applying the knowledge to generate real videos). However, this requires a model that generalizes well to unseen captions as well as unseen videos.
In Table 4 we compare entity recall when using the train set (seen) videos as the target database vs using the test set (unseen) video as the target database. OHEM vs All Mini-Batch Triplets. We experimented with online hard example mining (OHEM) where negative samples that most violate triplet constraints are used in the loss. All triplets achieved similar or higher top-1 noun, adjective and verb recall than OHEM when querying against seen videos (1.8,75.3,8.5% relative gain) and unseen videos (1.7, 42.8, −5.0% relative gain). . This indicates that novel captions often do not find a match in the target database with all entities and their attributes present in the same video. However, it is more likely that each entity and attribute combination appears in some video in the database. Note that text-to-text matching also prevents extension to unseen video databases without text annotations.
5.3 Human Evaluation
Metrics. In addition to the automated recall metrics which capture semantic fidelity of the generated videos to the captions, we run a human evaluation study to estimate the compositional consistency of entities in the scene (given the description) and the overall visual quality (independent of the description). The consistency metric requires humans to rate each entity in the video on a 0-4 scale on three aspects: (a) position in the scene, (b) size relative to other entities or the background, and (c) appearance and consistency of described interactions with other entities in the scene. The visual quality metric measures the aesthetic and realism of the generated scenes on a 0-4 scale along three axes: (a) foreground quality, (b) background quality, and (c) sharpness. See supplementary material for the design of these experiments. Modelling Pixels vs Retrieval. We experimented extensively with text conditioned whole video generation using models with and without adversarial losses and obtained poor results. Since generative models tend to work better on images with single entities, we swapped out the target embedding network in the entity retriever by a generator. Given the query embedding at each of the F time steps, the generator produces an appearance image and a segmentation mask. The model is trained using an L1 loss between the masked appearance image and the masked ground truth image, and an L1 loss between the generated and ground truth masks. See supplementary material for more details. This baseline produced blurry results with recognizable colors and shapes for most common characters like Fred, Wilma, Barney, and Betty at best. We also tried GAN and VAE based approaches and got only slightly less blur. Table 5 shows that this model performs poorly on the Visual Quality metric compared to Craft. Moreover, since the visual quality of the generated previous entities affects the Wilma performance of the layout composer, this also translates into poor ratings on the composition consistency metric. Since the semantic fidelity metrics can not be computed for this pixel generation approach, we ran a human evaluation to compare this model to ours. Humans were asked to mark nouns, adjectives and verbs in the sentence missing in the generated video. Craft significantly outperformed the pixel generation approach on noun, adjective, and verb recall (Craft 61.0, 54.5, 67.8, L1: 37.8, 45.9, 48.1).
Joint vs Independent Modelling of Layout. We compare Craft to a model that uses the same entity retriever but with ground truth (GT) positions. Using GT positions performed worse than Craft (GT:62.2, 18.1, 12.4; Full:62.3, 21.7, 16.0 Recall@1). This is also reflected in the composition consistency metric (GT:1.69, 1.69, 1.34; Full:1.78, 1.89, 1.46). This emphasizes the need to model layout composition and entity retrieval jointly. When using GT layouts, the retrieval gets conditioned on the layout but not vice versa.
See https://prior.allenai.org/projects/craft for more details on dataset split, annotation visualization, and dataset statistics
For context, most frequent entity prediction baselines are on our project page.