Newtonian Image Understanding: Unfolding the Dynamics of Objects in Static Images
In this paper, we study the challenging problem of predicting the dynamics of objects in static images. Given a query object in an image, our goal is to provide a physical understanding of the object in terms of the forces acting upon it and its long term motion as response to those forces. Direct and explicit estimation of the forces and the motion of objects from a single image is extremely challenging. We define intermediate physical abstractions called Newtonian scenarios and introduce Newtonian Neural Network (N3) that learns to map a single image to a state in a Newtonian scenario. Our evaluations show that our method can reliably predict dynamics of a query object from a single image. In addition, our approach can provide physical reasoning that supports the predicted dynamics in terms of velocity and force vectors. To spur research in this direction we compiled Visual Newtonian Dynamics (VIND) dataset that includes more than 6000 videos aligned with Newtonian scenarios represented using game engines, and more than 4500 still images with their ground truth dynamics.
A key capability in human perception is the ability to proactively predict what happens next in a scene  . Humans reliably use these predictions for planning their actions, making everyday decisions , and even correcting visual interpretations  . Examples include predictions involved in passing a busy street, catching a frisbee, or hitting a tennis ball with a racket. Performing these tasks require a rich understanding of the dynamics of objects moving in a scene. For example, hitting a tennis ball with a racket requires knowing the dynamics of the ball, when it hits the ground, how it bounces back from the ground, and what form of motion it follows.
Rich physical understanding of human perception even allows predictions of dynamics on only a single image. Most people, for example, can reliably predict the dynam-query object V F Figure 1 . Given a static image, our goal is to infer the dynamics of a query object (forces that are acting upon the object and the expected motion of the object as a response to those forces). In this paper, we show an algorithm that learns to map an image to a state in a physical abstraction called a Newtonian scenario. Our method provides a rich physical understanding of an object in an image that allows prediction of long term motion of the object and reasoning about the direction of net force and velocity vectors.
ics of the volleyball shown in Figure 1 . Theories in perception and cognition attribute this capability, among many explanations, to previous experience  and existence of an underlying physical abstraction  .
In this paper, we address the problem of physical understanding of objects in images in terms of the forces actioning upon them and their long term motions as their responses to those forces. Our goal is to unfold the dynamics of objects in still images. Figure 1 shows an example of a long term motion predicted by our approach along with the physical reasoning that supports the predicted dynamics.
Motion of objects and its relations to various physical quantities (mass, friction, external forces, geometry, etc.) has been extensively studied in Mechanics. In schools, classical mechanics is taught using basic Newtonian scenarios that explain a large number of simple motions in real world: inclined surfaces, falling, swinging, external forces, projectiles, etc. To infer the dynamics of an object, students need to figure out the Newtonian scenario that explains the situ- ation, find the physical quantities that contribute to the motion, and then plug them into the corresponding equations that relate contributing physical quantities to the motion. Estimating physical quantities from an image is an extremely challenging problem. For example, computer vision literature does not provide a reliable solution to direct estimation of mass, friction, the angle of an inclined plane, etc. from an image. Instead of direct estimation of the physical quantities from images, we formulate the problem of physical understanding as a mapping from an image to a physical abstraction. We follow the same principles of classical Mechanics and use Newtonian scenarios as our physical abstraction. These scenarios are depicted in Figure 2 . We chose to learn this mapping in the visual space and thus render the Newtonian scenarios using game engines.
Mapping a single image to a state in a Newtonian scenario allows us to borrow the rich Newtonian interpretation offered by game engines. This enables predicting the long term motion of the object along with rich physical reasoning that supports the predicted motion in terms of velocity and force vectors 1 . Learning such a mapping requires reasoning about subtle visual and contextual cues, and common knowledge of motion. For example, to predict the expected motion of the ball in Figure 1 one needs to rely on previous experience, visual cues (subtle hand posture of the player on the net, the line of sight of other players, their pose, scene configuration), and the knowledge about how objects move in a volleyball scene. To perform this mapping, we adopt a data driven approach and introduce Newtonian Neural Networks (N 3 ) that learns the complex interplay between visual cues and motions of objects.
To facilitate research in this challenging direction, we compiled VIND, VIsual Newtonian Dynamics dataset, that contains 6806 videos, with the corresponding game engine videos for training and 4516 still images with the predicted motions for testing.
Our experimental evaluations show promising results in Newtonian understanding of objects in images and enable prediction of long-term motions of objects backed by abstract Newtonian explanations of the predicted dynamics. This allows us to unfold the dynamics of moving objects in static images. Our experimental evaluations also show the benefits of using an intermediate physical abstraction compared to competitive baselines that make direct predictions of the motion.
2. Related Work
Cognitive studies: Recent studies in computational cognitive science show that humans approximate the principles of Newtonian dynamics and simulate the future states of the world using these principles [14, 5] . Our use of Newtonian scenarios as an intermediate representation is inspired by these studies.
Motion prediction: The problem of predicting future movements and trajectories has been tackled from different perspectives. Data-driven approaches have been proposed in [38, 25] to predict motion field in a single image. Future trajectories of people are inferred in  .  proposed to infer the most likely path for objects. In contrast, our method focuses on the physics of the motion and estimates a 3D long-term motion for objects. There are recent methods that address prediction of optical flow in static images [28, 35] . Flow does not carry semantics and represents very short-term motions in 2D whereas our method can infer long term 3D motions using force and velocity information. Physic-based human motion modeling was studied by [8, 6, 7, 32] . They employed human movement dynamics to predict future pose of humans. In contrast, we estimate the dynamics of objects.
Scene understanding: Reasoning about the stability of a scene has been addressed in  that use physical constraints to reason about the stability of objects that are modeled by 3D volumes. Our work is different in that we reason about the dynamics of stable and moving objects. The approach of  computes the probability that an object falls based on inferring disturbances caused naturally or by human actions. In contrast, we do not explicitly encode physics equations and we rely on images and direct percep- Figure 3 . Viewpoint annotation. We ask the annotators to choose the game engine video (among 8 different views of the Newtonian scenario) that best describes the view of the object in the image. The object in the game engine video is shown in red, and its direction of movement is shown in yellow. The video with a green border is the selected viewpoint. These videos correspond to Newtonian scenario (1) .
tion. The early work of Mann et al.  studies the perception of scene dynamics to interpret image sequences. Their method, unlike ours, requires complete geometric specification of the scene. A rich set of experiments are performed by  on sliding motion in the lab settings to estimate object mass and friction coefficients. Our method is not limited to sliding and works on a wide range of physical scenarios in various types of scenes.
Human object interaction: Prediction of human action based on object interactions has been studied in  . Prediction of the behavior of humans based on functional objects in a scene has been explored in  . Relative motion of objects in a scene are inferred in  . Our work is related to this line of thought in terms of predicting future events from still images. But our objective is quite different. We do not predict the next action, we care about understanding the underlying physics that justifies future motions in still images.
Tracking: Note that our approach is quite different from tracking [17, 11, 10] since tracking methods are not destined for single image reasoning.  incorporates simulations to properly model human motion and prevent physically impossible hypotheses during tracking.
3. Problem Statement & Overview
Given a static image, our goal is to reason about the expected long-term motion of a query object in 3D. To this end, we use an intermediate physical abstraction called Newtonian scenarios ( Figure 2 ) rendered by a game engine. We learn a mapping from a single image to a state in a Newtonian scenario by our proposed Newtonian Neural Network (N 3 ). A state in a Newtonian scenario corresponds to a specific moment in the video generated by the game engine and includes a set of rich physical quantities (force, velocity, 3D motion) for that moment. Mapping to a state in a Newtonian scenario allows us to borrow the correspond-ing physical quantities and use them to make predictions about the long term motion of the query object in a single image.
Mapping from a single image to a state in a Newtonian scenario involves solving two problems: (a) figuring out which Newtonian scenario explains the dynamics of the image best; (b) finding the correct moment in the scenario that matches the state of the object in motion. There are strong contextual and visual cues that can help to solve the first problem. However, the second problem involves reasoning about subtle visual cues and is even hard for human annotators. For example, to predict the expected motion and the current state of the ball in Figure 1 one needs to reason from previous experiences, visual cues, and knowledge about the motion of the object. N 3 adopts a data driven approach to use visual cues and the abstract knowledge of motion to learn (a) and (b) at the same time. To encode the visual cues N 3 uses 2D Convolutional Neural Networks (CNN) to represent the image. To learn about motions N 3 uses 3D CNNs to represent game engine videos of Newtonian scenarios. By joint embedding N 3 learns to map visual cues to exact states in Newtonian scenarios.
4. Vind Dataset
We collect VIsual Newtonian Dynamics (VIND) dataset, which contains game engine videos, natural videos and static images corresponding to the Newtonian scenarios. The Newtonian scenarios that we consider are inspired by the way Mechanics is taught in school and cover commonly seen simple motions of objects ( Figure 2 ). Few factors distinguish these scenarios from each other: (a) the path of the object, e.g. scenario (3) describes a projectile motion, while scenario (4) describes a linear motion, (b) whether the applied force is continuous or not, e.g., in scenario (8), the external force is continuously applied, while in scenario (4) the force is applied only in the beginning. (c) whether the object has contact with a support surface or not, e.g., this is the factor that distinguishes scenario (10) from scenario (4). Newtonian Scenarios: Representing a Newtonian scenario by a natural video is not ideal due to the noise caused by
EQUATION (9): Not extracted; please refer to original document.
(1) , processes the video inputs from game engine. This row has similar architecture to C3D  . The dimensions for convolutional outputs in this row are Channels, Frames, Height, Width. The filters in the motion row are convolved across Frames, Width and Height. These two rows meet by a cosine similarity layer that measures the similarities between the input image and each frame in the game engine videos. The maximum value of these similarities, in each Newtonian scenario is used as the confidence score for that scenario describing the motion of the object in the input image.
camera motion, object clutter, irrelevant visual nuisances, etc. To abstract away the Newtonian dynamics from noise and clutter in real world, we construct the Newtonian scenarios (shown in Figure 2 ) using a game engine. A game engine takes a scene configuration as input (e.g. a ball above the ground plane) and simulates it forward in time according to laws of motion in physics. For each Newtonian scenario, we render its corresponding game engine scenario from different viewpoints. In total, we obtain 66 game engine videos. For each game engine video, we store its depth map, surface normals and optical flow information in addition to the RGB image. In total each frame in the game engine video has 10 channels. Images and Videos: We also collect a dataset of natural videos and images depicting moving objects. The current datasets for action or object recognition are not suitable for our task as they either show complicated movements that go beyond classical dynamics (e.g. head massage or make up in UCF-101  , HMDB-51  ) or they show no motion (most images in PASCAL  or COCO  ). Annotations. We provide three types of annotations for each image/frame: (1) bounding box annotations for the objects that are described by at least one of our Newtonian scenarios, (2) viewpoint information i.e. which viewpoint of the game engine videos best describes the direction of the movements in the image/video, (3) state annotations. By state, we mean how far the object has moved on the expected scenario (e.g. is it at the beginning of the projectile motion? or is it at the peak point?). More details about the collection of the dataset and the annotation procedure can be found in Section 6. Example game engine videos corresponding to Newtonian scenario (1) are shown in Figure 3 .
5. Newtonian Neural Network
N 3 is shaped by two parallel convolutional neural networks (CNNs); one to encode visual cues and another to represent Newtonian motions. The input to N 3 is a static image with four channels (RGBM; where M is the object mask channel that specifies the location of the query object by a bounding-box mask smoothed with a Gaussian kernel) and 66 videos of Newtonian scenarios 2 (as described in Section 4) where each video has 10 frames (equally-spaced frames sampled from the entire video) and each frame has 10 channels (RGB, flow, depth, and surface normal). The output of N 3 is a 66 dimensional vector where each dimension shows the confidence of the input image being assigned to a viewpoint of a Newtonian scenario. N 3 learns the map-ping by enforcing similarities between the vector representations of static images and that of video frames corresponding to Newtonian scenarios. The state prediction is achieved by finding the most similar frame to the static image in the Newtonian space. Figure 4 depicts a schematic illustration of N 3 . The first row resembles the standard CNN architecture for image classification introduced by  . We refer to this row as image row. Image row has five 2D CONV layers (convolutional layers) and two FC layers (fully connected layers). The second row is a volumetric convolutional neural network inspired by  . We refer to this row as motion row. Motion row has six 3D CONV layers and one FC. The input to the motion row is a batch of 66 videos (corresponding to 66 Newtonian scenarios rendered by game engines). The motion row generates a 4096x10 matrix as output for each video, where a column in this matrix can be seen as a descriptor for a frame in the video. To preserve the same number of frames in the output, we eliminate MaxPooling over the temporal dimension for all CONV layers in the motion row. The two rows are joined by a matching layer that uses cosine similarity as a matching measure. The input to the image row is an RGBM image and the output is a 4096 dimensional vector (values after FC7 layer). This vector can be seen as a visual descriptor for the input image.