Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects
Authors
Abstract
When we humans look at a video of human-object interaction, we can not only infer what is happening but we can even extract actionable information and imitate those interactions. On the other hand, current recognition or geometric approaches lack the physicality of action representation. In this paper, we take a step towards more physical understanding of actions. We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects. One of the main challenges in tackling this problem is obtaining ground-truth labels for forces. We sidestep this problem by instead using a physics simulator for supervision. Specifically, we use a simulator to predict effects, and enforce that estimated forces must lead to same effect as depicted in the video. Our quantitative and qualitative results show that (a) we can predict meaningful forces from videos whose effects lead to accurate imitation of the motions observed, (b) by jointly optimizing for contact point and force prediction, we can improve the performance on both tasks in comparison to independent training, and (c) we can learn a representation from this model that generalizes to novel objects using few shot examples.
1. Introduction
What does it mean to understand a video of human-object interaction such as the one shown in Figure 1 ? One popular answer would be to recognize the nouns (objects) and verbs (actions) -e.g., in this case lifting a pot. But such an understanding is quite limited in nature. For example, simply recognizing 'lifting' does not tell one anything about how the pot was grasped, or how high it was lifted. To address these shortcomings, there has been a recent push towards a deeper geometric understanding of videos. From estimating contact points on the object [2] to estimating human and object poses [21] , these approaches tend to estimate the visible geometric structure. While the high-level semantic labeling ('lifting') or the geometric inferences (human and object pose estimation), both provide an answer to what happened in the video, it lacks the true physical substance for actionable understanding. For example, just knowing how the pot is moved is not sufficient for the robot to imitate -it needs to also understand how the act was accomplished.

In order to obtain a more actionable understanding, we argue that one must account for the physical nature of the task. The answer then, to the question of how the act was done, is rather straightforward from a physical perspective -the object was in contact with the human hand on two sides, and a combination of inward and upward forces applied at these contact points allowed it to be lifted up against gravity. This arXiv:2003.12045v1 [cs.CV] 26 Mar 2020 understanding of physical forces is not only directly useful for an active agent but also completely represents the interaction from the objects perspective as only external forces cause its motion. In this work, we take a step towards developing such an understanding and present a system that can infer the contact points and forces applied to a known object from an interaction video. While the goal of being able to infer these forces is desirable, it is unfortunately tedious (if not impossible) to acquire direct supervision for this task. The existing force sensors [22] are not precise enough to provide accurate direction or magnitude measurement, and not compact enough to keep the interaction natural. So, how do we get the supervision? We note that if we can infer physical forces applied to an object, we can also recover a full geometric understanding by simulating the effect of the forces on that object. We build on this insight and present an approach to learn prediction of physical forces, that instead of directly supervising the predicted forces, enforces that their effects match the observations through the interaction video. To train our system, we collect a dataset of videos recorded from multiple participants grabbing and moving objects. Then we use Mechanical Turk to annotate the keypoints of the objects and contact points on each frame and use this limited information in camera frame to infer the object's 6DOF pose in world coordinates and the person's contact points on object mesh. We observe that our approach of learning to predict forces via supervising their effects allows us to learn meaningful estimates and that these can explain the observed interaction in terms of reproducing the observed motion. Our experiments show that our model learns to infer human contact points on object mesh, and estimate the corresponding forces. We observe that applying these forces on the predicted contact points at each time step in physics simulation we can repeat the behavior depicted in the video. We also show that contact point and force prediction are highly correlated and jointly optimizing improves the performance on both tasks. Finally, we provide interesting evidence that the representation we learn encodes rich geometric and physical understanding that enables us to generalize to interacting with novel objects using only few shot examples.
2. Related Work
Pose estimation. In order to understand physical motions, a network needs to implicitly reason about the object pose. There is a long line of work in this area, with two different approaches: category-based [12, 28, 29] and instancebased [8, 19, 30, 33] . Our work is more aligned with the latter, and we use the YCB object set [3] which provides richly textured objects. While some of our model design decisions is inspired by these works, e.g. iterative pose es-timation [20] , our final goal, to infer about physics of the observed motions, is different.
Contact Point Prediction.
Predicting the contact point and hand pose estimation for object manipulation has been studied in the domain of detecting plausible tool grasping [1] , human action recognition [10] , hand tracking [14] , and common grasp pattern recognition [17] . Brahmbhatt et al. [2] collected a dataset of detailed contact maps using a thermal camera, and introduced a model to predict diverse contact patterns from object shape. While our model also reasons about the contact points,it is only one of the components towards better understanding the physical actions. Moreover, we show that we benefit from force prediction to improve the contact point estimations.
Human Object interaction. Typical approaches for understanding human object interaction use high level semantic labels [11, 13, 34] . Recently there have been some works in understanding the physical aspects of the interaction. Pham et al. [26] have used data from force and motion sensors to reason about human object interactions. They also use off-the-shelf tracking, pose estimation, and analytical calculations to infer forces [25] . Hwang et al. [18] studied the forces applied to deformable objects and the changes it makes on object shape. Li et al. [21] reasoned about human body pose and forces on the joints when the person is interacting with rigid stick-like hand tools. While these methods show encouraging results for this direction, our work concentrates on interaction scenarios with complex object meshes and more diverse contact point patterns.
Predicting physical motions. Recently, learning the physics dynamic has been widely studied by classifying the dynamics of objects in static images [23] , applying external forces in synthetic environments [24] , predicting postbounces of a ball [27] , simulating billiard games [9] , and using generative models to produce plausible human-object interactions [31] . These works are more broadly related to understanding the physical environment, however, their goal to predict how scenes evolve in the future is different from ours. We try to tackle the problem of physically reasoning about the motions observed in the videos.
Recovering physical properties. In recent years there have been efforts in building differentiable physics simulation [6, 16] . Wu et al. [32] use physics engines to estimate physical properties of objects from visual inputs. However, in contrast to these approaches aimed at retrieving the properties of the physical world, we assume these are known and examine the problem of interacting with it. Our real-to-sim method is more aligned with the path taken by [4] ; that being said, our goal is not bringing the simulation's and real world's distributions closer. We rather focus on replicating the observed trajectory in simulation.
3. Approach
Given a video depicting a human interacting with an object, our goal is to infer the physical forces applied over the course of the interaction. This is an extremely challenging task in the most general form. For example, the object geometry may vary wildly and even alter over the interaction (e.g. picking a cloth), or forms of contact may be challenging (e.g. from elbowing a door to playing a guitar). We therefore restrict the setup to make the task tractable, and assume that the interaction is with a known rigid object (given 3D model), and only involves a single hand (five fingers apply the force). Given such an interaction video, our goal is then to infer the forces applied to it at each time-step along with the corresponding contact points. Formally, given a sequence of images {I t } depicting an interaction with a known object, and additional annotation for (approximate) initial object pose, we predict the person's contact points in object coordinates frame C t = (c 0 t , . . . , c 4 t ) (each representing one finger), and the forces applied to each contact point (f 0 t , . . . , f 4 t ). As alluded to earlier, it is not possible to acquire supervision in the form of the ground-truth forces over a set of training interactions. Our key insight is that we can enable learning despite this, by instead acquiring indirect supervisory signal via enforcing that the simulated effect of predicted forces matches the observed motions across the video. We first describe in Section 3.1 a dataset we collect to allow learning using this procedure. We then describe in Section 3.2 how we can extract supervisory signal from this data, and finally present our overall learning procedure in Section 3.3.
3.1. Interaction Dataset
We collect a dataset of object manipulation videos, representing a diverse set of objects, motions, and grasping variations. We leverage the objects from the YCB set [3] , as these have the underlying geometry available, and record a set of videos showing participants manipulating 8 objects. To enable learning using these videos, we collect additional annotations in the form of semantic keypoints and pixel locations of contact points to (indirectly) allow recovering the motion of the objects as well as the 3D contact points for each interaction. We describe the data collection procedure in more detail in the supplementary, but in summary, we obtain: a) annotations for 2D locations for visible keypoints in each frame, b) 6D pose of the object w.r.t. the camera in each frame, though these are noisy due to partial visibility, co-planar keypoints, etc., and c) 3D contact points on the object mesh over each interaction video. There are 174 distinct interaction videos in our dataset (111 train, 31 Test, 32 Validation), 13K frames in total. The 8 objects used are: pitcher, bleach bottle, skillet, drill, hammer, toy airplane, tomato soup can and mustard bottle. We show some examples from the dataset in Figure 2 . We will publicly release the dataset and believe that it will also encourage future research on understanding physical interactions.
3.2. Supervisory Signal Via Physical Simulation
Given per-timestep predicted forces f t and corresponding contact points C t , we show that these can get supervisory signal by simulating their effects, and comparing the simulated motion against the observed one.
Discrepancy Between Simulated And Observed Motions.
A rigid body's 'state' can be succinctly captured by its 6D pose, linear velocity, and angular velocity. Given a current state s t and the specification of the applied forces, one can compute using a physics simulator P, the resulting state s t+1 ≡ P(s t , f t , C t ). Therefore, given the initial state, and predicted forces and contact points, we can simulate the entire trajectory under these forces and obtain a resulting state at each timestep. One possible way to measure the discrepancy between this resulting simulated motion and the observed one is to penalize the difference between the corresponding 6D poses. However, our annotated 'ground-truth' 6D poses are often not accurate (due to partial visibility etc.), and this measure of error is not robust. Instead, we note that we can directly use the annotated 2D keypoint locations to measure the error, by penalizing the re-projection error between the projected keypoints under the simulated pose and the annotated locations of the observed ones. We define a loss function that penalizes this error:
EQUATION (1): Not extracted; please refer to original document.
Here, l kp t is the annotated 2D location for keypoints, π is the projection operator which transforms 3D keypoints on the model to the camera frame under the (simulated) rotation and the translation at s t = (R t , T t ). Differentiable Physics Simulation. To allow learning using the objective above, we require a differentiable physics simulator P. While typical general-purpose simulators are unfortunately not differentiable, we note that the number of input variables (state s t , forces f t , and contact points C t ) in our scenario is low-dimensional. We can therefore use finite difference method to calculate the gradients of the outputs of the simulation with respect to its inputs. In order to calculate the derivative of output with respect to input, we need to calculate the partial derivatives ∂St+1
St , ∂St+1 ft and ∂St+1 C . We use the approximation
EQUATION (2): Not extracted; please refer to original document.
where h is a small constant. As s t ∈ R 13 , f t ∈ R k×3 , and C t ∈ R k×3 (k = 5 is the number of contact points), we can Ground-truth labels Figure 3 : Training schema. Given a video as input, the model predicts the forces and their corresponding contact points. We then apply these forces on the object mesh in physics simulation and jointly optimize for keypoint projection loss L keypoint and contact point prediction loss Lcp with the aim to imitate the motion observed in the video.
compute the gradients w.r.t the input using (13+3k+3k)+1 calls to the simulator P (the last call is for calculating f (x)).
We use the PyBullet simulator [5] for our work, and find that each (differentiable) call takes only 0.12 seconds.
3.3. Putting It Together: Joint Learning Of Forces And Contact Points
Given a video and the initial state of the object, we encode the sequence into a visual embedding, and use this embedding to predict the contact points and their corresponding forces. We then apply these forces in the physics simulation to infer the updated state of the object, and use it in addition to the sequence embedding to iteratively predict the subsequent forces to be applied. This can help the network to adapt to the possible mistakes it might have made, and change the forces in the next steps accordingly (Figure 4 ). To train our model we have two objectives: (1) to minimize the keypoint re-projection error, that help with reducing the discrepancy between the object trajectory in simulation and the one seen in the video (Equation 1), and (2) to minimize the error in contact point prediction in compare to the ground-truth ( Figure 3 ). The objective we use for optimizing the contact point estimation is defined as,
EQUATION (3): Not extracted; please refer to original document.
where k is the number of contact points, and C t andĈ t are the ground truth and predicted contact points at time t.
We note that the contact point decoder gets supervisory signals both from contact point loss and the keypoint loss. We believe this constrains contact point prediction to generate physically plausible motions as seen in videos. In experiments, we show this joint loss leads to improvement even in contact point prediction.
Training details. The backbone for obtaining primary image features is ResNet18 [15] pre-trained on ImageNet [7] . We take the features (which are of size 512x7x7) before the average pooling. For all the experiments we use batch size 64, Adam optimizer with learning rate of 0.001, videos of length 10 frames, with frequency of 30 fps, h f , h s = 0.01 and h c = 0.05 for approximating gradients (refer to Equation 2) for force, state and contact point respectively. We calculate the direction of the gravity in world coordinates and use it in physics simulation to ensure realistic behavior. To train our model we first train each branch in isolation using L cp for contact point prediction and L keypoint for force prediction modules, then we jointly optimize for both objectives and train end to end.
4. Experiments
The area of physical understanding of human-object interaction is largely unexplored, and there are no established benchmark datasets or evaluation metrics. In this work, we use our novel dataset to provide empirical evaluations, and in particular to show that: a) our results are qualitatively meaningful; and (b) individual components and loss terms are quantitatively meaningful (ablations). We will
Contact Points
Physics Simulation Figure 4 : Model overview. Given a video of a person moving an object, along with the initial pose of the object, our network predicts the human contact points and the forces applied at those for each time step. The implied effects of these forces can then be recovered by applying them in a physics simulation. Using the gradients through this simulated interaction, our model learns how to optimize for its two objectives: minimizing the error in projection of the object to camera frame, and predicting the accurate contact points. also demonstrate that a physical understanding of humanobject interactions leads to improvement even in individual components such as contact point estimation and estimating object poses; and (c) finally we will demonstrate the generalization power by showing that our network learns a rich representation that can use few examples to generalize to manipulating novel objects.
Evaluation Metric: The goal of our work is to obtain physical understanding of human-object interactions. But getting ground truth forces is hard, which makes it impossible to quantitatively measure the performance by just force values. So instead of measuring forces, we evaluate if our predicted forces lead to similar motions as depicted in the videos. Therefore, for evaluating our performance, we use the CP and KP error (Equations 1 and 3) as evaluation metrics. The former measures the L1 distance between the predictions and ground truth for contact point (in object coordinates), and the latter measures error in keypoint projection (in image frame). Original image size is 1920 × 1080 and the keypoint projection error is reported in pixels in the original image dimension. Rotation and translation error are the angular difference (quaternion distance) in rotation and L2 distance (meters) in translation respectively.
4.1. Qualitative Evaluation
We first show qualitative results of our full model: joint contact point and force prediction. The qualitative results for force prediction and contact point predictions are shown in Figure 5 and Figure 6 respectively. As the figure shows, our forces are quite meaningful. For example, in case of plane at t = 0 ( See Figure 5(a) ), initially the yellow and blue arrow is pushing to left and purple arrow is on the other side is pushing to right. This creates a twist motion and as seen at t = 4 rotates the plane. Also, in case of skillet ( Figure 5(b) ), there is a big change in the orientation of the object from t = 0 to t = 4, therefore a bigger magnitude of force is required. However, after getting the initial momentum the forces decrease to the minimum needed for maintaining the current state. Refer to project page * for more visualizations. Next we show few examples of contact point prediction. If contact points are predicted in isolation, small differences can lead to fundamentally different grasps. By enforcing physical meaning to these grasps (via forces), our approach ensures more meaningful predictions. An example is the top of Figure 5 , where isolated prediction leads to grasp on the rim of the pitcher; but joint prediction leads to prediction of grasp on the handle.