Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects

Kiana Ehsani
Shubham Tulsiani
Saurabh Gupta
Ali Farhadi
Abhinav Gupta
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2020
View in Semantic Scholar

Abstract

When we humans look at a video of human-object interaction, we can not only infer what is happening but we can even extract actionable information and imitate those interactions. On the other hand, current recognition or geometric approaches lack the physicality of action representation. In this paper, we take a step towards more physical understanding of actions. We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects. One of the main challenges in tackling this problem is obtaining ground-truth labels for forces. We sidestep this problem by instead using a physics simulator for supervision. Specifically, we use a simulator to predict effects, and enforce that estimated forces must lead to same effect as depicted in the video. Our quantitative and qualitative results show that (a) we can predict meaningful forces from videos whose effects lead to accurate imitation of the motions observed, (b) by jointly optimizing for contact point and force prediction, we can improve the performance on both tasks in comparison to independent training, and (c) we can learn a representation from this model that generalizes to novel objects using few shot examples.

1. Introduction

What does it mean to understand a video of human-object interaction such as the one shown in Figure 1 ? One popular answer would be to recognize the nouns (objects) and verbs (actions) -e.g., in this case lifting a pot. But such an understanding is quite limited in nature. For example, simply recognizing 'lifting' does not tell one anything about how the pot was grasped, or how high it was lifted. To address these shortcomings, there has been a recent push towards a deeper geometric understanding of videos. From estimating contact points on the object [2] to estimating human and object poses [21] , these approaches tend to estimate the visible geometric structure. While the high-level semantic labeling ('lifting') or the geometric inferences (human and object pose estimation), both provide an answer to what happened in the video, it lacks the true physical substance for actionable understanding. For example, just knowing how the pot is moved is not sufficient for the robot to imitate -it needs to also understand how the act was accomplished.

Figure 1. Not extracted; please refer to original document.

In order to obtain a more actionable understanding, we argue that one must account for the physical nature of the task. The answer then, to the question of how the act was done, is rather straightforward from a physical perspective -the object was in contact with the human hand on two sides, and a combination of inward and upward forces applied at these contact points allowed it to be lifted up against gravity. This arXiv:2003.12045v1 [cs.CV] 26 Mar 2020 understanding of physical forces is not only directly useful for an active agent but also completely represents the interaction from the objects perspective as only external forces cause its motion. In this work, we take a step towards developing such an understanding and present a system that can infer the contact points and forces applied to a known object from an interaction video. While the goal of being able to infer these forces is desirable, it is unfortunately tedious (if not impossible) to acquire direct supervision for this task. The existing force sensors [22] are not precise enough to provide accurate direction or magnitude measurement, and not compact enough to keep the interaction natural. So, how do we get the supervision? We note that if we can infer physical forces applied to an object, we can also recover a full geometric understanding by simulating the effect of the forces on that object. We build on this insight and present an approach to learn prediction of physical forces, that instead of directly supervising the predicted forces, enforces that their effects match the observations through the interaction video. To train our system, we collect a dataset of videos recorded from multiple participants grabbing and moving objects. Then we use Mechanical Turk to annotate the keypoints of the objects and contact points on each frame and use this limited information in camera frame to infer the object's 6DOF pose in world coordinates and the person's contact points on object mesh. We observe that our approach of learning to predict forces via supervising their effects allows us to learn meaningful estimates and that these can explain the observed interaction in terms of reproducing the observed motion. Our experiments show that our model learns to infer human contact points on object mesh, and estimate the corresponding forces. We observe that applying these forces on the predicted contact points at each time step in physics simulation we can repeat the behavior depicted in the video. We also show that contact point and force prediction are highly correlated and jointly optimizing improves the performance on both tasks. Finally, we provide interesting evidence that the representation we learn encodes rich geometric and physical understanding that enables us to generalize to interacting with novel objects using only few shot examples.

2. Related Work

Pose estimation. In order to understand physical motions, a network needs to implicitly reason about the object pose. There is a long line of work in this area, with two different approaches: category-based [12, 28, 29] and instancebased [8, 19, 30, 33] . Our work is more aligned with the latter, and we use the YCB object set [3] which provides richly textured objects. While some of our model design decisions is inspired by these works, e.g. iterative pose es-timation [20] , our final goal, to infer about physics of the observed motions, is different.

Contact Point Prediction.

Predicting the contact point and hand pose estimation for object manipulation has been studied in the domain of detecting plausible tool grasping [1] , human action recognition [10] , hand tracking [14] , and common grasp pattern recognition [17] . Brahmbhatt et al. [2] collected a dataset of detailed contact maps using a thermal camera, and introduced a model to predict diverse contact patterns from object shape. While our model also reasons about the contact points,it is only one of the components towards better understanding the physical actions. Moreover, we show that we benefit from force prediction to improve the contact point estimations.

Human Object interaction. Typical approaches for understanding human object interaction use high level semantic labels [11, 13, 34] . Recently there have been some works in understanding the physical aspects of the interaction. Pham et al. [26] have used data from force and motion sensors to reason about human object interactions. They also use off-the-shelf tracking, pose estimation, and analytical calculations to infer forces [25] . Hwang et al. [18] studied the forces applied to deformable objects and the changes it makes on object shape. Li et al. [21] reasoned about human body pose and forces on the joints when the person is interacting with rigid stick-like hand tools. While these methods show encouraging results for this direction, our work concentrates on interaction scenarios with complex object meshes and more diverse contact point patterns.

Predicting physical motions. Recently, learning the physics dynamic has been widely studied by classifying the dynamics of objects in static images [23] , applying external forces in synthetic environments [24] , predicting postbounces of a ball [27] , simulating billiard games [9] , and using generative models to produce plausible human-object interactions [31] . These works are more broadly related to understanding the physical environment, however, their goal to predict how scenes evolve in the future is different from ours. We try to tackle the problem of physically reasoning about the motions observed in the videos.

Recovering physical properties. In recent years there have been efforts in building differentiable physics simulation [6, 16] . Wu et al. [32] use physics engines to estimate physical properties of objects from visual inputs. However, in contrast to these approaches aimed at retrieving the properties of the physical world, we assume these are known and examine the problem of interacting with it. Our real-to-sim method is more aligned with the path taken by [4] ; that being said, our goal is not bringing the simulation's and real world's distributions closer. We rather focus on replicating the observed trajectory in simulation.

3. Approach

Given a video depicting a human interacting with an object, our goal is to infer the physical forces applied over the course of the interaction. This is an extremely challenging task in the most general form. For example, the object geometry may vary wildly and even alter over the interaction (e.g. picking a cloth), or forms of contact may be challenging (e.g. from elbowing a door to playing a guitar). We therefore restrict the setup to make the task tractable, and assume that the interaction is with a known rigid object (given 3D model), and only involves a single hand (five fingers apply the force). Given such an interaction video, our goal is then to infer the forces applied to it at each time-step along with the corresponding contact points. Formally, given a sequence of images {I t } depicting an interaction with a known object, and additional annotation for (approximate) initial object pose, we predict the person's contact points in object coordinates frame C t = (c 0 t , . . . , c 4 t ) (each representing one finger), and the forces applied to each contact point (f 0 t , . . . , f 4 t ). As alluded to earlier, it is not possible to acquire supervision in the form of the ground-truth forces over a set of training interactions. Our key insight is that we can enable learning despite this, by instead acquiring indirect supervisory signal via enforcing that the simulated effect of predicted forces matches the observed motions across the video. We first describe in Section 3.1 a dataset we collect to allow learning using this procedure. We then describe in Section 3.2 how we can extract supervisory signal from this data, and finally present our overall learning procedure in Section 3.3.

3.1. Interaction Dataset

We collect a dataset of object manipulation videos, representing a diverse set of objects, motions, and grasping variations. We leverage the objects from the YCB set [3] , as these have the underlying geometry available, and record a set of videos showing participants manipulating 8 objects. To enable learning using these videos, we collect additional annotations in the form of semantic keypoints and pixel locations of contact points to (indirectly) allow recovering the motion of the objects as well as the 3D contact points for each interaction. We describe the data collection procedure in more detail in the supplementary, but in summary, we obtain: a) annotations for 2D locations for visible keypoints in each frame, b) 6D pose of the object w.r.t. the camera in each frame, though these are noisy due to partial visibility, co-planar keypoints, etc., and c) 3D contact points on the object mesh over each interaction video. There are 174 distinct interaction videos in our dataset (111 train, 31 Test, 32 Validation), 13K frames in total. The 8 objects used are: pitcher, bleach bottle, skillet, drill, hammer, toy airplane, tomato soup can and mustard bottle. We show some examples from the dataset in Figure 2 . We will publicly release the dataset and believe that it will also encourage future research on understanding physical interactions.

Figure 2: Dataset. Showing two sample frames (drill and hammer) from our dataset. We collect annotation for semantic keypoints of the objects (Column 2) and human contact points. This data helps us to calculate object 6DOF pose (Column 3) and contact points on object mesh (Column 4).

3.2. Supervisory Signal Via Physical Simulation

Given per-timestep predicted forces f t and corresponding contact points C t , we show that these can get supervisory signal by simulating their effects, and comparing the simulated motion against the observed one.

Discrepancy Between Simulated And Observed Motions.

A rigid body's 'state' can be succinctly captured by its 6D pose, linear velocity, and angular velocity. Given a current state s t and the specification of the applied forces, one can compute using a physics simulator P, the resulting state s t+1 ≡ P(s t , f t , C t ). Therefore, given the initial state, and predicted forces and contact points, we can simulate the entire trajectory under these forces and obtain a resulting state at each timestep. One possible way to measure the discrepancy between this resulting simulated motion and the observed one is to penalize the difference between the corresponding 6D poses. However, our annotated 'ground-truth' 6D poses are often not accurate (due to partial visibility etc.), and this measure of error is not robust. Instead, we note that we can directly use the annotated 2D keypoint locations to measure the error, by penalizing the re-projection error between the projected keypoints under the simulated pose and the annotated locations of the observed ones. We define a loss function that penalizes this error:

EQUATION (1): Not extracted; please refer to original document.

Here, l kp t is the annotated 2D location for keypoints, π is the projection operator which transforms 3D keypoints on the model to the camera frame under the (simulated) rotation and the translation at s t = (R t , T t ). Differentiable Physics Simulation. To allow learning using the objective above, we require a differentiable physics simulator P. While typical general-purpose simulators are unfortunately not differentiable, we note that the number of input variables (state s t , forces f t , and contact points C t ) in our scenario is low-dimensional. We can therefore use finite difference method to calculate the gradients of the outputs of the simulation with respect to its inputs. In order to calculate the derivative of output with respect to input, we need to calculate the partial derivatives ∂St+1

St , ∂St+1 ft and ∂St+1 C . We use the approximation

EQUATION (2): Not extracted; please refer to original document.

where h is a small constant. As s t ∈ R 13 , f t ∈ R k×3 , and C t ∈ R k×3 (k = 5 is the number of contact points), we can Ground-truth labels Figure 3 : Training schema. Given a video as input, the model predicts the forces and their corresponding contact points. We then apply these forces on the object mesh in physics simulation and jointly optimize for keypoint projection loss L keypoint and contact point prediction loss Lcp with the aim to imitate the motion observed in the video.

Figure 3: Training schema. Given a video as input, the model predicts the forces and their corresponding contact points. We then apply these forces on the object mesh in physics simulation and jointly optimize for keypoint projection loss Lkeypoint and contact point prediction loss Lcp with the aim to imitate the motion observed in the video.

compute the gradients w.r.t the input using (13+3k+3k)+1 calls to the simulator P (the last call is for calculating f (x)).

We use the PyBullet simulator [5] for our work, and find that each (differentiable) call takes only 0.12 seconds.

3.3. Putting It Together: Joint Learning Of Forces And Contact Points

Given a video and the initial state of the object, we encode the sequence into a visual embedding, and use this embedding to predict the contact points and their corresponding forces. We then apply these forces in the physics simulation to infer the updated state of the object, and use it in addition to the sequence embedding to iteratively predict the subsequent forces to be applied. This can help the network to adapt to the possible mistakes it might have made, and change the forces in the next steps accordingly (Figure 4 ). To train our model we have two objectives: (1) to minimize the keypoint re-projection error, that help with reducing the discrepancy between the object trajectory in simulation and the one seen in the video (Equation 1), and (2) to minimize the error in contact point prediction in compare to the ground-truth ( Figure 3 ). The objective we use for optimizing the contact point estimation is defined as,

Figure 4: Model overview. Given a video of a person moving an object, along with the initial pose of the object, our network predicts the human contact points and the forces applied at those for each time step. The implied effects of these forces can then be recovered by applying them in a physics simulation. Using the gradients through this simulated interaction, our model learns how to optimize for its two objectives: minimizing the error in projection of the object to camera frame, and predicting the accurate contact points.

EQUATION (3): Not extracted; please refer to original document.

where k is the number of contact points, and C t andĈ t are the ground truth and predicted contact points at time t.

We note that the contact point decoder gets supervisory signals both from contact point loss and the keypoint loss. We believe this constrains contact point prediction to generate physically plausible motions as seen in videos. In experiments, we show this joint loss leads to improvement even in contact point prediction.

Training details. The backbone for obtaining primary image features is ResNet18 [15] pre-trained on ImageNet [7] . We take the features (which are of size 512x7x7) before the average pooling. For all the experiments we use batch size 64, Adam optimizer with learning rate of 0.001, videos of length 10 frames, with frequency of 30 fps, h f , h s = 0.01 and h c = 0.05 for approximating gradients (refer to Equation 2) for force, state and contact point respectively. We calculate the direction of the gravity in world coordinates and use it in physics simulation to ensure realistic behavior. To train our model we first train each branch in isolation using L cp for contact point prediction and L keypoint for force prediction modules, then we jointly optimize for both objectives and train end to end.

4. Experiments

The area of physical understanding of human-object interaction is largely unexplored, and there are no established benchmark datasets or evaluation metrics. In this work, we use our novel dataset to provide empirical evaluations, and in particular to show that: a) our results are qualitatively meaningful; and (b) individual components and loss terms are quantitatively meaningful (ablations). We will

Contact Points

Physics Simulation Figure 4 : Model overview. Given a video of a person moving an object, along with the initial pose of the object, our network predicts the human contact points and the forces applied at those for each time step. The implied effects of these forces can then be recovered by applying them in a physics simulation. Using the gradients through this simulated interaction, our model learns how to optimize for its two objectives: minimizing the error in projection of the object to camera frame, and predicting the accurate contact points. also demonstrate that a physical understanding of humanobject interactions leads to improvement even in individual components such as contact point estimation and estimating object poses; and (c) finally we will demonstrate the generalization power by showing that our network learns a rich representation that can use few examples to generalize to manipulating novel objects.

Evaluation Metric: The goal of our work is to obtain physical understanding of human-object interactions. But getting ground truth forces is hard, which makes it impossible to quantitatively measure the performance by just force values. So instead of measuring forces, we evaluate if our predicted forces lead to similar motions as depicted in the videos. Therefore, for evaluating our performance, we use the CP and KP error (Equations 1 and 3) as evaluation metrics. The former measures the L1 distance between the predictions and ground truth for contact point (in object coordinates), and the latter measures error in keypoint projection (in image frame). Original image size is 1920 × 1080 and the keypoint projection error is reported in pixels in the original image dimension. Rotation and translation error are the angular difference (quaternion distance) in rotation and L2 distance (meters) in translation respectively.

4.1. Qualitative Evaluation

We first show qualitative results of our full model: joint contact point and force prediction. The qualitative results for force prediction and contact point predictions are shown in Figure 5 and Figure 6 respectively. As the figure shows, our forces are quite meaningful. For example, in case of plane at t = 0 ( See Figure 5(a) ), initially the yellow and blue arrow is pushing to left and purple arrow is on the other side is pushing to right. This creates a twist motion and as seen at t = 4 rotates the plane. Also, in case of skillet ( Figure 5(b) ), there is a big change in the orientation of the object from t = 0 to t = 4, therefore a bigger magnitude of force is required. However, after getting the initial momentum the forces decrease to the minimum needed for maintaining the current state. Refer to project page * for more visualizations. Next we show few examples of contact point prediction. If contact points are predicted in isolation, small differences can lead to fundamentally different grasps. By enforcing physical meaning to these grasps (via forces), our approach ensures more meaningful predictions. An example is the top of Figure 5 , where isolated prediction leads to grasp on the rim of the pitcher; but joint prediction leads to prediction of grasp on the handle.

Figure 5: Qualitative Results. We show the results for the model which optimizes for both Lkeypoint and Lcp. Due to space limitations only Frames for t = 0, 4 are shown. For more videos and contact point visualizations refer to supplementary material.

Figure 6: Improvements in contact point prediction after joint optimization. We qualitatively show some examples for which the model makes better contact point predictions when it is trained using both Lkeypoint and Lcp. Fingers are color-coded (Thumb: orange, Index: Red, Middle: Blue, Ring: Green, Pinky: Purple).

4.2. Quantitative Evaluation

To quantitatively evaluate our approach, we measure performance by evaluating if application of forces as predicted lead to effects as depicted in the videos. Table 1 shows the performance in terms of CP metric and KP error. We report both the joint optimization (depicted as L keypoint + L cp ) and the case where contact point prediction and force prediction modules are learned in an isolated manner (L keypoint /L cp ). We observe that the joint model is significantly better (low error), both in contact point prediction and keypoint prediction. Next, we measure what happens if the contact point prediction was perfectly matched with the annotated ground truth. So, instead of predicting them, we use ground truth contact points. Given a video, initial pose, and human contact points, we predict the forces applied to each point of contact to replicate the motion. Table 2 shows the results for unseen videos. Note that the training and test sets of contact points are disjoint, so model needs to be able to generalize to applying forces to the object under novel contact point configurations.

Table 1: Contact point prediction and key point projection error on test set, independent vs. joint optimization In each set, the first row shows the results of the model optimizing for contact point and keypoint projection error separately, and the second row represents the joint optimization results. End to end training improves the results for both contact point prediction and keypoint projection.

Table 2: Predicting the forces using ground truth contact points We first train a model to predict the forces on the ground-truth contact points. The first four rows show the quantitative results for training separate model per object using the Lkeypoint objective and the last row shows the result for training one shared model for all 8 objects.

Comparing results in Table 1 and Table 2 shows that the jointly optimized model (L keypoint + L cp ), (even though it is using the predicted contact points which may have er- * https://ehsanik.github.io/forcecvpr2020 rors), predicts forces that result in better KP metric in compare to the model that uses ground-truth contact points (and is only trained on L keypoint ). This shows that jointly reasoning about where to apply the force and what force to apply can help with a better physical understanding.

Training one model for different objects. We try two training setup for each experiment: (1) training a separate model per object, and (2) sharing weights for one model across all objects (referred to as "All objects"). The common trend of performance improvement for both metrics after joint optimization is observed when training one model for all objects as well.

4.3. Few-Shot Generalization To Novel Objects

In order to evaluate the representation we learn from this training schema, we train a model for manipulating plane and use that to predict forces on unseen objects using few shot examples. Figure 7 shows that the representation we learned from predicting forces on one object can generalize to estimating forces for held-out objects using only few examples. Increasing the number of training samples yields better results on the test set. The 10 few shot experiment without pre-training (red) has a significantly lower accuracy than the one with our pre-training (light green) and is comparable with our 1-shot experiment (dark green). Table 2 : Predicting the forces using ground truth contact points We first train a model to predict the forces on the ground-truth contact points. The first four rows show the quantitative results for training separate model per object using the L keypoint objective and the last row shows the result for training one shared model for all 8 objects.

Figure 7: Few shot experiment. Showing the keypoint projection error for the model that is trained on planes and fine-tuned on few examples for held out objects (Lower is better). The results are shown for 1, 5, or 10 examples, as well as the model trained on the whole training set. The red bar shows errors for training on 10 examples without plane pre-training.

4.4. Additional Ablation Analysis

Regressing the force without simulation gradients. One alternative approach to solving the force inference problem is to predict the forces without the gradients from simulation. However, this requires having ground-truth labels for forces, which is almost impossible to obtain. So, we try to optimize for a set of pseudo-ground truth forces. The goal of this experiment is to investigate how keeping the physics simulation in the training loop can help with understanding physics of the environment and generalizing to unseen trajectories. To obtain pseudo-labels, we optimize a set of valid forces per training example which minimizes the error in keypoint projection. Then we train a model that given the sequence of images and ground truth contact points regresses these forces. The objective is defined as follows: Table 3 : Pseudo-ground truth force regression. We trained a model to predict the pseudo-ground truth forces for toy airplane from video, initial pose and contact points as input, by directly optimizing for the force prediction (L f orce ). Although the training error is similar to the model trained on keypoint projection loss, it fails to generalize to unseen images and trajectories. more generalizable representation.

Table 3: Pseudo-ground truth force regression. We trained a model to predict the pseudo-ground truth forces for toy airplane from video, initial pose and contact points as input, by directly optimizing for the force prediction (Lforce). Although the training error is similar to the model trained on keypoint projection loss, it fails to generalize to unseen images and trajectories.

L f orce (f gt ,f ) = f gt −f 2

Predicting initial state. We want to evaluate the necessity of giving the initial state as the input to the network. Thus, we try to predict the initial pose instead of using the ground truth. We do so by training a model that given a video as input and contact points in object coordinates (independent of object state), predicts the initial state of the object as well as the forces that are applied on each contact point (Table 4) .

Table 4: Predicting initial state. The result for the model trained on toy plane object set, predicting the initial pose as well as the forces on ground truth contact points. To see how the error in object initial pose estimation is affecting the performance, we input the ground-truth object pose to the model as input during inference to compare. This model is trained with Lkeypoint objective.

Adding noise to initial state during inference. We also

Image

Trained with Trained with Figure 6 : Improvements in contact point prediction after joint optimization. We qualitatively show some examples for which the model makes better contact point predictions when it is trained using both L keypoint and Lcp. Fingers are color-coded (Thumb: orange, Index: Red, Middle: Blue, Ring: Green, Pinky: Purple). want to investigate the effects of adding noise to initial state on performance. This experiment evaluates the robustness of the performance of the model to the initial state estimation. We used the trained model on toy airplane from Table 2, and added noise to the input initial state during inference time. Figure 8 shows the changes in KP metric with respect to tweaks in the rotation and translation of the initial state. Table 4 : Predicting initial state. The result for the model trained on toy plane object set, predicting the initial pose as well as the forces on ground truth contact points. To see how the error in object initial pose estimation is affecting the performance, we input the ground-truth object pose to the model as input during inference to compare. This model is trained with L keypoint objective.

Figure 8: Robustness to initial pose. The changes in the keypoint projection error with respect to the magnitude of the noise added to the initial state. We also calculated the error bars for 5 runs with different random seeds. The labels on the bottom of the chart show the magnitude of the noise in rotation in degrees and the one on the top shows the magnitude of noise in translation in meters.

Tweak In Translation(In M)

Tweak in Rotation (in degrees) Keypoint Projection Error (in px)

x Noise in translation x Noise in rotation Figure 8 : Robustness to initial pose. The changes in the keypoint projection error with respect to the magnitude of the noise added to the initial state. We also calculated the error bars for 5 runs with different random seeds. The labels on the bottom of the chart show the magnitude of the noise in rotation in degrees and the one on the top shows the magnitude of noise in translation in meters.

5. Conclusion

We have proposed a model that given a video depicting human-object interaction, predicts the contact points and the corresponding forces such that it replicates the observed motion. We demonstrate that jointly optimizing for both contact point prediction and keypoint projection error improves the results on both tasks in comparison to training models in isolation. We also show that our model learns a meaningful physical representation that can generalize to novel objects using few examples. Since our approach needs textured non-symmetric objects we were able to show these results on 8 objects from the YCB set [3] , but we conjecture that if the keypoint labels in camera frame can be estimated, this method can generalize further than this object set. We believe our work takes step towards integrating action and perception in one common framework -bringing it closer to real-world robotics. A feasible future work would be to investigate how our model's prediction can speed up the robotic imitation learning procedure. Figure 9 : Architecture details.

6. Supplementary Material

We first explain the details of the data collection, and data preprocessing. Then we continue by discussing the architecture and hyper-parameters in detail to navigate the researchers to reproduce the results. Code in PyTorch and the dataset will be released to the public later. Finally, we illustrate one frame per object showing the simulated trajectory, the contact points and the forces applied to them at a specific time step ( Figure 10 ). Code, data and the video of qualitative results are available in https://ehsanik. github.io/forcecvpr2020.

Figure 10: More qualitative frames. We show the estimated contact points and forces on variety of objects. For more videos and contact point visualizations refer to project’s webpage (https://ehsanik.github.io/forcecvpr2020).

6.1. Dataset

Annotations for Motion Estimation. We manually defined 10 keypoints per object and asked Turkers to annotate these points on each video frame if the corresponding keypoint was visible. Given the known locations of these points of the object mesh, and their (annotated) projections on the image, we can recover the object's 6D pose w.r.t. the camera using the PnP algorithm. However, we note that due to noisy annotations, partial visibility, or co-planarity of points, the recovered poses are often inaccurate. Annotating Contact Points. Instead of directly marking contact points for each interaction on the object mesh, we note that finding their projections on the image i.e. pixel locations of fingers is more natural for annotators. Given the annotations of their projections across the interaction frames, we can find the contact points on the mesh surface that project at these locations under the poses (Rt, Tt) inferred in the previous step. This is achieved by solving an optimization problem of finding C for each interaction sequence such that,

EQUATION (4): Not extracted; please refer to original document.

where t = 0, . . . , n is the time step, C ∈ R k×3 is the set of contact points (k is the number of contact points), Rt and Tt are the rotation and translation of the object in world coordinate, respectively, l c t is the pixel annotation for contact points from Mechanical Turk, and π is the projection of the 3D points from the world coordinate to the camera frame. Note that the camera is static and does not move throughout each interaction sequence. Detailed statistics of the dataset is available in Table 5 .

Table 5: Dataset Statistics Distribution of number of frames and number of distinct contact points per split. (#frames/# distinct contact points)

6.2. Architecture Details

The initial state encoder, contact point predictor and force predictor in Figure 9 are each three fully connected layers. extractor is a Resnet18 without the last fully connected and average pooling layer followed by a point-wise convolution. Encoder LSTM and contact point LSTM are each a unidirectional Long-Short Term Memory with 3 layers of hidden size 512 and Decoder LSTM is a single LSTM cell with hidden size 512. The inputs to Decoder LSTM at each time step t, are the encoding for the current state St, the predicted contact points Ct, the input embedding (which is the embedding of the entire sequence), and the frame embedding for time t + 1. State encoder is three fully connected layers that embed current rotation and translation of the object as well as its linear and angular velocity.

6.3. Training Details

To avoid over-fitting and enhance generalization, we randomly jitter the hue, saturation, and brightness of the images by 0.05. We train each one of our models until convergence on the training set, which takes between 30-60 epochs. Training each epoch takes 18-35 minutes on one GPU and 12 core CPU. We use a learning rate of 0.0001 for few shot experiments and 0.001 for the rest. We resize the input images to 224 × 224 before giving them as input to the feature extraction block.

Qualitative results follow on the next page.

Trajectory After Forces

Contact Points and Forces Figure 10 : More qualitative frames. We show the estimated contact points and forces on variety of objects. For more videos and contact point visualizations refer to project's webpage (https://ehsanik.github.io/forcecvpr2020).