Who Let the Dogs Out? Modeling Dog Behavior from Visual Data
We study the task of directly modelling a visually intelligent agent. Computer vision typically focuses on solving various subtasks related to visual intelligence. We depart from this standard approach to computer vision; instead we directly model a visually intelligent agent. Our model takes visual information as input and directly predicts the actions of the agent. Toward this end we introduce DECADE, a dataset of ego-centric videos from a dog's perspective as well as her corresponding movements. Using this data we model how the dog acts and how the dog plans her movements. We show under a variety of metrics that given just visual input we can successfully model this intelligent agent in many situations. Moreover, the representation learned by our model encodes distinct information compared to representations trained on image classification, and our learned representation can generalize to other domains. In particular, we show strong results on the task of walkable surface estimation and scene classification by using this dog modelling task as representation learning.
Computer vision research typically focuses on a few well defined tasks including image classification, object recognition, object detection, image segmentation, etc. These tasks have organically emerged and evolved over time as proxies for the actual problem of visual intelligence. Visual intelligence spans a wide range of problems and is hard to formally define or evaluate. As a result, the proxy tasks have served the community as the main point of focus and indicators of progress.
We value the undeniable impact of these proxy tasks in computer vision research and advocate the continuation of research on these fundamental problems. There is, however, a gap between the ideal outcome of these proxy tasks and the expected functionality of visually intelligent systems. In this paper, we take a direct approach to the prob- Figure 1 . We address three problems: (1) Acting like a dog: where the goal is to predict the future movements of the dog given a sequence of previously seen images. (2) Planning like a dog: where the goal is to find a sequence of actions that move the dog between the locations of the given pair of images. (3) Learning from a dog: where we use the learned representation for a third task (e.g., walkable surface estimation).
lem of visual intelligence. Inspired by recent work that explores the role of action and interaction in visual understanding [56, 3, 31] , we define the problem of visual intelligence as understanding visual data to the extent that an agent can take actions and perform tasks in the visual world. Under this definition, we propose to learn to act like a visually intelligent agent in the visual world.
Learning to act like visually intelligent agents, in general, is an extremely challenging and a hard-to-define problem. Actions correspond to a wide range of movements with complicated semantics. In this paper, we take a small step towards the problem of learning to directly act like intelligent agents by considering actions in their most basic and semantic-free form: simple movements.
We choose to model a dog as the visual agent. Dogs have a much simpler action space than, say, a human, making the task more tractable. However, they clearly demonstrate visual intelligence, recognizing food, obstacles, other humans and animals, and reacting to those inputs. Yet their goals and motivations are often unknown a priori. They simply exist as sovereign entities in our world. Thus we are modelling a black box where we only know the inputs and outputs of the system.
In this paper, we study the problem of learning to act and plan like a dog from visual input. We compile the Dataset of Ego-Centric Actions in a Dog Environment (DECADE), which includes ego-centric videos of a dog with her corresponding movements. To record movements we mount Inertial Measurement Units (IMU) on the joints and the body of the dog. We record the absolute position and can calculate the relative angle of the dog's main limbs and body.
Using DECADE, we explore three main problems in this paper ( Figure 1 ): (1) learning to act like a dog; (2) learning to plan like a dog; and (3) using dogs movements as supervisory signal for representation learning.
In learning to act like a dog, we study the problem of predicting the dog's future moves, in terms of all the joint movements, by observing what the dog has observed up to the current time. In learning to plan like a dog, we address the problem of estimating a sequence of movements that take the state of the dog's world from what is observed at a given time to a desired observed state. In using dogs as supervision, we explore the potentials of using the dogs movements for representation learning.
Our evaluations show interesting and promising results. Our models can predict how the dog moves in various scenarios (act like a dog) and how she decides to move from one state to another (plan like a dog). In addition, we show that the representation our model learns on dog behavior generalizes to other tasks. In particular, we see accuracy improvements using our dog model as pretraining for walkable surface estimation and scene recognition.
2. Related Work
To the best of our knowledge there is little to no work that directly models dog behavior. We mention past work that is most relevant. Visual prediction. [51, 30] predict the motion of objects in a static image using a large collection of videos.  infer the goals of people and their intended actions.  infer future activities from a stream of video.  improve tracking by considering multiple hypotheses for future plans of people.  recognize partial events, which enables early detection of events.  perform activity forecasting by integrating semantic scene understanding with optimal control theory.  use object affordances to predict the future activities of people.  localize functional objects by predicting people's intent.  propose an unsupervised approach to predict possible motions and appearance of objects in the future.  propose a hierarchical approach to predict a set of actions that happen in the future.  propose a method to generate the future frames of a video.  predict the future paths of pedestrians from a vehicle camera.  predict future trajectories of a person in an egocentric setting.  predict the future trajectories of objects according to Newtonian physics.  predict visual representations for future images.  forecast future frames by learning a policy to reproduce natural video sequences. Our work is different from these works since our goal is to predict the behavior of a dog and the movement of the joints from an ego-centric camera that captures the viewpoint of the dog. Sequence to sequence models. Sequence to sequence learning  has been used for different applications in computer vision such as representation learning  , video captioning [40, 50] , human pose estimation  , motion prediction  , or body pose labeling and forecasting [8, 44] . Our model fits into this paradigm since we map the frames in a video to joint movements of the dog. Ego-centric vision. Our work is along the lines of egocentric vision (e.g., [7, 32, 18, 19] ) since we study the dog's behavior from the perspective of the dog. However, dogs have less complex actions compared to humans, which makes the problem more manageable. Prior work explores future prediction in the context of ego-centric vision.  infer the temporal ordering of two snippets of ego-centric videos and predict what will happen next.  predict plausible future trajectories of ego-motion in ego-centric stereo images.  estimates the 3D joint position of unseen body joints using ego-centric videos.  use online reinforcement learning to forecast the future goals of the person wearing the camera. In contrast, our work focuses on predicting future joint movements given a stream of video. Ego-motion estimation. Our planning approach shares similarities with ego-motion learning.  propose an unsupervised approach for camera motion estimation.  propose a method based on combination of CNNs and RNNs to perform ego-motion estimation for cars.  learn a network to estimate relative pose of two cameras.  also train a CNN to learn depth map and motion of the camera in two consecutive images. In contrast to these approaches that estimate translation and rotation of the cam-era, we predict a sequence of joint movements. Note that the joint movements are constrained by the structure of the dog body so the predictions are constrained. Action inference & Planning. Our dog planning model infers the action sequence for the dog given a pair of images showing before and after action execution.  also learn the mapping between actions of a robot and changes in the visual state for the task of pushing objects.  optimize for actions that capture the state changes in an exploration setting. Inverse Reinforcement Learning. Several works (e.g., [1, 4, 34] ) have used Inverse Reinforcement Learning (IRL) to infer the agent's reward function from the observed behavior. IRL is not directly applicable to our problem since our action space is large and we do not have multiple training examples for each goal. Self-supervision. Various research explores representation learning by different self-supervisory signals such as egomotion [2, 12] , spatial location  , tracking in video  , colorization  , physical robot interaction  , inpainting  , sound  , etc. As a side product, we show we learn a useful representation using embeddings of joint movements and visual signals.
We introduce DECADE, a dataset of ego-centric dog video and joint movements. The dataset includes 380 video clips from a camera mounted on the dog's head. It also includes corresponding information about body position and movement. Overall we have 24500 frames. We use 21000 of them for training, 1500 for validation, and 2000 for testing. Train, validation, and test splits consist of disjoint video clips.
We use a GoPro camera on the dog's head to capture the ego-centric videos. We sub-sample frames at the rate of 5 fps. The camera applies video stabilization to the captured stream. We use inertial measurement units (IMUs) to measure body position and movement. Four IMUs measure the position of the dog's limbs, one measures the tail, and one measures the body position. The IMUs enable us to capture the movements in terms of angular displacements.
For each frame, we have the absolute angular displacement of the six IMUs. Each angular displacement is represented as a 4 dimensional quaternion vector. More details about angular calculations in this domain and the method for quantizing the data is explained in detail in Section 7. The absolute angular displacements of the IMUs depend on what direction the dog is facing. For that reason, we compute the difference between angular displacements of the joints, also in the quaternion space. The difference of the angular displacements between two consecutive frames (that is 0.2s in time) represents the action of the dog in that timestep.
An Arduino on the dog's back connects to the IMUs and records the positional information. It also collects audio data via a microphone mounted on the dog's back. We synchronize the GoPro with the IMU measurements using audio information. This allows us to synchronize the video stream with the IMU readings with microsecond precision.
We collect the data in various outdoor and indoor scenes: living room, stairs, balcony, street, and dog park are examples of these scenes. The data is recorded in more than 50 different locations. We recorded the behavior of the dog while involved in certain activities such as walking, following, fetching, interaction with other dogs, and tracking objects. No annotations are provided for the video frames, we use the raw data for our experiments.
4. Acting Like A Dog
We predict how the dog acts in the visual world in response to various situations. Specifically, we model the future actions of the dog given a sequence of previously seen images. The input is a sequence of image frames (I 1 , I 2 , . . . , I t ), and the output is the future actions (movements) of each joint j at each timestep t < t ≤ N : (a j t+1 , a j t+2 , . . . , a j t+N ). Timesteps are spaced evenly by 0.2s in time. The action a j t is the movement of the joint j, that along with the movements of other joints, takes us from image frame I t to I t+1 . For instance, a 2 3 represents the movement of the second joint that takes place between image frames I 3 and I 4 . Each action is the change in the orientation of the joints in the 3D space.
We formulate the problem as classification, i.e. we quantize joint angular movements and label each joint movement as a ground-truth action class. To obtain action classes, we cluster changes in IMU readings (joint angular movements) by K-means, and we use quaternion angular distances to represent angular distances between quaternions. Each cluster centroid represents a possible movement of that joint.
Our movement prediction model is based on an encoderdecoder architecture, where the goal is to find a mapping between input images and future actions. For instance, if the dog sees her owner with a bag of treats, there is a high probability that the dog will sit and wait for a treat, or if the dog sees her owner throwing a ball, the dog will likely track the ball and run toward it. Figure 2 shows our model. The encoder part of the model consists of a CNN and an LSTM. At each timestep, the CNN receives a pair of consecutive images as input and provides an embedding, which is used as the input to the LSTM. That is, the LSTM cell receives the features from frames t and t + 1 as the input in a timestep, and receives frames t + 1 and t + 2 in the next timestep. Our experimental results show that observing the two frames in each timestep of LSTM improves the performance of the model. Model architecture for acting. The model is an encoder-decoder style neural network. The encoder receives a stream of image pairs, and the decoder outputs future actions for each joint. There is a fully connected layer (FC) between the encoder and decoder parts to better capture the change in the domain (change from images to actions). In the decoder, the output probability of actions at each timestep is passed to the next timestep. We share the weights between the two ResNet towers.
The CNN consists of two towers of ResNet-18  , one for each frame, whose weights are shared.
The decoder's goal is to predict the future joint movements of the dog given the embedding of the input frames. The decoder receives its initial hidden state and cell from the encoder. At each timestep, the decoder outputs the action class for each of the joints. The input to the decoder at the first timestep is all zeros, at all other timesteps, we feed in the prediction of the last timestep, embedded by a linear transformer. Since we train the model with fixed output length, no stop token is required and we always stop at a fixed number of steps. Note that there are a total of six joints; hence our model outputs six classes of actions at each timestep.
Each image is given to the ResNet tower individually and the features for the two images are concatenated. The combined features are embedded into a smaller space by a linear transformation. The embedded features are fed into the encoder LSTM. We use a ResNet pre-trained on ImageNet  and we fine-tune it under a Siamese setting to estimate the joints movements between two consecutive frames. We use the fine-tuned ResNet in our encoder-decoder model.
We use an average of weighted class entropy losses, one for each joint, to train our encoder-decoder. Our loss function can be formulated as follows:
EQUATION (1): Not extracted; please refer to original document.
where g(t) i is the ground-truth class for i-th joint at timestep t, o(t) i gi is the predicted probability score for g i -th class of i-th joint at timestep t, f i gi is the number of data points whose i-th joint is labeled with g i , K is the number of joints, and N is the number of timesteps. The 1
f i g i fac-
tor helps the ground-truth labels that are underrepresented in the training data.
5. Planning Like A Dog
Another goal is to model how dogs plan actions to accomplish a task. To achieve this, we design a task as follows: Given a pair of non-consecutive image frames, plan a sequence of joint movements that the dog would take to get from the first frame (starting state) to the second frame (ending state). Note that a traditional motion estimator would not work here. Motion estimators infer a translation and rotation for the camera that can take us from an image to another; in contrast, here we expect the model to plan for the actuator, with its set of feasible actions, to traverse from one state to another. More formally, the task can be defined as follows. Given a pair of images (I 1 , I N ), output an action sequence of length N − 1 for each joint, that results in the movement of the dog from the starting point, where I 1 is observed, to the end point, where I N is observed.
Each action that the dog takes changes the states of the world, and therefore planning for the next steps. Thus, we design a recurrent neural network, containing an LSTM that observes the actions taken by the model in previous timesteps for the next timestamp action prediction. Figure 3 shows the overview of our model. We feed-forward image frames I 1 and I N to individual ResNet-18 towers, concatenate the features from the last layer and feed it to the LSTM. At each timestep, the LSTM cell outputs planned actions for all six joints. We pass the planned actions for a timestep as the input of the next timestep. This enables the network to plan the next movements conditioned on the previous actions. As opposed to making hard decisions about the pre- viously taken actions, we pass the action probabilities as the input to the LSTM in the next timestep. A low probability action at the current timestep might result in a high probability trajectory further along in the sequence. Using action probabilities prevents early pruning to keep all possibilities for the future actions.