See the Glass Half Full: Reasoning About Liquid Containers, Their Volume and Content
Humans have rich understanding of liquid containers and their contents; for example, we can effortlessly pour water from a pitcher to a cup. Doing so requires estimating the volume of the cup, approximating the amount of water in the pitcher, and predicting the behavior of water when we tilt the pitcher. Very little attention in computer vision has been made to liquids and their containers. In this paper, we study liquid containers and their contents, and propose methods to estimate the volume of containers, approximate the amount of liquid in them, and perform comparative volume estimations all from a single RGB image. Furthermore, we show the results of the proposed model for predicting the behavior of liquids inside containers when one tilts the containers. We also introduce a new dataset of Containers Of liQuid contEnt (COQE) that contains more than 5,000 images of 10,000 liquid containers in context labelled with volume, amount of content, bounding box annotation, and corresponding similar 3D CAD models.
Recent advancements in visual recognition have enabled researchers to start exploring tasks that go beyond categorization and entail high-level reasoning in visual domains. Visual reasoning, an essential component for a visually intelligent agent, has recently attracted computer vision researchers [46, 31, 32, 23, 34, 52, 51, 6, 1, 24] . Almost all the efforts in visual recognition and reasoning have been devoted to solid objects: how to detect [14, 36] and segment [30, 8] them, how to reason about physics of a solid world [31, 46] , and how forces would affect their behavior [32, 23] . Very little attention, however, has been made to liquid containers and the behavior of their content.
Humans, on the other hand, deal with liquids and their containers on a daily basis. We can comfortably pour water from a pitcher to a cup knowing how much water is already in the cup and having an estimate of the volume of the Can we pour the content of the container in the yellow box into the green one?
Comparative Volume Estimation Figure 1 . Our goal is to estimate the volume of the container (Volume Estimation), approximate what fraction of the volume is filled (Content Estimation), infer whether we can pour the content of one container into another (Comparative Volume Estimation), and predict how much liquid will remain in a container over time if it is tilted to a certain angle (Pouring Prediction). Our inference is based on a single RGB image.
cup and the amount of water in the pitcher. We effortlessly estimate the angle by which to tilt the pitcher to pour the right amount of water into the cup. In fact, five month old infants develop rich understanding of liquids and their containers and can predict whether water will pour or tumble from a cup if the cup is upended  . Other species such as orangutans can also estimate the volume of liquids inside a container and can predict if the liquid in one container can fit into the other one  .
In this paper, we study liquid containers ( Figure 1 ) and propose methods to estimate the volume of containers and their content in absolute and relative senses. We also show, for the first time, that we can predict the amount of liquid that remains in a container if it is tilted for a certain tilt angle, all from a single image. We introduce Containers Of liQuid contEnt (COQE), a dataset of images with containers annotated for their volume, the volume of their content, bounding box, and corresponding similar 3D models. Estimating the volume of containers is extremely challenging and requires reasoning about the size of the container, its shape, and contextual cues surrounding the container. The volume of the liquid content of a container can be estimated using subtle visual cues like the line at the edge of the liquid inside the container. We propose a deep learning based method to estimate the volume of containers and their content using the contextual cues from the surrounding objects. In addition, by integrating Convolutional Neural Networks and Recurrent Neural Networks, we can predict the behaviour of liquid contents inside containers as their reaction to tilting the container.
Our experimental evaluations on COQE dataset show that incorporating contextual cues provides improvement for estimating volume of the containers and the amount of their content. Furthermore, we show the results using a single RGB image for predicting how much liquid will remain inside a container over time if it is tilted by a certain angle.
2. Related Work
In this section, we describe the work relevant to ours. To the best of our knowledge, there is little to no work that directly addresses the same problem. Below, we mention past work that are most related.
In  , a hybrid discriminative-generative approach is proposed to detect transparent objects such as bottles and glasses.  propose a method for detection, 3D pose estimation, and 3D reconstruction of glassware.  also propose a method for reconstruction of 3D scenes that include transparent objects. Our work goes beyond detection and reconstruction since we perform reasoning about higherlevel tasks such as content estimation or pour prediction.
Object sizes are inferred by  using a combination of visual and linguistic cues. In this paper, we focus only on visual cues. Size estimates have also been used by [19, 41] to better estimate the geometry of scenes.
The result of 3D object detectors [27, 40, 15] can be used to obtain a rough estimate of the volume of the containers. However, they are typically designed for RGBD images. Moreover, the output of these detectors cannot be used for estimation of the amount of content or pouring prediction. Depth estimation methods from single RGB images [11, 29, 12] can also be used for computing the relative size of containers.
The affordance of containing liquids is inferred by  . Additionally, they reason about the best filling and transfer directions. The problem that we address is different and we use RGB images during inference (as opposed to RGBD images).  uses physical simulation to infer the affordance of containers and containment relationship between objects. Our work is different since we reason about liquid content estimation, pouring prediction, etc.
Our pouring prediction task shares similarities with  . In  , they predict the sequence of movement of rigid objects for a given force. In this work, we are concerned with liquids that have different dynamics and appearance statistics than solid objects.
There are a number of works in the robotics community that tackle the problem of liquid pouring [43, 4, 37, 47, 22, 38] . However, these approaches either have been designed for synthetic environments [47, 22] or they have been tested in lab settings and with additional sensors [43, 37, 4, 38] . Fluid simulation is a popular topic in computer graphics [33, 20, 5] . Our problem is different since we predict the liquid behavior from a single image and are not concerned about rendering.
In this paper, we focus on four important tasks related to liquids and their containers:
• Container volume estimation: Our goal in this task is to infer the volume of the container (i.e, the volume of the liquid inside the container when the container is full). The input is a single RGB image and the query container, and the output is the volume estimate (e.g., 50mL, 200mL, etc).
• Content estimation: In this task, the goal is to estimate how full a container is given a single RGB image and a query container. The example outputs are empty, 10% full, 50% full, etc.
• Comparative volume estimation: The task is to infer if we can pour the entire content of one container into another container. The input is a single RGB image and a pair of query containers in that image, and the output is yes, no, or can't tell (since we have opaque containers in the dataset). This is more complex than the previous two tasks since it requires reasoning about the size of the two containers and the amount of liquid in them simultaneously.
• Pouring prediction: The goal is to infer the amount of liquid in a container over time after tilting the container by a given angle. The inputs are a single RGB image, a query object, and a tilt angle. The output is a variable length sequence that determines the amount of liquid at each time step. The sequence has a variable length since some containers become empty much faster than other containers depending on the initial amount of liquid in them, the size of the container, and the tilt angle.
4. Coqe Dataset
There is no dataset to train and evaluate models on the four tasks defined above. Hence, we introduce a new dataset called Containers Of liQuid contEnt (COQE).
The COQE dataset includes more than 5,000 images, where in each image there are at least two containers. The containers belong to different categories such as bottle, glass, pitcher, bowl, kettle, pot, etc. Figure 2 shows some example images in the dataset.
It is infeasible to use web for collecting this dataset since obtaining accurate groundtruth volume estimates for arbitrary web images is not trivial. To overcome this problem, we used a commercial crowd-sourcing platform to collect images and their corresponding annotations. The annotators took pictures using their cameras or cellphones and measured the container volume using a measuring cup or reported the volume on the container label.
The data collectors were instructed to meet certain requirements. First, the images should include the context around the container since estimating the size from an image that only shows the container is an ambiguous task. To impose this constraint, we asked the annotators to take pictures that have at least 4 objects in each image. Second, the dataset should include annotations only for containers that had a bounding box whose larger side is larger than 30 pixels. We had this requirement because the content of the containers is not visible if the containers appear very small in the image. Finally, the dataset should include images that have objects in a natural setting to better capture background clutter, different illumination conditions, occlusion, etc.
Each container in our dataset has been annotated by its bounding box, the volume, and the amount of liquid inside the container. Additionally, we downloaded 34 CAD models from Trimble 3D Warehouse and we specify which 3D CAD model is most similar to each container in the images. Finding the correspondence with the CAD models enables us to run pouring simulations. For pouring simula-tion, we rescale the CAD models to the annotated volume and consider the annotated amount of liquid in the CAD model. Then, we tilt the CAD model by x degrees and record how much liquid remains in the CAD model for each tilt angle. Section 6.5 provides more details about pouring simulations.
5.1. Volume And Content Estimation
We now describe the model for estimation of container volume and content volume for a query container in an image.
We use a Convolutional Neural Network (CNN), where the input has 4 channels. The first three channels of the input are the RGB channels of the input image, and the fourth channel is used to represent the bounding box of the query container, which is basically a bounding box mask smoothed by a Gaussian kernel. An additional input to our model is a set of masks generated by an object detector. The masks generated by the object detector enable us to capture contextual information. The idea is that the surrounding objects typically provide a rough prior for the volume of the container of interest. We use Multipath network  as our object detector, which is a state-of-the-art instance segmentation method that generates a mask for objects along with the category information. We use Multipath that is trained on COCO dataset  so it generates masks for 80 categories defined by  . We create a binary image for each category, where the pixels of all masks for that category are set to 1. Then, we resize the mask to 28 × 28. We obtain a 28 × 28 × 81 cube, referred to as context tensor, since the object detector has 80 categories and we consider one category for the background (areas not covered by the masks of the 80 categories). For efficiency concerns, we do not use these masks in the input channel and we use them in a higher level of the model. The architecture of our model is shown in Figure 3 . We concatenate the context tensor with the input of the conv4 1 layer of ResNet-18  whose input size is 28 × 28 × 128. As a result, the input to conv4 1 will be of size 28 × 28 × 209. We refer to this network as Contextual ResNet for Containers (CRC) throughout the paper.
We formulate volume and content estimation as classification. We change the layer before the classification layer of ResNet based on the number of classes in each task. The loss for this network is the cross-entropy loss, which is typically used for multi-class classification. We consider different weights for different classes according to their inverse frequency in the training data. We could alternatively formulate these tasks as a regression problem. However, we obtained better performance using the classification formulation. Note that we train the network separately for volume and content estimation tasks (i.e. the classification layer has different size of output depending on the task).
5.2. Comparative Volume Estimation
Here, we answer the following question: "Can we pour the entire content of container 1 into container 2 in the same image?". Basically, the model needs to estimate the volume for the two containers and infer the current amount of liquid in each of them to answer the question. Our approach is implicit in that we let the network figure out these quantities and do not provide explicit supervision.
Our model for this task is a Siamese network, where there are two branches corresponding to two different containers in question. Similar to the previous tasks, each branch of the model receives a 4-channel input, where the first 3 channels is the RGB image and the 4th channel is the bounding box mask for the query container. We concatenate the output of the layers before the classification layers of the two branches (the concatenation output is 1024dimensional). A fully connected (FC) layer follows the output of the concatenation, which provides the input to a Log-Softmax layer. Alternatively, we tried a 5-channel input (i.e., 3 RGB channels, one channel for the mask of container 1 and another channel for the mask of container 2). The performance for this scenario was worse than the performance of the proposed model. We also tried two scenarios for the Siamese network, where we considered shared and non-shared weights. The performance for the shared weight case was better. The loss for this task is cross-entropy loss as well since we formulate it as classification, where the labels are yes, no, can't tell (which happens when at least one of the containers is opaque and its content is not visible).
5.3. Pouring Prediction
In this task, we predict how much liquid will remain in the container if we tilt it by x degrees. The output of this task is a function of a few factors: (1) The initial amount of liquid in the container, e.g., if a bottle is 10% full, tilting it by a few degrees will not have any effect on the amount of the liquid that remains inside the container. 2The geometry of the container. For example, a large tilt angle is required to pour the liquid from a container that has a narrow mouth. (3) The volume of the container. For example, it takes longer to pour the content of a larger container compared to a tiny container. Estimating each of these factors is a challenging task by itself.
We formulate this task as sequence prediction, where our goal is to generate the sequence of the amount of liquid in the container over time given a single RGB image, a query container, and a tilt angle x.
The amount of the liquid at each time step is dependent on the previous time steps so we use a recurrent network to capture these dependencies. Our architecture is a Convolutional Neural Network (CNN) followed by a Recurrent Neural Network (RNN).
The CNN part of the network has the same architecture as that of CRC (shown in Figure 3 ) with two differences. The first difference is that we have an additional input channel to encode the angle x. This channel has the same height and width as the input image and it is concatenated with the input image. All elements of this channel are set to x. The second difference is that we remove the classification layer of CRC so we can feed the output of the CNN into the recurrent part. We denote the output of the CNN by f , which is a 512 dimensional vector. We use f as the input to the recurrent part of the network. The architecture for this task is shown in Figure 4 . We consider a 100dimensional hidden unit for the recurrent network. The out- Note that the problem at each time step is a classification problem, where the RNN generates one of |R| classes. As described in Section 4, there is a 3D CAD model associated to each example. Therefore, we can simulate tilting for each container given an initial amount of liquid and obtain the groundtruth for this task. Note that the 3D CAD models are only used during training and not for inference.The detailed procedure for obtaining the groundtruth sequence is described in Section 6.5. The RNN stops the sequence if it generates r 0 , which is the empty state, or p, which corresponds to the opaque container case. The reason is that the rest of the sequence should be the same if it generates either of these two labels. We consider a maximum length of 5 for the sequences in our experiments.
The loss function is defined over the output sequence. Suppose we denote the groundtruth and output sequence by
S = (s 0 , s 1 , • • • , s t ) and O = (o 0 , o 1 , • • • , o t )
, respectively. The loss will be defined as:
EQUATION (1): Not extracted; please refer to original document.
where T is the maximum length of sequence, and w t (s t ) is the weight for each class (i.e. the inverse frequency of the amount s t at time t in the training data). Also, o t [s t ] is the s t -th element of o t . Recall that o t is |R|-dimensional. Also, note that o t = Sof tM ax(g(h t )), where h t is the hidden unit of the RNN at time step t, and g is a linear function followed by a ReLU non-linearity. Hence, the loss is a cross-entropy loss defined over the sequence. If the out-put sequence and the groundtruth sequence have different lengths (i.e. t = t ) , we pad them by the last element of the sequence to make them the same length.
We evaluate our models on different tasks that we defined: estimating the volume of a query container, estimating how full the container is (content estimation), comparative volume estimation that infers if we can pour the entire content of one query container into another, and pouring prediction that provides a temporal estimate of how much liquid will remain inside the query container if we tilt it. The first three tasks are mainly related to estimating the geometry of the container and its content, while the fourth task addresses the estimation of the behavior of the liquid inside the container. Dataset: Our dataset consists of more than 5,000 images that include more than 10,000 annotated containers. We use 6,386 containers for training, 1,000 for validation and 3,000 for test. Each container is annotated with the volume, the amount of content, a bounding box, and a corresponding 3D CAD model.
6.1. Implementation Details
We use Torch 1 to implement the proposed neural networks. We run the experiments on a Tesla K40 GPU. We feed the training images into the network in batches of size 96, where each batch contains RGB images, the mask images for the query container (or two masks for the comparative volume estimation task), and context tensor (described in Section 5.1).
Our learning rate is 10 −3 for all experiments. We use ResNet-18  for the ResNet part of the networks. The ResNet is pre-trained on ImageNet 2 . We randomly initialize the mask channels of the input layer and additional channels of conv 4 1 in CRC. For the random initialization, we randomly sample from a Gaussian distribution with mean 0 and standard deviation 0.01. To train the proposed models and the baselines we use 20,000 iterations. We choose the model that has the highest performance on the validation set.
6.2. Volume Estimation
We first provide evaluations for the volume estimation task. We divide the space of volumes into 10 classes, where the maximum volumes in each class are: 50, 100, 200, 300, 500, 750, 1000, 2000, 3000, ∞. The unit for the measurement is milliliter (mL). For example, the first class contains all containers that are smaller than 50mL, the second class are containers whose volume is between 50mL and 100mL and so on. The reason that the range is not uniform is to have better visual separation of examples. We could alternatively formulate the problem as a regression problem since volume is a continuous quantity, but the performance was worse. [45, 44, 32] also formulated a continuous variable estimation problem as classification due to the same reason. The baselines for this task are: (1) a naive regression that takes width and height of the container bounding box (normalized by the image width and height, respectively) as features and regresses the volume. (2) classification using AlexNet, where we replace the FC7 layer of AlexNet and its classification layer to adapt them to a 10-class classification. 3The CRC model without the contextual information. We use the same number of iterations for training these networks. Table 1 shows the results for this task. Our evaluation metric is average per-class accuracy. The chance performance for this task is 10%. Our model provides about 2.5% improvement over the case that we do not use contextual information. The results suggest that the information about the surrounding objects can help volume estimation. The overall low performance of these state-of-the-art CNNs shows how challenging the task is. Figure 5 shows qualitative examples of volume estimation.