Structured Set Matching Networks for One-Shot Part Labeling
Diagrams often depict complex phenomena and serve as a good test bed for visual and textual reasoning. However, understanding diagrams using natural image understanding approaches requires large training datasets of diagrams, which are very hard to obtain. Instead, this can be addressed as a matching problem either between labeled diagrams, images or both. This problem is very challenging since the absence of significant color and texture renders local cues ambiguous and requires global reasoning. We consider the problem of one-shot part labeling: labeling multiple parts of an object in a target image given only a single source image of that category. For this set-to-set matching problem, we introduce the Structured Set Matching Network (SSMN), a structured prediction model that incorporates convolutional neural networks. The SSMN is trained using global normalization to maximize local match scores between corresponding elements and a global consistency score among all matched elements, while also enforcing a matching constraint between the two sets. The SSMN significantly outperforms several strong baselines on three label transfer scenarios: diagram-to-diagram, evaluated on a new diagram dataset of over 200 categories; image-to-image, evaluated on a dataset built on top of the Pascal Part Dataset; and image-to-diagram, evaluated on transferring labels across these datasets.
A considerable portion of visual data consists of illustrations including diagrams, maps, sketches, paintings and infographics, which afford unique challenges from a computer vision perspective. While computer vision research has largely focused on understanding natural images, there has been a recent renewal of interest in understanding visual illustrations [24, 31, 51, 47, 52, 33, 55, 28] . Science and math diagrams are a particularly interesting subset of visual illustrations, because they often depict complex phenomena grounded in well defined curricula, and serve as a good test indicates equal contribution. Majority of the work has been done while JK is in AI2. Figure 1 . Matching results by our SSMN model. Given source images annotated by points with text labels, our model transfers labels to the points in the target images. Colors indicate each label. Black (Gray in row 2) dots indicate unlabeled points. SSMN is able to correctly label the target points in spite of significant geometric transformations and appearance differences between object parts in the source and target images of categories unobserved in training.
bed for visual and textual reasoning [24, 35, 36, 26, 18] . Understanding diagrams using natural image understanding approaches requires training models for diagram categories, object categories, part categories, etc. which requires large training corpora that are particularly hard to obtain for diagrams. Instead, this can be addressed by trans-ferring labels from smaller labeled datasets of diagrams (within-domain) as well as from labeled datasets of natural images (cross-domain). Label transfer has previously shown impressive results in a within-domain natural image setting  . It is interesting to note that young children are able to correctly identify diagrams of objects and their parts, having seen just a few diagrammatic and natural image examples in story books and textbooks.
The task of label transfer is quite challenging, especially in diagrams. First, it requires a fine grained analysis of a diagram, but the absence of significant color or textural information renders local appearance cues inherently ambiguous. Second, overcoming these local ambiguities requires reasoning about the entire structure of the diagram, which is challenging. Finally, large datasets of object diagrams with fine grained part annotations, spanning the entire set of objects we are interested in, are expensive to acquire. Motivated by these challenges, we present the One-Shot Part Labeling task: labeling object parts in a diagram having seen only one labeled image from that category.
One-Shot Part Labeling is the task of matching elements of two sets: the fully-labeled parts of a source image and the unlabeled parts of a target image. Although previous work has considered matching a single target to a set of sources [25, 46] , there is little prior work on set-to-set matching, which poses additional challenges as the model must predict a one-to-one matching. For this setting, we propose the Structure Set Matching Network (SSMN), a model that leverages the matching structure to improve accuracy. Our key observation is that a matching implies a transformation between the source and target objects and not all transformations are equally likely. For example, in Figure 1 (top), the matching would be highly implausible if we swapped the labels of "wing" and "tail," as this would imply a strange deformation of the depicted bird. However, transformations such as rotations and perspective shifts are common. The SSMN learns which transformations are likely and uses this information to improve its predictions.
The Structured Set Matching Network (SSMN) is an end-to-end learning model for matching the elements in two sets. The model combines convolutional neural networks (CNNs) into a structured prediction model. The CNNs extract local appearance features of parts from the source and target images. The structured prediction model maximises local matching scores (derived from the CNNs) between corresponding elements along with a global consistency score amongst all matched elements that represents whether the source-to-target transformation is reasonable. Crucially, the model is trained with global normalization to reduce errors from label bias  -roughly, model scores for points later in a sequence of predictions matter lesswhich we show is guaranteed to occur for RNNs and other locally-normalized models in set-to-set matching (Sec.4).
Off-the-shelf CNNs perform poorly on extracting features from diagrams [24, 52] , owing to the fact that dia-grams are very sparse and have little to no texture. Our key insight to overcoming this is to convert diagrams to distance transform images. The distance transform introduces meaningful textures into the images that capture the location and orientation of nearby edges. Our experiments show that this introduced texture improves performance and enables the use of model architectures built for natural images.
We compile three datasets: (1) a new diagram dataset named Diagram Part Labeling (DiPART), which consists of 4,921 diagram images across 200 objects categories, each annotated with 10 parts. (2) a natural image part labeling dataset named Pascal Part Matching (PPM) built on top of the popular Pascal Part dataset  . (3) a combination of the above two datasets (Cross-DiPART-PPM) to evaluate the task of cross-domain label transfer. The SSMN significantly outperforms several strong baselines on all three datasets.
In summary, our contributions include: (a) presenting the task of One-Shot Diagram Part Labeling (b) proposing Structured Set Matching Networks, an end-to-end combination of CNNs and structured prediction for matching elements in two sets (c) proposing converting diagrams into distance transforms, prior to passing them through a CNN (d) presenting a new diagram dataset DiPART towards the task of one-shot part labeling (e) obtaining state-of-theart results on 3 challenging setups: diagram-to-diagram, image-to-image and image-to-diagram.
2. Related Work
One-Shot Learning. Early work on one-shot learning includes Fei-Fei et al. [15, 16] who showed that one can take advantage of knowledge coming from previously learned categories, regardless of how different these categories might be. Koch et al.  proposed using a Siamese network for one-shot learning and demonstrated their model on the Omniglot dataset for character recognition. More recently Vinyals et al.  proposed a matching network for one-shot learning, which incorporates additional context into the representations of each element and the similarity function using LSTMs. The SSMN model builds on matching networks by incorporating a global consistency model that improves accuracy in the set-to-set case. Visual Illustrations. There is a large body of work in sketch based image retrieval (SBIR) [51, 33, 47, 55, 14] . SBIR has several applications including online product searches  . The key challenge in SBIR is embedding sketches and natural images into a common space, and is often solved with variants of Siamese networks. In SSMN, each pair of source and target encoders with the corresponding similarity network (Section 3.1) can be thought of as a Siamese network. There also has been work in sketch classification  . More recently  proposed a CNN architecture to produce state-of-the-art results on this set. They noted that off-the-shelf CNN architectures do not work well for sketches, and instead proposed a few modifications. Our analysis shows that converting diagrams to distance trans- Factor Graph for the Structured Prediction Appearance Matching Network $" $ " !$!" $ " $ " !$ ! $ " $ "!$ !" $ " $ " !$ ! $" $ "!$!" $ " $ " !$ ! form images allows us to use architectures resembling ones designed for natural images. Work in understanding diagrams for question answering includes domains of science [24, 26] , geometry [35, 36] and reasoning  . Abstract scenes have also been analyzed to learn semantics  and common sense  . Part Recognition. There is a large body of work in detecting parts of objects as a step towards detecting the entire object including [6, 17, 57, 37, 2, 40] to name a few. In contrast to these efforts (which learn part classifiers from many training images), we focus on one-shot labeling.
Learning Correspondences and Similarity Metrics. Labeling parts from a source image can be translated into a correspondence problem, which have received a lot of attention over the years. Recently, deep learning models have been employed for finding dense correspondences [8, 20, 54, 21] and patch correspondences [53, 22] . The SSMN differs from the majority of them due to its ability to jointly reason over the set of elements in the source and target. There has also been a fair amount of work on learning a metric for similarity [4, 7, 39] . The appearance similarity factor in the SSMN model builds on past work in this area. Recently, Han et al.  have proposed incorporating geometric plausibility into a model for semantic correspondence, a notion that is also strongly leveraged by the SSMN. Global Normalization with Neural Networks. Most work on structured prediction with neural networks uses locallynormalized models, e.g., for caption generation  . Such models are less expressive than globally-normalized models (e.g., CRFs)  and suffer from label bias  , which, as we show in Sec 4, is a significant problem in set-to-set matching. A few recent works have explored global normalization with neural networks for pose estimation  and semantic image segmentation [34, 56] . Models that permit inference via a dynamic program, such as linear chain CRFs, can be trained with log-likelihood by implementing the inference algorithm (which is just sums and products) as part of the neural network's computation graph, then per-forming backpropagation [19, 12, 50, 11, 32] . Some work has also considered using approximate inference during training [5, 42, 48] . Search-based learning objectives, such as early-update Perceptron  and LaSO  , are other training schemes for globally-normalized models that have an advantage over log-likelihood: they do not require the computation of marginals during training. This approach has recently been applied to syntactic parsing  and machine translation  , and we also use it to train SSMN.
3. Structured Set Matching Network
The structured set matching network (SSMN) is a model for matching elements in two sets that aims to maximise local match scores between corresponding elements and a global consistency score amongst all matched elements, while also enforcing a matching constraint between the two sets. We describe SSMN in the context of the problem of one-shot part labeling, though the model is applicable to any instance of set-to-set matching.
The one-shot part labeling problem is to label the parts of an object having seen only one example image from that category. We formulate this problem as a label transfer problem from a source to a target image. Both images are labeled with K parts, each of which is a single point marked within the part as shown in Figure 2 . Each part of the source image is further labeled with its name, e.g., "tail." The model's output is an assignment of part names to the points marked in the target image, each of which much be uniquely drawn from the source image.