Finding Domain-Specific Grounding in Noisy Visual-Textual Documents
Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations. Such granular annotation is rare, expensive, and unavailable in most domain-specific contexts. In contrast, unlabeled multi-image, multi-sentence documents are abundant. Can lexical grounding be learned from such documents, even though they have significant lexical and visual overlap? Working with a case study dataset of real estate listings, we demonstrate the challenge of distinguishing highly correlated grounded terms, such as “kitchen” and “bedroom”, and introduce metrics to assess this document similarity. We present a simple unsupervised clustering-based method that increases precision and recall beyond object detection and image tagging baselines when evaluated on labeled subsets of the dataset. The proposed method is particularly effective for local contextual meanings of a word, for example associating “granite” with countertops in the real estate dataset and with rocky landscapes in a Wikipedia dataset.
Multimodal data consisting of text and images is not only ubiquitous but increasingly diverse: libraries are digitizing visual-textual collections (British Library Labs, 2016; The Smithsonian, 2020); news organizations release over 1M images per year to accompany news articles (The Associated Press, 2020); and social media messages are rarely sent without visual accompaniment. In this work, we focus on one such specialized, multimodal domain: New York City real estate listings from the website StreetEasy.
To effectively index image-text datasets for search, retrieval, and other tasks, we need algorithms that learn connections between modalities, doing so from data that is naturally abundant. In documents that contain multiple images and sentences, there may be no explicit annotations for image-sentence associations or bounding box-word associations. As a result, existing image captioning/tagging methods are difficult to adapt to unlabeled multi-image, multi-sentence documents. Indeed, most prior image captioning work has focused on rare and expensive single-image, singlecaption collections such as MSCOCO, which focuses on literal, context-free descriptions for 80 object types (Lin et al., 2014) . Similarly, off-the-shelf object detectors may not account for contextual factors: to an ImageNet classifier, "pool" refers to a pool table (Russakovsky et al., 2015) . In the specialized real estate context, "pool" commonly refers to a swimming pool.
The apartment features a private balcony, dark hardwood floors and stunning floor-toceiling windows. The separate kitchen comes with a deluxe appliance package. There is also a washer.
The entire main floor is an open living area complete with half bath, a refined and stunning kitchen. Pass through the kitchen onto an ample patio, which overlooks the idyllic garden.
Large bedroom, kitchen, updated modern bathroom. Close to bike and subway.
Streeteasy Dataset Mscoco
More similar images More distinct images
More similar text More distinct text Wikipedia Figure 2 : Documents in the StreetEasy dataset are much more visually similar to each other than documents in seven multimodal image-text datasets spanning storytelling, cooking, travel blogs, captioning, etc. (Lin et al., 2014; Huang et al., 2016; Yagcioglu et al., 2018; Hessel et al., 2018 Hessel et al., , 2019 Nag Chowdhury et al., 2020) . Examples from StreetEasy show that words like "kitchen" are frequent and grounded. Black lines represent 99.99% CI.
Consider the task of lexical grounding: given a word, which images in the corpus depict that word? Consider the difficulty in learning a visual grounding for "kitchen" in StreetEasy. First, documents are multi-image, multi-sentence rather than single-image, single-sentence. Second, almost all documents picture a kitchen, a living room, and a dining room. Finally, "kitchen" co-occurs with more than two-thirds of all images, the majority of which are not kitchens. Is this task even possible?
Our first contribution is to map out a landscape of multimodal datasets, placing our real estate casestudy in relation to existing corpora. We operationalize this notion in Figure 2 by plotting average across-document visual+textual similarity for our StreetEasy case study compared to several existing multimodal corpora; 1 indeed, images in StreetEasy have very low diversity compared to other corpora. As a result of this self-similarity, in §3, we find that image-text grounding is difficult for off-the-shelf image tagging methods like multinomial/softmax regression, which leverage variation in both lexical and visual features across documents. 2 Our second contribution is a simple but performant clustering algorithm for this setting, EntSharp. 3 We intend this method to learn from image, word co-occurrences collected from multi-image, multi-sentence document collections. 1 We compute text similarity between documents with a length-controlled version of word mover's distance (WMD) (Kusner et al., 2015) on word2vec token features. We compute visual similarity between documents with "image mover's" distance, which is identical to WMD, but with a CNN feature for each image. More details are given in Appendix A.
2 Existing unsupervised approaches for this setting (Hessel et al., 2019; Nag Chowdhury et al., 2020) learn withindocument matchings of whole sentences/paragraphs, we learn cross-document matchings of word types to images.
3 Code is at https://github.com/gyauney/ domain-specific-lexical-grounding.
The training process iteratively "sharpens" the estimated Pr(word | image) distributions so that words "compete" to claim responsibility for images. We show that EntSharp outperforms both object detection and image tagging baselines at retrieving relevant images for given word types. We then qualitatively explore EntSharp's predictions on both StreetEasy and a multimodal Wikipedia dataset (Hessel et al., 2018) . The algorithm is often able to learn corpus specific relations: as shown in Figure 1, in the context of NYC real estate, "chrysler" refers to a prominent building and "granite" to a kitchen surface, while in Wikipedia the same words are grounded in cars and rocky outcroppings.
Related work. Learning image-text relationships is central to many applications, including image captioning/tagging (Kulkarni et al., 2013; Mitchell et al., 2013; Karpathy and Fei-Fei, 2015) and cross-modal retrieval/search (Jeon et al., 2003; Rasiwasia et al., 2010) . While most captioning work assumes a supervised one-to-one corpus, recent works consider documents containing multiple images/sentences (Park and Kim, 2015; Shin et al., 2016; Chu and Kao, 2017; Hessel et al., 2019; Nag Chowdhury et al., 2020) . Furthermore, compared to crowdannotated captioning datasets, web corpora are more challenging, as image-text relationships often transcend literal description (Marsh and White, 2003; Alikhani and Stone, 2019) .
2 Task And Models
We consider a direct image-text grounding task: for each word type, we aim to retrieve images mostassociated with that word. Models are evaluated by their capacity to compute word-image similarities that align with human judgment.
EntSharp. For each image in a document we iteratively infer a probability distribution over the words present in the document. During training, these distributions are encouraged to have low entropy. The output is an embedding of each word into image space: the model computes word-image similarities in this joint space. This can be thought of as a soft clustering, such that each word type is equivalent to a cluster but only certain clusters are available to certain images. This approach could also be situated within the framework of multipleinstance learning (Carbonneau et al., 2018) .
Each image i starts with a fixed feature vector i ∈ R d . Let I be the set of these image embeddings. For each word w we initialize a cluster centroid w ∈ R d to the average of co-occurring images' embeddings. Let 1 i,w be 1 if image i co-occurs with word w in any document and 0 otherwise. Each image i is assumed to have a membership distribution p i over words, where p i is initially uniform over co-occurring words. At each iteration, cluster centroids are updated to the weighted average of co-occurring images' embeddings: w := i∈I p i (w) • i followed by normalization. Each image's distribution over clusters is updated by taking a softmax of the cosine similarity between pairs of image and word embeddings, first multiplying similarities by a sharpness coefficient 4 equal to the iteration number, and finally masking for cooccurrence:
p i (w) ∝ 1 i,w •exp sharpness•( i• w) .
After training, we calculate the cosine similarity between image embeddings and the learned wordcluster embedding.
Untrained EntSharp baseline. We consider a simple averaging baseline, corresponding to the cluster center initializations of EntSharp: each word embedding is set to the mean of the features for all its co-occurring images.
Object detection baselines. We can use Ima-geNet to identify objects, but most words in the full vocabulary are not in the ImageNet labels. We implement two object detection baselines that map images to object names and then match object names to words in documents (Hessel et al., 2019) . For each image, we first get the image's top class predictions from DenseNet169 (Huang et al., 2017) pretrained on the ImageNet classification task (Russakovsky et al., 2015). These predictions are for a whole image and are restricted to the 1000 Ima-geNet labels. We bridge the gap between ImageNet labels and the vocabulary by then creating an image vector by averaging the word vectors corresponding to these predictions. Finally, for each word in the full vocabulary, we rank images by the cosine similarity between the word's vector and these image vectors. Words are represented in one baseline by word2vec embeddings (Mikolov et al., 2013) and in the other by the output of RoBERTa (Liu et al., 2019) when fed a single token as input.
Image tagging baselines. Inspired by Mahajan et al. (2018) , we implement softmax and multinomial regression models. The former, softmax regression, takes image features and predicts a distribution over the words in the vocabulary with a softmax loss. It computes the word type indicator vector for each document, i.e., 1 if word w was in the document else 0, and then 1 normalizes. Multinomial regression computes the word type indicator vector, and-instead of normalizing-computes the logistic sigmoid loss treating the labels as 0/1 indicators. This is equivalent to training a separate logistic regression for each word type to predict the presence/absence of a word type in each document, given the image features. Both models finally use the predicted conditional distributions to produce a ranking of images for each word.
StreetEasy dataset. The StreetEasy dataset comprises 29,347 real estate listings in New York City in June 2019. Document excerpts are shown in Figure 2 : each consists of both images and Englishlanguage sentences. Documents contain an average of 128 word tokens and 10 images, for totals of 3,773,608 word tokens and 294,279 images. There are no image-specific captions or labels. For our quantitative word-image retrieval evaluations, we augment StreetEasy with 17,658 human relevance judgements. After initial experiments, we selected words with a a variety of frequencies and degree of lexical/visual overlap with ImageNet categories: "kitchen" (co-occurs with 200k images), "bedroom" (175k), "washer" (65k), "outdoor" (50k), "fitness" (49k), and "pool" (29k). For each of these words of interest, we labeled a different random 1% subset of all images (2,943 images each): an image in a sample was labeled true if it corresponded with any sense of the associated word and false otherwise. For each model, we rank images for each query Figure 3 : Top images for EntSharp and object detection baselines on the StreetEasy dataset. Images in each word's section come from the same evaluation set, and each row is ranked in decreasing order from left to right. For example, the three rows in the "kitchen" section are different orderings of the same 2,943 images. Images with dark blue borders were labeled true with respect to the word, and those with light red borders were labeled false. E: EntSharp. W: word2vec object detection baseline. R: RoBERTa object detection baseline.