Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

Spatially Aware Multimodal Transformers for TextVQA


  • Yash Kant
  • Dhruv Batra
  • Peter Anderson
  • A. Schwing
  • Devi Parikh
  • Jiasen Lu
  • Harsh Agrawal
  • ECCV
  • 2020
  • View in Semantic Scholar


Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph. Further, each head in our multi-head self-attention layer focuses on a different subset of relations. Our approach has two advantages: (1) each head considers local context instead of dispersing the attention amongst all visual entities; (2) we avoid learning redundant features. We show that our model improves the absolute accuracy of current state-of-the-art methods on TextVQA by 2.2% overall over an improved baseline, and 4.62% on questions that involve spatial reasoning and can be answered correctly using OCR tokens. Similarly on ST-VQA, we improve the absolute accuracy by 4.2%. We further show that spatially aware self-attention improves visual grounding.

1 Introduction

The promise of assisting visually-impaired users gives us a compelling reason to study Visual Question Answering (VQA) [3] tasks. A dominant class of questions (∼20%) asked by visually-impaired users on images of their surroundings involves reading text in the image [5] . Naturally, the ability to reason about text in the image to answer questions such as "Is this medicine going to expire?", "Where is this bus going?" is of paramount importance for these systems. To benchmark a model's capability to reason about text in the image, new datasets [7, 32, 38] have been introduced for the task of Text Visual Question Answering (TextVQA).

Answering questions involving text in an image often requires reasoning about the relative spatial positions of objects and text. For instance, many questions such as "What is written on the player's jersey?" or "What is the next destination for compared to previous approaches [14, 38] , our model can reason about spatial relations between visual entities to answer questions correctly. (b) We construct a spatial-graph that encodes different spatial relationships between a pair of visual entities and use it to guide the self-attention layers present in multi-modal transformer architectures.

the bus?" ask about text associated with a particular visual object. Similarly, the question asked in Fig. 1 , "What sponsor is to the right of the players?", explicitly asks the answerer to look to the right of the players. Unsurprisingly, ∼13% of the questions in the TextVQA dataset use one or more spatial prepositions 4 .

Fig. 1: Qualitative Examples: The figure shows the output of M4C and our method on several image-question pairs. Bold and italics text denote words chosen from OCR tokens, otherwise it was chosen from the vocabulary. The VQA score for each prediction is mentioned inside parenthesis.

Existing methods for TextVQA reason jointly over 3 modalities -the input question, the visual content and the text in the image. LoRRA [38] uses an off-the-shelf Optical Character Recognition (OCR) system [9] to detect OCR tokens and extends previous VQA models [2] to select single OCR tokens from the images as answers. The more recently proposed Multimodal Multi-Copy Mesh (M4C) model [14] captures intra-and inter-modality interactions over all the inputs -question words, visual objects and OCR tokens -by using a multimodal transformer architecture that iteratively decodes the answer by choosing words from either the OCR tokens or some fixed vocabulary. The superior performance of M4C is attributed to the use of multi-head self-attention layers [44] which has become the defacto standard for modeling vision and language tasks [10, 27, 28, 39, 43] .

While these approaches take advantage of detected text, they are limited in how they use spatial relations. For instance, LoRRA [38] does not use any location information while M4C [14] merely encodes the absolute location of objects and text as input to the model. By default, self-attention layers are fullyconnected, dispersing attention across the entire global context and disregarding the importance of the local context around a certain object or text. As a result, in existing models the onus is on them to implicitly learn to reason about the relative spatial relations between objects and text. In contrast, in the Natural Language Processing community, it has proven beneficial to explicitly encode semantic structure between input tokens [41, 47, 49] . Moreover, while multiple independent heads in self-attention layers model different context, each head independently looks at the same global context and learns redundant features [28] that can be pruned away without substantially harming a model's performance [30, 46] .

We address the above limitations by proposing a novel spatially aware selfattention layer for multimodal transformers. First, we follow [23, 50] to build a spatial graph to represent relative spatial relations between all visual entities, i.e., all objects and OCR tokens. We then use this spatial graph to guide the self-attention layers in the multimodal transformer. We modify the attention computation in each head such that each entity attends to just the neighboring entities as defined by the spatial graph, and we restrict each head to only look at a subset of relations which prevents learning of redundant features.

Empirically, we evaluate the efficacy of our proposed approach on the challenging TextVQA [38] and Scene-Text VQA (ST-VQA) [7] datasets. We first improve the absolute accuracy of the baseline M4C model on TextVQA by 3.4% with improved features and hyperparameter optimization. We then show that replacing the fully-connected self-attention layers in the M4C model with our spatially aware self-attention layers improves absolute accuracy by a further 2.2% (or 4.62% for the ∼14% of TextVQA questions that include spatial prepositions and has a majority answer in OCR tokens). On ST-VQA our final model achieves an absolute 4.2% improvement in Average Normalized Levenshtein Similarity (ANLS). Finally, we show that our model is more visually grounded as it picks the correct answer from the list of OCR tokens 8.8% more often than M4C.

2 Related Work

Models for TextVQA: Several datasets and methods [7, 14, 32, 32, 38] have been proposed for the TextVQA task -i.e., answering questions which require models to explicitly reason about text present in the image. LoRRA [38] extends Pythia [15] with an OCR attention branch to reason over a combined list of answers from a static vocabulary and detected OCR tokens. Several other models have taken similar approaches to augmenting existing VQA models with OCR inputs [6, 7, 32] . Building on the success of transformers [44] and BERT [11] , the Multimodal Multi-Copy Mesh (M4C) model [14] (which serves as our baseline) uses a multimodal transformer to jointly encode the question, image and text and employs an auto-regressive decoding mechanism to perform multi-step answer decoding. However, these methods are limited in how they leverage the relative spatial relations between visual entities such as objects and OCR tokens. Specifically, early models [6, 7, 32] proposed for the TextVQA task did not encode any explicit spatial information while M4C [14] simply adds a location embedding of the absolute location to the input feature. We improve the performance of these models by proposing a general framework to effectively utilize the relative spatial structure between visual entities within the transformer architecture. Multimodal representation learning for Vision and Language: Recently, several general architectures for vision and language [10, 14, 22, 24, 27, 28, 33, 35, 39, 42, 43, 52] were proposed that reduce architectural differences across tasks. These models (including M4C) typically fuse vision and language modalities by applying either self-attention [4] or co-attention [29] mechanisms to capture intra-and inter-modality interactions. They achieve superior performance on many vision and language tasks due to their strong representation power and their ability to pre-train visual grounding in a self-supervised manner. Similar to M4C, these methods add a location embedding to their inputs, but do not explicitly encode relative spatial information (which is crucial for visual reasoning). Our work takes the first step towards modeling relative spatial locations within the multimodal transformer architecture.

Leveraging Explicit Relationships For Visual Reasoning:

Prior work has used Graph Convolutional Nets (GCN) [19] and Graph Attention Networks (GAT) [45] to leverage explicit relations for image captioning [50] and VQA [23] . Both these methods construct a spatial and semantic graph to relate different objects. Although our relative spatial relations are inspired from [23, 50] , our encoding differs greatly. First, [23, 50] looks at all the spatial relations in every attention head, whereas each self attention head in our model looks at different subset of the relations, i.e., each head is only responsible for a certain number of relations. This important distinction prevents spreading of attention over the entire global context and reduces redundancy amongst multiple heads.

Context Aware Transformers For Language Modeling:

Related to the use of spatial structure for visual reasoning tasks, there has been a body of work on modeling the underlying structure in input sequences for language modeling tasks. Previous approaches have considered encoding the relative position difference between sentence tokens [37] as well as encoding the depth of each word in a parse tree and the distance between word pairs [47] . Other approaches learn to adapt the attention span for each attention head [41, 49] , rather than explicitly modeling context for attention. While these methods work well for sequential input like natural language sentences, they cannot be directly applied to our task since our visual representations are non-sequential.

3 Background: Multimodal Transformers

Following the success of transformer [44] and BERT [11] based architectures on language modeling and sequence-to-sequence tasks, multi-modal transformerstyle models [10, 14, 22, 24, 27, 28, 33, 35, 39, 42, 43, 52] have shown impressive results on several vision-and-language tasks. Instead of using a single input modality (i.e., text), multiple modalities are encoded as a sequence of input tokens and appended together to form a single input sequence. Additionally, a type embedding unique to each modality is added to distinguish amongst input token of different modalities.

The core building block of the transformer architecture is a self-attention layer followed by a feed-forward network. The self-attention layer aims at capturing the direct relationships between the different input tokens. In this section, we first briefly recap the attention computation in the multi-head self-attention layer of the transformer and highlight some issues with classical self-attention layers.

3.1 Self-Attention Layer

A self-attention (SA) layer operates on an input sequence represented by

N d x -dimensional vectors X = (x 1 , . . . , x N ) ∈ R dx×N and computes the attended sequenceX = ( x 1 , . . . , x N ) ∈ R dx×N .

For this, self-attention employs h independent attention heads and applies the attention mechanism of Bahdanau et al. [4] to its own input. Each head in a self-attention layer transforms the input sequence

X into query Q h = [q h 1 , . . . , q h N ] ∈ R d h ×N , key K h = [k h 1 , . . . , k h N ] ∈ R d h ×N , and value V = [v h 1 , . . . , v h N ] ∈ R d h ×N vectors via learnable linear projections parameterized by W h Q , W h K , W h V ∈ R dx×d h : (q h i , k h i , v h i ) = (x i W h Q , x i W h K , x i W h V ) ∀i ∈ [1, . . . , N ].

Generally, d h is set to d x /H. Each attended sequence element x h i is then computed via a weighted sum of value vectors, i.e.,

EQUATION (1): Not extracted; please refer to original document.

The weight coefficient α h ij is computed via a Softmax over a compatibility function that compares the query vector q h i with key vectors of all the input tokens

EQUATION (2): Not extracted; please refer to original document.

The computation in Eq. (1) and Eq. (2) can be more compactly written as:

head h = A h (Q h , K h , V h ) = Softmax Q h (K h ) T √ d h V h ∀h = [1, . . . , H]. (3)

The output of all heads are then concatenated followed by a linear transformation with weights W O ∈ R (d h •H)×dx . Therefore, in the case of multi-head attention, we obtain the attended sequenceX = (

EQUATION (4): Not extracted; please refer to original document.

Application to multi-modal tasks: For multi-modal tasks, the self-attention is often modified to model cross-attention from one modality

U i to another modality U j as A(Q Ui , K Uj , V Uj ) or intra-modality attention A(Q Ui , K Ui , V Ui ).

Note, U i , U j are simply sets of indices which are used to construct sub-matrices. Some architectures like M4C [14] use the classical self-attention layer to model attention between tokens of all the modalities as

A(Q U , K U , V U ) where U = U 1 ∪ U 2 ∪ • • • ∪ U M

is the union of all M input modalities.

3.2 Limitations

The aforementioned self-attention layer exposes two limitations: (1) self-attention layers model the global context by encoding relations between every single pair of input tokens. This disperses the attention across every input token and overlooks the importance of semantic structure in the sequence. For instance, in the case of language modeling, it has proven beneficial to capture local-context [49] or the hierarchical structure of the input sentence by encoding the depth of each word in a parsing tree [47] , (2) multiple heads allow self-attention layers to jointly attend to different context in different heads. However, each head independently looks at the entire global information and there is no explicit mechanism to ensure that different attention heads capture different context. Indeed, it has been shown that the heads can be pruned away without substantially hurting a model's performance [30, 46] and that different heads learn redundant features [28] .

4 Approach

To address both limitations, we extend the self-attention layer to utilize a graph over the input tokens. Instead of looking at the entire global context, an entity attends to just the neighboring entities as defined by a relationship graph. Moreover, heads consider different types of relations which encodes different context and avoids learning redundant features. In what follows, we introduce the notation for input token representations. Next, we formally define the heterogeneous graph over tokens from multiple modalities which are connected by different edge types. Finally, we describe our approach to adapt the attention span of each head in the self-attention layer by utilizing this graph. While our framework is general and easily extensible to other tasks, we present our approach for the TextVQA task.

4.1 Graph Over Input Tokens

Let us define a directed cyclic heterogeneous graph G = (X, E) where each node corresponds to an input token x i ∈ X. E is a set of all edges e i→j , ∀x i , x j ∈ X.

Additionally, we define a mapping function Φ x : X → T x that maps a node x i ∈ X to one of the modalities. Consequently the number of node types is equal to the number of input modalities, i.e., |T x | = M . We also define a mapping function Φ e : E → T e that maps an edge e i→j ∈ E to a relationship type t l ∈ T e . We represent the question as a set of tokens, i.e., X ques = {x ∈ X : Φ x (x) = ques}. The visual content in the image is represented via a list of object region features X obj = {x ∈ X : Φ x (x) = obj}. Similarly, the list of OCR tokens present in the image is referred to as X ocr = {x ∈ X : Φ x (x) = ocr}. Following M4C, the model decodes multi-word answer Y ans = (y ans 1 , . . . , y ans T ) for T time-steps. Spatial Relationship Graph: Answering questions about text in the image involves reasoning about the spatial relations between various OCR tokens and objects present in the image. For instance, the question "What is the letter on the player's hat?" requires to first detect a hat in the image and then reason about the 'contains' relationship between the letter and the player's hat.

To encode these spatial relationships between all the objects X obj and OCR tokens X ocr present in the image, i.e., all the regions r ∈ R = X obj ∪ X ocr , we construct a spatial graph G spa = (R, E spa ) with nodes corresponding to the union of all objects and OCR tokens. The mapping function Φ spa : E spa → T spa In our spatially aware self-attention layer, object and OCR tokens attend to each other based on a subset of spatial relations T h ⊆ T spa . They also attend to question tokens via timp relation. Any input token x ∈ X do not attend to answer token y ans ∈ Y while y ans can attend to tokens in X as well as previous answer tokens y ans

assigns a spatial relationship t l ∈ T spa to an edge e = (r i , r j ) ∈ E spa . The mapping function utilizes the rules introduced by Yao et al. [50] which we illustrate in Fig. 2(a) . We use a total of twelve types of spatial relations (e.g.,