Spatially Aware Multimodal Transformers for TextVQA

Yash Kant
Dhruv Batra
Peter Anderson
A. Schwing
Devi Parikh
Jiasen Lu
Harsh Agrawal
ECCV
2020
View in Semantic Scholar

Abstract

Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph. Further, each head in our multi-head self-attention layer focuses on a different subset of relations. Our approach has two advantages: (1) each head considers local context instead of dispersing the attention amongst all visual entities; (2) we avoid learning redundant features. We show that our model improves the absolute accuracy of current state-of-the-art methods on TextVQA by 2.2% overall over an improved baseline, and 4.62% on questions that involve spatial reasoning and can be answered correctly using OCR tokens. Similarly on ST-VQA, we improve the absolute accuracy by 4.2%. We further show that spatially aware self-attention improves visual grounding.

1 Introduction

The promise of assisting visually-impaired users gives us a compelling reason to study Visual Question Answering (VQA) [3] tasks. A dominant class of questions (∼20%) asked by visually-impaired users on images of their surroundings involves reading text in the image [5] . Naturally, the ability to reason about text in the image to answer questions such as "Is this medicine going to expire?", "Where is this bus going?" is of paramount importance for these systems. To benchmark a model's capability to reason about text in the image, new datasets [7, 32, 38] have been introduced for the task of Text Visual Question Answering (TextVQA).

Answering questions involving text in an image often requires reasoning about the relative spatial positions of objects and text. For instance, many questions such as "What is written on the player's jersey?" or "What is the next destination for compared to previous approaches [14, 38] , our model can reason about spatial relations between visual entities to answer questions correctly. (b) We construct a spatial-graph that encodes different spatial relationships between a pair of visual entities and use it to guide the self-attention layers present in multi-modal transformer architectures.

the bus?" ask about text associated with a particular visual object. Similarly, the question asked in Fig. 1 , "What sponsor is to the right of the players?", explicitly asks the answerer to look to the right of the players. Unsurprisingly, ∼13% of the questions in the TextVQA dataset use one or more spatial prepositions 4 .

Fig. 1: Qualitative Examples: The figure shows the output of M4C and our method on several image-question pairs. Bold and italics text denote words chosen from OCR tokens, otherwise it was chosen from the vocabulary. The VQA score for each prediction is mentioned inside parenthesis.

Existing methods for TextVQA reason jointly over 3 modalities -the input question, the visual content and the text in the image. LoRRA [38] uses an off-the-shelf Optical Character Recognition (OCR) system [9] to detect OCR tokens and extends previous VQA models [2] to select single OCR tokens from the images as answers. The more recently proposed Multimodal Multi-Copy Mesh (M4C) model [14] captures intra-and inter-modality interactions over all the inputs -question words, visual objects and OCR tokens -by using a multimodal transformer architecture that iteratively decodes the answer by choosing words from either the OCR tokens or some fixed vocabulary. The superior performance of M4C is attributed to the use of multi-head self-attention layers [44] which has become the defacto standard for modeling vision and language tasks [10, 27, 28, 39, 43] .

While these approaches take advantage of detected text, they are limited in how they use spatial relations. For instance, LoRRA [38] does not use any location information while M4C [14] merely encodes the absolute location of objects and text as input to the model. By default, self-attention layers are fullyconnected, dispersing attention across the entire global context and disregarding the importance of the local context around a certain object or text. As a result, in existing models the onus is on them to implicitly learn to reason about the relative spatial relations between objects and text. In contrast, in the Natural Language Processing community, it has proven beneficial to explicitly encode semantic structure between input tokens [41, 47, 49] . Moreover, while multiple independent heads in self-attention layers model different context, each head independently looks at the same global context and learns redundant features [28] that can be pruned away without substantially harming a model's performance [30, 46] .

We address the above limitations by proposing a novel spatially aware selfattention layer for multimodal transformers. First, we follow [23, 50] to build a spatial graph to represent relative spatial relations between all visual entities, i.e., all objects and OCR tokens. We then use this spatial graph to guide the self-attention layers in the multimodal transformer. We modify the attention computation in each head such that each entity attends to just the neighboring entities as defined by the spatial graph, and we restrict each head to only look at a subset of relations which prevents learning of redundant features.

Empirically, we evaluate the efficacy of our proposed approach on the challenging TextVQA [38] and Scene-Text VQA (ST-VQA) [7] datasets. We first improve the absolute accuracy of the baseline M4C model on TextVQA by 3.4% with improved features and hyperparameter optimization. We then show that replacing the fully-connected self-attention layers in the M4C model with our spatially aware self-attention layers improves absolute accuracy by a further 2.2% (or 4.62% for the ∼14% of TextVQA questions that include spatial prepositions and has a majority answer in OCR tokens). On ST-VQA our final model achieves an absolute 4.2% improvement in Average Normalized Levenshtein Similarity (ANLS). Finally, we show that our model is more visually grounded as it picks the correct answer from the list of OCR tokens 8.8% more often than M4C.

2 Related Work

Models for TextVQA: Several datasets and methods [7, 14, 32, 32, 38] have been proposed for the TextVQA task -i.e., answering questions which require models to explicitly reason about text present in the image. LoRRA [38] extends Pythia [15] with an OCR attention branch to reason over a combined list of answers from a static vocabulary and detected OCR tokens. Several other models have taken similar approaches to augmenting existing VQA models with OCR inputs [6, 7, 32] . Building on the success of transformers [44] and BERT [11] , the Multimodal Multi-Copy Mesh (M4C) model [14] (which serves as our baseline) uses a multimodal transformer to jointly encode the question, image and text and employs an auto-regressive decoding mechanism to perform multi-step answer decoding. However, these methods are limited in how they leverage the relative spatial relations between visual entities such as objects and OCR tokens. Specifically, early models [6, 7, 32] proposed for the TextVQA task did not encode any explicit spatial information while M4C [14] simply adds a location embedding of the absolute location to the input feature. We improve the performance of these models by proposing a general framework to effectively utilize the relative spatial structure between visual entities within the transformer architecture. Multimodal representation learning for Vision and Language: Recently, several general architectures for vision and language [10, 14, 22, 24, 27, 28, 33, 35, 39, 42, 43, 52] were proposed that reduce architectural differences across tasks. These models (including M4C) typically fuse vision and language modalities by applying either self-attention [4] or co-attention [29] mechanisms to capture intra-and inter-modality interactions. They achieve superior performance on many vision and language tasks due to their strong representation power and their ability to pre-train visual grounding in a self-supervised manner. Similar to M4C, these methods add a location embedding to their inputs, but do not explicitly encode relative spatial information (which is crucial for visual reasoning). Our work takes the first step towards modeling relative spatial locations within the multimodal transformer architecture.

Leveraging Explicit Relationships For Visual Reasoning:

Prior work has used Graph Convolutional Nets (GCN) [19] and Graph Attention Networks (GAT) [45] to leverage explicit relations for image captioning [50] and VQA [23] . Both these methods construct a spatial and semantic graph to relate different objects. Although our relative spatial relations are inspired from [23, 50] , our encoding differs greatly. First, [23, 50] looks at all the spatial relations in every attention head, whereas each self attention head in our model looks at different subset of the relations, i.e., each head is only responsible for a certain number of relations. This important distinction prevents spreading of attention over the entire global context and reduces redundancy amongst multiple heads.

Context Aware Transformers For Language Modeling:

Related to the use of spatial structure for visual reasoning tasks, there has been a body of work on modeling the underlying structure in input sequences for language modeling tasks. Previous approaches have considered encoding the relative position difference between sentence tokens [37] as well as encoding the depth of each word in a parse tree and the distance between word pairs [47] . Other approaches learn to adapt the attention span for each attention head [41, 49] , rather than explicitly modeling context for attention. While these methods work well for sequential input like natural language sentences, they cannot be directly applied to our task since our visual representations are non-sequential.

3 Background: Multimodal Transformers

Following the success of transformer [44] and BERT [11] based architectures on language modeling and sequence-to-sequence tasks, multi-modal transformerstyle models [10, 14, 22, 24, 27, 28, 33, 35, 39, 42, 43, 52] have shown impressive results on several vision-and-language tasks. Instead of using a single input modality (i.e., text), multiple modalities are encoded as a sequence of input tokens and appended together to form a single input sequence. Additionally, a type embedding unique to each modality is added to distinguish amongst input token of different modalities.

The core building block of the transformer architecture is a self-attention layer followed by a feed-forward network. The self-attention layer aims at capturing the direct relationships between the different input tokens. In this section, we first briefly recap the attention computation in the multi-head self-attention layer of the transformer and highlight some issues with classical self-attention layers.

3.1 Self-Attention Layer

A self-attention (SA) layer operates on an input sequence represented by

N d x -dimensional vectors X = (x 1 , . . . , x N ) ∈ R dx×N and computes the attended sequenceX = ( x 1 , . . . , x N ) ∈ R dx×N .

For this, self-attention employs h independent attention heads and applies the attention mechanism of Bahdanau et al. [4] to its own input. Each head in a self-attention layer transforms the input sequence

X into query Q h = [q h 1 , . . . , q h N ] ∈ R d h ×N , key K h = [k h 1 , . . . , k h N ] ∈ R d h ×N , and value V = [v h 1 , . . . , v h N ] ∈ R d h ×N vectors via learnable linear projections parameterized by W h Q , W h K , W h V ∈ R dx×d h : (q h i , k h i , v h i ) = (x i W h Q , x i W h K , x i W h V ) ∀i ∈ [1, . . . , N ].

Generally, d h is set to d x /H. Each attended sequence element x h i is then computed via a weighted sum of value vectors, i.e.,

EQUATION (1): Not extracted; please refer to original document.

The weight coefficient α h ij is computed via a Softmax over a compatibility function that compares the query vector q h i with key vectors of all the input tokens

EQUATION (2): Not extracted; please refer to original document.

The computation in Eq. (1) and Eq. (2) can be more compactly written as:

head h = A h (Q h , K h , V h ) = Softmax Q h (K h ) T √ d h V h ∀h = [1, . . . , H]. (3)

The output of all heads are then concatenated followed by a linear transformation with weights W O ∈ R (d h •H)×dx . Therefore, in the case of multi-head attention, we obtain the attended sequenceX = (

EQUATION (4): Not extracted; please refer to original document.

Application to multi-modal tasks: For multi-modal tasks, the self-attention is often modified to model cross-attention from one modality

U i to another modality U j as A(Q Ui , K Uj , V Uj ) or intra-modality attention A(Q Ui , K Ui , V Ui ).

Note, U i , U j are simply sets of indices which are used to construct sub-matrices. Some architectures like M4C [14] use the classical self-attention layer to model attention between tokens of all the modalities as

A(Q U , K U , V U ) where U = U 1 ∪ U 2 ∪ • • • ∪ U M

is the union of all M input modalities.

3.2 Limitations

The aforementioned self-attention layer exposes two limitations: (1) self-attention layers model the global context by encoding relations between every single pair of input tokens. This disperses the attention across every input token and overlooks the importance of semantic structure in the sequence. For instance, in the case of language modeling, it has proven beneficial to capture local-context [49] or the hierarchical structure of the input sentence by encoding the depth of each word in a parsing tree [47] , (2) multiple heads allow self-attention layers to jointly attend to different context in different heads. However, each head independently looks at the entire global information and there is no explicit mechanism to ensure that different attention heads capture different context. Indeed, it has been shown that the heads can be pruned away without substantially hurting a model's performance [30, 46] and that different heads learn redundant features [28] .

4 Approach

To address both limitations, we extend the self-attention layer to utilize a graph over the input tokens. Instead of looking at the entire global context, an entity attends to just the neighboring entities as defined by a relationship graph. Moreover, heads consider different types of relations which encodes different context and avoids learning redundant features. In what follows, we introduce the notation for input token representations. Next, we formally define the heterogeneous graph over tokens from multiple modalities which are connected by different edge types. Finally, we describe our approach to adapt the attention span of each head in the self-attention layer by utilizing this graph. While our framework is general and easily extensible to other tasks, we present our approach for the TextVQA task.

4.1 Graph Over Input Tokens

Let us define a directed cyclic heterogeneous graph G = (X, E) where each node corresponds to an input token x i ∈ X. E is a set of all edges e i→j , ∀x i , x j ∈ X.

Additionally, we define a mapping function Φ x : X → T x that maps a node x i ∈ X to one of the modalities. Consequently the number of node types is equal to the number of input modalities, i.e., |T x | = M . We also define a mapping function Φ e : E → T e that maps an edge e i→j ∈ E to a relationship type t l ∈ T e . We represent the question as a set of tokens, i.e., X ques = {x ∈ X : Φ x (x) = ques}. The visual content in the image is represented via a list of object region features X obj = {x ∈ X : Φ x (x) = obj}. Similarly, the list of OCR tokens present in the image is referred to as X ocr = {x ∈ X : Φ x (x) = ocr}. Following M4C, the model decodes multi-word answer Y ans = (y ans 1 , . . . , y ans T ) for T time-steps. Spatial Relationship Graph: Answering questions about text in the image involves reasoning about the spatial relations between various OCR tokens and objects present in the image. For instance, the question "What is the letter on the player's hat?" requires to first detect a hat in the image and then reason about the 'contains' relationship between the letter and the player's hat.

To encode these spatial relationships between all the objects X obj and OCR tokens X ocr present in the image, i.e., all the regions r ∈ R = X obj ∪ X ocr , we construct a spatial graph G spa = (R, E spa ) with nodes corresponding to the union of all objects and OCR tokens. The mapping function Φ spa : E spa → T spa In our spatially aware self-attention layer, object and OCR tokens attend to each other based on a subset of spatial relations T h ⊆ T spa . They also attend to question tokens via timp relation. Any input token x ∈ X do not attend to answer token y ans ∈ Y while y ans can attend to tokens in X as well as previous answer tokens y ans

assigns a spatial relationship t l ∈ T spa to an edge e = (r i , r j ) ∈ E spa . The mapping function utilizes the rules introduced by Yao et al. [50] which we illustrate in Fig. 2(a) . We use a total of twelve types of spatial relations (e.g.,

Fig. 2: Qualitative Examples: The figure shows the output of M4C and our method on several image-question pairs. Bold and italics text denote words chosen from OCR tokens, otherwise it was chosen from the vocabulary. The VQA score for each prediction is mentioned with parenthesis.

r i − contains − r j , r i − is-inside − r j as well as a 'self-relation').

Note that G spa is a symmetric directed graph, i.e., for every edge e i→j there is a reverse edge e j→i .

Implicit Relationship Between Objects, Ocr And Question Tokens:

For the TextVQA task, different types of spatial relations might be useful for different question types. For instance, a question asking about 'what is written on the player's jersey' might focus on the contains relationship, whereas a question asking about 'what sponsor is to the right of the player' might utilize the right relationship. Thus, to inject semantic information from the question into the object and OCR representation, we allow object and OCR tokens to attend to question tokens. In our general framework, we accomplish this via a bipartite graph G imp (R, X ques , E imp ) connecting all the object and OCR tokens r i ∈ R to all question tokens x j ∈ X ques via an implicit edge e i→j of type t imp . Thus, by attending to question tokens, each object and OCR token learns to implicitly incorporate useful semantic information from the question into its representation.

4.2 Spatially Aware Self-Attention Layer

As mentioned in Section 3, attention in a single-head h of a self-attention layer can be computed by the compatibility function defined in Equation 2. The compatibility function computes a similarity between a query q h i corresponding to input x i , and the key vector k h j of input token x j . Within a single head we want attention to only look at relevant tokens. We model it by allowing each head to focus on only a subset of edge types T h ⊆ T e . In other words, we want each token x i to only focus on tokens x j when they are connected via an edge e i→j of type Φ e (e i→j ) ∈ T h . In the context of TextVQA, we use the combination of two graphs, G spa ∪G imp , defined over tokens from all the input data modalities x ∈ X. The subset of relations T h each head h attends to is subset of c spatial relationships between (x i , x j ) and one implicit relationship between question and image tokens, i.e.,

T h = {t imp , t h , t h+1 , • • • , t (h+c) mod |T spa | }, t ∈ T e = T spa ∪ t imp .

When c > 1, multiple heads are aware of a given spatial relationship and we are encouraging the models to jointly attend to information from different representation subspaces [44] . When c = 1, each head only focuses on a one type of spatial relationship. Similarly when c = |T spa | + 1, each head attends to all the spatial relationships as well as the implicit relationship t imp . We empiricaly observed that c = 2 works best for our setting.

As illustrated in Fig. 3 , to weigh the attention in each head based on the subset of spatial relationships T h , we introduce a bias term defined as

Fig. 3: Qualitative Examples: The figure shows the output of M4C and our method on several image-question pairs. Bold and italics text denote words chosen from OCR tokens, otherwise it was chosen from the vocabulary. The VQA score for each prediction is mentioned with parenthesis.

EQUATION (5): Not extracted; please refer to original document.

to modify the computation of the attention weights α h ij over different tokens. Specifically, we compute attention weights as follows 5 :

EQUATION (6): Not extracted; please refer to original document.

Intuitively, as illustrated in Fig. 3 if there is no edge e i→j of type t l ∈ T h between nodes x i and x j , then the compatibility score q h i (k h j ) T + b h i,j is negative infinity and the attention weights α h ij become zero. Otherwise, the attention weights can be modulated based on the specific edge type t l = Φ e (e i→j ) by learning a bias term for each edge type β h t l ∈ {β h t1 , . . . , β h |T e | }. Alternatively, we can set β h t l to zero if we do not want to modulate attention based on the edge type between a pair of tokens. Classical self-attention layers described in Section 3.1 are hence a special case which is obtained when |T e | = 1 and G is a fully connected graph.

Specifically, for TextVQA, as illustrated in Fig. 2(b) , all object and OCR tokens attend to each other based on that subset of relations T h that a head is responsible for. Since we want the representations of object and OCR tokens to contain information about the question, all object and OCR tokens attend to all question tokens via the edge of type t imp . For simplicity, we don't learn this relation-dependent bias and run all our experiments with β h t l set to zero. Importantly, our graph-aware self attention layer overcomes the aforementioned two limitations of classical self-attention layers. First, each head is able to focus on a subset of relations T h ⊆ T e . Consequently, the attention is not distributed over the entire sequence of tokens and each token gathers information from only a specific subset of tokens. Second, we are forcing each head to look at a different context which prevents the heads from learning redundant features.

Causal Attention For Answer Tokens:

During decoding, the M4C model generates answer tokens y ans t ∀t one step at a time. Inspired by the success of several text-to-text models [12, 26] , the M4C architecture uses a causal attention mask where y ans t attends to all question, image, and OCR tokens x ∈ X along with entries in the answer y ans [14, 38] to generate the answer tokens. During decoding, at each step the model transforms the predicted token from the previous step to a d-dimensional vector z t . We use z t to compute similarity with all OCR-tokens and vocabulary words and pick the the most similar one. We iteratively decode the answer over 12 time steps.

4.3 Implementation Details

Following M4C [14] , the input to our multimodal transformer consists of three different modalities -1) 20 Question tokens, 2) 100 Object tokens, and 3) 50 OCR tokens. Below, we briefly describe the construction of each of the modal features. For a detailed discussion we refer the reader to M4C [14] .

Question Features:

We encode the question text using three layers of a BERT [11] model pre-trained on English Wikipedia and Book-Corpus [53] datasets. We finetune this model during training.

Object Features:

We encode the object regions by extracting features from a ResNeXT-152 [48] based Faster R-CNN model [36] trained on Visual Genome [20] with attribute loss. We then add an absolute-location embedding to these features by using the bounding box coordinates.

OCR Features: Similarly, for OCR, we extract region features using the same object detector and we append an embedding obtained from FastText [8] and PHOC features [1] of the ocr-text. We also add an absolute-location embedding by using the bounding box coordinates of the OCR token similar to object features.

5 Experiments

We evaluate our model on the TextVQA dataset [38] and the ST-VQA dataset [7] . Our model outperforms previous work by a significant margin and sets the new state-of-the-art on both datasets.

5.1 Evaluation On Textvqa Dataset

The TextVQA dataset [38] contains 28,408 images from the Open Images dataset [21] , with human-written questions asking about text in the image. Following VQAv2 [3] , each question in the TextVQA dataset has 10 free-response answers, and the final accuracy is measured via soft voting of the 10 answers (VQA Accuracy). Following the M4C model [14] , we collect the top 5000 frequent words from the answers in the training set as our answer vocabulary. We compare our method with the recent proposed LoRRA [38] , M4C [14] and 2019 TextVQA Challenge leaderboard entires [25, 40] . Improved M4C baseline (M4C † ). To establish a strong baseline we further improve M4C by replacing the Rosetta-en OCR system with the Google OCR system 6 which we qualitatively find to be more accurate, detecting text with higher recall and having fewer spelling errors. This improves the performance from 39.4% to 41.8% (Rows 4 and 6 in Table 1 ). Next, we replace the ResNet-101 [13] backbone of the Faster R-CNN [36] feature-extractor with a ResNeXt-151 backbone [48] as recommended by [28] . This further improves the performance from 41.8% to 42.0 % (Rows 6 and 7). Finally, we add two additional transformer layers (Row 8), jointly train M4C on ST-VQA [7] (Row 9) and use beam search decoding (Row 10) to establish the final improved baselines 43.8% and 42.4% on validation and test set respectively. Our Results (SA-M4C). Our model consist of 2 normal self-attention layers and 4 spatially aware self-attention layers (2N¡4S). As shown in Table. 1 Row 13, our model is 2.2% (absolute) higher than it's counterpart in Row 10 and 4.4% better than the baseline M4C model (Row 5). Note that the improved M4C model in Row 10 and our method use the same input features, equal number of transformer layers and have the same number of parameters. Next, we perform model ablations to analyze the source of the gains in our method. Model structure ablations. We answer the question, "How many spatially aware self-attention layers are helpful?", by incrementally replacing the selfattention layers in M4C with the proposed spatially aware self-attention layers. Table. 2 Row 1, 2 and 3 show that the performance improves as we replace normal self-attention layers with spatially aware self-attention. We achieve the best performance after replacing 4 out of 6 self-attention layers (43.19% vs. 43.80% ). It's important to note that keeping the bottom self-attention layer is critical to model attention across modalities since attention for question tokens are masked in spatially aware self-attention. Table 1 : Results on TextVQA [38] dataset. We compare our model (rows 11-13) against the prior works (row 1-5) and the improved baselines (rows 6-10). † Indicates our ablations for improved baseline.

Table 1: Results on TextVQA [38] dataset. We compare our model (rows 11-13) against the prior works (row 1-5) and the improved baselines (rows 6-10). † Indicates our ablations for improved baseline. †† Indicates the best model from improved baseline.

† † Indicates the best model from improved baseline. Span of spatially aware self-attention head. Recall that the context-size parameters (c) is the number of relationships |T h | each attention head looks at, and controls the sparsity of each head in spatially aware self-attention. When c > 1, multiple heads are aware of a given spatial relationships which jointly attend information from different representation subspaces. Sweeping over the context-size (c) we find that c = 2 works the best (Row 7 in Table 2 ).

Table 2. Not extracted; please refer to original document.

Method

Comparing with other methods that induce sparsity into transformer. We further compare our approach with other formulations [23, 51] that induce sparsity in the Transformer architectures as well as randomly mask attention heads. We describe each setting as follows.

-Random masking (M4C-Random). We randomly initialize a spatial graph by assigning an edge of a given type between two nodes including a no-edge with equal probability. We use this graph as input to our spatially aware selfattention layer. Through this comparison, we want to establish the importance of spatial graph induced sparsity vs random sparsity in self-attention layers. We report this baseline by averaging across 5 different seeds. -Top-k Attention (M4C-Top-k). Instead of masking the attention weights based on a graph, we explicitly select the top-k attention weights and mask the rest [51] . We use k = 9 which corresponds to inducing the same level of sparsity as our baseline model. This helps to establish the need of guiding the attention based on spatial relationships. -Graph Attention (M4C-ReGAT). We implement ReGAT-based attention layer Li et al. [23] which endows Graph Attention Network [45] encoding with spatial information by adding a bias term specific to each relation. 7 . The goal is to establish the improvements of our spatially aware self-attention layer compared to prior work. Table. 2 Row 4 -Row 7 show these comparisons. We observe that random masking decreases the performance on TextVQA dataset by 1.2% which verifies the importance of a correct spatial relationship graph. By selecting the top-k connections (M4C-Top-9), we observe an improvement of 0.66% compared to M4C-Random. However, M4C-Top-9 still underperforms compared to our proposed SA-M4C model by 0.64%. Similarly, our proposed SA-M4C model outperforms the graph attention version (M4C-ReGAT) by 0.7%.

5.2 Evaluation On St-Vqa

We also report results on the ST-VQA [7] dataset which is another recently proposed dataset for the TextVQA task. ST-VQA contains 18,921 training and 2,971 test images sourced from several datasets [5, 16, 17, 20, 31] . Following M4C, we report results on the Open Dictionary (Task-3) as it matches the TextVQA setting where no answer candidates are provided at test time.

The ST-VQA dataset adopts Average Normalized Levenshtein Similarity (ANLS) defined as 1−d L (a pred , a gt )/ max(|a pred |, |a gt |) averaged over all questions. a pred and a gt refer to prediction and ground-truth answers respectively while d L is edit distance. The metric truncates scores lower than 0.5 to 0 before averaging. We use both VQA accuracy and ANLS as the evaluation metric to facilitate comparison with prior work.

For training and validation on ST-VQA we use the same splits used by M4C [14] generated by randomly selecting 17,028 images for training and the remaining 1,893 for validation. We train the improved baseline model and our best model (spatially aware self-attention) on ST-VQA and report results in Table 3 . Following prior works [7, 14] we show VQA Accuracy and ANLS both on validation set and only the latter on the test set. On the validation set our improved baseline achieves an accuracy of 40.71% and an ANLS of 0.499 improving by 2.66% and 0.027 absolute. Further, the final model with spatially aware self-attention layers achieves an accuracy of 42.23% and an ANLS of 0.512 improving by 1.52% and 0.013 in absolute gains on the validation set. On the test set, our best model achieves state-of-the-art performance of of 0.504 ANLS.

6 Analysis

Spatial reasoning: We look at the source of improvements in our model both quantitatively and qualitatively. First, we look at the performance of our model on subset of questions from TextVQA validation dataset that involve spatial reasoning. For this, we carefully curate a list of spatial-prepositions (see Supplementary for detail), and filter questions based on occurrence of one or more of these spatial-prepositions. After applying this filter, We observe that ∼ 14% of the questions (709/5000) are retained. On this subset D spa , our model perform 2.83% better than M4C. Since, OCR tokens can answer only ∼ 65% of the questions in the validation set, we also look at the subset of questions that require spatial reasoning and has a majority answer in OCR tokens. On this subset D spa+ocr (409/5000 questions), our model performs 4.62% better than M4C.

Visual Grounding:

As a proxy to analyze visual grounding of our model, we look at instances in which models predict the answer using the list of OCR tokens without relying on the vocabulary. Our model picks an answer from the list of OCR tokens on 368/701 questions from the D spa subset, and achieves 52.85% accuracy. This greatly improves the performance over M4C which only achieves 44.05% accuracy on a similar number (398/709) of questions that were answered using OCR tokens. The increase in performance is similar on D spa+ocr where we achieve a score of 67.95% on 260/401 questions compared to 59.27% achieved by M4C over 273/401 questions.

Qualitative Analysis: In Fig. 4 , we can qualitatively see how our models can reason about relative positions of object and text in the image. Our model picks the correct answer in Fig. 4(a, b, f, g ) by reasoning about relations like 'right', 'top-left'. Fig. 4(c) shows another examples where our model can reason about spatial relations between object ('green square') and text('lime'). We can also see several instances in Fig. 4(i, j, k, l) where based on the type of spatial relationship mentioned in the question, our model changes the answer.

Fig. 4: The figure shows examples where we flipped the spatial relation in the original question to see whether the models change their answers. We observe that our spatially aware multimodal transformer correctly reasons about the spatial relationships mentioned in the question and predict the answer more accurately than M4C. Green text denote correct predictions. Red text denote incorrect predictions while orange text denote partially correct answers.

Potential Sources Of Error:

While our model improves multi-modal transformer models by encoding spatial relationships, we are still far away from human baseline. Our models are not robust to spelling mistakes in the OCR tokens Fig. 4(h) . As the models become more visually grounded, they rely on the OCR tokens more often to answer the question. This reduces their ability to learn to pick the right spelling from the static vocabulary in case that word is present. Secondly, these models have trouble generating the stop condition during decoding. As we can see in Fig. 4(d) , our model predicts 'stop global warming' as the answer whereas the correct answer is 'stop.' Finally, while our model can encode relative spatial relationships, for reasoning about absolute positions in the image, our model can benefit from stronger cues about absolute locations.

7 Conclusion

We developed a spatially aware self-attention layer that encodes different types of relations between input entities via a graph. In our proposed method, each input entity only looks at neighboring entities as defined by a spatial graph. This allows each input to focus on a local context instead of dispersing attention amongst all other entities. Each head also focuses on a different subset of the spatial relations which avoids learning redundant features. We apply our general framework on the task of TextVQA by constructing a spatial graph between object and OCR tokens and utilizing it in the spatially aware self-attention layers. We found this graph-based attention to significantly improve results achieving state-of-the-art performance on the TextVQA and ST-VQA dataset. Finally, we present our analysis showing how our method improves visual grounding. Fig. 1 : Qualitative Examples: The figure shows the output of M4C and our method on several image-question pairs. Bold and italics text denote words chosen from OCR tokens, otherwise it was chosen from the vocabulary. The VQA score for each prediction is mentioned inside parenthesis. Fig. 2 : Qualitative Examples: The figure shows the output of M4C and our method on several image-question pairs. Bold and italics text denote words chosen from OCR tokens, otherwise it was chosen from the vocabulary. The VQA score for each prediction is mentioned with parenthesis. Fig. 3 : Qualitative Examples: The figure shows the output of M4C and our method on several image-question pairs. Bold and italics text denote words chosen from OCR tokens, otherwise it was chosen from the vocabulary. The VQA score for each prediction is mentioned with parenthesis. Fig. 4 : The figure shows examples where we flipped the spatial relation in the original question to see whether the models change their answers. We observe that our spatially aware multimodal transformer correctly reasons about the spatial relationships mentioned in the question and predict the answer more accurately than M4C. Green text denote correct predictions. Red text denote incorrect predictions while orange text denote partially correct answers.

We use several prepositions such as 'right', 'top', 'contains', etc. to filter questions that involve spatial reasoning.

If for a given xi, Φe(ei→j) / ∈ T h , ∀j ∈ [1, N ], then we explicitly set α h i,j = 0,.

https://cloud.google.com/products/ai/

We use the code released by the authors https://github.com/linjieli222/VQA_ ReGAT/