Abstract
A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head for each new task or dataset. In this work, we propose a task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text. The system supports a wide range of vision tasks such as classification, localization, question answering, captioning, and more. We evaluate the system’s ability to learn multiple skills simultaneously, to perform tasks with novel skillconcept combinations, and to learn new skills efficiently and without forgetting.
1. Introduction
Since machine learning revolutionized face recognition [53] and face detection [48, 51] , the computer vision field has produced fantastic special purpose learning-based systems for everything from image synthesis to human pose estimation. Yet these systems do not in aggregate come close to the ability of human generalists to learn and perform virtually any task afforded by our senses and motor controls. To broaden the applicability and effectiveness of computer vision, we need to make progress in creating more general purpose vision systems that continually learn and perform a broad range of tasks.
In modern computer vision, tasks are specified through the system architecture. For example, in digit classification, neural network architectures accept an image as input and produce ten outputs -each output representing the probability of one of ten digit classes. In such systems, adding more tasks, such as new types of predictions or even new datasets, often involves expanding the architecture with additional output heads. Several multi-task vision systems, capable of performing classical computer vision tasks like detection, segmentation and depth estimation have been proposed, e.g. [63, 11, 27, 16] . More recent works [35, 36, 5] We propose a general purpose vision system, GPV-I, that takes an image and a natural language task description and outputs bounding boxes, confidences and text. GPV-I can be trained end-to-end on any task that demands a box or text output without any architecture modifications such as adding a new task-head. More importantly, GPV-I is a step towards building and evaluating general purpose learning systems. Results correspond to a model trained to perform VQA, captioning, localization, and classification tasks. Star indicates the output modality supervised during training for each task.
retrieval. While such multi-task systems provide evidence that training across a suite of datasets provides impressive results on the individual ones, the tasks performed by them must be itemized and defined in advance of training. Such systems are special purpose learners -designed, trained, and limited to specific predetermined tasks.
In this paper, we propose a task-agnostic architecture, GPV-I, that can perform vision-language tasks not prescribed by the architecture design or outputs. Instead, our system takes as input an image along with a piece of text that is used to define the task. Given the task, the system outputs relevant bounding boxes, confidences, and text. A user can input an image and query the system with a variety of requests such as "How many cars are in the image?", "Locate all the sedans", "What make is the blue car?" and "Describe the image" -each request eliciting a different response using a combination of output modalities. We show this system outperforms and out-generalizes special purpose architectures that learn to perform each task individually, because of the ability to re-use the same encoders and decoders for multiple tasks.
Our work is a small step toward general purpose vision. Ultimately, a general purpose system should achieve three forms of generality:
• Generality of architecture: Can learn and perform any task within a broad domain without change to network structure (e.g. learn to classify bird species, without adding new output heads, by re-using ability to encode images, interpret task from text, and produce words) • Generality of concepts across skills: Can perform tasks in skill-concept combinations not seen during training (e.g. localize "muskrat" after learning to answer questions about "muskrats") • Generality of learning: Can learn new tasks sampleefficiently with minimal loss to performance on previously learned tasks To test generality of architecture, we train and evaluate our system's ability to perform visual question answering (VQA), captioning, object classification, and object localization on the COCO dataset [33] , as well as extend to additional tasks, without any modification of architecture or task-specific outputs. To test generality of concepts across skills, we present a new split of the COCO images and corresponding task annotations called COCO-SCE (Skill-Concept Evaluation), such that some words are never mentioned in VQA or caption training annotations and others are excluded from classification and localization, and then test on the examples with the held out words. To test generality of learning, we fine-tune our system on a referring expressions task and measure its learning curve and extent of forgetting previously learned tasks. Our results demonstrate types of generality not shown in previous systems, but also point to major challenges to be addressed in future work.
In summary our main contributions include: (1) First task-agnostic vision-language architecture for learning and performing classification, grounding, question answering, captioning, and other tasks that involve image, text and pointing (via bounding boxes) modalities. (2) Evaluation that tests generality of architecture, concepts across skills, and learning ability.
2. Related Work
Our architecture is motivated and enabled by several significant advances in joint vision-language understanding. Over the last decade, specialized and effective approaches have been developed for vision-language tasks, including image captioning [12, 22, 29, 38, 56, 62] , phrase grounding [42, 41, 47] , referring expression comprehension [23, 39] , visual question answering (VQA) [2, 14] , fact-based VQA [58] , knowledge-and reasoning-based VQA [20, 60, 64] , visual dialog [8] , and even text-toimage generation [7] . Cross-modal embeddings are particularly important. These originated with CCA-based methods [15, 13, 18] , evolved to neural models [26, 57] , and more recently have benefited from transformer architectures [54] . State-of-the-art self-supervised language models like BERT [9] and its extensions [35, 50, 5, 52, 31, 32, 25, 65] have further improved the effectiveness of vision and language representations. Latest systems like 12-in-1 [37] and UniT [19] have leveraged such representations to attempt to unify a number of different vision-language tasks, but the architectures still rely on task-specific heads. Mul-tiModel [21] is a multi-modal (image, audio, text) architecture to jointly perform speech recognition, text translation, image classification, and captioning using shared input encoders, cross-modal encoders, and decoders. However, their system requires the task to be identified as an input token, and their system substantially underperforms stateof-the-art. In contrast, GPV-I infers the task from a natural language prompt and is competitive with state-of-the-art by benefiting from latest architectural innovations.
In the natural language domain, several works try to blur or erase artificial task boundaries. Kumar et al. [30] show that multiple tasks, such as part-of-speech tagging, question answering, and classification, can be formulated as a sequence-to-sequence transformation solved with a single task-agnostic architecture, though separate parameters are trained for each task. Recent works such as DecaNLP [40] and UnifiedQA [24] have trained single models to perform multiple tasks by reformulating each task as question answering allowing the individual task-performances to benefit from more data with diverse supervision while sharing model parameters. Works such as T5 [45] , GPT [44, 3] have also highlighted the transfer learning capabilities of unified models specially in zero-shot and few-shot scenarios. The idea of specifying tasks through natural language descriptions is also being actively explored [59] .
Though released too recently to influence our work, some recent works also make progress towards more general vision capabilities. CLIP [43] is trained contrastively on 400M image/text pairs and achieves impressive zeroshot classification performance by framing classification as an image-text matching task. Most similar to our work, Cho et al. [6] propose a unified architecture for VQA, grounding, and other tasks as sequence-to-sequence generation with regions referenced through text tokens. Our system has a similar goal but differs in technical aspects, including our ability to end-to-end train all encoders and decoders (including region generators). Our evaluation is also unique in its focus on skill-concept and learning generalizations.
3. The Gpv-I Model 3.1. Architecture Overview
The most distinct aspect of our GPV-I system is that tasks are defined through natural language text input, instead of multi-head outputs. Most systems, for example, that perform ImageNet [10] classification and COCO detection would have one 1000-class confidence output head and another 80-class box and confidence output head. More tasks or more datasets would require more output heads. Once trained, such a system will always produce 1,080 types of confidence and 80 classes of bounding boxes.
GPV-I does not have explicit task boundaries and instead takes in a natural language task description such as "What is sitting on the sofa?" (VQA), "Find all instances of dogs" (localization), "What is going on in the image" (captioning), or "What kind of object is this?" (classification). GPV-I interprets and performs all tasks using the same language/vision/cross-modal encoders and decoders. In training and evaluation, tasks such as localization and referring expressions have bounding box ground truth, while others such as classification, question answering, and captioning have text ground truth. Yet, all tasks involve common skills such as interpreting the task description, localizing objects, representing image regions, and determining relevance to the task. A new task, such as referring expressions, can be defined simply by providing new inputs ("Find the man wearing a green shirt") and output supervision (bounding box). Like humans, GPV-I can learn to perform a wide range of tasks, limited only by modalities that it can sense and produce. Fig. 2 provides an overview of GPV-I's architecture consisting of a visual encoder, language encoder, visionlanguage co-attention module, and output heads for the supported output modalities -boxes, relevance scores, and text. We use the CNN backbone and the transformer encoder-decoder from DETR [4] , an end-to-end trainable object detector. The natural language task description is encoded with BERT [9] . To cross-contextualize representations from the visual and language encoders, we use ViLBERT's coattention module [35] . Box and objectness heads predict task-agnostic bounding boxes and scores. Relatedness head predicts a task-specific score for each output box that is combined with the objectness scores to obtain relevance scores. The text decoder is a transformer decoder that autoregressively generates text output while using relevanceconditioned representations produced by the cross-modal module as memory.
3.2. Vision Modules
We use a DETR based visual encoder. A Resnet-50 backbone [17] extracts a convolutional feature map that is fed into DETR's transformer encoder to get contextualized features for every grid location. The transformer decoder takes as input R (= 100) object queries (learned constant vectors) and the contextualized grid features and produces region descriptors per object query. The main intuition is that the object queries serve as learnable anchors, and the transformer encoder-decoder trained on detection eliminates the need for non-maximum suppression as a postprocessing step. The complete region encoding is obtained by concatenating DETR's transformer features, which encode location and limited appearance information, with RoI pooled features from the CNN backbone.
As a vision decoder, GPV-I uses DETR's box head to predict bounding boxes from region descriptors, resulting in R region proposals. These bounding boxes are used for grounding and detection tasks as well as for RoI pooling from the CNN backbone. We also replace DETR's 80-way object classification layer with a binary objectness classification layer, which contributes to determining relevance.
3.3. Language Modules
The language encoder is used to encode the task description. We use the WordPiece tokenizer [61] to obtain sub-word tokens for the language input and a pre-trained BERT model to compute representations. Sub-word tokenization provides robustness to out-of-vocabulary words, and large scale language model pretraining allows GPV-I to better handle paraphrases of language queries and zero-shot generalization to novel task descriptions, assuming semantic similarity to previously seen descriptions in the BERT embedding space.
The language decoder outputs one or more words to classify, describe, or answer the input. The sequence of co-attended region representations and language query's token representations are concatenated to construct a single sequence that serves as memory for the transformer text decoder. At each generation step, the sequence of words generated thus far are fed into the decoder along with the memory and a distribution over the vocabulary words is predicted to sample the next word. The input words are encoded using trainable word embeddings. The output logit for a vocabulary word is obtained by taking dot product between the embedding vector output by the decoder and a linearly transformed BERT encoding of the word.
3.4. Cross-Modal Modules
The region descriptors from the vision modules and subtoken representations from the language module are transformed by linear layers to equal dim. vectors and fed into ViLBERT's co-attention layers for cross-contextualization. The relatedness head uses the co-attended region features to predict logits that indicate relevance of regions to the task description. These logits are added to logits from the objectness head and transformed into region-relevance scores by a sigmoid activation. These relevance scores are an output of the system used, e.g., to rank bounding boxes or indicate importance of regions to performing the task.
Relevance conditioning modulates the co-attended visual features with relevance scores. Specifically, the relevance score s of each region is used to weight learned vectors {v rel , v nrel }, which are added to the region features before feeding to the decoder. This conditioning enables supervision from the text decoder to affect the relatedness and objectness heads. In this way, a model trained to produce captions for images of peacocks may learn to localize peacocks, and, conversely, the ability to localize peacocks may translate to improved caption quality.
3.5. Training
A general purpose architecture needs a general purpose learning algorithm. Each training sample consists of an image, a task description, and targets. Depending on the task, targets could consist of ground truth bounding boxes, text, or both. In each training iteration, we uniformly draw samples across all tasks to construct mini-batches. For all samples that contain a text target, we maximize the loglikelihood of the ground truth text. For all samples that contain bounding boxes as targets, we use DETR's Hungarian loss for training the box and relevance prediction. Initialization. We initialize all vision modules except the last linear layer in the objectness head with weights from DETR pretrained on object detection data. BERT is pretrained on BooksCorpus [67] and English Wikipedia. Optimization. We train GPV-I with a batch size of 120 with AdamW optimizer [34] . We keep DETR initialized weights frozen for the first 10 epochs and finetune all modules except BERT for 30 more epochs. For learning rate (LR), we do a gradual warm-up over the first 4 epochs to a maximum LR of 10 −4 followed by linear decay to 0. Following DETR, we apply gradient clipping on visual module parameters and use a maximum learning rate of 10 −5 for the CNN backbone. While we strive for task-agnostic learning, in practice, we use a 0.05× lower text loss weight for captioning due to more words in the target text than other tasks.
4. Tasks And Data
Our experiments involve 5 tasks using images from the COCO dataset and annotations from the COCO, VQA V2 [14] , and REFCOCO+ [23] datasets. Sec. 4.1 describes how these tasks are posed to our general purpose system along with respective losses and metrics used for training and evaluation. Sec. 4.2 details how samples are created for each task from the original annotations and introduces our COCO-SCE split for testing the generalization of concepts across skills.
4.1. Tasks
VQA aims to answer a question given an image. The input is an image/text pair, and the output is text. While training, the loss employed is the negative log likelihood of the ground truth answer text. We use the standard VQA evaluation metric (annotator-agreement weighted answer accuracy) [2] to report results. Captioning aims to produce a description of an image. The input is an image and a prompt, such as "Describe the image" or "What is going on in the image?", and the output is text. While training, the loss employed is the negative log likelihood of the annotated caption. The evaluation metric reported is CIDEr-D [55] that measures the similarity of the generated and ground truth captions. Localization aims to produce a tightly fitting bounding box to an object. The input is an image and a prompt, such as "Find all instances of dogs" or "Locate the chairs", and the output is a set of ranked bounding boxes. Training uses DETR's Hungarian loss. Evaluation is an average of perimage average-precision (AP) with a 0.5 bounding box intersection over union (IOU) threshold. For example, if an image contains two target objects and the correctness of the top four ranked boxes is {True, False, False, True}, the AP is (1/1+2/4)/2=0.75 (every-point interpolation). The reported number is AP averaged over samples. Classification aims to assign a category to a region. The input is an image patch and a prompt such as "What is this thing?" or "What object is this?", and the output is text. In principle, GPV-I can produce any category label within the large vocabulary of the text decoder, including words that it has not seen within its classification training data. However, for evaluation, a K-way classification is performed by suppressing outputs that do not correspond to any of the applicable K categories. The training loss used is the negative log likelihood of text output, and evaluation is accuracy averaged over samples.
Referring expressions (RefExp) aims to localize a single region that corresponds to a phrase. The input is an image and a referring expression such as "the man wearing a green shirt", and the output is one bounding box. While the training loss and evaluation is the same as localization, the key distinction is disambiguation of the referred instance among other instances of the same object category in the image.
4.2. Data
We present experiments using images from the richly annotated COCO dataset. We use question and answer annotations from the VQA V2 dataset, referring expressions from REFCOCO+, and COCO annotations for other tasks. Data samples. VQA samples consist of the original questions as prompts paired with the most-agreed answer among annotators. For captioning, COCO provides 5 captions per image, each of which is treated as a different sample paired with one of 14 captioning prompt templates. We generate localization samples for each object category in the image using one of 18 prompt templates paired with all instances for the category. For classification, we create a sample for each object category in the image by choosing one of the instances (cropped using the ground truth box) paired with one of 4 prompt templates. RefExp samples consist of referring expressions as prompts with corresponding boxes. Data splits. We present results for GPV-I and baselines on two data splits. First we train and evaluate models using the standard data splits for the corresponding tasks. This provides results for GPV-I in the context of past work. Then, to test the ability of vision systems to generalize concepts across skills, we present a new split of the above annotations, named COCO-SCE (Skill-Concept Evaluation). COCO-SCE. Fig. 3 presents a schematic of the proposed COCO-SCE splits. The 80 classes of COCO are split into 3 disjoint sets, specifying which tasks can use them for training and validation:
• H vqa,cap : 10 classes held-out from the VQA and captioning tasks in the train/val sets • H cls,loc : 10 different classes held-out from the classification and localization tasks in the train/val sets
Val Test
Annotations that mention any of the categories in ℋ!"# ,%#&
Coco-Sce Test Seen
COCO-SCE Test Unseen Figure 3 : COCO-SCE: A split of COCO images and annotations to test the generalization of concepts across skills. Schematic shows train, val and test samples used for VQA.
• S: 60 remaining classes are not held out from any tasks When a category is held out, any annotations containing that word are not used for training or val. E.g., if boat is a held out category for VQA, then the annotation {"What color is the boat?", "Blue"} would be excluded from the train/val set. Other annotations from the same image may still be used, e.g. {"Is it a sunny day?", "Yes"}. Also, the classification and localization annotations for boat would be included in train/val for respective tasks. The assignment of categories to H vqa,cap , H cls,loc , and S is random, except that we assign person to S, because it is so common.
Images in COCO-SCEtrain and val sets come from COCO train set, and images in COCO-SCE test set are the ones present in COCO validation set (as COCO test annotations are hidden). COCO-SCE train and val splits are created by first creating an 80-20 partition of COCO train images and then for each task discarding samples that expose the heldout categories for that task through the annotations. On the test set we report performance separately for samples belonging to "seen" (e.g. S ∪ H cls,loc for VQA) and "unseen" (e.g. H vqa,cap for VQA) categories for each task.
5. Experiments
Our experiments evaluate GPV-I for its effectiveness compared to specialized models (Sec. 5.1), its ability to apply learned skills to unseen concepts for that skill (Sec. 5.2), its efficiency at learning new skills, and retention of previously learned skills (Sec. 5.3). Sec 5.4 provides ablations.
5.1. Generality Vs. Effectiveness
Is generality of GPV-I at the cost of effectiveness? We compare GPV-I to competitive special purpose models designed for each task -ViLBERT [35] (VQA), VLP [66] (captioning), Faster-RCNN [46] ) (localization) and Resnet-50 [17] (classification). To avoid conflating effectiveness of architecture with availability of more data, we retrain these models to only use COCO and VQA V2 annotations. For ViLBERT and VLP this requires replacing Visual Genome [28] bottom-up features [1] with Faster-RCNN features trained only on COCO and no pretraining on Conceptual Captions [49] . Table 1 :
Comparison to special purpose baselines (COCO-SCE and COCO splits): Our jointly trained GPV-I compares well to specialized single-task baselines as well as GPV-I trained on individual task data. On COCO split, we report test-server results for VQA and captioning and validation results for localization and classification as the annotations for test images are hidden. On COCO-SCE split, we report test results for all tasks.
Tab. 1 shows that on the COCO-SCE split, the general purpose GPV-I architecture trained on individual tasks compares favorably to each special purpose model (rows a vs b). Also, the generality of GPV-I enables it to be jointly trained on all 4 tasks, leading to sizeable gains on 2 tasks and comparable results on others (rows b vs c). The same trends also hold when we compare models on the original COCO data-splits (rows d vs e), validating that these trends are not merely a product of our proposed splits. Together, these results establish that the generality of GPV-I is not at the expense of effectiveness.
5.2. Skill-Concept Generalization
One of the key tenets of general purpose learning is the generality of concepts across skills, i.e. the ability of a model to perform well on novel skill-concept combinations that were not seen during training. When training on a single task on COCO-SCE splits, a model does not have access to any annotation on held-out concepts. For example, a model trained only on VQA will never see a question or answer about "horse" ∈ H vqa,cap . However, when training on all tasks, the model learns to localize and classify horse images. Therefore we expect the model to apply the acquired skill of question answering to answer questions about horse without explicitly being trained on horse VQA data.
Tab. 2 shows the performance of the specialized models and the 1-Task and Multitask GPV-I models on the COCO-SCE full test split as well as separately on the subset of test data categorized as "seen" and "unseen" (see Fig. 3 for a schematic of these subsets for the VQA task). The 1-Task GPV-I (row b) trained on individual tasks serves as a baseline to account for learned priors and dataset biases by the GPV-I architecture. We observe significant gains by Multitask GPV-I (row c) on the "unseen" subset across all tasks, particularly over the specialized models (row c vs row a) -indicating that the general purpose architecture is better suited at learning skills and then applying them to concepts that were unseen for that skill. We also report the perfor-mance of Multitask GPV-I trained on the COCO training split (row d). Since this split exposes the model to held-out concepts for all tasks, it can serve as a loose upper bound for the "unseen" split. Fig. 4 shows an analysis for VQA, highlighting concepts and question types (VQA sub-skills) that benefit most from multitask training. Interestingly, 2 of the top-3 concepts that benefit the most belong to the "unseen" set (H vqa,cap ) for VQA. Also, the top question types that benefit are ones that require skills from localization and classification.
5.3. Learning Generalization
A system exhibits good learning generalization if it can learn new skills sample-efficiently without forgetting previously-learned skills. Learning ability. Fig. 5 (left) shows the learning curve of GPV-I either pretrained on only the localization task (the only other task that has bounding-box supervision) or all four tasks, when finetuned on the Referring Expressions task. Multitask GPV-I demonstrates much better zero-shot performance as well as better sample-efficiency in the low data regime. The learning of attributes and additional nouns provides a better starting point for referring expressions; e.g., while the localization-trained model starts with the ability to localize person, the multitask model is also familiar with red and sweater through captions and VQA and may better localize "the person wearing a red sweater". Retention. Fig. 5 (right) shows the percent of performance retained on the original tasks as GPV-I is trained with increasing amounts of RefExp training data. Interestingly, Multitask GPV-I forgets slower than GPV-I-Loc on the localization task. Localization and captioning suffer the most from catastrophic forgetting while classification shows robust retention. GPV-I does not include explicit mechanisms for addressing forgetting, but our results highlight the importance of such mechanisms for general purpose learning.
5.4. Ablations
Tab. 3 ablates key factors that make GPV-I effective. Finetuning end-to-end (as opposed to keeping DETR initialized weights frozen) contributes to performance across all tasks (rows a vs c). RoI pooling significantly boosts performance for VQA, slightly for captioning, but leads to slight drop for localization and classification (rows a vs b).