# ELASTIC: Improving CNNs with Instance Specific Scaling Policies

## Authors

## Abstract

Scale variation has been a challenge from traditional to modern approaches in computer vision. Most solutions to scale issues have similar theme: a set of intuitive and manually designed policies that are generic and fixed (e.g. SIFT or feature pyramid). We argue that the scale policy should be learned from data. In this paper, we introduce ELASTIC, a simple, efficient and yet very effective approach to learn instance-specific scale policy from data. We formulate the scaling policy as a non-linear function inside the network’s structure that (a) is learned from data, (b) is instance specific, (c) does not add extra computation, and (d) can be applied on any network architecture. We applied ELASTIC to several state-of-the-art network architectures and showed consistent improvement without extra (sometimes even lower) computation on ImageNet classification, MSCOCO multi-label classification, and PASCAL VOC semantic segmentation. Our results show major improvement for images with scale challenges e.g. images with several small objects or objects with large scale variations. Our code and models will be publicly available soon.

## 1. Introduction

Scale variation has been one of the main challenges in computer vision. There is a rich literature on different approaches to encode scale variations in computer vision algorithms [19] . In feature engineering, there have been manually prescribed solutions that offer scale robustness. For example, the idea of searching for scale first and then extracting features based on a known scale used in SIFT or the idea of using feature pyramids are examples of these prescribed solutions. Some of these ideas have also been migrated to feature learning using deep learning in modern recognition solutions.

The majority of the solutions in old-school and even modern approaches to encode scale are manually designed * Work done while an intern at AI2. The early layers receive eXtralarge resolutions and in the following layers resolutions decrease as Large, Medium, and Small. We argue that scaling policy in CNNs should be instance-specific. Our Elastic model (the third row) allows different scaling policy for different input images and it learns from the training data how to pick the best policy. For scale challenging images e.g. images with lots of small(or diverse scale) objects, it is crucial that network can adapt its scale policy based on the input. As it can be seen in this figure, Elastic gives a better prediction for these scale challenging images.

and fixed solutions. For example, most state-of-the-art image classification networks [15, 30, 9, 13, 37, 41] use the feature pyramid policy where a network looks at the larger resolution first and then goes to smaller ones as it proceeds through the layers. Despite the fact that this common practice seems to be a natural and intuitive choice, we argue that this scale policy is not necessarily the best one for all possible scale variations in images. We claim that an ideal scale policy should (1) be learned from the data; (2) be instance specific;

(3) not add extra computational burden; and (4) be applicable to any network architecture. For example, instead of looking at the scales according to the feature pyramid policy if we process the images in Figure 1 based on a learned and instance specific policy we see an improved performance. In images with scale challenges like the golf ball image in Figure 1 the learned scale policy might differ dramatically from a pyramid policy resulting in correct classification of that instance. The learned policy for this instance starts from looking at the image from a large scale (dark blue color) and the go immediately to smaller scale then back to large scale followed by smaller scale and so on.

In this paper we introduce ELASTIC, an approach to learn instance-specific and not-necessarily-pyramidal scale policies with no extra(or lower) computational cost. Our solution is simple, efficient and and very effective on a wide range of network architectures for image classification and segmentation. Our Elastic model can be applied on any CNN architectures simply by adding down-samplings and up-samplings in parallel branches at each layer and let the network learn from data a scaling policy in which inputs being processed at different resolutions in each layer. We named our model ELASTIC because each layer in the network is flexible in terms of choosing the best scale by a soft policy.

Our experimental evaluations show improvements in image classification on ImageNet [28] , multi-label classification on MSCOCO [18] , and semantic segmentation on PASCAL VOC for ResNeXt [34] , SE-ResNeXt [11] , DenseNet [13] , and Deep Layer Aggregation (DLA) [37] architectures. Furthermore, our results show major improvements (about 4%) on images with scale challenges (lots of small objects or large variation across scales within the same image) and lower improvements for images without scale challenges. Our qualitative analysis shows that images with similar scaling policies (over the layers of the network) are sharing similar complexity pattern in terms of scales of the objects appearing in the image.

## 2. Related Work

The underlying idea behind Elastic is conceptually simple and there are several approaches in the literature using similar concepts . Therefore, we study all the categories of related CNN models and clarify the differences and similarities to our model. There are several approaches in fusing information at different visual resolutions. The majority of these approaches can be categorized into four categories (depicted in Figure 2 (b-e)).

Image pyramid: An input image is passed through a model multiple times at different resolutions and predictions are made independently at all levels. The final output is computed as an ensemble of outputs from all resolutions. This approach has been a common practice in [4, 5, 29] .

Loss pyramid: This method enforces multiple loss functions at different resolutions. [32] uses this approach to improve the utilization of computing resources inside the network. SSD [20] and MS-CNN [2] also use losses at multiple layers of the feature hierarchy.

Filter pyramid: Each layer is divided into multiple branches of convolution with different filter sizes (typically referred to as the split-transform-merge architecture). The variation in filter sizes results in capturing features at different resolutions, but with additional number of parameters and operations. The inception family of networks [32, 33, 31] use this approach. To further reduce the complexity of the filter pyramid [24, 35, 36] use dilated convolutions to cover a larger receptive field with the same number of FLOPs.

Feature pyramid: This is the most common approach to incorporate multiple scales in a CNN architecture. This approach fuses features from different resolutions in a network by either concatenation or summation. Fully convolutional networks [22] adds up the scores from multiple scales to compute the final class score. Hypercolumns [7] uses earlier layers in the network to capture low-level information to describe a pixel in a vector. Several other approaches (HyperNet [14] , ParseNet [21] , and ION [1] ) concatenate the outputs from multiple layers to compute the output for the last layer. Several recent methods including SharpMask [26] and U-Net [27] for segmentation, Stacked Hourglass networks [25] for keypoint estimation and Recombinator networks [10] for face detection, have used skip connections to incorporate low-level feature maps on multiple resolutions and semantic levels. [12] extends the principles of DenseNet [13] to fuse features across different resolution blocks. Feature pyramid networks (FPNs) [17] are designed to normalize resolution and equalize semantics across the levels of a pyramidal feature resolution hierarchy through top-down and lateral connections. Likewise, [37] proposes an iterative deep aggregation that raises resolution, but further deepens the representation by nonlinear and progressive fusion. DLA [38] proposes to aggregate information across various different scale in the CNN in a hierarchical fashion.

Elastic resembles models from the Filter pyramid family as well as the Feature pyramid family, in that it introduces parallel branches of computation (a la Filter pyramid) and also fuses information from different scales (a la Feature pyramid). The major difference to the feature pyramid models is that in Elastic every layer in the network considers information at multiple scales uniquely whereas in feature pyramid the information for higher or lower resolution is injected from the other layers. Elastic provides an exponential number of scaling policies across the layers and yet keeps the computational complexity the same (or even lower) as the base model. The major difference to the filter pyramid is that the the number of FLOPs to cover a higher receptive field in Elastic is proportionally lower due to the downsampling whereas in the filter pyramid the FLOPs is higher or the same as the original convolution depending of filter size or dilation parameters.

## 3. Model

In this section, we elaborate the structure of our proposed Elastic and illustrate standard CNN architectures being augmented with our Elastic. We also contrast our model with other multi-scale approaches.

## 3.1. Scale Policy In Cnn Blocks

Formally, a layer in a convolutional neural network can be expressed as

`EQUATION (1): Not extracted; please refer to original document.`

where q is the number of branches to be aggregated, T i (x) can be an arbitrary function (normally it is a combination of convolution, batch normalization and activation), and σ are nonlinearities. A few F(x) are stacked into a block to process information in one spatial resolution. Blocks with decreasing spatial resolutions are stacked to integrate a pyramid scale policy in the network architecture. A network example of 3 blocks with 2 layers in each block can be expressed as

N = F 32 • F 31 • D r2 • F 22 • F 21 • D r1 • F 12 • F 11 (2)

where D ri indicates the resolution decrease by ratio r i > 1 after a few blocks. D ri can be simply implemented by increasing the stride in the convolution right after. For example, ResNeXt [34] stacks bottleneck layers in each resolution and use convolution with stride 2 to reduce spatial resolution. This leads to a fixed scaling policy that enforces a linear relation between number of layers and the effective receptive field of those layers. Parameters of T i (x) and the elements in input tenors x are all of the tangible ingredients in a CNN that define computational capacity of the model. Under a fixed computational capacity measured by FLOPs, to improve the accuracy of such a model, we can either increase number of parameters in T i (x) and decrease the resolution of x or increase the resolution of x and decrease number of parameters in T i (x). By adjusting the input resolutions at each layer and number of parameters, we can define a scaling policy across the network. We argue that finding the optimal scaling policy (a trade-off between the resolution and number of parameters in each layer) is not trivial. There are several model designs toward increasing the accuracy and manually injecting variations of feature pyramid but most of them are at the cost of higher FLOPs and more parameters in the network. In the next section, we explain our solution that can learn an optimal scaling policy and maintain or reduce number of parameters and FLOPs while improving the accuracy.

## 3.2. The Elastic Structure

In order to learn image features at different scales, we propose to add down-samplings and up-samplings in parallel branches at each layer and let the network make decision on adjusting its process toward various resolutions at each layer. Networks can learn this policy from training data. We add down-samplings and up-samplings in parallel branches at each layer and divide all the parameters across these branches as follows:

F(x) = σ q i=1 U ri (T i (D ri (x))) (3) N = F 32 • F 31 • F 22 • F 21 • F 12 • F 11 (4)

where D ri (x) and U ri (x) are respectively downsampling and upsampling functions which change spatial resolutions of features in a layer. Unlike in equation 2, a few F are applied sequentially without downsampling the main stream, and N (x) has exactly the same resolution as original x. Note that the learned scaling policy in this formulation will be instance-specific i.e. for different image instances, the network may activate branches in different resolutions at each block 1 . In section 4 we show that this instancespecific scaling policy improves prediction on images with scale challenges e.g. images consist of lots of small objects or highly diverse object sizes.

Conceptually, we propose a new structure where information is always kept at a high spatial resolution, and each block or branch processes information at a lower or equal resolution. In this way we decouple feature processing resolution (T i processes information at different resolutions) from feature storage resolution (the main stream resolution of the network).

This way of expanding the layers encourages the network to process different resolutions separately at different branches in a layer and thus captures cross-scale information. More interestingly, since at each layer network has multiple options of resolutions, this leads to an exponential relation between the number of the layers and the number of different scaling policies. In fact, this intuition is aligned with our experimental evaluation, where we have observed different categories of images adopt different scaling policies (see section 4.1.1) . For example, categories with clean and uniform background images mostly choose the low-resolution paths across the network and categories Figure 2 : Multi-scaling model structures. This figure illustrates different approaches to multi-scaling in CNN models and our Elastic model. The solid-line rectangles show the input size and the dashed-line rectangles shows the filter size.

with complex and cluttered objects and background mostly choose the high resolution paths across the network. The computational cost of our Elastic model is equal to or lower than the base model. Because at each layer the maximum resolution is the original resolution of the input tensor. We apply downsampling on some branches, to be processed with the next convolution, that reduces the computation and gives us extra room for adding more layers to match the computation of the original model. This simple add-on of downsamplings and upsamplings (Elastic) can be applied to any CNN layers T i (x) in any architecture to improve accuracy of a model. Our applications are introduced in the next section.

## 3.3. Augmenting Models With Elastic

Now, we show how to apply Elastic on different network architecture. To showcase the power of Elastic, we choose to apply Elastic on some state-of-the-art network architectures: ResNeXt [34] , Deep Layer Aggregation (DLA) [38] , and DenseNet [13] . A natural way of applying Elastic on current classification models is to augment bottleneck layers with multiple branches. This makes our modification on ResNeXt and DLA almost identical. At each layer we apply downsampling and upsampling to a portion of branches, as shown in Figure 3 -left. In DenseNet we compile an equivalent version by parallelizing a single branch into two branches and then apply downsampling and upsampling on some of the branches, as shown in Figure 3 -right. Note that applying Elastic reduces FLOPs in each layer. To match the original FLOPs we increase number of layers in the network while dividing similar number of FLOPs across resolutions. The detailed architecture of our model can be found in Appendix B.

Relation to other multi-scaling approaches As discussed in section 2, most of the current multi-scaling approaches can be categorized into four different categories (1) image pyramid, (2) loss pyramid (3) filter pyramid, and (4) feature pyramid. Figure 2 (b-e) demonstrates the structure of these categories.

All of these models can improve the accuracy usually under a higher computational budget. Elastic (Figure 2 ) guarantees no extra computational cost while achieving better

## Multi-Scaling Method Flops Parameters

Single Scale

n 2 ck 2 ck 2

Feature Pyramid (concat)

n 2 (qc)k 2 (qc)k 2

Feature Pyramid (add) bi = 1 and b i > 1 and r i > 1 denote the branching and scaling ratio respectively. Note that the FLOPs and parameters in Elastic is always (under any branching q and scaling ratio r) lower than or equal to the original model whereas in feature/filter pyramid is higher or equal.

n 2 ck 2 ck 2 Filter Pyramid (standard) q i=1 n 2 c(kr i ) 2 b i q i=1 c(kr i ) 2 b i Filter Pyramid (dilated) n 2 ck 2 ck 2 Elastic q i=1 ( n r i ) 2 ck 2 b i ck 2

accuracy. Filter pyramid is the most similar model to the Elastic's structure. The major difference to the filter pyramid is that the the number of FLOPs to cover a higher receptive field in Elastic is proportionally lower due to the downsampling whereas in the filter pyramid the FLOPs is higher or the same as the original convolution depending of filter size or dilation parameters. Table 1 compares the FLOPs and number of parameters between Elastic and feature/filter pyramid for a single convolutional operation. Note that the FLOPs and parameters in Elastic is always (under any branching q and scaling ratio r) lower or equal to the original model whereas in filter/feature pyramid this is higher or equal. Feature pyramid methods are usually applied on top of an existing classification model, by concatenating features from different resolutions. It is capable of merging features from different scales in the backbone model and shows improvements on various tasks, but it does not intrinsically the scaling policy. Our Elastic structure can be viewed as a feature pyramid inside a layer, which is able to model different scaling policies. Spatial pyramid pooling or Atrous(dilated) spatial pyramid shares the same limitation as feature pyramid methods.

## 4. Experiments

In this section, we present experiments on applying Elastic on current strong classification models. We evaluate their performance on ImageNet classification, and we show consistent improvements over current models. Furthermore, in order to show the generality of our approach, we transfer our pretrained Elastic models to multi-label image classification and semantic segmentation. We use ResNeXt [34] , DenseNet [13] and DLA [37] as our base models to be augmented with Elastic. Implementation details. We use the official PyTorch Im-ageNet codebase with random crop augmentation but no color or lighting augmentation, and we report standard 224×224 single crop error on the validation set. We train our model with 8 workers (GPUs) and 32 samples per worker. Following DLA [37] , all models are trained for 120 epochs with learning rate 0.1 and divided by 10 at epoch 30, 60, 90. We initialize our models using normal He initialization [8] . Stride-2 average poolings are adopted as our downsamplings unless otherwise notified, since most of our downsamplings are 2× downsamplings, in which case bilinear downsampling is equivalent to average pooling. Also, Elastic add-on is applied to all blocks except stride-2 ones or high level blocks operating at resolution 7.

## 4.1. Imagenet Classification

We evaluate Elastic on ImageNet [28] 1000 way classification task (ILSVRC2012). The ILSVRC 2012 dataset contains 1.2 million training images and 50 thousand validation images. In this experiment we show that our Elastic add-on consistently improves the accuracy of the state-ofthe-art models without introducing extra computation and parameters. Table 2 compares the top-1 and top-5 error rates of all of the base models with the Elastic augmentation (indicated by '+Elastic') and shows the number of parameters and FLOPs used for a single inference. Besides DenseNet, ResNeXt, DLA, SE-ResNeXt50+Elastic is also reported. In all the tables "*" denotes our implementation of the model. It shows that our improvement is almost orthogonal to the channel calibration proposed in [11] . In addition, we include ResNeXt50x2+Elastic to show that our improvement does not come from more depth added to ResNeXt101. In Figure 4 we project the numbers in the Table 2 into two plots: accuracy vs. number of parameters (Figure 4 -left) and accuracy vs. FLOPs (Figure 4-right) . This plot shows that our Elastic model can reach to a higher accuracy without any extra (or with lower) computational cost.

## 4.1.1 Scale Policy Analysis

To analyze the learned scale policy of our Elastic model we define a simple score that shows at each block what was the resolution level (high or low) that the input tensor was processed. We formally define this scale policy score at each block by differences of activations in high resolution and low resolution branches, S = A highres − A lowres , where A highres and A lowres are the mean activations after the ReLUs after 3 × 3 convolutions in high-resolution and low-resolution branches respectively. Figure 5 shows all of the categories in ImageNet validation sorted by the average of the scale policy score S (average over all of the layers for all images within a category). As it can be seen, categories with more complex images appear to have a larger S i.e. they mostly go through high-resolution branches in each block and images with simpler patterns appear to have smaller S which means they mostly go through the low resolution branches in each block. To analyze the impact of the scale policy on the accuracy of the Elastic, we represent each image (in the ImageNet validation set) by a 17 dimensional vector such that the value of the 17 elements are the scale policy score S for the 17 Elastic blocks in a ResNeXt50+Elastic model. Then we apply tsne [23] on all these vectors to get a two dimensional visualization. In figure 6-(left) we draw all the images in the tsne coordinates. It can be seen that images are clustered based on their complexity pattern. In figure 6-(middle) for all of the images we show the 17 scale policy scores S in 17 blocks. As it can be seen most of the images go through the high resolution branches on the early layers and low resolution branches at the later layers but some images break this pattern. For examples, images pointed by the green circle are activating high resolution branches in the 13 th block of the network. These images usually contain a complex pattern that the network needs to extract features in high resolution to classify correctly. Images pointed by the purple circle are activating low resolution branches at early layers, the 4 th block of the network. These images usually contain a simple pattern that the network can classify at low resolution early on. In Fig