ELASTIC: Improving CNNs with Instance Specific Scaling Policies

Huiyu Wang
Aniruddha Kembhavi
Ali Farhadi
A. Yuille
Mohammad Rastegari
ArXiv
2018
View in Semantic Scholar

Abstract

Scale variation has been a challenge from traditional to modern approaches in computer vision. Most solutions to scale issues have similar theme: a set of intuitive and manually designed policies that are generic and fixed (e.g. SIFT or feature pyramid). We argue that the scale policy should be learned from data. In this paper, we introduce ELASTIC, a simple, efficient and yet very effective approach to learn instance-specific scale policy from data. We formulate the scaling policy as a non-linear function inside the network’s structure that (a) is learned from data, (b) is instance specific, (c) does not add extra computation, and (d) can be applied on any network architecture. We applied ELASTIC to several state-of-the-art network architectures and showed consistent improvement without extra (sometimes even lower) computation on ImageNet classification, MSCOCO multi-label classification, and PASCAL VOC semantic segmentation. Our results show major improvement for images with scale challenges e.g. images with several small objects or objects with large scale variations. Our code and models will be publicly available soon.

1. Introduction

Scale variation has been one of the main challenges in computer vision. There is a rich literature on different approaches to encode scale variations in computer vision algorithms [19] . In feature engineering, there have been manually prescribed solutions that offer scale robustness. For example, the idea of searching for scale first and then extracting features based on a known scale used in SIFT or the idea of using feature pyramids are examples of these prescribed solutions. Some of these ideas have also been migrated to feature learning using deep learning in modern recognition solutions.

The majority of the solutions in old-school and even modern approaches to encode scale are manually designed * Work done while an intern at AI2. The early layers receive eXtralarge resolutions and in the following layers resolutions decrease as Large, Medium, and Small. We argue that scaling policy in CNNs should be instance-specific. Our Elastic model (the third row) allows different scaling policy for different input images and it learns from the training data how to pick the best policy. For scale challenging images e.g. images with lots of small(or diverse scale) objects, it is crucial that network can adapt its scale policy based on the input. As it can be seen in this figure, Elastic gives a better prediction for these scale challenging images.

and fixed solutions. For example, most state-of-the-art image classification networks [15, 30, 9, 13, 37, 41] use the feature pyramid policy where a network looks at the larger resolution first and then goes to smaller ones as it proceeds through the layers. Despite the fact that this common practice seems to be a natural and intuitive choice, we argue that this scale policy is not necessarily the best one for all possible scale variations in images. We claim that an ideal scale policy should (1) be learned from the data; (2) be instance specific;

(3) not add extra computational burden; and (4) be applicable to any network architecture. For example, instead of looking at the scales according to the feature pyramid policy if we process the images in Figure 1 based on a learned and instance specific policy we see an improved performance. In images with scale challenges like the golf ball image in Figure 1 the learned scale policy might differ dramatically from a pyramid policy resulting in correct classification of that instance. The learned policy for this instance starts from looking at the image from a large scale (dark blue color) and the go immediately to smaller scale then back to large scale followed by smaller scale and so on.

Figure 1: Instance-specific scale policy. Scaling policy in CNNs are typically integrated in the network architecture manually in a pyramidal fashion. The color bar in this figure (second row) shows the scales at different blocks of the ResNext50 architecture. The early layers receive eXtralarge resolutions and in the following layers resolutions decrease as Large, Medium, and Small. We argue that scaling policy in CNNs should be instance-specific. Our Elastic model (the third row) allows different scaling policy for different input images and it learns from the training data how to pick the best policy. For scale challenging images e.g. images with lots of small(or diverse scale) objects, it is crucial that network can adapt its scale policy based on the input. As it can be seen in this figure, Elastic gives a better prediction for these scale challenging images.

In this paper we introduce ELASTIC, an approach to learn instance-specific and not-necessarily-pyramidal scale policies with no extra(or lower) computational cost. Our solution is simple, efficient and and very effective on a wide range of network architectures for image classification and segmentation. Our Elastic model can be applied on any CNN architectures simply by adding down-samplings and up-samplings in parallel branches at each layer and let the network learn from data a scaling policy in which inputs being processed at different resolutions in each layer. We named our model ELASTIC because each layer in the network is flexible in terms of choosing the best scale by a soft policy.

Our experimental evaluations show improvements in image classification on ImageNet [28] , multi-label classification on MSCOCO [18] , and semantic segmentation on PASCAL VOC for ResNeXt [34] , SE-ResNeXt [11] , DenseNet [13] , and Deep Layer Aggregation (DLA) [37] architectures. Furthermore, our results show major improvements (about 4%) on images with scale challenges (lots of small objects or large variation across scales within the same image) and lower improvements for images without scale challenges. Our qualitative analysis shows that images with similar scaling policies (over the layers of the network) are sharing similar complexity pattern in terms of scales of the objects appearing in the image.

2. Related Work

The underlying idea behind Elastic is conceptually simple and there are several approaches in the literature using similar concepts . Therefore, we study all the categories of related CNN models and clarify the differences and similarities to our model. There are several approaches in fusing information at different visual resolutions. The majority of these approaches can be categorized into four categories (depicted in Figure 2 (b-e)).

Figure 2: Multi-scaling model structures. This figure illustrates different approaches to multi-scaling in CNN models and our Elastic model. The solid-line rectangles show the input size and the dashed-line rectangles shows the filter size.

Image pyramid: An input image is passed through a model multiple times at different resolutions and predictions are made independently at all levels. The final output is computed as an ensemble of outputs from all resolutions. This approach has been a common practice in [4, 5, 29] .

Loss pyramid: This method enforces multiple loss functions at different resolutions. [32] uses this approach to improve the utilization of computing resources inside the network. SSD [20] and MS-CNN [2] also use losses at multiple layers of the feature hierarchy.

Filter pyramid: Each layer is divided into multiple branches of convolution with different filter sizes (typically referred to as the split-transform-merge architecture). The variation in filter sizes results in capturing features at different resolutions, but with additional number of parameters and operations. The inception family of networks [32, 33, 31] use this approach. To further reduce the complexity of the filter pyramid [24, 35, 36] use dilated convolutions to cover a larger receptive field with the same number of FLOPs.

Feature pyramid: This is the most common approach to incorporate multiple scales in a CNN architecture. This approach fuses features from different resolutions in a network by either concatenation or summation. Fully convolutional networks [22] adds up the scores from multiple scales to compute the final class score. Hypercolumns [7] uses earlier layers in the network to capture low-level information to describe a pixel in a vector. Several other approaches (HyperNet [14] , ParseNet [21] , and ION [1] ) concatenate the outputs from multiple layers to compute the output for the last layer. Several recent methods including SharpMask [26] and U-Net [27] for segmentation, Stacked Hourglass networks [25] for keypoint estimation and Recombinator networks [10] for face detection, have used skip connections to incorporate low-level feature maps on multiple resolutions and semantic levels. [12] extends the principles of DenseNet [13] to fuse features across different resolution blocks. Feature pyramid networks (FPNs) [17] are designed to normalize resolution and equalize semantics across the levels of a pyramidal feature resolution hierarchy through top-down and lateral connections. Likewise, [37] proposes an iterative deep aggregation that raises resolution, but further deepens the representation by nonlinear and progressive fusion. DLA [38] proposes to aggregate information across various different scale in the CNN in a hierarchical fashion.

Elastic resembles models from the Filter pyramid family as well as the Feature pyramid family, in that it introduces parallel branches of computation (a la Filter pyramid) and also fuses information from different scales (a la Feature pyramid). The major difference to the feature pyramid models is that in Elastic every layer in the network considers information at multiple scales uniquely whereas in feature pyramid the information for higher or lower resolution is injected from the other layers. Elastic provides an exponential number of scaling policies across the layers and yet keeps the computational complexity the same (or even lower) as the base model. The major difference to the filter pyramid is that the the number of FLOPs to cover a higher receptive field in Elastic is proportionally lower due to the downsampling whereas in the filter pyramid the FLOPs is higher or the same as the original convolution depending of filter size or dilation parameters.

3. Model

In this section, we elaborate the structure of our proposed Elastic and illustrate standard CNN architectures being augmented with our Elastic. We also contrast our model with other multi-scale approaches.

3.1. Scale Policy In Cnn Blocks

Formally, a layer in a convolutional neural network can be expressed as

EQUATION (1): Not extracted; please refer to original document.

where q is the number of branches to be aggregated, T i (x) can be an arbitrary function (normally it is a combination of convolution, batch normalization and activation), and σ are nonlinearities. A few F(x) are stacked into a block to process information in one spatial resolution. Blocks with decreasing spatial resolutions are stacked to integrate a pyramid scale policy in the network architecture. A network example of 3 blocks with 2 layers in each block can be expressed as

N = F 32 • F 31 • D r2 • F 22 • F 21 • D r1 • F 12 • F 11 (2)

where D ri indicates the resolution decrease by ratio r i > 1 after a few blocks. D ri can be simply implemented by increasing the stride in the convolution right after. For example, ResNeXt [34] stacks bottleneck layers in each resolution and use convolution with stride 2 to reduce spatial resolution. This leads to a fixed scaling policy that enforces a linear relation between number of layers and the effective receptive field of those layers. Parameters of T i (x) and the elements in input tenors x are all of the tangible ingredients in a CNN that define computational capacity of the model. Under a fixed computational capacity measured by FLOPs, to improve the accuracy of such a model, we can either increase number of parameters in T i (x) and decrease the resolution of x or increase the resolution of x and decrease number of parameters in T i (x). By adjusting the input resolutions at each layer and number of parameters, we can define a scaling policy across the network. We argue that finding the optimal scaling policy (a trade-off between the resolution and number of parameters in each layer) is not trivial. There are several model designs toward increasing the accuracy and manually injecting variations of feature pyramid but most of them are at the cost of higher FLOPs and more parameters in the network. In the next section, we explain our solution that can learn an optimal scaling policy and maintain or reduce number of parameters and FLOPs while improving the accuracy.

3.2. The Elastic Structure

In order to learn image features at different scales, we propose to add down-samplings and up-samplings in parallel branches at each layer and let the network make decision on adjusting its process toward various resolutions at each layer. Networks can learn this policy from training data. We add down-samplings and up-samplings in parallel branches at each layer and divide all the parameters across these branches as follows:

F(x) = σ q i=1 U ri (T i (D ri (x))) (3) N = F 32 • F 31 • F 22 • F 21 • F 12 • F 11 (4)

where D ri (x) and U ri (x) are respectively downsampling and upsampling functions which change spatial resolutions of features in a layer. Unlike in equation 2, a few F are applied sequentially without downsampling the main stream, and N (x) has exactly the same resolution as original x. Note that the learned scaling policy in this formulation will be instance-specific i.e. for different image instances, the network may activate branches in different resolutions at each block 1 . In section 4 we show that this instancespecific scaling policy improves prediction on images with scale challenges e.g. images consist of lots of small objects or highly diverse object sizes.

Conceptually, we propose a new structure where information is always kept at a high spatial resolution, and each block or branch processes information at a lower or equal resolution. In this way we decouple feature processing resolution (T i processes information at different resolutions) from feature storage resolution (the main stream resolution of the network).

This way of expanding the layers encourages the network to process different resolutions separately at different branches in a layer and thus captures cross-scale information. More interestingly, since at each layer network has multiple options of resolutions, this leads to an exponential relation between the number of the layers and the number of different scaling policies. In fact, this intuition is aligned with our experimental evaluation, where we have observed different categories of images adopt different scaling policies (see section 4.1.1) . For example, categories with clean and uniform background images mostly choose the low-resolution paths across the network and categories Figure 2 : Multi-scaling model structures. This figure illustrates different approaches to multi-scaling in CNN models and our Elastic model. The solid-line rectangles show the input size and the dashed-line rectangles shows the filter size.

with complex and cluttered objects and background mostly choose the high resolution paths across the network. The computational cost of our Elastic model is equal to or lower than the base model. Because at each layer the maximum resolution is the original resolution of the input tensor. We apply downsampling on some branches, to be processed with the next convolution, that reduces the computation and gives us extra room for adding more layers to match the computation of the original model. This simple add-on of downsamplings and upsamplings (Elastic) can be applied to any CNN layers T i (x) in any architecture to improve accuracy of a model. Our applications are introduced in the next section.

3.3. Augmenting Models With Elastic

Now, we show how to apply Elastic on different network architecture. To showcase the power of Elastic, we choose to apply Elastic on some state-of-the-art network architectures: ResNeXt [34] , Deep Layer Aggregation (DLA) [38] , and DenseNet [13] . A natural way of applying Elastic on current classification models is to augment bottleneck layers with multiple branches. This makes our modification on ResNeXt and DLA almost identical. At each layer we apply downsampling and upsampling to a portion of branches, as shown in Figure 3 -left. In DenseNet we compile an equivalent version by parallelizing a single branch into two branches and then apply downsampling and upsampling on some of the branches, as shown in Figure 3 -right. Note that applying Elastic reduces FLOPs in each layer. To match the original FLOPs we increase number of layers in the network while dividing similar number of FLOPs across resolutions. The detailed architecture of our model can be found in Appendix B.

Figure 3: Left: ResNeXt bottleneck vs. Elastic bottleneck. Right: DenseNet block vs. its equivalent form vs. Elastic block. Elastic blocks spend half of the paths processing downsampled inputs in a low resolution, then the processed features are upsampled and added back to features with the original resolution. Elastic blocks have the same number of parameters and less FLOPs than original blocks

Relation to other multi-scaling approaches As discussed in section 2, most of the current multi-scaling approaches can be categorized into four different categories (1) image pyramid, (2) loss pyramid (3) filter pyramid, and (4) feature pyramid. Figure 2 (b-e) demonstrates the structure of these categories.

All of these models can improve the accuracy usually under a higher computational budget. Elastic (Figure 2 ) guarantees no extra computational cost while achieving better

Multi-Scaling Method Flops Parameters

Single Scale

n 2 ck 2 ck 2

Feature Pyramid (concat)

n 2 (qc)k 2 (qc)k 2

Feature Pyramid (add) bi = 1 and b i > 1 and r i > 1 denote the branching and scaling ratio respectively. Note that the FLOPs and parameters in Elastic is always (under any branching q and scaling ratio r) lower than or equal to the original model whereas in feature/filter pyramid is higher or equal.

n 2 ck 2 ck 2 Filter Pyramid (standard) q i=1 n 2 c(kr i ) 2 b i q i=1 c(kr i ) 2 b i Filter Pyramid (dilated) n 2 ck 2 ck 2 Elastic q i=1 ( n r i ) 2 ck 2 b i ck 2

accuracy. Filter pyramid is the most similar model to the Elastic's structure. The major difference to the filter pyramid is that the the number of FLOPs to cover a higher receptive field in Elastic is proportionally lower due to the downsampling whereas in the filter pyramid the FLOPs is higher or the same as the original convolution depending of filter size or dilation parameters. Table 1 compares the FLOPs and number of parameters between Elastic and feature/filter pyramid for a single convolutional operation. Note that the FLOPs and parameters in Elastic is always (under any branching q and scaling ratio r) lower or equal to the original model whereas in filter/feature pyramid this is higher or equal. Feature pyramid methods are usually applied on top of an existing classification model, by concatenating features from different resolutions. It is capable of merging features from different scales in the backbone model and shows improvements on various tasks, but it does not intrinsically the scaling policy. Our Elastic structure can be viewed as a feature pyramid inside a layer, which is able to model different scaling policies. Spatial pyramid pooling or Atrous(dilated) spatial pyramid shares the same limitation as feature pyramid methods.

Table 1: Computation in multi-scaling models. This table compares the FLOPs and number of parameters between Elastic and feature/filter pyramid for a single convolutional operation, where the input tensor is n× n× c and the filter size is k× k. q denotes the number of branches in the layer, where ∑q 1 1 bi = 1 and bi > 1 and ri > 1 denote the branching and scaling ratio respectively. Note that the FLOPs and parameters in Elastic is always (under any branching q and scaling ratio r) lower than or equal to the original model whereas in feature/filter pyramid is higher or equal.

4. Experiments

In this section, we present experiments on applying Elastic on current strong classification models. We evaluate their performance on ImageNet classification, and we show consistent improvements over current models. Furthermore, in order to show the generality of our approach, we transfer our pretrained Elastic models to multi-label image classification and semantic segmentation. We use ResNeXt [34] , DenseNet [13] and DLA [37] as our base models to be augmented with Elastic. Implementation details. We use the official PyTorch Im-ageNet codebase with random crop augmentation but no color or lighting augmentation, and we report standard 224×224 single crop error on the validation set. We train our model with 8 workers (GPUs) and 32 samples per worker. Following DLA [37] , all models are trained for 120 epochs with learning rate 0.1 and divided by 10 at epoch 30, 60, 90. We initialize our models using normal He initialization [8] . Stride-2 average poolings are adopted as our downsamplings unless otherwise notified, since most of our downsamplings are 2× downsamplings, in which case bilinear downsampling is equivalent to average pooling. Also, Elastic add-on is applied to all blocks except stride-2 ones or high level blocks operating at resolution 7.

4.1. Imagenet Classification

We evaluate Elastic on ImageNet [28] 1000 way classification task (ILSVRC2012). The ILSVRC 2012 dataset contains 1.2 million training images and 50 thousand validation images. In this experiment we show that our Elastic add-on consistently improves the accuracy of the state-ofthe-art models without introducing extra computation and parameters. Table 2 compares the top-1 and top-5 error rates of all of the base models with the Elastic augmentation (indicated by '+Elastic') and shows the number of parameters and FLOPs used for a single inference. Besides DenseNet, ResNeXt, DLA, SE-ResNeXt50+Elastic is also reported. In all the tables "*" denotes our implementation of the model. It shows that our improvement is almost orthogonal to the channel calibration proposed in [11] . In addition, we include ResNeXt50x2+Elastic to show that our improvement does not come from more depth added to ResNeXt101. In Figure 4 we project the numbers in the Table 2 into two plots: accuracy vs. number of parameters (Figure 4 -left) and accuracy vs. FLOPs (Figure 4-right) . This plot shows that our Elastic model can reach to a higher accuracy without any extra (or with lower) computational cost.

Figure 4: Imagenet Accuracy vs. FLOPS and Parameters This figure shows our Elastic model can achieve a lower error without any extra (or with lower) computational cost.

Table 2: State-of-the-art model comparisons on ImageNet validation set. Base models (DenseNet, ResNeXt, and DLA) are augmented by Elastic (indicated by ’+Elastic’). * indicates our implementation of these models. Note that augmenting with Elastic always improves accuracy across the board.

4.1.1 Scale Policy Analysis

To analyze the learned scale policy of our Elastic model we define a simple score that shows at each block what was the resolution level (high or low) that the input tensor was processed. We formally define this scale policy score at each block by differences of activations in high resolution and low resolution branches, S = A highres − A lowres , where A highres and A lowres are the mean activations after the ReLUs after 3 × 3 convolutions in high-resolution and low-resolution branches respectively. Figure 5 shows all of the categories in ImageNet validation sorted by the average of the scale policy score S (average over all of the layers for all images within a category). As it can be seen, categories with more complex images appear to have a larger S i.e. they mostly go through high-resolution branches in each block and images with simpler patterns appear to have smaller S which means they mostly go through the low resolution branches in each block. To analyze the impact of the scale policy on the accuracy of the Elastic, we represent each image (in the ImageNet validation set) by a 17 dimensional vector such that the value of the 17 elements are the scale policy score S for the 17 Elastic blocks in a ResNeXt50+Elastic model. Then we apply tsne [23] on all these vectors to get a two dimensional visualization. In figure 6-(left) we draw all the images in the tsne coordinates. It can be seen that images are clustered based on their complexity pattern. In figure 6-(middle) for all of the images we show the 17 scale policy scores S in 17 blocks. As it can be seen most of the images go through the high resolution branches on the early layers and low resolution branches at the later layers but some images break this pattern. For examples, images pointed by the green circle are activating high resolution branches in the 13 th block of the network. These images usually contain a complex pattern that the network needs to extract features in high resolution to classify correctly. Images pointed by the purple circle are activating low resolution branches at early layers, the 4 th block of the network. These images usually contain a simple pattern that the network can classify at low resolution early on. In Fig

Figure 5: Scale policy for complex vs. simple image categories. This figure shows the overall block scale policy score on the entire ImageNet categories. It shows that categories with complex image patterns mostly go through the high resolution branches in the network and categories with simpler image pattern go through the low-resolution branches.

Figure 6: Scale policy analysis. This figure shows the impact of the scale policy on the accuracy of our Elastic model. (left) shows all the ImageNet validation set clustered using tsne by their scale policy pattern in the ResNeXt50+Elastic as discussed in section 4.1.1. (middle) shows the the scale policy score of all the images at 17 blocks of the network. Most of the images use high resolution features at early layers and low resolution features at later layers but some images break this pattern. Images pointed in the green circle use high resolution features in the 13th block. Images pointed in the purple circle use low resolution features in the 4th block. These images usually contain a simpler pattern. (right)-bottom shows the density of images in the tsne space and (right)-top shows the density of the images that got correctly classified by Elastic model but miss-classified by the base ResNeXt model. This shows that Elastic can improve prediction when images are challenging in terms of their scale information. Some samples are pointed by the yellow circle. Best viewed in color.

4.2. Ms Coco Multi-Label Classification

To further investigate the generality of our model, we finetune our ImageNet pretrained model and evaluate on MS COCO multi-label classification task. The MSCOCO images are far more complicated in that there exists multiple objects from different categories and scales in each image. Implementation details. All models that we report are finetuned from ImageNet pre-trained model for 36 epochs with learning rate starting at 0.001 and being divided by 10 at epoch 24, 30. We train on 4 workers and 24 images per worker with SGD and weight decay of 0.0005. We train our models with binary cross entropy (BCE) loss, which is usually used as a baseline for domain specific works that explicitly models spatial or semantic relations. We use the same data augmentations as our ImageNet training, and adopt standard multi-label testing on images resized to 224 × 224.

Evaluation metrics. Following the literature of multilabel classification [40, 6, 39, 16] , results are evaluated us- ing macro/micro evaluations. After training the models with BCE loss, labels with greater than 0.5 probability are considered positive. Then, macro and micro F1-scores are calculated to measure overall performance and the average of per-class performances respectively.

Results. Table 3 shows that elastic consistently improves per-class F1 and overall F1. In the case of DLA, Elastic augmentation even reduces the FLOPs and number of parameters by a large margin.

Table 3: MSCOCO multi-class classification. This table shows the generality of our Elastic model by finetuning the pretrained ImageNet model on MSCOCO multi-class images with binary cross entropy loss. Elastic improves the F1 scores all across the board.

Scale challenging images. We claimed that Elastic is very effective on scale challenging images. Now, we empirically show that a high portion of the accuracy improvement of our Elastic model is rooted in a better scale policy learning. To illustrate this we investigated the relative accuracy improvement of the ResNeXt model augmented with Elastic only in images with specific object sizes. MSCOCO dataset has provided object size labeled into three categories small, medium, and large in Table 4 show the accuracy im-provement of Elastic over all possible object scales. As it can be seen the best improvement of Elastic is from images with small objects and second best is from images with a combination of small and large objects. This proves our hypothesis that Elastic helps to better understand scale challenging images through the scale policy learning. X60+Elastic.

Table 4: Scale challenging images. This table shows the relative accuracy improvement of our Elastic models on images with different object scales in MSCOCO. Elastic improvement is mainly from images with small objects. This shows that Elastic is very effective on scale challenging images

4.3. Pascal Voc Semantic Segmentation

To show the strength of our Elastic model on a pixel level classification task, we report experiments on PASCAL VOC semantic segmentation. ResNeXt models use weight decay 0.0005 instead of 0.0001 in ResNet. All models are trained for 50 epochs and we report mean intersection-over-union (IOU) on the validation set. Other implementation details follow [3] , with MG(1, 2, 4), ASPP (6, 12, 18) , image pooling, OS=16, batch size of 16, for both training and validation, without bells and whistles. Our ResNet101 reproduces the mIOU of 77.21% reported in [3] . Our DLA models use the original iterative deep aggregation as a decoder and are trained with the same scheduling as [3] . In Table. 5, Elastic shows a large margin of improvement. This verifies that Elastic finds the scale policy that allows processing high level semantic information and low level boundary information together, which is critical in the task of semantic segmentation.

Table 5: PASCAL VOC semantic segmentation. This table compares the accuracy of semantic image segmentation (mIOU%) using Elastic models vs. original model. Elastic model outperform original models by a large margin. This supports that Elastic learns a scale policy that allows processing high level semantic information and low level boundary information together.

Table 6: Ablation study of up(down)sampling methods. In this table we show the accuracy of ImageNet classification using Elastic by different choices of up(down)sampling methods. w/ AP indicates average pooling. Our experiment shows Elastic with bilinear up(down)sampling is the best choice with reduced FLOPs.

4.4. Ablation Study

In this section we study the effect of different elements in our Elastic model. We chose DLA-X60 as our base model and applied Elastic to perform the ablation experiments. Table 7 : Ablation study of high(low) resolution branching rates. In this table we evaluate different branching rate across high and low resolutions at each block. We observe that the best trade-off is when we equally divide the branches into high and low resolutions. Independent of the ratio, all variations of branching are better than the base model.

Table 7: Ablation study of high(low) resolution branching rates. In this table we evaluate different branching rate across high and low resolutions at each block. We observe that the best trade-off is when we equally divide the branches into high and low resolutions. Independent of the ratio, all variations of branching are better than the base model.

Upsampling/Downsampling methods. We carried our experiments with bilinear up(down)sampling on DLA-X60+Elastic. In Table 6 we show the accuracy of Ima-geNet classification using Elastic by different choices of up(down)sampling methods: Bilinear, Nearest, Trained filters and Trained Dilated filters with and without average pooling (indicated by w/ AP). Our experiment shows Elastic with the bilinear up(down)sampling is the best choice.

High/Low resolution branching rate. We sweep over different choices of dividing parallel branches in the blocks into the high and low resolutions. In table 7 we compare the variations of the percentage of branches allocated to high and low resolutions at each block. This experiment shows that the best trade-off is when we equally divide the branches into high and low resolutions. Interestingly, all of the branching options are outperforming the vanilla model (without Elastic). This shows that our Elastic model is quite robust to this parameter.

5. Conclusion

We proposed Elastic, a model that captures scale variations in images by learning the scale policy from data. Our Elastic model is simple, efficient and very effective. Our model can easily be applied on any CNN architectures and improve accuracy while maintaining the same computation (or lower) as the original model. We applied Elastic on several state-of-the-art network architectures and showed consistent improvement on ImageNet classification, MSCOCO multi-class classification, and PASCAL VOC semantic segmentation. Our results show major improvement for images with scale challenges e.g. images consist of several small objects or objects with large scale variations.

We use the term "block" and "layer" interchangeably. Each block is a combination of 3 × 3 and 1 × 1 convolutions, batch-normalizations and activation functions