Butterfly Transform: An Efficient FFT Based Neural Architecture Design

Keivan Alizadeh-Vahid
Ali Farhadi
Mohammad Rastegari
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2020
View in Semantic Scholar

Abstract

In this paper, we show that extending the butterfly operations from the FFT algorithm to a general Butterfly Transform (BFT) can be beneficial in building an efficient block structure for CNN designs. Pointwise convolutions, which we refer to as channel fusions, are the main computational bottleneck in the state-of-the-art efficient CNNs (e.g. MobileNets). We introduce a set of criterion for channel fusion, and prove that BFT yields an asymptotically optimal FLOP count with respect to these criteria. By replacing pointwise convolutions with BFT, we reduce the computational complexity of these layers from O(n^2) to O(n log n) with respect to the number of channels. Our experimental evaluations show that our method results in significant accuracy gains across a wide range of network architectures, especially at low FLOP ranges. For example, BFT results in up to a 6.75% absolute Top-1 improvement for MobileNetV1, 4.4 % for ShuffleNet V2 and 5.4% for MobileNetV3 on ImageNet under a similar number of FLOPS. Notably, ShuffleNet-V2+BFT outperforms state-of-the-art architecture search methods MNasNet, FBNet and MobilenetV3 in the low FLOP regime.

1 Introduction

Devising convolutional neural networks (CNN) that can run efficiently on resource-constrained edge devices has attracted several researchers. Current state-of-the-art efficient architecture designs are mainly structured to reduce the overparameterization of CNNs [25, 16] . A common design choice is to reduce the FLOPs and parameters of a network by factorizing convolutional layers [18, 32, 28, 41] , using a separable depth-wise convolution, into two components: (1) spatial fusion, where each spatial channel is convolved independently by a depth-wise convolution; and (2) channel fusion, where all the spatial channels are linearly combined by 1 × 1-convolutions, known as point-wise convolution. During spatial fusion, the network learns features from the spatial planes and during channel fusion the network learns relations between these features across channels. This is often implemented using 3 × 3 filters for the spatial fusion, and 1 × 1 filters for the channel fusion. Inspecting the computational profile of these networks at inference time reveals that the computational burden of the spatial fusion is relatively negligible compared to that of the channel fusion. In fact, the computational complexity of the point-wise convolutions in the channel fusion is quadratic in the number of channels (O(n 2 ) where n is the number of channels).

These expensive point-wise convolutions during channel fusion are the main focus of this paper. The point-wise convolutions form a fully connected structure between neurons and can be efficiently implemented using a matrix multiplication. The literature on efficient matrix multiplication suggests imposing a structure over this matrix. Low-rank [22] or circulant [2, 9] structures are few examples of structures that offer efficiencies in matrix multiplication. In the context of representing point-wise convolutions in a neural network, an ideal structure, we argue, should have the following properties. First, the structure should not impose significant limitations on the capacity of the network. In other words, an ideal structure should maintain high information flow through the network; This can be thought of as having a large bottleneck size 1 . Second, the structure should offer efficiency gains; this is often done by minimizing the FLOPS; in our case, this translates into having fewer edges in the network's graph. Finally, in an ideal network structure, there should be at least one path from every input node to all output nodes. This enables the cross talk across channels during the fusion and without this property some input nodes may not receive crucial signals during the back propagation.

In this paper, we introduce the Butterfly Transform (BFT ), a light-weight channel fusion method with the complexity of O(n log n) with respect to the number of channels. BFT fuses all the channels in log n layers with O(n) operations at each layer. We show that BFT 's network structure is an optimal structure (in terms of FLOPs) that satisfies all of the aforementioned properties of an ideal channel-fusion network. The structure of the BFT network is inspired from the butterfly operations in the Fast Fourier Transform (FFT). These butterfly operations have been heavily optimized in several hardware/software platforms [12, 5, 10] making BFT readily usable in a wide variety of applications.

Our experimental evaluations show that simply replacing the point-wise convolutions with BFT offer significant gain. We have observed that under similar number of FLOPs, the butterfly transform consistently improves the accuracy of the efficient design of the original CNN architectures. For example using BFT in MobileNet v1 0.25 with 37M number of FLOPs get 53.6 top-1 accuracy on the imagenet dataset [7] and using BFT in ShuffleNet v2 0.5 with 41M number of FLOPs achieve 61.33 top-1 accuracy.

2 Related Work

Deep neural networks suffer from intensive computations. Several approaches have been proposed to address efficient training and inference in deep neural networks.

Efficient CNN architecture designs: Recent successes in visual recognition tasks, including object classification, detection, and segmentation, can be attributed to exploration of different CNN designs [24, 33, 15, 23, 35, 20] . To make these network designs more efficient, they have factorized convolutions into different steps enforcing distinct focuses on spatial and channel fusion [18, 32] . Further, other approaches extended the factorization schema with sparse structure either in channel fusion [28, 41] or spatial fusion [29] . [19] forced more connections between the layers of the network but reduced the computation by desigining smaller layers. Our method follows the same direction of designing a sparse structure on channel fusion that enables lower computation with a minimal loss in accuracy.

Network pruning: This line of work focuses on reducing the substantial redundant parameters in CNNs by pruning out either neurons or weights [13, 14] . Due to the unstructured sparsity of these models, the learned models from these methods cannot be used efficiently in standard compute platforms such as CPUs and GPUs. Therefore, other approaches in pruning only focus to prune out channels rather than individual neuron or weights [17, 43, 11] . These methods drop a channel either by monitoring the average weight values or average activation values on each channel during the training. Our method is different from these type methods in the way that we enforce a predefined sparse channel structure to begin with and we do not change the structure of the network during the training.

Low-rank network design: To reduce the computation in CNN, [37, 25, 8, 22] exploit from the fact that CNNs are over parameterized. These models learn a linear low rank representation of the parameters in the network either by post processing the trained weight tensors or by enforcing a linear low-rank structure during the training. There are few works that enforce non-linear low-rank structure using circulant matrix design [2, 9] . These low-rank network structures achieves efficiency with the cost of lowering the information flow from input channels to the output channels (i.e. they have a few bottleneck nodes) but our butterfly transform is in fact a non-linear structured low-rank representation that maximizes information flow.

Quantization: Another approach to improve the efficiency of the deep networks is low-bit representation of network weights and neurons using quantization [34, 30, 39, 4, 42, 21, 1] . These approaches use fewer bits (instead of 32-bit high-precision floating points) to represent weights and neurons for the standard training procedure of a network. In the case of extremely low bitwidth (1-bit) [30] had to modify the training procedure to find the discrete binary values for the weights and the neurons in the network. Our method is orthogonal to this line of work and these method are complementary to our network.

Neural architecture search: Recently, neural search methods, including reinforcement learning and genetic algorithms, have been proposed to automatically construct network architectures [44, 40, 31, 45, 36, 27] . These methods search over a huge network space (e.g. MNASNet [36] searches over 8K different design choices) using a dictionary of pre-defined search space parameters, including different types of convolutional layers and kernel sizes, to identify a network structure, usually nonhomogeneous, that satisfies optimization constraints, such as inference time. Recent search-based methods [36, 3, 38] use MobileNetv2 [32] as a basic search block for automatic network design. The main computational bottleneck in most of the search based method is in the channel fusion and our butterfly structure does not exist in any of the predefined blocks of these methods. Our efficient channel fusion can be augmented with these models to further improve the efficiency of these networks. Our experiments shows that our proposed butterfly structure outperforms recent architecture search based models on small network design.

3 Model

In this section, we outline the details of our proposed model. As discussed above, the main computational bottleneck in current efficient neural architecture design is in channel fusion step, which is implemented by a point-wise convolution layer. The input to this layer is a tensor X of size n in × h × w, where n is the number of channels and w, h are the width and height respectively. The size of the weight tensor W is n out × n in × 1 × 1 and the output tensor Y is n out × h × w. Without loss of generality, we assume n = n in = n out . The complexity of a point-wise convolution layer is O(n 2 wh) and this is mainly influenced by the number of channels n. We propose a new layer design, Butterfly Transform, that has O((n log n)wh) complexity. This design is inspired by the Fast Fourier Transform (FFT) algorithm, which has been widely used in the computational engines for a variety of applications and there exist optimized hardware/software design for the key operations of this algorithm which are applicable to our method. In the following subsections we explain the problem formulation and the structure of our butterfly transform.

3.1 Point-Wise Convolution As Matrix-Vector Products

A point-wise convolution can be defined as a function P as follows:

EQUATION (1): Not extracted; please refer to original document.

This can be written as a matrix product by reshaping the input tensor X to a 2-D matrixX with size n × (hw) (each column vector in theX corresponds to a spatial vector X[:, i, j]) and reshaping the weight tensor to a 2-D matrixŴ with size n × n,

EQUATION (2): Not extracted; please refer to original document.

whereŶ is the matrix representation of the output tensor Y. This can be seen as a linear transformation of the vectors in the columns ofX usingŴ as a transformation matrix. The linear transformation is a matrix-vector product and its complexity is O(n 2 ). By enforcing structure on this transformation matrix, one can reduce the complexity of the transformation. However, to be effective as a channel fusion transform, it is critical that this transformation respects the desirable characteristics detailed below.

Ideal characteristics of a fusion network: 1) every-to-all connectivity: There must be at least one path between every input channel and all of the output channels 2) maximum bottleneck size:

The botteleneck size is defined as the minimum number of nodes in the network that if removed, the information flow from input channels to output channels would be completely cut off (i.e. there would be no path from any input channel to any output channel). The largest possible bottleneck size in a multi-layer network is n. 3) small edge count: To reduce computation, we expect the network to have as few edges as possible. 4) equal out-degree within each layer: To enable efficient matrix implementation of the network, all nodes within each layer must have the same out degree 2 .

Claim: A multi-layer network with these properties has at least O(n log n) edges.

Proof : Suppose there exist n i nodes in i th layer. Removing all the nodes in one layer will disconnect inputs from outputs. Since the maximum possible bottleneck size is n, therefore n i ≥ n. Now suppose that out degree of each node at layer i is d i . Number of nodes in layer i, which are reachable from an input channel is

i−1 j=0 d j .

Because of the every-to-all connectivity, all of the n nodes in the output layer are reachable. Therefore

m−1 j=0 d j ≥ n. This implies that m−1 j=0 log 2 (d j ) ≥ log 2 (n).

The total number of edges will be

m−1 j=0 n j d j ≥ n m−1 j=0 d j ≥ n m−1 j=0 log 2 (d j ) ≥ n log 2 n

In the following section we present a network structure that satisfies all the ideal characteristics of a fusion network.

3.2 Butterfly Transform (Bft)

As mentioned above we can reduce the complexity of a matrix-vector product by enforcing structure on the matrix. There are several ways to enforce structure on the matrix. Here we introduce a family of the structured matrix that leads to a O(n log n) complexity of operations and parameters while maintaining the accuracy.

Butterfly Matrix: We define B (n,k) as a butterfly matrix of order n and base k where B (n,k) ∈ IR n×n :

EQUATION (4): Not extracted; please refer to original document.

where

x i ∈ IR n k

is a subsection of x that is achieved by breaking x into k equal sized vector. Therefore, the product can be simplified by factoring out M as follow:

B (n,k) x =          M ( n k ,k) 1 k j=1 D 1j x j . . . M ( n k ,k) i k j=1 D ij x j . . . M ( n k ,k) k k j=1 D kj x j          B (n,k) x =          M ( n k ,k) 1 y 1 . . . M ( n k ,k) i y i . . . M ( n k ,k) k y k          (5)

where

y i = k j=1 D ij x j . Note that M ( n k ,k) i

y i is a smaller product between a butterfly matrix of order n 2 and a vector of size n 2 therefore, we can use divide-and-conquer to recursively calculate the product B (n,k) x. If we consider T (n, k) as the computational complexity of the product between a (n, k) butterfly matrix and an n-D vector. From equation 5, the product can be calculated by k products of butterfly matrices of order n k which its complexity is kT (n/k, k). The complexity of calculating y i for all i ∈ {1, . . . , k} is O(kn) therefore:

T (n, k) = kT (n/k, k) + O(kn) (6) 1 − BFLayer 2 − BFLayer log − BFLayer x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1 x n 2 +2

. . .

x n−1 x n BFT( n 2 , 2) BFT( n 2 , 2) y 1 y 2 . . . y n 2 −1 y n 2 y n 2 +1 y n 2 +2

. . .

y n−1 y n BFT(n, 2) x 1 x 2 . . . xn 2 −1 xn 2 xn 2 +1 xn 2 +2

. . .

x n−1 x n x 1 x 2 . . . xn 2 −1 xn 2 xn 2 +1 xn 2 +2

. . .

x n−1 x n ℎ x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1

x n 2 +2

. . .