Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Home
Report a problem with this paper

Butterfly Transform: An Efficient FFT Based Neural Architecture Design

Authors

  • Keivan Alizadeh-Vahid
  • Ali Farhadi
  • Mohammad Rastegari
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
  • View in Semantic Scholar

Abstract

In this paper, we show that extending the butterfly operations from the FFT algorithm to a general Butterfly Transform (BFT) can be beneficial in building an efficient block structure for CNN designs. Pointwise convolutions, which we refer to as channel fusions, are the main computational bottleneck in the state-of-the-art efficient CNNs (e.g. MobileNets). We introduce a set of criterion for channel fusion, and prove that BFT yields an asymptotically optimal FLOP count with respect to these criteria. By replacing pointwise convolutions with BFT, we reduce the computational complexity of these layers from O(n^2) to O(n log n) with respect to the number of channels. Our experimental evaluations show that our method results in significant accuracy gains across a wide range of network architectures, especially at low FLOP ranges. For example, BFT results in up to a 6.75% absolute Top-1 improvement for MobileNetV1, 4.4 % for ShuffleNet V2 and 5.4% for MobileNetV3 on ImageNet under a similar number of FLOPS. Notably, ShuffleNet-V2+BFT outperforms state-of-the-art architecture search methods MNasNet, FBNet and MobilenetV3 in the low FLOP regime.

1 Introduction

Devising convolutional neural networks (CNN) that can run efficiently on resource-constrained edge devices has attracted several researchers. Current state-of-the-art efficient architecture designs are mainly structured to reduce the overparameterization of CNNs [25, 16] . A common design choice is to reduce the FLOPs and parameters of a network by factorizing convolutional layers [18, 32, 28, 41] , using a separable depth-wise convolution, into two components: (1) spatial fusion, where each spatial channel is convolved independently by a depth-wise convolution; and (2) channel fusion, where all the spatial channels are linearly combined by 1 × 1-convolutions, known as point-wise convolution. During spatial fusion, the network learns features from the spatial planes and during channel fusion the network learns relations between these features across channels. This is often implemented using 3 × 3 filters for the spatial fusion, and 1 × 1 filters for the channel fusion. Inspecting the computational profile of these networks at inference time reveals that the computational burden of the spatial fusion is relatively negligible compared to that of the channel fusion. In fact, the computational complexity of the point-wise convolutions in the channel fusion is quadratic in the number of channels (O(n 2 ) where n is the number of channels).

These expensive point-wise convolutions during channel fusion are the main focus of this paper. The point-wise convolutions form a fully connected structure between neurons and can be efficiently implemented using a matrix multiplication. The literature on efficient matrix multiplication suggests imposing a structure over this matrix. Low-rank [22] or circulant [2, 9] structures are few examples of structures that offer efficiencies in matrix multiplication. In the context of representing point-wise convolutions in a neural network, an ideal structure, we argue, should have the following properties. First, the structure should not impose significant limitations on the capacity of the network. In other words, an ideal structure should maintain high information flow through the network; This can be thought of as having a large bottleneck size 1 . Second, the structure should offer efficiency gains; this is often done by minimizing the FLOPS; in our case, this translates into having fewer edges in the network's graph. Finally, in an ideal network structure, there should be at least one path from every input node to all output nodes. This enables the cross talk across channels during the fusion and without this property some input nodes may not receive crucial signals during the back propagation.

In this paper, we introduce the Butterfly Transform (BFT ), a light-weight channel fusion method with the complexity of O(n log n) with respect to the number of channels. BFT fuses all the channels in log n layers with O(n) operations at each layer. We show that BFT 's network structure is an optimal structure (in terms of FLOPs) that satisfies all of the aforementioned properties of an ideal channel-fusion network. The structure of the BFT network is inspired from the butterfly operations in the Fast Fourier Transform (FFT). These butterfly operations have been heavily optimized in several hardware/software platforms [12, 5, 10] making BFT readily usable in a wide variety of applications.

Our experimental evaluations show that simply replacing the point-wise convolutions with BFT offer significant gain. We have observed that under similar number of FLOPs, the butterfly transform consistently improves the accuracy of the efficient design of the original CNN architectures. For example using BFT in MobileNet v1 0.25 with 37M number of FLOPs get 53.6 top-1 accuracy on the imagenet dataset [7] and using BFT in ShuffleNet v2 0.5 with 41M number of FLOPs achieve 61.33 top-1 accuracy.

2 Related Work

Deep neural networks suffer from intensive computations. Several approaches have been proposed to address efficient training and inference in deep neural networks.

Efficient CNN architecture designs: Recent successes in visual recognition tasks, including object classification, detection, and segmentation, can be attributed to exploration of different CNN designs [24, 33, 15, 23, 35, 20] . To make these network designs more efficient, they have factorized convolutions into different steps enforcing distinct focuses on spatial and channel fusion [18, 32] . Further, other approaches extended the factorization schema with sparse structure either in channel fusion [28, 41] or spatial fusion [29] . [19] forced more connections between the layers of the network but reduced the computation by desigining smaller layers. Our method follows the same direction of designing a sparse structure on channel fusion that enables lower computation with a minimal loss in accuracy.

Network pruning: This line of work focuses on reducing the substantial redundant parameters in CNNs by pruning out either neurons or weights [13, 14] . Due to the unstructured sparsity of these models, the learned models from these methods cannot be used efficiently in standard compute platforms such as CPUs and GPUs. Therefore, other approaches in pruning only focus to prune out channels rather than individual neuron or weights [17, 43, 11] . These methods drop a channel either by monitoring the average weight values or average activation values on each channel during the training. Our method is different from these type methods in the way that we enforce a predefined sparse channel structure to begin with and we do not change the structure of the network during the training.

Low-rank network design: To reduce the computation in CNN, [37, 25, 8, 22] exploit from the fact that CNNs are over parameterized. These models learn a linear low rank representation of the parameters in the network either by post processing the trained weight tensors or by enforcing a linear low-rank structure during the training. There are few works that enforce non-linear low-rank structure using circulant matrix design [2, 9] . These low-rank network structures achieves efficiency with the cost of lowering the information flow from input channels to the output channels (i.e. they have a few bottleneck nodes) but our butterfly transform is in fact a non-linear structured low-rank representation that maximizes information flow.

Quantization: Another approach to improve the efficiency of the deep networks is low-bit representation of network weights and neurons using quantization [34, 30, 39, 4, 42, 21, 1] . These approaches use fewer bits (instead of 32-bit high-precision floating points) to represent weights and neurons for the standard training procedure of a network. In the case of extremely low bitwidth (1-bit) [30] had to modify the training procedure to find the discrete binary values for the weights and the neurons in the network. Our method is orthogonal to this line of work and these method are complementary to our network.

Neural architecture search: Recently, neural search methods, including reinforcement learning and genetic algorithms, have been proposed to automatically construct network architectures [44, 40, 31, 45, 36, 27] . These methods search over a huge network space (e.g. MNASNet [36] searches over 8K different design choices) using a dictionary of pre-defined search space parameters, including different types of convolutional layers and kernel sizes, to identify a network structure, usually nonhomogeneous, that satisfies optimization constraints, such as inference time. Recent search-based methods [36, 3, 38] use MobileNetv2 [32] as a basic search block for automatic network design. The main computational bottleneck in most of the search based method is in the channel fusion and our butterfly structure does not exist in any of the predefined blocks of these methods. Our efficient channel fusion can be augmented with these models to further improve the efficiency of these networks. Our experiments shows that our proposed butterfly structure outperforms recent architecture search based models on small network design.

3 Model

In this section, we outline the details of our proposed model. As discussed above, the main computational bottleneck in current efficient neural architecture design is in channel fusion step, which is implemented by a point-wise convolution layer. The input to this layer is a tensor X of size n in × h × w, where n is the number of channels and w, h are the width and height respectively. The size of the weight tensor W is n out × n in × 1 × 1 and the output tensor Y is n out × h × w. Without loss of generality, we assume n = n in = n out . The complexity of a point-wise convolution layer is O(n 2 wh) and this is mainly influenced by the number of channels n. We propose a new layer design, Butterfly Transform, that has O((n log n)wh) complexity. This design is inspired by the Fast Fourier Transform (FFT) algorithm, which has been widely used in the computational engines for a variety of applications and there exist optimized hardware/software design for the key operations of this algorithm which are applicable to our method. In the following subsections we explain the problem formulation and the structure of our butterfly transform.

3.1 Point-Wise Convolution As Matrix-Vector Products

A point-wise convolution can be defined as a function P as follows:

EQUATION (1): Not extracted; please refer to original document.

This can be written as a matrix product by reshaping the input tensor X to a 2-D matrixX with size n × (hw) (each column vector in theX corresponds to a spatial vector X[:, i, j]) and reshaping the weight tensor to a 2-D matrixŴ with size n × n,

EQUATION (2): Not extracted; please refer to original document.

whereŶ is the matrix representation of the output tensor Y. This can be seen as a linear transformation of the vectors in the columns ofX usingŴ as a transformation matrix. The linear transformation is a matrix-vector product and its complexity is O(n 2 ). By enforcing structure on this transformation matrix, one can reduce the complexity of the transformation. However, to be effective as a channel fusion transform, it is critical that this transformation respects the desirable characteristics detailed below.

Ideal characteristics of a fusion network: 1) every-to-all connectivity: There must be at least one path between every input channel and all of the output channels 2) maximum bottleneck size:

The botteleneck size is defined as the minimum number of nodes in the network that if removed, the information flow from input channels to output channels would be completely cut off (i.e. there would be no path from any input channel to any output channel). The largest possible bottleneck size in a multi-layer network is n. 3) small edge count: To reduce computation, we expect the network to have as few edges as possible. 4) equal out-degree within each layer: To enable efficient matrix implementation of the network, all nodes within each layer must have the same out degree 2 .

Claim: A multi-layer network with these properties has at least O(n log n) edges.

Proof : Suppose there exist n i nodes in i th layer. Removing all the nodes in one layer will disconnect inputs from outputs. Since the maximum possible bottleneck size is n, therefore n i ≥ n. Now suppose that out degree of each node at layer i is d i . Number of nodes in layer i, which are reachable from an input channel is

i−1 j=0 d j .

Because of the every-to-all connectivity, all of the n nodes in the output layer are reachable. Therefore

m−1 j=0 d j ≥ n. This implies that m−1 j=0 log 2 (d j ) ≥ log 2 (n).

The total number of edges will be

m−1 j=0 n j d j ≥ n m−1 j=0 d j ≥ n m−1 j=0 log 2 (d j ) ≥ n log 2 n

In the following section we present a network structure that satisfies all the ideal characteristics of a fusion network.

3.2 Butterfly Transform (Bft)

As mentioned above we can reduce the complexity of a matrix-vector product by enforcing structure on the matrix. There are several ways to enforce structure on the matrix. Here we introduce a family of the structured matrix that leads to a O(n log n) complexity of operations and parameters while maintaining the accuracy.

Butterfly Matrix: We define B (n,k) as a butterfly matrix of order n and base k where B (n,k) ∈ IR n×n :

EQUATION (4): Not extracted; please refer to original document.

where

x i ∈ IR n k

is a subsection of x that is achieved by breaking x into k equal sized vector. Therefore, the product can be simplified by factoring out M as follow:

B (n,k) x =          M ( n k ,k) 1 k j=1 D 1j x j . . . M ( n k ,k) i k j=1 D ij x j . . . M ( n k ,k) k k j=1 D kj x j          B (n,k) x =          M ( n k ,k) 1 y 1 . . . M ( n k ,k) i y i . . . M ( n k ,k) k y k          (5)

where

y i = k j=1 D ij x j . Note that M ( n k ,k) i

y i is a smaller product between a butterfly matrix of order n 2 and a vector of size n 2 therefore, we can use divide-and-conquer to recursively calculate the product B (n,k) x. If we consider T (n, k) as the computational complexity of the product between a (n, k) butterfly matrix and an n-D vector. From equation 5, the product can be calculated by k products of butterfly matrices of order n k which its complexity is kT (n/k, k). The complexity of calculating y i for all i ∈ {1, . . . , k} is O(kn) therefore:

T (n, k) = kT (n/k, k) + O(kn) (6) 1 − BFLayer 2 − BFLayer log − BFLayer x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1 x n 2 +2

. . .

x n−1 x n BFT( n 2 , 2) BFT( n 2 , 2) y 1 y 2 . . . y n 2 −1 y n 2 y n 2 +1 y n 2 +2

. . .

y n−1 y n BFT(n, 2) x 1 x 2 . . . xn 2 −1 xn 2 xn 2 +1 xn 2 +2

. . .

x n−1 x n x 1 x 2 . . . xn 2 −1 xn 2 xn 2 +1 xn 2 +2

. . .

x n−1 x n ℎ x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1

x n 2 +2

. . .

x n−1 x n x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1 x n 2 +2

. . .

x n−1 x n x 1 x 2 . . . xn 2 −1 xn 2 xn 2 +1 xn 2 +2

. . .

x n−1 x n x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1 x n 2 +2

. . .

x n−1 x n x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1 x n 2 +2

. . .

x n−1 x n x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1 x n 2 +2

. . .

x n−1 x n x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1

x n 2 +2

. . .

x n−1 x n x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1 x n 2 +2

. . .

x n−1 x n x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1

x n 2 +2

. . .

x n−1 x n x 1 x 2 . . . xn 2 −1 xn 2 xn 2 +1 xn 2 +2

. . .

x n−1 x n x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1 x n 2 +2 . . . x n−1 x n x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1 x n 2 +2 . . . x n−1 x n x 1 x 2 . . . x n 2 −1 x n 2 x n 2 +1

x T (n, k) = O(k(n log k n))

With a smaller choice of k(2 ≤ k ≤ n) we can achieve a lower complexity. Algorithm 1 illustrates the recurcive procedure of a butterfly transform when k = 2 . DFT transform is a specific case of BFT: It can be shown that the Discrete Fourier Transform (DFT) is a specific case of our butterfly transform. In fact, a DFT transform is a (n, 2) butterfly transform such that the transformation matrix B (n,2) ∈ C n×n . The elements of the output vector is permuted by radix-2 shuffle and the diagonal elements of the Dij are drawn from n th root of unity z n = 1 where z ∈ C. Note that the complexity of the butterfly transform is independent of row permutation. Therefore, the Fast Fourier Transform (FFT) is a specific case of our efficient butterfly transform of order n and base 2, which its complexity is O(n log n). BFT can be seen as a more general transform than the DFT in a way that its parameters being learned during the training of the network. In the next section we discuss about the network structure of the BFT.

3.3 Butterfly Neural Network

The procedure explained in algorithm 1 can be represented by a butterfly graph similar to the FFT's graph. The butterfly network structure has been used for function representation [26] and fast factorization for approximating linear transformation [6] . We adopt this graph as an architecture design for the layers of a neural network. Figure 1 illustrates the architecture of a butterfly network of base k = 2 applied on an input tensor of size n × h × w. The left figure shows how the recursive structure of the BFT as a network. The right figure shows the constructed multi-layer network which has log n butterfly layers (BFLayer). Note that the complexity of each butterfly layer is O(n) (2n operations), therefore, the total complexity of the BFT architecture will be O(n log n). Each butterfly layer can be augmented by batchnormm and non-linearity functions (e.g. ReLU, Sigmoid). In section 4.2 we study the effect of using different choices of these functions. We found that both batch norm and nonlinear functions (ReLU and Sigmoid) are not effective within BFLayers. Batchnorm is not effective mainly because its complexity is the same as the BFLayer O(n), therefore, it doubles the computation of the entire transform. We use batchnorm only at the end of the transform. The non-linear activation ReLU and Sigmoid zero out almost half of the values in each BFLayer, thus multiplication of these values throughout the forward propagation destroys all the information. The BFLayers can be internally connected with residual connections in different ways. In our experiments, we found that the best residual connections are the one that connect the input of the firs BFLayer to the output of the last BFLayer.

Figure 1: BFT Architecture: This figure illustrates the graph structure of the proposed butterfly transform. The left figure shows the recursive procedure of the BFT that is applied to an input tensor and the left figure shows the expanded version of the recursive procedure as logn butterfly layers in the network.

Butterfly network is an optimal structure satisfying all the characteristics of an ideal fusion network. There exist exactly one path between every input channel to all the output channels, the degree of each node in the graph is exactly k, the bottleneck size is maximum (n), and the number of edges are O(n log n).

We use the BFT architecture as a replacement of the point-wise convolution layer (1 × 1-convs) in different CNN architectures including MobileNet [18] and ShuffleNet [28] . Our experimental results shows that under the same number of FLOPs, the efficiency gain by BFT is more effective in terms of accuracy compared to the original model with smaller channel rate. We saw consistent accuracy improvement across several architecture settings.

Fusing channels using BFT, instead of pointwise-convolution reduces the size of the computational bottleneck by a large-margin. Figure 2 illustrate the percentage of the number of operations by each block type throughout a forward pass in the network. Note that when BFT is applied, the percentage of the depth-wise convolutions increases by 8×.

Figure 2: Distribution of FLOPs: This figure shows that replacing the point-wise convolution with BFT reduces the size of the computational bottleneck.

4 Experiment

In this section, we demonstrate the performance of the proposed butterfly transform on image classification task. To showcase the strength of our method on designing very small network, we compare performance of Butterfly Transform with point-wise convolutions in two state-of-the-art efficient architectures: (1) Mobilenet (2) Shufflenet V2. We also compare our results with other type of transforms that have O(n log n) computation (e.g. low-rank transform and circulant transform). We also show that, in small number of FLOPS (14 M) our method works better than state-of-the art architecture search methods.

4.1.1 Implementation And Dataset Details:

Following a common practice, we evaluate the performance of Butterfly Transform on the ImageNet dataset, at different levels of complexity, ranging from 14 MFLOPS to 41 MFLOPs. ImageNet contains 1.2M training samples and 50K validation samples.

For each architecture, we substitute point-wise convolutions with butterfly transforms. To have a fair comparison between BFT and point-wise convolutions we adjust the channel numbers in the architectures (MobileNet and ShuffleNet) such that the number of FLOPs in both methods (BFT and point-wise) remain equal. We optimize our network by minimizing a cross-entropy loss using SGD. We found that using no weight decay gives us a better accuracy. We have used k = 4. Because it has the same number of FLOPs as base k = 2 (4n log 4 n = 2n log 2 n) and the corresponding BFT would have less number of BFlayers (log 4 n < log 2 n). Note that if k = n, then the complexity of BFT is the same as point-wise convolution.

Weight initialization: A common heuristic which is used in weight initialization randomly initializes values in range (−x, x) where x = 2 √ n . We initialize each edge in a way that multiplication of all the edges on a path from an input node to output node is x. We initialize each edge in range (−x 1 log(n) , x 1 log(n) ).

4.1.2 Mobilenet + Bft

This is the model that we replace BFT with pointwise convs in mobilenet v1. We have compared the Mobilenet with width-multiplier 0.25 (Using 128, 160, 224 resolutions) and Mobilenet+BFT with width-multiplier 1.0 (Using 96, 128, 160 resolutions). In table 1a we see that using BFT we can outperform pointwise convolutions by 5% in top-1 accuracy at 14MFLOPs. Note that Mobilenet+BFT at 24 MFLOPs has almost the same accuracy with Mobilenet at 41MFLOPs, which means it can get same accuracy by almost half of the FLOPs. This has been achieved with not changing architecture at all and by just changing point-wise convolution. Table 1b shows the results for Shufflenet V2. We have slightly adjusted the number of output channels to build Shufflenet V2-1.25. We have compared Shufflenet V2-1.25+BFT with Shufflenet V2-0.5. We have compared these two methods in different input resolutions (128, 160, 224) which gives FLOPs ranging from 14M to 41M. Shufflenet V2-1.25 + BFT achieves about 2.5% better accuracy than our implementation on Shufflenet V2-0.5 which uses pointwise convolutions. It achieves 1% better accuracy than reported number on Shufflenet V2 at 41MFLOPs.

Table 1: This table compares the top-1 accuracy on ImageNet using MobileNet-v1 and ShuffleNet-v2 when replacing the point-wise convolution with butterfly transform

4.1.4 Comparison With Neural Architecture Search

Our model used in shufflenet v2 can achieve a higher accuracy than the state-of-the-art architecture search methods MNasNet [36] and FBNet [38] on efficient network setting (∼ 14M Flops). This is because, those methods only search among a set of predefined blocks. The most efficient channel fusion block in those methods is the point-wise convolution, which is more expensive than BFT. In table 2a, we show the top-1 accuracy comparison on imagenet dataset. One can further improve the accuracy of the architecture search models by including BFT as a searchable block in those models.

Table 2: These tables compare BFT with other efficient network design approaches. Our BFT model augmented with shufflenetV2 outperfoms the state-of-the-art neaural architecture search methods (MNasNet [36], FBNet[38]). Also BFT achieves higher accuracy compared to efficient network design by low-rank transform and circulant matrix transform.

4.1.5 Comparison With Other O(N Log N) Architectures

To further illustrate the benefits of BFLayers, we have compared it with other architectures that reduce the complexity to O(n log n). Here we study circular [9] and low-rank architectures [22] .

Circular architercture: In this architecture the matrix that represents the point-wise convolution is a circulant matrix. In a circulant matrix rows are cyclically shifted of one each other [9] . The product of this circulant matrix by a column can be efficiently computed in O(n log(n)) using fast fourier transform (FFT).

Low-rank matrix: In this architecture the matrix that represents the point-wise convolution is the product of a n × log n matrix and a log n × n matrix. Therefore the point-wise convolution can be performed by two consequent small matrix product and the total complexity is O(n log n) Table 2 : These tables compare BFT with other efficient network design approaches. Our BFT model augmented with shufflenetV2 outperfoms the state-of-the-art neaural architecture search methods (MNasNet [36] , FBNet [38] ). Also BFT achieves higher accuracy compared to efficient network design by low-rank transform and circulant matrix transform. As it can be seen in table 2b, butterfly transform outperforms both the circulant and the low-rank structures under the same number of FLOPs.

4.2 Ablation Study

Now, we study different elements of our BFT model. As mentioned earlier, residual connections and non-linear activations can be augmented within our BFLayers. Here we show the performance of these elements in isolation on CIFAR-10 dataset using MobileNetv1 as the base network.

With/without non-linearity: As studied by [32] adding a non-linearity function like ReLU or Sigmoid to a narrow layer (with few channels) reduces the accuracy because it cuts off half of the values of an internal layer to zero. In BFT, the effect of an input channel i on an output channel o, is determined by the multiplication of all the edges on the path between i and o. Dropping a value to zero will destroys all the information transferred between the two channels. Dropping half of the values of each internal layer destroys almost all the information in the entire layer. Figure 3b compares the the learning curves of BFT models with and without non-linear activation functions.

Figure 3: In BFT we should not enforce weight decay, because it significantly reduces the effect of input channels on output channels. Similarly, we should not apply the common non-linear activations. Because these functions zero out almost half of the values in the intermediate BFLayer that leads to significant drop in the information flow from input channels to the output channels

With/without weight-decay: We found that BFT is very sensitive to the weight decay. Because in BFT there is only one path from an input channel i to an output channel o. The effect of i on o is determined by the multiplication of all the intermediate edges along the path between i and o. Pushing all weight values to zero, will significantly reduce the affect of the i on o. Therefore, weight decay is very destructive in BFT. Figure 3a illustrates the learning curves with and without using weight decay on BFT.

Residual Connections:

The graphs that obtained by replacing BFTransform with Table 3 : Residual connections point-wise convolutions are very deep. A good practice to train this kind of networks is to add residual connections. We expeimented three different ways of residual connections (1) First-to-Last, which connects the input of first BFLayer to the output of last BFLayer, (2) Every-other-Layer, which connects every other BFLayers and (3) No-residual, which there is no residual connection. we found the First-to-last is the most effective type of residual connection table 3 compares these type of connections.

Table 3. Not extracted; please refer to original document.

5 Conclusion

In this paper, we introduce BFT , an efficient transformation that can simply replace point-wise convolutions in various neural architectures to significantly improve the accuracy of the networks while maintaining in the same number of FLOP. The structure employed by BFT has interesting properties that ensures the efficacy of the proposed solution. The main focus of this paper is on small networks that are designed for resource-constrained devices. BFT shows bigger gains in the smaller networks, and its utility diminishes for larger ones. The butterfly operations in BFT are extremely hardware friendly and there exist several optimized hardware/software platforms for these operations.

The bottleneck size in a network is defined as: the minimum number of nodes that need to be removed to ensure that no path exists between an input and output channel.

( n k ,k) iis a butterfly matrices of order n k and base k and D ij is an arbitrary diagonal n k × n k matrix. The matrix-vector product between a butterfly matrix B (n,k) and a vector x ∈ IR n is :B (n,k) x =     M ( n k ,k) 1 D 11 . . . M ( n k ,k) 1 D 1k . . . . . . . . . M ( n k ,k) k

In this way, one can represent each layer as a fixed number of matrix multiplication, which is supported by fast linear algebra libraries (e.g. BLAS), otherwise, we need to use sparse matrix operations which are not as optimized.