Soft Threshold Weight Reparameterization for Learnable Sparsity

Aditya Kusupati
Vivek Ramanujan
Raghav Somani
Mitchell Wortsman
Prateek Jain
S. Kakade
Ali Farhadi
ICML
2020
View in Semantic Scholar

Abstract

Sparsity in Deep Neural Networks (DNNs) is studied extensively with the focus of maximizing prediction accuracy given an overall parameter budget. Existing methods rely on uniform or heuristic non-uniform sparsity budgets which have sub-optimal layer-wise parameter allocation resulting in a) lower prediction accuracy or b) higher inference cost (FLOPs). This work proposes Soft Threshold Reparameterization (STR), a novel use of the soft-threshold operator on DNN weights. STR smoothly induces sparsity while learning pruning thresholds thereby obtaining a non-uniform sparsity budget. Our method achieves state-of-the-art accuracy for unstructured sparsity in CNNs (ResNet50 and MobileNetV1 on ImageNet-1K), and, additionally, learns non-uniform budgets that empirically reduce the FLOPs by up to 50%. Notably, STR boosts the accuracy over existing results by up to 10% in the ultra sparse (99%) regime and can also be used to induce low-rank (structured sparsity) in RNNs. In short, STR is a simple mechanism which learns effective sparsity budgets that contrast with popular heuristics. Code, pretrained models and sparsity budgets are at this https URL.

1. Introduction

Deep Neural Networks (DNNs) are the state-of-the-art models for many important tasks in the domains of Computer Vision, Natural Language Processing, etc. To enable highly accurate solutions, DNNs require large model sizes resulting in huge inference costs, which many times become the main bottleneck in the real-world deployment of the solutions. During inference, a typical DNN model stresses the following aspects of the compute environment: 1) RAM -working memory, 2) Processor compute -Floating Point Operations (FLOPs 1 ), and 3) Flash -model size. Various techniques are proposed to make DNNs efficient including model pruning (sparsity) (Han et al., 2015) , knowledge distillation (Bucilu et al., 2006) , model architectures (Howard et al., 2017) and quantization (Rastegari et al., 2016) . Sparsity of the model, in particular, has potential for impact across a variety of inference settings as it reduces the model size and inference cost (FLOPs) without significant change in training pipelines. Naturally, several interesting projects address inference speed-ups via sparsity on existing frameworks (Liu et al., 2015; Elsen et al., 2019) and commodity hardware (Ashby et al.) . On-premise or Edge computing is another domain where sparse DNNs have potential for deep impact as it is governed by billions of battery limited devices with single-core CPUs. These devices, including mobile phones (Anguita et al., 2012) and IoT sensors (Patil et al., 2019; Roy et al., 2019) , can benefit significantly from sparsity as it can enable real-time on-device solutions thus preserving user privacy, reducing latency, improving energy efficiency and reliability. Sparsity in DNNs, surveyed extensively in Section 2, has been the subject of several papers where new algorithms are designed to obtain DNNs with a given parameter budget. But state-of-the-art DNN models tend to have a large number of layers with highly non-uniform distribution both in terms of the number of parameters as well as FLOPs required per layer. Most existing methods rely either on uniform sparsity across all parameter tensors (layers) or on heuristically obtained non-uniform sparsity budgets leading to a sub-optimal weight allocation across layers and can lead to a significant loss in accuracy. Furthermore, as the budget is set at a global level, some of the layers with a small number of parameters (like initial convolution layers) would be fully dense as their contribution to the budget is insignificant. However, those layers can have significant FLOP count, e.g., in an initial convolution layer, a simple tiny 3×3 kernel would be applied to the entire image. Hence, while such models might decrease the number of non-zeroes significantly, their FLOP count could still be large.

Motivated by the above-mentioned challenges, this works addresses the following question: "Can we design a method to learn non-uniform sparsity budget across layers that is optimized per-layer, is stable, and is accurate?".

Most existing methods for learning sparse DNNs have their roots in the long celebrated literature of high-dimension statistics and, in particular, sparse regression. These methods are mostly based on well-known Hard and Soft Thresholding techniques, which are essentially projected gradient methods with explicit projection onto the set of sparse parameters. However, these methods require a priori knowledge of sparsity, and as mentioned above, mostly heuristic methods are used to set the sparsity levels per layer.

We propose Soft Threshold Reparameterization (STR) to address the aforementioned issues. We use the fact that the projection onto the sparse sets is available in closed form and propose a novel reparameterization of the problem. That is, for forward pass of DNN, we use soft-thresholded version (Donoho, 1995) of a weight tensor W l of the l-th layer in the DNN:

S(W l , α l ) := sign (W l )•ReLU(|W l |− α l )

where α l is the pruning threshold for the l-th layer. As the DNN loss can be written as a continuous function of α l 's, we can use backpropagation and sub-gradients to learn layer-specific α l to smoothly induce sparsity.

Due to layer-specific thresholds and sparsity, STR is able to achieve state-of-the-art accuracy for unstructured sparsity in CNNs across various sparsity regimes. STR makes even small-parameter layers sparse which results in models with significantly lower inference costs (FLOPs) than the baselines. For example, STR for 90% sparse MobileNetV1 on ImageNet-1K results in a 0.3% boost in accuracy with 50% fewer FLOPs. Empirically, STR's learnt non-uniform budget makes it a very effective choice for ultra (99%) sparse ResNet50 architecture as well where it is ∼10% more accurate than baselines when applied to ImageNet-1K. STR can also be trivially modified to induce structured sparsity (e.g. low-rank), demonstrating its generalizability to a variety of DNN architectures in different domains. Finally, STR's learnt non-uniform sparsity budget can translate across tasks thus discovering an efficient sparse backbone of the model. The 3 major contributions of this paper are:

• Soft Threshold Reparameterization (STR), for the weights in DNNs, to induce sparsity via learning the per-layer pruning thresholds thereby obtaining a better non-uniform sparsity across parameter tensors (layers).

• Extensive experimentation showing that STR achieves the state-of-the-art accuracy for sparse CNNs (ResNet50 and MobileNetV1 on ImageNet-1K) along with a significant reduction in inference FLOPs.

• Extension of STR to structured sparsity, that is useful for direct implementation of fast inference in practice.

2. Related Work

This section covers the spectrum of work on sparsity in DNNs. The sparsity in the discussion can be characterized as (a) unstructured and (b) structured while sparsification techniques can be (i) dense-to-sparse, and (ii) sparse-tosparse. Finally, the sparsity budget in DNNs can either be (a) uniform, or (b) non-uniform across layers. This will be a key focus of this paper, as different budgets result in different inference compute costs as measured by FLOPs.

2.1. Unstructured And Structured Sparsity

Unstructured sparsity does not take the structure of the model (e.g. channels, rank, etc.,) into account. Typically, unstructured sparsity is induced in DNNs by making the parameter tensors sparse directly based on heuristics (e.g. weight magnitude) thereby creating sparse tensors which might not be capable of leveraging the speed-ups provided by commodity hardware during training and inference. Unstructured sparsity has been extensively studied and includes methods which use gradient, momentum and Hessian based heuristics (Evci et al., 2019; Lee et al., 2019; Dettmers & Zettlemoyer, 2019; LeCun et al., 1990; Hassibi & Stork, 1993) , and magnitude-based pruning (Han et al., 2015; Guo et al., 2016; Zhu & Gupta, 2017; Frankle & Carbin, 2019; Gale et al., 2019; Mostafa & Wang, 2019; Wortsman et al., 2019; Bellec et al., 2018; Mocanu et al., 2018; Narang et al., 2019; Kusupati et al., 2018) . Unstructured sparsity can also be induced by L 0 , L 1 regularization (Louizos et al., 2018) , and Variational Dropout (VD) (Molchanov et al., 2017) .

Gradual Magnitude Pruning (GMP), proposed in (Zhu & Gupta, 2017) , and studied further in (Gale et al., 2019) , is a simple magnitude-based weight pruning applied gradually over the course of the training. Discovering Neural Wirings (DNW) (Wortsman et al., 2019) also relies on magnitudebased pruning while utilizing a straight-through estimator for the backward pass. GMP and DNW are the state-of-theart for unstructured pruning in DNNs (especially in CNNs) demonstrating the effectiveness of magnitude pruning. VD gets accuracy comparable to GMP (Gale et al., 2019) for CNNs but at a cost of 2× memory and 4× compute during training making it hard to be used ubiquitously.

Structured sparsity takes structure into account making the models scalable on commodity hardware with the standard computation techniques/architectures. Structured sparsity includes methods which make parameter tensors lowrank (Jaderberg et al., 2014; Lu et al., 2016; Alizadeh et al., 2019) , prune out channels, filters and induce block/group sparsity (Liu et al., 2019; He et al., 2017; Wen et al., 2016; Li et al., 2017; Luo et al., 2017) . Even though structured sparsity can leverage speed-ups provided by the multi-core processors, the highest levels of sparsity and model pruning are only possible with unstructured sparsity techniques.

2.2. Dense-To-Sparse And Sparse-To-Sparse Training

Until recently, most sparsification methods were dense-tosparse i.e., the DNN starts fully dense and is made sparse by the end of the training. Dense-to-sparse training in DNNs encompasses the techniques presented in (Han et al., 2015; Zhu & Gupta, 2017; Molchanov et al., 2017; Frankle & Carbin, 2019; Renda et al., 2020) .

The lottery ticket hypothesis (Frankle & Carbin, 2019) sparked an interest in training sparse neural networks end-toend. This is referred to as sparse-to-sparse training and a lot of recent work (Mostafa & Wang, 2019; Bellec et al., 2018; Evci et al., 2019; Lee et al., 2019; Dettmers & Zettlemoyer, 2019) aims to do sparse-to-sparse training using techniques which include re-allocation of weights to improve accuracy. (Evci et al., 2019) uses the magnitude to drop and the periodic dense gradients to regrow weights. SNFS and RigL are state-of-the-art in sparse-to-sparse training but fall short of GMP for the same experimental settings. It should be noted that, even though sparse-to-sparse can reduce the training cost, the existing frameworks (Paszke et al., 2019; Abadi et al., 2016) consider the models as dense resulting in minimal gains.

DNW (Wortsman et al., 2019) and Dynamic Pruning with Feedback (DPF) (Lin et al., 2020) fall between both as DNW uses a fully dense gradient in the backward pass and DPF maintains a copy of the dense model in parallel to optimize the sparse model through feedback. Note that DPF is complementary to most of the techniques discussed here.

2.3. Uniform And Non-Uniform Sparsity

Uniform sparsity implies that all the layers in the DNN have the same amount of sparsity in proportion. Quite a few works have used uniform sparsity (Gale et al., 2019) , given its ease and lack of hyperparameters. However, some works keep parts of the model dense, including the first or the last layers (Lin et al., 2020; Mostafa & Wang, 2019; Zhu & Gupta, 2017) . In general, making the first or the last layers dense benefits all the methods. GMP typically uses uniform sparsity and achieves state-of-the-art results.

Non-uniform sparsity permits different layers to have different sparsity budgets. Weight re-allocation heuristics have been used for non-uniform sparsity in DSR and SNFS. It can be a fixed budget like the ERK (Erdos-Renyi-Kernel) heuristic described in RigL (Evci et al., 2019) . A global pruning threshold (Han et al., 2015) can also induce non-uniform sparsity and has been leveraged in Iterative Magnitude Pruning (IMP) (Frankle & Carbin, 2019; Renda et al., 2020) . A good non-uniform sparsity budget can help in maintaining accuracy while also reducing the FLOPs due to a better parameter distribution. Aforementioned methods with non-uniform sparsity do not reduce the FLOPs compared to uniform sparsity in practice. Very few techniques like AMC (He et al., 2018) , using expensive reinforcement learning, minimize FLOPs with non-uniform sparsity.

Most of the discussed techniques rely on intelligent heuristics to obtain non-uniform sparsity. Learning the pruning thresholds and in-turn learning the non-uniform sparsity budget is the main contribution of this paper.

3. Method -Str

Optimization under sparsity constraint on the parameter set is a well studied area spanning more than three decades (Donoho, 1995; Candes et al., 2007; Jain et al., 2014) , and is modeled as:

min W L(W; D), s.t. W 0 ≤ k, where D := x i ∈ R d , y i ∈ R i∈[n]

is the observed data, L is the loss function, W are the parameters to be learned and • 0 denotes the L 0 -norm or the number of non-zeros, and k is the parameter budget. Due to non-convexity and combinatorial structure of the L 0 norm constraint, it's convex relaxation L 1 norm has been studied for long time and has been at the center of a large literature on high-dimensional learning. In particular, several methods have been proposed to solve the two problems including projected gradient descent, forward/backward pruning etc.

Projected gradient descent (PGD) in particular has been popular for both the problems as the projection onto both L 0 as well as the L 1 ball is computable in almost closed form (Beck & Teboulle, 2009; Jain et al., 2014) ; L 0 ball projection is called Hard Thresholding while L 1 ball projection is known as Soft Thresholding. Further, these methods have been the guiding principle for many modern DNN model pruning (sparsity) techniques (Han et al., 2015; Zhu & Gupta, 2017; Narang et al., 2019) .

However, projection-based methods suffer from the problem of dense gradient and intermediate parameter structure, as the gradient descent iterate can be arbitrarily out of the set and is then projected back onto L 0 or L 1 ball. At a scale of billions of parameters, computing such dense gradients and updates can be daunting. More critically, the budget parameter k is set at the global level, so it is not clear how to partition the budget for each layer, as the importance of each layer can be significantly different.

In this work, we propose a reparameterization trick, Soft Threshold Reparameterization (STR) based on the soft threshold operator (Donoho, 1995) , to alleviate both the above mentioned concerns. That is, instead of first updating W via gradient descent and then computing it's projection, we directly optimize over projected W. Let S g (W; s) be the projection of W parameterized by s and function g. S is applied elementwise to each element of W and is defined as:

EQUATION (1): Not extracted; please refer to original document.

where s is a learnable parameter, g : R → R, and α = g(s) is the pruning threshold. ReLU(a) = max(a, 0). That is, if |w| ≤ g(s), then S g (w, s) sets it to 0.

Reparameterizing the optimization problem with S modifies (note that it is not equivalent) it to:

EQUATION (2): Not extracted; please refer to original document.

For L-layer DNN architectures, we divide W into:

W = [W l ] L l=1

where W l is the parameter tensor for the l-th layer. As mentioned earlier, different layers of DNNs can have significantly different number of parameters. Similarly, different layers might need different sparsity budget for the best accuracy. So, we set the trainable pruning parameter for each layer as

s l . That is, s = [s 1 , . . . , s L ].

Now, using the above mentioned reparameterization for each W l and adding a standard L 2 regularization per layer, we get the following Gradient Descent (GD) update equation at the t-th step for

W l , ∀ l ∈ [L]: W (t+1) l ← (1 − η t • λ)W (t) l − η t ∇ W l S g (W l , s l ) • ∇ Sg(W l ,s l ) L(S g (W (t) , s), D), (3)

where η t is the learning rate at the t-th step, and λ is the L 2 regularization (weight-decay) hyper-parameter.

∇ W l S g (W l , s l ) is the gradient of S g (W l , s l ) w.r.t. W l .

Now, S is non-differentiable, so we use a popular subgradient which leads to the following update equation:

EQUATION (4): Not extracted; please refer to original document.

where 1 {•} is the indicator function and A B denotes element-wise (Hadamard) product of tensors A and B.

Now, if g is a continuous function, then using the STR (2) and (1), it is clear that L(S g (W, s), D) is a continuous function of s. Further, sub-gradient of L w.r.t. s, can be computed and uses for gradient descent on s as well; see Appendix A.2. Algorithm 1 in the Appendix shows the implementation of STR on 2D convolution. STR can be modified and applied on the eigenvalues of a parameter tensor, instead of individual entries mentioned above, resulting in low-rank tensors; see Section 4.2.1 for further details. Note that s also has the same weight-decay parameter λ.

Naturally, g plays a critical role here, as a sharp g can lead to an arbitrary increase in threshold leading to poor accuracy while a flat g can lead to slow learning and wasted gradient iterations. Practical considerations for choice of g are discussed in Appendix A.1. For the experiments, g is set as the Sigmoid function when training CNNs and the exponential function for RNNs. Typically, {s l } l∈[L] are initialized with s init to ensure that the thresholds

{α l = g(s l )} l∈[L]

start close to 0. Figure 1 shows that the thresholds' dynamics are guided by a combination of gradients from L and the weight-decay on s. Further, the overall sparsity budget for STR is not set explicitly. Instead, it is controlled by the weight-decay parameter (λ), which can be further fine-tuned using s init . Interestingly, this curve is similar to that of the hand crafted heuristic for the thresholds defined in (Narang et al., 2019) . Figure 2 shows the overall learnt sparsity budget for ResNet50 during training. The curve looks similar to the GMP's (Zhu & Gupta, 2017) handcrafted sparsification heuristic, however, STR learns it via backpropagation. Finally, each parameter tensor learns a different threshold value, {α l } l∈ [L] , resulting in unique final thresholds across the layers, as shown in Figure 3 for ResNet50. This, in turn, results in the non-uniform sparsity budget which is empirically shown to be effective in increasing prediction accuracy while reducing the inference cost (FLOPs). More- over, (4) shows that the gradient update itself is sparse as gradient of L is multiplied with an indicator function of S g (W l ) = 0 which gets sparser over iterations ( Figure 2 ). So STR addresses both the issues with standard PGD methods (Hard/Soft Thresholding) that we mentioned above.

Figure 1. The learnt threshold parameter, α = g(s), for layer 10 in 90% sparse ResNet50 on ImageNet-1K over the course of training.

Figure 2. The progression of the learnt overall budget for 90% sparse ResNet50 on ImageNet-1K over the course of training.

Figure 3. The final learnt threshold values, [αl]54l=1 = [g(sl)] 54 l=1, for all the layers in 90% sparse ResNet50 on ImageNet-1K.

3.1. Analysis

The reparameterization trick using the projection operator's functional form can be used for standard constrained optimization problems as well (assuming the projection operator has a closed-form). However, it is easy to show that in general, such a method need not converge to the optimal solution even for convex functions over convex sets. This raises a natural question about the effectiveness of the technique for sparse weights learning problem. It turns out that for sparsity constrained problems, STR is very similar to backward pruning (Hastie et al., 2009) which is a well-known technique for sparse regression. Note that, similar to Hard/-Soft Thresholding, standard backward pruning also does not support differentiable tuning thresholds which makes it challenging to apply it to DNNs.

To further establish this connection, let's consider a standard sparse regression problem where y = Xw * , X ij ∼ N (0, 1), and X ∈ R n×d . w * ∈ {0, 1} d has r d nonzeros, and d n r log d. Due to the initialization, g(s) ≈ 0 in initial few iterations. So, gradient descent converges to the least 2 -norm regression solution. That is, w = UU T w * where U ∈ R d×n is the right singular vector matrix of X and is a random n-dimensional subspace. As U is a random subspace. Since n

r log d, U S U T S ≈ r d • I where S = supp(w * ), and U S indexes rows of U corre- sponding to S. That is, min j∈S U j • U T w * ≥ 1 − o(1). On the other hand, U j • U T S w * √ nr d √ log d

with high probability for j ∈ S. As n r log d, almost all the elements of supp(w * ) will be in top O (n) elements of w. Furthermore, XS g (w, s) = y, so |s| would decrease significantly via weight-decay and hence g(s) becomes large enough to prune all but say O (n) elements. Using a similar argument as above, leads to further pruning of w, while ensuring recovery of almost all elements in supp(w * ).

4. Experiments

This section showcases the experimentation followed by the observations from applying STR for (a) unstructured sparsity in CNNs and (b) structured sparsity in RNNs.

4.1. Unstructured Sparsity In Cnns

4.1.1. EXPERIMENTAL SETUP ImageNet-1K (Deng et al., 2009 ) is a widely used largescale image classification dataset with 1K classes. All the CNN experiments presented are on ImageNet-1K. ResNet50 (He et al., 2016) and MobileNetV1 (Howard et al., 2017) are two popular CNN architectures. ResNet50 is extensively used in literature to show the effectiveness of sparsity in CNNs. Experiments on MobileNetV1 argue for the generalizability of the proposed technique (STR). Dataset and models' details can be found in Appendix A.6.

STR was compared against strong state-of-the-art baselines in various sparsity regimes including GMP (Gale et al., 2019) , DSR (Mostafa & Wang, 2019), DNW (Wortsman et al., 2019), SNFS (Dettmers & Zettlemoyer, 2019) , RigL (Evci et al., 2019) and DPF (Lin et al., 2020) . GMP and DNW always use a uniform sparsity budget. RigL, SNFS, DSR, and DPF were compared in their original form. Exceptions for the uniform sparsity are marked in Table 1 . The "+ ERK" suffix implies the usage of ERK budget (Evci et al., 2019) instead of the original sparsity budget. Even though VD (Molchanov et al., 2017) achieves state-of-theart results, it is omitted due to the 2× memory and 4× compute footprint during training. Typically VD and IMP use a global threshold for sparsity which can also be learnt using STR. The unstructured sparsity experiments presented compare the techniques which induce layer-wise sparsity. Note that STR is generalizable to other scenarios as well. Open-source implementations/models of the available techniques were used as the baselines. CNN experiments were run on a machine with 4 NVIDIA Titan X (Pascal) GPUs.

Table 1. STR is the state-of-the-art for unstructured sparsity in ResNet50 on ImageNet-1K while having lesser inference cost (FLOPs) than the baselines across all the sparsity regimes. ∗ and # imply that the first and last layer are dense respectively.

All baselines use the hyperparameter settings defined in their implementations/papers. The experiments for STR use a batch size of 256, cosine learning rate routine and are trained for 100 epochs following the hyperparameter settings in (Wortsman et al., 2019) . STR has weight-decay (λ) and s init hyperparameters to control the overall sparsity in CNNs and can be found in Appendix A.5. GMP 1.5× (Gale et al., 2019) and RigL 5× (Evci et al., 2019) show that training the networks longer increases accuracy. However, due to the limited compute resources and environmental concerns (Schwartz et al., 2019), all the experiments were run only for around 100 epochs (∼3 days each). Unstructured sparsity in CNNs with STR is enforced by learning one threshold per-layer as shown in Figure 3 . PyTorch STRConv code can be found in Algorithm 1 of Appendix.

4.1.2. Resnet50 On Imagenet-1K

A fully dense ResNet50 trained on ImageNet-1K has 77% top-1 validation accuracy. STR is compared extensively to Table 1 . STR is the state-of-the-art for unstructured sparsity in ResNet50 on ImageNet-1K while having lesser inference cost (FLOPs) than the baselines across all the sparsity regimes. * and # imply that the first and last layer are dense respectively.

Method

Top STR comfortably beats all the baselines across all the sparsity regimes as seen in Table 1 and is the state-of-the-art for unstructured sparsity. Figure 4 shows that STR forms a frontier curve encompassing all the baselines at all the levels of sparsity. Very few methods are stable in the ultra sparse regime of 98-99% sparsity and only GMP can achieve 99% sparsity. STR is very stable even in the ultra sparse regime, as shown in Table 1 and Figure 4 , while being at least 10% higher in accuracy than GMP at 99% sparsity.

Figure 4. STR forms a frontier curve over all the baselines in all sparsity regimes showing that it is the state-of-the-art for unstructured sparsity in ResNet50 on ImageNet-1K.

STR induces non-uniform sparsity across layers, Table 1 and Figure 5 show that STR produces models which have lower or similar inference FLOPs compared to the baselines while having better prediction accuracy in all the sparsity regimes. This hints at the fact that STR could be redistributing the parameters thereby reducing the FLOPs. In the 80% sparse models, STR is at least 0.19% better in accuracy than the baselines while having at least 60M (6.5%) lesser FLOPs. Similarly, STR has state-of-the-art accuracy in 90%, 95% and 96.5% sparse regimes while having at least 68M (16.5%), 45M (22%) and 140M (54%) lesser FLOPs than the best baselines respectively. In the ultra sparse regime of 98% and 99% sparsity, STR has similar or slightly higher FLOPs compared to the baselines but is at least 4.6% and 10% better in accuracy respectively. Table 1 summarizes that the non-uniform sparsity baselines like SNFS, SNFS+ERK and RigL+ERK can have up to 2-4× higher inference cost (FLOPs) due to non-optimal layer-wise distribution of the parameter weights. Observations: STR on ResNet50 shows some interesting observations related to sparsity and inference cost (FLOPs). These observations will be further discussed in Section 5:

Figure 5. STR results in ResNet50 models on ImageNet-1K which have the lowest inference cost (FLOPs) for any given accuracy.

1. STR is state-of-the-art for unstructured sparsity. 2. STR minimizes inference cost (FLOPs) while maintaining accuracy in the 80-95% sparse regime. 3. STR maximizes accuracy while maintaining inference cost (FLOPs) in 98-99% ultra sparse regime.

4. STR learns a non-uniform layer-wise sparsity, shown in Figure 6 , which shows that the initial layers of the CNN can be sparser than that of the existing nonuniform sparsity methods. All the learnt non-uniform budgets through STR can be found in Appendix A.3. 5. Figure 6 also shows that the last layers through STR are denser than that of the other methods which is contrary to the understanding in the literature of nonuniform sparsity (Mostafa & Wang, 2019; Dettmers & Zettlemoyer, 2019; Evci et al., 2019; Gale et al., 2019) . This leads to a sparser backbone for transfer learning. The backbone sparsities can be found in Appendix A.3. 6. Figure 7 shows the layer-wise FLOPs distribution for the non-uniform sparsity methods. STR adjusts the FLOPs across layers such that it has lower FLOPs than the baselines. Note that the other non-uniform sparsity budgets lead to heavy compute overhead in the initial layers due to denser parameter tensors.

Figure 6. Layer-wise sparsity budget for the 90% sparse ResNet50 models on ImageNet-1K using various sparsification techniques.

Figure 7. Layer-wise FLOPs budget for the 90% sparse ResNet50 models on ImageNet-1K using various sparsification techniques.

4.1.3. Mobilenetv1 On Imagenet-1K

MobileNetV1 was trained on ImageNet-1K for unstructured sparsity with STR to ensure generalizability. Since GMP is the state-of-the-art baseline as shown earlier, STR was only compared to GMP for 75% and 90% sparsity regimes. A fully dense MobileNetV1 has a top-1 accuracy of 70.60% on ImageNet-1K. GMP (Zhu & Gupta, 2017) has the first layer and depthwise convolution layers dense for MobileNetV1 to ensure training stability and maximize accuracy. Table 2 shows the STR is at least 0.65% better than GMP for 75% sparsity, while having at least 62M (38%) lesser FLOPs. More interestingly, STR has state-of-the-art accuracy while having up to 50% (40M) lesser FLOPs than GMP in the 90% sparsity regime. All the observations made for ResNet50 hold for MobileNetV1 as well. The sparsity and FLOPs distribution across layers can be found in Appendix A.4.

Table 2. STR is up to 3% higher in accuracy while having 33% lesser inference cost (FLOPs) for MobileNetV1 on ImageNet-1K.

4.2.1. Experimental Setup

Google-12 is a speech recognition dataset that has 12 classes made from the Google Speech Commands dataset (Warden, 2018). HAR-2 is a binarized version of the 6-class Human Activity Recognition dataset (Anguita et al., 2012) . These two datasets stand as compelling cases for on-device resource-efficient machine learning at the edge. Details about the datasets can be found in Appendix A.6.

FastGRNN (Kusupati et al., 2018 ) was proposed to enable powerful RNN models on resource-constrained devices. FastGRNN relies on making the RNN parameter matrices low-rank, sparse and quantized. As low-rank is a form of structured sparsity, experiments were done to show the effectiveness of STR for structured sparsity. The input vector to the RNN at each timestep and hidden state have D &D dimensionality respectively. FastGRNN has two parameter matrices, W ∈ R D×D , U ∈ RD ×D which are reparameterized as product of low-rank matrices, W = W 1 W 2 , and

U = U 1 U 2 where W 1 ∈ R D×r W , W 2 ∈ R r W ×D , and (U 1 ) , U 2 ∈ R r U ×D . r W ,

m U = 1D, W 1 ∈ R D×D , W 2 ∈ R D×D , and U 1 , U 2 ∈ RD ×D .

To learn the low-rank, STR is applied on the m W , and m U vectors. Learning low-rank with STR on m W , m U can be thought as inducing unstructured sparsity on the two trainable vectors aiming for the right r W , and

r U .

The baseline is low-rank FastGRNN where the ranks of the matrices are preset (Kusupati et al., 2018) . The EdgeML (Dennis et al.) FastGRNN was used for the experiments with the hyperparameters suggested in the paper. This is referred to as vanilla training. Hyperparameters for the models can be found in Appendix A.5. Table 3 presents the results for low-rank FastGRNN with vanilla training and STR. Full-rank non-reparameterized FastGRNN has an accuracy of 92.60% and 96.10% on Google-12 and HAR-2 respectively. STR outperforms vanilla training by up to 1.67% in four different model-size reducing rank settings on Google-12. Similarly, on HAR-2, STR is better than vanilla training in all the rank settings by up to 2.47%. Note that the accuracy of the low-rank models obtained by STR is either better or on-par with the full rank models while being around 50% and 70% smaller in size (low-rank) for Google-12 and HAR-2 respectively.

Table 3. STR can induce learnt low-rank in FastGRNN resulting in up to 2.47% higher accuracy than the vanilla training.

4.2.2. Fastgrnn On Google-12 And Har-2

These experiments for structured sparsity in RNNs show that STR can be applied appropriately to acquire low-rank parameter tensors. Similarly, STR can be extended and applied to methods for channel pruning and block sparsity (Liu et al., 2019; He et al., 2017; Huang & Wang, 2018) in DNNs.

5. Discussion And Drawbacks

STR's usage for unstructured sparsity leads to interesting observations as noted in Section 4.1.2. It is clear from Table 1 and Figures 4, 5 that STR achieves state-of-the-art accuracy for all the sparsity regimes and also reduces the FLOPs in doing so. STR helps in learning non-uniform sparsity budgets which are intriguing to study as an optimal non-uniform sparsity budget can ensure minimization of FLOPs while maintaining accuracy. Although it is not clear why STR's learning dynamics result in a non-uniform budget that minimizes FLOPs, the reduction in FLOPs is due to the better redistribution of parameters across layers.

Non-uniform sparsity budgets learnt by STR have the initial and middle layers to be sparser than the other methods while making the last layers denser. Conventional wisdom suggests that the initial layers should be denser as the early loss of information would be hard to recover, this drives the existing non-uniform sparsity heuristics. As most of the parameters are present in the deeper layers, the existing methods tend to make them sparser while not affecting the FLOPs by much. STR, on the other hand, balances the FLOPs and sparsity across the layers as shown in Figures 6 , 7 making it a lucrative and efficient choice. The denser final layers along with sparser initial and middle layers point to sparser CNN backbones obtained using STR. These sparse backbones can be viable options for efficient representation/transfer learning for downstream tasks. Table 4 shows the effectiveness/transferability of the learnt non-uniform budget through STR for 90% sparse ResNet50 on ImageNet-1K. DNW typically takes in a uniform sparsity budget and has an accuracy of 74% for a 90% sparse ResNet50. Using ERK non-uniform budget for 90% sparsity results in a 0.1% increase in accuracy at the cost 2.35× inference FLOPs. Training DNW with the learnt budget from STR results in a 0.53% accuracy boost while reducing FLOPs by 66M (16%). Note that the learnt non-uniform budgets can also be obtained using smaller representative datasets instead of expensive large-scale experiments.

Table 4. Effect of various layer-wise sparsity budgets when used with DNW for ResNet50 on ImageNet-1K.

The major drawback of STR is the tuning of the weightdecay parameter, λ and finer-tuning with s init to obtain the targeted overall sparsity. One way to circumvent this issue is to freeze the non-uniform sparsity distribution in the middle of training when the overall sparsity constraints are met and train for the remaining epochs. This might not potentially give the best results but can give a similar budget which can be then transferred to methods like GMP or DNW. Another drawback of STR is the function g for the threshold. The stability, expressivity and sparsification capability of STR depends on g. However, it should be noted that Sigmoid and exponential functions work just fine, as g, for STR.

6. Conclusions

This paper proposed Soft Threshold Reparameterization (STR), a novel use of the soft-threshold operator, for the weights in DNN, to smoothly induce sparsity while learning pruning thresholds thereby obtaining a non-uniform sparsity budget. Extensive experimentation showed that STR is the state-of-the-art technique for unstructured sparsity in CNNs (ResNet50 and MobileNetV1 on ImageNet-1K) while also being effective for structured sparsity in RNNs. Our method results in sparse models that have significantly lesser inference costs than the baselines. In particular, STR achieves the same accuracy as the baselines for 90% sparse MobileNetV1 with 50% lesser FLOPs. STR has ∼10% higher accuracy than the existing methods in ultra sparse regime (99% sparse) of ResNet50 showing the effectiveness of the learnt non-uniform sparsity budget across layers. Finally, STR can also induce low-rank structure in RNNs while increasing the prediction accuracy showing the generalizability of the proposed reparameterization.

inspired by network science. Nature communications, 9 Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024-8035, 2019.

Patil, S. G., Dennis, D. K., Pabbaraju, C., Shaheer, N., Simhadri, H. V., Seshadri, V., Varma, M., and Jain, Roy, D., Srivastava, S., Kusupati, A., Jain, P., Varma, M. , and Arora, A. One size does not fit all: Multi-scale, cascaded rnns for radar classification. In Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, pp. 1-10, 2019.

Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. Green AI. arXiv preprint arXiv:1907.10597, 2019.

Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018. URL http://download.tensorflow.org/data/ speech_commands_v0.01.tar.gz.

Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074 -2082 . Wortsman, M., Farhadi, A., and Rastegari, M. Discovering neural wirings. In Advances In Neural Information Processing Systems, pp. 2680 -2690 Zhu, M. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.

A. Appendix

A.1. Characterization of g

For the training dynamics of s, we propose some desired properties for choosing g : R → R ++ : • ∃ G ∈ R ++ 0 < g (s) ≤ G ∀ s ∈ R.

• 0 < g(s), lim

• g (s init ) < 1 providing us a handle on the dynamics of s.

For simplicity, the choice of g were the logistic sigmoid function, g(s) = k 1+e −s , and the exponential function, g(s) = ke s , for k ∈ R k , since in most of the experimental scenarios, we almost always have s < 0 throughout the training making it satisfy all the desired properties in R − . One can choose k as an appropriate scaling factor based on the final weight distribution of a given DNN. All the CNN experiments in this paper we use the logistic sigmoid function with k = 1, as the weights' final learnt values are typically 1, and low-rank RNN use the exponential function with k = 1. It should be noted that better functional choices might exist for g and can affect the expressivity and dynamics of STR parameterization for inducing sparsity.

A.2. Gradient w.r.t. {s l } l∈[L]

The gradient of s l ∀ l ∈ [L] takes an even interesting form

EQUATION (5): Not extracted; please refer to original document.

Where

EQUATION (6): Not extracted; please refer to original document.

where λ is its 2 regularization hyperparameter. Table 5 lists the non-uniform sparsity budgets learnt through STR across the sparsity regimes of 80%, 90%, 95%, 96.5%, 98% and 99% for ResNet50 on ImageNet-1K. The table also lists the backbone sparsities of every budget. It is clear that STR results in a higher than expected sparsity in the backbones of CNNs resulting in efficient backbones for transfer learning. Table 6 summarizes all the sparsity budgets for 90% sparse ResNet50 on ImageNet-1K obtained using various methods. This table also shows that the backbone sparsities learnt through STR are considerably higher than that of the baselines.

Table 5. The non-uniform sparsity budgets for various sparsity ranges learnt through STR for ResNet50 on ImageNet-1K. FLOPs distribution per layer can be computed as 100−si

Table 6. The non-uniform sparsity budgets learnt multiple methods for 90% sparse ResNet50 on ImageNet-1K. FLOPs distribution per layer can be computed as 100−si

A.3. Resnet50 Learnt Budgets And Backbone Sparsities

One can use these budgets directly for techniques like GMP and DNW for a variety of datasets and have significant accuracy gains as shown in the Table 4 . Table 7 summarizes all the sparsity budgets for 90% sparse MobileNetV1 on ImageNet-1K obtained using various methods. Note that GMP here makes the first and depthwise (dw) convolution layers dense, hence it is not the standard uniform sparsity. This table also shows that the backbone sparsities learnt through STR are considerably higher than that of GMP. Figure 8 shows the sparsity distribution across layers when compared to GMP and Figure 9 shows the FLOPs distribution across layers when compared to GMP for 90% sparse MobileNetV1 models on ImageNet-1K.

Figure 8. Layer-wise sparsity budget for the 90% sparse MobileNetV1 models on ImageNet-1K using various sparsification techniques.

Figure 9. Layer-wise FLOPs distribution for the 90% sparse MobileNetV1 models on ImageNet-1K using various sparsification techniques.

Table 7. The non-uniform sparsity budgets learnt multiple methods for 90% sparse MobileNetV1 on ImageNet-1K. FLOPs distribution per layer can be computed as 100−si

A.4. Mobilenetv1 Sparsity And Flops Budget Distributions

Algorithm 1 PyTorch code for STRConv with per-layer threshold.

import torch import torch.nn as nn import torch.nn.functional as F from args import args as parser_args def softThreshold(x, s, g=torch.sigmoid): # STR on a weight x (can be a tensor) with "s" (typically a scalar, but can be a tensor) with function "g". return torch.sign(x) * torch.relu(torch.abs(x)-g(s))

class STRConv(nn.Conv2d): # Overloaded Conv2d which can replace nn.Conv2d def __init__(self, * args, ** kwargs): super().__init__( * args, ** kwargs) # "g" can be chosen appropriately, but torch.sigmoid works fine. self.g = torch.sigmoid # parser_args gets arguments from command line. sInitValue is the initialization of "s" for all layers. It can take in different values per-layer as well. self.s = nn.Parameter(parser_args.sInitValue * torch.ones([1, 1])) # "s" can be per-layer (a scalar), per-channel/filter (a vector) or per individual weight (a tensor of the size self.weight). All the experiments use per-layer "s" (a scalar) in the paper.

def forward(self, x):

Table 8. The hyperparameters for various sparse ResNet50 models on ImageNet-1K using STR. λ is the weight-decay parameter and sinit is the initialization of all si for all the layers in ResNet50.

Table 9. The hyperparameters for various sparse MobileNetV1 models on ImageNet-1K using STR. λ is the weight-decay parameter and sinit is the initialization of all si for all the layers in MobileNetV1.

self.sparseWeight = softThreshold(self.weight, self.s, self.g) # Parameters except "x" and "self.sparseWeight" can be chosen appropriately. All the experiments use default PyTorch arguments. x = F.conv2d(x, self.sparseWeight, self.bias, self.stride, self.padding, self.dilation, self.groups) return x # FC layer is implemented as a 1x1 Conv2d and STRConv is used for FC layer as well. Table 10 . Hyperparameters for the low-rank FastGRNN with STR. The same weight-decay parameter λ is applied on both m W , m U . Multiple rank setting can be acheived during the training course of the FastGRNN model. g(sinit) ≈ 0 ie., sinit ≤ −10 for all the experiments.

Table 10. Hyperparameters for the low-rank FastGRNN with STR. The same weight-decay parameter λ is applied on both mW,mU. Multiple rank setting can be acheived during the training course of the FastGRNN model. g(sinit) ≈ 0 ie., sinit ≤ −10 for all the experiments.

Google-12 HAR-2 (r W , r U ) Weight-decay (λ) (r W , r U ) Weight-decay (λ) (12, 40) 0.001 (9, 8) 0.001 (11, 35) 0.001 (9, 7) 0.001 (10, 31) 0.002 (8, 7) 0.001 (9, 24) 0.005 were used and the remaining two classes were noise and background sounds (taken randomly from the remaining 20 short word utterances). The datasets were zero mean -unit variance normalized during training and prediction. Google-12 has 22,246 training points, 3,081 testing points. Each datapoint has 99 timesteps with each input being 32 dimensional making the datapoint 3,168 dimensional.

HAR-2: Human Activity Recognition (HAR) dataset was collected from an accelerometer and gyroscope on a Samsung Galaxy S3 smartphone. The features available on the repository were directly used for experiments. The 6 activities were merged to get the binarized version. The classes {Sitting, Laying, Walking Upstairs} and {Standing, Walking, Walking Downstairs} were merged to obtain the two classes. The dataset was zero mean -unit variance normalized during training and prediction. HAR-2 has 7,352 training points and 2,947 test points. Each datapoint has 1,152 dimensions, which will be split into 128 timesteps leading to dimensional per timestep inputs.

ResNet50: ResNet50 is a very popular CNN architecture and is widely used to showcase the effectiveness of sparsification techniques. ResNet50 has 54 parameter layers (including fc) and a couple of pooling layers (which contribute minimally to FLOPs). All the batchnorm parameters are left dense and are learnt during the training. STR can be applied per-layer, per-channel and even per-weight to obtain unstructured sparsity and the aggressiveness of sparsification increases in the same order. This paper only uses per-layer STR which makes it have 54 additional learnable scalars. The layer-wise parameters and FLOPs can be seen in Tables 6 and 5 . All the layers had no bias terms.

MobileNetV1: MobileNetV1 is a popular efficient CNN architecture. It is used to showcase the generalizability of sparsification techniques. MobileNetV1 has 28 parameter layers (including fc) and a couple of pooling layers (which contribute minimally to FLOPs). All the batchnorm parameters are left dense and are learnt during the training. STR can be applied per-layer, per-channel and even per-weight to obtain unstructured sparsity and the aggressiveness of sparsification increases in the same order. This paper only uses per-layer STR which makes it have 28 additional learnable scalars. The layer-wise parameters and FLOPs can be seen in Tables 7. All the layers had no bias terms.

FastGRNN: FastGRNN's update equations can be found in (Kusupati et al., 2018) . FastGRNN, in general, benefits a lot from the low-rank reparameterization and this enables it to be deployed on tiny devices without losing any accuracy. FastGRNN's biases and final classifier are left untouched in all the experiments and only the input and hidden projection matrices are made low-rank. All the hyperparameters were set specific to the datasets according to (Kusupati et al., 2018) .