Contents lists available at ScienceDirect

## Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

# Recent advances in convolutional neural network acceleration

Qianru Zhang<sup>a</sup>, Meng Zhang<sup>a,\*</sup>, Tinghuan Chen<sup>a,b</sup>, Zhifei Sun<sup>a</sup>, Yuzhe Ma<sup>b</sup>, Bei Yu<sup>b</sup>

<sup>a</sup> National ASIC System Engineering Technology Research Center, Southeast University, Nanjing, China <sup>b</sup> Department of Computer Science & Engineering, The Chinese University of Hong Kong, Hong Kong, China

#### ARTICLE INFO

Article history: Received 31 December 2017 Revised 28 July 2018 Accepted 17 September 2018 Available online 21 September 2018

Keywords: Convolutional neural network Model compression Algorithm optimization Hardware acceleration

## ABSTRACT

In recent years, convolutional neural networks (CNNs) have shown great performance in various fields such as image classification, pattern recognition, and multi-media compression. Two of the feature properties, local connectivity and weight sharing, can reduce the number of parameters and increase processing speed during training and inference. However, as the dimension of data becomes higher and the CNN architecture becomes more complicated, the end-to-end approach or the combined manner of CNN is computationally intensive, which becomes limitation to CNN's further implementation. Therefore, it is necessary and urgent to implement CNN in a faster way. In this paper, we first summarize the acceleration methods that contribute to but not limited to CNN by reviewing a broad variety of research papers. We propose a taxonomy in terms of three levels, i.e. structure level, algorithm level, and implementation level, for acceleration methods. We also analyze the acceleration methods in terms of CNN architecture compression, algorithm optimization, and hardware-based improvement. At last, we give a discussion on different perspectives of these acceleration and optimization methods within each level. The discussion shows that the methods in each level still have large exploration space. By incorporating such a wide range of disciplines, we expect to provide a comprehensive reference for researchers who are interested in CNN acceleration.

© 2018 Elsevier B.V. All rights reserved.

## 1. Introduction

Convolutional neural network (CNN) architectures have been around for over two decades. Compared with other neural network models such as multiple layer perceptron (MLP), CNN is designed to take multiple arrays as input and then process the input using convolution operator within a local field by mimicking eyes perceiving images. Therefore, it shows excellent performance in solving computer vision problems such as image classification, recognition and understanding [1–3]. It is also effective for a wide range of fields such as speech recognition that requires correlated speech spectral representations [4], VLSI physical design [5], multi-media compression [6] comparing with the traditional DCT transformation and compressive sensing methods [7,8], and cancer detection from a series of condition changing images [9]. Moreover, many top players have been in a fever to play Go match with alphaGo recently, which has CNN implemented.

However, in order to receive good performance of prediction and accomplish more difficult goals, CNN architecture becomes deeper and more complicated. At the same time, more pixels are

\* Corresponding author *E-mail address:* zmeng@seu.edu.cn (M. Zhang).

https://doi.org/10.1016/j.neucom.2018.09.038 0925-2312/© 2018 Elsevier B.V. All rights reserved. packed into one image thanks to high resolution acquisition devices. As a result, CNN training and inference are very computationally expensive and become limited for implementation due to its slow speed. Although acceleration and optimization for CNN have been explored since it was brought up, recently this seems to be keener as it has such good industrial impact.

Some companies have unveiled accelerators for deep learning inference that can be extensively used for CNN. Google's second generation Tensor Processing Unit (TPU) is designed for Tensor-Flow framework that has the increased performance of 92TFLOPS in peak and on-chip memory of 28MiB. It not only supports integers but also floating point calculations, which makes it more powerful in deep learning training [10]. NVIDIA launches an open source project called NVIDIA Deep Learning Accelerator (NVDLA) along with an open license that is ready for people who are interested in data intensive automotive products. It includes Verilog and C-model for the chip, Linux drivers, test suites, and kernel and user based software with development tools [11]. Intel's Nervana Neural Network Processor (NNP) has been announced recently for dealing with neural network matrix multiplications and convolutions. The memory architecture is designed for efficient operations and data movements without cache hierarchy, which improves the training performance of deep learning [12].









Fig. 1. Illustration of LeNet-5.

In this paper, we review many recent works and summarize acceleration methods not only in structure level and algorithm level, but also in implementation level. This paper differs from other deep neural network review papers in three aspects. (1) Topic: Some review papers summarize the relevant work regarding deep neural networks in different applications such as computer vision, natural language processing and acoustic signal processing, among which CNN is only a component [13–18]. They make systematic introduction for various kinds of neural networks that are fit for specific applications and provide a guide for people who want to implement deep neural networks in their fields. However, few of them mention acceleration methods, while our paper focuses on CNN and its acceleration methods. (2) Time: Recent deep learning review papers are mostly historical [19–21]. They usually trace back the origins through over fifteen years to form a big picture of the neural network development, which is very inspiring to think over the origins. Our paper focuses on researches recently when hardware becomes limited and efficiency becomes the priority. (3) Taxonomy: There are no reviews that incorporate hardware into algorithms since they are different disciplines. In this paper, we talk about the acceleration methods in three levels, because they are interwoven and highly dependent.

This survey paper is organized as following. In Section 2, an overview of modern CNN structure is given with different typical layers that the improvement is focused on. In Section 3 we present our taxonomy for recent CNN acceleration methods followed by the overview in three categories, including CNN compression in Section 4, algorithm optimization in Section 5, and hardware-oriented acceleration in Section 6. After that, in Section 7 a discussion is given on these methods from different perspectives. Finally, Section 8 concludes this paper with some future challenges.

## 2. Convolutional neural network

The modern convolutional neural networks proposed by Le-Cun [22] is a 5-layer (excluding the input and subsampling layers) LeNet-5 structure. It has the following structure C1, S2, C3, S4, C5, F6, OUTPUT as shown in Fig. 1, where C indicates convolutional layer, S indicates subsampling layer, and F indicates fullyconnected layer. There are many modifications regarding the structure of CNNs in order to handle more complicated datasets and problems, such as AlexNet (8 layers) [23], GoogLeNet (22 layers) [27], VGG-16 (16 layers) [25], and ResNet (152 layers) [26]. Table 1 summarizes the state-of-the-art CNNs. In this table, Feature column summarizes the most important parts in each model [29–31]. Application column provides the fields that the methods were proposed for the first time. Fully-connected layer is followed by the Softmax layer except for LeNet and NIN. As we can see from the table, the number of parameters in modern CNNs is large, which usually takes a long time for training and for inference. Plus, higher dimensional input, large number of parameters, and complex CNN configuration challenge hardware in terms of processing element efficiency, memory bandwidth, off-chip memory, communication and so on.

Among these different structures, they share four key features including weight sharing, local connection, pooling, and the use of many layers [20]. There are some commonly used layers such as convolutional layers, subsampling layers (pooling layers), and fully-connected layers. Usually, there is a convolutional layer after the input. The convolutional layer is often followed by a subsampling layer. This combination repeats several times to increase the depth of CNN. The fully-connected layers are designed as the last few layers in order to map from extracted features to labels. These four layers are introduced as follows.

(a) Input layer: In CNNs, input layers usually take multiple arrays and are often size-fixed. Comparing to ordinary fully-connected neural networks, the CNN input do not need size-normalization and centralization, because CNN enjoys the characteristic of translation invariance [32].

(b) Convolutional layer: As a key feature layer that makes CNNs different from other ordinary neural networks, neuron units of convolutional layers are first computed by convolution operation over small local patches of input, and then followed by activation functions (tanh, sigmoid, ReLU, etc.), and form a 2D feature map (3D feature map channel). In general, we have that

$$\mathbf{Z}_j = \sum_i \mathbf{X}_i * \mathbf{K}_{ij} + \mathbf{B}_j,\tag{1}$$

$$\mathbf{A}_{j} = f(\mathbf{Z}_{j}),\tag{2}$$

where  $\mathbf{Z}_j$  represents the output from the convolution operation,  $\mathbf{X}_i$  denotes the input to the convolutional layer,  $\mathbf{K}_{ij}$  is the convolution

| Table 1                                                                             |        |
|-------------------------------------------------------------------------------------|--------|
| CNN model summary. C: convolutional layer, S: subsampling layer, F: fully-connected | layer. |

| Model          | Layer Size                       | Configuration                                                                 | Feature                                           | Parameter Size                                 | Application                            |
|----------------|----------------------------------|-------------------------------------------------------------------------------|---------------------------------------------------|------------------------------------------------|----------------------------------------|
| LeNet [22]     | 5 layers                         | 3C-2S-1F-RBF output layer                                                     |                                                   | 60,000                                         | Document recognition                   |
| AlexNet [23]   | 8 layers                         | 5C-3S-3F                                                                      | Local response<br>normalization                   | 60,000,000                                     | Image classification                   |
| NIN [24]       | -                                | 3mlpconv-global average<br>pooling (S can be added in<br>between the mlpconv) | mlpconv layer: 1C-3MLP;<br>global average pooling | -                                              | Image classification                   |
| VGG [25]       | 11-19 layers                     | VGG-16: 13C-5S-3F                                                             | Increased depth with stacked $3 \times 3$ kernels | 133,000,000 to 144,000,000                     | Image classification and localization  |
| ResNet [26]    | Can be very deep<br>(152 layers) | ResNet-152: 151C-2S-1F                                                        | Residual module                                   | ResNet-20: 270,000;<br>ResNet-1202: 19,400,000 | Image classification, object detection |
| GoogLeNet [27] | 22 layers                        | 3C-9Inception-5S-1F                                                           | Inception module                                  | 6,797,700                                      | Image classification, object detection |
| Xception [28]  | 37 layers                        | 36C-5S-1F                                                                     | Depth-wise separable<br>convolutions              | 22,855,952                                     | Image classification                   |

| Table | 2 |
|-------|---|
|-------|---|

| Activation | function | summary. |
|------------|----------|----------|
|            |          | J.       |

| Function  | Saturation    | Definition                                                                | Parameter $\alpha$                 | Plot |
|-----------|---------------|---------------------------------------------------------------------------|------------------------------------|------|
| Sigmoid   | Saturated     | $f(x) = 1/(1 + e^{-x})$                                                   | -                                  | (a)  |
| Tanh      | Saturated     | $f(x) = 2/(1 + e^{-2x}) - 1$                                              | -                                  | (b)  |
| ReLU      | Non-saturated | $f(x) = \begin{cases} x & x \ge 0\\ 0 & x < 0 \end{cases}$                | -                                  | (c)  |
| LeakyReLU | Non-saturated | $f(x) = \begin{cases} x & x \ge 0\\ \alpha x & x < 0 \end{cases}$         | $\alpha \in (0, 1)$                | (d)  |
| PReLU     | Non-saturated | $f(x) = \begin{cases} x & x \ge 0\\ \alpha x & x < 0 \end{cases}$         | $\alpha$ is a learned parameter    | (d)  |
| RReLU     | Non-saturated | $f(x) = \begin{cases} x & x \ge 0\\ \alpha x & x < 0 \end{cases}$         | $\alpha \sim$ uniform(a, b)        | (d)  |
| ELU       | Non-saturated | $f(x) = \begin{cases} x & x \ge 0\\ \alpha (e^x - 1) & x < 0 \end{cases}$ | $\alpha$ is a predefined parameter | (e)  |

kernel, and  $\mathbf{B}_j$  is the additive bias. In the following equation,  $\mathbf{A}_j$  is the output feature map of the convolutional layer and  $f(\cdot)$  is an activation function.

Activation functions are mathematical operations over the input, which introduces non-linearity into neural networks and help catch non-linear features of the input data. There are various types of activation functions as summarized in Table 2 and Fig. 2.

Sigmoid and Tanh are called saturated functions. As we can see from their definitions or plots, when the input is very small or very large, the output saturates at 0 or 1 for Sigmoid and -1 or 1 for Tanh. There are two problems with saturation. The gradients at saturated regions are almost zero, which dramatically decreases neurons backpropagation and makes it difficult to converge in the training phase. Furthermore, more attention needs to be paid in weight initialization when using saturated activation functions for the neural networks may not learn in the first place. To alleviate saturation problem, many non-saturated activations are proposed such as Rectified Linear Unit (ReLU) [33], Leaky ReLU [34], Parametric ReLU (PReLU) [35], Randomized Leaky ReLU (RReLU) [36], and Exponential Linear Unit (ELU) [37].

Convolution plays a very important role in CNN. On one hand, by weight sharing, neurons in the same feature map share the same parameters, which reduces dramatically the total number of parameters. In different spatial location, input may have some same features such as edges, points, angles, etc. Weight sharing makes the CNN less sensitive to location and shifting. On the other hand, since each convolution operation is targeted for a small patch of input, the extracted features remain intrinsic topology of the input that helps recognize patterns.

(c) Subsampling Layer (pooling layer): Convolutional layers are usually followed by subsampling layers to reduce the feature map resolution. The amount of parameters and computation are also re-

duced accordingly. More formally,

$$\mathbf{Z}_i = \operatorname{down}(\mathbf{X}_i),\tag{3}$$

where  $down(\cdot)$  represents a subsampling method.

Maximum operation and average operation are two typical subsampling methods and have been implemented in CNNs. In spite of max pooling and average pooling, some methods that work better in mitigating overfitting problems in CNN are proposed such as Lp pooling [38], stochastic pooling [39], and mixed pooling [40]. He et al. propose a pooling method called spatial pyramids pooling (SPP) that can output a fixed-length feature map and therefore can deal with various input image sizes [41]. Spectral pooling is a pooling method to reduce dimensionality in frequency, which preserves more information than spacial domain, and can be implemented in Fast Fourier Transform (FFT) based CNNs [42]. While multi-scale orderless pooling proposed by Gong et al. outperforms other methods in highly variable scene matching [43].

Different from convolution kernels, subsampling kernels are often hand-picked and remain unchanged during training and inference. There are two main reasons for subsampling. One is that by maximizing or averaging over the previous feature map, the size of feature map reduces. The other one is that by subsampling, the output feature map is more robust to distortions and errors of individual neuron units [44].

(d) Fully-connected layer: After several layers, high-level features are extracted and require mapping to labels. In fully-connected layer, neuron units are transformed from 2D into 1D. Each unit in the current layer is connected to all the units in the previous layer such like regular neural networks. It not only extracts features in a more complex way in order to dig deep for more information, but patterns in different locations are connected as well.



Fig. 2. Activation function plot.

(e) Output layer: As a feed-forward neural network, the output layer neuron units are fixed. They are usually linked with previous neurons in a fully-connected way and they are the final threshold for predicting.

In general, CNNs have gained a lot of interest in researching the meaning behind the combination of those different layers. The advantages brought by the structure of CNNs include reduced number of parameters and translation invariance.

#### 3. Acceleration method taxonomy

Our taxonomy is shown in Fig. 3. The philosophy behind the taxonomy is the order from designing, to training a CNN and finally to implementing it on hardware. For the CNN structure, there is redundancy in both weights and the number of bits for representation. For the redundancy in weights, layer decomposition. network pruning, block-circulant projection and knowledge distillation methods can be applied. For the redundancy in representation, using fixed-point representation is the mainstream. Once the structure is decided, CNN adopts training algorithms that are generally used in other neural networks for training process. The most popular training method is gradient decent based back-propagation algorithm. By propagating errors back from output to input and by adjusting weights wired in the network, errors can be reduced to an acceptable degree. The criterion for algorithm optimization is convergence speed with proper stability. Considering that convolutional layers are computationally intensive, we are also interested in the convolution operation complexity. Therefore, we also summarize some efficient convolution methods that are adopted in the CNN. As for the implementation level, the mainstream GPU, FPGA, ASIC are discussed. Recently, people see a promising future for fast implementation of CNN as neuromorphic engineering develops. Some new devices are also presented in this paper. The acceleration approaches of each level is orthogonal and can be combined with those in other levels. By researching such a wide range of methods, we expect to provide a general idea on CNN acceleration, especially for deep CNN, from the perspectives of structure, algorithm, and hardware.

## 4. Structure level

Many training and inference process can be accelerated by reducing redundancy in network structures. There is redundancy both in weights and in the way how weights are represented. Two perspectives of acceleration methods will be summarized as follows in terms of redundancy in weights and redundancy in representations.

#### 4.1. Redundancy in weights

There is significant redundancy in the parameterization of some neural networks. As Denil et al. and Sainath et al. observe that some weights learned in networks are correlated with each other, they demonstrate that some of the weights can either be predicted or be unnecessary to learn [45,46].

### 4.1.1. Layer decomposition

Low-rank approximation can be adopted to reduce redundancy in weights [47]. For one layer, the input-output relationship can be described by

$$\mathbf{y} = g(\mathbf{x} \cdot \mathbf{W}), \tag{4}$$

where **W** is the weight matrix with size  $m \times n$ . **W** can be replaced by the product of two full rank matrices **U** · **V** with size  $m \times r$  and  $r \times n$  respectively. The number of parameters in **W** can be reduced to 1/d if the following inequality holds, d(mr + rn) < mn. An efficient low-rank approximation of kernels can be applied in first few convolutional layers of CNN to exploit the linear structure of the over-parameterization within a filter. For example, Denton et al. reduce the computation work for redundancy within kernels. It achieves  $2 \sim 2.5 \times$  speedup with less than 1% drop in classification performance for a single convolutional layer. It uses singular



Fig. 3. Taxonomy of CNN acceleration methods.

value decomposition method to exploit the approximation of kernels with assumptions that the singular values of the kernels decay rapidly so that the size of the kernels can be reduced significantly [48].

Instead of treating kernel filters as different matrices, kernels in one layer can be treated as a 3D tensor with two spatial dimensions and the third dimension representing channels. Lebedev et al. use CP-decomposition for convolutional layers, which achieves  $8.5 \times$  CPU speedup at the cost of 1% error increase [49]. Tai et al. utilize tensor decomposition to remove the redundancy in the convolution kernels, which achieves twice more efficiency of inference for VGG-16 [50]. Wang et al. propose to use group sparse tensor decomposition for each convolutional layer, which achieves  $6.6 \times$  speed-up on PC and  $5.91 \times$  speed-up on mobile device with less than 1% error rate increase [51]. Tucker decomposition is also used recently to decompose pre-trained weights with fine-tuning afterwards [52,53].

Weight matrix decomposition method can not only be applied to convolutional layers, but also fully-connected layers. Applying the low-rank approximation to the fully-connected layer weight can achieve a  $30 \sim 50\%$  reduction of number of parameters with little loss in accuracy, which is roughly an equivalent reduction in training time [46]. In spite of using two full rank matrices

$$\mathbf{W} = \mathbf{U} \cdot \mathbf{V},\tag{5}$$

some works have proposed different decomposition forms

$$\mathbf{W} = \mathbf{D}_1 \cdot \mathbf{H} \cdot \mathbf{P} \cdot \mathbf{D}_2 \cdot \mathbf{H} \cdot \mathbf{D}_3, \tag{6}$$

with diagonal matrices **D**<sub>1, 2, 3</sub> and Hadamard matrix **H** [54], and

$$\mathbf{W} = \mathbf{A} \cdot \mathbf{C} \cdot \mathbf{D} \cdot \mathbf{C}^{-1},\tag{7}$$

with diagonal matrices **A**, **D** and DCT matrix **C** [55]. This method can be used during training the CNN, which is very meaningful. The CNN efficiency can be further improved if the training complexity can be reduced as well. Ioannou et al. propose to learn some basis small filters that can describe the more complex filters from the scratch. By carefully choosing the initialization status, the new method can be used during training [56]. Wen et al. force more weight information into the filters to get more efficient CNNs. With its help, the training process converges faster during the fine-tuning phase. In the experiments, it obtains  $2 \times$  faster on GPU without accuracy loss [57].

The decomposition technique is layer oriented and can be interleaved with other modules such as ReLU modules in CNN. It can also be applied to the structure of neural networks. Rigamonti et al. apply this technique to the general frameworks and reduce the computational complexity by using linear combinations of fewer separable filters [58]. This method can be extended for multiple layers (e.g. > 10) by utilizing low-rank approximation for both weights and input [59]. It can achieve  $4 \times$  speedup with 0.3% error increase for deep network models VGG-16 by focusing on reducing accumulated error across layers using generalized singular value decomposition.

The methods above can be generalized as layer decomposition for filter weight matrix dimension reduction, while pruning is another method for dimension reduction.

## 4.1.2. Network pruning

Network pruning originates as a method to reduce the size and over-fitting of a neural network. As neural network implementation on hardware becomes more popular, it is necessary to reduce the limitation such as its intensive computation and large memory bandwidth requirement. Nowadays, pruning is usually adopted as a method to reduce the network size and to increase the network inference speed so that it can be applied in specific hardware such as embedded systems.

There are many pruning methods in terms of weights, connections, filters, channels, feature maps, and so on. Unlike layer decomposition in which computational complexity is reduced through reducing the total size of layers, selected neurons are removed in pruning. For pruning weights, the unimportant connections of weights with magnitudes smaller than a given threshold are dropped. Experiments are taken on NVIDIA TitanX and GTX980 GPUs, which achieves  $9 \times$  and  $13 \times$  parameter reduction for AlexNet and VGG-16 models respectively with no loss of accuracy [60]. Zhou et al. incorporate sparse constraints to decimate

the number of neurons during training, which reduces the 70% number of neurons without accuracy sacrifice [62]. Besides the method to eliminate least influential neurons, another method is to merge selected and the rest of neurons to maintain diversity in the information. Mariet et al. succeed in merging the qualified neurons with unqualified ones and reduce the network complexity [63]. By trading off the training error with the remaining hidden neurons, they achieve 25% reduction of the original number of parameters with 0.04 accuracy reduction in MNIST dataset. Channel pruning method is to eliminate lowly active channels, which means filters are applied in fewer number of channels in each layer. Polyak et al. propose a channel-pruning based method Inbound Prune to compress a redundant network. Their experiment is taken on the platform of Samsung Galaxy S6 and it achieves  $1.59 \times$  speedup [64]. Recently, pruning is combined with other acceleration techniques to achieve speedup. For example, Han et al. combine pruning with trained quantization and Huffman coding to deep compress the neural networks in three steps. It achieves  $3 \times$  layer-wise speedup on fully-connected layer over benchmark on CPU [60].

Some of these pruning methods result in structured sparsity [61], while others cause unstructured sparsity such as weightbased pruning. Many techniques are proposed to deal with problems of unstructured sparsity being unfriendly to hardware. Wen et al. propose a method called Structured Sparsity Learning (SSL) for regularizing compressed structures of deep CNNs and speeding up convolutional computation by group Lasso regularization and locality optimization respectively. It improves convolutional layer computation speed by  $5.1 \times$  and  $3.1 \times$  over CPU and GPU [65]. He et al. propose a channel pruning method by iteratively reducing redundant channels through solving LASSO and reconstructing the outputs with linear least squares. It achieves  $5 \times$  speed increase in VGG-16 and  $2 \times$  speedup in ResNet/Xception [66]. Liu et al. also impose channel-based pruning. They use L1 regularization and achieve 20  $\times\,$  reduction in model size and 5  $\times\,$  reduction in computing operations for VGG model [67]. Li et al. prune whole filters as well as their related feature maps and reduce inference cost of VGG-16 by 34% and ResNet-110 by 38% [68]. Their method uses sum of filter's absolute values as a measurement of filter importance, which is filter-based and avoids sparse connectivity. Based on Taylor expansion of cost function between pruning and nonpruning situations, Molchanov et al. reduce feature maps from convolutional layers and implement the iterative pruning method in transfer learning setting [69]. ASIC based methods dealing with irregular sparsity are proposed as well and will be discussed in Section 6.3.

#### 4.1.3. Block-circulant projection

A square matrix could be represented by a one-block-circulant matrix, while a non-squared matrix could be represented by blockcirculant matrix. Block-circulant matrix is one of the structured matrices that is usually used in paradigms such as dimension reduction [70], since it can represent an unstructured matrix with a vector. A one-block-circulant matrix is defined as

$$\mathbf{R} = circ(\mathbf{r}) := \begin{bmatrix} r_0 & r_{d-1} & \cdots & r_2 & r_1 \\ r_1 & r_0 & r_{d-1} & \cdots & r_2 \\ \vdots & & & \ddots & \vdots \\ r_{d-1} & r_{d-2} & \cdots & r_1 & r_0 \end{bmatrix},$$
(8)

which can be represented by a vector  $\mathbf{r} = (r_0, r_1, \dots, r_{d-1})$ . Blockcirculant based CNN has been explored nowadays as it has small storage requirements.

Cheng et al. apply the circulant matrix in the fully connected layer and achieve significant gain in efficiency with little decrease in accuracy [71]. Yang et al. focus on reducing the computational time spent in fully-connected layer by imposing the circulant structure on the weight matrix for dimension reduction with



Fig. 4. Illustration of knowledge distillation.

little loss in performance [72]. Ding et al. propose to use blockcirculant structure in both fully-connected layers and convolutional layers in non-square-matrix situations to further reduce the storage waste. They also mathematically prove that fewer weights in circulant form do not harm the ability of a deep CNN without weight redundancy reduction [73].

#### 4.1.4. Knowledge distillation

Knowledge distillation is a concept that information obtained from a large complex ensemble neural networks can be utilized to form a compact neural network [74]. The way that knowledge is transferred can be depicted in the following Fig. 4. Information flow from one complex network to a simpler one by training the latter one with data labeled by the former network. By using synthetic data generated from a complex network to train a compact model, it is less likely to cause overfitting and can approximate the functions very well. More importantly, it provides a new perspective for model compression and complicated neural network acceleration.

Synthetic data is very important in succeeding model compression. If it matches well with the true distribution from the functions of a complex model, it usually takes less data for training to mimic it with high-fidelity. Furthermore, the compact model has good generalization characteristics in some missions as it reduces overfitting. Bucilu et al. lay a foundation for mimicking a large machine learning model by experimenting three ways to generate pseudo data, which are random, naive bayes estimation, and MUNGE, respectively [75]. Some researches propose teacherstudent format, which also adopts knowledge distillation concepts with different methods for synthesizing data [76]. For example, Hinton et al. compress a deep teacher network into a student network using data combined from teacher network outcome and the true labeled data [77]. The student network can achieve very high accuracy on MNIST dataset with less run time of inference. Romero et al. mimic a wider and shallower teacher neural network with a thinner and deeper network called a student network by learning an intermediate representation that is predicted by the teacher network [78]. The depth of the student network ensures its performance, while its thin characteristic reduces the computation complexity.

## 4.2. Redundancy in representations

Many weights in neural networks have very small values. For example, the first non-zero digit of many weights occurs in the eighth decimal place, which requires more precise way to record them. Most arithmetic operations in neural networks use 32floating point representation in order to achieve a good accuracy. As a trade-off, that increases the computation workload and memory size for the neural networks. However, arithmetic operations in fixed-point instead of floating-point can achieve enough good performance for neural networks [79]. A 16-bit fixed-point representation method is proposed by using stochastic rounding for training CIFAR-10 dataset [80]. A further compression of 10-bit dynamic fixed-point is also explored [81]. Han et al. quantize pruned CNNs to 8-bit and achieve further storage reduction with no loss of accuracy [60].

For now, representation in one bit is the simplest form. In terms of binarization, there can be three forms, binary input, binary weights of the network, and binary operations. Courbariaux et al. propose a BinaryConnect method to use 1-bit fixed-point weights to train a neural network [82]. Rastegari et al. come up with a XNOR-Nets with binary weights and binary input fed to convolutional layers [83]. It results in  $58 \times$  speedup of convolutional operations. Kim et al. propose a Bitwise Neural Network, which takes everything as binary such as weights, bias terms, input, output, and basic logic operations instead of floating or fixed-point arithmetic operations [84]. Zhou et al. propose to train CNN using binary and stochastically quantized low bit-width gradients and achieve comparable performance as 32-bit counterparts [85]. Hubara et al. propose training methods for quantized neural networks that use low precision including 1-bit weights and activations, and replace most arithmetic operations with bit-wise operations [86]. Kim et al. compress binary weight CNNs by decomposing kernels into sub-kernels with common parts. They reduce the operation of each image by 47.7% [87]. Ternary CNNs are proposed recently as a more expressive method comparing to binary CNNs, which seeks to achieve a balance between binary networks and full precision networks in terms of compression rate and accuracy [88-90].

Stochastic computing (SC) is a type of technique that simplifies numerical computations into bit-wise operations by representing continuous values with random bit streams. It provides many benefits for neural networks such as low computation footprint, error tolerance, simple implementation in circuits and better trade-off between time and accuracy [91]. Many works contribute to exploring potential space in optimization and in deep belief networks [92–94]. Recently it starts to gain attention in CNN field and regarded as a promising technique for deep CNN implementation on ASIC (Section 6.3) and on embedded portable devices as it can significantly reduce resource consumption with high accuracy.

SC is first adopted in deep CNN by Ren et al. with proposed method called SC-DCNN. They design both function blocks and feature extraction blocks that help SC efficiently implemented in deep CNN. It successfully achieves the lowest resource consumption of LeNet5 with optimized configurations among many state-of-the-art software and hardware platforms [95]. Li et al. further improve SC based DCNN by introducing normalization in the hardware implementation and dropout in DCNN software training. They design the stochastic normalization circuit by decoupling complex normalization into three units, namely, square and summation, activation and division. Their proposed method improves the SC-based DCNN with 3.26% top-1 accuracy and 3.05% top-5 accuracy [96].

Although errors may accumulate due to representation approximation, its hardware implementation can achieve a much faster speed and lead to less energy consumption.

## 5. Algorithm level

In the training process, gradient-based method is widely used in multi-layer feedforward neural networks (FNN), while some other models use analytically determined methods to minimize the cost function [97]. In the forward pass, the output of CNN is calculated, while in the backward pass, weights and bias are adjusted. By reducing the number of iterations to converge, training time can be decreased. Therefore, optimizing gradient decent algorithm is very important for improving the performance in training. For CNN, convolution computation reduces the amount of weights dramatically because it focuses on a local perception field. But repeated mathematic addition and multiplication increase the computation intensity. Therefore, in the forward process, convolution operation workloads are computationally intensive and become constrained for implementation. In the following, we discuss the algorithm optimization of the two directions of data flow, which are gradient-based backward training methods and convolutionbased forward inference methods. We summarize distributed gradient descent methods, hybrid variants of the gradient decent and the improvement in terms of self-adaptive learning rates, momentum factors, and partial gradients. We also give an overview on im2col-based algorithms, Winograd-based algorithms, and FFT, all of which address the convolution cost problem in CNN.

#### 5.1. Gradient decent optimization

Gradient decent is one of the most popular algorithms for optimization. It has been largely used in finding global minima for error functions during training neural networks, because it is simple and empirical to implement. The core of mathematical model of gradient decent algorithm is the update rule  $\theta = \theta - \eta \cdot \nabla_{\theta} J(\theta)$ , where the parameters are updated in the opposite direction of the gradient of the error function  $\nabla_{\theta} J(\theta)$ .

Distributed gradient decent methods have been proposed to alleviate hardware workload. Take Google training CNN [77] as an example. As illustrated in Fig. 5, there are two types of parallelism for the distribution. (a) Replica of CNNs are trained through a server using averaging gradients and different batches of data. Parameters are updated based on all the average gradients, which indicates that new parameters reflect the features from the whole data. (b) For each replica of CNN, it distributes the computation into different cores with different subset of neurons. Its implementation will be introduced in Section 6.1.

Back-propagation algorithm is a form of gradient decent algorithm that is implemented in the neural networks. Some hybrid variants of back-propagation have been proposed in order to take advantage of the benefits from other algorithms. For example, combining it with cuckoo search algorithm can increase the searching speed for optimal solutions [98]. The combination with ant colony algorithm can decrease the computational cost with increased stability of convergence [99]. Pan et al. introduce three stages in the back-propagation with genetic algorithms and steepest decent methods combined together to achieve a fast and stable goal [100]. Ding et al. use genetic algorithms to optimize the weights of a back-propagation neural network by encoding and thresholding the connection weights [101].

For deep CNN, as errors accumulate layer by layer, the gradient either decays rapidly to zero or increases out of bound. Researchers focus on making changes in error functions, learning rates and incorporating momentum [102] to reduce the derivative vanishing effects and to improve the speed of convergence in the 1990's, while in the recent 7 years, incorporating various factors with momentum factors, introducing self-adaptive learning rates, and using partial gradients are mainstreams to improve the gradient algorithms.



Fig. 5. Illustration of CNN distributed system.

For example, adapting learning rate to parameters with exponentially decaying average of squared gradients leads to a varying learning rate, which depends on each current and past parameter instead of being a constant [103–105]. Hamid et al. incorporate the momentum factor and give control over it, which accelerates the convergence speed especially for oscillating situations in ravine [106]. Nesterov et al. have proposed to use partial gradient to update each parameter rather than using the whole gradient [107,108]. They randomly collect feature dimensions by sampling a block of coordinates and taking partial derivatives over this block, which can dramatically reduce the gradient computation complexity. As a result, it is much faster than the regular stochastic gradient decent method especially for high-dimensional dataset.

For deep neural networks, gradient descent with backpropagation is not guaranteed to find the global minimum of the error function, and is subject to weight vanishing or exploding. The former issue is due to the non-convexity of error functions in neural networks. Some works focus on non-gradient-based methods, such as ant bee colony algorithms and genetic algorithms. They are usually for simple dataset like Boolean dataset and simple neural network structures with one to two hidden layers. In practical, local minimum problem can be leveraged by a deep architecture [20]. The second issue that weight vanishes or explodes when the amount of layers accumulates is still an open problem and has much potential to explore.

## 5.2. Feed-forward efficient convolution

Three methods are summarized for the feed-forward efficient convolution including im2col-based algorithm, Winograd-based method, and FFT-based method, with the most commonly used one being introduced firstly. For the direct convolution in the CNN, convolution kernels slide over the two dimensions of the input and the output is obtained by dot product between the kernels and the input. While for the im2col-based algorithms, the input matrix is linearized into multiple lowered vectors, which can be later efficiently computed [109–111]. Cho et al. further reduce the linearization memory-overhead and improve the computational efficiency by modifying both the lowered vectors and the vectorized kernels [112]. Winograd-based methods are to incorporate Winograd's minimal filtering algorithms to compute minimal convolution over small filters. Coppersmith Winograd algorithm is known as a fast matrix multiplication algorithm. Winograd-based convolution reduces the multiplications by increasing the number of additions and it reduces the memory consumption on GPU [113,114]. Winograd's minimal filtering algorithms can help reduce convolution computation at the expense of memory bandwidth. Xiao et al. utilize Winograd's minimal filtering theory combined with heterogeneous algorithms for a fusion architecture to mitigate memory bandwidth problem [115].

Based on the experiment that FFT can be applied in MLP to accelerate the first layer inference [116], Mathieu et al. first apply FFT on weights of CNN and achieve good performance and speedup for large perceptive areas [117]. For using FFT in CNN, it is necessary to transform back and forth between time domain and frequency domain, which consumes a lot of resources and takes time. Therefore, it needs delicate balance between the benefits of computation in frequency domain and the drawbacks of transforming back and forth. Large perception areas have better performance, which results in limitation in the neural network with small convolution filters. In order to solve this problem, one of the solutions is to train weights directly in frequency domain [118]. Ko et al. train the CNNs entirely in the frequency domain with approximate frequency-domain nonlinear operations, sinc interpolation and Hermitian symmetry. By eliminating Fourier transforms at each layer, they achieve significantly training time reduction for CIFAR-10 recognition [119].

## 6. Implementation level

Neural networks regain their vigor due to high performance hardware recently. CPU used to be the main stream for implementing machine learning algorithms about twenty years ago, because matrix multiplication and factorization techniques were not popular back then. Nowadays, GPU, FPGA, and ASIC are utilized for accelerating training and predicting process. Besides, much new device technology is proposed to meet requirement for very large models and large training datasets. In the following, hardware based accelerators are summarized in terms of GPU, FPGA, ASIC and frontier new device that is promising for accelerating deep convolutional neural networks.

## 6.1. GPU

In terms of GPU, clusters of GPUs can accelerate very large neural networks with over one billion parameters in a parallel way. The mainstream of GPU cluster neural networks usually work with distributed SGD algorithms as illustrated in Section 5.1. Many researches further exploit the parallelism and make efforts on communication among different clusters. For example, Baidu Heterogeneous Computing Group uses two types of parallelism called model-data parallelism and data parallelism to extend CNN architectures to 36 servers, each with 4 NVIDIA Tesla K40m GPUs and 12 GB memory. The strategies include butterfly synchronization and lazy update, which makes good use of overlapping in computation and communication [120]. Coates et al. propose a clustering of GPU servers using Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology and high-speed communication infrastructure for parallelism in distributed gradient decent algorithm, which reduces 98% number of machines used for training [121]. In terms of non-distributed SGD algorithms, Imani et al. propose a nearest content addressable memory block called NNCAM, which stores highly frequent patterns for reusing. It accelerates CNNs over general purpose GPU with 40% speedup [122].

## 6.2. FPGA

There are many parallelism levels in hardware acceleration, such as coarse-grain, medium-grain, fine-grain, and massive [123]. FPGA outperforms in terms of its fine grain and coarse grain reconfiguration ability and its hierarchical storage structure and scheduling mechanism can be optimized flexibly. Flexible hierarchical memory systems can support complex data access mode of CNN. It is often used to improve the efficiency of on-chip memory and to reduce the energy consumption.

Peemen et al. experiment on Virtex 6 FPGA board and show that the accelerator design can achieve  $11 \times$  speedup with very complicated address mapping of data access [124]. Zhang et al. take data reuse, parallel processing, and off-chip memory bandwidth into consideration in FPGA accelerator. The accelerator achieves  $17.42 \times$  faster speed than CPU in AlexNet CNN architecture [125]. Martnez et al. take advantage of the FPGA reconfiguration characteristics by unfolding the loop execution on different cascading stages. As the number of multipliers for convolution increases, the proposed method can achieve 12 GOPS at most [126]. A hardware acceleration method for CNN is proposed by combining fine grain in operator level parallelism and coarse grain parallelism. Compared with 4xIntel Xeon 2.3 GHz, 1.35 GHz C870, and a 200 MHz FPGA, the proposed design achieves a  $4\,\times\,$  to  $8\,\times\,$  speed boost [127]. Wang et al. propose an on-chip memory design called Memsqueezer that can be implemented on FPGA. They shrink the memory size by compressing data, weights, and intermediate data from the perspectives of hardware, which achieves 80% energy reduction compared with conventional buffer designs [128]. Zhang et al. design an FPGA accelerator engine called Caffeine that decreases underutilized memory bandwidth. It reorganizes the memory access according to their proposed matrix-multiplication representation applied to both convolutional layers and fully-connected layers. Caffeines implementation on Xilinx KU060 and Virtex 7690t FPGA achieves very high peak performance of 365 GOPS and 636 GOPS respectively [129]. Rahman et al. present a 3D array architecture, which can benefit all layers in CNNs. With optimization of on-chip buffer sizes for FPGAs, it can outperform the state-of-theart solutions by 22% in terms of MAC [130]. Alwani et al. explore the design space of dataflow across multiple convolutional layers, where a fused layer accelerator is designed that reduces feature map data transfer from and to off-chip memory [131].

## 6.3. ASIC

For ASIC design, despite of using methods in structure level such as block-circulant projection in Section 4.1.3 and SC in Section 4.2, to improve the design in implementation level, memory can be expanded and locality can be increased to reduce data transporting within systems for deep neural network accelerating. Tensor Processing Unit (TPU) is designed for low precision computation with high efficiency. It uses a large on-chip memory of 28 MiB to execute the neural network applications, which can achieve at most  $30 \times$  faster speed than an Nvidia K80 GPU [10]. TETRIS is an architecture using 3D memory proposed by Gao et al. It saves more area for processing elements and leaves more space for accelerator design [132].

Luo et al. create an architecture of 64-chip system that minimizes data moving between synapses and neurons by storing them closely. It reduces the burden on external memory bandwidth and achieves a speedup of  $450 \times$  over a GPU with  $150 \times$  energy reduction [133]. Wang et al. propose to group adjacent process engines (PEs) into dual-channel PEs called Chain-NN to mitigate huge amount of data movements. They simulate it under TSMC 28nm process and achieve a peak throughput of 806.4 GOPS in AlexNet [134]. Single instruction multiple data (SIMD) processors are used on a 32-bit CPU to design a system targeted for ASIC synthesis to perform real-time detection, recognition and segmentation of mega-pixel images. They optimize the operation in CNN with available parallelism in hardware. The ASIC implementations outperform the CPU conventional methods in terms of frames/s [135].

Recently, some ASIC designs target at sparse networks with irregularity. For example, Zhang et al. propose an accelerator called Cambricon-X that can reach 544 GOP/s in 6.38 mm<sup>2</sup> [136]. It consists an Indexing Module, which can efficiently schedule processing elements that store irregular and compressed synapses. Kwon et al. design a reconfigurable accelerator called MAERI to adapt various layer dataflow patterns. They can efficiently utilize compute resources and provides  $6.9 \times$  speedup at 50% sparsity [137]. Network pruning could induce sparsity and irregularity as discussed in Section 4.1.2. With such designs, better performance is expected to achieve when combined.

#### 6.4. New devices

As new device technology and circuits arise, deep convolutional neural networks can be potentially accelerated by orders of magnitude. In terms of new device, very large scale integration systems are explored to mimic complex biological neuron architectures.

Some of them are in their theoretical demonstration state for training deep neural networks. For example, Gokmen and Vlasov from IBM research center propose a resistive processing unit (RPU) device, which can both store and compute parameters in this unit. It has extremely high processing speed with  $30,000 \times$  higher than state-of-the-art microprocessors (84,000 GigaOps/s/W) [138]. As neuromorphic engineering develops, more new devices emerge to handle high frequency and high volume information transformation through synapses. Some are in theoretical state that have not been implemented on neural networks for classification and recognition, such as nano-scale phase change device [139] and ferroelectric memristors [140].

Resistive memories are treated as one of the promising solutions for deep neural network acceleration due to its nonvolatility, high storage density, and low power consumption [141]. Its architecture mimics neural networks, where weight storage and computation can be done simultaneously [142,143]. As CMOS memories become larger, its scale becomes limited. Therefore, besides the main stream CMOS based memory, nonvolatile memory becomes more popular in storing weights, such as resistive random access memory (RRAM) [144–147] and spin-transfer torque random access memory (STT-RAM) [148].

Memristor crossbar array structures can deal with computational expensive matrix multiplication and have been explored in CNN hardware implementations. For example, Hu et al. develop a Dot-Product Engine (DPE) utilizing memristor crossbar, which achieves  $1000 \times$  to  $10,000 \times$  speed-efficiency product compared with a digital ASIC [149]. Xia et al. address energy consumption problem between crossbars and ADC/DAC and can save more than

Table 3Layer decomposition and pruning methods analysis.

|                     | Method | Target layer         | Pre-training | Performance                                                             |
|---------------------|--------|----------------------|--------------|-------------------------------------------------------------------------|
| Layer decomposition | [47]   | Convolutional layers | Required     | $2.5 \times$ speedup with no loss in accuracy                           |
|                     | [48]   | Convolutional layers | Required     | $2 \times$ speedup with $< 1\%$ accuracy drop                           |
|                     | [52]   | Whole network        | Required     | 1.09 $\times$ reduction in weights & 4.93 $\times$ speedup in VGG-16    |
|                     | [56]   | Convolutional layers | Not required | 76% reduction in weights in VGG-11                                      |
| Pruning             | [152]  | Whole network        | Required     | prune 90% parameters of the convolutional kernels                       |
|                     | [153]  | Whole network        | Required     | prune 92.3% parameters in VGG-16                                        |
|                     | [65]   | Whole network        | Not required | 5.1 $\times$ (CPU) & 3.1 $\times$ (GPU) speedup in convolutional layers |
|                     | [68]   | Whole network        | Required     | 34% inference FLOP reduction in VGG-16                                  |

95% energy with similar accuracy of CNN [150]. Ankit et al. propose a hierarchical reconfigurable architecture with memristive crossbar arrays called RESPARC, which is 15  $\times$  more energy efficient and has 60  $\times$  more throughput for deep CNNs [151].

In general, for any CNN hardware implementation, there are a lot of potential solutions to be explored in design space. It is not trivial to design a general hardware architecture that can be applied to every CNN, especially when limitations on computation resource and memory bandwidth are considered.

## 7. Discussion

Researches have different flavors over dataset, model, and implementation platforms. Many datasets and models are treated as benchmarks based on previous researches. But different benchmarks and their combinations make it difficult to compare the method in one level with one in other levels. In the following discussion, we constrain our comparison and analysis within each level.

#### 7.1. Structure level

We have summarized methods of layer decomposition and pruning in Table 3. Some of the layer decomposition and pruning methods focus on inference, because pre-trained CNNs are required before applying the corresponding methods. It is a limitation comparing to other acceleration methods. For example, some large scale networks still need training for weeks or months before layer decomposition method implementation [47,48]. Pruning by sparsified weights and their connections require pre-training on the original full model and fine-tuning [152,153].

Many layer decomposition and pruning methods are layerwised and optimized for specific layer when they are first time proposed. For example, Sainath et al. demonstrate significant reduction in parameters in the output softmax layer [46]. Mariet et al. successfully prune 25% of the parameters with good performance in the fully connected layer [63]. Denton et al. successfully reduce a large magnitude number of parameters in the convolutional layer [48]. As these methods focus on different types of layers, there is exploration space about how to combine them for further acceleration.

After reducing redundancy in representation, the size of neural networks can reduce dramatically. However, specific hardware is required to achieve a speedup in training and testing, since currently most GPUs are improved to suit for floating-point performance. For example, using BinaryConnect method [154] to train a Torch7 frame ConvNet on GPU takes more time. The time complexity can be reduced theoretically by 60% if using dedicated hardware.

We have summarized methods of reducing redundancy in representation in Tables 4 and 5. The Note column in these two tables provide important information that needs to be distinguished among different methods. Low-bit representation methods are targeted for both large and small models. Various bit-width representation results in different performance dependent on different models and datasets. Many experiments are conducted based on small datasets with low resolution (e.g.  $32 \times 32$ ) such as CIFAR10, and usually achieve less than 5% error rate increase. For image classification at large scale (e.g. ImageNet), low-bit representation method is difficult to achieve the same performance as that for small dataset.

### 7.2. Algorithm level

## 7.2.1. Information for updating

According to the cost function of gradient descent algorithm, there are two categories, which are first-order derivatives gradient descent and second-order derivatives gradient decent. Compared with first-order derivatives based gradient decent algorithm, the second-order one is faster to converge. But it is more complex to utilize the second order information, which makes it prohibitive in practice for deep large neural networks. Therefore, more emphasis has been put on how to approximate the Hessian matrices, which consists of the second-order derivatives for simplicity [155].

#### 7.2.2. Data for training

According to the update amount of data, there are three variants of the algorithms, which are batch, mini-batch, and online. The batch method uses whole dataset to update the gradient in one iteration. The mini-batch uses randomly picked small amount of data to update the gradient while the online method uses new incoming subset of data once to update the gradient and stops at any time.

For batch gradient decent, it is guaranteed to converge to the global minimum for convex surfaces. But it can be very slow and requires very large memory storage. The mini-batch can avoid redundant gradient computation using shuffled examples. As a result, it usually shoots the minimum faster than the batch gradient. With delicate picked learning rate, its fluctuation performance decreases. The online method can be used for designing a light-weight algorithm in terms of memory and speed for handling a stream of data. Since the data is updated frequently, it can be used to predict the most recent state of the trend. But as data is discarded after gradient update, online method is considered to be more difficult and unreliable [156].

#### 7.2.3. Asynchronous updating

Asynchronous training algorithms help improve the efficiency on large-scale clusters of machines for distributed training and inference. Proposed asynchronous stochastic gradient decent algorithms such as Downpour SGD and AASGD help improve neural network performance in classification problems with a large amount of high dimensional data. But more attention is needed on communication among different workers in the clusters, since suboptimal communication can result in parameters diverging [157,158].

Representation reduction methods (CIFAR10). C: convolutional layer, S: subsampling layer, F: fully-connected layer.

| Bit-width | Method               | Model                                  | Error rate increase        | Note                                                                          |
|-----------|----------------------|----------------------------------------|----------------------------|-------------------------------------------------------------------------------|
| Binary    | BinaryConnect [82]   | Self-designed (e.g.<br>6C-3S-2F-L2SVM) | $\sim$ 2% error rate drop  | Binary weights during training and testing                                    |
|           | Binarynet [154]      | Self-designed                          | 10.15% absolute error rate | Binary weights & activations during forward pass                              |
| Ternary   | TWN [88]             | VGG-7                                  | < 1%                       | Ternary weights in forward & backward pass                                    |
|           | Ternary Connect [89] | 6C-1F-1softmax                         | $\sim$ 3% error rate drop  | Ternary weights during training                                               |
|           | TNN [90]             | VGG-variant                            | 12.11% absolute error rate | Teacher-student approach based ternary input &<br>activations during training |
| Others    | 12/14/16-bit [80]    | 3C-3S-1softmax                         | < 5%                       | Fixed-point number representation with stochastic<br>rounding                 |
|           | 10-bit [81]          | Maxout networks                        | < 5%                       | Dynamic 10-bit fixed point precision during training                          |

#### Table 5

Representation reduction methods (ImageNet).

| Bit-width | Method                                                               | Model                                                                                                                   | Error rate increase                   | Note                                                                                                                                                                                                                                                          |
|-----------|----------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|---------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Binary    | QNN [86]<br>XNOR-net [83]<br>BWN [83]<br>DoReFa-Net [85]<br>TWN [88] | GoogLeNet, AlexNet<br>AlexNet, ResNet, GoogLeNet-variation<br>AlexNet, ResNet, GoogLeNet-variation<br>AlexNet<br>ResNet | > 10%<br>> 10%<br>< 10%<br>around 10% | Binary weights & activations during training and testing<br>Binary weights & input to convolutional layers<br>Binary weights in forward & backward pass<br>Binary weights & 2-bit activations & 6-bit gradients<br>Ternary weights in forward & backward pass |

#### Table 6

FFT-based method analysis.

| Method            | Platform                     | Feature                                              | Additional<br>memory | Complexity of Fourier<br>transform & inverse                 | Complexity of add & Mul<br>in frequency domain &<br>Extra Complexity                                                                      |
|-------------------|------------------------------|------------------------------------------------------|----------------------|--------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| Mathieu [117]     | GeForce GTX Titan GPU        | Perform convolutions as products in frequency domain | Yes                  | $(2C \cdot n^2 \log n)(S \cdot f + f' \cdot f + S \cdot f')$ | $4S \cdot f' \cdot f \cdot n^2$                                                                                                           |
| Rippel [159]      | Xeon Phi coprocessor         | Pooling in frequency domain                          | Yes                  |                                                              |                                                                                                                                           |
| Ko [119]          | ASIC                         | Train the network entirely in frequency domain       | No                   | $2Cn^2\log n(S \cdot f)$                                     | $\begin{array}{l} (3/2 \cdot (n^2 - \alpha) + \alpha) \cdot S \cdot f' \cdot \\ f + 3/2 \cdot n^2 \cdot k^2 \cdot f' \cdot f \end{array}$ |
| S: mini-batch siz | e, f: input feature map dept | h, f': output feature map depth, n: feature i        | map dimension        |                                                              |                                                                                                                                           |

k: kernel dimension, C: hidden constant in the O notation,  $\alpha$ : 1 for odd and 4 for even n

#### Table 7

Performance comparison among GPU, FPGA, and ASIC.

| Method                                 | Platform          | Memory                                    | Frequency | Performance                         |
|----------------------------------------|-------------------|-------------------------------------------|-----------|-------------------------------------|
| Minwa [120]                            | GPUs              | 6.9 TB host memory & 1.7 TB device memory | -         | 0.6 PFlops single precision at peak |
| Roofline-model-based accelerator [125] | VC707 FPGA        | -                                         | 100 MHz   | 61.62 GFLOPS                        |
| Caffeine [129]                         | Xilinx KU060 FPGA | -                                         | 200 MHz   | 365 GOPS                            |
| ICAN accelerator [130]                 | Virtex-7 FPGA     | Memory bandwidth 6.2 GB/s                 | 160 MHz   | 147.82 GOPS                         |
| Dadiannao [133]                        | ASIC              | 36 MB node eDRAM                          | 606 MHz   | 2.09 TeraOps/s of a node at peak    |
| Chain-NN [134]                         | ASIC              | 352 KB on-chip memory                     | 700 MHz   | 806.4 GOPS at peak                  |
| Cambricon-X [136]                      | ASIC              | 56 KB on-chip SRAM                        | 1 GHz     | 544 GOPS                            |

## 7.2.4. From frequency perspective

Table 6 summarizes methods of FFT-based CNNs.The concept of implementing CNN in frequency is to replace convolution operation in time domain with multiplication in frequency domain. It takes time to transform back and forth. As a result, it performs well on large feature maps. Development is made to suit for small feature maps such as training network directly in frequency domain. Compared with other algorithms, FFT method requires additional memory for padding the filters to the same size of the input and storing frequency domain intermediate results. This leads to a trade-off for hardware implementation. On one hand, it can take use of power in GPU parallelism to speedup convolution computation dramatically. On the other hand, more delicate GPU memory allocation is required due to limit memory.

## 7.3. Implementation level

As CNN becomes more and more complex, general purpose processors cannot exploit the inherent parallelism for matrix or



Fig. 6. Power and throughput among CPU, GPU, FPGA, and ASIC.

tensor operations and therefore becomes bottleneck when performing large deep convolutional neural networks. Various designs for accelerating network training and inference have been proposed based on GPU, ASIC, and FPGA. Table 7 gives a performance comparison of different methods implementing on GPU, FPGA, and ASIC. Fig. 6 shows the comparison among CPU, GPU, FPGA, and ASIC in terms of power and throughput [160–164]. GPU supports several teraFLOPS throughput and large memory access, but consumes a lot of energy. In terms of economy, GPU costs to set up for large deep convolutional neural networks.

Comparing to GPU, ASIC is specialized hardware and can be delicately designed to maximize its benefits such as power-efficiency and large throughput in CNN implementation. However, once CNN algorithms are implemented on ASIC, it is difficult to change the hardware design. On the other hand FPGA is easy to be programmed and reconfigured. It is more convenient for prototyping.

Compared with GPU, FPGA throughput is tens of gigaFLOPS and it has limited memory access. In addition, it does not support floating-point natively. But it is more power-efficient. Due to its limited memory access, many proposed methods are focused on accelerating inference time of neural network since inference process requires less memory access comparing to training process. Others are emphasized on external memory optimization for large neural network acceleration. Different models need different hardware optimization and even for the same model, different designs result in quite various acceleration performance [60]. In terms of economy, FPGA is reconfigurable and is easier to evolve hardware, frameworks and software. Especially for various models of neural networks, its flexibility shortens design cycle and costs less.

#### 8. Conclusion

In this paper, we have summarized recent advances in CNN acceleration methods from all structure level, algorithm level, and implementation level. In structure level, CNN is compressed without losing significant accuracy since there is redundancy in most of the CNN architectures. For training algorithms, besides convergence speed, convolution calculation is also an important factor for CNN. FFT method introduces a frequency perspective for training neural networks. In implementation level, characteristics for different hardware such as FPGA and GPU are explored combined with CNN features. CNN performs better in computer vision field as its structure goes deeper and the amount of data becomes larger, which makes it time consuming and computationally expensive. It is imperative and necessary to accelerate CNN for its further implementation in life. For now, there is no generalized evaluation system to test the acceleration performance for comparison among different methods in different levels. Researches use case by case dataset benchmark and different criterion in each level. Therefore, it is challenging in acceleration performance evaluation as well.

#### Acknowledgment

This research work was partly supported by the Natural Science Foundation of China and Jiangsu (Project no. 61750110529, 61850410535, 61671148, BK20161147), the Research and Innovation Program for Graduate Students in Universities of Jiangsu Province (Grant SJCX17\_0048, SJCX18\_0058), and the Research Grants Council of Hong Kong SAR (Project no. CUHK24209017).

#### References

- [1] S. Yu, S. Jia, C. Xu, Convolutional neural networks for hyperspectral image classification, Neurocomputing 219 (2017) 88–98.
- [2] C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1915–1929.
- [3] K. Wang, C. Gou, N. Zheng, J.M. Rehg, F.Y. Wang, Parallel vision for perception and understanding of complex scenes: methods, framework, and perspectives, Artif. Intell. Rev. 48 (3) (2017) 299–329.
- [4] P. Qin, W. Xu, J. Guo, An empirical convolutional neural network approach for semantic relation classification, Neurocomputing 190 (2016) 1–9.
- [5] B. Yu, D.Z. Pan, T. Matsunawa, X. Zeng, Machine learning and pattern matching in physical design, in: Proceedings of the IEEE/ACM Asia and South Pacific Design Automation Conference (ASPDAC), 2015, pp. 286–293.
- [6] L. Theis, W. Shi, A. Cunningham, F. Huszár, Lossy image compression with compressive autoencoders, in: Proceedings of the International Conference on Learning Representations (ICLR), 2017.

- [7] M. Zhang, T. Chen, X. Shi, P. Cao, Image arbitrary-ratio down-and up-sampling scheme exploiting DCT low frequency components and sparsity in high frequency components, IEICE Trans. Inf. Syst. 99 (2) (2016) 475–487.
- [8] T. Chen, M. Zhang, J. Wu, C. Yuen, Y. Tong, Image encryption and compression based on Kronecker compressed sensing and elementary cellular automata scrambling, Opt. Laser Technol. 84 (2016) 118–133.
- [9] J.A.A. Jothi, V.M.A. Rajam, A survey on automated cancer diagnosis from histopathology images, Artif. Intell. Rev. 48 (1) (2017) 31–81.
- [10] N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., In-data center performance analysis of a tensor processing unit, in: Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA), 2017, pp. 1–12.
- [11] NVDLA (http://nvdla.org).
- [12] (https://www.intelnervana.com).
- [13] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep learning for visual understanding: a review, Neurocomputing 187 (2016) 27–48.
- [14] M. Ghayoumi, A quick review of deep learning in facial expression, J. Commun. Comput. 14 (2017) 34–38.
- [15] A. Carrio, C. Sampedro, A. Rodriguez-Ramos, P. Campoy, A review of deep learning methods and applications for unmanned aerial vehicles, J. Sensors 2017 (2017).
- [16] J. Zhang, C. Zong, Deep neural networks in machine translation: an overview, IEEE Intell. Syst. 30 (5) (2015) 16–25.
- [17] S.P. Singh, A. Kumar, H. Darbari, L. Singh, A. Rastogi, S. Jain, Machine translation using deep learning: an overview, in: Proceedings of the International Conference on Computer, Communications and Electronics (Comptelix), 2017, pp. 162–167.
- [18] Z.H. Ling, S.Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian, H.M. Meng, L. Deng, Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends, IEEE Signal Process. Mag. 32 (3) (2015) 35–52.
- [19] J. Schmidhuber, Deep learning in neural networks: an overview, Neural Netw. 61 (2015) 85–117.
- [20] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436-444.
- [21] X. Du, Y. Cai, S. Wang, L. Zhang, Overview of deep learning, in: Proceedings of the Youth Academic Annual Conference of Chinese Association of Automation (YAC), IEEE, 2016, pp. 159–164.
- [22] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
- [23] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
- [24] M. Lin, Q. Chen, S. Yan, Network in network, arXiv:1312.4400 (2013).
- [25] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: Proceedings of the International Conference on Learning Representations (ICLR), 2015.
- [26] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
- [28] F. Chollet, Xception: deep learning with depthwise separable convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- [29] N. Aloysius, M. Geetha, A review on deep convolutional neural networks, in: Proceedings of the International Conference on Communication and Signal Processing (ICCSP), 2017, IEEE, 2017, pp. 0588–0592.
- [30] A.A.M. Al-Saffar, H. Tao, M.A. Talab, Review of deep convolution neural network in image classification, in: Proceedings of the International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), 2017, IEEE, 2017, pp. 26–31.
- [31] W. Rawat, Z. Wang, Deep convolutional neural networks for image classification: a comprehensive review, Neural Comput. 29 (9) (2017) 2352–2449.
- [32] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006.
- [33] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines, in: Proceedings of the International Conference on Machine Learning (ICML), 2010, pp. 807–814.
- [34] A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improve neural network acoustic models, in: Proceedings of the International Conference on Machine Learning (ICML), 2013.
- [35] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: surpassing human-level performance on imagenet classification, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034.
- [36] B. Xu, N. Wang, T. Chen, M. Li, Empirical evaluation of rectified activations in convolutional network, in: Proceedings of the International Conference on Machine Learning Workshop, 2015.
- [37] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear units (ELUS), in: Proceedings of the International Conference on Learning Representations (ICLR), 2016.
- [38] P. Sermanet, S. Chintala, Y. LeCun, Convolutional neural networks applied to house numbers digit classification, in: Proceedings of the IEEE International Conference on Pattern Recognition (ICPR), 2012, pp. 3288–3291.

- [39] M.D. Zeiler, R. Fergus, Stochastic pooling for regularization of deep convolutional neural networks, in: Proceedings of the International Conference on Learning Representations (ICLR), 2013.
- [40] D. Yu, H. Wang, P. Chen, Z. Wei, Mixed pooling for convolutional neural networks, in: Proceedings of the International Conference on Rough Sets and Knowledge Technology, Springer, 2014, pp. 364–375.
- [41] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9) (2015) 1904–1916.
- [42] O. Rippel, J. Snoek, R.P. Adams, Spectral representations for convolutional neural networks, in: Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2015, pp. 2449–2457.
- [43] Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale orderless pooling of deep convolutional activation features, in: Proceedings of the European Conference on Computer Vision (ECCV), Springer, 2014, pp. 392–407.
- [44] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A survey of deep neural network architectures and their applications, Neurocomputing 234 (2017) 11–26.
- [45] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al., Predicting parameters in deep learning, in: Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2013, pp. 2148–2156.
- [46] T.N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, B. Ramabhadran, Low-rank matrix factorization for deep neural network training with high-dimensional output targets, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6655–6659.
- [47] M. Jaderberg, A. Vedaldi, A. Zisserman, Speeding up convolutional neural networks with low rank expansions, in: Proceedings of the British Machine Vision Conference (BMVC), 2014.
- [48] E.L. Denton, W. Zaremba, J. Bruna, Y. LeCun, R. Fergus, Exploiting linear structure within convolutional networks for efficient evaluation, in: Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2014, pp. 1269–1277.
- [49] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, V. Lempitsky, Speeding-up convolutional neural networks using fine-tuned CP-decomposition, in: Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- [50] C. Tai, T. Xiao, Y. Zhang, X. Wang, W.E. Convolutional neural networks with low-rank regularization, in: International Conference on Learning Representations (ICLR), 2016.
- [51] P. Wang, J. Cheng, Accelerating convolutional neural networks for mobile applications, in: Proceedings of the ACM International Multimedia Conference (MM), 2016, pp. 541–545.
- [52] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, D. Shin, Compression of deep convolutional neural networks for fast and low power mobile applications, in: Proceedings of the International Conference on Learning Representations, 2016.
- [53] H. Ding, K. Chen, Y. Yuan, M. Cai, L. Sun, S. Liang, Q. Huo, A compact CNN-D-BLSTM based character model for offline handwriting recognition with tucker decomposition, in: Proceedings of the IAPR International Conference on Document Analysis and Recognition (ICDAR), 1, IEEE, 2017, pp. 507–512.
- [54] Q. Le, T. Sarlós, A. Smola, Fastfood-approximating kernel expansions in loglinear time, in: Proceedings of the International Conference on Machine Learning (ICML), 85, 2013.
- [55] M. Moczulski, M. Denil, J. Appleyard, N. de Freitas, ACDC: a structured efficient linear layer, in: Proceedings of the International Conference on Learning Representations (ICLR), 2016.
- [56] Y. Ioannou, D. Robertson, J. Shotton, R. Cipolla, A. Criminisi, Training CNNs with low-rank filters for efficient image classification, in: Proceedings of the International Conference on Learning Representations (ICLR), 2016.
- [57] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, H. Li, Coordinating filters for faster deep neural networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
- [58] A. Sironi, B. Tekin, R. Rigamonti, V. Lepetit, P. Fua, Learning separable filters, IEEE Trans. Pattern Anal. Mach. Intell. 37 (1) (2015) 94–106.
- [59] X. Zhang, J. Zou, K. He, J. Sun, Accelerating very deep convolutional networks for classification and detection, IEEE Trans. Pattern Anal. Mach. Intell. 38 (10) (2016) 1943–1955.
- [60] S. Han, H. Mao, W.J. Dally, Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding, in: Proceedings of the International Conference on Learning Representations (ICLR), 2016.
- [61] H. Wang, Q. Zhang, Y. Wang, H. Hu, Structured probabilistic pruning for convolutional neural network acceleration, in: British Machine Vision Conference (BMVC), 2018.
- [62] H. Zhou, J.M. Alvarez, F. Porikli, Less is more: towards compact CNNs, in: Proceedings of the European Conference on Computer Vision (ECCV), Springer, 2016, pp. 662–677.
- [63] Z. Mariet, S. Sra, Diversity networks, in: Proceedings of the International Conference on Learning Representations (ICLR), 2016.
- [64] A. Polyak, L. Wolf, Channel-level acceleration of deep face representations, IEEE Access 3 (2015) 2163–2175.
- [65] W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in deep neural networks, in: Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2016, pp. 2074–2082.
- [66] Y. He, X. Zhang, J. Sun, Channel pruning for accelerating very deep neural networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2, 2017, p. 6.

- [67] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, C. Zhang, Learning efficient convolutional networks through network slimming, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2755–2763.
- [68] H. Li, A. Kadav, I. Durdanovic, H. Samet, H.P. Graf, Pruning filters for efficient convnets, in: Proceedings of the International Conference on Learning Representations (ICLR), 2017.
- [69] A. Vedaldi, K. Lenc, Matconvnet: convolutional neural networks for MATLAB, in: Proceedings of the ACM International Multimedia Conference (MM), 2015, pp. 689–692.
- [70] S. Oymak, C. Thrampoulidis, B. Hassibi, Near-optimal sample complexity bounds for circulant binary embedding, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 6359–6363.
- [71] Y. Cheng, F.X. Yu, R.S. Feris, S. Kumar, A. Choudhary, S.F. Chang, An exploration of parameter redundancy in deep networks with circulant projections, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2857–2865.
- [72] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, Z. Wang, Deep fried ConvNets, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1476–1483.
- [73] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, et al., CirCNN: accelerating and compressing deep neural networks using block-circulant weight matrices, in: Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2017, pp. 395–408.
- [74] R. Caruana, A. Niculescu Mizil, G. Crew, A. Ksikes, Ensemble selection from libraries of models, in: Proceedings of the International Conference on Machine Learning (ICML), 2004, p. 18.
- [75] C. Bucilu, R. Caruana, A. Niculescu Mizil, Model compression, in: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2006, pp. 535–541.
- [76] Z. Xu, Y. Hsu, J. Huang, Training student networks for acceleration with conditional adversarial networks, in: British Machine Vision Conference (BMVC), 2018.
- [77] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv:1503.02531(2015).
- [78] A. Romero, N. Ballas, S.E. Kahou, A. Chassang, C. Gatta, Y. Bengio, Fitnets: hints for thin deep nets, in: Proceedings of the International Conference on Learning Representations (ICLR), 2015.
- [79] D. Hammerstrom, A VLSI architecture for high-performance, low-cost, on-chip learning, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), IEEE, 1990, pp. 537–544.
- [80] S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep learning with limited numerical precision., in: Proceedings of the International Conference on Machine Learning (ICML), 2015, pp. 1737–1746.
- [81] M. Courbariaux, Y. Bengio, J.P. David, Training deep neural networks with low precision multiplications, in: Proceedings of the International Conference on Learning Representations (ICLR), 2015a.
- [82] M. Courbariaux, Y. Bengio, J.P. David, Binaryconnect: Training deep neural networks with binary weights during propagations, in: Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2015b, pp. 3123–3131.
- [83] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: Imagenet classification using binary convolutional neural networks, in: Proceedings of the European Conference on Computer Vision (ECCV), Springer, 2016, pp. 525–542.
- [84] M. Kim, P. Smaragdis, Bitwise neural networks, in: Proceedings of the International Conference on Machine Learning (ICML), 2016.
- [85] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, Y. Zou, DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients, arXiv:1606.06160(2018).
- [86] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural networks: Training neural networks with low precision weights and activations, J. Mach. Learn. Res. 18 (1) (2017) 6869–6898.
- [87] H. Kim, J. Sim, Y. Choi, L.-S. Kim, A kernel decomposition architecture for binary-weight convolutional neural networks, in: Proceedings of the ACM/IEEE Design Automation Conference (DAC), 2017, p. 60.
- [88] F. Li, B. Zhang, B. Liu, Ternary weight networks, arXiv:1605.04711(2016).
- [89] Z. Lin, M. Courbariaux, R. Memisevic, Y. Bengio, Neural networks with few multiplications, in: Proceedings of the International Conference on Learning Representations (ICLR), 2016.
- [90] H. Alemdar, V. Leroy, A. Prost-Boucle, F. Pétrot, Ternary neural networks for resource-efficient AI applications, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2017, pp. 2547–2554.
- [91] B.D. Brown, H.C. Card, Stochastic neural computation. i. computational elements, IEEE Trans. Comput. 50 (9) (2001) 891–905.
- [92] Z. Li, A. Ren, J. Li, Q. Qiu, B. Yuan, J. Draper, Y. Wang, Structural design optimization for deep convolutional neural networks using stochastic computing, in: Proceedings of the IEEE/ACM Proceedings Design, Automation and Test in Eurpoe (DATE), 2017, pp. 250–253.
- [93] K. Kim, J. Kim, J. Yu, J. Seo, J. Lee, K. Choi, Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks, in: Proceedings of the ACM/IEEE Design Automation Conference (DAC), 2016, pp. 124:1–124:6.
- [94] Y. Ji, F. Ran, C. Ma, D.J. Lilja, A hardware implementation of a radial basis function neural network using stochastic logic, in: Proceedings of the IEEE/ACM Proceedings Design, Automation and Test in Eurpoe (DATE), 2015, pp. 880–883.
- [95] A. Ren, Z. Li, C. Ding, Q. Qiu, Y. Wang, J. Li, X. Qian, B. Yuan, SC-DCNN: high-

ly-scalable deep convolutional neural network using stochastic computing, in: Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017, pp. 405–418.

- [96] J. Li, Z. Yuan, Z. Li, A. Ren, C. Ding, J. Draper, S. Nazarian, Q. Qiu, B. Yuan, Y. Wang, Normalization and dropout for stochastic computing-based deep convolutional neural networks, Integr. VLSI J. (2017), doi:10.1016/j.vlsi.2017. 11.002.
- [97] G. Li, P. Niu, X. Duan, X. Zhang, Fast learning network: a novel artificial neural network with a fast learning speed, Neural Comput. Appl. 24 (7-8) (2014) 1683–1695.
- [98] N.M. Nawi, A. Khan, M.Z. Rehman, A new back-propagation neural network optimized with cuckoo search algorithm, in: Proceedings of the International Conference on Computational Science and Its Applications, Springer, 2013, pp. 413–426.
- [99] Y.P. Liu, M.G. Wu, J.X. Qian, Evolving neural networks using the hybrid of ant colony optimization and bp algorithms, in: Proceedings of the International Symposium on Neural Networks, Springer, 2006, pp. 714–722.
- [100] S.T. Pan, M.L. Lan, An efficient hybrid learning algorithm for neural network-based speech recognition systems on FPGA chip, Neural Comput. Appl. 24 (7-8) (2014) 1879–1885.
- [101] S. Ding, C. Su, J. Yu, An optimizing bp neural network algorithm based on genetic algorithm, Artif. Intell. Rev. 36 (2) (2011) 153–162.
  [102] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations
- [102] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, Technical Report, 1985.
- [103] J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res. 12 (Jul) (2011) 2121–2159.
- [104] M.D. Zeiler, ADADELTA: an adaptive learning rate method, arXiv:1212. 5701(2012).
- [105] D. Kingma, J. Ba, Adam: a method for stochastic optimization, in: International Conference on Learning Representations (ICLR), 2015.
- [106] N.A. Hamid, N.M. Nawi, R. Ghazali, M.N.M. Salleh, Accelerating learning performance of back propagation algorithm by using adaptive gain together with adaptive momentum and adaptive learning rate on classification problems, in: Proceedings of the International Conference on Ubiquitous Computing and Multimedia Applications, Springer, 2011, pp. 559–570.
- [107] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM J. Optim. 22 (2) (2012) 341–362.
- [108] P. Richtárik, M. Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, Math. Prog. 144 (1-2) (2014) 1–38.
- [109] cuBLAS (http://docs.nvidia.com/cuda/cublas).
- [110] MLK (https://software.intel.com/en-us/intel-mkl).
- [111] OpenBLAS (http://www.openblas.net).
- [112] M. Cho, D. Brand, Mec: Memory-efficient convolution for deep neural network, in: Proceedings of the International Conference on Machine Learning (ICML), 2017, pp. 815–824.
- [113] A. Lavin, S. Gray, Fast algorithms for convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4013–4021.
- [114] H. Park, D. Kim, J. Ahn, S. Yoo, Zero and data reuse-aware fast convolution for deep neural networks on GPU, in: Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, 2016, pp. 1–10.
- [115] Q. Xiao, Y. Liang, L. Lu, S. Yan, Y.-W. Tai, Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAS, in: Proceedings of the ACM/IEEE Design Automation Conference (DAC), 2017, pp. 1–6.
- [116] S. Ben Yacoub, B. Fasel, J. Luettin, Fast face detection using MLP and FFT, in: Proceedings of the International Conference on Audio and Video-based Biometric Person Authentication, 1999, pp. 31–36.
- [117] M. Mathieu, M. Henaff, Y. LeCun, Fast training of convolutional networks through ffts, arXiv:1312.5851(2013).
- [118] T. Brosch, R. Tam, Efficient training of convolutional deep belief networks in the frequency domain for application to high-resolution 2D and 3D images, Neural Comput. 27 (1) (2015) 211–227.
- [119] J.H. Ko, B. Mudassar, T. Na, S. Mukhopadhyay, Design of an energy-efficient accelerator for training of convolutional neural networks using frequency-domain computation, in: Proceedings of the ACM/IEEE Design Automation Conference (DAC), 2017, pp. 1–6.
- [120] R. Wu, S. Yan, Y. Shan, Q. Dang, G. Sun, Deep image: Scaling up image recognition (2015). arXiv: 1501.02876.
- [121] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, N. Andrew, Deep learning with cots HPC systems, in: Proceedings of the International Conference on Machine Learning (ICML), 2013, pp. 1337–1345.
- [122] M. Imani, D. Peroni, Y. Kim, A. Rahimi, T. Rosing, Efficient neural network acceleration on GPGPU using content addressable memory, in: Proceedings of the IEEE/ACM Proceedings Design, Automation and Test in Eurpoe (DATE), 2017, pp. 1026–1031.
- [123] N. Izeboudjen, C. Larbes, A. Farah, A new classification approach for neural networks hardware: from standards chips to embedded systems on chip, Artif. Intell. Rev. 41 (4) (2014) 491–534.
- [124] M. Peemen, A.A. Setio, B. Mesman, H. Corporaal, Memory-centric accelerator design for convolutional neural networks, in: Proceedings of the IEEE International Conference on Computer Design (ICCD), 2013, pp. 13–19.
- [125] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, J. Cong, Optimizing FPGA-based accelerator design for deep convolutional neural networks, in: Proceedings of the ACM Symposium on FPGAs, 2015, pp. 161–170.
- [126] J.J. Martínez, J. Garrigós, J. Toledo, J.M. Ferrández, An efficient and expandable

hardware implementation of multilayer cellular neural networks, Neurocomputing 114 (2013) 54-62.

- [127] S. Chakradhar, M. Sankaradas, V. Jakkula, S. Cadambi, A dynamically configurable coprocessor for convolutional neural networks, in: Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), 2010.
- [128] Y. Wang, H. Li, X. Li, Re-architecting the on-chip memory sub-system of machine-learning accelerator for embedded devices, in: Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2016, p. 13.
- [129] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks, in: Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2016, pp. 1–8.
- [130] A. Rahman, J. Lee, K. Choi, Efficient FPGA acceleration of convolutional neural networks using logical-3D compute array, in: Proceedings of the IEEE/ACM Proceedings Design, Automation and Test in Eurpoe (DATE), 2016, pp. 1393–1398.
- [131] M. Alwani, H. Chen, M. Ferdman, P. Milder, Fused-layer CNN accelerators, in: Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–12.
- [132] M. Gao, J. Pu, X. Yang, M. Horowitz, C. Kozyrakis, Tetris: scalable and efficient neural network acceleration with 3D memory, in: Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017, pp. 751–764.
- [133] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, Y. Chen, Dadiannao: a neural network supercomputer, IEEE Trans. Comput. 66 (1) (2017) 73–88.
- [134] S. Wang, D. Zhou, X. Han, T. Yoshimura, Chain-NN: an energy-efficient 1D chain architecture for accelerating deep convolutional neural networks, in: Proceedings of the IEEE/ACM Proceedings Design, Automation and Test in Eurpoe (DATE), 2017, pp. 1032–1037.
- [135] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, E. Culurciello, Hardware accelerated convolutional neural networks for synthetic vision systems, in: IEEE International Symposium on Circuits and Systems (ISCAS), 2010, pp. 257–260.
- [136] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, Y. Chen, Cambricon-X: an accelerator for sparse neural networks, in: Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–12.
- [137] H. Kwon, A. Samajdar, T. Krishna, MAERI: enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects, in: Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018, pp. 461–475.
- [138] T. Gokmen, Y. Vlasov, Acceleration of deep neural network training with resistive cross-point devices: design considerations, Front. Neurosci. 10 (2016) 333.
- [139] B.L. Jackson, B. Rajendran, G.S. Corrado, M. Breitwisch, G.W. Burr, R. Cheek, K. Gopalakrishnan, S. Raoux, C.T. Rettner, A. Padilla, et al., Nanoscale electronic synapses using phase change devices, ACM J. Emerg. Technol. Comput. Syst. (JETC) 9 (2) (2013) 12.
- [140] S. Saïghi, C.G. Mayr, T. Serrano Gotarredona, H. Schmidt, G. Lecerf, J. Tomas, J. Grollier, S. Boyn, A.F. Vincent, D. Querlioz, et al., Plasticity in memristive devices for spiking neural networks, Front. Neurosci. 9 (2015) 51.
- [141] J. Seo, B. Lin, M. Kim, P.Y. Chen, D. Kadetotad, Z. Xu, A. Mohanty, S. Vrudhula, S. Yu, J. Ye, et al., On-chip sparse learning acceleration with CMOS and resistive synaptic devices, IEEE Trans. Nanotechnol. (TNANO) 14 (6) (2015) 969–979.
- [142] X. Zeng, S. Wen, Z. Zeng, T. Huang, Design of memristor-based image convolution calculation in convolutional neural network, Neural Comput. Appl. (2016) 1–6.
- [143] Y. Shim, A. Sengupta, K. Roy, Low-power approximate convolution computing unit with domain-wall motion based spin-memristor for image processing applications, in: Proceedings of the ACM/IEEE Design Automation Conference (DAC), 2016, pp. 1–6.
- [144] L. Ni, H. Huang, H. Yu, On-line machine learning accelerator on digital RRAM-crossbar, in: Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), 2016, pp. 113–116.
- [145] Z. Xu, A. Mohanty, P.Y. Chen, D. Kadetotad, B. Lin, J. Ye, S. Vrudhula, S. Yu, J. Seo, Y. Cao, Parallel programming of resistive cross-point array for synaptic plasticity, Procedia Comput. Sci. 41 (2014) 126–133.
- [146] M. Prezioso, F. Merrikh Bayat, B. Hoskins, G. Adam, K.K. Likharev, D.B. Strukov, Training and operation of an integrated neuromorphic network based on metal-oxide memristors, Nature 521 (7550) (2015) 61–64.
- [147] M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Xie, Y. Wang, H. Yang, TIME: A training-in-memory architecture for memristor-based deep neural networks, in: Proceedings of the ACM/IEEE Design Automation Conference (DAC), 2017, pp. 1–6.
- [148] L. Song, Y. Wang, Y. Han, H. Li, Y. Cheng, X. Li, STT-RAM buffer design for precision-tunable general-purpose neural network accelerator, IEEE Trans. Very Large Scale Integr. Syst. (TVLSI) 25 (4) (2017) 1285–1296.
- [149] M. Hu, J.P. Strachan, Z. Li, E.M. Grafals, N. Davila, C. Graves, S. Lam, N. Ge, J.J. Yang, R.S. Williams, Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication, in: Proceedings of the ACM/IEEE Design Automation Conference (DAC), 2016, p. 19.
- [150] L. Xia, T. Tang, W. Huangfu, M. Cheng, X. Yin, B. Li, Y. Wang, H. Yang, Switched by input: power efficient structure for RRAM-based convolutional neural net-

work, in: Proceedings of the ACM/IEEE Design Automation Conference (DAC), 2016, pp. 1–6.

- [151] A. Ankit, A. Sengupta, P. Panda, K. Roy, RESPARC: a reconfigurable and energy-efficient architecture with memristive crossbars for deep spiking neural networks, in: Proceedings of the ACM/IEEE Design Automation Conference (DAC), 2017, pp. 27:1–27:6.
- [152] B. Liu, M. Wang, H. Foroosh, M. Tappen, M. Pensky, Sparse convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 806–814.
- [153] S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections for efficient neural network, in: Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2015, pp. 1135–1143.
- [154] M. Courbariaux, I. Hubara, D. Soudry, R. El Yaniv, Y. Bengio, Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1, arXiv:1602.02830 (2016).
- [155] S.S. Liew, M. Khalil Hani, R. Bakhteri, An optimized second order stochastic learning algorithm for neural network training, Neurocomputing 186 (2016) 74–89.
- [156] H. Cho, M.K. An, Co-clustering algorithm: batch, mini-batch, and online, In. J. Inf. Electron. Eng. 4 (5) (2014) 340.
- [157] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q.V. Le, et al., Large scale distributed deep networks, in: Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2012, pp. 1223–1231.
- [158] Q. Meng, W. Chen, J. Yu, T. Wang, Z.M. Ma, T.-Y. Liu, Asynchronous accelerated stochastic gradient descent, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2016.
- [159] O. Rippel, J. Snoek, R.P. Adams, Spectral representations for convolutional neural networks, in: Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2015, pp. 2449–2457.
- [160] Z. Du, S. Liu, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, Q. Guo, X. Feng, Y. Chen, et al., An accelerator for high efficient vision processing, IEEE Trans. Comput.-Aid. Des. Integr. Circuits Syst. (TCAD) 36 (2) (2017) 227–240.
- [161] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: efficient inference engine on compressed deep neural network, in: Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA), 2016, pp. 243–254.
- [162] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al., Going deeper with embedded FPGA platform for convolutional neural network, in: Proceedings of the ACM Symposium on FPGAs, 2016, pp. 26–35.
- [163] B. Moons, M. Verhelst, An energy-efficient precision-scalable ConvNet processor in 40-nm CMOS, IEEE J. Solid-State Circuits 52 (4) (2017) 903–914.
- [164] Y.H. Chen, T. Krishna, J.S. Emer, V. Sze, EYERISS: an energy-efficient reconfigurable accelerator for deep convolutional neural networks, IEEE J. Solid-State Circuits 52 (1) (2017) 127–138.



**Qianru Zhang** is a Ph.D. student at National ASIC Center in School of Electronic Science and Engineering, Southeast University, China. She obtained her M.S. in Electrical and Computer Engineering from University of California, Irvine in 2016. Her research interests include digital signal processing, big data analysis and deep learning techniques.



**Meng Zhang** received the B.S. degree in Electrical Engineering from the China University of Mining and Technology, Xuzhou, China, in 1986, and the M.S. and Ph.D. degrees in Bioelectronics Engineering from Southeast University, Nanjing, China, in 1993 and 2014, respectively. He is currently a professor in National ASIC System Research Center, College of Electronic Science and Engineering of Southeast University, Nanjing, PR China. He is a faculty adviser of Ph.D. graduates. His research interests include digital signal and image processing, digital communication systems, wireless sensor networks, information security and assurance, cryptography, and digital integrated

circuit design, machine learning etc. He is an author or coauthor of more than 50 referred journal and international conference papers and a holder of more than 60 patents, including some PCT, US patents.



**Tinghuan Chen** is a Ph.D. student at the Department of Computer Science and Engineering, The Chinese University of Hong Kong. He received his M.Eng. in Electronics Engineering from National ASIC Center, Southeast University, China in 2017 and B.Eng. from the same university in 2014. He is interested in convex optimization and machine learning with applications in Design Automation and Cyber-Physical Systems.



**Zhifei Sun** received the B.Eng. degree in electronics engineering from Nanjing University of Posts and Telecommunications of China in 2015, and is current pursuing for Master degree in National ASIC Center, Southeast University, China. He is interested in machine learning and digital communication.



**Yuzhe Ma** received his B.E. degree from the Department of Microelectronics, Sun Yat-sen University in 2016. He is currently pursuing his Ph.D. degree at the Department of Computer Science and Engineering, The Chinese University of Hong Kong. His research interests include VLSI design for manufacturing, physical design and machine learning on chips.



**Bei Yu** received his Ph.D. degree from the Department of Electrical and Computer Engineering, University of Texas at Austin in 2014. He is currently an Assistant Professor in the Department of Computer Science and Engineering, The Chinese University of Hong Kong. He has served in the editorial boards of Integration, the VLSI Journal and IET Cyber-Physical Systems: Theory & Applications. He received four Best Paper Awards at International Symposium on Physical Design (ISPD) 2017, SPIE Advanced Lithography Conference 2016, International Conference on Computer Aided Design (ICCAD) 2013, and Asia and South Pacific Design Automation Conference (ASPDAC) 2012, plus three additional Best Paper Award nominations at

DAC/ICCAD/ASPDAC, and four ICCAD/ISPD contest awards.