

香港中文大學 The Chinese University of Hong Kong

# CMSC5743 L02: CNN Accurate Speedup I

Bei Yu

(Latest update: September 28, 2020)

Fall 2020

・ロト・国・・国・・国・ シック・



#### These slides contain/adapt materials developed by

- Minsik Cho and Daniel Brand (2017). "MEC: memory-efficient convolution for deep neural network". In: *Proc. ICML*
- Asit K. Mishra et al. (2017). "Fine-grained accelerators for sparse machine learning workloads". In: *Proc. ASPDAC*, pp. 635–640
- Jongsoo Park et al. (2017). "Faster CNNs with direct sparse convolutions and guided pruning". In: *Proc. ICLR*
- UC Berkeley EE290: "Hardware for Machine Learning" https://inst.eecs.berkeley.edu/~ee290-2/sp20/





GEMM

Sparse Convolution

**Direct Convolution** 

**Further Discussions** 





GEMM

Sparse Convolution

**Direct Convolution** 

**Further Discussions** 













H: Height of Input Activation W: Width of Input Activation R: Height of Weight S: Width of Weight P: Height of Output Activation Q: Width of Output Activation stride: # of rows/columns traversed per step





+w \* 7 + x \* 8 + y \* 9

H: Height of Input Activation W: Width of Input Activation R: Height of Weight S: Width of Weight P: Height of Output Activation Q: Width of Output Activation stride: # of rows/columns traversed per step





H: Height of Input Activation W: Width of Input Activation R: Height of Weight S: Width of Weight P: Height of Output Activation G: Width of Output Activation stride: # of rows/columns traversed per step









H: Height of Input Activation W: Width of Input Activation R: Height of Weight S: Width of Weight P: Height of Output Activation G: Width of Output Activation stride: # of rows/columns traversed per step padding: # of zero rows/columns added

C: # of Input Channels





H: Height of Input Activation W: Width of Input Activation R: Height of Weight S: Width of Weight P: Height of Output Activation Q: Width of Output Activation stride: # of rows/columns traversed per step padding: # of zero rows/columns added

**C:**# of Input Channels **K:**# of Output Channels





#### **Output Activation**

Ν

H: Height of Input Activation W: Width of Input Activation R: Height of Weight S: Width of Weight P: Height of Output Activation Q: Width of Output Activation stride: # of rows/columns traversed per step padding: # of zero rows/columns added

C: # of Input Channels K: # of Output Channels N: Batch size





Direct convolution: No extra memory overhead

- Low performance
- Poor memory access pattern due to geometry-specific constraint
- Relatively short dot product

## Background: Memory System





(Relative) size of the memory at each level



#### Temporal Locality





#### GEMM

**Sparse Convolution** 

Direct Convolution

Further Discussions

# Im2col (Image2Column) Convolution





Large extra memory overhead

- Good performance
- BLAS-friendly memory layout to enjoy SIMD/locality/parallelism
- Applicable for any convolution configuration on any platform

# Im2col (Image2Column) Convolution





Transform convolution to matrix multiplication

Unified calculation for both convolution and fully-connected layers

# Im2col (Image2Column): Another View



#### Image to column operation (im2col) Slide the input image like a convolution but each patch become a column vector.



We get true performance gain when the kernel has a large number of filters, ie: F=4 and/or you have a batch of images (N=4). Example for the input batch [4x4x5x4], convolved with 4 filters [2x2x3x2].

The only problem with this approach is the amount of memory

#### Reshaped kernel: [4x12]

Converted input batch [12x36]



<sup>1</sup>https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine\_learning/ % 10/31 deep\_learning/convolution\_layer/making\_faster





- Sub matrices in the lowered matrix will be "sgemm" ed in parallel
- Smaller memory foot print, cache locality, and explicit parallelism

<sup>&</sup>lt;sup>2</sup>Minsik Cho and Daniel Brand (2017). "MEC: memory-efficient convolution for deep neural network" In: Proc. ICML. 💈 🗠 🤉





- Sub matrices in the lowered matrix will be "sgemm" ed in parallel
- Smaller memory foot print, cache locality, and explicit parallelism

<sup>&</sup>lt;sup>2</sup>Minsik Cho and Daniel Brand (2017). "MEC: memory-efficient convolution for deep neural network" In: Proc. ICML. 💈 🗠 🧠





- Sub matrices in the lowered matrix will be "sgemm" ed in parallel
- Smaller memory foot print, cache locality, and explicit parallelism

<sup>&</sup>lt;sup>2</sup>Minsik Cho and Daniel Brand (2017). "MEC: memory-efficient convolution for deep neural network" In: Proc. ICML. 💈 🗠 🧠





- Sub matrices in the lowered matrix will be "sgemm" ed in parallel
- Smaller memory foot print, cache locality, and explicit parallelism

<sup>&</sup>lt;sup>2</sup>Minsik Cho and Daniel Brand (2017). "MEC: memory-efficient convolution for deep neural network" In: Proc. ICML. 💈 🗠 🔍





- Sub matrices in the lowered matrix will be "sgemm" ed in parallel
- Smaller memory foot print, cache locality, and explicit parallelism

<sup>&</sup>lt;sup>2</sup>Minsik Cho and Daniel Brand (2017). "MEC: memory-efficient convolution for deep neural network" In: Proc. ICML.



Over  $2 \times$  memory saving<sup>3</sup>:



<sup>3</sup>Minsik Cho and Daniel Brand (2017). "MEC: memory-efficient convolution for deep neural network" In: Proc. ICML. 🛓 🗠 🤈 🗠





GEMM

Sparse Convolution

**Direct Convolution** 

Further Discussions



- Our DNN may be redundant, and sometimes the filters may be sparse
- Sparsity can be helpful to overcome over-fitting



## Sparse Convolution: Naive Implementation 1







- Algorithm 1 Sparse Convlution Naive 1
- 1: for all w[i] do
- 2: **if** *w*[i] = 0 **then**
- 3: Continue;
- 4: end if
- 5: output feature map  $Y \leftarrow X \times w[i];$
- 6: end for

# Sparse Convolution: Naive Implementation 1

0

0

4

8



E ∽QQ





- 1: for all w[i] do
- if w[i] = 0 then 2:
- Continue; 3:
- end if 4:
- 5: output feature map  $Y \leftarrow X \times w[i]$ ;
- 6: end for

BAD implementation for Pipeline!

| Instr. No.     | Pipeline Stage |    |    |     |     |          |     |
|----------------|----------------|----|----|-----|-----|----------|-----|
| 1              | IF             | ID | ΕX | MEM | WB  |          |     |
| 2              |                | IF | ID | EX  | мем | WB       |     |
| 3              |                |    | IF | ID  | EX  | мем      | WB  |
| 4              |                |    |    | IF  | ID  | ΕX       | мем |
| 5              |                |    |    |     | IF  | ID       | ΕX  |
| Clock<br>Cycle | 1              | 2  | 3  | 4   | 5   | <b>6</b> | 7   |

14/31

# Sparse Matrix Representation





- CSR: Good for operation on feature maps
- CSC: Good for operation on filters
- We have better control on filters, thus usually CSC.



# Sparse Convolution: Naive Implementation 2





- BAD implementation for Spatial Locality!
- Poor memory access patterns

### SOTA 2: Sparse Convolution





Figure 1: Conceptual view of the direct sparse convolution algorithm. Computation of output value at (y, x)th position of *n*th output channel is highlighted.

```
for each output channel n {
  for j in [W.rowptr[n], W.rowptr[n+1]) {
    off = W.colidx[j]; coeff = W.value[j]
    for (int y = 0; y < H_OUT; ++y) {
        for (int x = 0; x < W_OUT; ++x) {
            out[n][y][x] += coeff*in[off+f(0,y,x)]
        }
    }
}</pre>
```

Figure 2: Sparse convolution pseudo code. Matrix W has *compressed sparse row* (CSR) format, where rowptr [n] points to the first non-zero weight of *n*th output channel. For the *j*th nonzero weight at (n, c, r, s), W.collidx [j] contains the offset to (c, r, s) the element of tensor in, which is pre-computed by layout function as f(c, r, s). If in has CHW format,  $f(c, r, s) = (cH_{in} + r)W_{in} + s$ . The "virtual" dense matrix is formed on-the-fly by shifting in by (0, y, x).

4

<sup>4</sup> Jongsoo Park et al. (2017). "Faster CNNs with direct sparse convolutions and guided pruning". In: Broc./IGLR. ( ) 🛬 🖉 🔍 🔍

## **Discussion: Sparse-Sparse Convolution**



- Sparsity is a desired property for computation acceleration. (cuSPARSE library, direct sparse convolution, etc.)
- Sometimes not only the filters but also the input feature maps are sparse.



# **Discussion: Sparse-Sparse Convolution**





Efficient programming implementation required; (Improve pipeline efficiency)

When sparsity(*input*) = 0.9, sparsity(*weight*) = 0.8, more than  $10 \times$  speedup;

#### Some other issues:

- How to be compatible with pooling layer?
- Transform between dense & sparse formats





GEMM

Sparse Convolution

**Direct Convolution** 

**Further Discussions** 

#### **Direct Convolution**





# 1D Convolution Example





```
for (q=0; q<Q; q++) {
  for (s=0; s<S; s++) {
    OA[q] += IA[q+s] * W[s];
  }
}
Output Stationary (OS)</pre>
```

Dataflow

```
for (s=0; s<S; s++) {
   for (q=0; q<Q; q++) {
      OA[q] += IA[q+s] * W[s];
   }
}
Weight Stationary (WS)
   Dataflow</pre>
```

# Buffer Access Pattern 1: Output Stationary





















## Buffer Access Pattern 2: Weight Stationary





◆□▶ ◆□▶ ◆臣▶ ◆臣▶ 善臣 めんで

# Weight Stationary in 3D Convolution Scenario





# Weight Stationary in 3D Convolution Scenario





# Weight Stationary in 3D Convolution Scenario





#### Dataflow

Defines the execution order of the DNN operations in hardware

- Computation Order
- Data Movement Order
- Loop nest is a compact way to describe the execution order, i.e., dataflow, supported in hardware.
  - for: temporal for, describes the temporal execution order
  - spatial\_for: describes parallel execution















#### • Apply spatial parallelism





#### • Apply temporal tiling

```
for (n=0; n<N; n++) {</pre>
  for (r=0; r<R; r++)
    for (s=0; s<S; s++) +
      for (c t=0; c t<C/16; c t++) {</pre>
        for (k t=0; k t<K/64; k t++) {</pre>
          spatial for (c s=0; c s<16; c s++) {</pre>
             spatial for (k s=0; k s<64; k s++) {</pre>
               int curr c = c t * 16 + c s;
               int curr k = k t * 64 + k s;
               float curr w = W[r][s][curr c][curr k];
               for (p=0; p<P; p++) {
                 for (q=0; q<Q; q++) {</pre>
                   h = p * stride - pad + r;
                   w = q * stride - pad + s;
                   OA[n][curr k][p][q] +=
                             IA[n][curr c][h][w]
                             * curr w;
     }}}
```





GEMM

Sparse Convolution

Direct Convolution

**Further Discussions** 

# Example: Halide, SIGGRAPH '2019



- https://youtu.be/3uiEyEKji0M
- "We generate schedules for Halide programs using tree search over the space of schedules guided by a learned cost model and optional autotuning. The cost model is trained by benchmarking thousands of randomly-generated Halide programs and schedules. The resulting code significantly outperforms prior work and human experts."



5

<sup>5</sup>Andrew Adams et al. (2019). "Learning to optimize halide with tree search and random programs". In: *ACM Trans. Graph.* 38.4, 121:1–121:12. DOI: 10.1145/3306346.3322967. URL: https://doi.org/10.1145/3306346.3322967.

# Example: FlexFlow, SysML'2019



"The optimizer uses a MCMC search algorithm to explore the space of possible parallelization strategies and iteratively proposes candidate strategies that are evaluated by a execution simulator."

6



# Example: AutoTVM v1.0, NeurIPS '2018



"We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search using effective model transfer across workloads."



<sup>7</sup>Tianqi Chen et al. (2018). "Learning to Optimize Tensor Programs". In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. Ed. by Samy Bengio et al., pp. 3393–3404. URL:

http://papers.nips.cc/paper/7599-learning-to-optimize-tensor-programs.\*\* E \*\* E \*\* E \*\* E \*\* C

## Example: Ansor: AutoTVM v2.0, arXiv



"We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores much more optimization combinations by sampling programs from a hierarchical representation of the search space."



<sup>&</sup>lt;sup>8</sup>Lianmin Zheng et al. (2020). "Ansor : Generating High-Performance Tensor Programs for Deep Learning". In: CoRR abs/2006.06762. arXiv: 2006.06762. URL: https://arxiv.org/abs/2006.06762. ID + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) + (2) +