

### Lecture 09: TVM

### Bei Yu

(Latest update: December 19, 2020)

Spring 2021

・ロト・国・・国・・国・ シック・



#### These slides contain/adapt materials developed by

Chen, Tianqi, et al. "TVM: An automated end-to-end optimizing compiler for deep learning." 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018.



◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 - のへで

## Beginning of Story



◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 - のへで

## Beginning of Story



З



### Beginning of Story





З



### Beginning of Story









seudo-code for convolution program for the VIA accelerat



| 0+06: LOAD(PARAM[ 0-71])<br>0+02: LOAD(PARAM[ 0-71])<br>0+02: LOAD(LOBEF[ 0-71])<br>0+02: LOAD(LOBEF[ 0-71])<br>0+02: LOAD(LOBEF[ 0-71])<br>0+03: EXE (ACTIV[ 0-24],PARAM[ 0-<br>0+07: PUSH(EX->5T)<br>0+03: STOR(STBUF[ 0-7])<br>0+03: STOR(                                                 | 71],LDBUF[ 0-31],STBUF[ 0- 7]) | // LD@TID0<br>// LD@TID0<br>// LD@TID0<br>// LD@TID0<br>// EX@TID0<br>// EX@TID0<br>// EX@TID0<br>// ST@TID0<br>// ST@TID0<br>// ST@TID0<br>// ST@TID0 |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 9+08: LOAD(ACTIV[5-60])<br>0+08: LOAD(ACTIV[5-60])<br>0+08: LOAD(AD(00F[31-63])<br>0+08: PUSH(L0-FEX)<br>0+08: EXE (ACTIV[25-50],PARAM[0-<br>0+18: PUSH(EX-51)<br>0+11: PUSH(EX-51)<br>0+11: PUSH(EX-51)<br>0+11: PUSH(EX-51)<br>0+12: PUSH(EX-51)<br>0+13: PUSH(EX-5 | 71],LDBUF[32-63],STBUF[32-39]) | // LD@TID1<br>// LD@TID1<br>// EX@TID1<br>// EX@TID1<br>// EX@TID1<br>// EX@TID1<br>// ST@TID1<br>// ST@TID1<br>// ST@TID1                             |  |
| <pre>btl: POP (EX-&gt;LD)<br/>btl: DOP (EX-&gt;LD)<br/>btl: LOAD(ACTIV[ e&gt;-24])<br/>btl: LOAD(ACTIV[ e&gt;-24])<br/>btl: LOAD(LOBWE[ e&gt;-24])<br/>btl: LOAD(LOBWE[ e&gt;-24])<br/>btl: DOBWE[ e&gt;-24])<br/>btl: EXE (ACTIV[ e&gt;-24], PARAM[ e-<br/>btl: EXE (ACTIV[ e&gt;-24], PARAM[ e-<br/>btl: EXE (ACTIV[ e&gt;-57])<br/>btl: EXE (ACTIV[ e&gt;-57])</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 71],LDBUF[ 0-31],STBUF[ 0- 7]) | // LD@TID2<br>// LD@TID2<br>// LD@TID2<br>// LD@TID2<br>// LD@TID2<br>// EX@TID2<br>// EX@TID2<br>// EX@TID2<br>// ST@TID2<br>// ST@TID2               |  |
| // vitual intend 3<br>8x28: POP (EX->LD)<br>8x21: LOAD(ACTIV[25-59])<br>8x21: LOAD(ACTIV[25-59])<br>8x22: LOAD(LBBF[32-53])<br>8x23: PUSH(LB->EX)<br>8x25: PUSH(LB->EX)<br>8x26: EXE (ACTIV[25-58],PARAM[ 0-<br>8x27: PUSH(EX->57)<br>8x28: POP (EX->57)<br>8x28: POP (EX->57)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 71],LDBUF[32-63],STBUF[32-39]) | // LD@TID3<br>// LD@TID3<br>// LD@TID3<br>// LD@TID3<br>// EXQTID3<br>// EXQTID3<br>// EXQTID3<br>// EXQTID3<br>// ST@TID3<br>// ST@TID3               |  |





// micro op 0 fields.
for y in [0\_1)

(c) Max pool, batch norm and activation

micro-coded program

◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 のへで

// Convolution access pattern dictated by micro-coded program. // Each register index is derived as a 2-D affine function. // e.g. idx,= a.g.yb.gxtc,#, where c.g. is specified by



▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のへで

### Goal: Deploy Deep Learning Everywhere







・ロト ・四ト ・ヨト ・ヨト

æ

# Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks



イロト イヨト イヨト イヨト

# Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends



イロト イヨト イヨト イヨト

## Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends





イロト イヨト イヨト イヨト

## Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends





## Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

#### Explosion of hardware backends





Explosion of models and frameworks

#### Explosion of hardware backends







<ロト <回ト < 回ト < 回ト



《曰》 《聞》 《臣》 《臣》

## Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

#### Explosion of hardware backends





Explosion of models and frameworks



◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 のへぐ



Explosion of models and frameworks





Explosion of models and frameworks

























◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 のへで





◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 のへで





《曰》 《聞》 《臣》 《臣》





《曰》 《聞》 《臣》 《臣》







### Existing Approach: Engineer Optimized Tensor Operators





Vanilla Code

```
for y in range(1024):
    for x in range(1024):
        C[y][x] = 0
        for k in range(1024):
            C[y][x] += A[k][y] * B[k][x]
```



<ロ> (四) (四) (三) (三) (三)

### Existing Approach: Engineer Optimized Tensor Operators





### Loop Tiling for Locality



・ロト ・ 日 ・ ・ ヨ ・ ・ ヨ ・ ・

### Existing Approach: Engineer Optimized Tensor Operators





### Map to Accelerators

```
inp_buffer AL[8][8], BL[8][8]
acc_buffer CL[8][8]
for yo in range(128):
    vdla.fill_zero(CL)
    for ko in range(128):
       vdla.dma_copy2d(AL, A[ko*8:ko*8+8][yo*8:yo*8+8])
       vdla.dma_copy2d(BL, B[ko*8:ko*8+8][xo*8:xo*8+8])
       vdla.fused_gemm8x8_add(CL, AL, BL)
       vdla.dma_copy2d(C[yo*8:yo*8+8,xo*8:xo*8+8], CL)
```

Human exploration of optimized code



# Limitations of Existing Approach



### cuDNN





# Limitations of Existing Approach



#### cuDNN







# Limitations of Existing Approach

9







# Limitations of Existing Approach







# Limitations of Existing Approach















## Limitations of Existing Approach



New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

・ロト ・日下・・日下・・











## Limitations of Existing Approach



9

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup




# Limitations of Existing Approach



New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup















#### Limitations of Existing Approach Frameworks m New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup **Engineering intensive cuDNN** $\mathcal{P}_{\mathcal{A}}$































▲ロト ▲御 ▶ ▲ 臣 ▶ ▲ 臣 ▶ ● 臣 ● のへ(









・ロト ・四ト ・ヨト ・ヨト





<ロト <回ト < 回ト < 回ト

#### Hardware-aware Search Space







イロト イポト イヨト イヨト

#### Hardware-aware Search Space





# Computational Graph as IR

#### Represent High level Deep Learning Computations

Effective Equivalent Transformations to Optimize the Graph



Approach taken by: TensorFlow XLA, Intel NGraph, Nvidia TensorRT





# need to build and optimize operators for each hardware, variant of layout, precision, threading pattern ...



◆□▶ ◆□▶ ◆目▶ ◆目▶ 目 のへで





◆□▶ ◆□▶ ◆臣▶ ◆臣▶ 臣 のへで



▲日▼ ▲□▼ ▲目▼ ▲目▼ 目 ろんの

#### **Tensor Index Expression**

Compute C = dot(A, B.T)

```
import tvm
```

m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h')
A = tvm.placeholder((m, h), name='A')
B = tvm.placeholder((n, h), name='B')
k = tvm.reduce\_axis((0, h), name='k')
C = tvm.compute((m, n), lambda i, j: tvm.sum(A[i, k] \* B[j, k], axis=k))
Shape of C
Computation Rule



▲□▶ ▲□▶ ▲□▶ ▲□▶ ▲□ ● つんの

# Tensor Expressions are Expressive

#### Affine Transformation

```
out = tvm.compute((n, m), lambda i, j: tvm.sum(data[i, k] * w[j, k], k))
out = tvm.compute((n, m), lambda i, j: out[i, j] + bias[i])
```

#### Convolution

```
out = tvm.compute((c, h, w),
    lambda i, x, y: tvm.sum(data[kc,x+kx,y+ky] * w[i,kx,ky], [kx,ky,kc]))
```

#### ReLU

out = tvm.compute(shape, lambda \*i: tvm.max(0, out(\*i))



・ロト ・四ト ・ヨト ・ 王

#### Emerging Tools Using Tensor Expression Language

Halide: Image processing language

Loopy: python based kernel generator

TACO: sparse tensor code generator

Tensor Comprehension



# Schedule: Tensor Expression to Code



◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 - のへで



# Example Schedule Transformation

- C = tvm.compute((n,), lambda i: A[i] + B[i])
- s = tvm.create\_schedule(C.op)

```
for (int i = 0; i < n; ++i) {
    C[i] = A[i] + B[i];
}</pre>
```





イロト イポト イヨト イヨト 三日

#### Example Schedule Transformation

```
C = tvm.compute((n,), lambda i: A[i] + B[i])
```

```
s = tvm.create_schedule(C.op)
```

xo, xi = s[C].split(s[C].axis[0], factor=32)

```
for (int xo = 0; xo < ceil(n / 32); ++xo) {
  for (int xi = 0; xi < 32; ++xi) {
    int i = xo * 32 + xi;
    if (i < n) {
        C[i] = A[i] + B[i];
      }
   }
}</pre>
```



イロト イポト イヨト イヨト 三日

### Example Schedule Transformation

```
C = tvm.compute((n,), lambda i: A[i] + B[i])
s = tvm.create_schedule(C.op)
xo, xi = s[C].split(s[C].axis[0], factor=32)
s[C].recorder(xi, xo)
```

```
for (int xi = 0; xi < 32; ++xi) {
  for (int xo = 0; xo < ceil(n / 32); ++xo) {
    int i = xo * 32 + xi;
    if (i < n) {
        C[i] = A[i] + B[i];
      }
   }
}</pre>
```



イロト イポト イヨト イヨト 三日

## Example Schedule Transformation

```
C = tvm.compute((n,), lambda i: A[i] + B[i])
s = tvm.create_schedule(C.op)
xo, xi = s[C].split(s[C].axis[0], factor=32)
s[C].recorder(xi, xo)
s[C].bind(xo, tvm.thread_axis("blockIdx.x")
s[C].bind(xi, tvm.thread_axis("threadIdx.x")
```

```
int i = threadIdx.x * 32 + blockIdx.x;
if (i < n) {
    C[i] = A[i] + B[i];
}</pre>
```



・ロト ・雪ト ・ヨト ・ヨト

# Key Challenge: Good Space of Schedule

Should contain any knobs that produces a logically equivalent program that runs well on backend models

Must contain the common manual optimization patterns

Need to actively evolve to incorporate new techniques



イロト イヨト イヨト イヨト

- 22

# Hardware-aware Search Space



Reuse primitives from prior work: Halide, Loopy



#### Challenge to Support Diverse Hardware Backends

CPUs



GPUs





TPU-like specialized Accelerators





臣





## Hardware-aware Search Space

| GPUs |
|------|
|------|





#### **Compute Primitives**

scalar



vector

Memory Subsystem





## Hardware-aware Search Space

| GPUs |  |
|------|--|
| Ĩ    |  |





vector

L2 SM SM TX/L1 TX/L1 RF RF RF RF mixed

Shared memory among compute cores

▲□▶ ▲圖▶ ▲厘▶ ▲厘▶



# Hardware-aware Search Space

|       | Compute Pr              | Memory Subsystem |                                               |                  |
|-------|-------------------------|------------------|-----------------------------------------------|------------------|
| GF 03 |                         |                  |                                               |                  |
|       | scalar                  | vector<br>Sha    | L2<br>SM SI<br>TX/L1 TX/<br>RF RF RF<br>mixed | y among<br>cores |
|       | Use of Shared<br>Memory | Threa<br>Coopera | ad<br>ation                                   |                  |



#### Hardware-aware Search Space

#### TPU-like Specialized Accelerators





#### **Compute Primitives**



tensor



explicitly managed

臣



## Hardware-aware Search Space

#### TPU-like Specialized Accelerators











◆□▶ ◆□▶ ◆三▶ ◆三▶ 三 のへで

#### **Tensorization Challenge**

Compute primitives



◆□▶ ◆□▶ ◆三▶ ◆三▶ 三 のへで

# **Tensorization Challenge**





◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 のへで

#### **Tensorization Challenge**





◆□▶ ◆御▶ ◆臣▶ ◆臣▶ 臣 の�?

#### **Tensorization Challenge**





・ロト ・四ト ・ヨト ・ヨト

#### **Tensorization Challenge**



#### Hardware designer: declare tensor instruction interface with Tensor Expression

w, x = t.placeholder((8, 8)), t.placeholder((8, 8)) declare behavior k = t, reduce axis((0, 8)) y = t.compute((8, 8), lambda i, j: t.sum(w[i, k] \* x[j, k], axis=k)) lowering rule to generate def gemm\_intrin\_lower(inputs, outputs): hardware intrinsics to carry ww ptr = inputs[0].access ptr("r") xx ptr = inputs[1].access ptr("r") out the computation zz ptr = outputs[0].access ptr("w") compute = t.hardware\_intrin("gemm8x8", ww\_ptr, xx\_ptr, zz\_ptr) reset = t.hardware intrin("fill zero", zz ptr) update = t.hardware intrin("fuse gemm8x8 add", ww ptr. xx ptr. zz ptr) return compute, reset, update

gemm8x8 = t.decl\_tensor\_intrin(y.op, gemm\_intrin\_lower)



#### **Tensorization Challenge**



#### Hardware designer: declare tensor instruction interface with Tensor Expression

w, x = t.placeholder((8, 8)), t.placeholder((8, 8)) declare behavior k = t, reduce axis((0, 8)) v = t.compute((8, 8), lambda i, i: t.sum(w[i, k] \* x[j, k], axis=k)) lowering rule to generate def gemm\_intrin\_lower(inputs, outputs): hardware intrinsics to carry ww ptr = inputs[0].access ptr("r") xx ptr = inputs[1].access ptr("r") out the computation zz ptr = outputs[0].access ptr("w") compute = t.hardware\_intrin("gemm8x8", ww\_ptr, xx\_ptr, zz\_ptr) reset = t.hardware intrin("fill zero", zz ptr) update = t.hardware intrin("fuse gemm8x8 add", ww ptr. xx ptr. zz ptr) return compute, reset, update

gemm8x8 = t.decl\_tensor\_intrin(y.op, gemm\_intrin\_lower)

#### Tensorize: transform program to use tensor instructions



◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 のへで


### Hardware-aware Search Space

#### TPU-like Specialized Accelerators









tensor



explicitly managed

臣



### Hardware-aware Search Space

#### TPU-like Specialized Accelerators





**Compute Primitives** 



tensor



イロト イヨト イヨト イヨト

2



《曰》 《聞》 《臣》 《臣》

э.

## Software Support for Latency Hiding





## Software Support for Latency Hiding



◆□▶ ◆□▶ ◆三▶ ◆三▶ ● のへで



## Software Support for Latency Hiding



◆□▶ ◆□▶ ◆三▶ ◆三▶ ◆□▶



<ロト <回ト < 回ト < 回ト















ъ





3









・ロト ・四ト ・ヨト ・ヨト









### Global View of TVM Stack



《曰》 《圖》 《臣》 《臣》 - 22



### High Level Compilation Frontend

module = runtime.create(graph, lib, tvm.gpu(0))
module.set\_input(\*\*params)
module.run(data=data\_array)
output = tvm.nd.empty(out\_shape, ctx=tvm.gpu(0))
module.get\_output(0, output)

import tvm
import nnvm.frontend
import nnvm.compiler

On languages and platforms you choose



▲日 > ▲ 圖 > ▲ 圖 > - ▲ 圖 > -





#### Program Your Phone with Python from Your Laptop

**RPC** Server on Compiler Stack Embedded Device lib = t.build(s, [A, B], 'llvm -target=armv7l-none-linux-gnueabihf', name='mvfunc') remote = t.rpc.connect(host. port) lib.save('myfunc.o') upload module to remote remote.upload('myfunc.o') get remote function f = remote.load\_module('myfunc.o') ctx = remote.cpu(0)copy data to remote a = t.nd.array(np.random.uniform(size=1024), ctx) get remote array handle b = t.nd.array(np.zeros(1024), ctx)run function on remote remote\_timer = f.time\_evaluator('myfunc', ctx, number=10) time\_cost = remote\_timer(a, b) get profile statistics back np.testing.assert equal(b.asnumpy(), expected) copy data back to host for correctness verification

▲ロト ▲団ト ▲ヨト ▲ヨト 三回 - の々で



### Learning-based Program Optimizer





《曰》 《聞》 《臣》 《臣》

臣



(日) (四) (里) (里)

### Learning-based Program Optimizer





### Learning-based Program Optimizer



**Runtime Measurements** 

<ロ> (四) (四) (三) (三) (三)



# Learning-based Program Optimizer Program **Program Optimizer** Code Generator **Runtime Measurements**

High experiment cost, each trial costs ~1second



(日) (四) (里) (里)

### Learning-based Program Optimizer





《曰》 《聞》 《臣》 《臣》

臣

### Learning-based Program Optimizer





<ロ> (四) (四) (三) (三) (三)

### Learning-based Program Optimizer



#### Need reliable cost model per hardware



(日) (四) (里) (里)

### Learning-based Program Optimizer





▲□▶ ▲圖▶ ▲厘▶ ▲厘≯

# Learning-based Program Optimizer Program **Program Optimizer** Code Generator $\mathcal{T}$ Training data



### Learning-based Program Optimizer





<ロ> (四) (四) (三) (三) (三) (三)

### Learning-based Program Optimizer



Adapt to hardware type by learning Make prediction in 1ms level



▲日 > ▲ 圖 > ▲ 圖 > - ▲ 圖 > -

### Effectiveness of ML based Model





▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のへで

### Effectiveness of ML based Model



27



◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 のへで

### Effectiveness of ML based Model





### Effectiveness of ML based Model





### Effectiveness of ML based Model



▲□▶ ▲□▶ ▲目▶ ▲目▶ 目 のなぐ



▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のへで

### Effectiveness of ML based Model



**Relative Speedup** 



・ロト ・四ト ・ヨト ・ヨト





#### End to End Inference Performance (Nvidia Titan X)



▲□▶ ▲圖▶ ▲厘▶ ▲厘▶ 2



## End to End Inference Performance (Nvidia Titan X)










◆ロト ◆母 ▶ ◆臣 ▶ ◆臣 ▶ ◆ ● ● ● ● ●





◆ロト ◆母 ▶ ◆臣 ▶ ◆臣 ▶ ◆ ● ● ● ● ●







#### End to End Performance(ARM Cortex-A53)



34

◆□▶ ◆□▶ ◆臣▶ ◆臣▶ 臣 のへの



#### End to End Performance(ARM Cortex-A53)



34

◆□▶ ◆□▶ ◆臣▶ ◆臣▶ 臣 のな()



▲□▶ ▲圖▶ ▲厘▶ ▲厘▶

.

# End to End Performance(ARM GPU)





## Supporting New Specialized Accelerators



Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer





# Supporting New Specialized Accelerators



Hardware aware Search Space of Optimized Tensor Programs

#### Machine Learning based Program Optimizer





・ロ・ ・回・ ・ヨ・ ・ヨ・

2

#### TVM/VTA: Full Stack Open Source System



ML-based Optimize







《曰》 《聞》 《臣》 《臣》

2

#### TVM/VTA: Full Stack Open Source System

37



VTA MicroArchitecture







イロト 不同ト 不同ト 不同ト

臣

#### TVM/VTA: Full Stack Open Source System



**ML-based Optimizer** 

VTA Hardware/Software Interface (ISA)

#### VTA MicroArchitecture





webservices





▲□▶ ▲圖▶ ▲厘▶ ▲厘▶

臣

#### TVM/VTA: Full Stack Open Source System







amazon webservices

37



イロト イヨト イヨト イヨト

æ

## TVM/VTA: Full Stack Open Source System





## TVM/VTA: Full Stack Open Source System



- JIT compile accelerator micro code
- Support heterogenous devices, 10x better than CPU on the same board.
- Move hardware complexity to software

イロト イヨト イヨト イヨト

æ



## TVM/VTA: Full Stack Open Source System



- JIT compile accelerator micro code
- Support heterogenous devices, 10x better than CPU on the same board.
- Move hardware complexity to software
  compiler, driver,
  hardware design
  full stack open source



# TVM: Learning-based Learning System



# Check it out!

▲□▶ ▲圖▶ ▲厘▶ ▲厘▶

æ