

# Placement in Advanced Technology Nodes

Bei Yu Department of Computer Science & Engineering Chinese University of Hong Kong byu@cse.cuhk.edu.hk

September 30, 2021

### Placement in Design Flow





### Placement in Design Flow





# EDA Toy Example [Jens Vygen,2006]





# EDA Toy Example [Jens Vygen,2006]





# EDA Toy Example [Jens Vygen,2006]









### An Inverter Example





































- Huge problem sizes: tens of millions of cells
- Huge solution space: larger than  $1K \times 1K$  grids in a layout







#states: ~10<sup>123</sup>

#states: ~10360

#states: >10<sup>100,000</sup>

Google AlphaGo Train 40 days using 176 GPUs







230 cells in FPGA (design e 64 in the MCNC benchmark suite)







1. Global placement





1. Global placement

WL: 1.05e+6



2. Legalization





1. Global placement



2. Legalization

WL: 1.02e+6



3. Detailed Placement

# **Global Placement Algorithms**





efficiency









# Quadratic Placement – Optimizing Wirelength Objective









Euclidean-distance model



|            |                      | Pessimistic  |           | Optimistic |            |
|------------|----------------------|--------------|-----------|------------|------------|
| Model      | Min. Steiner<br>Tree | Clique Model |           | HPWL       | Star Model |
| Distance   | Manhattan            | Manhattan    | Euclidean | Manhattan  | Euclidean  |
| Accuracy   | ****                 | *            | *         | ****       | ***        |
| Smoothness | *                    | **           | ****<br>* | **         | ****       |











# Quadratic Placement – Satisfying Constraints









# Quadratic Placement - Overall Optimization Flow





#### 16/50

- Iterative optimization
- Wirelength models: HPWL, clique model, star model
- QP solver is fast!
- Rough legalization







- Mathematical formulation
  - $d_i$  denotes the density of bin i

 $\begin{array}{l} \min_{\boldsymbol{x},\boldsymbol{y}} \quad WL(\boldsymbol{x},\boldsymbol{y}), \\ s.t. \quad d_b\left(\boldsymbol{x},\boldsymbol{y}\right) \leq t_d, \forall b \in Bins \end{array}$ 

- Nonlinear placement objective
  - Lagrangian relaxation





# Nonlinear Placement – Wirelength Smoothing<sup>1</sup>





<sup>1</sup>William C. Naylor, Ross Donelly, and Lu Sha (2001). *Non-linear optimization system and method for wire length and delay optimization for an automatic electric circuit placer*. US Patent 216,632. 19/50

# Nonlinear Placement – Wirelength Smoothing<sup>2</sup>







<sup>2</sup>Meng-Kai Hsu, Valeriy Balabanov, and Yao-Wen Chang (2013). "TSV-aware analytical placement for 3-D IC designs based on a novel weighted-average wirelength model". In: *IEEE TCAD* 32.4, pp. 497–509.

**2.3.1** Numerical Stability and Approximation Error. Although multivariate wirelength models (*LSE*, *WA*) are able to approximate maximum (minimum) as close as possible in theory, numerical instability may happen when  $\gamma$  is too small. On a modern computer, a double-precision floating-point cannot handle  $e^z$  when z > 710 (overflow issue), or z < -746 (precision limit). Thus, an appropriate value of  $\gamma$  needs to be determined carefully for *LSE* and *WA*. On the other hand, there is no exponential function in a bivariate function, so the value of  $\gamma$  can be much smaller, as long as it is representable and appropriate. As a result, the bivariate wirelength model is able to achieve a lower approximation error of *HPWL*, when compared to the multivariate wirelength model [14]. However, a lower approximation error does not always imply better solution quality, since

**2.3.2** Total Runtime. In an optimization process, the total runtime is roughly equal to the number of iterations times the runtime per iteration. The runtime per iteration can be further divided into the intrinsic cost of  $f_Y$  and the cost of calling the function. The bivariate wirelength model has a smaller intrinsic computational cost, which is apparent since the multivariate wirelength model uses more complex functions such as logarithm and exponentiation, whereas the bivariate wirelength model only uses square and square root. However, the calculation process of a bivariate wirelength model requires recursive function calls, which induces considerable computational overheads. The number of iterations to convergence is difficult to analyze in a placement problem since there are overlapping and density constraints. However, it has been observed that the bivariate wirelength model needs more iterations to converge [14]. Combining all these factors, the multivariate wirelength model could be

#### Algorithm: BiG wirelength model Input: $X = [x_1, \ldots, x_n]$ : variables (array) f: smooth maximum function Output: $G = [g_1, \ldots, g_n]$ : derivatives for each variables (array) max': estimated maximum value (scalar) 01. $max_1 \leftarrow -\infty$ //first largest value 02. $max_2 \leftarrow -\infty$ //second largest value 03. for each x<sub>i</sub>: if $x_i > max_1$ : 04. 05. $max_2 \leftarrow max_1$ $max_1 \leftarrow x_i$ 06. else if $x_i > max_2$ : 07. 08. $max_2 \leftarrow x_i$ 09. $sum_a \leftarrow 0$ 10. for each $x_i$ : if $x_i == max_1$ : 11. $g_i \leftarrow \frac{\partial}{\partial x_i} f(x_i, max_2)$ 12. 13. else: $g_i \leftarrow \frac{\partial}{\partial x_i} f(x_i, max_1)$ 14. $sum_a \leftarrow sum_a + q_i$ 15. 16. for each a:: $g_i \leftarrow \frac{g_i}{sum_a}$ 17. 18. $max' \leftarrow max_1$ 19. return G. max'



3

21/50

<sup>3</sup>Fan-Keng Sun and Yao-Wen Chang (2019). "BiG: A bivariate gradient-based wirelength model for analytical circuit placement". In: *Proc. DAC*.



### • Potential function for standard cells

- $P_x(b,v)$  and  $P_y(b,v)$  are the overlap functions between bin b and cell v
- $D_b(\mathbf{x}, \mathbf{y}) = \sum_{v \in V} P_x(b, v) P_y(b, v)$



<sup>4</sup>Andrew B Kahng and Qinke Wang (2005). "Implementation and extensibility of an analytic placer". In: *IEEE TCAD* 24.5, pp. 734–747.

<sup>5</sup>Sheng Chou, Meng-Kai Hsu, and Yao-Wen Chang (2012). "Structure-aware placement for datapath-intensive circuit designs". In: *Proc. DAC*, pp. 762–767.



- Potential function for standard cells
  - Smoothed potential function
  - $\widehat{D_b}(\boldsymbol{x}, \boldsymbol{y}) = \sum_{v \in V} \widehat{P_x}(b, v) \widehat{P_y}(b, v)$
- $\min_{x,y} WL(x, y) + \lambda D(x, y)$

$$\lambda \sum_{b} \left( \widehat{D_{b}}(\boldsymbol{x}, \boldsymbol{y}) - t_{d} \right)^{2}$$

- Challenges
  - Gradient only has local view
  - Need multi-level bins



Multi-level bins



### • Potential function for fixed macros

- Bell-shape smoothing works well for standard cells
- For fixed macros, P'(x, y) = G(x, y) \* P(x, y)



# **Detailed Placement Algorithms**





WL: 1.02e+6

3. Detailed Placement

• Multi-bin Knapsack?

• TSP?

# Single-Row Problem





$$\begin{split} \min & \sum_{n \in N} w(n) \left( \overline{x}(n) - \underline{x}(n) \right) & \text{for all } i, j ) \\ \text{s.t.} & \underline{x}(n) \leq x(c(p)) + xoffs(p) \leq \overline{x}(n) & \text{for all } p \in n ) \\ x_{\min}^{i} \leq x(c_{1}^{i}) & (\text{for all } i, j) \\ x(c_{m}^{i}) \leq x_{\max}^{i} & (\text{for all } i, j) \\ x(c_{m}^{i}) \leq x_{\max}^{i} & (\text{for all } i) & f_{ij} > 0 \end{split}$$

This LP is the dual of a minimum-cost flow problem. To see this, we introduce an auxiliary variable  $v_0$  with the value zero. Then every constraint has the form  $v_i - v_j \ge a_{ij}$ , where  $a_{ij}$  is a constant and  $v_i$  and  $v_i$  are variables. Each of these constraints corresponds id to a dual variable in

s.t. 
$$\sum_{i} f_{ij} - \sum_{k} f_{jk} = \begin{cases} -w(n) & \text{if } v_j \text{ is } \underline{x}(n) \\ w(n) & \text{if } v_j \text{ is } \overline{x}(n) \\ 0 & \text{otherwise} \end{cases}$$

$$(\text{for all } j)$$

$$f_{ij} \ge 0 \qquad (\text{for all } i, j)$$

- Dual of Min-Cost Flow
- Unimodular matrix •

<sup>&</sup>lt;sup>6</sup>Jens Vygen (1998). "Algorithms for detailed placement of standard cells". In: Proc. DATE, pp. 321–324.

# Single-Row Placement<sup>7</sup>,<sup>8</sup>





Figure 4: Shortest path computation for legalizing a row placement.

<sup>&</sup>lt;sup>7</sup>Andrew B Kahng, Paul Tucker, and Alex Zelikovsky (1999). "Optimization of linear placements for wirelength minimization with free sites". In: *Proc. ASPDAC*, pp. 241–244.

<sup>&</sup>lt;sup>8</sup>Andrew B. Kahng, Igor L. Markov, and Sherief Reda (2004). "On legalization of row-based placements". In: *Proc. GLSVLSI*, pp. 214–219.



29/50

#### Single-Segment Clustering Algorithm

```
num old cluster n
Initialize old cluster[i] as standard cell Ci, i =1, 2, ..., num old cluster.
do
 Find the bounds list and the Optimal Region Center Xic for Ki,
      and set X(old_cluster[i]) = Xic
 newcount 1 // the count for the number of new clusters
 new_cluster[1] old_cluster[1] // initialize the first new cluster
 i 1
 while(j < num_old_cluster )
   do
      if new cluster[newcount] and old cluster[i+1] has overlap
       Cluster new cluster[newcount] and old cluster[i+1] to form the
              new new cluster[newcount]
        Merge the bounds list for new cluster(newcount) and old cluster(i+1)
              to get the new bounds list for new cluster [newcount]
        Find the Optimal Region Center Xc for new cluster[newcount]
              based on the new bounds list
       X(new_cluster[newcount]) Xc
      else
        newcount
                   newcount + 1 //begin a new cluster new_cluster[newcount+1]
       i i+1
 num_old_cluster newcount
 old_cluster[i] new_cluster[i] (i =1, ..., newcount)
until no overlap among old_cluster[i], (i=1, ..., num_old_cluster)
Assign the Ci (i=1, 2, ..., n) to the positions according to the positions of the
old cluster[j] (j=1, 2, ..., num old cluster) they belong to
```

<sup>9</sup>Min Pan, Natarajan Viswanathan, and Chris Chu (2005). "An efficient and effective detailed placement algorithm". In: *Proc. ICCAD*, pp. 48–55.



- Sung Woo Hur and John Lillis (2000). "Mongrel: hybrid techniques for standard cell placement". In: *Proc. ICCAD*, pp. 165–170
- Yuelin Du and Martin D. F. Wong (2014). "Optimization of standard cell based detailed placement for 16 nm FinFET process". In: *Proc. DATE*, 357:1–357:6

# DREAMPlace





# **Challenges of Nonlinear Placement**

### Low efficiency

■ >3h for 10M-cell design

### Limited acceleration

■ Limited speedup, e.g. mPL, due to clustering **Huge development effort** 

■ >1year for ePlace/RePlAce







### **DREAMPlace Strategies**

- Cast the non-linear placement problem into a neural network training problem.
- Leverage deep learning hardware (GPU) and software toolkit (e.g. Pytorch)
- Enable ultra-high parallelism and acceleration while getting the state-of-the-art results.

<sup>&</sup>lt;sup>10</sup>Yibo Lin et al. (2019). "DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement". In: *Proc. DAC*.

Analogy between NN Training and Placement [DAC'19]<sup>11</sup>





<sup>&</sup>lt;sup>11</sup>Yibo Lin et al. (2019). "DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VI SI Placement". In: *Proc. DAC* 

# Analogy between NN Training and Placement [DAC'19]<sup>11</sup>



### Casting the placement problem into neural network training

Net instances



$$\min_{\mathbf{w}} \sum_{i}^{n} WL(e_{i}; \mathbf{w}) + \lambda D(\mathbf{w})$$
Forward Propagation
Compute obj
Net
Instance
(e\_{i}, 0)
Backward Propagation
Compute gradient  $\frac{\partial obj}{\partial \mathbf{w}}$ 

### Solve a placement

<sup>&</sup>lt;sup>11</sup>Yibo Lin et al. (2019). "DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement". In: *Proc. DAC*.



### Leverage highly optimized deep learning toolkit PyTorch



<sup>&</sup>lt;sup>12</sup>C. Cheng et al. (2019). "RePlAce: Advancing Solution Quality and Routability Validation in Global Placement". In: *IEEE TCAD* 38.9, pp. 1717–1730.





#### DREAMPlace

- CPU: Intel E5-2698 v4 @2.20GHz
- GPU: 1 NVIDIA Tesla V100
- · Single CPU thread was used

#### RePIAce [TCAD'18, Cheng+]

- CPU: 24-core 3.0 GHz Intel Xeon
- · 64GB memory allocated

37/50





# Ex: Routability Estimation $\times$ DREAMPlace [DATE'21]<sup>13</sup>





<sup>&</sup>lt;sup>13</sup>Siting Liu et al. (2021). "Global Placement with Deep Learning-Enabled Explicit Routability Optimization". In: *Proc. DATE*.



### Huawei Ascend 910







Adder is one of the most important component!





(b) Physical synthesis perspective

# EDA Challenges: How to Design an AI Chip Component?





(c) Current EDA tool output



(d) Manual design



- Classical idea: bit-sliced DSP datapaths [Cai, DAC'90]
  - Decide ordering of linearly placed blocks
  - Solved by A\* in the search space
- Standard cell placement [Tsay, TCAD'95]
  - Strongly connected subcircuits (cones) are extracted
  - BFS + heuristics
  - Placed as macro cells



- Placer has little control of exact locations if datapath is generated separately
- Abstract physical model [Ye, ICCAD'00]
  - Bit-sliced abstraction
  - Compiled from HDL
  - Blocks are placed abutted
- Two-step heuristic for linear placement
  - quadratic assignment
  - sliding window optimization



APM of a booth multiplier [Ye, ICCAD'00]

# **Regularity Extraction**

- Consider cells with the same bit-slice are lined up horizontally [Nijssen IWLS'96]
  - geometric regularity: circuit is fitted onto a matrix of rectangular buckets
  - interconnect regularity: most nets are within one slice/one stage
- Typical methods for regularity extraction
  - Search-wave expansion [Nijssen IWLS'96]
  - Signature-based [Arikati, ICCAD'97]
  - Template covering [Chowdhary, TCAD'99]
  - Network flow [Xiang, ISPD'13]
  - Bipartite graph vertex covering [Huang, DAC'17]





# Regularity Extraction: Network Flow Approach



- Datapath main frame (DMF) [Xiang, ISPD'13]
  - a set of *n* disjoint paths from input to output
  - maximize the number of datapath gates on these paths
- Can be optimally identified by the min-cost max-flow algorithm



DMF identification can be transformed to a network flow problem [Xiang, ISPD'13]

# Datapath Driven Systolic Array Placement

- Systolic arrays are a popular choice to support neural network computations
- Current FPGA CAD tools cannot synthesize them in high quality
- One solution: restrict fixed locations for PEs [Zhang, ISCAS'19]
  - Sufficient DSPs, close to used I/O banks



PE placement with floorplan constraints [Zhang, ISCAS'19]



- Detailed placement [Serdar, DATE'01]
- SOC placement [Tong, JOS'02] [Jing, ICCCAS'02]
- Parallel multiplier design [Bae, ISPD'15]
- Genaral ASIC design [Ye, ISCAS'02] [Chou, DAC'12] [Wang, IETCDS'17]

• ...

### These slides contain/adapt materials developed by

- Course "Optimization and Machine Learning in VLSI Design Automation", Peking Univ, 2021.
- Yibo Lin et al. (2019). "DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement". In: *Proc. DAC*
- Zhuolun He et al. (2021). "Physical synthesis for advanced neural network processors". In: *Proc. ASPDAC*, pp. 833–840

Draw Placement Together