## IMpress: Large Integer Multiplication Expression Rewriting for FPGA HLS

Ecenur Üstün<sup>1</sup>, Ismail San<sup>2,1</sup>, Jiaqi Yin<sup>3</sup>, Cunxi Yu<sup>3</sup>, Zhiru Zhang<sup>1</sup>

<sup>1</sup> Cornell University <sup>2</sup> Eskisehir Technical University

<sup>3</sup> University of Utah





### Large Integer Multiplication in Cryptography



Kaplan-Meier survival analysis and genome-wide association study [1]

Privacy-preserving machine learning [2][3]

[1] D. Froelicher *et al.* Truly Privacy-Preserving Federated Analytics for Precision Medicine with Multiparty Homomorphic Encryption. Nature Communications, 2021.

[2] N. Dowlin *et al.* CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy. ICML, 2016.
[3] Microsoft SEAL <u>github.com/Microsoft/SEAL</u>

### **FPGA-Based Integer Multiplication**

Heterogeneity



#### **Default HLS Results**

| BW    | DSP<br>(DSP%)         | LUT<br>(LUT%)         | CARRY<br>(CARRY%)    | FF<br>(FF%)             | Latency<br>(cycles) | Frequency<br>(MHz) |
|-------|-----------------------|-----------------------|----------------------|-------------------------|---------------------|--------------------|
| 256   | <b>225</b><br>(1.83%) | <b>225</b><br>(0.01%) | <b>6</b><br>(<0.01%) | <b>692</b><br>(0.02%)   | 4                   | 18                 |
| 512   | <b>900</b><br>(7.32%) | <b>434</b><br>(0.03%) | <b>6</b><br>(<0.01%) | <b>1,329</b><br>(0.04%) | 4                   | 5                  |
| 1,024 | NA                    | NA                    | NA                   | NA                      | NA                  | NA                 |
| 2,048 | NA                    | NA                    | NA                   | NA                      | NA                  | NA                 |

Target: AMD Xilinx Virtex Ultrascale+, 300 MHz

### **Existing Approaches to Optimize Multiplication**



| n     | Implemen<br>tation | DSP   | LUT    |
|-------|--------------------|-------|--------|
|       | HLS Def            | 225   | 225    |
| 256   | Schl               | 200   | 3,483  |
|       | Karat              | 164   | 4,339  |
|       | HLS Def            | 900   | 434    |
| 512   | Schl               | 900   | 3,368  |
|       | Karat              | 675   | 5,182  |
|       | HLS Def            | NA    | NA     |
| 1,024 | Schl               | 3,600 | 7,759  |
|       | Karat              | 2,700 | 12,012 |



4

#### **Equivalence Graph**

Combine different ways of implementing an expression in one graph



### Equivalence Graph (E-Graph)

**Common sub-expressions** 



C. G. Nelson. Techniques for Program Verification. Stanford University, 1980. R. Nieuwenhuis *et al.* Proof-Producing Congruence Closure. RTA, 2005.

### **E-Graphs and Equality Saturation**

#### Given

- An input program: a + b × 2
- A rewrite rule:  $t \times 2 \rightarrow t \ll 1$



#### Integer multiplication with equality saturation:

| BW    | E-Nodes | E-Classes | Expressions |
|-------|---------|-----------|-------------|
| 64    | 252     | 194       | 1.08e6      |
| 128   | 1,171   | 878       | 1.37e24     |
| 256   | 5,546   | 4,078     | 3.55e96     |
| 512   | 26,769  | 19,426    | 1.58e386    |
| 1,024 | 130,936 | 94,218    | 6.31e1544   |
| 2,048 | 645,935 | 462,342   | 1.59e6179   |



<u>E-class</u> is a set of <u>e-nodes</u>. Represents equivalent expressions.  $c ::= \{n_1, n_2, ...\}$ 

<u>E-node</u> is a function symbol paired with a list of children e-classes. Represents expression(s).  $n ::= f(c_1, c_2, ...)$ 

### IMpress: Large Integer Multiplication Expression Rewriting

- Optimizes large integer multiplication
  - Rewrite at arithmetic level & DSP block level
  - Rich design space through equality saturation
  - Scalable due to efficient data structure
- Extracts optimal expression(s) for multiple FPGA resource objectives
- Translates final expression to HLS C++
- Fits more instances of cryptographic applications on FPGA



### **Equality Saturation with IMpress**



schoolbook(mul<sub>n</sub>)

karatsuba(mul<sub>n</sub>)

### **Equality Saturation with IMpress**

- $\langle mul_{32}, schoolbook(mul_{32}) \rangle$
- $\langle mul_{32}, karatsuba(mul_{32}) \rangle$
- $\blacktriangleright \quad \langle mul_{17}, mul_{16} \rangle$
- Inter- and intra-rule subexpression sharing



#### **Extraction and Code Generation**

- DSP minimization egg (heuristic) & IMpress (exact)
- DSP-constrained LUT minimization IMpress (exact)
- DSP and LUT co-minimization IMpress (exact)



#### Scalability

| D\//  | E-Nodes | E-Classes | Expressions | Rewrite<br>Rules | Saturation<br>(s) | Extraction (s) |         |
|-------|---------|-----------|-------------|------------------|-------------------|----------------|---------|
| D V V |         |           |             |                  |                   | egg            | IMpress |
| 64    | 252     | 194       | 1.08e6      | 10               | 0.01              | 0.01           | 0.01    |
| 128   | 1,171   | 878       | 1.37e24     | 13               | 0.06              | 0.02           | 0.03    |
| 256   | 5,546   | 4,078     | 3.55e96     | 16               | 0.29              | 0.11           | 0.09    |
| 512   | 26,769  | 19,426    | 1.58e386    | 19               | 1.51              | 0.63           | 0.50    |
| 1,024 | 130,936 | 94,218    | 6.31e1544   | 22               | 8.60              | 2.72           | 4.07    |
| 2,048 | 645,935 | 462,342   | 1.59e6179   | 25               | 49.19             | 18.26          | 45.33   |

| Post-Place<br>Cost | 64    | 128   | 256    | 512    | 1,024   | 2,048     |
|--------------------|-------|-------|--------|--------|---------|-----------|
| egg                | 1,823 | 6,698 | 25,359 | 85,117 | 278,265 | 1,101,986 |
| IMpress            | 1,823 | 6,466 | 22,399 | 78,434 | 264,610 | 883,931   |

#### Extraction quality

#### Pareto-optimality analysis

#### Flexibility in controlling DSP-LUT



#### Placement-constrained million-bit multiplication for FHE

|               |      | BW          | DSP<br>(DSP%)     | LUT<br>(LUT%)       | Latency<br>(cycles) |
|---------------|------|-------------|-------------------|---------------------|---------------------|
|               | SLR3 | HLS Default | NA                | NA                  | NA                  |
| LUTs: 432,000 |      | K(32, 64)   | 3,950<br>(32.15%) | 339,999<br>(19.68%) | 13,123,773          |
| DSPs: 3,072   | SLR2 | K(32, 32)   | 3,227<br>(26.26%) | 408,301<br>(23.63%) | 12,927,158          |
|               |      | K(16, 32)   | 2,984<br>(24.28%) | 439,775<br>(25.45%) | 13,222,078          |
|               | SLR1 | K(16, 16)   | 2,431<br>(19.78%) | 511,916<br>(29.62%) | 13,811,922          |
|               | SLR0 | IMpress     | 3,071<br>(24.99%) | 381,697<br>(22.09%) | 12,746,927          |

xcu250-figd2104-2L-e (AMD Xilinx U250)

Frequency

(MHz)

NA

212

239

239

229

225

#### RSA

| BW        | DSP<br>(DSP%)     | LUT<br>(LUT%)       | Latency<br>(cycles) | Frequency<br>(MHz) | # Instances |
|-----------|-------------------|---------------------|---------------------|--------------------|-------------|
| Vitis-1   | 0<br>(0%)         | 17,055<br>(0.99%)   | 39,565              | 238                | 101         |
| Vitis-2   | 4,530<br>(36.87%) | 41,882<br>(2.42%)   | 3,638               | 18                 | 2           |
| 2-level K | 3,072<br>(25.00%) | 156,339<br>(9.05%)  | 5,192               | 223                | 4           |
| 3-level K | 2,352<br>(19.14%) | 223,964<br>(12.96%) | 4,772               | 241                | 5           |
| 4-level K | 1,782<br>(14.50%) | 291,507<br>(16.87%) | 5,150               | 236                | 5           |
| IMpress   | 2,030<br>(16.52%) | 247,706<br>(14.33%) | 5,024               | 231                | 6           |

AMD Xilinx Vitis Security Library, available at <u>github.com/Xilinx/Vitis\_Libraries/tree/master/security</u>.

## IMpress: Large Integer Multiplication Expression Rewriting for FPGA HLS

#### Ecenur Üstün, Ismail San, Jiaqi Yin, Cunxi Yu, Zhiru Zhang

- Optimize large integer multiplication at the HLS level
- Equality saturation based rewriting Bich design space
  - Rich design space
- Extraction with multiple objectives
  - Flexibility for application requirements
  - Increase number of design instances

# **Thank you!**

