# Efficient Methods and Hardware for Deep Learning

## Song Han

#### **Stanford University**

May 25, 2017



# Intro





Song Han PhD Candidate Stanford Bill Dally Chief Scientist NVIDIA Professor Stanford

# **Deep Learning is Changing Our Lives**

#### **Self-Driving Car**



This image is licensed under CC-BY 2.0

This image is in the public domain



AlphaGo

**Machine Translation** 



This image is in the public domain

This image is licensed under CC-BY 2.0



**Smart Robots** 

#### **Models are Getting Larger**



Dally, NIPS'2016 workshop on Efficient Methods for Deep Neural Networks

### The first Challenge: Model Size

Hard to distribute large models through over-the-air update



<u>App icon</u> is in the public domain <u>Phone image</u> is licensed under <u>CC-BY 2.0</u>

#### This item is over 100MB.

Microsoft Excel will not download until you connect to Wi-Fi.







This image is licensed under CC-BY 2.0

### The Second Challenge: Speed

|            | Error rate | Training time |
|------------|------------|---------------|
| ResNet18:  | 10.76%     | 2.5 days      |
| ResNet50:  | 7.02%      | 5 days        |
| ResNet101: | 6.21%      | 1 week        |
| ResNet152: | 6.16%      | 1.5 weeks     |

Such long training time limits ML researcher's productivity

Training time benchmarked with fb.resnet.torch using four M40 GPUs

# The Third Challenge: Energy Efficiency



This image is in the public domain

#### AlphaGo: 1920 CPUs and 280 GPUs, \$3000 electric bill per game



This image is in the public domain





Phone image is licensed under CC-BY 2.0

on mobile: drains battery on data-center: increases TCO



This image is licensed under CC-BY 2.0

#### Where is the Energy Consumed?

larger model => more memory reference => more energy

### Where is the Energy Consumed?

#### larger model => more memory reference => more energy



# Where is the Energy Consumed?

#### larger model => more memory reference => more energy



Battery images are in the public domain Image 1, image 2, image 2, image 4

# Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design

#### **Application as a Black Box**



### **Open the Box before Hardware Design**



Breaks the boundary between algorithm and hardware









### Algorithm





#### Hardware

#### Agenda



#### Agenda



#### Hardware 101: the Family



\* including GPGPU

#### **Hardware 101: Number Representation**



Dally, High Performance Hardware for Machine Learning, NIPS'2015

### Hardware 101: Number Representation



Energy numbers are from Mark Horowitz "Computing's Energy Problem (and what we can do about it)", ISSCC 2014 Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.

#### Agenda



### Part 1: Algorithms for Efficient Inference

- 1. Pruning
- 2. Weight Sharing
- 3. Quantization
- 4. Low Rank Approximation
- 5. Binary / Ternary Net
- 6. Winograd Transformation

### Part 1: Algorithms for Efficient Inference

- 1. Pruning
- 2. Weight Sharing
- 3. Quantization
- 4. Low Rank Approximation
- 5. Binary / Ternary Net
- 6. Winograd Transformation

#### **Pruning Neural Networks**



[Lecun et al. NIPS'89] [Han et al. NIPS'15]

Pruning Trained Quantization

**Huffman Coding** 

Stanford University

### **Pruning Neural Networks**



**6**M

10x less connections

#### Pruning

Trained Quantization

#### **Huffman Coding**

Stanford University

#### [Han et al. NIPS'15]

#### **Pruning Neural Networks**





Pruning

**Trained Quantization** 

Huffman Coding

### **Pruning Neural Networks**



#### Pruning

Trained Quantization

Huffman Coding

#### [Han et al. NIPS'15]

#### **Retrain to Recover Accuracy**



Trained Quantization

Huffman Coding

#### [Han et al. NIPS'15] Iteratively Retrain to Recover Accuracy



Trained Quantization

Huffman Coding

### **Pruning RNN and LSTM**



Pruning

**Trained Quantization** 

**Huffman Coding** 

# **Pruning RNN and LSTM**

| % |   |
|---|---|
|   |   |
|   | % |

90%

95%



90%







- Original: a basketball player in a white uniform is playing with a ball
- **Pruned 90%**: a basketball player in a white uniform is playing with a basketball
  - **Original** : a brown dog is running through a grassy field
- **Pruned 90%**: a brown dog is running through a grassy area
- **Original** : a man is riding a surfboard on a wave
- **Pruned 90%**: a man in a wetsuit is riding a wave on a ulletbeach
- **Original** : a soccer player in red is running in the field
  - **Pruned 95%:** a man in a red shirt and black and white black shirt is running through a field

Pruning



**Trained Quantization** 

**Huffman Coding** 

#### [Han et al. NIPS'15] Pruning Changes Weight Distribution



Conv5 layer of Alexnet. Representative for other network layers as well.

**Huffman Coding** 

### Part 1: Algorithms for Efficient Inference

- 1. Pruning
- 2. Weight Sharing
- 3. Quantization
- 4. Low Rank Approximation
- 5. Binary / Ternary Net
- 6. Winograd Transformation

[Han et al. ICLR'16]

#### **Trained Quantization**



Pruning

**Trained Quantization** 

**Huffman Coding** 

Stanford University

#### **Trained Quantization**



#### 32 bit

4bit 8x less memory footprint

Pruning

**Trained Quantization** 

**Huffman Coding** 

#### weights (32 bit float) 2.09 -0.98 1.48 0.09 0.05 -0.14 -1.08 2.12 -0.91 1.92 -1.03 0 1.87 0 1.53 1.49

Pruning

**Trained Quantization** 

**Huffman Coding** 

#### [Han et al. ICLR'16]

## **Trained Quantization**



#### Pruning

**Trained Quantization** 

#### **Huffman Coding**

| weights<br>(32 bit float) |       |       |       |         | cluster index<br>(2 bit uint) |   |   |   | centroids |       |  |
|---------------------------|-------|-------|-------|---------|-------------------------------|---|---|---|-----------|-------|--|
| 2.09                      | -0.98 | 1.48  | 0.09  |         | 3                             | 0 | 2 | 1 | 3:        | 2.00  |  |
| 0.05                      | -0.14 | -1.08 | 2.12  | cluster | 1                             | 1 | 0 | 3 | 2:        | 1.50  |  |
| -0.91                     | 1.92  | 0     | -1.03 |         | 0                             | 3 | 1 | 0 | 1:        | 0.00  |  |
| 1.87                      | 0     | 1.53  | 1.49  |         | 3                             | 1 | 2 | 2 | 0:        | -1.00 |  |

Pruning

**Trained Quantization** 

**Huffman Coding** 

| weights<br>(32 bit float) |       |       |       | cluster index<br>(2 bit uint) |   |   |   | се | ntroids |       |
|---------------------------|-------|-------|-------|-------------------------------|---|---|---|----|---------|-------|
| 2.09                      | -0.98 | 1.48  | 0.09  |                               | 3 | 0 | 2 | 1  | 3:      | 2.00  |
| 0.05                      | -0.14 | -1.08 | 2.12  | cluster                       | 1 | 1 | 0 | 3  | 2:      | 1.50  |
| -0.91                     | 1.92  | 0     | -1.03 |                               | 0 | 3 | 1 | 0  | 1:      | 0.00  |
| 1.87                      | 0     | 1.53  | 1.49  |                               | 3 | 1 | 2 | 2  | 0:      | -1.00 |

#### gradient

|       | _     |       |       |
|-------|-------|-------|-------|
| -0.03 | -0.01 | 0.03  | 0.02  |
| -0.01 | 0.01  | -0.02 | 0.12  |
| -0.01 | 0.02  | 0.04  | 0.01  |
| -0.07 | -0.02 | 0.01  | -0.02 |

Pruning

### **Trained Quantization**

#### **Huffman Coding**

|          | weię<br>(32 bit | ghts<br>t float) |       |          | (     | clustei<br>(2 bit | r index<br>uint) |       | се    | ntroids |
|----------|-----------------|------------------|-------|----------|-------|-------------------|------------------|-------|-------|---------|
| 2.09     | -0.98           | 1.48             | 0.09  |          | 3     | 0                 | 2                | 1     | 3:    | 2.00    |
| 0.05     | -0.14           | -1.08            | 2.12  | cluster  | 1     | 1                 | 0                | 3     | 2:    | 1.50    |
| -0.91    | 1.92            | 0                | -1.03 |          | 0     | 3                 | 1                | 0     | 1:    | 0.00    |
| 1.87     | 0               | 1.53             | 1.49  |          | 3     | 1                 | 2                | 2     | 0:    | -1.00   |
| gradient |                 |                  |       |          |       |                   |                  |       |       |         |
| -0.03    | -0.01           | 0.03             | 0.02  |          | -0.03 | 0.12              | 0.02             | -0.07 |       |         |
| -0.01    | 0.01            | -0.02            | 0.12  | group by | 0.03  | 0.01              | -0.02            |       | -     |         |
| -0.01    | 0.02            | 0.04             | 0.01  |          | 0.02  | -0.01             | 0.01             | 0.04  | -0.02 | 2       |
| -0.07    | -0.02           | 0.01             | -0.02 |          | -0.01 | -0.02             | -0.01            | 0.01  |       |         |

Pruning

**Trained Quantization** 

**Huffman Coding** 



**Trained Quantization** 



Pruning

**Trained Quantization** 

**Huffman Coding** 

#### [Han et al. ICLR'16]

### **Before Trained Quantization: Continuous Weight**



### After Trained Quantization: Discrete Weight



### After Trained Quantization: Discrete Weight after Training



[Han et al. ICLR'16]

### How Many Bits do We Need?

**Pruning** Trained Quantization

Huffman Coding

Stanford University

## How Many Bits do We Need?



**Trained Quantization** 

**Huffman Coding** 

### **Pruning + Trained Quantization Work Together**

Pruning Trained Q

Trained Quantization

**Huffman Coding** 

### **Pruning + Trained Quantization Work Together**



AlexNet on ImageNet

**Trained Quantization** 

**Huffman Coding** 

## **Huffman Coding**



- In-frequent weights: use more bits to represent
- Frequent weights: use less bits to represent

# **Summary of Deep Compression**



**Trained Quantization** 

Huffman Coding

# **Results: Compression Ratio**

| Network   | Original Compressed<br>Size Size | Compression<br>Ratio | Original Compressed<br>Accuracy Accuracy |
|-----------|----------------------------------|----------------------|------------------------------------------|
| LeNet-300 | 1070KB → 27KB                    | <b>40x</b>           | 98.36% → 98.42%                          |
| LeNet-5   | 1720KB → 44KB                    | 39x                  | 99.20% → 99.26%                          |
| AlexNet   | 240MB → 6.9MB                    | 35x                  | 80.27% → 80.30%                          |
| VGGNet    | 550MB→11.3MB                     | <b>49x</b>           | 88.68% → 89.09%                          |
| GoogleNet | 28MB → 2.8MB                     | 10x                  | 88.90% → 88.92%                          |
| ResNet-18 | 44.6MB → 4.0MB                   | 11x                  | 89.24% → 89.28%                          |

Can we make compact models to begin with?

| Com | 2661 | nn |
|-----|------|----|
|     |      |    |
|     |      |    |

### SqueezeNet



landola et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size", arXiv 2016

**Regularization** 

**Acceleration** 

Compression

# **Compressing SqueezeNet**

| Network    | Approach            | Size   | Ratio      | Top-1<br>Accuracy | Top-5<br>Accuracy |
|------------|---------------------|--------|------------|-------------------|-------------------|
| AlexNet    | -                   | 240MB  | <b>1</b> x | 57.2%             | 80.3%             |
| AlexNet    | SVD                 | 48MB   | 5x         | 56.0%             | 79.4%             |
| AlexNet    | Deep<br>Compression | 6.9MB  | 35x        | 57.2%             | 80.3%             |
| SqueezeNet | _                   | 4.8MB  | 50x        | 57.5%             | 80.3%             |
| SqueezeNet | Deep<br>Compression | 0.47MB | 510x       | 57.5%             | 80.3%             |

landola et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size", arXiv 2016

# **Results: Speedup**



Compression

Acceleration

Regularization

# **Results: Energy Efficiency**



Acceleration

Regularization

# **Deep Compression Applied to Industry**



Acceleration

Regularization

# Part 1: Algorithms for Efficient Inference

- 1. Pruning
- 2. Weight Sharing
- 3. Quantization
- 4. Low Rank Approximation
- 5. Binary / Ternary Net
- 6. Winograd Transformation

# **Quantizing the Weight and Activation**



- Train with float
- Quantizing the weight and activation:
  - Gather the statistics for weight and activation
  - Choose proper radix point position
- Fine-tune in float format
- Convert to fixed-point format

Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA'16

## **Quantization Result**



Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA'16

# Part 1: Algorithms for Efficient Inference

- 1. Pruning
- 2. Weight Sharing
- 3. Quantization
- 4. Low Rank Approximation
- 5. Binary / Ternary Net
- 6. Winograd Transformation

# Low Rank Approximation for Conv

- Layer responses lie in a lowrank subspace
- Decompose a convolutional layer with d filters with filter size  $k \times k \times c$  to
  - A layer with d' filters ( $k \times k \times c$ )
  - A layer with d filter  $(1 \times 1 \times d')$



Zhang et al Efficient and Accurate Approximations of Nonlinear Convolutional Networks CVPR'15

## Low Rank Approximation for Conv

| speedup | rank sel. | Conv1 | Conv2 | Conv3 | Conv4 | Conv5 | Conv6 | Conv7 | err. ↑ % |
|---------|-----------|-------|-------|-------|-------|-------|-------|-------|----------|
| 2×      | no        | 32    | 110   | 199   | 219   | 219   | 219   | 219   | 1.18     |
| 2×      | yes       | 32    | 83    | 182   | 211   | 239   | 237   | 253   | 0.93     |
| 2.4×    | no        | 32    | 96    | 174   | 191   | 191   | 191   | 191   | 1.77     |
| 2.4×    | yes       | 32    | 74    | 162   | 187   | 207   | 205   | 219   | 1.35     |
| 3×      | no        | 32    | 77    | 139   | 153   | 153   | 153   | 153   | 2.56     |
| 3×      | yes       | 32    | 62    | 138   | 149   | 166   | 162   | 167   | 2.34     |
| 4×      | no        | 32    | 57    | 104   | 115   | 115   | 115   | 115   | 4.32     |
| 4×      | yes       | 32    | 50    | 112   | 114   | 122   | 117   | 119   | 4.20     |
| 5×      | no        | 32    | 46    | 83    | 92    | 92    | 92    | 92    | 6.53     |
| 5×      | yes       | 32    | 41    | 94    | 93    | 98    | 92    | 90    | 6.47     |

Zhang et al Efficient and Accurate Approximations of Nonlinear Convolutional Networks CVPR'15

### **Low Rank Approximation for FC**

Build a mapping from row / column indices of matrix W = [W(x, y)] to vectors i and  $j: x \leftrightarrow i = (i_1, \ldots, i_d)$  and  $y \leftrightarrow j = (j_1, \ldots, j_d)$ .

TT-format for matrix W:  $W(i_1, \ldots, i_d; j_1, \ldots, j_d) = W(x(i), y(j)) = \underbrace{G_1[i_1, j_1]}_{1 \times r} \underbrace{G_2[i_2, j_2]}_{r \times r} \ldots \underbrace{G_d[i_d, j_d]}_{r \times 1}$ 

| Туре                      | 1 im. time (ms) | 100 im. time (ms) |
|---------------------------|-----------------|-------------------|
| CPU fully-connected layer | 16.1            | 97.2              |
| CPU TT-layer              | 1.2             | 94.7              |
| GPU fully-connected layer | 2.7             | 33                |
| GPU TT-layer              | 1.9             | 12.9              |
|                           |                 |                   |

Novikov et al Tensorizing Neural Networks, NIPS'15

# Part 1: Algorithms for Efficient Inference

- 1. Pruning
- 2. Weight Sharing
- 3. Quantization
- 4. Low Rank Approximation
- 5. Binary / Ternary Net
- 6. Winograd Transformation

# **Binary / Ternary Net: Motivation**



# **Trained Ternary Quantization**



Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR'17

Pruning Trained Quantization Huffman Coding

# **Weight Evolution during Training**



Figure 2: Ternary weights value (above) and distribution (below) with iterations for different layers of ResNet-20 on CIFAR-10.

Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR'17

# Visualization of the TTQ Kernels



**Pruning** 

# **Error Rate on ImageNet**





Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR'17

Pruning Trained Quantization Huffman Coding

# Part 1: Algorithms for Efficient Inference

- 1. Pruning
- 2. Weight Sharing
- · 3. Quantization
- 4. Low Rank Approximation
- 5. Binary / Ternary Net
- 6. Winograd Transformation

# **3x3 DIRECT Convolutions**

**Compute Bound** 



Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs

Julien Demouth, Convolution OPTIMIZATION: Winograd, NVIDIA

## **3x3 WINOGRAD Convolutions**

Transform Data to Reduce Math Intensity



Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs Winograd convolution: we need 16xC FMAs for 4 outputs: 2.25x fewer FMAs

See A. Lavin & S. Gray, "Fast Algorithms for Convolutional Neural Networks Julien Demouth, Convolution OPTIMIZATION: Winograd, NVIDIA

# **Speedup of Winograd Convolution**

#### VGG16, Batch Size 1 - Relative Performance



Measured on Maxwell TITAN X

Julien Demouth, Convolution OPTIMIZATION: Winograd, NVIDIA

## Agenda



## **Hardware for Efficient Inference**

#### a common goal: minimize memory access









Eyeriss MIT RS Dataflow

#### DaDiannao CAS eDRAM

#### TPU Google 8-bit Integer

"This unit is designed for dense matrices. Sparse architectural support was omitted for time-todeploy reasons. Sparsity will have high priority in future designs"

EIE Stanford Compression/ Sparsity

#### **Compression**

#### **Acceleration**

#### Regularization

#### Stanford University

## **Google TPU**



TPU Card to replace a disk

Up to 4 cards / server

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

# Google TPU

- The Matrix Unit: 65,536 (256x256) 8-bit multiply-accumulate units
- 700 MHz clock rate
- Peak: 92T operations/second
  65,536 \* 2 \* 700M
- >25X as many MACs vs GPU
- >100X as many MACs vs CPU
- 4 MiB of on-chip Accumulator memory
- 24 MiB of on-chip Unified Buffer, (activation memory)
- 3.5X as much on-chip memory vs GPU
- Two 2133MHz DDR3 DRAM channels
- 8 GiB of off-chip weight DRAM memory

### TPU: High-level Chip Architecture



David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

# Google TPU

| Processor                     | mm <sup>2</sup> | Clock<br>MHz | TDP<br>Watts | Idle<br>Watts   | Memory<br>GB/sec | Peak TOPS/chip |        |
|-------------------------------|-----------------|--------------|--------------|-----------------|------------------|----------------|--------|
|                               |                 |              |              |                 |                  | 8b int.        | 32b FP |
| CPU: Haswell<br>(18 core)     | 662             | 2300         | 145          | <mark>41</mark> | 51               | 2.6            | 1.3    |
| GPU: Nvidia<br>K80 (2 / card) | 561             | 560          | 150          | 25              | 160              |                | 2.8    |
| TPU                           | <331*           | 700          | 75           | 28              | 34               | 91.8           |        |

\*TPU is less than half die size of the Intel Haswell processor

K80 and TPU in 28 nm process; Haswell fabbed in Intel 22 nm process

These chips and platforms chosen for comparison because widely deployed in Google data centers

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

## **Inference Datacenter Workload**

| Name  | LOC  | FC |      | Layers<br>Vector | Pool | Total | Nonlinear<br>function | Weights | TPU Ops /<br>Weight<br>Byte | TPU<br>Batch<br>Size | %<br>Deployed |
|-------|------|----|------|------------------|------|-------|-----------------------|---------|-----------------------------|----------------------|---------------|
| MLP0  | 0.1k |    | Conv | V CCION          |      | 5     | ReLU                  | 20M     | 200                         | 200                  | (10/          |
| MLP1  | 1k   | 4  |      |                  |      | 4     | ReLU                  | 5M      | 168                         | 168                  | 61%           |
| LSTM0 | 1k   | 24 |      | 34               |      | 58    | sigmoid,<br>tanh      | 52M     | 64                          | 64                   | 2007          |
| LSTM1 | 1.5k | 37 |      | 19               |      | 56    | sigmoid,<br>tanh      | 34M     | 96                          | 96                   | 29%           |
| CNN0  | 1k   |    | 16   |                  |      | 16    | ReLU                  | 8M      | 2888                        | 8                    | 50/           |
| CNN1  | 1k   | 4  | 72   |                  | 13   | 89    | ReLU                  | 100M    | 1750                        | 32                   | 5%            |

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

#### **Roofline Model: Identify Performance Bottleneck**



Arithmetic Intensity: FLOPs/Byte Ratio

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

## **TPU Roofline**



Operational Intensity: Ops/weight byte (log scale) David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

# Log Rooflines for CPU, GPU, TPU



TeraOps/sec (log scale)



David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

# Linear Rooflines for CPU, GPU, TPU



David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

Stanford University

## Why so far below Rooflines?

Low latency requirement => Can't batch more => low ops/byte

## How to Solve this?

less memory footprint => need compress the model

## **Challenge:**

Hardware that can infer on compressed model

[Han et al. ISCA'16]

#### EIE: the First DNN Accelerator for Sparse, Compressed Model

Compression

Acceleration

Regularization

Stanford University

[Han et al. ISCA'16]

#### EIE: the First DNN Accelerator for Sparse, Compressed Model



Acceleration

#### Regularization

## **EIE: Reduce Memory Access by Compression**





#### physically

| Virtual Weight | <b>W</b> <sub>0,0</sub> | W <sub>0,1</sub> | W <sub>4,2</sub> | W <sub>0,3</sub> | W <sub>4,3</sub> |
|----------------|-------------------------|------------------|------------------|------------------|------------------|
| Relative Index | 0                       | 1                | 2                | 0                | 0                |
| Column Pointer | 0                       | 1                | 2                | 3                |                  |

Han et al. "EIE: Efficient Inference Engine on Compressed Deep Neural Network", ISCA 2016, Hotchips 2016



rule of thumb: 0 \* A = 0 W \* 0 = 0

Compression

Acceleration

Regularization



rule of thumb: 0 \* A = 0 W \* 0 = 0

Compression

Acceleration

Regularization



rule of thumb: 0 \* A = 0 W \* 0 = 0

Compression

Acceleration

Regularization



rule of thumb: 0 \* A = 0 W \* 0 = 0

Compression

Acceleration

Regularization



rule of thumb: 0 \* A = 0 W \* 0 = 0

Compression

Acceleration

Regularization



rule of thumb: 0 \* A = 0 W \* 0 = 0

Compression

Acceleration

Regularization



rule of thumb: 0 \* A = 0 W \* 0 = 0

Compression

Acceleration

Regularization



rule of thumb: 0 \* A = 0 W \* 0 = 0

Compression

Acceleration

Regularization



rule of thumb: 0 \* A = 0 W \* 0 = 0

Compression

Acceleration

Regularization



rule of thumb: 0 \* A = 0 W \* 0 = 0

Compression

Acceleration

Regularization

## **EIE Architecture**

#### Weight decode



#### **Address Accumulate**

rule of thumb: 
$$0 * A = 0$$
  $W * 0 = 0$   $2.09, 1.92 => 2$ 

Compression

Acceleration

Regularization

[Han et al. ISCA'16]

## **Micro Architecture for each PE**





Compression

Acceleration

Regularization

#### [Han et al. ISCA'16]

## **Speedup on EIE**



Compression

Acceleration

Regularization

## d NT-LSTM Geo Mean Energy Efficiency on EIE



Compression

Acceleration

Regularization

103

[Han et al. ISCA'16]

# **Comparison: Throughput**



[Han et al. ISCA'16]

# **Comparison: Energy Efficiency**



## Agenda



# Part 3: Efficient Training — Algorithms

- 1. Parallelization
- 2. Mixed Precision with FP16 and FP32
- 3. Model Distillation
- 4. DSD: Dense-Sparse-Dense Training

# Part 3: Efficient Training — Algorithms

- 1. Parallelization
- 2. Mixed Precision with FP16 and FP32
- 3. Model Distillation
- 4. DSD: Dense-Sparse-Dense Training

#### Moore's law made CPUs 300x faster than in 1990 But its over...



Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

#### Data Parallel – Run multiple inputs in parallel



Dally, High Performance Hardware for Machine Learning, NIPS'2015

### Data Parallel – Run multiple inputs in parallel



- Doesn't affect latency for one input
- Requires P-fold larger batch size
- For training requires coordinated weight update

Dally, High Performance Hardware for Machine Learning, NIPS'2015

#### **Parameter Update**



Large Scale Distributed Deep Networks, Jeff Dean et al., 2013

#### **Model Parallel** Split up the Model – i.e. the network

Dally, High Performance Hardware for Machine Learning, NIPS'2015

#### Model-Parallel Convolution – by output region (x,y)



Dally, High Performance Hardware for Machine Learning, NIPS'2015

# Model-Parallel Convolution – By output map j (filter)



Dally, High Performance Hardware for Machine Learning, NIPS'2015

## Model Parallel Fully-Connected Layer (M x V)





Dally, High Performance Hardware for Machine Learning, NIPS'2015

Stanford University

## Model Parallel Fully-Connected Layer (M x V)



Dally, High Performance Hardware for Machine Learning, NIPS'2015

#### Hyper-Parameter Parallel Try many alternative networks in parallel

Dally, High Performance Hardware for Machine Learning, NIPS'2015

## **Summary of Parallelism**

- Lots of parallelism in DNNs
  - · 16M independent multiplies in one FC layer
  - · Limited by overhead to exploit a fraction of this
- Data parallel
  - Run multiple training examples in parallel
  - · Limited by batch size
- Model parallel
  - Split model over multiple processors
  - · By layer
  - Conv layers by map region
  - Fully connected layers by output activation
- Easy to get 16-64 GPUs training one model in parallel

Dally, High Performance Hardware for Machine Learning, NIPS'2015

## Part 3: Efficient Training — Algorithms

- 1. Parallelization
- 2. Mixed Precision with FP16 and FP32
- 3. Model Distillation
- 4. DSD: Dense-Sparse-Dense Training

### **Mixed Precision**



https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

### **Mixed Precision Training**



Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, "Training with mixed precision", NVIDIA GTC 2017

## **Inception V1**



Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, "Training with mixed precision", NVIDIA GTC 2017

#### ResNet



Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, "Training with mixed precision", NVIDIA GTC 2017

#### **AlexNet**

| Mode                     | Top1<br>accuracy, % | Top5<br>accuracy, % |  |  |  |  |  |  |
|--------------------------|---------------------|---------------------|--|--|--|--|--|--|
| Fp32                     | 58.62               | 81.25               |  |  |  |  |  |  |
| Mixed precision training | 58.12               | 80.71               |  |  |  |  |  |  |
| Inceptio                 | Inception V3        |                     |  |  |  |  |  |  |
| Mode                     | Top1<br>accuracy, % | Top5<br>accuracy, % |  |  |  |  |  |  |
| Fp32                     | 71.75               | 90.52               |  |  |  |  |  |  |
| Mixed precision training | 71.17               | 90.10               |  |  |  |  |  |  |

#### **ResNet-50**

|                          | Top1        | Top5        |
|--------------------------|-------------|-------------|
| Mode                     | accuracy, % | accuracy, % |
| Fp32                     | 73.85       | 91.44       |
| Mixed precision training | 73.6        | 91.11       |

Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, "Training with mixed precision", NVIDIA GTC 2017

## Part 3: Efficient Training Algorithm

- 1. Parallelization
- 2. Mixed Precision with FP16 and FP32
- 3. Model Distillation
- 4. DSD: Dense-Sparse-Dense Training

### **Model Distillation**



student model has much smaller model size

#### Softened outputs reveal the dark knowledge

| cow              | dog<br>1 | cat | car<br>0         | original hard                  |
|------------------|----------|-----|------------------|--------------------------------|
| 0                | L        | 0   | 0                | targets                        |
| COW              | dog      | cat | car              | output of                      |
| 10 <sup>-6</sup> | .9       | .1  | 10 <sup>-9</sup> | geometric                      |
|                  |          |     |                  | ensemble                       |
| cow              | dog      | cat | car              | a offered output               |
| .05              | .3       | .2  | .005             | softened output<br>of ensemble |

Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network

#### Softened outputs reveal the dark knowledge

$$p_i = \frac{\exp\left(\frac{z_i}{T}\right)}{\sum_j \exp\left(\frac{z_j}{T}\right)}$$

- Method: Divide score by a "temperature" to get a much softer distribution
- Result: Start with a trained model that classifies 58.9% of the test frames correctly. The new model converges to 57.0% correct even when it is only trained on 3% of the data

Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network

## Part 3: Efficient Training Algorithm

- 1. Parallelization
- 2. Mixed Precision with FP16 and FP32
- 3. Model Distillation
- 4. DSD: Dense-Sparse-Dense Training

## **DSD: Dense Sparse Dense Training**



DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide range of deep neural networks on CNNs / RNNs / LSTMs.

Han et al. "DSD: Dense-Sparse-Dense Training for Deep Neural Networks", ICLR 2017

### **DSD:** Intuition





#### learn the trunk first

#### then learn the leaves

Han et al. "DSD: Dense-Sparse-Dense Training for Deep Neural Networks", ICLR 2017

### DSD is General Purpose: Vision, Speech, Natural Language

| Network   | Domain | Dataset  | Туре | Baseline | DSD   | Abs.<br>Imp. | Rel.<br>Imp. |
|-----------|--------|----------|------|----------|-------|--------------|--------------|
| GoogleNet | Vision | ImageNet | CNN  | 31.1% →  | 30.0% | 1.1%         | 3.6%         |
| VGG-16    | Vision | ImageNet | CNN  | 31.5% →  | 27.2% | 4.3%         | 13.7%        |
| ResNet-18 | Vision | ImageNet | CNN  | 30.4% →  | 29.3% | 1.1%         | 3.7%         |
| ResNet-50 | Vision | ImageNet | CNN  | 24.0% →  | 23.2% | 0.9%         | 3.5%         |

Open Sourced DSD Model Zoo: https://songhan.github.io/DSD

The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

| ( nm  | pression  |
|-------|-----------|
| COIII | NICSSIUII |
|       |           |

133

### DSD is General Purpose: Vision, Speech, Natural Language

| Network    | Domain  | Dataset   | Туре | Baseline | DSD   | Abs.<br>Imp. | Rel.<br>Imp. |
|------------|---------|-----------|------|----------|-------|--------------|--------------|
| GoogleNet  | Vision  | ImageNet  | CNN  | 31.1% →  | 30.0% | 1.1%         | 3.6%         |
| VGG-16     | Vision  | ImageNet  | CNN  | 31.5% →  | 27.2% | 4.3%         | 13.7%        |
| ResNet-18  | Vision  | ImageNet  | CNN  | 30.4% →  | 29.3% | 1.1%         | 3.7%         |
| ResNet-50  | Vision  | ImageNet  | CNN  | 24.0% →  | 23.2% | 0.9%         | 3.5%         |
| NeuralTalk | Caption | Flickr-8K | LSTM | 16.8 →   | 18.5  | 1.7          | 10.1%        |

Open Sourced DSD Model Zoo: <u>https://songhan.github.io/DSD</u>

The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

| ( nm  | pression |
|-------|----------|
| COIII | NICSSIUI |
|       |          |

Regularization

### DSD is General Purpose: Vision, Speech, Natural Language

| Network      | Domain  | Dataset   | Туре | Baseline | DSD   | Abs.<br>Imp. | Rel.<br>Imp. |
|--------------|---------|-----------|------|----------|-------|--------------|--------------|
| GoogleNet    | Vision  | ImageNet  | CNN  | 31.1% →  | 30.0% | 1.1%         | 3.6%         |
| VGG-16       | Vision  | ImageNet  | CNN  | 31.5% →  | 27.2% | 4.3%         | 13.7%        |
| ResNet-18    | Vision  | ImageNet  | CNN  | 30.4% →  | 29.3% | 1.1%         | 3.7%         |
| ResNet-50    | Vision  | ImageNet  | CNN  | 24.0% →  | 23.2% | 0.9%         | 3.5%         |
| NeuralTalk   | Caption | Flickr-8K | LSTM | 16.8 →   | 18.5  | 1.7          | 10.1%        |
| DeepSpeech   | Speech  | WSJ'93    | RNN  | 33.6% →  | 31.6% | 2.0%         | 5.8%         |
| DeepSpeech-2 | Speech  | WSJ'93    | RNN  | 14.5% →  | 13.4% | 1.1%         | 7.4%         |

Open Sourced DSD Model Zoo: <u>https://songhan.github.io/DSD</u>

The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

Compression

Acceleration

Regularization

Stanford University

135

## DSD Model Zoo

DSD model zoo. Better accuracy models from DSD training on Imagenet with same model architecture.

DSD Model Zoo

This repo contains pre-trained models by Dense-Sparse-Dense(DSD) training on Imagenet.

Download

ftar Download

Compared to conventional training method, dense→sparse→dense (DSD) training yielded higher accuracy with same model architecture.

Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.

#### **Download:**

#### https://songhan.github.io/DSD

### **DSD on Caption Generation**



Baseline model: Andrej Karpathy, Neural Talk model zoo. Han et al. "DSD: Dense-Sparse-Dense Training for Deep Neural Networks", ICLR 2017

#### A. Supplementary Material: More Examples of DSD framing improves the Ferrormance of NeuralTalk Auto-Caption System DSD on Caption Generation



- **<u>Baseline</u>**: a boy is swimming in a pool. Sparse: a small black dog is jumping into a pool.
- **DSD**: a black and white dog is swimming in front of a building. in a pool.



**Baseline**: a group of people are standing in front of a building. Sparse: a group of people are standing

**DSD**: a group of people are walking in a park.



- **Baseline**: two girls in bathing suits are playing in the water.
- **Sparse**: two children are playing in the sand.
- **DSD**: two children are playing in the sand.



**Baseline**: a man in a red shirt and jeans is riding a bicycle down a street. **Sparse**: a man in a red shirt and a woman in a wheelchair. **DSD**: a man and a woman are riding on a street.



**Baseline**: a group of people sit on a bench in front of a building. **Sparse**: a group of people are standing in front of a building. **DSD**: a group of people are standing 'in a fountain.



- **xBaseline**: a man in a black jacket and a black jacket is smiling.
- **Sparse**: a man and a woman are standing **Sparse**: a group of football players in a in front of a mountain.
- **DSD**: a man in a black jacket is standing next to a man in a black shirt.



- **Baseline**: a group of football players in **Baseline**: a dog runs through the grass. red uniforms.
- field.
- **DSD**: a group of football players in red and white uniforms.



**Sparse**: a dog runs through the grass. **DSD**: a white and brown dog is running through the grass.

Baseline model: Andrej Karpathy, Neural Talk model zoo.



### Agenda



## **CPUs for Training**

#### Intel Knights Landing (2016)



- 7 TFLOPS FP32
- 16GB MCDRAM- 400 GB/s
- 245W TDP
- 29 GFLOPS/W (FP32)
- 14nm process

#### Knights Mill: next gen Xeon Phi "optimized for deep learning"

Intel announced the addition of new vector instructions for deep learning (AVX512-4VNNIW and AVX512-4FMAPS), October 2016

Slide Source: Sze et al Survey of DNN Hardware, MICRO'16 Tutorial. Image Source: Intel, Data Source: Next Platform

## **GPUs for Training**

#### Nvidia PASCAL GP100 (2016)



- 10/20 TFLOPS FP32/FP16
- 16GB HBM 750 GB/s
- 300W TDP
- 67 GFLOPS/W (FP16)
- 16nm process
- 160GB/s NV Link

Slide Source: Sze et al Survey of DNN Hardware, MICRO'16 Tutorial. Data Source: NVIDIA

## **GPUs for Training**

#### Nvidia Volta GV100 (2017)



- 15 FP32 TFLOPS
- 120 Tensor TFLOPS
- 16GB HBM2 @ 900GB/s
- 300W TDP
- 12nm process
- 21B Transistors
- die size: 815 mm2
- 300GB/s NVLink

Data Source: NVIDIA

### What's new in Volta: Tensor Core



a new instruction that performs 4x4x4 FMA mixed-precision operations per clock 12X increase in throughput for the Volta V100 compared to the Pascal P100

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

### Pascal v.s. Volta



cuBLAS Mixed Precision (FP16 Input, FP32 compute)

Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations.

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

### Pascal v.s. Volta



Left: Tesla V100 trains the ResNet-50 deep neural network 2.4x faster than Tesla P100. Right: Given a target latency per image of 7ms, Tesla V100 is able to perform inference using the ResNet-50 deep neural network 3.7x faster than Tesla P100.

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

| M                                            |                                                                 |                                        |                                                             |                                                             |                             |                          |                                                        |                                                     |                                                     |                                                             |                                                             |        |                                                                                                                 |        |
|----------------------------------------------|-----------------------------------------------------------------|----------------------------------------|-------------------------------------------------------------|-------------------------------------------------------------|-----------------------------|--------------------------|--------------------------------------------------------|-----------------------------------------------------|-----------------------------------------------------|-------------------------------------------------------------|-------------------------------------------------------------|--------|-----------------------------------------------------------------------------------------------------------------|--------|
|                                              |                                                                 |                                        |                                                             |                                                             |                             | L1 Instruc               | tion Cache                                             |                                                     |                                                     |                                                             |                                                             |        |                                                                                                                 |        |
|                                              |                                                                 | 1 0 Ir                                 | nstruct                                                     | ion C                                                       | ache                        |                          |                                                        |                                                     |                                                     | nstruc                                                      | tion C                                                      | ache   |                                                                                                                 |        |
|                                              | War                                                             | and said                               |                                                             |                                                             |                             |                          | L0 Instruction Cache<br>Warp Scheduler (32 thread/clk) |                                                     |                                                     |                                                             |                                                             |        |                                                                                                                 |        |
|                                              | Warp Scheduler (32 thread/clk)<br>Dispatch Unit (32 thread/clk) |                                        |                                                             |                                                             |                             |                          | -                                                      | h Unit                                              |                                                     |                                                             |                                                             |        |                                                                                                                 |        |
|                                              | 100                                                             |                                        |                                                             |                                                             | l x 32-bit)                 |                          |                                                        |                                                     |                                                     | File ('                                                     |                                                             |        |                                                                                                                 |        |
| FP64                                         | INT                                                             | INT                                    | FP32                                                        | FP32                                                        |                             |                          | FP64                                                   | INT                                                 | INT                                                 | FP32                                                        | FP32                                                        |        |                                                                                                                 |        |
| FP64                                         | INT                                                             | INT                                    | FP32                                                        | FP32                                                        |                             |                          | FP64                                                   | INT                                                 | INT                                                 | FP32                                                        | FP32                                                        |        |                                                                                                                 |        |
| FP64                                         | INT                                                             | INT                                    | FP32                                                        | FP32                                                        |                             |                          | FP64                                                   | INT                                                 | INT                                                 | FP32                                                        | FP32                                                        |        |                                                                                                                 |        |
| FP64                                         | INT                                                             | INT                                    | FP32                                                        | FP32                                                        | TENSOR                      | TENSOR                   | FP64                                                   | INT                                                 | INT                                                 | FP32                                                        | FP32                                                        |        | SOR                                                                                                             | TENSO  |
| FP64                                         | INT                                                             | INT                                    | FP32                                                        |                                                             | CORE                        | CORE                     | FP64                                                   | INT                                                 | INT                                                 | FP32                                                        |                                                             |        | RE                                                                                                              | CORE   |
| FP64                                         | INT                                                             | INT                                    | FP32                                                        |                                                             |                             |                          | FP64                                                   | INT                                                 | INT                                                 | FP32                                                        |                                                             |        |                                                                                                                 |        |
| FP64                                         | INT                                                             | INT                                    | FP32                                                        |                                                             |                             |                          | FP64                                                   | INT                                                 | INT                                                 | FP32                                                        |                                                             |        |                                                                                                                 |        |
| FP64                                         | INT                                                             | INT                                    | FP32<br>LD/                                                 | FP32                                                        | LD/ LD/                     |                          | FP64                                                   | INT                                                 | INT                                                 | FP32                                                        | FP32<br>LD/                                                 | LD/    | LD/                                                                                                             |        |
| ST ST                                        | ST                                                              | ST                                     | ST                                                          | ST                                                          | ST ST                       | SFU                      | ST ST                                                  | ST                                                  | ST                                                  | ST                                                          | ST                                                          | ST     | ST                                                                                                              | SFU    |
|                                              |                                                                 | L0 Ir                                  | nstruct                                                     | ion C                                                       | ache                        |                          |                                                        |                                                     | L0 Ir                                               | nstruc                                                      | tion C                                                      | ache   |                                                                                                                 |        |
|                                              | War                                                             |                                        |                                                             |                                                             | hread/clk)                  |                          |                                                        | War                                                 |                                                     | edule                                                       |                                                             |        | /clk)                                                                                                           |        |
|                                              |                                                                 | A REAL PROPERTY.                       |                                                             |                                                             | read/clk)                   |                          |                                                        |                                                     | Contraction in the                                  | h Unit                                                      | A REAL PROPERTY.                                            |        |                                                                                                                 |        |
|                                              | Reg                                                             | ister                                  | File (1                                                     | 6,384                                                       | l x 32-bit)                 |                          |                                                        | Reg                                                 | ister                                               | File (1                                                     | 16,384                                                      | 4 x 32 | 2-bit)                                                                                                          |        |
| FP64                                         | INT                                                             | INT                                    |                                                             |                                                             |                             |                          |                                                        |                                                     |                                                     |                                                             |                                                             |        |                                                                                                                 |        |
| ED.C.                                        |                                                                 | INT                                    | FP32                                                        | FP32                                                        |                             |                          | FP64                                                   | INT                                                 | INT                                                 | FP32                                                        | FP32                                                        |        |                                                                                                                 |        |
| FP64                                         | INT                                                             | INT                                    | FP32<br>FP32                                                |                                                             |                             |                          | FP64<br>FP64                                           | INT<br>INT                                          | INT<br>INT                                          |                                                             | FP32<br>FP32                                                |        |                                                                                                                 |        |
| FP64<br>FP64                                 | INT<br>INT                                                      |                                        |                                                             | FP32                                                        |                             |                          |                                                        |                                                     |                                                     |                                                             | FP32                                                        |        |                                                                                                                 |        |
|                                              |                                                                 | INT<br>INT<br>INT                      | FP32<br>FP32<br>FP32                                        | FP32<br>FP32<br>FP32                                        | TENSOR                      | TENSOR                   | FP64                                                   | INT                                                 | INT<br>INT<br>INT                                   | FP32<br>FP32<br>FP32                                        | FP32<br>FP32<br>FP32                                        |        | SOR                                                                                                             |        |
| FP64<br>FP64<br>FP64                         | INT<br>INT<br>INT                                               | INT<br>INT<br>INT<br>INT               | FP32<br>FP32<br>FP32<br>FP32                                | FP32<br>FP32<br>FP32<br>FP32                                | TENSOR                      | TENSOR<br>CORE           | FP64<br>FP64<br>FP64<br>FP64                           | INT<br>INT<br>INT<br>INT                            | INT<br>INT<br>INT                                   | FP32<br>FP32<br>FP32<br>FP32                                | FP32<br>FP32<br>FP32<br>FP32                                |        | ISOR<br>DRE                                                                                                     | TENSOI |
| FP64<br>FP64<br>FP64<br>FP64                 | INT<br>INT<br>INT<br>INT                                        | INT<br>INT<br>INT<br>INT<br>INT        | FP32<br>FP32<br>FP32<br>FP32<br>FP32                        | FP32<br>FP32<br>FP32<br>FP32<br>FP32                        |                             | Income which seems areas | FP64<br>FP64<br>FP64<br>FP64<br>FP64                   | INT<br>INT<br>INT<br>INT<br>INT                     | INT<br>INT<br>INT<br>INT                            | FP32<br>FP32<br>FP32<br>FP32<br>FP32                        | FP32<br>FP32<br>FP32<br>FP32<br>FP32                        |        | the second se |        |
| FP64<br>FP64<br>FP64<br>FP64<br>FP64         | INT<br>INT<br>INT<br>INT                                        | INT<br>INT<br>INT<br>INT<br>INT        | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                |                             | Income which seems areas | FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64           | INT<br>INT<br>INT<br>INT<br>INT<br>INT              | INT<br>INT<br>INT<br>INT<br>INT                     | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                |        | the second se |        |
| FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64 | INT<br>INT<br>INT<br>INT<br>INT                                 | INT<br>INT<br>INT<br>INT<br>INT<br>INT | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32        | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32        | CORE                        | Income which seems areas | FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64   | INT<br>INT<br>INT<br>INT<br>INT<br>INT              | INT<br>INT<br>INT<br>INT<br>INT<br>INT              | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32        | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32        | CC     | DRE                                                                                                             |        |
| FP64<br>FP64<br>FP64<br>FP64<br>FP64         | INT<br>INT<br>INT<br>INT                                        | INT<br>INT<br>INT<br>INT<br>INT        | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                |                             | Income which seems areas | FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64           | INT<br>INT<br>INT<br>INT<br>INT<br>INT              | INT<br>INT<br>INT<br>INT<br>INT                     | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                |        | the second se |        |
| FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64 | INT<br>INT<br>INT<br>INT<br>INT<br>LD/                          | INT<br>INT<br>INT<br>INT<br>INT<br>INT | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>LD/ | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>LD/ | CORE<br>LD/<br>ST LD/<br>ST | CORE                     | FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64   | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>LD/<br>ST | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>LD/<br>ST | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>LD/ | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>LD/ | CC     |                                                                                                                 |        |

The GV100 SM is partitioned into four processing blocks, each with:

- 8 FP64 Cores
- 16 FP32 Cores
- 16 INT32 Cores
- two of the new mixed-precision Tensor Cores for deep learning
- a new L0 instruction cache
- one warp scheduler
- one dispatch unit
- a 64 KB Register File.

https://devblogs.nvidia.com/parallelforall/ cuda-9-features-revealed/

| <b>Tesla Product</b>         | Tesla K40           | Tesla M40           | Tesla P100          | Tesla V100          |
|------------------------------|---------------------|---------------------|---------------------|---------------------|
| GPU                          | GK110 (Kepler)      | GM200 (Maxwell)     | GP100 (Pascal)      | GV100 (Volta)       |
| <b>GPU Boost Clock</b>       | 810/875 MHz         | 1114 MHz            | 1480 MHz            | 1455 MHz            |
| Peak FP32 TFLOP/s*           | 5.04                | 6.8                 | 10.6                | 15                  |
| Peak Tensor Core<br>TFLOP/s* | -                   | -                   | -                   | 120                 |
| Memory Interface             | 384-bit GDDR5       | 384-bit GDDR5       | 4096-bit HBM2       | 4096-bit HBM2       |
| Memory Size                  | Up to 12 GB         | Up to 24 GB         | 16 GB               | 16 GB               |
| TDP                          | 235 Watts           | 250 Watts           | 300 Watts           | 300 Watts           |
| Transistors                  | 7.1 billion         | 8 billion           | 15.3 billion        | 21.1 billion        |
| GPU Die Size                 | 551 mm <sup>2</sup> | 601 mm <sup>2</sup> | 610 mm <sup>2</sup> | 815 mm <sup>2</sup> |
| Manufacturing<br>Process     | 28 nm               | 28 nm               | 16 nm FinFET+       | 12 nm FFN           |

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

### **GPU / TPU**

|                                 | K80<br>2012        | TPU<br>2015 | P40<br>2016 |
|---------------------------------|--------------------|-------------|-------------|
| Inferences/Sec<br><10ms latency | 1/ <sub>13</sub> X | 1X          | 2X          |
| Training TOPS                   | 6 FP32             | NA          | 12 FP32     |
| Inference TOPS                  | 6 FP32             | 90 INT8     | 48 INT8     |
| <b>On-chip Memory</b>           | 16 MB              | 24 MB       | 11 MB       |
| Power                           | 300W               | 75W         | 250W        |
| Bandwidth                       | 320 GB/S           | 34 GB/S     | 350 GB/S    |

https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/

### **Google Cloud TPU**



Cloud TPU delivers up to 180 teraflops to train and run machine learning models.

source: Google Blog

### **Google Cloud TPU**



A "TPU pod" built with 64 second-generation TPUs delivers up to 11.5 petaflops of machine learning acceleration.

"One of our new large-scale translation models used to take a full day to train on 32 of the best commercially-available GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod."— Google Blog

## Wrap-Up



#### Future



Smart

Low Latency

Privacy

Mobility

**Energy-Efficient** 

#### **Outlook: the Focus for Computation**







**PC Era** 

**Mobile-First Era** 

**AI-First Era** 



Brain-Inspired Cognitive Computing

Sundar Pichai, Google IO, 2016

## Thank you!

stanford.edu/~songhan

