

## Introduction to Inference

Micah Villmow, Principal TensorRT Engineer Hotchips Tutorial on ML-Inference 8/27/2023





# Agenda

- Inference Introduction
- Ecosystem
- Optimization
- Execution





## Inference Introduction



## What Is Inference?

"Inference is a conclusion reached on the basis of evidence and reasoning<sup>1''</sup>

— Oxford Dictionary



### Conclusion:

The output of the network

### Evidence:

### Reasoning:

- Prior knowledge is the trained weights
- Current knowledge is the input activations

The reasoning is implicit in the network

- No backward passes.
- Weights are read-only.
- Activations do not need to be stored.
- Dataset is unknown.
- The data can be, but is not required to be, normalized.
- Optimize for different metrics vs throughput for training.

## Difference From Training







**Computer Vision** 

### Al Inference Drives The Modern Applications Demand for fast, easy inference deployment greater than ever

**Conversational AI** 

Recommender System

Fraud Detection





## Ecosystem





Embedded







VPU

CPU

## Hardware Ecosystem









GPU





### Gaming



### Professional



SOC



WaferScale





### Inference Software Ecosystem A simplified model

Application Layer

Extended Framework Layer (Hugging Face, PyTorch Lightning)

Framework Layer (PyTorch, TF, JAX)

Graph Compiler Layer (XLA, FX, TensorRT)

System Layer (CUDA, NCCL, OpenMPI)

Platform Layer

Hardware Layer











### GANs

## Cambrian Model Explosion

Capsule Nets



**Reinforcement Learning** 



## Optimizations



## Why Optimize?



### Reduce



| FP Format            | Exponent* | Significand+ | Integer Formats |        |
|----------------------|-----------|--------------|-----------------|--------|
| IEEE-754 64bit       | 11        | 52           | Int64           | Int32  |
| IEEE-754 32bit       | 8         | 23           | Int16           | Int8   |
| TensorFloat-32 19bit | 8         | 10           | Int4            | Int2   |
| IEEE-754 16bit       | 5         | 10           | Uint64          | Uint32 |
| BFloat 16bit         | 8         | 7            | Uint16          | Uint8  |
| FP8E4M3              | 4         | 3            | Uint4           | Uint2  |
| FP8E5M2              | 5         | 2            | Uint1           |        |

### Precision What is the best format?









# Layouts What is the best representation?



### Linear FP16 F16 F16 F16 F16

### 2xVec4 FP16

| F16 | F16 | F16 | F16 |
|-----|-----|-----|-----|
| F16 | F16 | F16 | F16 |
| F16 | F16 | F16 | F16 |
| F16 | F16 | F16 | F16 |
| F16 | F16 | F16 | F16 |
| F16 | F16 | F16 | F16 |
| F16 | F16 | F16 | F16 |
| F16 | F16 | F16 | F16 |



| F16 |
|-----|
| F16 |

# Memory — Formats Combination of Precision and Layout

### Linear Int8

| 18 | 18 | 18 |
|----|----|----|
|----|----|----|



### Mat<sub>4</sub> FP16

| F16 | F16 | F16 |
|-----|-----|-----|
| F16 | F16 | F16 |

### Mat<sub>4</sub> Int8

| 18 | 18 | 18 | 18 |
|----|----|----|----|
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |



### 2xMat4 Int8

| 18 | 18 | 18 | 18 |
|----|----|----|----|
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |
| 18 | 18 | 18 | 18 |

4xMat4 Int4

| 14 | 14 | 14 | 14 |
|----|----|----|----|
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |
| 14 | 14 | 14 | 14 |



## Type and Shape inference

- What are the input and output types of the purple layer?
- What are the output dimensions of the Softmax?



### Linear FP32 -> ???? Conv (4, 32, 128, 128)

### ??? Activation (?, ?, ?, ?)

### ???? -> Vec 2 FP 16 Softmax (?, ?, ?, ?)

## Dynamic — Dimensions

Variable sized inputs for a network

- How many video formats are there?
- Image sizes?
- Sentence lengths?
- Dictionary sizes?
- Object counts?





SXGA 1280x102

OSXG/ 2560x204

|                | 4:3                | 3:2             | 16:10               | 16         |
|----------------|--------------------|-----------------|---------------------|------------|
| <b>\</b><br>24 | OVGA<br>320x240    | NTSC<br>720x480 | CGA<br>320X200      | W\<br>854  |
| A<br>48        | VGA<br>640x280     | 1152X768        | WSXGA+<br>1680x1050 | HD<br>1280 |
|                | PAL<br>768×576     | 1280x854        | WUXGA<br>1920x1200  | HD<br>1920 |
|                | SVGA<br>800x600    | 1440X960        | WOXGA<br>2560x1600  |            |
|                | XGA<br>1024x768    |                 |                     |            |
|                | 1280x960           |                 |                     |            |
|                | SXGA+<br>1400x1050 |                 |                     |            |
|                | UXGA<br>1600x1200  |                 |                     |            |
|                | OXGA<br>2048x1536  |                 |                     |            |
|                |                    |                 |                     |            |

### 6:9



Dynamic Data Dependent Shapes How many objects are in each image?





### Quantization

### Do almost the same with less

- Decreases latency and storage
- Balances between truncation and discretization



Floating point x<sub>f</sub>

Signed Int8 x<sub>q</sub>





## Activations — Sparsity Types of sparsity

- Generated as part of training or via post training optimization
- Sparsity in inference can improve performance reduce size of weights
- Structured The number of weights that are zero is same every N values
  - N:M N zeros every M elements
  - Block Various blocks are sparse
- Unstructured Percentage of memory that is o
  - 10% 1 out of 10 elements over entire memory block is a zero



and





























Post Training Qauntization

## **Quantization Methods**

### Quantization Aware Training





## Weight Optimizations

Transforming weights to different types

- Conversion to different data types lowers memory requirements
- Conversion to different formats allows efficient algorithms











### FP32 -> FP16 Conversion





### FP32 -> Int8 Conversion





## FP32 to FP16 w/Padding



|  | <br> |
|--|------|
|  |      |
|  |      |
|  |      |
|  |      |
|  |      |
|  |      |
|  |      |
|  |      |

## Layer Fusion

Optimizes use of GPU memory and bandwidth nodes in a kernel

- Combines nodes into a single node, making single kernel execution.
- Significantly reduces number of layers to compute, resulting in faster performance.
- Eliminates unnecessary memory traffic by not spilling to memory.

### developer.nvidia.com/tensorrt



### by fusing

Input

Input



## Time Fusion

Optimizes recurrent neural networks over time steps with dynamically generated kernels

- Recurrent neural network optimizations
- Deploy highly optimized ASR and TTS
- Compiler fuses pointwise ops, fuses GEMMs and compute efficiently across time steps

### developer.nvidia.com/tensorrt





## Memory — Tiling

- Segment a graph to get better computation locality
- Fits more of the graph computation into L1/L2





| /       |
|---------|
|         |
| 54x3x3) |
| /       |
|         |
| 54x3x3) |
| /       |
|         |
| ō4x3x3) |
| ,       |
|         |
| 64x3x3) |

## Memory Scheduling

Minimizes memory footprint and reuses memory for tensors efficiently

- Reduces memory footprint and improves re-use
- Two tensors with disjoint lifetimes can share memory
- Becomes an instance of the "dynamic storage allocation problem"
- Similar to traditional register allocation

### developer.nvidia.com/tensorrt



memory

the same





| nsorA  | Tensor B | Tensor C | > | Block |
|--------|----------|----------|---|-------|
| DN 2   |          |          |   |       |
| nsor D |          |          | > | Block |
| on 3   |          |          |   |       |
| nsor E |          |          |   |       |



## Memory Scheduling

Minimizes memory footprint and reuses memory for tensors efficiently

- Reduces memory footprint and improves USe
- Two tensors with disjoint lifetimes can share memory
- Becomes an instance of the "dynamic storage allocation problem"
- Similar to traditional register allocation

### developer.nvidia.com/tensorrt



memory re-

the same







## Kernel Selection

Selects best data layers and algorithms based on the target platform

- Specialized kernels optimized for every operation
- Combination of static and dynamically generated kernels
- Kernel selection uses timing information to choose combination of formats, precisions, and implementations that minimizes a network property
- Strives for best performance for specific deployment platform and specific neural network







### Single Device Model

# Multi-Device Segmentation How to run large language models?

Multi Device Model



## Execution







## Execution Modes



## Multi-Stream Concurrent Execution

Uses a scalable design to process multiple input streams in parallel

• Better performance and improved utilization through multistream concurrent execution





## Activations — Async Buffering

- Relies on multiple buffers to pipeline execution
- Requires notification of when a buffer is ready
- Allows pipelining of execution and copying inputs for next iteration





Time



### Delivering High Performance Across Frameworks NVIDIA Triton's architecture



| odel Ana     | alyzer                     |                 |                              |   | Мо |
|--------------|----------------------------|-----------------|------------------------------|---|----|
| $\checkmark$ |                            |                 |                              |   |    |
|              |                            |                 |                              |   |    |
| >            | Dy<br>(Real ti             |                 | Per<br>Schedu                |   |    |
|              | Mul                        |                 |                              |   |    |
|              | Custom<br><b>O PyTorch</b> | ONNX TensorFlow | <b>OpenVINO</b> <sup>®</sup> | < |    |
|              |                            |                 |                              |   |    |

Utilization, Throughput, Latency Metrics









## Al Inference Workflow

Collaboration between multiple teams





