TensorRT Advanced Optimization Techniques for NVIDIA Jetson: A Comprehensive Technical Guide
Deep dive into dynamic shapes, custom plugins, INT8 calibration, sparsity, and production-grade inference optimization
Plain English Summary
What is TensorRT?
TensorRT is NVIDIA's secret weapon for making AI models run incredibly fast. Think of it as a translator and optimizer—it takes your AI model and rewrites it to run up to 40x faster on NVIDIA hardware.
Why is this important?
| Without TensorRT | With TensorRT |
|---|---|
| 5 frames/second | 120 frames/second |
| Model won't fit in memory | Model runs smoothly |
| Uses all the GPU power | Uses GPU efficiently |
| Generic code | Hardware-optimized code |
Key concepts explained simply:
| Concept | Simple Explanation | Real-World Analogy |
|---|---|---|
| Dynamic Shapes | Model handles different input sizes | A parking spot that fits any car size |
| INT8 Calibration | Teaching the model to use smaller numbers accurately | Training someone to estimate weights instead of using a scale |
| Custom Plugins | Adding your own special operations | Adding custom apps to your phone |
| Sparsity | Skipping unnecessary calculations | Not reading blank pages in a book |
| Multi-Profile | Different settings for different situations | Sport mode vs eco mode in a car |
Performance gains you can expect:
| Model Type | FP32 | FP16 | INT8 | Sparse INT8 |
|---|---|---|---|---|
| YOLOv8 | 15 FPS | 45 FPS | 95 FPS | 120 FPS |
| ResNet-50 | 100 FPS | 250 FPS | 500 FPS | 650 FPS |
What will you learn?
- Handle variable batch sizes and image resolutions
- Build custom operations when standard ones aren't enough
- Calibrate INT8 models for maximum speed with minimal accuracy loss
- Use structured sparsity for 2x additional speedup
- Profile and debug performance issues
The bottom line: TensorRT is essential for production AI on Jetson. This guide takes you from basic optimization to advanced techniques used by NVIDIA themselves.
Table of Contents
- Introduction
- Dynamic Shapes and Optimization Profiles
- Custom Layer Implementation with IPluginV3
- Plugin Development Best Practices
- INT8 Calibration Strategies
- Sparsity and Structured Pruning
- Multi-Profile Engines for Variable Workloads
- TensorRT-LLM and Edge-LLM for Jetson
- Streaming and Async Inference
- Memory Pooling and Allocation Strategies
- Profiling with Nsight Systems
- Benchmark Comparisons
- Conclusion
Introduction
NVIDIA TensorRT is the premier SDK for high-performance deep learning inference, delivering up to 40x faster inference compared to CPU-only platforms. For edge deployments on NVIDIA Jetson devices (AGX Orin, Orin Nano, and the upcoming Thor), mastering advanced TensorRT optimization techniques is essential for production-grade AI applications.
The Jetson AGX Orin delivers up to 170 INT8 Sparse TOPS with Tensor Cores and 85 FP16 TFLOPS, making it a powerful platform for real-time inference in robotics, automotive, and industrial applications. This guide covers the advanced techniques needed to fully exploit this hardware capability.
Dynamic Shapes and Optimization Profiles
Understanding Dynamic Shapes
When working with variable input dimensions (batch size, sequence length, image resolution), TensorRT requires optimization profiles that specify permitted dimension ranges at build time.
import tensorrt as trt
def build_engine_with_dynamic_shapes(onnx_path: str, engine_path: str):
"""Build TensorRT engine with dynamic batch size support."""
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
# Parse ONNX model
with open(onnx_path, 'rb') as f:
if not parser.parse(f.read()):
for error in range(parser.num_errors):
print(parser.get_error(error))
return None
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB
# Create optimization profile for dynamic shapes
profile = builder.create_optimization_profile()
# Input tensor: [batch, channels, height, width]
# Define min, optimal, and max shapes
profile.set_shape(
"input",
min=(1, 3, 224, 224), # Minimum shape
opt=(8, 3, 224, 224), # Optimal shape for auto-tuning
max=(32, 3, 224, 224) # Maximum shape
)
config.add_optimization_profile(profile)
# Build serialized engine
serialized_engine = builder.build_serialized_network(network, config)
with open(engine_path, 'wb') as f:
f.write(serialized_engine)
return serialized_engineShape Tensors and Data-Dependent Shapes
For models with data-dependent output shapes (like NonMaxSuppression or NonZero operators), TensorRT requires special handling through shape tensors:
# Setting shape tensor values at runtime
profile.set_shape_input(
"shape_input",
min=(1,),
opt=(4,),
max=(16,)
)Multiple Optimization Profiles
For variable workloads, create multiple profiles optimized for different input ranges:
def create_multi_profile_engine():
"""Create engine with multiple optimization profiles for different batch sizes."""
config = builder.create_builder_config()
# Profile 1: Small batches (real-time inference)
profile1 = builder.create_optimization_profile()
profile1.set_shape("input", min=(1, 3, 224, 224), opt=(1, 3, 224, 224), max=(4, 3, 224, 224))
config.add_optimization_profile(profile1)
# Profile 2: Medium batches (balanced throughput/latency)
profile2 = builder.create_optimization_profile()
profile2.set_shape("input", min=(4, 3, 224, 224), opt=(8, 3, 224, 224), max=(16, 3, 224, 224))
config.add_optimization_profile(profile2)
# Profile 3: Large batches (maximum throughput)
profile3 = builder.create_optimization_profile()
profile3.set_shape("input", min=(16, 3, 224, 224), opt=(32, 3, 224, 224), max=(64, 3, 224, 224))
config.add_optimization_profile(profile3)
return configAt runtime, select the appropriate profile based on actual input dimensions:
# Select profile at runtime
context.set_optimization_profile_async(profile_index=1, stream=cuda_stream)Custom Layer Implementation with IPluginV3
The IPluginV3 Interface
Starting with TensorRT 10.0, IPluginV3 is the only recommended plugin interface. It provides three capability interfaces:
- IPluginV3OneCore: Plugin attributes common to build and runtime
- IPluginV3OneBuild: Build-time capabilities
- IPluginV3OneRuntime: Runtime execution capabilities
C++ Implementation Example
#include "NvInferPlugin.h"
#include <cuda_runtime.h>
class CustomActivationPlugin : public nvinfer1::IPluginV3,
public nvinfer1::IPluginV3OneCore,
public nvinfer1::IPluginV3OneBuild,
public nvinfer1::IPluginV3OneRuntime
{
public:
// IPluginV3 interface
nvinfer1::IPluginCapability* getCapabilityInterface(
nvinfer1::PluginCapabilityType type) noexcept override
{
if (type == nvinfer1::PluginCapabilityType::kCORE) {
return static_cast<IPluginV3OneCore*>(this);
}
if (type == nvinfer1::PluginCapabilityType::kBUILD) {
return static_cast<IPluginV3OneBuild*>(this);
}
if (type == nvinfer1::PluginCapabilityType::kRUNTIME) {
return static_cast<IPluginV3OneRuntime*>(this);
}
return nullptr;
}
// IPluginV3OneCore interface
char const* getPluginName() const noexcept override { return "CustomActivation"; }
char const* getPluginVersion() const noexcept override { return "1"; }
char const* getPluginNamespace() const noexcept override { return ""; }
// IPluginV3OneBuild interface
int32_t getNbOutputs() const noexcept override { return 1; }
int32_t configurePlugin(
nvinfer1::DynamicPluginTensorDesc const* in, int32_t nbInputs,
nvinfer1::DynamicPluginTensorDesc const* out, int32_t nbOutputs) noexcept override
{
mInputDims = in[0].desc.dims;
return 0;
}
bool supportsFormatCombination(
int32_t pos, nvinfer1::DynamicPluginTensorDesc const* inOut,
int32_t nbInputs, int32_t nbOutputs) noexcept override
{
// Support FP32 and FP16
bool valid = inOut[pos].desc.format == nvinfer1::PluginFormat::kLINEAR;
valid &= (inOut[pos].desc.type == nvinfer1::DataType::kFLOAT ||
inOut[pos].desc.type == nvinfer1::DataType::kHALF);
return valid;
}
nvinfer1::DimsExprs getOutputDimensions(
int32_t outputIndex, nvinfer1::DimsExprs const* inputs,
int32_t nbInputs, nvinfer1::IExprBuilder& exprBuilder) noexcept override
{
return inputs[0]; // Same shape as input
}
// IPluginV3OneRuntime interface
int32_t enqueue(
nvinfer1::PluginTensorDesc const* inputDesc,
nvinfer1::PluginTensorDesc const* outputDesc,
void const* const* inputs, void* const* outputs,
void* workspace, cudaStream_t stream) noexcept override
{
// Launch custom CUDA kernel
int32_t numElements = 1;
for (int i = 0; i < inputDesc[0].dims.nbDims; ++i) {
numElements *= inputDesc[0].dims.d[i];
}
if (inputDesc[0].type == nvinfer1::DataType::kFLOAT) {
customActivationKernel<float><<<
(numElements + 255) / 256, 256, 0, stream>>>(
static_cast<const float*>(inputs[0]),
static_cast<float*>(outputs[0]),
numElements
);
}
return 0;
}
private:
nvinfer1::Dims mInputDims;
};Plugin Creator Registration
class CustomActivationPluginCreator : public nvinfer1::IPluginCreatorV3One
{
public:
char const* getPluginName() const noexcept override { return "CustomActivation"; }
char const* getPluginVersion() const noexcept override { return "1"; }
nvinfer1::IPluginV3* createPlugin(
char const* name, nvinfer1::PluginFieldCollection const* fc,
nvinfer1::TensorRTPhase phase) noexcept override
{
return new CustomActivationPlugin();
}
};
// Register plugin
REGISTER_TENSORRT_PLUGIN(CustomActivationPluginCreator);INT8 Calibration Strategies
Calibration Algorithm Comparison
| Algorithm | Best For | Accuracy | Speed |
|---|---|---|---|
| MinMax | NLP models, stable distributions | Good | Fast |
| Entropy (v2) | CNNs, general-purpose | Best | Medium |
| Percentile | Data with outliers | Good | Fast |
| MSE | High-precision requirements | Excellent | Slow |
Python Calibrator Implementation
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pathlib import Path
class Int8EntropyCalibrator(trt.IInt8EntropyCalibrator2):
"""INT8 calibration using entropy algorithm."""
def __init__(self, data_loader, cache_file: str = "calibration.cache"):
trt.IInt8EntropyCalibrator2.__init__(self)
self.data_loader = data_loader
self.cache_file = Path(cache_file)
self.batch_size = data_loader.batch_size
self.current_index = 0
# Allocate GPU memory for calibration batch
self.device_input = cuda.mem_alloc(
self.batch_size * 3 * 224 * 224 * np.float32().itemsize
)
# Pre-load calibration data
self.calibration_data = list(data_loader)
def get_batch_size(self):
return self.batch_size
def get_batch(self, names):
"""Return a batch for calibration."""
if self.current_index >= len(self.calibration_data):
return None
batch = self.calibration_data[self.current_index]
self.current_index += 1
# Handle different input types (torch tensors, numpy arrays)
if hasattr(batch, 'numpy'):
batch = batch.numpy()
# Ensure contiguous memory layout
batch = np.ascontiguousarray(batch.astype(np.float32))
cuda.memcpy_htod(self.device_input, batch)
return [int(self.device_input)]
def read_calibration_cache(self):
"""Load cached calibration data if available."""
if self.cache_file.exists():
with open(self.cache_file, 'rb') as f:
return f.read()
return None
def write_calibration_cache(self, cache):
"""Save calibration data for reuse."""
with open(self.cache_file, 'wb') as f:
f.write(cache)
print(f"Calibration cache saved to {self.cache_file}")
def build_int8_engine(onnx_path: str, calibrator, engine_path: str):
"""Build INT8 TensorRT engine with calibration."""
logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
with open(onnx_path, 'rb') as f:
parser.parse(f.read())
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 2 << 30)
# Enable INT8 mode
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = calibrator
# Also enable FP16 for layers that don't benefit from INT8
config.set_flag(trt.BuilderFlag.FP16)
# Create optimization profile
profile = builder.create_optimization_profile()
profile.set_shape("input", (1, 3, 224, 224), (8, 3, 224, 224), (32, 3, 224, 224))
config.add_optimization_profile(profile)
# Build and serialize
serialized_engine = builder.build_serialized_network(network, config)
with open(engine_path, 'wb') as f:
f.write(serialized_engine)
return serialized_engineCalibration Data Preparation Script
import torch
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from PIL import Image
import glob
class CalibrationDataset(Dataset):
"""Dataset for INT8 calibration - use representative samples."""
def __init__(self, image_dir: str, num_samples: int = 500):
self.images = glob.glob(f"{image_dir}/*.jpg")[:num_samples]
self.transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
img = Image.open(self.images[idx]).convert('RGB')
return self.transform(img)
def create_calibration_dataloader(image_dir: str, batch_size: int = 8):
"""Create dataloader for calibration."""
dataset = CalibrationDataset(image_dir, num_samples=500)
return DataLoader(
dataset,
batch_size=batch_size,
shuffle=True,
num_workers=4,
pin_memory=True
)
# Usage example
if __name__ == "__main__":
cal_loader = create_calibration_dataloader("/path/to/calibration/images")
calibrator = Int8EntropyCalibrator(cal_loader, "resnet50_int8.cache")
engine = build_int8_engine("resnet50.onnx", calibrator, "resnet50_int8.engine")Sparsity and Structured Pruning
2:4 Structured Sparsity
The NVIDIA Ampere architecture (including Jetson AGX Orin) supports 2:4 structured sparsity through Sparse Tensor Cores, achieving up to 2x throughput for qualifying layers.
import torch
from torch.nn.utils import prune
def apply_2_4_sparsity(model: torch.nn.Module):
"""Apply 2:4 structured sparsity pattern to model weights."""
def apply_structured_pruning(tensor: torch.Tensor) -> torch.Tensor:
"""Prune 2 smallest values in every 4 consecutive elements."""
shape = tensor.shape
flat = tensor.flatten()
# Reshape to groups of 4
num_groups = flat.numel() // 4
groups = flat[:num_groups * 4].reshape(-1, 4)
# Get indices of 2 smallest per group
_, indices = torch.topk(groups.abs(), k=2, dim=1, largest=False)
# Create mask
mask = torch.ones_like(groups)
mask.scatter_(1, indices, 0)
# Apply mask
result = flat.clone()
result[:num_groups * 4] = (groups * mask).flatten()
return result.reshape(shape)
for name, module in model.named_modules():
if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
with torch.no_grad():
module.weight.data = apply_structured_pruning(module.weight.data)
return model
def verify_sparsity(model: torch.nn.Module) -> dict:
"""Verify 2:4 sparsity pattern compliance."""
results = {}
for name, module in model.named_modules():
if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
weight = module.weight.data.flatten()
num_groups = weight.numel() // 4
groups = weight[:num_groups * 4].reshape(-1, 4)
# Count zeros per group
zeros_per_group = (groups == 0).sum(dim=1)
compliant = (zeros_per_group >= 2).float().mean().item()
results[name] = {
'sparsity': (weight == 0).float().mean().item(),
'2:4_compliance': compliant
}
return resultsNVIDIA Model Optimizer for Sparsity
# Using NVIDIA Model Optimizer for production sparsity
import modelopt.torch.sparsity as mts
def apply_modelopt_sparsity(model, data_loader):
"""Apply 2:4 sparsity using NVIDIA Model Optimizer."""
# Configure sparsity
sparsity_config = {
"data_loader": data_loader,
"collect_func": lambda x: x[0], # Extract input from batch
}
# Apply sparsity
sparse_model = mts.sparsify(
model,
mode="2:4",
config=sparsity_config
)
# Fine-tune sparse model
# ... training loop ...
# Export for TensorRT
sparse_model = mts.export(sparse_model)
return sparse_modelMulti-Profile Engines for Variable Workloads
Production Configuration
class MultiProfileEngine:
"""Manage TensorRT engine with multiple optimization profiles."""
def __init__(self, engine_path: str):
self.logger = trt.Logger(trt.Logger.WARNING)
with open(engine_path, 'rb') as f:
self.engine = trt.Runtime(self.logger).deserialize_cuda_engine(f.read())
self.contexts = {}
self.streams = {}
# Create execution context per profile
for i in range(self.engine.num_optimization_profiles):
ctx = self.engine.create_execution_context()
ctx.set_optimization_profile_async(i, cuda.Stream().handle)
self.contexts[i] = ctx
self.streams[i] = cuda.Stream()
def select_profile(self, batch_size: int) -> int:
"""Select optimal profile based on batch size."""
# Profile selection logic based on your profile configuration
if batch_size <= 4:
return 0 # Low-latency profile
elif batch_size <= 16:
return 1 # Balanced profile
else:
return 2 # High-throughput profile
def infer(self, input_data: np.ndarray) -> np.ndarray:
"""Run inference with automatic profile selection."""
batch_size = input_data.shape[0]
profile_idx = self.select_profile(batch_size)
context = self.contexts[profile_idx]
stream = self.streams[profile_idx]
# Set input shape for selected profile
context.set_input_shape("input", input_data.shape)
# Allocate buffers and run inference
# ... buffer allocation and execution ...
return output_datatrtexec Multi-Profile Build
# Build engine with multiple profiles using trtexec
trtexec \
--onnx=model.onnx \
--saveEngine=model_multiprofile.engine \
--minShapes=input:1x3x224x224 \
--optShapes=input:8x3x224x224 \
--maxShapes=input:32x3x224x224 \
--minShapes=input:1x3x224x224 \
--optShapes=input:16x3x224x224 \
--maxShapes=input:64x3x224x224 \
--fp16 \
--int8 \
--calib=calibration.cache \
--workspace=4096 \
--verboseTensorRT-LLM and Edge-LLM for Jetson
TensorRT Edge-LLM Overview
NVIDIA introduced TensorRT Edge-LLM in JetPack 7.1, specifically designed for LLM and VLM inference on Jetson platforms:
# TensorRT Edge-LLM is a C++ framework - Python bindings example
from tensorrt_edge_llm import EdgeLLMEngine
def setup_edge_llm():
"""Configure TensorRT Edge-LLM for Jetson deployment."""
config = {
"model_path": "llama-3.2-3b-instruct",
"quantization": "nvfp4", # NVFP4 for memory efficiency
"kv_cache_config": {
"max_tokens": 4096,
"page_size": 64
},
"speculative_decoding": {
"enabled": True,
"algorithm": "eagle3"
}
}
engine = EdgeLLMEngine(config)
return engineTensorRT-LLM on Jetson AGX Orin
# Install TensorRT-LLM for Jetson (JetPack 6.1+)
git clone -b v0.12.0-jetson https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
# Build Llama model for Jetson
python examples/llama/build.py \
--model_dir ./llama-3.2-3b \
--output_dir ./llama_trt \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--max_batch_size 4 \
--max_input_len 512 \
--max_output_len 256 \
--paged_kv_cacheMemory-Efficient LLM Configuration
# TensorRT-LLM memory configuration for Jetson
from tensorrt_llm import LLM, SamplingParams
def configure_llm_for_jetson():
"""Configure TensorRT-LLM for memory-constrained Jetson devices."""
llm = LLM(
model="meta-llama/Llama-3.2-3B-Instruct",
tensor_parallel_size=1,
kv_cache_config={
"free_gpu_memory_fraction": 0.85, # Reserve 15% for other ops
"enable_block_reuse": True
},
build_config={
"max_batch_size": 4,
"max_num_tokens": 2048,
"plugin_config": {
"paged_kv_cache": True,
"remove_input_padding": True,
"context_fmha": True
}
}
)
return llmStreaming and Async Inference
CUDA Streams for Pipelined Inference
import tensorrt as trt
import pycuda.driver as cuda
import numpy as np
from threading import Thread
from queue import Queue
class AsyncTensorRTInference:
"""Asynchronous TensorRT inference with CUDA streams."""
def __init__(self, engine_path: str, num_streams: int = 2):
self.logger = trt.Logger(trt.Logger.WARNING)
with open(engine_path, 'rb') as f:
self.engine = trt.Runtime(self.logger).deserialize_cuda_engine(f.read())
# Create multiple streams for pipelining
self.num_streams = num_streams
self.streams = [cuda.Stream() for _ in range(num_streams)]
self.contexts = [self.engine.create_execution_context() for _ in range(num_streams)]
# Pre-allocate buffers per stream
self.buffers = []
for _ in range(num_streams):
self.buffers.append(self._allocate_buffers())
self.current_stream = 0
self.result_queue = Queue()
def _allocate_buffers(self):
"""Allocate input/output buffers."""
buffers = {}
for i in range(self.engine.num_io_tensors):
name = self.engine.get_tensor_name(i)
dtype = trt.nptype(self.engine.get_tensor_dtype(name))
shape = self.engine.get_tensor_shape(name)
# Handle dynamic shapes with max dimensions
shape = tuple(max(1, s) for s in shape)
size = int(np.prod(shape) * np.dtype(dtype).itemsize)
buffers[name] = {
'host': cuda.pagelocked_empty(shape, dtype),
'device': cuda.mem_alloc(size),
'shape': shape
}
return buffers
def infer_async(self, input_data: np.ndarray, callback=None):
"""Queue async inference request."""
stream_idx = self.current_stream
self.current_stream = (self.current_stream + 1) % self.num_streams
stream = self.streams[stream_idx]
context = self.contexts[stream_idx]
buffers = self.buffers[stream_idx]
# Copy input to device asynchronously
np.copyto(buffers['input']['host'], input_data)
cuda.memcpy_htod_async(
buffers['input']['device'],
buffers['input']['host'],
stream
)
# Set tensor addresses
for name, buf in buffers.items():
context.set_tensor_address(name, int(buf['device']))
# Execute inference
context.execute_async_v3(stream.handle)
# Copy output back asynchronously
cuda.memcpy_dtoh_async(
buffers['output']['host'],
buffers['output']['device'],
stream
)
# Register callback
if callback:
stream.add_callback(
lambda s, e, d: callback(buffers['output']['host'].copy()),
None
)
return stream_idx
def synchronize(self, stream_idx: int = None):
"""Wait for inference completion."""
if stream_idx is not None:
self.streams[stream_idx].synchronize()
else:
for stream in self.streams:
stream.synchronize()CUDA Graphs for Reduced Launch Overhead
def capture_cuda_graph(context, stream, input_buffer, output_buffer):
"""Capture inference as CUDA graph for reduced overhead."""
# Warm-up run
context.execute_async_v3(stream.handle)
stream.synchronize()
# Begin capture
cuda.cuStreamBeginCapture(stream.handle, cuda.CU_STREAM_CAPTURE_MODE_GLOBAL)
# Execute inference (this gets captured)
context.execute_async_v3(stream.handle)
# End capture
graph = cuda.cuStreamEndCapture(stream.handle)
graph_exec = cuda.cuGraphInstantiate(graph)
return graph_exec
def run_with_cuda_graph(graph_exec, stream):
"""Execute captured CUDA graph."""
cuda.cuGraphLaunch(graph_exec, stream.handle)
stream.synchronize()Memory Pooling and Allocation Strategies
Custom GPU Allocator
import tensorrt as trt
import pycuda.driver as cuda
class PooledGpuAllocator(trt.IGpuAllocator):
"""Custom GPU allocator with memory pooling."""
def __init__(self, pool_size: int = 1 << 30): # 1GB default
super().__init__()
self.pool_size = pool_size
self.pool = cuda.mem_alloc(pool_size)
self.offset = 0
self.allocations = {}
def allocate(self, size: int, alignment: int, flags: int) -> int:
"""Allocate from pool with alignment."""
# Align offset
aligned_offset = (self.offset + alignment - 1) & ~(alignment - 1)
if aligned_offset + size > self.pool_size:
# Pool exhausted, allocate new memory
ptr = cuda.mem_alloc(size)
self.allocations[int(ptr)] = ('external', size)
return int(ptr)
ptr = int(self.pool) + aligned_offset
self.allocations[ptr] = ('pool', size)
self.offset = aligned_offset + size
return ptr
def deallocate(self, ptr: int) -> bool:
"""Deallocate memory."""
if ptr in self.allocations:
alloc_type, size = self.allocations.pop(ptr)
if alloc_type == 'external':
# Free external allocation
cuda.mem_free(cuda.DeviceAllocation(ptr))
# Pool allocations are reused
return True
return False
def reallocate(self, ptr: int, size: int, alignment: int) -> int:
"""Reallocate memory."""
self.deallocate(ptr)
return self.allocate(size, alignment, 0)
# Usage with builder
def build_with_custom_allocator():
config = builder.create_builder_config()
allocator = PooledGpuAllocator(pool_size=2 << 30) # 2GB pool
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
# Note: Custom allocator is set on runtime, not builder config
runtime = trt.Runtime(logger)
runtime.gpu_allocator = allocatorPre-allocated Output Buffers
class PreallocatedInference:
"""Inference with pre-allocated output buffers for reduced latency."""
def __init__(self, engine_path: str):
# ... engine loading ...
# Pre-allocate output buffers
self.output_buffers = []
for i in range(self.num_outputs):
shape = self.engine.get_tensor_shape(f"output_{i}")
dtype = trt.nptype(self.engine.get_tensor_dtype(f"output_{i}"))
# Allocate pinned memory for faster transfers
host_buf = cuda.pagelocked_empty(shape, dtype)
device_buf = cuda.mem_alloc(host_buf.nbytes)
self.output_buffers.append({
'host': host_buf,
'device': device_buf
})
def infer(self, input_data: np.ndarray) -> list:
"""Run inference with pre-allocated buffers."""
# Input is copied, but output buffers are reused
# This overlaps GPU execution with memory operations
# Previous output can be processed while new inference runs
return [buf['host'] for buf in self.output_buffers]Profiling with Nsight Systems
Command-Line Profiling
# Basic TensorRT profiling
nsys profile \
--trace=cuda,nvtx,osrt \
--gpu-metrics-device=all \
--output=tensorrt_profile \
python inference.py
# Profile specific iterations
export TLLM_PROFILE_START_STOP=10-20
nsys profile \
--trace=cuda,nvtx,cudnn,cublas \
-c cudaProfilerApi \
--output=trt_llm_profile \
python llm_inference.py
# Generate detailed report
nsys stats tensorrt_profile.nsys-rep --report cuda_gpu_traceProgrammatic Profiling
import tensorrt as trt
import ctypes
class TensorRTProfiler(trt.IProfiler):
"""Custom profiler for layer-by-layer timing."""
def __init__(self):
super().__init__()
self.layer_times = {}
self.total_time = 0
def report_layer_time(self, layer_name: str, ms: float):
"""Record layer execution time."""
if layer_name not in self.layer_times:
self.layer_times[layer_name] = []
self.layer_times[layer_name].append(ms)
self.total_time += ms
def print_summary(self):
"""Print profiling summary."""
print("\n" + "="*60)
print("TensorRT Layer Profiling Summary")
print("="*60)
sorted_layers = sorted(
self.layer_times.items(),
key=lambda x: sum(x[1]),
reverse=True
)
for name, times in sorted_layers[:20]:
avg_ms = sum(times) / len(times)
total_ms = sum(times)
pct = (total_ms / self.total_time) * 100
print(f"{name[:40]:40s} | {avg_ms:8.3f}ms | {pct:5.1f}%")
print("="*60)
print(f"Total time: {self.total_time:.3f}ms")
# Enable profiling
profiler = TensorRTProfiler()
context.profiler = profiler
# Run inference
for _ in range(100):
context.execute_async_v3(stream.handle)
stream.synchronize()
# Print results
profiler.print_summary()trtexec Profiling
# Detailed layer profiling with trtexec
trtexec \
--loadEngine=model.engine \
--dumpProfile \
--dumpLayerInfo \
--profilingVerbosity=detailed \
--iterations=100 \
--avgRuns=100 \
--warmUp=1000 \
--useCudaGraph
# Export timing data
trtexec \
--loadEngine=model.engine \
--exportProfile=timing.json \
--exportLayerInfo=layers.jsonBenchmark Comparisons
Performance Comparison Table (Jetson AGX Orin 64GB)
| Model | FP32 | FP16 | INT8 | INT8 Sparse |
|---|---|---|---|---|
| ResNet-50 | 8.5ms | 3.2ms | 1.8ms | 1.2ms |
| YOLOv8s | 12.4ms | 4.8ms | 2.6ms | 1.9ms |
| YOLOv8x | 45.2ms | 18.6ms | 9.8ms | 6.7ms |
| BERT-Base | 24.3ms | 11.2ms | 6.4ms | 4.8ms |
| Llama-3.2-3B | 892ms/tok | 156ms/tok | N/A | 98ms/tok |
Benchmark Script
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import time
def benchmark_engine(engine_path: str, input_shape: tuple,
warmup: int = 50, iterations: int = 200):
"""Benchmark TensorRT engine performance."""
logger = trt.Logger(trt.Logger.WARNING)
with open(engine_path, 'rb') as f:
engine = trt.Runtime(logger).deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
stream = cuda.Stream()
# Allocate buffers
input_data = np.random.randn(*input_shape).astype(np.float32)
d_input = cuda.mem_alloc(input_data.nbytes)
output_shape = tuple(context.get_tensor_shape("output"))
output_data = np.empty(output_shape, dtype=np.float32)
d_output = cuda.mem_alloc(output_data.nbytes)
# Set tensor addresses
context.set_tensor_address("input", int(d_input))
context.set_tensor_address("output", int(d_output))
# Warm-up
cuda.memcpy_htod_async(d_input, input_data, stream)
for _ in range(warmup):
context.execute_async_v3(stream.handle)
stream.synchronize()
# Benchmark
latencies = []
for _ in range(iterations):
cuda.memcpy_htod_async(d_input, input_data, stream)
start = cuda.Event()
end = cuda.Event()
start.record(stream)
context.execute_async_v3(stream.handle)
end.record(stream)
stream.synchronize()
latencies.append(start.time_till(end))
# Statistics
latencies = np.array(latencies)
results = {
'mean_ms': np.mean(latencies),
'std_ms': np.std(latencies),
'min_ms': np.min(latencies),
'max_ms': np.max(latencies),
'p50_ms': np.percentile(latencies, 50),
'p95_ms': np.percentile(latencies, 95),
'p99_ms': np.percentile(latencies, 99),
'throughput_fps': 1000 / np.mean(latencies)
}
return results
def compare_precisions(onnx_path: str, input_shape: tuple):
"""Compare FP32, FP16, and INT8 performance."""
precisions = {
'FP32': {'fp16': False, 'int8': False},
'FP16': {'fp16': True, 'int8': False},
'INT8': {'fp16': True, 'int8': True}
}
results = {}
for name, config in precisions.items():
engine_path = f"model_{name.lower()}.engine"
# Build engine (simplified)
# ... engine building code ...
results[name] = benchmark_engine(engine_path, input_shape)
print(f"\n{name} Results:")
print(f" Mean Latency: {results[name]['mean_ms']:.2f}ms")
print(f" P99 Latency: {results[name]['p99_ms']:.2f}ms")
print(f" Throughput: {results[name]['throughput_fps']:.1f} FPS")
return results
if __name__ == "__main__":
results = compare_precisions(
"resnet50.onnx",
input_shape=(1, 3, 224, 224)
)Conclusion
Mastering TensorRT optimization for NVIDIA Jetson requires understanding the full optimization pipeline:
- Dynamic Shapes: Use optimization profiles to handle variable input dimensions while maintaining peak performance
- Custom Plugins: Implement IPluginV3 for unsupported operators with CUDA kernel optimization
- INT8 Calibration: Choose the right calibration strategy (Entropy v2 for CNNs, MinMax for NLP) with representative data
- 2:4 Sparsity: Leverage structured pruning for up to 2x throughput on Ampere/Orin architectures
- Multi-Profile Engines: Build engines optimized for different workload characteristics
- TensorRT-LLM/Edge-LLM: Deploy optimized LLMs on Jetson with paged KV cache and speculative decoding
- Async Inference: Use CUDA streams and graphs to maximize GPU utilization
- Memory Management: Implement custom allocators and pre-allocated buffers for consistent latency
- Profiling: Use Nsight Systems and trtexec to identify and resolve bottlenecks
For production deployments on Jetson AGX Orin, combining INT8 quantization with FP16 fallback, 2:4 sparsity, and CUDA graphs can achieve 5-10x speedup over naive PyTorch inference while maintaining model accuracy.
References and Further Reading
- NVIDIA TensorRT Documentation
- TensorRT Best Practices
- Working with Dynamic Shapes
- Extending TensorRT with Custom Layers
- NVIDIA Model Optimizer
- TensorRT-LLM GitHub
- TensorRT Edge-LLM Documentation
- Accelerating Inference with Sparsity
- INT8 Calibration with TensorRT
- Nsight Systems User Guide
- TensorRT-LLM Performance Analysis
- Lei Mao's TensorRT Custom Plugin Example
- Jetson AGX Orin Performance Benchmarks
Last updated: January 2026