TensorRT Advanced Optimization Techniques for NVIDIA Jetson: A Comprehensive Technical Guide

Deep dive into dynamic shapes, custom plugins, INT8 calibration, sparsity, and production-grade inference optimization


Plain English Summary

What is TensorRT?

TensorRT is NVIDIA's secret weapon for making AI models run incredibly fast. Think of it as a translator and optimizer—it takes your AI model and rewrites it to run up to 40x faster on NVIDIA hardware.

Why is this important?

Without TensorRT With TensorRT
5 frames/second 120 frames/second
Model won't fit in memory Model runs smoothly
Uses all the GPU power Uses GPU efficiently
Generic code Hardware-optimized code

Key concepts explained simply:

Concept Simple Explanation Real-World Analogy
Dynamic Shapes Model handles different input sizes A parking spot that fits any car size
INT8 Calibration Teaching the model to use smaller numbers accurately Training someone to estimate weights instead of using a scale
Custom Plugins Adding your own special operations Adding custom apps to your phone
Sparsity Skipping unnecessary calculations Not reading blank pages in a book
Multi-Profile Different settings for different situations Sport mode vs eco mode in a car

Performance gains you can expect:

Model Type FP32 FP16 INT8 Sparse INT8
YOLOv8 15 FPS 45 FPS 95 FPS 120 FPS
ResNet-50 100 FPS 250 FPS 500 FPS 650 FPS

What will you learn?

  1. Handle variable batch sizes and image resolutions
  2. Build custom operations when standard ones aren't enough
  3. Calibrate INT8 models for maximum speed with minimal accuracy loss
  4. Use structured sparsity for 2x additional speedup
  5. Profile and debug performance issues

The bottom line: TensorRT is essential for production AI on Jetson. This guide takes you from basic optimization to advanced techniques used by NVIDIA themselves.


Table of Contents

  1. Introduction
  2. Dynamic Shapes and Optimization Profiles
  3. Custom Layer Implementation with IPluginV3
  4. Plugin Development Best Practices
  5. INT8 Calibration Strategies
  6. Sparsity and Structured Pruning
  7. Multi-Profile Engines for Variable Workloads
  8. TensorRT-LLM and Edge-LLM for Jetson
  9. Streaming and Async Inference
  10. Memory Pooling and Allocation Strategies
  11. Profiling with Nsight Systems
  12. Benchmark Comparisons
  13. Conclusion

Introduction

NVIDIA TensorRT is the premier SDK for high-performance deep learning inference, delivering up to 40x faster inference compared to CPU-only platforms. For edge deployments on NVIDIA Jetson devices (AGX Orin, Orin Nano, and the upcoming Thor), mastering advanced TensorRT optimization techniques is essential for production-grade AI applications.

The Jetson AGX Orin delivers up to 170 INT8 Sparse TOPS with Tensor Cores and 85 FP16 TFLOPS, making it a powerful platform for real-time inference in robotics, automotive, and industrial applications. This guide covers the advanced techniques needed to fully exploit this hardware capability.


Dynamic Shapes and Optimization Profiles

Understanding Dynamic Shapes

When working with variable input dimensions (batch size, sequence length, image resolution), TensorRT requires optimization profiles that specify permitted dimension ranges at build time.

import tensorrt as trt

def build_engine_with_dynamic_shapes(onnx_path: str, engine_path: str):
    """Build TensorRT engine with dynamic batch size support."""
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser = trt.OnnxParser(network, logger)

    # Parse ONNX model
    with open(onnx_path, 'rb') as f:
        if not parser.parse(f.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            return None

    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB

    # Create optimization profile for dynamic shapes
    profile = builder.create_optimization_profile()

    # Input tensor: [batch, channels, height, width]
    # Define min, optimal, and max shapes
    profile.set_shape(
        "input",
        min=(1, 3, 224, 224),      # Minimum shape
        opt=(8, 3, 224, 224),      # Optimal shape for auto-tuning
        max=(32, 3, 224, 224)      # Maximum shape
    )

    config.add_optimization_profile(profile)

    # Build serialized engine
    serialized_engine = builder.build_serialized_network(network, config)

    with open(engine_path, 'wb') as f:
        f.write(serialized_engine)

    return serialized_engine

Shape Tensors and Data-Dependent Shapes

For models with data-dependent output shapes (like NonMaxSuppression or NonZero operators), TensorRT requires special handling through shape tensors:

# Setting shape tensor values at runtime
profile.set_shape_input(
    "shape_input",
    min=(1,),
    opt=(4,),
    max=(16,)
)

Multiple Optimization Profiles

For variable workloads, create multiple profiles optimized for different input ranges:

def create_multi_profile_engine():
    """Create engine with multiple optimization profiles for different batch sizes."""
    config = builder.create_builder_config()

    # Profile 1: Small batches (real-time inference)
    profile1 = builder.create_optimization_profile()
    profile1.set_shape("input", min=(1, 3, 224, 224), opt=(1, 3, 224, 224), max=(4, 3, 224, 224))
    config.add_optimization_profile(profile1)

    # Profile 2: Medium batches (balanced throughput/latency)
    profile2 = builder.create_optimization_profile()
    profile2.set_shape("input", min=(4, 3, 224, 224), opt=(8, 3, 224, 224), max=(16, 3, 224, 224))
    config.add_optimization_profile(profile2)

    # Profile 3: Large batches (maximum throughput)
    profile3 = builder.create_optimization_profile()
    profile3.set_shape("input", min=(16, 3, 224, 224), opt=(32, 3, 224, 224), max=(64, 3, 224, 224))
    config.add_optimization_profile(profile3)

    return config

At runtime, select the appropriate profile based on actual input dimensions:

# Select profile at runtime
context.set_optimization_profile_async(profile_index=1, stream=cuda_stream)

Custom Layer Implementation with IPluginV3

The IPluginV3 Interface

Starting with TensorRT 10.0, IPluginV3 is the only recommended plugin interface. It provides three capability interfaces:

  • IPluginV3OneCore: Plugin attributes common to build and runtime
  • IPluginV3OneBuild: Build-time capabilities
  • IPluginV3OneRuntime: Runtime execution capabilities

C++ Implementation Example

#include "NvInferPlugin.h"
#include <cuda_runtime.h>

class CustomActivationPlugin : public nvinfer1::IPluginV3,
                                public nvinfer1::IPluginV3OneCore,
                                public nvinfer1::IPluginV3OneBuild,
                                public nvinfer1::IPluginV3OneRuntime
{
public:
    // IPluginV3 interface
    nvinfer1::IPluginCapability* getCapabilityInterface(
        nvinfer1::PluginCapabilityType type) noexcept override
    {
        if (type == nvinfer1::PluginCapabilityType::kCORE) {
            return static_cast<IPluginV3OneCore*>(this);
        }
        if (type == nvinfer1::PluginCapabilityType::kBUILD) {
            return static_cast<IPluginV3OneBuild*>(this);
        }
        if (type == nvinfer1::PluginCapabilityType::kRUNTIME) {
            return static_cast<IPluginV3OneRuntime*>(this);
        }
        return nullptr;
    }

    // IPluginV3OneCore interface
    char const* getPluginName() const noexcept override { return "CustomActivation"; }
    char const* getPluginVersion() const noexcept override { return "1"; }
    char const* getPluginNamespace() const noexcept override { return ""; }

    // IPluginV3OneBuild interface
    int32_t getNbOutputs() const noexcept override { return 1; }

    int32_t configurePlugin(
        nvinfer1::DynamicPluginTensorDesc const* in, int32_t nbInputs,
        nvinfer1::DynamicPluginTensorDesc const* out, int32_t nbOutputs) noexcept override
    {
        mInputDims = in[0].desc.dims;
        return 0;
    }

    bool supportsFormatCombination(
        int32_t pos, nvinfer1::DynamicPluginTensorDesc const* inOut,
        int32_t nbInputs, int32_t nbOutputs) noexcept override
    {
        // Support FP32 and FP16
        bool valid = inOut[pos].desc.format == nvinfer1::PluginFormat::kLINEAR;
        valid &= (inOut[pos].desc.type == nvinfer1::DataType::kFLOAT ||
                  inOut[pos].desc.type == nvinfer1::DataType::kHALF);
        return valid;
    }

    nvinfer1::DimsExprs getOutputDimensions(
        int32_t outputIndex, nvinfer1::DimsExprs const* inputs,
        int32_t nbInputs, nvinfer1::IExprBuilder& exprBuilder) noexcept override
    {
        return inputs[0];  // Same shape as input
    }

    // IPluginV3OneRuntime interface
    int32_t enqueue(
        nvinfer1::PluginTensorDesc const* inputDesc,
        nvinfer1::PluginTensorDesc const* outputDesc,
        void const* const* inputs, void* const* outputs,
        void* workspace, cudaStream_t stream) noexcept override
    {
        // Launch custom CUDA kernel
        int32_t numElements = 1;
        for (int i = 0; i < inputDesc[0].dims.nbDims; ++i) {
            numElements *= inputDesc[0].dims.d[i];
        }

        if (inputDesc[0].type == nvinfer1::DataType::kFLOAT) {
            customActivationKernel<float><<<
                (numElements + 255) / 256, 256, 0, stream>>>(
                static_cast<const float*>(inputs[0]),
                static_cast<float*>(outputs[0]),
                numElements
            );
        }
        return 0;
    }

private:
    nvinfer1::Dims mInputDims;
};

Plugin Creator Registration

class CustomActivationPluginCreator : public nvinfer1::IPluginCreatorV3One
{
public:
    char const* getPluginName() const noexcept override { return "CustomActivation"; }
    char const* getPluginVersion() const noexcept override { return "1"; }

    nvinfer1::IPluginV3* createPlugin(
        char const* name, nvinfer1::PluginFieldCollection const* fc,
        nvinfer1::TensorRTPhase phase) noexcept override
    {
        return new CustomActivationPlugin();
    }
};

// Register plugin
REGISTER_TENSORRT_PLUGIN(CustomActivationPluginCreator);

INT8 Calibration Strategies

Calibration Algorithm Comparison

Algorithm Best For Accuracy Speed
MinMax NLP models, stable distributions Good Fast
Entropy (v2) CNNs, general-purpose Best Medium
Percentile Data with outliers Good Fast
MSE High-precision requirements Excellent Slow

Python Calibrator Implementation

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pathlib import Path

class Int8EntropyCalibrator(trt.IInt8EntropyCalibrator2):
    """INT8 calibration using entropy algorithm."""

    def __init__(self, data_loader, cache_file: str = "calibration.cache"):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.data_loader = data_loader
        self.cache_file = Path(cache_file)
        self.batch_size = data_loader.batch_size
        self.current_index = 0

        # Allocate GPU memory for calibration batch
        self.device_input = cuda.mem_alloc(
            self.batch_size * 3 * 224 * 224 * np.float32().itemsize
        )

        # Pre-load calibration data
        self.calibration_data = list(data_loader)

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names):
        """Return a batch for calibration."""
        if self.current_index >= len(self.calibration_data):
            return None

        batch = self.calibration_data[self.current_index]
        self.current_index += 1

        # Handle different input types (torch tensors, numpy arrays)
        if hasattr(batch, 'numpy'):
            batch = batch.numpy()

        # Ensure contiguous memory layout
        batch = np.ascontiguousarray(batch.astype(np.float32))
        cuda.memcpy_htod(self.device_input, batch)

        return [int(self.device_input)]

    def read_calibration_cache(self):
        """Load cached calibration data if available."""
        if self.cache_file.exists():
            with open(self.cache_file, 'rb') as f:
                return f.read()
        return None

    def write_calibration_cache(self, cache):
        """Save calibration data for reuse."""
        with open(self.cache_file, 'wb') as f:
            f.write(cache)
        print(f"Calibration cache saved to {self.cache_file}")


def build_int8_engine(onnx_path: str, calibrator, engine_path: str):
    """Build INT8 TensorRT engine with calibration."""
    logger = trt.Logger(trt.Logger.INFO)
    builder = trt.Builder(logger)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser = trt.OnnxParser(network, logger)

    with open(onnx_path, 'rb') as f:
        parser.parse(f.read())

    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 2 << 30)

    # Enable INT8 mode
    config.set_flag(trt.BuilderFlag.INT8)
    config.int8_calibrator = calibrator

    # Also enable FP16 for layers that don't benefit from INT8
    config.set_flag(trt.BuilderFlag.FP16)

    # Create optimization profile
    profile = builder.create_optimization_profile()
    profile.set_shape("input", (1, 3, 224, 224), (8, 3, 224, 224), (32, 3, 224, 224))
    config.add_optimization_profile(profile)

    # Build and serialize
    serialized_engine = builder.build_serialized_network(network, config)

    with open(engine_path, 'wb') as f:
        f.write(serialized_engine)

    return serialized_engine

Calibration Data Preparation Script

import torch
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from PIL import Image
import glob

class CalibrationDataset(Dataset):
    """Dataset for INT8 calibration - use representative samples."""

    def __init__(self, image_dir: str, num_samples: int = 500):
        self.images = glob.glob(f"{image_dir}/*.jpg")[:num_samples]
        self.transform = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            )
        ])

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img = Image.open(self.images[idx]).convert('RGB')
        return self.transform(img)


def create_calibration_dataloader(image_dir: str, batch_size: int = 8):
    """Create dataloader for calibration."""
    dataset = CalibrationDataset(image_dir, num_samples=500)
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=4,
        pin_memory=True
    )


# Usage example
if __name__ == "__main__":
    cal_loader = create_calibration_dataloader("/path/to/calibration/images")
    calibrator = Int8EntropyCalibrator(cal_loader, "resnet50_int8.cache")
    engine = build_int8_engine("resnet50.onnx", calibrator, "resnet50_int8.engine")

Sparsity and Structured Pruning

2:4 Structured Sparsity

The NVIDIA Ampere architecture (including Jetson AGX Orin) supports 2:4 structured sparsity through Sparse Tensor Cores, achieving up to 2x throughput for qualifying layers.

import torch
from torch.nn.utils import prune

def apply_2_4_sparsity(model: torch.nn.Module):
    """Apply 2:4 structured sparsity pattern to model weights."""

    def apply_structured_pruning(tensor: torch.Tensor) -> torch.Tensor:
        """Prune 2 smallest values in every 4 consecutive elements."""
        shape = tensor.shape
        flat = tensor.flatten()

        # Reshape to groups of 4
        num_groups = flat.numel() // 4
        groups = flat[:num_groups * 4].reshape(-1, 4)

        # Get indices of 2 smallest per group
        _, indices = torch.topk(groups.abs(), k=2, dim=1, largest=False)

        # Create mask
        mask = torch.ones_like(groups)
        mask.scatter_(1, indices, 0)

        # Apply mask
        result = flat.clone()
        result[:num_groups * 4] = (groups * mask).flatten()

        return result.reshape(shape)

    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
            with torch.no_grad():
                module.weight.data = apply_structured_pruning(module.weight.data)

    return model


def verify_sparsity(model: torch.nn.Module) -> dict:
    """Verify 2:4 sparsity pattern compliance."""
    results = {}

    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
            weight = module.weight.data.flatten()
            num_groups = weight.numel() // 4
            groups = weight[:num_groups * 4].reshape(-1, 4)

            # Count zeros per group
            zeros_per_group = (groups == 0).sum(dim=1)
            compliant = (zeros_per_group >= 2).float().mean().item()

            results[name] = {
                'sparsity': (weight == 0).float().mean().item(),
                '2:4_compliance': compliant
            }

    return results

NVIDIA Model Optimizer for Sparsity

# Using NVIDIA Model Optimizer for production sparsity
import modelopt.torch.sparsity as mts

def apply_modelopt_sparsity(model, data_loader):
    """Apply 2:4 sparsity using NVIDIA Model Optimizer."""

    # Configure sparsity
    sparsity_config = {
        "data_loader": data_loader,
        "collect_func": lambda x: x[0],  # Extract input from batch
    }

    # Apply sparsity
    sparse_model = mts.sparsify(
        model,
        mode="2:4",
        config=sparsity_config
    )

    # Fine-tune sparse model
    # ... training loop ...

    # Export for TensorRT
    sparse_model = mts.export(sparse_model)

    return sparse_model

Multi-Profile Engines for Variable Workloads

Production Configuration

class MultiProfileEngine:
    """Manage TensorRT engine with multiple optimization profiles."""

    def __init__(self, engine_path: str):
        self.logger = trt.Logger(trt.Logger.WARNING)
        with open(engine_path, 'rb') as f:
            self.engine = trt.Runtime(self.logger).deserialize_cuda_engine(f.read())

        self.contexts = {}
        self.streams = {}

        # Create execution context per profile
        for i in range(self.engine.num_optimization_profiles):
            ctx = self.engine.create_execution_context()
            ctx.set_optimization_profile_async(i, cuda.Stream().handle)
            self.contexts[i] = ctx
            self.streams[i] = cuda.Stream()

    def select_profile(self, batch_size: int) -> int:
        """Select optimal profile based on batch size."""
        # Profile selection logic based on your profile configuration
        if batch_size <= 4:
            return 0  # Low-latency profile
        elif batch_size <= 16:
            return 1  # Balanced profile
        else:
            return 2  # High-throughput profile

    def infer(self, input_data: np.ndarray) -> np.ndarray:
        """Run inference with automatic profile selection."""
        batch_size = input_data.shape[0]
        profile_idx = self.select_profile(batch_size)

        context = self.contexts[profile_idx]
        stream = self.streams[profile_idx]

        # Set input shape for selected profile
        context.set_input_shape("input", input_data.shape)

        # Allocate buffers and run inference
        # ... buffer allocation and execution ...

        return output_data

trtexec Multi-Profile Build

# Build engine with multiple profiles using trtexec
trtexec \
    --onnx=model.onnx \
    --saveEngine=model_multiprofile.engine \
    --minShapes=input:1x3x224x224 \
    --optShapes=input:8x3x224x224 \
    --maxShapes=input:32x3x224x224 \
    --minShapes=input:1x3x224x224 \
    --optShapes=input:16x3x224x224 \
    --maxShapes=input:64x3x224x224 \
    --fp16 \
    --int8 \
    --calib=calibration.cache \
    --workspace=4096 \
    --verbose

TensorRT-LLM and Edge-LLM for Jetson

TensorRT Edge-LLM Overview

NVIDIA introduced TensorRT Edge-LLM in JetPack 7.1, specifically designed for LLM and VLM inference on Jetson platforms:

# TensorRT Edge-LLM is a C++ framework - Python bindings example
from tensorrt_edge_llm import EdgeLLMEngine

def setup_edge_llm():
    """Configure TensorRT Edge-LLM for Jetson deployment."""

    config = {
        "model_path": "llama-3.2-3b-instruct",
        "quantization": "nvfp4",  # NVFP4 for memory efficiency
        "kv_cache_config": {
            "max_tokens": 4096,
            "page_size": 64
        },
        "speculative_decoding": {
            "enabled": True,
            "algorithm": "eagle3"
        }
    }

    engine = EdgeLLMEngine(config)
    return engine

TensorRT-LLM on Jetson AGX Orin

# Install TensorRT-LLM for Jetson (JetPack 6.1+)
git clone -b v0.12.0-jetson https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM

# Build Llama model for Jetson
python examples/llama/build.py \
    --model_dir ./llama-3.2-3b \
    --output_dir ./llama_trt \
    --dtype float16 \
    --use_gpt_attention_plugin float16 \
    --use_gemm_plugin float16 \
    --max_batch_size 4 \
    --max_input_len 512 \
    --max_output_len 256 \
    --paged_kv_cache

Memory-Efficient LLM Configuration

# TensorRT-LLM memory configuration for Jetson
from tensorrt_llm import LLM, SamplingParams

def configure_llm_for_jetson():
    """Configure TensorRT-LLM for memory-constrained Jetson devices."""

    llm = LLM(
        model="meta-llama/Llama-3.2-3B-Instruct",
        tensor_parallel_size=1,
        kv_cache_config={
            "free_gpu_memory_fraction": 0.85,  # Reserve 15% for other ops
            "enable_block_reuse": True
        },
        build_config={
            "max_batch_size": 4,
            "max_num_tokens": 2048,
            "plugin_config": {
                "paged_kv_cache": True,
                "remove_input_padding": True,
                "context_fmha": True
            }
        }
    )

    return llm

Streaming and Async Inference

CUDA Streams for Pipelined Inference

import tensorrt as trt
import pycuda.driver as cuda
import numpy as np
from threading import Thread
from queue import Queue

class AsyncTensorRTInference:
    """Asynchronous TensorRT inference with CUDA streams."""

    def __init__(self, engine_path: str, num_streams: int = 2):
        self.logger = trt.Logger(trt.Logger.WARNING)

        with open(engine_path, 'rb') as f:
            self.engine = trt.Runtime(self.logger).deserialize_cuda_engine(f.read())

        # Create multiple streams for pipelining
        self.num_streams = num_streams
        self.streams = [cuda.Stream() for _ in range(num_streams)]
        self.contexts = [self.engine.create_execution_context() for _ in range(num_streams)]

        # Pre-allocate buffers per stream
        self.buffers = []
        for _ in range(num_streams):
            self.buffers.append(self._allocate_buffers())

        self.current_stream = 0
        self.result_queue = Queue()

    def _allocate_buffers(self):
        """Allocate input/output buffers."""
        buffers = {}
        for i in range(self.engine.num_io_tensors):
            name = self.engine.get_tensor_name(i)
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
            shape = self.engine.get_tensor_shape(name)

            # Handle dynamic shapes with max dimensions
            shape = tuple(max(1, s) for s in shape)
            size = int(np.prod(shape) * np.dtype(dtype).itemsize)

            buffers[name] = {
                'host': cuda.pagelocked_empty(shape, dtype),
                'device': cuda.mem_alloc(size),
                'shape': shape
            }
        return buffers

    def infer_async(self, input_data: np.ndarray, callback=None):
        """Queue async inference request."""
        stream_idx = self.current_stream
        self.current_stream = (self.current_stream + 1) % self.num_streams

        stream = self.streams[stream_idx]
        context = self.contexts[stream_idx]
        buffers = self.buffers[stream_idx]

        # Copy input to device asynchronously
        np.copyto(buffers['input']['host'], input_data)
        cuda.memcpy_htod_async(
            buffers['input']['device'],
            buffers['input']['host'],
            stream
        )

        # Set tensor addresses
        for name, buf in buffers.items():
            context.set_tensor_address(name, int(buf['device']))

        # Execute inference
        context.execute_async_v3(stream.handle)

        # Copy output back asynchronously
        cuda.memcpy_dtoh_async(
            buffers['output']['host'],
            buffers['output']['device'],
            stream
        )

        # Register callback
        if callback:
            stream.add_callback(
                lambda s, e, d: callback(buffers['output']['host'].copy()),
                None
            )

        return stream_idx

    def synchronize(self, stream_idx: int = None):
        """Wait for inference completion."""
        if stream_idx is not None:
            self.streams[stream_idx].synchronize()
        else:
            for stream in self.streams:
                stream.synchronize()

CUDA Graphs for Reduced Launch Overhead

def capture_cuda_graph(context, stream, input_buffer, output_buffer):
    """Capture inference as CUDA graph for reduced overhead."""

    # Warm-up run
    context.execute_async_v3(stream.handle)
    stream.synchronize()

    # Begin capture
    cuda.cuStreamBeginCapture(stream.handle, cuda.CU_STREAM_CAPTURE_MODE_GLOBAL)

    # Execute inference (this gets captured)
    context.execute_async_v3(stream.handle)

    # End capture
    graph = cuda.cuStreamEndCapture(stream.handle)
    graph_exec = cuda.cuGraphInstantiate(graph)

    return graph_exec


def run_with_cuda_graph(graph_exec, stream):
    """Execute captured CUDA graph."""
    cuda.cuGraphLaunch(graph_exec, stream.handle)
    stream.synchronize()

Memory Pooling and Allocation Strategies

Custom GPU Allocator

import tensorrt as trt
import pycuda.driver as cuda

class PooledGpuAllocator(trt.IGpuAllocator):
    """Custom GPU allocator with memory pooling."""

    def __init__(self, pool_size: int = 1 << 30):  # 1GB default
        super().__init__()
        self.pool_size = pool_size
        self.pool = cuda.mem_alloc(pool_size)
        self.offset = 0
        self.allocations = {}

    def allocate(self, size: int, alignment: int, flags: int) -> int:
        """Allocate from pool with alignment."""
        # Align offset
        aligned_offset = (self.offset + alignment - 1) & ~(alignment - 1)

        if aligned_offset + size > self.pool_size:
            # Pool exhausted, allocate new memory
            ptr = cuda.mem_alloc(size)
            self.allocations[int(ptr)] = ('external', size)
            return int(ptr)

        ptr = int(self.pool) + aligned_offset
        self.allocations[ptr] = ('pool', size)
        self.offset = aligned_offset + size

        return ptr

    def deallocate(self, ptr: int) -> bool:
        """Deallocate memory."""
        if ptr in self.allocations:
            alloc_type, size = self.allocations.pop(ptr)
            if alloc_type == 'external':
                # Free external allocation
                cuda.mem_free(cuda.DeviceAllocation(ptr))
            # Pool allocations are reused
            return True
        return False

    def reallocate(self, ptr: int, size: int, alignment: int) -> int:
        """Reallocate memory."""
        self.deallocate(ptr)
        return self.allocate(size, alignment, 0)


# Usage with builder
def build_with_custom_allocator():
    config = builder.create_builder_config()
    allocator = PooledGpuAllocator(pool_size=2 << 30)  # 2GB pool
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)

    # Note: Custom allocator is set on runtime, not builder config
    runtime = trt.Runtime(logger)
    runtime.gpu_allocator = allocator

Pre-allocated Output Buffers

class PreallocatedInference:
    """Inference with pre-allocated output buffers for reduced latency."""

    def __init__(self, engine_path: str):
        # ... engine loading ...

        # Pre-allocate output buffers
        self.output_buffers = []
        for i in range(self.num_outputs):
            shape = self.engine.get_tensor_shape(f"output_{i}")
            dtype = trt.nptype(self.engine.get_tensor_dtype(f"output_{i}"))

            # Allocate pinned memory for faster transfers
            host_buf = cuda.pagelocked_empty(shape, dtype)
            device_buf = cuda.mem_alloc(host_buf.nbytes)

            self.output_buffers.append({
                'host': host_buf,
                'device': device_buf
            })

    def infer(self, input_data: np.ndarray) -> list:
        """Run inference with pre-allocated buffers."""
        # Input is copied, but output buffers are reused
        # This overlaps GPU execution with memory operations

        # Previous output can be processed while new inference runs
        return [buf['host'] for buf in self.output_buffers]

Profiling with Nsight Systems

Command-Line Profiling

# Basic TensorRT profiling
nsys profile \
    --trace=cuda,nvtx,osrt \
    --gpu-metrics-device=all \
    --output=tensorrt_profile \
    python inference.py

# Profile specific iterations
export TLLM_PROFILE_START_STOP=10-20
nsys profile \
    --trace=cuda,nvtx,cudnn,cublas \
    -c cudaProfilerApi \
    --output=trt_llm_profile \
    python llm_inference.py

# Generate detailed report
nsys stats tensorrt_profile.nsys-rep --report cuda_gpu_trace

Programmatic Profiling

import tensorrt as trt
import ctypes

class TensorRTProfiler(trt.IProfiler):
    """Custom profiler for layer-by-layer timing."""

    def __init__(self):
        super().__init__()
        self.layer_times = {}
        self.total_time = 0

    def report_layer_time(self, layer_name: str, ms: float):
        """Record layer execution time."""
        if layer_name not in self.layer_times:
            self.layer_times[layer_name] = []
        self.layer_times[layer_name].append(ms)
        self.total_time += ms

    def print_summary(self):
        """Print profiling summary."""
        print("\n" + "="*60)
        print("TensorRT Layer Profiling Summary")
        print("="*60)

        sorted_layers = sorted(
            self.layer_times.items(),
            key=lambda x: sum(x[1]),
            reverse=True
        )

        for name, times in sorted_layers[:20]:
            avg_ms = sum(times) / len(times)
            total_ms = sum(times)
            pct = (total_ms / self.total_time) * 100
            print(f"{name[:40]:40s} | {avg_ms:8.3f}ms | {pct:5.1f}%")

        print("="*60)
        print(f"Total time: {self.total_time:.3f}ms")


# Enable profiling
profiler = TensorRTProfiler()
context.profiler = profiler

# Run inference
for _ in range(100):
    context.execute_async_v3(stream.handle)
stream.synchronize()

# Print results
profiler.print_summary()

trtexec Profiling

# Detailed layer profiling with trtexec
trtexec \
    --loadEngine=model.engine \
    --dumpProfile \
    --dumpLayerInfo \
    --profilingVerbosity=detailed \
    --iterations=100 \
    --avgRuns=100 \
    --warmUp=1000 \
    --useCudaGraph

# Export timing data
trtexec \
    --loadEngine=model.engine \
    --exportProfile=timing.json \
    --exportLayerInfo=layers.json

Benchmark Comparisons

Performance Comparison Table (Jetson AGX Orin 64GB)

Model FP32 FP16 INT8 INT8 Sparse
ResNet-50 8.5ms 3.2ms 1.8ms 1.2ms
YOLOv8s 12.4ms 4.8ms 2.6ms 1.9ms
YOLOv8x 45.2ms 18.6ms 9.8ms 6.7ms
BERT-Base 24.3ms 11.2ms 6.4ms 4.8ms
Llama-3.2-3B 892ms/tok 156ms/tok N/A 98ms/tok

Benchmark Script

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import time

def benchmark_engine(engine_path: str, input_shape: tuple,
                     warmup: int = 50, iterations: int = 200):
    """Benchmark TensorRT engine performance."""

    logger = trt.Logger(trt.Logger.WARNING)
    with open(engine_path, 'rb') as f:
        engine = trt.Runtime(logger).deserialize_cuda_engine(f.read())

    context = engine.create_execution_context()
    stream = cuda.Stream()

    # Allocate buffers
    input_data = np.random.randn(*input_shape).astype(np.float32)
    d_input = cuda.mem_alloc(input_data.nbytes)

    output_shape = tuple(context.get_tensor_shape("output"))
    output_data = np.empty(output_shape, dtype=np.float32)
    d_output = cuda.mem_alloc(output_data.nbytes)

    # Set tensor addresses
    context.set_tensor_address("input", int(d_input))
    context.set_tensor_address("output", int(d_output))

    # Warm-up
    cuda.memcpy_htod_async(d_input, input_data, stream)
    for _ in range(warmup):
        context.execute_async_v3(stream.handle)
    stream.synchronize()

    # Benchmark
    latencies = []

    for _ in range(iterations):
        cuda.memcpy_htod_async(d_input, input_data, stream)

        start = cuda.Event()
        end = cuda.Event()

        start.record(stream)
        context.execute_async_v3(stream.handle)
        end.record(stream)

        stream.synchronize()
        latencies.append(start.time_till(end))

    # Statistics
    latencies = np.array(latencies)
    results = {
        'mean_ms': np.mean(latencies),
        'std_ms': np.std(latencies),
        'min_ms': np.min(latencies),
        'max_ms': np.max(latencies),
        'p50_ms': np.percentile(latencies, 50),
        'p95_ms': np.percentile(latencies, 95),
        'p99_ms': np.percentile(latencies, 99),
        'throughput_fps': 1000 / np.mean(latencies)
    }

    return results


def compare_precisions(onnx_path: str, input_shape: tuple):
    """Compare FP32, FP16, and INT8 performance."""

    precisions = {
        'FP32': {'fp16': False, 'int8': False},
        'FP16': {'fp16': True, 'int8': False},
        'INT8': {'fp16': True, 'int8': True}
    }

    results = {}

    for name, config in precisions.items():
        engine_path = f"model_{name.lower()}.engine"

        # Build engine (simplified)
        # ... engine building code ...

        results[name] = benchmark_engine(engine_path, input_shape)

        print(f"\n{name} Results:")
        print(f"  Mean Latency: {results[name]['mean_ms']:.2f}ms")
        print(f"  P99 Latency:  {results[name]['p99_ms']:.2f}ms")
        print(f"  Throughput:   {results[name]['throughput_fps']:.1f} FPS")

    return results


if __name__ == "__main__":
    results = compare_precisions(
        "resnet50.onnx",
        input_shape=(1, 3, 224, 224)
    )

Conclusion

Mastering TensorRT optimization for NVIDIA Jetson requires understanding the full optimization pipeline:

  1. Dynamic Shapes: Use optimization profiles to handle variable input dimensions while maintaining peak performance
  2. Custom Plugins: Implement IPluginV3 for unsupported operators with CUDA kernel optimization
  3. INT8 Calibration: Choose the right calibration strategy (Entropy v2 for CNNs, MinMax for NLP) with representative data
  4. 2:4 Sparsity: Leverage structured pruning for up to 2x throughput on Ampere/Orin architectures
  5. Multi-Profile Engines: Build engines optimized for different workload characteristics
  6. TensorRT-LLM/Edge-LLM: Deploy optimized LLMs on Jetson with paged KV cache and speculative decoding
  7. Async Inference: Use CUDA streams and graphs to maximize GPU utilization
  8. Memory Management: Implement custom allocators and pre-allocated buffers for consistent latency
  9. Profiling: Use Nsight Systems and trtexec to identify and resolve bottlenecks

For production deployments on Jetson AGX Orin, combining INT8 quantization with FP16 fallback, 2:4 sparsity, and CUDA graphs can achieve 5-10x speedup over naive PyTorch inference while maintaining model accuracy.


References and Further Reading


Last updated: January 2026

Contact Us for Edge AI Solutions
Share this article: