A Developer Guide to ONNX

Harnessing the Power of ONNX Runtime

Part 2 of a series on ONNX.

In this section, we cover:

  • The architecture of ONNX Runtime and the concept of execution providers
  • How linking strategies (static vs. dynamic) affect deployment and performance
  • Practical integration considerations for Python or Rust projects
  • Balancing flexibility, efficiency, and maintainability across different platforms

3: How ONNX Runtime Works

3.1: Execution Providers and Their Role in Inference

A cornerstone of ONNX Runtime’s architecture is the concept of Execution Providers (EPs). An execution provider is a pluggable module that implements the execution of ONNX operators on a particular hardware or library. In simpler terms, an EP is like a driver or backend that ONNX Runtime can use to run parts of the model on a specific platform (CPU, CUDA GPU, TensorRT, FPGA, etc.). ONNX Runtime’s core is designed to be extensible – instead of hard-coding all computation, it delegates to whatever EPs are available on the system. This allows ORT to take advantage of specialized hardware without needing to know the low-level details of that hardware.

Each execution provider defines a set of kernels (implementations) for certain ONNX operators, possibly optimized or accelerated for its target device. For example, the CPU Execution Provider includes efficient implementations of all ops on CPU, the CUDA EP provides GPU implementations for many ops using CUDA, and the TensorRT EP can execute whole subgraphs of ops via NVIDIA’s TensorRT engine. At runtime, ONNX Runtime will query each available EP to see which parts of the model they can handle – this is done via a GetCapability() interface where an EP tells ORT which nodes (operations) in the computation graph it can execute.

ONNX Runtime orchestrates inference by partitioning the model’s computational graph among the available execution providers. It essentially fuses the model graph into subgraphs, each subgraph assigned to one provider. For instance, consider a model with various neural network layers: ORT might assign a batch of convolution and activation nodes to run on a CUDA EP (GPU) as one subgraph, and maybe some post-processing ops to the CPU EP as another subgraph, if the GPU EP doesn’t support those ops. These subgraphs are then executed by their respective providers. This partitioning is dynamic and based on what EPs report as supported. Importantly, ONNX Runtime always has a default CPU provider that can execute any op (ensuring no model is unsupported), whereas more specialized EPs might only support a subset of ops. The default CPU EP is used as a fallback for any part of the model that isn’t taken by another EP.

The execution providers are considered in a priority order. ORT will try to “push” as much of the computation as possible onto the most specialized (and usually fastest) provider first, then the next, and so on, with the CPU provider typically last as the catch-all. The goal is to maximize usage of accelerators: for example, if a GPU is present, let it handle everything it can, and only send to CPU what the GPU cannot do. This design allows heterogeneous execution across multiple devices in one seamless flow. The necessary data transfers (e.g. moving tensors between CPU and GPU memory) are handled by ORT so that from the user’s perspective it just works.

To summarize, execution providers in ONNX Runtime abstract the hardware-specific execution of model operations. ORT uses EPs to achieve both flexibility (you can run on different hardware by plugging in different EPs) and performance (by offloading work to faster specialized hardware when available). This modular architecture means as new hardware or libraries emerge (say a new Neural Processing Unit or a new optimization library), support for it can be added to ONNX Runtime as a new EP without disturbing the rest of the system. For developers, it means an ONNX model can automatically benefit from accelerators present on the system, simply by enabling the corresponding execution provider.

3.2: Switching Between CPU and GPU Providers (e.g., CUDA, TensorRT)

One of the practical tasks for library developers using ONNX Runtime is selecting and configuring the execution providers for a given deployment. By default, if you install the standard onnxruntime Python package, it comes with the CPU execution provider enabled (and possibly others if it’s a GPU build). Using the CPU provider requires no special configuration – it’s the fallback and will always be there. However, if you want to leverage a GPU (CUDA) or other accelerators like TensorRT, you need to ensure those providers are available in your ONNX Runtime build and then explicitly configure the runtime to use them.

In Python, switching or specifying execution providers is straightforward. When creating an InferenceSession, you can pass a list of providers in order of priority. For example:

import onnxruntime as ort
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

This instructs ONNX Runtime to try the CUDA provider first and use CPU as a fallback (). In this context, CUDAExecutionProvider utilizes the system’s NVIDIA GPU (assuming the onnxruntime package was installed with GPU support, e.g., onnxruntime-gpu), and CPUExecutionProvider is the built-in default. If an operation in the model is supported by CUDA, it will run on the GPU; if not, it will run on CPU (). As of ONNX Runtime 1.10+, the API requires explicitly specifying the GPU provider to use it – leaving the providers list empty defaults to CPU only (). This is a safety measure to avoid unexpected GPU usage; it makes the developer’s intention clear.

For NVIDIA’s TensorRT (another execution provider which uses optimized inference engines), the idea is similar: you need an ONNX Runtime build that includes the TensorRT EP (or you compile ORT with --use_tensorrt). Then you could specify providers=["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]. In such a configuration, ORT will first try to delegate model parts to TensorRT (which can yield even faster GPU inference for supported models), then fall back to general CUDA, then CPU. It’s worth noting that using TensorRT might require that you provide static shapes or have TensorRT engines built ahead of time, but ONNX Runtime abstracts much of that – it will attempt to build a TensorRT engine for subgraphs it hands off.

In the Rust ecosystem, the ort crate similarly allows selecting execution providers, though the approach is at build-time and initialization rather than at session creation (due to Rust’s static nature). The ort crate comes with certain providers enabled by default (CPU, and possibly CUDA if the feature is turned on). Developers can enable crate features for CUDA, TensorRT, etc., and the crate will include those EPs if available. When using the Rust API, one typically specifies provider options when creating an inference session or uses the default which might already prioritize GPU if compiled in. For example, the SessionBuilder in Rust might have methods to register providers. Under the hood it’s doing the same thing: calling into the ORT C API to set the list of desired providers.

Switching between CPU and GPU thus involves: (1) having ONNX Runtime built with the desired providers, and (2) specifying them in the preferred order when initializing inference. Library developers should ensure they ship or depend on the correct ONNX Runtime package: for instance, use the onnxruntime-gpu Python package if GPU is needed (the CPU-only package won’t have CUDA support), or in Rust enable the cuda feature and link against an ORT build that has CUDA enabled. It’s also a good practice to include CPU as a fallback provider in the list, unless you explicitly want to fail when a GPU op is unsupported. Including CPU ensures that if the accelerator can’t handle something, it will still run (albeit slower) on CPU rather than erroring out ().

In summary, ONNX Runtime makes it easy to switch execution providers: it’s often a one-line configuration change. The heavy lifting of partitioning the model and coordinating between CPU/GPU is handled internally by ORT. For a library developer, the main considerations are building/distributing the runtime with the needed providers and exposing a way for users to choose or configure the provider (for example, allowing a user to select “use GPU” vs “CPU only” in a high-level API). The flexibility of ONNX Runtime here means the same ONNX model can run in a variety of environments – from CPU-only servers to GPU clusters – by simply toggling providers.

3.3: Model Optimization Features in ONNX Runtime

Beyond hardware acceleration, ONNX Runtime provides a suite of model optimization techniques that can significantly improve inference performance. These optimizations operate on the ONNX computation graph (often called graph optimizations or graph transformations). The idea is to transform the model graph into a more efficient but functionally equivalent form before executing it. ORT performs many of these optimizations automatically when you load a model, and it also offers APIs to control or run optimizations explicitly.

Some of the key optimization features are:

All these optimization features mean that simply by choosing ONNX Runtime, a lot of performance tuning is handled for you under the hood. A library developer doesn’t necessarily need to implement custom fusions or worry about manually optimizing the computation graph – ORT will handle general best practices. However, it is useful to be aware of these features. For example, if you know your model has segments that could be fused, you can check if ORT is already fusing them (it often does). If you require absolute maximum performance, you might ship an offline-optimized model or use ORT’s API to disable certain optimizations (rarely needed, but there are cases like debugging where you might turn them off). In practice, enabling ORT’s highest optimization level (which is the default ORT_ENABLE_ALL in C++ API or just default behavior in Python) is recommended for production inference. It’s also worth noting that as ONNX Runtime evolves, new optimization passes (like emerging compiler techniques, better operator fusion for transformer models, etc.) are added, and users automatically benefit by upgrading ORT. This continuous improvement is one reason ONNX Runtime often shows better performance with each release.

4: Deployment Considerations and Linking Strategies

4.1: Static vs. Dynamic Linking of ONNX Runtime

When integrating ONNX Runtime into a library or application (especially in languages like C++ or Rust), one important consideration is how the ONNX Runtime library is linked: statically or dynamically.

For Python users, this choice is abstracted away – using the pip package means you’re effectively using dynamic linking (the package ships a shared library). For Rust or C++ library developers, deciding static vs dynamic is an important build configuration. Static linking tends to be easier for end-users of your library (no external dependency to install), but it can increase your binary size. Dynamic linking can keep binaries smaller and allow swapping out the ORT library without recompiling your app (for instance, updating ORT separately), but introduces more points of failure in deployment.

There is also a hybrid approach in Rust that the ort crate supports: a feature called “load dynamic”. This is where you compile your Rust code without directly linking against ORT, and instead, at runtime, you dynamically load (dlopen) the ONNX Runtime library. This gives flexibility to decide which ORT binary to load (maybe different ones on different machines) and avoids the application failing to start if the library isn’t present – you can catch that error at runtime and handle it more gracefully () (). The load-dynamic feature in the Rust crate is recommended when dynamic linking is needed because it lets you control the loading process and path, rather than letting the OS loader handle it blindly (). In C/C++, one could implement a similar plugin loading approach, but it’s more manual.

In summary, static linking of ONNX Runtime yields a self-contained artifact and is often preferred for maximum reliability (especially in Rust), whereas dynamic linking is necessary in some cases and can be convenient for sharing libraries or updating them independently. The decision may depend on the execution providers you need (if all your needed EPs support static linking, that path is attractive; if not, dynamic can’t be avoided). It’s important for library developers to provide guidance or options to users: for instance, the Rust crate allows both methods via features, and a C++ library might offer CMake options for static vs dynamic linkage. One must also consider license implications (ORT is MIT licensed, so not problematic to static link) and compliance (some environments might require dynamic linking for security updates). But purely technically, both linking strategies are supported by ONNX Runtime.

4.2: How Linking Choices Affect Portability and Performance

Linking strategy has implications for portability, distribution size, and even to a small extent performance:

In essence, the linking choice doesn’t affect the core inference speed but does affect the deployment characteristics. Library developers should weigh these trade-offs. Often, providing both options is best: e.g., publish a Python wheel (dynamic) for ease of use, and also publish a statically linked binary for C++ or Rust usage where single-file deployment is valued.

From a user perspective, transparency and documentation are key. If your library uses ONNX Runtime, document if users need to install anything. For example, a Rust library might say “this comes bundled with ONNX Runtime statically linked for CPU – if you want GPU support, enable this feature and ensure you have the appropriate CUDA libraries available.” Or a C++ application might check at startup for the presence of onnxruntime.dll and give a clear error message if not found. ONNX Runtime’s own documentation and the ort crate maintainers explicitly call out these linking nuances so developers can avoid common pitfalls (like missing DLLs or unresolved symbols) () ().

In summary, static linking tends to simplify portability at the cost of larger binaries, while dynamic linking can reduce footprint and allow shared usage at the cost of more complicated deployment. Performance differences are usually minor, except in memory usage when scaling out many processes or slight startup latency differences. Knowing these factors, you can make an informed decision that aligns with your project’s needs (e.g., a command-line tool might favor static for simplicity, whereas a system package might favor dynamic to integrate with system libraries).

4.3: Deploying ONNX Models Across Different Architectures

One of ONNX Runtime’s strengths is that it is cross-platform and supports various processor architectures – but to leverage that, library developers need to plan for how they’ll deploy ORT on different targets (x86, ARM, etc.). An ONNX model (.onnx file) is itself platform-neutral (it’s just data – a graph of ops). The question is ensuring that an ONNX Runtime implementation is available on the architecture where you want to run that model.

ONNX Runtime supports architectures like x86_64, ARM64, and others. For example, you may want to run on desktop CPUs (x86_64 Windows/Linux/macOS), mobile devices (ARM64 Android or iOS), or even specialized chips (like NVIDIA Jetson’s ARM64 CPU with CUDA GPU). The official ONNX Runtime project provides pre-built binaries for the most common combinations (often x64 Windows/Linux with CPU and GPU, some ARM builds, etc.), and the community (or the Rust crate maintainers) provides others () (). If a pre-built package is not available for your target (say you need ONNX Runtime on a Raspberry Pi with ARM32), you have the option to compile from source for that architecture (). The build system supports cross-compilation flags (there are even docs for cross-compiling to ARM or other platforms ()).

When deploying across architectures, consider the following:

In practical terms, many library developers leverage containerization or pre-built distributions. For instance, if deploying to an Nvidia Jetson (ARM64 with GPU), one might use NVIDIA’s JetPack which includes ONNX Runtime with GPU support, or use Docker containers that have ORT built in. If targeting iOS, one would include ORT as a static library in the Xcode project (the ORT project provides an XCFramework for iOS). These platform-specific deployment steps need to be documented for your library’s users.

To sum up, ONNX Runtime is designed to be cross-platform, but the onus is on the library/application developer to package or compile the runtime for each target architecture they intend to support. Static vs dynamic linking, as discussed, also plays a role here: static linking can simplify deploying to different OS/arch by baking in the bits, whereas dynamic linking means you need to include the right .so for each platform. The good news is the ONNX model itself doesn’t need changes – it’s all about the runtime support. By planning for multi-architecture support (perhaps using continuous integration to build for several platforms and verifying the inference correctness and performance), a library developer can deliver a robust solution that uses ONNX Runtime to run models anywhere from cloud servers to mobile phones (), fulfilling ONNX’s promise of flexible deployment.

5: Execution Providers and Performance Trade-offs

5.1: Overview of Available Execution Providers

ONNX Runtime boasts a wide array of execution providers, each tapping into different hardware or software accelerators. The CPU Execution Provider is the default built-in provider that uses the host CPU (with optimizations like multi-threading and vectorized instructions). In addition, there are many others for hardware acceleration:

In practice, the most commonly used EPs (as of today) are CPU, CUDA, TensorRT, and DirectML, with OpenVINO and NNAPI also popular in their respective domains. The ONNX Runtime documentation provides a summary table of supported EPs, indicating which are stable vs in preview (). The variety of EPs means ORT is very flexible: the same ONNX model could run with different EPs on different devices – for example, CUDA EP on an Nvidia server, OpenVINO EP on an Intel-based edge device, or NNAPI on an Android phone.

For library developers, it’s important to know which EPs you want to support or include. Including every single EP is often not feasible because that could bloat the binary (each EP adds code and sometimes third-party dependencies). Instead, you’d pick based on your audience: e.g., if your library is for server-side inference with Nvidia GPUs, you’d include CUDA and TensorRT EPs. If it’s for mobile, you’d include NNAPI or CoreML EPs. The onnxruntime Python packages are split (there’s onnxruntime-gpu for CUDA/TensorRT, etc.) to keep size manageable. In Rust, you enable features for the EPs you need (the ort crate might mark some EP support as optional features, so you don’t pay the cost if you don’t use them).

5.2: Impact of Execution Provider Selection on Performance

Choosing the right execution provider can have a dramatic impact on performance. The general rule is: use the most specialized provider available for your hardware, as long as it supports your model’s operations. Here are a few considerations and trade-offs:

In practice, benchmarking is crucial. ONNX Runtime provides performance tuning guidelines (), but a lot comes down to trying different EPs for your specific model. For example, a bert-like transformer model might benefit greatly from NVIDIA’s TensorRT or OpenVINO with int8, whereas a simple logistic regression sees no benefit beyond CPU. A combination often used in deployment is running multiple providers in a prioritized list: e.g., ["CUDAExecutionProvider", "CPUExecutionProvider"] as mentioned, which gives the best of both – use GPU when possible, otherwise CPU (). This typically ensures flexibility (works on machines with or without GPU) and performance (take advantage of GPU if present). The ONNX Runtime team suggests always including CPU as backup to guarantee execution ().

One trade-off to highlight: including many EPs can increase load time and memory usage. If you enable everything (CUDA, TensorRT, OpenVINO, etc.) in one runtime, the initialization will load various drivers and libraries which has overhead. If you know your environment (e.g., this build is for GPU machines only), you can trim down to just needed EPs for efficiency.

In summary, execution provider selection is a key lever for performance tuning in ONNX Runtime. The fastest option is to use the most optimized EP available for the target hardware – usually a vendor-specific accelerator. But you must consider coverage and overhead. Often a combination like “specialized EP + CPU fallback” is used to balance performance with completeness. The ONNX Runtime architecture ensures that if an EP can speed up even part of the model, it will do so, which generally yields performance improvements as long as the accelerated part is significant enough to outweigh any transition overhead. Best practice is to test your particular model with different EP configurations to see which yields the best performance on your target hardware, and then deploy with that configuration.

5.3: Best Practices for Balancing Performance and Flexibility

Balancing performance and flexibility when using ONNX Runtime often means making your solution run fast on high-end hardware while still being able to run (perhaps more slowly) on lower-end hardware or in different environments. Here are some best practices for library developers:

Balancing performance and flexibility is essentially about graceful degradation: get maximum acceleration when available, but don’t break or behave poorly when it isn’t. ONNX Runtime, by virtue of its design, helps a lot with this by allowing multiple providers in priority order and by always having a safe CPU fallback that supports all ops. By following the above practices, a library can deliver great performance on say, a powerful GPU workstation, while still functioning on a laptop with no GPU (perhaps slower, but correctly). The users can then decide to upgrade hardware or adjust settings if they need more speed, without needing a different codebase or model.

6: ONNX Runtime in Practice

6.1: Real-world Usage in Libraries and Applications

ONNX Runtime has been widely adopted in both open-source libraries and commercial applications as the inference engine of choice for ONNX models. Some notable examples:

These real-world usages demonstrate that ONNX Runtime isn’t just an academic or experimental tool; it’s battle-tested in scenarios ranging from cloud services to mobile apps. For library developers, leveraging ORT means you are building on the same engine that powers large-scale systems, giving confidence in its stability and performance. It also means there’s a community of practice and knowledge – you can find examples of similar use cases to learn from. For instance, if you’re developing a cross-platform app in Flutter and want to use ORT, Microsoft even published a blog on how Pieces.app used ONNX Runtime in a Flutter app (), indicating community knowledge on mobile integration.

6.2: Integration Challenges and Cross-Platform Deployment Solutions

Integrating ONNX Runtime into a library or application, especially across multiple platforms, can introduce some challenges. However, there are known solutions and patterns for each:

In essence, cross-platform deployment with ONNX Runtime is a solved problem in many ways – numerous projects have done it, so there are established patterns. Using continuous integration to build/test on each target platform is extremely helpful: it catches issues early. For example, if your library CI builds on Linux, Windows, Mac, and runs a sample ONNX model through ORT on each, you can verify that all needed pieces are in place. Containerization is another helpful tool: you can ship a Docker image with ORT set up properly for those who want a turnkey solution.

Community forums (GitHub issues, ONNX Runtime’s GitHub, StackOverflow) are full of Q&A on integrating ORT in various scenarios, which can be a resource. As a library developer, embracing ORT’s cross-platform nature means you can support a wide user base – from cloud VMs to smartphones – but it does require careful planning of packaging and thorough testing in each environment.

6.3: Best Practices for Efficient and Maintainable ONNX Runtime Usage

To wrap up, here are some best practices for using ONNX Runtime in a way that is efficient (performance-wise) and maintainable (easy to support and update) within libraries or applications:

In conclusion, using ONNX Runtime in a library involves considerations at both development-time (setting up the build, choosing EPs, writing the integration code) and run-time (performance tuning, handling different environments). By understanding its architecture and using its features as intended, you can create a solution that is both fast and robust. ONNX Runtime effectively abstracts away a lot of complexity of multi-hardware inference, letting you focus on the higher-level functionality of your library. By following the outlined practices – from correct provider configuration to cross-platform testing and optimization – library developers can harness the full power of ONNX Runtime’s efficient inference while keeping their codebase clean and maintainable for the long term. The end result is delivering to users a flexible AI inference capability: they can take a model from anywhere, and with your library and ONNX Runtime, run it everywhere. () ()

Key References