Benchmarking On-Device Inference: Best Practices Using Raspberry Pi 5 and AI HAT+2
edge AIbenchmarkinghardware

Benchmarking On-Device Inference: Best Practices Using Raspberry Pi 5 and AI HAT+2

ttrainmyai
2026-01-25
10 min read
Advertisement

Practical guide to benchmarking Raspberry Pi 5 + AI HAT+2 for latency, throughput and energy—actionable steps, scripts, and 2026 trends.

Hook — Why your Pi 5 inference numbers probably lie (and how to fix that)

If your team is evaluating on-device AI for production—image classification at the edge, real-time audio transcription in kiosks, or private LLM-based assistants—raw model accuracy is only half the story. The painful questions are usually: how fast does the model respond under load, how many requests per second can a single unit handle, and how much energy will it consume when deployed at scale?

In 2026 the mix of new small NPUs, more aggressive memory pricing, and refreshed edge hardware makes accurate benchmarking essential. This guide gives developers and IT teams a practical, repeatable benchmarking plan for the Raspberry Pi 5 + AI HAT+2 that reliably measures latency, throughput, and energy use, and shows you how to interpret and improve results.

The 2026 context: why Pi 5 + HAT+2 benchmarks matter now

Two trends are driving edge AI decisions in 2026: widespread NPU availability on cheap SBCs and continued memory/price pressures across the supply chain. At CES 2026 several reports noted higher memory costs that indirectly affect model sizing and deployment choices; at the same time new accessory NPUs like the AI HAT+2 are enabling generative and transformer-based workloads on small devices.

"Memory chip scarcity is driving up prices for laptops and PCs" — industry reporting from early 2026 highlights how hardware cost could influence model choices and on-device strategies.

Benchmarks with the Pi 5 and AI HAT+2 are therefore no longer academic experiments—they inform procurement, capacity planning, and energy budgets for fleets of edge devices. A rigorous methodology will avoid costly surprises when you scale from a proof-of-concept to hundreds or thousands of deployed units.

What this guide covers

  • Hardware and software prerequisites for reproducible testing
  • Definitions: latency, throughput, and energy metrics you must report
  • Step-by-step benchmark workflows (commands and example scripts)
  • Profiling, bottleneck analysis and practical optimisations
  • How to present results for procurement and DevOps teams
  • Advanced strategies and 2026 trends to watch

Prerequisites: hardware, software and a repeatable lab

Hardware

  • Raspberry Pi 5 (latest firmware and cooling solution). Use identical units for comparison.
  • AI HAT+2 (latest firmware/driver). Confirm vendor SDK for NPU acceleration is installed.
  • High-quality SD card or USB storage (use same model across tests).
  • External power meter (recommended) — USB power meters or inline DC meters (e.g., Monsoon, INA219 + Pi) for energy measurement.
  • Network isolation (run benchmarks offline or on a private network to avoid variability).

Software

  • Raspberry Pi OS (64-bit) with updated kernel and firmware (late-2025/early-2026 updates to ARM support improve NPU drivers).
  • Python 3.11+, pip packages: numpy, psutil, onnxruntime (or vendor runtime), tflite-runtime if applicable.
  • Vendor SDK for AI HAT+2 (follow official install steps; verify NPU device is visible).
  • Profiling tools: perf, top, mpstat, and a lightweight tracer like py-spy for Python code. For broader telemetry and cache-level analysis consider the playbook in Monitoring and Observability for Caches.
  • Optional: ONNX/TF/Torch conversion tooling for constrained models.

Key metrics and definitions

Before running tests, agree on what you will measure and report. Make all tests reproducible and report confidence intervals.

  • Latency: time between input arrival and output availability. Report p50, p95 and p99 (median and tail latencies). Measure cold-start latency separately if your deploys may restart processes.
  • Throughput: inference requests per second (RPS) the device can sustain with acceptable tail latency (e.g., p95 < target SLA).
  • Energy: joules or watt-seconds per inference and average watts during steady-state. For fleet cost estimate convert to kWh/year.
  • Utilisation: CPU, NPU, memory bandwidth, and thermal behaviour (throttling). These explain bottlenecks; tie these metrics into broader edge analytics and gateway telemetry where appropriate.

Benchmark design principles

  1. Isolate variables: run one change at a time (model, batch size, thread count, power profile).
  2. Warm up: include a warm-up phase (e.g., 20-50 inferences) to avoid JIT/compiler effects and NPU cold caches.
  3. Repeat and aggregate: run 5–10 trials and report mean ± stddev plus p95/p99.
  4. Realistic inputs: use representative input data (image sizes, audio frame lengths, token counts) rather than synthetic tiny inputs that skew results.
  5. Measure end-to-end: include pre/post-processing time that your app performs in production.

Step-by-step: Latency measurement

We present a minimal reproducible Python pattern you can adapt. This example assumes model loading with a vendor runtime exposing a .run() call.

# pseudo-code; adapt to your runtime
import time
import numpy as np

# load model via vendor runtime
model = load_model('model.onnx', use_npu=True)

# prepare inputs matching production payload
inputs = [np.random.randn(1,3,224,224).astype('float32') for _ in range(200)]

# warm-up
for i in range(30):
    model.run(inputs[i%len(inputs)])

latencies = []
for i in range(100):
    t0 = time.perf_counter()
    model.run(inputs[i%len(inputs)])
    t1 = time.perf_counter()
    latencies.append((t1 - t0) * 1000)  # ms

# compute p50/p95/p99
import numpy as np
print('p50', np.percentile(latencies, 50))
print('p95', np.percentile(latencies, 95))
print('p99', np.percentile(latencies, 99))

Practical notes:

  • Run this with the process pinned to a small CPU core set, or with default scheduler—document the choice.
  • Log system metrics concurrently with mpstat and top to correlate CPU/NPU utilisation.
  • Measure both cold start (first load and first inference) and warm latency if you will frequently restart processes.

Step-by-step: Throughput testing

Throughput tests simulate continuous load. Use a simple loop that sends requests as fast as possible and records per-request latency; alternatively use a multi-threaded client that mimics your production runtime.

# simple synchronous loop measuring sustained RPS
start = time.time()
count = 0
end_time = start + 60  # 60s sustained run
while time.time() < end_time:
    model.run(next_input())
    count += 1

rps = count / 60.0
print('sustained RPS', rps)

For asynchronous systems or batched inference, measure effective inferences per second and latency distribution under that load. Important variations to test:

  • Single-request mode (batch size = 1)
  • Micro-batch mode (batch sizes 2–8)
  • Batching with different queuing delays (to simulate real arrival patterns)

Step-by-step: Energy and power measurement

Energy is the least standardised metric but often the most important for fleet TCO. Two practical approaches:

  1. External power meter: Use an inline USB/DC power meter and log watts continuously. For a 60s workload, compute joules per inference = (average watts * seconds) / number of inferences.
  2. Controlled wall-power baseline: Measure idle power for 60s, then active power for your workload for 60s; subtract idle to get net inference energy.

Example calculation:

  • Idle: 3.5 W
  • Active during load: 6.5 W
  • Net: 3.0 W additional
  • If sustained RPS = 10, per-inference energy = (3.0 W / 10 RPS) = 0.3 J per inference.

Practical tips:

  • Make sure peripherals (HDMI, USB devices) are consistent across runs; they change power draw. If you're planning field deployments, consider the recommendations in portable power and solar-backed kits such as the Host Pop-Up Kit or the Jackery/EcoFlow comparisons when sizing batteries.
  • Measure for long enough to average out fluctuations (60–300 seconds depending on variability).
  • If you lack a hardware meter, consider a calibrated USB power delivery meter or measure wall socket using a Kill A Watt style device on the Pi's power supply.

Profiling and bottleneck analysis

Once you have basic metrics, use profiling to understand bottlenecks:

  • CPU vs NPU: vendor SDKs often expose utilisation counters. If NPU % is low but CPU usage is high, preprocessing or data movement is the bottleneck.
  • Memory bandwidth: large models may be memory-bound. Check swap usage and page faults. Use cache and observability guidance from monitoring and observability for caches to instrument memory-sensitive workloads.
  • Thermal throttling: monitor temperature and frequency; if throttling occurs, results are not reproducible. Document cooling used.

Tools and commands:

  • top, htop, mpstat — CPU utilisation trends
  • perf stat — hardware counters (cycles, cache misses)
  • vendor profiler — NPU utilisation and operator timings
  • psutil (Python) — process-level CPU and memory usage in parallel with latency logging

Optimisations that change the numbers

When you tune a model for on-device use, always re-run the same benchmark suite. Common optimisations with high ROI:

  • Quantisation (8-bit / 16-bit): reduces memory bandwidth and NPU compute; expect lower latency and energy per inference if supported by the HAT+2 runtime.
  • Operator fusion: use a runtime that fuses operators to reduce memory writes.
  • Model pruning / distillation: smaller model size decreases memory traffic and may improve throughput.
  • Batching: trade-off between latency and throughput—micro-batches can raise RPS but increase tail latency.
  • Pipeline partitioning: keep pre/post processing on CPU and heavy ops on NPU for better resource utilisation.
  • Threading and core pinning: pin CPU threads to dedicated cores to reduce jitter. For field setups and micro-events where low-jitter is critical, see portable edge kit recommendations in portable edge kits.

Note: quantisation sometimes reduces accuracy. Include accuracy checks in the benchmarking loop to confirm the model remains within acceptable thresholds.

Reproducibility checklist

Record the following with every benchmark:

  • Hardware: Pi model, RAM, HAT+2 firmware version, cooling, serial numbers if needed.
  • Software: OS version, kernel, Python version, vendor SDK version, model file hash.
  • Benchmark parameters: input dataset, batch size, warm-up count, trial count, CPU governor, frequency limits.
  • Power measurement method and device calibration.
  • Raw logs: latency samples, system metrics, profiler outputs.

How to present results to stakeholders

Translate microbenchmark metrics into business-relevant figures for procurement and operations:

  • Per-device energy per inference → yearly energy cost per 10k/day transactions across your fleet.
  • SLA mapping: determine the maximum concurrent users per device while meeting p95 < SLA threshold.
  • Hardware TCO: combine unit cost, expected lifespan, energy cost and support overhead to compare Pi 5 + HAT+2 vs alternatives. If you’re buying battery backups or field power, consult power-station comparisons like Jackery vs EcoFlow to estimate real-world runtime and replacement schedules.

As of early 2026 there are a few trends and tools to incorporate into your strategy:

  • Vendor runtimes improving ARM NPU support: many vendors released updates in late 2025 to increase operator coverage—retest after each SDK update. Also consider integrating benchmark checks into CI; see CI/CD patterns in CI/CD for generative video models for ideas on reproducible pipelines.
  • Model compilers and graph-level optimisers (e.g., TVM/Glow variants for ARM) gained optimized passes for small NPUs—these can significantly reduce latency.
  • Memory-price pressures (reported at CES 2026) mean smaller models and aggressive quantisation are more attractive for cost-constrained projects.
  • Edge orchestration: lightweight container runtimes and device management platforms now include health and metric collectors—integrate benchmark scripts into CI or lightweight orchestration. For low-latency retail and pop-up scenarios where device trust and quick updates matter, the edge-enabled pop-up retail guide has practical overlap with these patterns.

Example: summarised benchmark report (template)

Include a short one-page summary for decision-makers, and attach raw results for engineering teams.

  • Device: Raspberry Pi 5 + AI HAT+2 (HAT firmware v1.2.0)
  • Model: ResNet50-distilled (onnx, quantised int8)
  • Workload: 224x224 RGB images, steady traffic
  • Latency (p50/p95/p99): 45ms / 120ms / 200ms
  • Throughput (sustained RPS): 18 RPS
  • Energy per inference: 0.25 J (net)
  • Notes: CPU-bound pre-processing; no thermal throttling with passive heatsink + fan

Common pitfalls and how to avoid them

  • Running microbenchmarks with batch size=1 synthetic inputs: they often overestimate real-world performance.
  • Neglecting warm-up and JIT effects: always include warm-up to stabilise runtimes and NPU caches.
  • Forgetting system noise: disable background jobs and network traffic during tests.
  • Comparing non-equivalent models: ensure the same architecture, accuracy and optimisations before comparing numbers.

Security and compliance brief (UK focus)

When benchmarking models on real user data, anonymise or synthesise inputs. For UK data protection compliance (GDPR/UK), document data minimisation steps, retention policies, and ensure devices are configured to avoid exfiltration during testing. Keep local test datasets on-device and encrypt storage where required.

Final checklist before you ship

  1. Run the full benchmark suite (latency, throughput, energy) using production-like inputs.
  2. Validate model accuracy after quantisation/pruning.
  3. Document hardware/software versions and attach raw logs.
  4. Include energy cost into 3-year TCO for fleet procurement decisions.
  5. Integrate smoke tests into CI for any model or runtime updates.

Why this methodology saves time and money

Rigorously measuring latency, throughput and energy on representative Pi 5 + AI HAT+2 units avoids the common trap of optimistic bench numbers that collapse under real-world load. Accurate benchmarks help you:

  • Choose the right device configuration and cooling for your SLA
  • Size fleet and energy budgets correctly
  • Detect regression risks from SDK updates or model changes

Next steps and call-to-action

Start with a small reproducible benchmark: pick one representative model, run the latency/throughput/energy suite above, and document results. If you want a ready-made, vendor-neutral benchmarking kit for Pi 5 + AI HAT+2 that includes scripts, dashboards and reporting templates tuned for UK deployments, our team can provide a hands-on workshop or an automated benchmark pack that plugs into your CI.

Action: Run one 60-second throughput test and one 100-inference latency run; if you share the logs we’ll provide a free evaluation and recommendations scoped to your use case.

Advertisement

Related Topics

#edge AI#benchmarking#hardware
t

trainmyai

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-03T19:08:18.520Z