Quickstart

End-to-end walkthrough: take an ONNX model that does not fit in your target’s SRAM, tile it with TiGrIS, and deploy it.

We’ll use MobileNetV1 (int8 quantized, 128x128 input, 3.2M parameters). Its naive peak activation memory is 256 KiB, but we’ll compile it for a 64 KB SRAM budget on an ESP32-S3. TiGrIS tiles it into 9 stages and runs it in ~1.4 seconds. For full benchmark results, see Introducing TiGrIS.

Prerequisites

  • Python 3.10+ with tigris-ml installed
  • An ONNX model (f32, int8, or any quantization)
pip install tigris-ml

Any ONNX model works. This walkthrough uses MobileNetV1 (128x128). To generate it and the other benchmark models, run python models/prepare.py from tigris-bench.

Step 1: Analyze

Check whether the model fits within a 64 KB SRAM + 8 MB PSRAM budget (typical for an ESP32-S3):

tigris analyze mobilenet_v1_i8.onnx -m 64K -m 8M -f 16M
╭────────────────────── TiGrIS - mobilenet_v1_i8 ──────────────────────╮
│ Operators            30                                              │
│ Tensors              114 (31 activations)                            │
│ Peak memory (naive)  256.00 KiB                                      │
│ Largest tensor       1x64x64x64 (256.00 KiB)                         │
│ Quantization         INT8 (QDQ)                                      │
╰──────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────── SRAM ────────────────────────────────╮
│ Budget              64.00 KiB                                        │
│   pool 2 (slow)     8.00 MiB                                         │
│ Scheduled peak      64.00 KiB (25.0% of naive peak)                  │
│ Stages              9                                                │
│ Spill / reload I/O  992.01 KiB / 1.02 MiB                            │
│                                                                      │
│ Need tiling         6 of 9 stages                                    │
│   tileable          6 (18 tiles, max halo 2)                         │
╰────────────────  PASS - tiling resolves all stages  ─────────────────╯
╭──────────────────────────────── Flash ────────────────────────────────╮
│ Budget         16.00 MiB                                              │
│ Weight data     3.09 MiB                                              │
│ Plan overhead   0.01 MiB                                              │
│ Plan (est.)     3.10 MiB                                              │
╰───────────────────────  PASS - plan fits  ───────────────────────────╯

The model’s naive peak is 256 KiB but the SRAM budget is 64 KiB. The compiler partitions the graph into 9 stages, 6 of which need spatial tiling (18 tiles total). Intermediate results spill to the 8 MB PSRAM pool between stages. No hardware required to run analyze.

Step 2: Compile

Generate a binary execution plan:

tigris compile mobilenet_v1_i8.onnx -m 64K -m 8M -f 16M --xip -o mobilenet.tgrs
Binary plan written to mobilenet.tgrs
  30 ops, 9 stages @ 64.00 KiB budget
  plan size: 3.10 MiB
  flash 16.00 MiB: fits
FlagMeaning
-m 64K -m 8MMemory pools, fast to slow. First is SRAM budget, second is PSRAM. The compiler decides what goes where.
-f 16MFlash budget. Warns if the plan doesn’t fit.
--xipExecute-in-place. Weights are read from flash at runtime, not copied to SRAM.
-o mobilenet.tgrsOutput path for the binary plan.

PSRAM is required for multi-stage models. Without it, only single-stage models (where the full model fits in one SRAM arena) are supported.

Step 3: Generate C code

Use codegen to produce a backend-specific C harness:

tigris codegen mobilenet.tgrs --backend esp-nn -o mobilenet.c
BackendTarget
referencePortable C99 (any platform)
esp-nnEspressif optimized kernels (ESP32 family)
cmsis-nnArm optimized kernels (Cortex-M family)

Same plan, different kernels. Alternatively, skip codegen and load the .tgrs file at runtime from a flash partition. See Runtime Integration for both approaches.

Step 4: Simulate (optional)

Inspect the execution trace before deploying:

tigris simulate mobilenet_v1_i8.onnx -m 64K -m 8M

This prints a step-by-step trace of what the runtime would do. No actual inference runs.

╭───────────────── TiGrIS Simulate - mobilenet_v1_i8 ──────────────────╮
│ 30 ops, 9 stages, 64.00 KiB budget, 256.00 KiB peak                  │
╰──────────────────────────────────────────────────────────────────────╯

──────────────────────────  Stage 0 (ops 0-0)  ─────────────────────────
  Peak: 48.00 KiB | Fits budget

  Reload inputs:
    input  [1, 3, 128, 128]  48.00 KiB  <- slow memory

 Step  Op              Type   In shape           Out shape          Live
    0  conv0_conv      Conv   [1, 3, 128, 128]   [1, 32, 64, 64]   48 KiB

  Spill outputs:
    conv0_out  [1, 32, 64, 64]  128.00 KiB  -> slow memory

──────────────────────────  Stage 1 (ops 1-1)  ─────────────────────────
  Peak: 128.00 KiB | Tiled: 3 tiles, 30 rows + 2 halo (RF 3)

  Reload inputs:
    conv0_out  [1, 32, 64, 64]  128.00 KiB  <- slow memory

 Step  Op              Type           In shape           Out shape
    1  b1_dw_conv      DepthwiseConv  [1, 32, 64, 64]   [1, 32, 64, 64]

  Spill outputs:
    b1_dw_out  [1, 32, 64, 64]  128.00 KiB  -> slow memory

  ... (7 more stages)

Each stage shows what gets reloaded from slow memory, which ops run, and what gets spilled back. Stage 1’s peak (128 KiB) exceeds the 64 KiB budget, so the compiler tiles it into 3 passes of 30 rows with a 2-row halo overlap.

Step 5: Deploy

The .tgrs plan contains the operator schedule, memory map, tiling parameters, and weights. On your target:

  1. Flash the plan to a partition, or embed it via codegen.
  2. Load with tigris_plan_load().
  3. Initialize arenas with tigris_mem_init().
  4. Run inference with tigris_run().

Minimal C example:

#include "tigris.h"
#include "tigris_loader.h"
#include "tigris_mem.h"
#include "tigris_executor.h"
#include "tigris_kernels_s8.h"

#include <string.h>

extern const uint8_t plan_data[];
extern const uint32_t plan_size;

static uint8_t fast_buf[64 * 1024];   /* 64K SRAM arena */
static uint8_t slow_buf[512 * 1024];  /* PSRAM for spills */

void run_inference(const int8_t *input, size_t input_size) {
    tigris_plan_t plan;
    tigris_plan_load(plan_data, plan_size, &plan);

    void *tensor_ptrs[128];  /* max tensors */
    tigris_mem_t mem;
    tigris_mem_init(&mem, (void **)tensor_ptrs, plan.header->num_tensors,
                    fast_buf, sizeof(fast_buf),
                    slow_buf, sizeof(slow_buf));

    uint16_t in_idx = plan.model_inputs[0];
    tigris_mem_alloc_slow(&mem, in_idx, plan.tensors[in_idx].size_bytes);
    memcpy(mem.tensor_ptrs[in_idx], input, input_size);

    tigris_exec_stats_t stats;
    tigris_run(&plan, &mem, tigris_dispatch_kernel_s8, NULL, &stats);

    uint16_t out_idx = plan.model_outputs[0];
    int8_t *output = (int8_t *)mem.tensor_ptrs[out_idx];
}

The runtime is about 8 KB of code, requires no heap allocation, and works on any C99 target. See Runtime Integration for error handling, ESP-IDF setup, and ESP-NN backend configuration.

What’s next