Milestone 3: Training a Lightweight Clarity Model + Exporting to RKNN (RV1126)


Milestone: 3 — Train → Export ONNX → Convert RKNN → Ready for On-board Validation

1) What Milestone 3 Is About

Milestone 3’s goal is straightforward:

  1. Train a lightweight clarity classifier/regressor (PyTorch) on a real dataset pipeline (Milestone 2 output).
  2. Export the best checkpoint to ONNX (Rockchip toolchain compatibility requires opset ≤ 12 in my case).
  3. Convert ONNX to RKNN using rknn-toolkit inside an Ubuntu VM.
  4. Produce a deployable .rknn model for RV1126.

At the end of this milestone, I should have a clean, reproducible conversion pipeline and a ready-to-test model on board.


2) Current Status (What’s Done)

Training completed (baseline clarity_v0)
Evaluation completed (val/test accuracy around ~0.46–0.47; confusion matrix indicates some “collapse”/class bias to middle bins)
ONNX export succeeded (confirmed opset = 12)
RKNN conversion succeeded (generated clarity_fp.rknn)
Next: board-side deployment & runtime verification (SSH connection is unstable on campus network, may temporarily fall back to UART)


3) Training Snapshot (Baseline Reality Check)

I’m using Milestone 2’s processed dataset + manifest to train a small model on Windows (PyTorch). The baseline accuracy is not perfect, and the confusion matrix shows the model tends to predict a few classes heavily.

This is not a failure — it’s useful signal:

  • The dataset’s “clarity” distribution may be imbalanced.
  • Labels generated from traditional metrics (e.g., Laplacian/Tenengrad) can be noisy, especially near decision boundaries.
  • Resolution/compression in processed images can reduce perceptual clarity differences.

Key takeaway: I can still proceed to deployment testing, because the real question is:

“Does the NPU-friendly model beat traditional metrics on the board scenario?”


4) Exporting ONNX (Windows)

Why opset=12 matters

During conversion, RKNN toolchain rejected opset 18+ (common default in some exporters). So I forced opset=12.

Practical pain points I hit

  • Missing Python deps (onnx, onnxscript, pyyaml, etc.)
  • ModuleNotFoundError: host / import paths breaking when running scripts from different working directories
  • Model builder signature mismatch (num_classes missing) after refactors
  • Exported ONNX accidentally had dynamic input shape, which RKNN refused

The outcome

I eventually exported:

  • models/clarity_v0/clarity_op12.onnx (opset=12, static input)

Checklist after export:

  • Confirm opset:
    • python -c "import onnx; m=onnx.load('...onnx'); print([(o.domain,o.version) for o in m.opset_import])"

5) Converting to RKNN (Ubuntu VM)

Conversion happens in Ubuntu VM using rknn-toolkit 1.7.5.

Common dependency traps

I hit a few classic conflicts:

  • TensorFlow missing → toolkit internal modules crash
  • Torch missing → some optimize paths import torch internals
  • typing_extensions mismatch (some torch builds require newer versions)
  • “Unknown level: ‘WARNING’” (a bad env var / logging level parsing issue in that environment)
  • RKNN rejected ONNX with dynamic input: shape ['N', 3, 224, 224] not support

Fix that mattered most: “Static input shape”

RKNN doesn’t like dynamic dims in the exported ONNX. The conversion script must load ONNX with explicit input shape / fixed dims.

After fixing conversion script + ensuring ONNX opset=12 + static input, I got:

models/clarity_v0/clarity_fp.rknn


6) Deliverables Produced in Milestone 3

  • Trained checkpoint:
    host/outputs/clarity_v0/best.pt
  • Exported ONNX (opset 12):
    models/clarity_v0/clarity_op12.onnx
  • Converted RKNN:
    models/clarity_v0/clarity_fp.rknn

This means the “model artifact chain” is now complete.


7) Lessons Learned (Hard-Won)

  1. Keep ONNX opset conservative for embedded toolchains (≤ 12 here).
  2. Dynamic shape kills conversion — lock input to fixed [1,3,H,W].
  3. Don’t trust GUI tools. CLI pipeline is more reproducible and debuggable.
  4. Split environments by responsibility:
    • training env (PyTorch, metrics, augment)
    • export env (onnx/onnxscript stable)
    • convert env (rknn-toolkit + its pinned deps)
  5. Don’t block on “model is not perfect yet” — deploy early to validate end-to-end feasibility.

8) Next Step: On-board Validation (Milestone 4 Preview)

Milestone 4 will focus on:

  • Upload .rknn + minimal inference binary/script to the RV1126
  • Verify:
    • RKNN runtime loads the model
    • inference runs on NPU
    • latency/fps
    • compare with traditional metrics (same frames)
  • If SSH is unstable (campus network), fall back to UART for reliable access.

Leave a Comment

Your email address will not be published. Required fields are marked *