Milestone 3: Training a Lightweight Clarity Model + Exporting to RKNN (RV1126)

Milestone: 3 — Train → Export ONNX → Convert RKNN → Ready for On-board Validation

1) What Milestone 3 Is About

Milestone 3’s goal is straightforward:

Train a lightweight clarity classifier/regressor (PyTorch) on a real dataset pipeline (Milestone 2 output).
Export the best checkpoint to ONNX (Rockchip toolchain compatibility requires opset ≤ 12 in my case).
Convert ONNX to RKNN using rknn-toolkit inside an Ubuntu VM.
Produce a deployable .rknn model for RV1126.

At the end of this milestone, I should have a clean, reproducible conversion pipeline and a ready-to-test model on board.

2) Current Status (What’s Done)

Training completed (baseline clarity_v0)
Evaluation completed (val/test accuracy around ~0.46–0.47; confusion matrix indicates some “collapse”/class bias to middle bins)
ONNX export succeeded (confirmed opset = 12)
RKNN conversion succeeded (generated clarity_fp.rknn)
Next: board-side deployment & runtime verification (SSH connection is unstable on campus network, may temporarily fall back to UART)

3) Training Snapshot (Baseline Reality Check)

I’m using Milestone 2’s processed dataset + manifest to train a small model on Windows (PyTorch). The baseline accuracy is not perfect, and the confusion matrix shows the model tends to predict a few classes heavily.

This is not a failure — it’s useful signal:

The dataset’s “clarity” distribution may be imbalanced.
Labels generated from traditional metrics (e.g., Laplacian/Tenengrad) can be noisy, especially near decision boundaries.
Resolution/compression in processed images can reduce perceptual clarity differences.

Key takeaway: I can still proceed to deployment testing, because the real question is:

“Does the NPU-friendly model beat traditional metrics on the board scenario?”

4) Exporting ONNX (Windows)

Why opset=12 matters

During conversion, RKNN toolchain rejected opset 18+ (common default in some exporters). So I forced opset=12.

Practical pain points I hit

Missing Python deps (onnx, onnxscript, pyyaml, etc.)
ModuleNotFoundError: host / import paths breaking when running scripts from different working directories
Model builder signature mismatch (num_classes missing) after refactors
Exported ONNX accidentally had dynamic input shape, which RKNN refused

The outcome

I eventually exported:

models/clarity_v0/clarity_op12.onnx (opset=12, static input)

Checklist after export:

Confirm opset:
- python -c "import onnx; m=onnx.load('...onnx'); print([(o.domain,o.version) for o in m.opset_import])"

5) Converting to RKNN (Ubuntu VM)

Conversion happens in Ubuntu VM using rknn-toolkit 1.7.5.

Common dependency traps

I hit a few classic conflicts:

TensorFlow missing → toolkit internal modules crash
Torch missing → some optimize paths import torch internals
typing_extensions mismatch (some torch builds require newer versions)
“Unknown level: ‘WARNING’” (a bad env var / logging level parsing issue in that environment)
RKNN rejected ONNX with dynamic input: shape ['N', 3, 224, 224] not support

Fix that mattered most: “Static input shape”

RKNN doesn’t like dynamic dims in the exported ONNX. The conversion script must load ONNX with explicit input shape / fixed dims.

After fixing conversion script + ensuring ONNX opset=12 + static input, I got:

models/clarity_v0/clarity_fp.rknn

6) Deliverables Produced in Milestone 3

Trained checkpoint:
host/outputs/clarity_v0/best.pt
Exported ONNX (opset 12):
models/clarity_v0/clarity_op12.onnx
Converted RKNN:
models/clarity_v0/clarity_fp.rknn

This means the “model artifact chain” is now complete.

7) Lessons Learned (Hard-Won)

Keep ONNX opset conservative for embedded toolchains (≤ 12 here).
Dynamic shape kills conversion — lock input to fixed [1,3,H,W].
Don’t trust GUI tools. CLI pipeline is more reproducible and debuggable.
Split environments by responsibility:
- training env (PyTorch, metrics, augment)
- export env (onnx/onnxscript stable)
- convert env (rknn-toolkit + its pinned deps)
Don’t block on “model is not perfect yet” — deploy early to validate end-to-end feasibility.

8) Next Step: On-board Validation (Milestone 4 Preview)

Milestone 4 will focus on:

Upload .rknn + minimal inference binary/script to the RV1126
Verify:
- RKNN runtime loads the model
- inference runs on NPU
- latency/fps
- compare with traditional metrics (same frames)
If SSH is unstable (campus network), fall back to UART for reliable access.