Milestone: 3 — Train → Export ONNX → Convert RKNN → Ready for On-board Validation
1) What Milestone 3 Is About
Milestone 3’s goal is straightforward:
- Train a lightweight clarity classifier/regressor (PyTorch) on a real dataset pipeline (Milestone 2 output).
- Export the best checkpoint to ONNX (Rockchip toolchain compatibility requires opset ≤ 12 in my case).
- Convert ONNX to RKNN using rknn-toolkit inside an Ubuntu VM.
- Produce a deployable
.rknnmodel for RV1126.
At the end of this milestone, I should have a clean, reproducible conversion pipeline and a ready-to-test model on board.
2) Current Status (What’s Done)
Training completed (baseline clarity_v0)
Evaluation completed (val/test accuracy around ~0.46–0.47; confusion matrix indicates some “collapse”/class bias to middle bins)
ONNX export succeeded (confirmed opset = 12)
RKNN conversion succeeded (generated clarity_fp.rknn)
Next: board-side deployment & runtime verification (SSH connection is unstable on campus network, may temporarily fall back to UART)
3) Training Snapshot (Baseline Reality Check)
I’m using Milestone 2’s processed dataset + manifest to train a small model on Windows (PyTorch). The baseline accuracy is not perfect, and the confusion matrix shows the model tends to predict a few classes heavily.
This is not a failure — it’s useful signal:
- The dataset’s “clarity” distribution may be imbalanced.
- Labels generated from traditional metrics (e.g., Laplacian/Tenengrad) can be noisy, especially near decision boundaries.
- Resolution/compression in processed images can reduce perceptual clarity differences.
Key takeaway: I can still proceed to deployment testing, because the real question is:
“Does the NPU-friendly model beat traditional metrics on the board scenario?”
4) Exporting ONNX (Windows)
Why opset=12 matters
During conversion, RKNN toolchain rejected opset 18+ (common default in some exporters). So I forced opset=12.
Practical pain points I hit
- Missing Python deps (
onnx,onnxscript,pyyaml, etc.) ModuleNotFoundError: host/ import paths breaking when running scripts from different working directories- Model builder signature mismatch (
num_classesmissing) after refactors - Exported ONNX accidentally had dynamic input shape, which RKNN refused
The outcome
I eventually exported:
models/clarity_v0/clarity_op12.onnx(opset=12, static input)
Checklist after export:
- Confirm opset:
python -c "import onnx; m=onnx.load('...onnx'); print([(o.domain,o.version) for o in m.opset_import])"
5) Converting to RKNN (Ubuntu VM)
Conversion happens in Ubuntu VM using rknn-toolkit 1.7.5.
Common dependency traps
I hit a few classic conflicts:
- TensorFlow missing → toolkit internal modules crash
- Torch missing → some optimize paths import torch internals
typing_extensionsmismatch (some torch builds require newer versions)- “Unknown level: ‘WARNING’” (a bad env var / logging level parsing issue in that environment)
- RKNN rejected ONNX with dynamic input:
shape ['N', 3, 224, 224] not support
Fix that mattered most: “Static input shape”
RKNN doesn’t like dynamic dims in the exported ONNX. The conversion script must load ONNX with explicit input shape / fixed dims.
After fixing conversion script + ensuring ONNX opset=12 + static input, I got:
models/clarity_v0/clarity_fp.rknn
6) Deliverables Produced in Milestone 3
- Trained checkpoint:
host/outputs/clarity_v0/best.pt - Exported ONNX (opset 12):
models/clarity_v0/clarity_op12.onnx - Converted RKNN:
models/clarity_v0/clarity_fp.rknn
This means the “model artifact chain” is now complete.
7) Lessons Learned (Hard-Won)
- Keep ONNX opset conservative for embedded toolchains (≤ 12 here).
- Dynamic shape kills conversion — lock input to fixed
[1,3,H,W]. - Don’t trust GUI tools. CLI pipeline is more reproducible and debuggable.
- Split environments by responsibility:
- training env (PyTorch, metrics, augment)
- export env (onnx/onnxscript stable)
- convert env (rknn-toolkit + its pinned deps)
- Don’t block on “model is not perfect yet” — deploy early to validate end-to-end feasibility.
8) Next Step: On-board Validation (Milestone 4 Preview)
Milestone 4 will focus on:
- Upload
.rknn+ minimal inference binary/script to the RV1126 - Verify:
- RKNN runtime loads the model
- inference runs on NPU
- latency/fps
- compare with traditional metrics (same frames)
- If SSH is unstable (campus network), fall back to UART for reliable access.
