📱 Mobile Model Optimization Guide

Comprehensive guide to optimizing YOLOv8 models for mobile and edge deployment.

🎯 Overview

The optimization script provides multiple techniques to reduce model size and improve inference speed for mobile devices:

INT8 Quantization: Reduce model size by ~75% with minimal accuracy loss
FP16 Precision: Reduce model size by ~50%
TensorFlow Lite: Optimized for Android/iOS
CoreML: Native iOS framework
ONNX: Cross-platform deployment
NCNN: Optimized for ARM processors

📊 Expected Results

Format	Size Reduction	Accuracy Loss	Best For
TFLite INT8	~75%	1-3%	Android
CoreML INT8	~75%	1-3%	iOS
ONNX INT8	~75%	1-3%	Cross-platform
NCNN	~70%	<1%	ARM devices
PyTorch FP16	~50%	<0.5%	Python/C++

Example: YOLOv8n Model

Original: ~6 MB (FP32)
TFLite INT8: ~1.5 MB (75% reduction)
ONNX INT8: ~1.6 MB (73% reduction)
CoreML INT8: ~1.5 MB (75% reduction)
PyTorch FP16: ~3 MB (50% reduction)

🚀 Quick Start

Optimize for All Platforms

python models/yolov8/optimize_for_mobile.py \
  --model runs/train/best.pt \
  --optimize all \
  --output ./mobile_models

This will generate optimized models for all platforms and print a comparison report.

Platform-Specific Optimization

Android (TensorFlow Lite)

python models/yolov8/optimize_for_mobile.py \
  --model runs/train/best.pt \
  --optimize tflite \
  --output ./mobile_models

Output: best_int8.tflite (~1.5 MB)

Integration:

// Android Kotlin
import org.tensorflow.lite.Interpreter
import java.nio.MappedByteBuffer

val interpreter = Interpreter(loadModelFile("best_int8.tflite"))

iOS (CoreML)

python models/yolov8/optimize_for_mobile.py \
  --model runs/train/best.pt \
  --optimize coreml \
  --output ./mobile_models

Output: best_int8.mlpackage (~1.5 MB)

Integration:

// iOS Swift
import CoreML
import Vision

guard let model = try? VNCoreMLModel(for: best_int8().model) else {
    return
}
let request = VNCoreMLRequest(model: model)

Cross-Platform (ONNX)

python models/yolov8/optimize_for_mobile.py \
  --model runs/train/best.pt \
  --optimize onnx \
  --output ./mobile_models

Output: best_int8.onnx (~1.6 MB)

Integration:

import onnxruntime as ort

session = ort.InferenceSession("best_int8.onnx")
outputs = session.run(None, {"images": input_tensor})

ARM Devices (NCNN)

python models/yolov8/optimize_for_mobile.py \
  --model runs/train/best.pt \
  --optimize ncnn \
  --output ./mobile_models

Output: best_ncnn/ directory with .param and .bin files

Integration:

// C++
#include "net.h"

ncnn::Net net;
net.load_param("best_ncnn/model.param");
net.load_model("best_ncnn/model.bin");

🔧 Optimization Techniques Explained

1. INT8 Quantization

Converts 32-bit floating point weights to 8-bit integers.

Benefits:

75% size reduction
2-4x faster inference on mobile CPUs
Lower memory usage

Trade-offs:

1-3% accuracy loss (typically acceptable)
Requires post-training quantization

How it works:

FP32 weight: 0.123456789 (4 bytes)
↓ Quantization
INT8 weight: 123 (1 byte)
Scale factor: 0.001

2. FP16 (Half Precision)

Uses 16-bit floating point instead of 32-bit.

Benefits:

50% size reduction
1.5-2x faster on GPUs with FP16 support
Minimal accuracy loss (<0.5%)

Best for:

Devices with GPU acceleration
When accuracy is critical

3. Dynamic Shape Optimization

Removes dynamic shapes for faster inference.

Benefits:

Faster model loading
Better mobile CPU optimization
Reduced memory overhead

📱 Platform-Specific Integration

Android Integration

1. Add Dependencies (build.gradle)

dependencies {
    implementation 'org.tensorflow:tensorflow-lite:2.14.0'
    implementation 'org.tensorflow:tensorflow-lite-gpu:2.14.0'
    implementation 'org.tensorflow:tensorflow-lite-support:0.4.4'
}

2. Load and Run Model

import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.support.image.TensorImage
import java.nio.ByteBuffer

class UIElementDetector(context: Context) {
    private val interpreter: Interpreter

    init {
        val model = loadModelFile(context, "best_int8.tflite")
        interpreter = Interpreter(model, Interpreter.Options().apply {
            setNumThreads(4) // Use 4 CPU threads
        })
    }

    fun detect(bitmap: Bitmap): List<Detection> {
        val tensorImage = TensorImage.fromBitmap(bitmap)
        val output = Array(1) { Array(25200) { FloatArray(25) } }

        interpreter.run(tensorImage.buffer, output)
        return parseDetections(output)
    }
}

3. GPU Acceleration (Optional)

val options = Interpreter.Options().apply {
    addDelegate(GpuDelegate())
}
val interpreter = Interpreter(model, options)

iOS Integration

1. Add CoreML Model to Xcode

Drag best_int8.mlpackage into your Xcode project
Xcode automatically generates Swift interface

2. Run Inference

import CoreML
import Vision
import UIKit

class UIElementDetector {
    private var model: VNCoreMLModel?

    init() {
        guard let mlModel = try? best_int8(configuration: MLModelConfiguration()) else {
            fatalError("Failed to load model")
        }
        model = try? VNCoreMLModel(for: mlModel.model)
    }

    func detect(image: UIImage, completion: @escaping ([Detection]) -> Void) {
        guard let model = model else { return }

        let request = VNCoreMLRequest(model: model) { request, error in
            guard let results = request.results as? [VNRecognizedObjectObservation] else {
                return
            }
            let detections = self.parseResults(results)
            completion(detections)
        }

        let handler = VNImageRequestHandler(cgImage: image.cgImage!)
        try? handler.perform([request])
    }
}

3. Metal GPU Acceleration (Automatic)

CoreML automatically uses Metal GPU when available.

React Native Integration

Using ONNX Runtime

import { InferenceSession } from 'onnxruntime-react-native';

const session = await InferenceSession.create('./best_int8.onnx');

const detect = async (imageData) => {
  const feeds = { images: new Tensor('float32', imageData, [1, 3, 640, 640]) };
  const results = await session.run(feeds);
  return parseDetections(results);
};

🔍 Benchmarking

Compare Model Performance

# Install dependencies
pip install onnxruntime opencv-python

# Run benchmark
python models/yolov8/benchmark_mobile.py \
  --original runs/train/best.pt \
  --optimized mobile_models/best_int8.onnx \
  --test_images ./test_images/ \
  --runs 100

Expected Performance (YOLOv8n)

Device	Format	Inference Time	FPS
iPhone 13 Pro	CoreML INT8	15ms	66
Pixel 6	TFLite INT8	18ms	55
Raspberry Pi 4	NCNN	45ms	22
Desktop CPU	ONNX INT8	8ms	125

⚙️ Advanced Configuration

Custom Quantization

For better accuracy, provide calibration data:

from ultralytics import YOLO

model = YOLO('runs/train/best.pt')

# Export with calibration data
model.export(
    format='tflite',
    int8=True,
    data='./data/dataset.yaml',  # Use your dataset for calibration
    imgsz=640
)

Optimize for Specific Input Size

If you always use a specific input size, optimize for it:

python models/yolov8/optimize_for_mobile.py \
  --model runs/train/best.pt \
  --optimize onnx \
  --input_size 320  # Smaller = faster

Pruning (Further Size Reduction)

Remove redundant weights for even smaller models:

from ultralytics import YOLO

# Train with pruning
model = YOLO('yolov8n.pt')
model.train(
    data='./data/dataset.yaml',
    epochs=100,
    prune=0.3  # Remove 30% of weights
)

📊 Size Comparison Tool

Generate a visual comparison of all formats:

python models/yolov8/optimize_for_mobile.py \
  --model runs/train/best.pt \
  --optimize all

Output:

================================================================================
OPTIMIZATION RESULTS SUMMARY
================================================================================

Format               Size (MB)    Reduction    Best For
────────────────────────────────────────────────────────────────────────────
Original             6.00         -            Reference
TFLite INT8          1.50         ↓75.0%       Android, iOS
ONNX INT8            1.60         ↓73.3%       Cross-platform
CoreML INT8          1.55         ↓74.2%       iOS, macOS
NCNN                 1.80         ↓70.0%       Android, iOS (ARM)
PyTorch FP16         3.00         ↓50.0%       Python, C++

────────────────────────────────────────────────────────────────────────────
🏆 Smallest model: TFLite INT8 (1.50 MB)
   Size reduction: 75.0%

🎯 Choosing the Right Format

Decision Tree

Need to deploy on mobile?
├── Yes
│   ├── Android only? → TFLite INT8
│   ├── iOS only? → CoreML INT8
│   ├── Both? → TFLite INT8 + CoreML INT8
│   └── React Native? → ONNX INT8
│
└── No
    ├── Edge device (ARM)? → NCNN
    ├── Python inference? → PyTorch FP16
    └── Cross-platform? → ONNX INT8

Recommendations by Use Case

Use Case	Recommended Format	Reason
Android App	TFLite INT8	Native framework, smallest size
iOS App	CoreML INT8	Native framework, GPU acceleration
Cross-platform App	ONNX INT8	Works everywhere, good size
Raspberry Pi	NCNN or ONNX	ARM optimized
Web Browser	ONNX (ONNX.js)	Browser compatible
Server Inference	PyTorch FP16	Good balance

🐛 Troubleshooting

CoreML Export: "RuntimeError: BlobWriter not loaded"

Problem: CoreML export fails with BlobWriter error.

Solution:

# Option 1: Update coremltools
pip install --upgrade coremltools

# Option 2: Use TFLite instead (works on iOS too!)
python models/yolov8/optimize_for_mobile.py --model runs/train/best.pt --optimize tflite

# Option 3: Use ONNX (works on iOS with ONNX Runtime)
python models/yolov8/optimize_for_mobile.py --model runs/train/best.pt --optimize onnx

Why this happens: CoreML export in YOLOv8 sometimes has compatibility issues with certain coremltools versions.

Good news: TFLite and ONNX work perfectly on iOS and give the same size reduction!

ONNX Export: "TypeError: quantize_dynamic() got unexpected keyword"

Problem: ONNX quantization fails with unexpected keyword error.

Solution: This is fixed in the script. The optimize_model parameter has been removed. Update your script if you see this error.

Model Size Still Too Large?

Use a smaller base model: YOLOv8n instead of YOLOv8m
Apply pruning: Remove 30-50% of weights during training
Reduce input size: Use 320x320 instead of 640x640
Distillation: Train a smaller model to mimic larger one

Accuracy Loss Too High?

Use FP16 instead of INT8: Better accuracy, still 50% smaller
Provide calibration data: Better quantization
Post-training fine-tuning: Fine-tune after quantization
QAT (Quantization-Aware Training): Train with quantization in mind

Slow Inference on Device?

Enable GPU acceleration: Use GPU delegate (Android) or Metal (iOS)
Reduce input size: Smaller images = faster inference
Use multi-threading: Set num_threads parameter
Optimize model architecture: Use YOLOv8n instead of YOLOv8m

📚 Additional Resources

Documentation

Example Projects

examples/android_app/ - Android TFLite integration
examples/ios_app/ - iOS CoreML integration
examples/react_native/ - React Native ONNX integration

Benchmarking Tools

models/yolov8/benchmark_mobile.py - Performance testing
models/yolov8/accuracy_test.py - Accuracy comparison

🔄 Update Requirements

Add to your requirements.txt:

# Model optimization dependencies
onnx>=1.14.0
onnxruntime>=1.16.0
onnxruntime-tools>=1.7.0
coremltools>=7.0  # For CoreML export (macOS only)
tensorflow>=2.14.0  # For TFLite export

Install:

pip install onnx onnxruntime onnxruntime-tools tensorflow
# macOS only for CoreML:
pip install coremltools

✅ Validation

After optimization, validate your model:

# Test on sample images
python models/yolov8/validate_optimized.py \
  --original runs/train/best.pt \
  --optimized mobile_models/best_int8.onnx \
  --test_dir ./test_images/ \
  --compare_accuracy

This ensures your optimized model maintains acceptable accuracy.

Ready to deploy? Choose your platform and follow the integration guide above! 🚀

FilesExpand file tree

MOBILE_OPTIMIZATION.md

Latest commit

History

MOBILE_OPTIMIZATION.md

File metadata and controls

📱 Mobile Model Optimization Guide

🎯 Overview

📊 Expected Results

Example: YOLOv8n Model

🚀 Quick Start

Optimize for All Platforms

Platform-Specific Optimization

Android (TensorFlow Lite)

iOS (CoreML)

Cross-Platform (ONNX)

ARM Devices (NCNN)

🔧 Optimization Techniques Explained

1. INT8 Quantization

2. FP16 (Half Precision)

3. Dynamic Shape Optimization

📱 Platform-Specific Integration

Android Integration

1. Add Dependencies (build.gradle)

2. Load and Run Model

3. GPU Acceleration (Optional)

iOS Integration

1. Add CoreML Model to Xcode

2. Run Inference

3. Metal GPU Acceleration (Automatic)

React Native Integration

Using ONNX Runtime

🔍 Benchmarking

Compare Model Performance

Expected Performance (YOLOv8n)

⚙️ Advanced Configuration

Custom Quantization

Optimize for Specific Input Size

Pruning (Further Size Reduction)

📊 Size Comparison Tool

🎯 Choosing the Right Format

Decision Tree

Recommendations by Use Case

🐛 Troubleshooting

CoreML Export: "RuntimeError: BlobWriter not loaded"

ONNX Export: "TypeError: quantize_dynamic() got unexpected keyword"

Model Size Still Too Large?

Accuracy Loss Too High?

Slow Inference on Device?

📚 Additional Resources

Documentation

Example Projects

Benchmarking Tools

🔄 Update Requirements

✅ Validation