Comprehensive guide to optimizing YOLOv8 models for mobile and edge deployment.
The optimization script provides multiple techniques to reduce model size and improve inference speed for mobile devices:
- INT8 Quantization: Reduce model size by ~75% with minimal accuracy loss
- FP16 Precision: Reduce model size by ~50%
- TensorFlow Lite: Optimized for Android/iOS
- CoreML: Native iOS framework
- ONNX: Cross-platform deployment
- NCNN: Optimized for ARM processors
| Format | Size Reduction | Accuracy Loss | Best For |
|---|---|---|---|
| TFLite INT8 | ~75% | 1-3% | Android |
| CoreML INT8 | ~75% | 1-3% | iOS |
| ONNX INT8 | ~75% | 1-3% | Cross-platform |
| NCNN | ~70% | <1% | ARM devices |
| PyTorch FP16 | ~50% | <0.5% | Python/C++ |
- Original: ~6 MB (FP32)
- TFLite INT8: ~1.5 MB (75% reduction)
- ONNX INT8: ~1.6 MB (73% reduction)
- CoreML INT8: ~1.5 MB (75% reduction)
- PyTorch FP16: ~3 MB (50% reduction)
python models/yolov8/optimize_for_mobile.py \
--model runs/train/best.pt \
--optimize all \
--output ./mobile_modelsThis will generate optimized models for all platforms and print a comparison report.
python models/yolov8/optimize_for_mobile.py \
--model runs/train/best.pt \
--optimize tflite \
--output ./mobile_modelsOutput: best_int8.tflite (~1.5 MB)
Integration:
// Android Kotlin
import org.tensorflow.lite.Interpreter
import java.nio.MappedByteBuffer
val interpreter = Interpreter(loadModelFile("best_int8.tflite"))python models/yolov8/optimize_for_mobile.py \
--model runs/train/best.pt \
--optimize coreml \
--output ./mobile_modelsOutput: best_int8.mlpackage (~1.5 MB)
Integration:
// iOS Swift
import CoreML
import Vision
guard let model = try? VNCoreMLModel(for: best_int8().model) else {
return
}
let request = VNCoreMLRequest(model: model)python models/yolov8/optimize_for_mobile.py \
--model runs/train/best.pt \
--optimize onnx \
--output ./mobile_modelsOutput: best_int8.onnx (~1.6 MB)
Integration:
import onnxruntime as ort
session = ort.InferenceSession("best_int8.onnx")
outputs = session.run(None, {"images": input_tensor})python models/yolov8/optimize_for_mobile.py \
--model runs/train/best.pt \
--optimize ncnn \
--output ./mobile_modelsOutput: best_ncnn/ directory with .param and .bin files
Integration:
// C++
#include "net.h"
ncnn::Net net;
net.load_param("best_ncnn/model.param");
net.load_model("best_ncnn/model.bin");Converts 32-bit floating point weights to 8-bit integers.
Benefits:
- 75% size reduction
- 2-4x faster inference on mobile CPUs
- Lower memory usage
Trade-offs:
- 1-3% accuracy loss (typically acceptable)
- Requires post-training quantization
How it works:
FP32 weight: 0.123456789 (4 bytes)
β Quantization
INT8 weight: 123 (1 byte)
Scale factor: 0.001
Uses 16-bit floating point instead of 32-bit.
Benefits:
- 50% size reduction
- 1.5-2x faster on GPUs with FP16 support
- Minimal accuracy loss (<0.5%)
Best for:
- Devices with GPU acceleration
- When accuracy is critical
Removes dynamic shapes for faster inference.
Benefits:
- Faster model loading
- Better mobile CPU optimization
- Reduced memory overhead
dependencies {
implementation 'org.tensorflow:tensorflow-lite:2.14.0'
implementation 'org.tensorflow:tensorflow-lite-gpu:2.14.0'
implementation 'org.tensorflow:tensorflow-lite-support:0.4.4'
}import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.support.image.TensorImage
import java.nio.ByteBuffer
class UIElementDetector(context: Context) {
private val interpreter: Interpreter
init {
val model = loadModelFile(context, "best_int8.tflite")
interpreter = Interpreter(model, Interpreter.Options().apply {
setNumThreads(4) // Use 4 CPU threads
})
}
fun detect(bitmap: Bitmap): List<Detection> {
val tensorImage = TensorImage.fromBitmap(bitmap)
val output = Array(1) { Array(25200) { FloatArray(25) } }
interpreter.run(tensorImage.buffer, output)
return parseDetections(output)
}
}val options = Interpreter.Options().apply {
addDelegate(GpuDelegate())
}
val interpreter = Interpreter(model, options)- Drag
best_int8.mlpackageinto your Xcode project - Xcode automatically generates Swift interface
import CoreML
import Vision
import UIKit
class UIElementDetector {
private var model: VNCoreMLModel?
init() {
guard let mlModel = try? best_int8(configuration: MLModelConfiguration()) else {
fatalError("Failed to load model")
}
model = try? VNCoreMLModel(for: mlModel.model)
}
func detect(image: UIImage, completion: @escaping ([Detection]) -> Void) {
guard let model = model else { return }
let request = VNCoreMLRequest(model: model) { request, error in
guard let results = request.results as? [VNRecognizedObjectObservation] else {
return
}
let detections = self.parseResults(results)
completion(detections)
}
let handler = VNImageRequestHandler(cgImage: image.cgImage!)
try? handler.perform([request])
}
}CoreML automatically uses Metal GPU when available.
import { InferenceSession } from 'onnxruntime-react-native';
const session = await InferenceSession.create('./best_int8.onnx');
const detect = async (imageData) => {
const feeds = { images: new Tensor('float32', imageData, [1, 3, 640, 640]) };
const results = await session.run(feeds);
return parseDetections(results);
};# Install dependencies
pip install onnxruntime opencv-python
# Run benchmark
python models/yolov8/benchmark_mobile.py \
--original runs/train/best.pt \
--optimized mobile_models/best_int8.onnx \
--test_images ./test_images/ \
--runs 100| Device | Format | Inference Time | FPS |
|---|---|---|---|
| iPhone 13 Pro | CoreML INT8 | 15ms | 66 |
| Pixel 6 | TFLite INT8 | 18ms | 55 |
| Raspberry Pi 4 | NCNN | 45ms | 22 |
| Desktop CPU | ONNX INT8 | 8ms | 125 |
For better accuracy, provide calibration data:
from ultralytics import YOLO
model = YOLO('runs/train/best.pt')
# Export with calibration data
model.export(
format='tflite',
int8=True,
data='./data/dataset.yaml', # Use your dataset for calibration
imgsz=640
)If you always use a specific input size, optimize for it:
python models/yolov8/optimize_for_mobile.py \
--model runs/train/best.pt \
--optimize onnx \
--input_size 320 # Smaller = fasterRemove redundant weights for even smaller models:
from ultralytics import YOLO
# Train with pruning
model = YOLO('yolov8n.pt')
model.train(
data='./data/dataset.yaml',
epochs=100,
prune=0.3 # Remove 30% of weights
)Generate a visual comparison of all formats:
python models/yolov8/optimize_for_mobile.py \
--model runs/train/best.pt \
--optimize allOutput:
================================================================================
OPTIMIZATION RESULTS SUMMARY
================================================================================
Format Size (MB) Reduction Best For
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Original 6.00 - Reference
TFLite INT8 1.50 β75.0% Android, iOS
ONNX INT8 1.60 β73.3% Cross-platform
CoreML INT8 1.55 β74.2% iOS, macOS
NCNN 1.80 β70.0% Android, iOS (ARM)
PyTorch FP16 3.00 β50.0% Python, C++
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Smallest model: TFLite INT8 (1.50 MB)
Size reduction: 75.0%
Need to deploy on mobile?
βββ Yes
β βββ Android only? β TFLite INT8
β βββ iOS only? β CoreML INT8
β βββ Both? β TFLite INT8 + CoreML INT8
β βββ React Native? β ONNX INT8
β
βββ No
βββ Edge device (ARM)? β NCNN
βββ Python inference? β PyTorch FP16
βββ Cross-platform? β ONNX INT8
| Use Case | Recommended Format | Reason |
|---|---|---|
| Android App | TFLite INT8 | Native framework, smallest size |
| iOS App | CoreML INT8 | Native framework, GPU acceleration |
| Cross-platform App | ONNX INT8 | Works everywhere, good size |
| Raspberry Pi | NCNN or ONNX | ARM optimized |
| Web Browser | ONNX (ONNX.js) | Browser compatible |
| Server Inference | PyTorch FP16 | Good balance |
Problem: CoreML export fails with BlobWriter error.
Solution:
# Option 1: Update coremltools
pip install --upgrade coremltools
# Option 2: Use TFLite instead (works on iOS too!)
python models/yolov8/optimize_for_mobile.py --model runs/train/best.pt --optimize tflite
# Option 3: Use ONNX (works on iOS with ONNX Runtime)
python models/yolov8/optimize_for_mobile.py --model runs/train/best.pt --optimize onnxWhy this happens: CoreML export in YOLOv8 sometimes has compatibility issues with certain coremltools versions.
Good news: TFLite and ONNX work perfectly on iOS and give the same size reduction!
Problem: ONNX quantization fails with unexpected keyword error.
Solution: This is fixed in the script. The optimize_model parameter has been removed. Update your script if you see this error.
- Use a smaller base model: YOLOv8n instead of YOLOv8m
- Apply pruning: Remove 30-50% of weights during training
- Reduce input size: Use 320x320 instead of 640x640
- Distillation: Train a smaller model to mimic larger one
- Use FP16 instead of INT8: Better accuracy, still 50% smaller
- Provide calibration data: Better quantization
- Post-training fine-tuning: Fine-tune after quantization
- QAT (Quantization-Aware Training): Train with quantization in mind
- Enable GPU acceleration: Use GPU delegate (Android) or Metal (iOS)
- Reduce input size: Smaller images = faster inference
- Use multi-threading: Set
num_threadsparameter - Optimize model architecture: Use YOLOv8n instead of YOLOv8m
examples/android_app/- Android TFLite integrationexamples/ios_app/- iOS CoreML integrationexamples/react_native/- React Native ONNX integration
models/yolov8/benchmark_mobile.py- Performance testingmodels/yolov8/accuracy_test.py- Accuracy comparison
Add to your requirements.txt:
# Model optimization dependencies
onnx>=1.14.0
onnxruntime>=1.16.0
onnxruntime-tools>=1.7.0
coremltools>=7.0 # For CoreML export (macOS only)
tensorflow>=2.14.0 # For TFLite exportInstall:
pip install onnx onnxruntime onnxruntime-tools tensorflow
# macOS only for CoreML:
pip install coremltoolsAfter optimization, validate your model:
# Test on sample images
python models/yolov8/validate_optimized.py \
--original runs/train/best.pt \
--optimized mobile_models/best_int8.onnx \
--test_dir ./test_images/ \
--compare_accuracyThis ensures your optimized model maintains acceptable accuracy.
Ready to deploy? Choose your platform and follow the integration guide above! π