A modular video detection toolkit that produces a stable det-v1 JSON output schema, with a pluggable backend (currently Ultralytics) and optional model export.
- Backend: Ultralytics (YOLO families, RT-DETR, YOLO-World/YOLOE, SAM/FastSAM — depending on your installed
ultralyticsversion) - Default behavior: no files are written unless you opt-in (JSON / frames / annotated video)
Every run returns a det-v1 payload in memory (and the CLI prints it to stdout).
Top-level keys:
schema_version: always"det-v1"video:{path, fps, frame_count, width, height}detector: configuration used for the run (name/weights/conf/imgsz/device/half + task + optional prompts/topk)frames: list of per-frame records
Per-frame record:
frame: 0-based frame indexfile: standard frame filename (e.g.000000.jpg) (even if frames aren’t saved)detections: list of detections
Detection fields:
- boxes:
bbox = [x1, y1, x2, y2] - pose:
keypoints = [[x, y, score], ...] - segmentation:
segments = [[[x, y], ...], ...](polygons) - oriented boxes (best-effort):
obb = [cx, cy, w, h, angle_degrees]plus an axis-alignedbbox
{
"schema_version": "det-v1",
"video": {"path": "in.mp4", "fps": 30.0, "frame_count": 120, "width": 1920, "height": 1080},
"detector": {"name": "ultralytics", "weights": "yolo26n", "conf_thresh": 0.25, "imgsz": 640, "device": "cpu", "half": false, "task": "detect"},
"frames": [
{
"frame": 0,
"file": "000000.jpg",
"detections": [
{"det_ind": 0, "bbox": [100.0, 50.0, 320.0, 240.0], "score": 0.91, "class_id": 0, "class_name": "person"}
]
}
]
}Requires Python 3.11+.
pip install detect-libOptional extras (only if you need them):
pip install "detect-lib[export]" # ONNX / export helpers
pip install "detect-lib[coreml]" # CoreML export (macOS)
pip install "detect-lib[openvino]" # OpenVINO export
pip install "detect-lib[tf]" # TensorFlow export paths (heavy)git clone https://github.com/Surya-Rayala/VisionPipeline-detection.git
cd VisionPipeline-detection
uv syncExtras:
uv sync --extra export
uv sync --extra coreml
uv sync --extra openvino
uv sync --extra tfAll CLI commands are:
python -m ...(pip)uv run python -m ...(uv)
Help:
python -m detect.cli.detect_video -hList models (registry + installed):
python -m detect.cli.detect_video --list-models1) Bounding boxes (typical YOLO / RT-DETR)
python -m detect.cli.detect_video \
--video in.mp4 \
--detector ultralytics \
--weights yolo26n \
--task detect \
--json \
--save-video annotated.mp4 \
--out-dir out --run-name yolo26n_detect2) Instance segmentation (polygons)
python -m detect.cli.detect_video \
--video in.mp4 \
--detector ultralytics \
--weights yolo26n-seg \
--task segment \
--json \
--save-video annotated.mp4 \
--out-dir out --run-name yolo26n_seg3) Pose (keypoints)
python -m detect.cli.detect_video \
--video in.mp4 \
--detector ultralytics \
--weights yolo26n-pose \
--task pose \
--json \
--save-video annotated.mp4 \
--out-dir out --run-name yolo26n_pose4) Open-vocabulary (YOLO-World / YOLOE)
python -m detect.cli.detect_video \
--video in.mp4 \
--detector ultralytics \
--weights yolov8s-worldv2 \
--task openvocab \
--text "person,car,dog" \
--json \
--save-video annotated.mp4 \
--out-dir out --run-name worldv2_openvocab*Open-vocabulary + polygons (YOLOE -seg)
Use a YOLOE segmentation weight and segment when you want polygons.
python -m detect.cli.detect_video \
--video in.mp4 \
--detector ultralytics \
--weights yoloe-11s-seg \
--task segment \
--text "person,car,dog" \
--json \
--save-video annotated.mp4 \
--out-dir out --run-name yoloe_seg_openvocabdetect | segment | pose | obb | classify | sam | sam2 | sam3 | fastsamdescribe the output type you want.openvocabis a prompt mode for YOLO-World/YOLOE. Output type follows the model (boxes vs masks). If you want polygons, use a*-segmodel andsegment.
You can supply prompts via:
--text "a,b,c"(open-vocabulary label list)--box "x1,y1,x2,y2"(repeatable)--point "x,y"or--point "x,y,label"(repeatable; label 1=fg, 0=bg)--prompts prompts.json(combined)
Example prompts.json:
{
"text": ["person", "car", "dog"],
"boxes": [[100, 100, 500, 500]],
"points": [[320, 240, 1], [100, 120, 0]],
"topk": 5
}Export note (open-vocab): exported formats (ONNX/CoreML/etc.) may not support changing the vocabulary at runtime. If prompts don’t take effect, run the .pt weights for true open-vocabulary prompting or post-filter detections.
--jsonwritesout/<run-name>/detections.json--frameswritesout/<run-name>/frames/*.jpg--save-video NAME.mp4writesout/<run-name>/NAME.mp4
If you don’t enable any artifacts, no output directory is created.
Python uses snake_case keyword arguments. The CLI uses kebab-case flags. The values are the same, but the names differ.
Common mapping:
- CLI
--video→ Pythonvideo - CLI
--detector→ Pythondetector - CLI
--weights→ Pythonweights - CLI
--classes "0,2"→ Pythonclasses=[0, 2] - CLI
--conf-thresh→ Pythonconf_thresh - CLI
--imgsz→ Pythonimgsz - CLI
--device→ Pythondevice - CLI
--half→ Pythonhalf=True - CLI
--task→ Pythontask
Prompts:
- CLI
--text "a,b"→ Pythonprompts={"text": ["a", "b"]} - CLI
--box "x1,y1,x2,y2"(repeatable) → Pythonprompts={"boxes": [[x1, y1, x2, y2], ...]} - CLI
--point "x,y,label"(repeatable) → Pythonprompts={"points": [[x, y, label], ...]} - CLI
--topk N→ Pythontopk=N(orprompts={"topk": N})
Artifacts (all opt-in):
- CLI
--json→ Pythonsave_json=True - CLI
--frames→ Pythonsave_frames=True - CLI
--save-video NAME.mp4→ Pythonsave_video="NAME.mp4" - CLI
--out-dir DIR→ Pythonout_dir="DIR" - CLI
--run-name NAME→ Pythonrun_name="NAME" - CLI
--no-progress→ Pythonprogress=False - CLI
--display→ Pythondisplay=True
Note: the Python API also accepts an advanced artifacts=ArtifactOptions(...) object, but the convenience args above are easiest for most usage.
from detect import detect_video
res = detect_video(
video="in.mp4",
detector="ultralytics",
weights="yolo26n",
task="detect",
classes=None, # e.g. [0, 2] to filter class ids
conf_thresh=0.25,
imgsz=640,
device="auto",
half=False,
# prompts={"text": ["person", "car", "dog"]}, # for open-vocabulary models
save_json=True,
save_video="annotated.mp4",
out_dir="out",
run_name="py_detect",
)
print(res.payload["schema_version"], len(res.payload["frames"]))
print(res.paths)Note: legacy detector aliases (yolo_bbox, yolo_seg, yolo_pose) are still accepted for backward compatibility, but the docs use ultralytics everywhere.
Export is currently implemented for the Ultralytics backend.
python -m detect.cli.export_model -h
python -m detect.cli.export_model \
--weights yolo26n \
--formats onnx \
--out-dir models/exports --run-name y26_onnxPython export also uses snake_case args (e.g., out_dir, run_name) and accepts formats as a list or comma-separated string.
from detect.backends.ultralytics.export import export_model_ultralytics
res = export_model_ultralytics(
weights="yolo26n",
formats=["onnx"],
imgsz=640,
out_dir="models/exports",
run_name="y26_onnx_py",
)
print("run_dir:", res["run_dir"])
for p in res["artifacts"]:
print("-", p)Compatibility notes:
- Some model families do not support export (e.g., MobileSAM and SAM/SAM2/SAM3 per Ultralytics docs). The export CLI will warn and exit.
- YOLO-World v1 weights (
*-world.pt) do not support export; use YOLO-World v2 (*-worldv2.pt) for export. - YOLOv10 supports export but only to a restricted set of formats; unsupported formats will warn and exit.
MIT License. See LICENSE.