Skip to content

feat(runtime): add standalone edge runtime for Pi/Jetson deployment#799

Open
rachmlenig wants to merge 30 commits intomainfrom
feat-runtime-edge-standalone
Open

feat(runtime): add standalone edge runtime for Pi/Jetson deployment#799
rachmlenig wants to merge 30 commits intomainfrom
feat-runtime-edge-standalone

Conversation

@rachmlenig
Copy link
Contributor

Break the edge runtime's dependency on runtimes/universal/ by copying and trimming the needed files into runtimes/edge/. The edge runtime is now fully self-contained with its own pyproject.toml, Dockerfile, and zero imports from universal.

Key changes from universal:

  • models/init.py exports only 4 model types (was 12)
  • vision router includes only detection/classification/streaming (no training, evaluation, tracking, OCR, or document extraction)
  • chat_completions/service.py makes heavy utils optional (context summarizer, history compressor, tool calling, thinking)
  • file_handler.py rewritten without PyMuPDF (no PDF processing)
  • context_calculator.py makes torch import lazy for GGUF-only deploys

Break the edge runtime's dependency on runtimes/universal/ by copying
and trimming the needed files into runtimes/edge/. The edge runtime
is now fully self-contained with its own pyproject.toml, Dockerfile,
and zero imports from universal.

Key changes from universal:
- models/__init__.py exports only 4 model types (was 12)
- vision router includes only detection/classification/streaming (no
  training, evaluation, tracking, OCR, or document extraction)
- chat_completions/service.py makes heavy utils optional (context
  summarizer, history compressor, tool calling, thinking)
- file_handler.py rewritten without PyMuPDF (no PDF processing)
- context_calculator.py makes torch import lazy for GGUF-only deploys
@github-actions
Copy link
Contributor

github-actions bot commented Mar 12, 2026

All E2E Tests Passed!

Test Results by Platform

OS Mode Status
ubuntu-latest source ✅ Passed
ubuntu-latest binary ✅ Passed
macos-latest source ✅ Passed
macos-latest binary ✅ Passed
windows-latest source ✅ Passed
windows-latest binary ✅ Passed

Summary


This comment was automatically generated by the E2E Tests workflow.

Move 4 identical utility modules from both runtimes into
llamafarm_common so bugs fixed in one place apply everywhere:

- utils/safe_home.py → llamafarm_common/safe_home.py
- utils/device.py → llamafarm_common/device.py
- utils/model_cache.py → llamafarm_common/model_cache.py
- utils/model_format.py → llamafarm_common/model_format.py

Both runtimes now have thin re-export shims that import from
llamafarm_common, so all internal `from utils.X import Y`
statements continue to work unchanged.

Also:
- Add cachetools dep to llamafarm_common (needed by model_cache)
- Consolidate pidfile.py to use safe_home instead of duplicating
  home directory resolution logic
- Fix model_format.py internal import to use relative import
  within the common package

Removes ~1,100 lines of duplicated code across the two runtimes.

Candidates identified but not moved (would add heavy deps to common):
- core/logging.py (needs structlog)
- services/error_handler.py (needs fastapi)
- models/base.py, vision_base.py (architectural scope change)
- All router files (deeply coupled to FastAPI app structure)
Add HailoYOLOModel that runs YOLO inference on the Hailo-10H AI
accelerator using pre-compiled .hef models from the Hailo Model Zoo.

The server auto-detects Hailo hardware at startup:
- Checks for hailo_platform package
- Checks for PCI device ID 1e60 (Hailo-10H) via lspci
- Falls back to /dev/hailo0 device node
- Falls back to CPU/ultralytics if Hailo not available

New file: models/hailo_model.py
- HailoYOLOModel with same interface as YOLOModel
- Letterbox preprocessing for aspect-ratio-preserving resize
- NMS output parsing (Hailo .hef models include built-in NMS)
- COCO 80-class label mapping
- Configurable .hef directory via HAILO_HEF_DIR env var

Server changes:
- load_detection_model() selects backend based on hardware detection
- FORCE_CPU_VISION=1 env var to skip Hailo and force CPU
- hailo_platform import is fully optional (try/except)
Build from repo root with -f flag so COPY can reach common/ and
packages/. Use --no-sources to skip [tool.uv.sources] relative
paths that don't apply inside the container.

Usage: docker build -t edge-runtime -f runtimes/edge/Dockerfile .
Two bugs prevented llama.cpp from loading on ARM64 Linux (Pi/Jetson):

1. Version mismatch: _get_llamafarm_release_version() read the
   llamafarm-llama package version (0.1.0) but the ARM64 binary is
   published under the main monorepo release tag (v0.0.28). These
   versions are decoupled. Now queries GitHub API for the latest
   release, with LLAMAFARM_RELEASE_VERSION env var override and
   v0.0.28 hardcoded fallback.

2. Extension mismatch: manifest template used .tar.gz but the
   actual published asset is .zip. Fixed to match.
The pre-built llama.cpp ARM64 binary requires GLIBC 2.38+ but
python:3.12-slim-bookworm only has GLIBC 2.36. Switch to
ubuntu:24.04 (GLIBC 2.39) and install Python via apt.

Also add --break-system-packages to uv pip install since Ubuntu
24.04 marks system Python as externally managed (PEP 668). This
is safe inside a container.
Address CodeQL review comments:
- Remove `from module import *` from all 8 re-export shims (edge +
  universal). The explicit imports already cover everything needed.
- Remove unused `get_file_images` import from edge server.py
Dockerfile:
- Install vision extra (ultralytics, transformers) and pi-heif
- Add system libs for OpenCV (libgl1, libglib2.0-0, libxcb1)
- Set YOLO_AUTOINSTALL=false to prevent runtime pip installs

Vision hardening:
- Strip whitespace from base64 input before decoding (fixes
  newlines from curl piping with jq/base64 tools)
- Wrap PIL Image.open() with proper error handling in vision_base,
  detect_classify — returns clear error instead of raw traceback
- Pre-register HEIF plugin in yolo_model.py to prevent import
  errors on some ultralytics builds
…edge

- Re-export HfApi and _check_local_cache_for_model from the
  universal model_format shim so tests that mock
  utils.model_format.HfApi continue to work
- Add project.json for edge runtime so Nx can process the project
  graph without failing on the unnamed directory
…/rag

The model_format tests mocked utils.model_format.HfApi, but after the
refactor to a re-export shim, detect_model_format lives in
llamafarm_common.model_format and uses its own HfApi reference. Fix mock
targets to patch at the source module.

The E2E source tests failed because UV_EXTRA_INDEX_URL and
UV_INDEX_STRATEGY leaked from the CI environment into server/rag
processes via os.Environ(). The PyTorch CPU index only has cp314 wheels
for markupsafe, causing install failures on Python 3.12. Strip these
vars from the base process environment so only services that explicitly
declare them (universal-runtime) receive them.
@rachmlenig rachmlenig marked this pull request as ready for review March 17, 2026 20:31
@qodo-free-for-open-source-projects
Copy link
Contributor

Review Summary by Qodo

Add standalone edge runtime for Pi/Jetson deployment with GGUF inference and KV cache management

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Introduces a fully self-contained edge runtime for Pi/Jetson deployment with zero dependencies on
  runtimes/universal/
• Implements GGUF language model inference via llama-cpp with memory-optimized quantized model
  support for edge devices
• Adds multi-tier KV cache management (VRAM → RAM → disk) with segment-level validation for
  efficient prefix caching
• Provides OpenAI-compatible chat completions service with optional heavy utilities (context
  summarizer, history compressor, tool calling)
• Implements vision routers for detection, classification, and streaming with Hailo-10H accelerator
  support
• Adds comprehensive context management with multiple truncation strategies (sliding_window,
  keep_system, middle_out, summarize)
• Introduces thinking/reasoning model support with budget allocation and chain-of-thought utilities
• Consolidates device detection and model format utilities into common llamafarm_common package
  for code reuse
• Includes GPU allocation with VRAM estimation and SSRF-safe remote cascade support
• Provides GGML logging integration and metadata caching to optimize performance on constrained
  hardware
Diagram
flowchart LR
  A["Edge Runtime<br/>FastAPI Server"] --> B["GGUF Language<br/>Model"]
  A --> C["Vision Models<br/>Detection/Classification"]
  A --> D["KV Cache<br/>Manager"]
  B --> E["llama-cpp<br/>Inference"]
  B --> F["Context<br/>Calculator"]
  D --> G["Multi-tier Cache<br/>VRAM/RAM/Disk"]
  A --> H["Chat Completions<br/>Service"]
  H --> I["Optional Utils<br/>Summarizer/Compressor"]
  H --> J["Tool Calling<br/>& Thinking"]
  C --> K["Hailo-10H<br/>Accelerator"]
  L["Common Package"] -.->|Device Detection| A
  L -.->|Model Format| A
Loading

Grey Divider

File Changes

1. runtimes/edge/models/gguf_language_model.py ✨ Enhancement +1640/-0

GGUF language model wrapper with llama-cpp integration

• New 1640-line GGUF language model wrapper using llama-cpp for quantized model inference on edge
 devices
• Implements unified memory GPU detection for Jetson/Tegra platforms with synchronous inference
 optimization
• Provides chat completion, streaming, audio input, and tool calling support with native Jinja2
 template rendering
• Includes context management, token counting, KV cache support, and comprehensive error handling
 for memory-constrained devices

runtimes/edge/models/gguf_language_model.py


2. runtimes/edge/routers/chat_completions/service.py ✨ Enhancement +1402/-0

Chat completions service with optional utilities and KV cache

• New 1402-line chat completions service for edge runtime with optional heavy utilities (context
 summarizer, history compressor, tool calling)
• Implements streaming and non-streaming responses with incremental tool call detection via state
 machine
• Supports native audio input for multimodal models and STT transcription fallback
• Includes KV cache management, context validation/truncation, thinking budget allocation, and
 comprehensive logging

runtimes/edge/routers/chat_completions/service.py


3. runtimes/edge/utils/gguf_metadata_cache.py ✨ Enhancement +302/-0

Centralized GGUF metadata caching for performance

• New 302-line shared GGUF metadata cache to avoid redundant file reads (~4-5 seconds per read)
• Caches file size, context length, chat template, special tokens, and architecture parameters for
 KV cache estimation
• Thread-safe implementation with fallback for newer GGUF quantization types not yet supported by
 Python gguf library
• Provides cache statistics and selective cache clearing functionality

runtimes/edge/utils/gguf_metadata_cache.py


View more (62)
4. runtimes/edge/utils/kv_cache_manager.py ✨ Enhancement +703/-0

Multi-tier KV cache manager with segment validation

• Implements multi-tier KV cache management (VRAM → RAM → disk) with segment-level validation for
 multi-agent model sharing
• Provides cache entry lifecycle (prepare, lookup, restore, save_after_generation) with
 deduplication and TTL support
• Includes budget enforcement, garbage collection, and background cleanup task for memory management
• Supports partial cache hits when only part of conversation has changed (system prompt, tools, or
 history turns)

runtimes/edge/utils/kv_cache_manager.py


5. runtimes/edge/utils/context_calculator.py ✨ Enhancement +477/-0

Context size calculator with memory-aware optimization

• Computes optimal context window size based on available memory, model architecture, and user
 configuration
• Implements four-tier priority system: user config → model training context → pattern defaults →
 computed max
• Calculates exact KV cache bytes per token from GGUF metadata (n_layer, n_head_kv, head sizes)
• Provides lazy torch import for GGUF-only deployments without GPU dependencies

runtimes/edge/utils/context_calculator.py


6. runtimes/edge/utils/tool_calling.py ✨ Enhancement +555/-0

Prompt-based tool calling with XML tag parsing

• Implements prompt-based tool calling with XML tag injection and detection for model outputs
• Supports multiple tool_choice modes: auto, none, required, and specific function forcing
• Provides incremental streaming utilities to extract tool names and arguments from partial JSON
• Includes tool schema validation and comprehensive error handling for malformed tool calls

runtimes/edge/utils/tool_calling.py


7. runtimes/edge/utils/context_manager.py ✨ Enhancement +506/-0

Context window management with truncation strategies

• Manages context window with multiple truncation strategies: sliding_window, keep_system,
 middle_out, summarize
• Validates messages fit within context budget and applies truncation when needed
• Implements content truncation as fallback when message removal is insufficient
• Provides context usage tracking including truncation metadata for API responses

runtimes/edge/utils/context_manager.py


8. runtimes/edge/models/hailo_model.py ✨ Enhancement +357/-0

Hailo-10H YOLO detection model integration

• Implements YOLO detection model for Hailo-10H AI accelerator using pre-compiled .hef models
• Provides letterboxing preprocessing and NMS output parsing for bounding box extraction
• Supports model variant mapping (yolov8n, yolov11n, etc.) with fallback to VISION_MODELS_DIR
• Includes async load/unload and inference with thread pool execution to avoid event loop blocking

runtimes/edge/models/hailo_model.py


9. runtimes/edge/server.py ✨ Enhancement +437/-0

Edge runtime FastAPI server with hardware detection

• Minimal FastAPI server for on-device inference on constrained hardware (Raspberry Pi, Jetson)
• Implements hardware detection for Hailo-10H accelerator with CPU fallback for vision models
• Provides model lifecycle management with TTL-based unloading and background cleanup task
• Integrates KV cache manager, chat completions, health checks, and vision routers
 (detection/classification/streaming only)

runtimes/edge/server.py


10. runtimes/edge/routers/chat_completions/types.py ✨ Enhancement +307/-0

Chat completion types with audio and cache support

• Defines OpenAI-compatible chat completion request/response types with audio and tool calling
 support
• Includes audio content extraction and STT transcription fallback utilities
• Adds KV cache parameters (cache_key, return_cache_key) and context management options
 (auto_truncate, truncation_strategy)
• Provides thinking/reasoning model support with separate thinking content and token tracking

runtimes/edge/routers/chat_completions/types.py


11. runtimes/edge/routers/vision/__init__.py ✨ Enhancement +32/-0

Edge vision router aggregation (detection/classification only)

• Combines edge-specific vision routers (detection, classification, detect_classify, streaming)
• Excludes OCR, document extraction, training, evaluation, tracking, and sample data endpoints
• Exports loader setters and session cleanup functions for dependency injection

runtimes/edge/routers/vision/init.py


12. runtimes/universal/utils/model_format.py Refactoring +4/-152

Consolidate model format detection to common library

• Simplifies to re-export model format utilities from llamafarm_common as single source of truth
• Removes local caching and HuggingFace API logic previously duplicated in universal runtime
• Maintains backward compatibility by re-exporting GGUF utilities and quantization preferences

runtimes/universal/utils/model_format.py


13. runtimes/edge/utils/gpu_allocator.py ✨ Enhancement +349/-0

GPU allocation and VRAM estimation for edge runtime

• New GPU allocation module for multi-model, multi-GPU GGUF inference
• Implements single-GPU placement preference with multi-GPU fallback strategy
• Estimates VRAM requirements using model size, context window, and KV cache calculations
• Provides SSRF-safe remote cascade support with allowlist validation

runtimes/edge/utils/gpu_allocator.py


14. runtimes/edge/routers/vision/streaming.py ✨ Enhancement +385/-0

Streaming vision detection with cascade chain support

• New streaming vision router with cascade detection chain support
• Implements session management with TTL-based cleanup for orphaned streams
• Supports both local and remote model cascading with SSRF protection
• Provides endpoints for stream start/stop, frame processing, and session listing

runtimes/edge/routers/vision/streaming.py


15. runtimes/edge/utils/thinking.py ✨ Enhancement +268/-0

Thinking model utilities for chain-of-thought reasoning

• New utilities for parsing and controlling thinking/reasoning in models like Qwen3
• Implements parse_thinking_response() to extract <think> tags from model output
• Provides inject_thinking_control() to inject /think and /no_think soft switches
• Includes ThinkingBudgetProcessor logits processor to enforce token budget limits

runtimes/edge/utils/thinking.py


16. runtimes/edge/routers/cache.py ✨ Enhancement +243/-0

KV cache management API for prefix caching

• New KV cache API with prepare, validate, list, evict, stats, and GC endpoints
• Supports warm cache preparation with model loading and KV state pre-computation
• Implements cache validation without consuming the cache entry
• Provides cache statistics, garbage collection, and per-entry eviction

runtimes/edge/routers/cache.py


17. runtimes/edge/utils/context_summarizer.py ✨ Enhancement +239/-0

LLM-based conversation history summarization

• New context summarization module using LLM-based compression of conversation history
• Preserves recent messages while summarizing older ones to reduce token count
• Integrates with server's model caching mechanism for efficient summarization model loading
• Configurable model selection and number of recent exchanges to preserve

runtimes/edge/utils/context_summarizer.py


18. runtimes/edge/utils/history_compressor.py ✨ Enhancement +259/-0

Conversation history compression utilities

• New lossless and near-lossless compression techniques for conversation history
• Applies whitespace normalization, tool result truncation, code block compression
• Removes duplicate/near-duplicate content while preserving recent messages
• Reduces token usage before truncation without losing semantic meaning

runtimes/edge/utils/history_compressor.py


19. runtimes/edge/models/yolo_model.py ✨ Enhancement +175/-0

YOLO object detection model wrapper

• New YOLO object detection model wrapper supporting YOLOv8/v11 via ultralytics
• Implements detection, training, and export functionality with device auto-detection
• Includes path validation for security and pi_heif import error suppression
• Supports class filtering and confidence threshold customization

runtimes/edge/models/yolo_model.py


20. runtimes/edge/models/language_model.py ✨ Enhancement +222/-0

HuggingFace language model wrapper

• New language model wrapper for HuggingFace causal language models
• Implements both non-streaming and streaming text generation with chat templates
• Uses asyncio.to_thread() to avoid blocking the FastAPI event loop during loading
• Supports temperature, top-p sampling, and stop sequences

runtimes/edge/models/language_model.py


21. runtimes/edge/models/clip_model.py ✨ Enhancement +185/-0

CLIP image classification and embedding model

• New CLIP-based image classification and embedding model wrapper
• Implements zero-shot classification with pre-computed class embeddings
• Provides image and text embedding generation with caching of class embeddings
• Supports multiple CLIP variants (ViT-base, ViT-large, SigLIP)

runtimes/edge/models/clip_model.py


22. runtimes/edge/routers/vision/detect_classify.py ✨ Enhancement +180/-0

Combined detection and classification endpoint

• New detect+classify combo endpoint combining YOLO detection with CLIP classification
• Crops detected regions and classifies each crop in a single round-trip
• Includes minimum crop size filtering and RGB mode conversion for JPEG encoding
• Returns unified results with both detection and classification confidence scores

runtimes/edge/routers/vision/detect_classify.py


23. runtimes/universal/utils/device.py Refactoring +9/-195

Device utilities refactored to common package

• Refactored to re-export device utilities from llamafarm_common package
• Establishes single source of truth for device detection across runtimes
• Maintains backward compatibility with existing imports

runtimes/universal/utils/device.py


24. common/llamafarm_common/device.py ✨ Enhancement +195/-0

Common device detection and optimization utilities

• New common device detection module with lazy PyTorch loading
• Implements optimal device selection (CUDA, MPS, CPU) with environment variable overrides
• Provides detailed device info including per-GPU memory statistics
• Includes GGUF-specific GPU layer configuration independent of PyTorch

common/llamafarm_common/device.py


25. runtimes/edge/utils/ggml_logging.py ✨ Enhancement +184/-0

GGML/llama.cpp logging integration

• New GGML logging management module routing llama.cpp logs through Python logging
• Implements three modes: capture (default), suppress, and passthrough
• Handles log level mapping and buffers partial lines for complete message logging
• Downgrades known false-error messages to DEBUG level

runtimes/edge/utils/ggml_logging.py


26. runtimes/edge/models/vision_base.py ✨ Enhancement +188/-0

Base classes for vision models

• New base classes for vision models (detection and classification)
• Defines result dataclasses for detection, classification, and embeddings
• Provides image conversion utilities (bytes/numpy to PIL/numpy)
• Implements device resolution and model info retrieval

runtimes/edge/models/vision_base.py


27. runtimes/edge/models/base.py ✨ Enhancement +156/-0

Abstract base class for all model types

• New abstract base class for all HuggingFace models (transformers, diffusers)
• Implements common lifecycle methods (load, unload) with GPU cache clearing
• Provides dtype selection and tensor device movement utilities
• Includes platform-specific optimizations for MPS and CUDA

runtimes/edge/models/base.py


28. cli/cmd/orchestrator/python_env.go Additional files +21/-15

...

cli/cmd/orchestrator/python_env.go


29. cli/cmd/orchestrator/services.go Additional files +2/-2

...

cli/cmd/orchestrator/services.go


30. common/llamafarm_common/__init__.py Additional files +8/-0

...

common/llamafarm_common/init.py


31. common/llamafarm_common/model_cache.py Additional files +188/-0

...

common/llamafarm_common/model_cache.py


32. common/llamafarm_common/model_format.py Additional files +172/-0

...

common/llamafarm_common/model_format.py


33. common/llamafarm_common/pidfile.py Additional files +3/-7

...

common/llamafarm_common/pidfile.py


34. common/llamafarm_common/safe_home.py Additional files +34/-0

...

common/llamafarm_common/safe_home.py


35. common/pyproject.toml Additional files +1/-0

...

common/pyproject.toml


36. packages/llamafarm-llama/src/llamafarm_llama/_binary.py Additional files +40/-9

...

packages/llamafarm-llama/src/llamafarm_llama/_binary.py


37. runtimes/edge/Dockerfile Additional files +69/-0

...

runtimes/edge/Dockerfile


38. runtimes/edge/config/model_context_defaults.yaml Additional files +34/-0

...

runtimes/edge/config/model_context_defaults.yaml


39. runtimes/edge/core/__init__.py Additional files +0/-0

...

runtimes/edge/core/init.py


40. runtimes/edge/core/logging.py Additional files +156/-0

...

runtimes/edge/core/logging.py


41. runtimes/edge/models/__init__.py Additional files +45/-0

...

runtimes/edge/models/init.py


42. runtimes/edge/openapi.json Additional files +1/-0

...

runtimes/edge/openapi.json


43. runtimes/edge/project.json Additional files +31/-0

...

runtimes/edge/project.json


44. runtimes/edge/pyproject.toml Additional files +88/-0

...

runtimes/edge/pyproject.toml


45. runtimes/edge/routers/__init__.py Additional files +0/-0

...

runtimes/edge/routers/init.py


46. runtimes/edge/routers/chat_completions/__init__.py Additional files +3/-0

...

runtimes/edge/routers/chat_completions/init.py


47. runtimes/edge/routers/chat_completions/router.py Additional files +26/-0

...

runtimes/edge/routers/chat_completions/router.py


48. runtimes/edge/routers/health/__init__.py Additional files +5/-0

...

runtimes/edge/routers/health/init.py


49. runtimes/edge/routers/health/router.py Additional files +75/-0

...

runtimes/edge/routers/health/router.py


50. runtimes/edge/routers/vision/classification.py Additional files +61/-0

...

runtimes/edge/routers/vision/classification.py


51. runtimes/edge/routers/vision/detection.py Additional files +76/-0

...

runtimes/edge/routers/vision/detection.py


52. runtimes/edge/routers/vision/utils.py Additional files +22/-0

...

runtimes/edge/routers/vision/utils.py


53. runtimes/edge/services/__init__.py Additional files +0/-0

...

runtimes/edge/services/init.py


54. runtimes/edge/services/error_handler.py Additional files +143/-0

...

runtimes/edge/services/error_handler.py


55. runtimes/edge/utils/__init__.py Additional files +0/-0

...

runtimes/edge/utils/init.py


56. runtimes/edge/utils/device.py Additional files +9/-0

...

runtimes/edge/utils/device.py


57. runtimes/edge/utils/file_handler.py Additional files +213/-0

...

runtimes/edge/utils/file_handler.py


58. runtimes/edge/utils/jinja_tools.py Additional files +192/-0

...

runtimes/edge/utils/jinja_tools.py


59. runtimes/edge/utils/model_cache.py Additional files +4/-0

...

runtimes/edge/utils/model_cache.py


60. runtimes/edge/utils/model_format.py Additional files +24/-0

...

runtimes/edge/utils/model_format.py


61. runtimes/edge/utils/safe_home.py Additional files +4/-0

...

runtimes/edge/utils/safe_home.py


62. runtimes/edge/utils/token_counter.py Additional files +153/-0

...

runtimes/edge/utils/token_counter.py


63. runtimes/universal/tests/test_model_format.py Additional files +12/-12

...

runtimes/universal/tests/test_model_format.py


64. runtimes/universal/utils/model_cache.py Additional files +3/-187

...

runtimes/universal/utils/model_cache.py


65. runtimes/universal/utils/safe_home.py Additional files +3/-33

...

runtimes/universal/utils/safe_home.py


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Contributor

qodo-free-for-open-source-projects bot commented Mar 17, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (4) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Undocumented UV_INDEX_STRATEGY env var 📘 Rule violation ⛯ Reliability
Description
The PR introduces UV_INDEX_STRATEGY as a configurable environment variable for
universal-runtime, but it is not documented in .env.example. This can lead to misconfiguration
(especially in CI) and violates the requirement to document new configuration keys.
Code

cli/cmd/orchestrator/services.go[R151-154]

			"HF_TOKEN":                 "",
			// In CI environments, use CPU-only PyTorch to avoid downloading 3GB+ of CUDA packages
-			"UV_EXTRA_INDEX_URL":  "${UV_EXTRA_INDEX_URL}",
-			"UV_INDEX_STRATEGY":   "", // Inherit from parent env (e.g. unsafe-best-match in CI)
+			"UV_EXTRA_INDEX_URL": "${UV_EXTRA_INDEX_URL}",
+			"UV_INDEX_STRATEGY":  "${UV_INDEX_STRATEGY}",
Evidence
UV_INDEX_STRATEGY is now read from the parent environment and passed into the universal-runtime
service, which makes it a user/CI-configurable key. .env.example contains no entry for
UV_INDEX_STRATEGY (or UV index settings), so the new key is not documented as required.

AGENTS.md
AGENTS.md
cli/cmd/orchestrator/services.go[149-155]
.env.example[43-55]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The PR adds a new configurable environment variable (`UV_INDEX_STRATEGY`) used by the orchestrator when launching `universal-runtime`, but `.env.example` was not updated to include it.

## Issue Context
Users/CI may need to set this variable for installs (e.g., CPU-only PyTorch index behavior). The repo compliance rules require new configuration keys to be documented in `.env.example` and kept consistent with configuration changes.

## Fix Focus Areas
- cli/cmd/orchestrator/services.go[149-155]
- .env.example[43-55]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. One-line stub function defs 📘 Rule violation ✓ Correctness
Description
Several fallback stub functions are defined on a single line, which commonly violates Ruff/PEP8
rules (e.g., E701) and can fail repo lint checks. This reduces readability and consistency with
repository Python style conventions.
Code

runtimes/edge/routers/chat_completions/service.py[R60-67]

+    # No-op stubs — edge doesn't support tool calling
+    def detect_probable_tool_call(*a, **kw): return False  # type: ignore[misc]
+    def detect_tool_call_in_content(*a, **kw): return None  # type: ignore[misc]
+    def extract_arguments_progress(*a, **kw): return ""  # type: ignore[misc]
+    def extract_tool_name_from_partial(*a, **kw): return None  # type: ignore[misc]
+    def is_tool_call_complete(*a, **kw): return False  # type: ignore[misc]
+    def parse_tool_choice(*a, **kw): return None  # type: ignore[misc]
+    def strip_tool_call_from_content(*a, **kw): return a[0] if a else ""  # type: ignore[misc]
Evidence
Repository Python style requires Ruff-compatible formatting; single-line def ...: return ...
statements are typically flagged by Ruff/PEP8 (E701) and are inconsistent with the stated
conventions.

AGENTS.md
runtimes/edge/routers/chat_completions/service.py[58-67]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Single-line stub function definitions in the ImportError fallback are likely to violate Ruff/PEP8 (e.g., E701) and reduce readability.

## Issue Context
These stubs are used when optional tool-calling support is unavailable. They should still conform to repository Python style so CI linting passes.

## Fix Focus Areas
- runtimes/edge/routers/chat_completions/service.py[60-67]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. export() signature too long 📘 Rule violation ✓ Correctness
Description
The export() method signature exceeds the repository’s 88-character line length limit. This is
likely to trigger Ruff line-length checks and violates the repo’s Python style conventions.
Code

runtimes/edge/models/vision_base.py[R160-162]

+    async def export(self, format: Literal["onnx", "coreml", "tensorrt", "tflite", "openvino"],
+                     output_path: str, **kwargs) -> str:
+        raise NotImplementedError(f"{self.__class__.__name__} does not support export to {format}")
Evidence
The compliance checklist requires Python code to respect the 88-character line length; the
export() definition line is substantially longer than that limit.

AGENTS.md
runtimes/edge/models/vision_base.py[156-162]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The `export()` method definition exceeds the repository’s 88-character line limit.

## Issue Context
Repo style conventions (Ruff/PEP8 variant) enforce line length, and long signatures should be wrapped for readability and lint compliance.

## Fix Focus Areas
- runtimes/edge/models/vision_base.py[160-162]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (3)
4. openapi.json missing final newline 📘 Rule violation ✓ Correctness
Description
The newly added runtimes/edge/openapi.json is committed without a final newline. This violates
.editorconfig-style newline handling expectations and can cause formatting inconsistencies across
tools.
Code

runtimes/edge/openapi.json[1]

+{"openapi":"3.1.0","info":{"title":"LlamaFarm Edge Runtime","description":"Minimal on-device inference API for drones and edge hardware","version":"0.1.0"},"paths":{"/health":{"get":{"tags":["health"],"summary":"Health Check","description":"Health check endpoint with device information.","operationId":"health_check_health_get","responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{}}}}}}},"/v1/models":{"get":{"tags":["health"],"summary":"List Models","description":"List currently loaded models.","operationId":"list_models_v1_models_get","responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{}}}}}}},"/v1/chat/completions":{"post":{"summary":"Chat Completions","description":"OpenAI-compatible chat completions endpoint.\n\nSupports any HuggingFace causal language model.","operationId":"chat_completions_v1_chat_completions_post","requestBody":{"content":{"application/json":{"schema":{"$ref":"#/components/schemas/ChatCompletionRequest"}}},"required":true},"responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{}}}},"422":{"description":"Validation Error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HTTPValidationError"}}}}}}},"/v1/vision/detect":{"post":{"tags":["vision","vision-detection"],"summary":"Detect Objects","description":"Detect objects in an image using YOLO.","operationId":"detect_objects_v1_vision_detect_post","requestBody":{"content":{"application/json":{"schema":{"$ref":"#/components/schemas/DetectRequest"}}},"required":true},"responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{"$ref":"#/components/schemas/DetectResponse"}}}},"422":{"description":"Validation Error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HTTPValidationError"}}}}}}},"/v1/vision/classify":{"post":{"tags":["vision","vision-classification"],"summary":"Classify Image","description":"Classify an image using CLIP (zero-shot).","operationId":"classify_image_v1_vision_classify_post","requestBody":{"content":{"application/json":{"schema":{"$ref":"#/components/schemas/ClassifyRequest"}}},"required":true},"responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{"$ref":"#/components/schemas/ClassifyResponse"}}}},"422":{"description":"Validation Error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HTTPValidationError"}}}}}}},"/v1/vision/detect_classify":{"post":{"tags":["vision","vision-detect-classify"],"summary":"Detect And Classify","description":"Detect objects then classify each crop — single round-trip.\n\nRuns YOLO detection → crops each bounding box → CLIP classifies each crop.\nReturns unified results with both detection and classification info.","operationId":"detect_and_classify_v1_vision_detect_classify_post","requestBody":{"content":{"application/json":{"schema":{"$ref":"#/components/schemas/DetectClassifyRequest"}}},"required":true},"responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{"$ref":"#/components/schemas/DetectClassifyResponse"}}}},"422":{"description":"Validation Error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HTTPValidationError"}}}}}}},"/v1/vision/stream/start":{"post":{"tags":["vision","vision-streaming"],"summary":"Start Stream","description":"Start a streaming detection session with cascade config.","operationId":"start_stream_v1_vision_stream_start_post","requestBody":{"content":{"application/json":{"schema":{"$ref":"#/components/schemas/StreamStartRequest"}}},"required":true},"responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{"$ref":"#/components/schemas/StreamStartResponse"}}}},"422":{"description":"Validation Error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HTTPValidationError"}}}}}}},"/v1/vision/stream/frame":{"post":{"tags":["vision","vision-streaming"],"summary":"Process Frame","description":"Process a frame through the cascade chain.","operationId":"process_frame_v1_vision_stream_frame_post","requestBody":{"content":{"application/json":{"schema":{"$ref":"#/components/schemas/StreamFrameRequest"}}},"required":true},"responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{"$ref":"#/components/schemas/StreamFrameResponse"}}}},"422":{"description":"Validation Error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HTTPValidationError"}}}}}}},"/v1/vision/stream/stop":{"post":{"tags":["vision","vision-streaming"],"summary":"Stop Stream","description":"Stop a streaming session.","operationId":"stop_stream_v1_vision_stream_stop_post","requestBody":{"content":{"application/json":{"schema":{"$ref":"#/components/schemas/StreamStopRequest"}}},"required":true},"responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{"$ref":"#/components/schemas/StreamStopResponse"}}}},"422":{"description":"Validation Error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HTTPValidationError"}}}}}}},"/v1/vision/stream/sessions":{"get":{"tags":["vision","vision-streaming"],"summary":"List Sessions","description":"List active streaming sessions.","operationId":"list_sessions_v1_vision_stream_sessions_get","responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{"$ref":"#/components/schemas/SessionsListResponse"}}}}}}},"/v1/models/unload":{"post":{"tags":["models"],"summary":"Unload All Models","description":"Unload all loaded models to free memory.","operationId":"unload_all_models_v1_models_unload_post","responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{}}}}}}}},"components":{"schemas":{"Audio":{"properties":{"id":{"type":"string","title":"Id"}},"type":"object","required":["id"],"title":"Audio","description":"Data about a previous audio response from the model.\n[Learn more](https://platform.openai.com/docs/guides/audio)."},"BoundingBox":{"properties":{"x1":{"type":"number","title":"X1"},"y1":{"type":"number","title":"Y1"},"x2":{"type":"number","title":"X2"},"y2":{"type":"number","title":"Y2"}},"type":"object","required":["x1","y1","x2","y2"],"title":"BoundingBox"},"CascadeConfigRequest":{"properties":{"chain":{"items":{"type":"string"},"type":"array","title":"Chain","description":"Model chain, can include 'remote:http://...'","default":["yolov8n"]},"confidence_threshold":{"type":"number","maximum":1.0,"minimum":0.0,"title":"Confidence Threshold","default":0.7}},"type":"object","title":"CascadeConfigRequest"},"ChatCompletionAssistantMessageParam":{"properties":{"role":{"type":"string","const":"assistant","title":"Role"},"audio":{"anyOf":[{"$ref":"#/components/schemas/Audio"},{"type":"null"}]},"content":{"anyOf":[{"type":"string"},{"items":{"anyOf":[{"$ref":"#/components/schemas/ChatCompletionContentPartTextParam"},{"$ref":"#/components/schemas/ChatCompletionContentPartRefusalParam"}]},"type":"array"},{"type":"null"}],"title":"Content"},"function_call":{"anyOf":[{"$ref":"#/components/schemas/FunctionCall"},{"type":"null"}]},"name":{"type":"string","title":"Name"},"refusal":{"anyOf":[{"type":"string"},{"type":"null"}],"title":"Refusal"},"tool_calls":{"items":{"anyOf":[{"$ref":"#/components/schemas/ChatCompletionMessageFunctionToolCallParam"},{"$ref":"#/components/schemas/ChatCompletionMessageCustomToolCallParam"}]},"type":"array","title":"Tool Calls"}},"type":"object","required":["role"],"title":"ChatCompletionAssistantMessageParam","description":"Messages sent by the model in response to user messages."},"ChatCompletionContentPartImageParam":{"properties":{"image_url":{"$ref":"#/components/schemas/ImageURL"},"type":{"type":"string","const":"image_url","title":"Type"}},"type":"object","required":["image_url","type"],"title":"ChatCompletionContentPartImageParam","description":"Learn about [image inputs](https://platform.openai.com/docs/guides/vision)."},"ChatCompletionContentPartInputAudioParam":{"properties":{"input_audio":{"$ref":"#/components/schemas/InputAudio"},"type":{"type":"string","const":"input_audio","title":"Type"}},"type":"object","required":["input_audio","type"],"title":"ChatCompletionContentPartInputAudioParam","description":"Learn about [audio inputs](https://platform.openai.com/docs/guides/audio)."},"ChatCompletionContentPartRefusalParam":{"properties":{"refusal":{"type":"string","title":"Refusal"},"type":{"type":"string","const":"refusal","title":"Type"}},"type":"object","required":["refusal","type"],"title":"ChatCompletionContentPartRefusalParam"},"ChatCompletionContentPartTextParam":{"properties":{"text":{"type":"string","title":"Text"},"type":{"type":"string","const":"text","title":"Type"}},"type":"object","required":["text","type"],"title":"ChatCompletionContentPartTextParam","description":"Learn about [text inputs](https://platform.openai.com/docs/guides/text-generation)."},"ChatCompletionDeveloperMessageParam":{"properties":{"content":{"anyOf":[{"type":"string"},{"items":{"$ref":"#/components/schemas/ChatCompletionContentPartTextParam"},"type":"array"}],"title":"Content"},"role":{"type":"string","const":"developer","title":"Role"},"name":{"type":"string","title":"Name"}},"type":"object","required":["content","role"],"title":"ChatCompletionDeveloperMessageParam","description":"Developer-provided instructions that the model should follow, regardless of\nmessages sent by the user. With o1 models and newer, `developer` messages\nreplace the previous `system` messages."},"ChatCompletionFunctionMessageParam":{"properties":{"content":{"anyOf":[{"type":"string"},{"type":"null"}],"title":"Content"},"name":{"type":"string","title":"Name"},"role":{"type":"string","const":"function","title":"Role"}},"type":"object","required":["content","name","role"],"title":"ChatCompletionFunctionMessageParam"},"ChatCompletionFunctionToolParam":{"properties":{"function":{"$ref":"#/components/schemas/FunctionDefinition"},"type":{"type":"string","const":"function","title":"Type"}},"type":"object","required":["function","type"],"title":"ChatCompletionFunctionToolParam","description":"A function tool that can be used to generate a response."},"ChatCompletionMessageCustomToolCallParam":{"properties":{"id":{"type":"string","title":"Id"},"custom":{"$ref":"#/components/schemas/Custom"},"type":{"type":"string","const":"custom","title":"Type"}},"type":"object","required":["id","custom","type"],"title":"ChatCompletionMessageCustomToolCallParam","description":"A call to a custom tool created by the model."},"ChatCompletionMessageFunctionToolCallParam":{"properties":{"id":{"type":"string","title":"Id"},"function":{"$ref":"#/components/schemas/Function"},"type":{"type":"string","const":"function","title":"Type"}},"type":"object","required":["id","function","type"],"title":"ChatCompletionMessageFunctionToolCallParam","description":"A call to a function tool created by the model."},"ChatCompletionRequest":{"properties":{"model":{"type":"string","title":"Model"},"messages":{"items":{"anyOf":[{"$ref":"#/components/schemas/ChatCompletionDeveloperMessageParam"},{"$ref":"#/components/schemas/ChatCompletionSystemMessageParam"},{"$ref":"#/components/schemas/ChatCompletionUserMessageParam"},{"$ref":"#/components/schemas/ChatCompletionAssistantMessageParam"},{"$ref":"#/components/schemas/ChatCompletionToolMessageParam"},{"$ref":"#/components/schemas/ChatCompletionFunctionMessageParam"}]},"type":"array","title":"Messages"},"temperature":{"anyOf":[{"type":"number"},{"type":"null"}],"title":"Temperature","default":1.0},"top_p":{"anyOf":[{"type":"number"},{"type":"null"}],"title":"Top P","default":1.0},"max_tokens":{"anyOf":[{"type":"integer"},{"type":"null"}],"title":"Max Tokens"},"stream":{"anyOf":[{"type":"boolean"},{"type":"null"}],"title":"Stream","default":false},"stop":{"anyOf":[{"type":"string"},{"items":{"type":"string"},"type":"array"},{"type":"null"}],"title":"Stop"},"logprobs":{"anyOf":[{"type":"boolean"},{"type":"null"}],"title":"Logprobs"},"top_logprobs":{"anyOf":[{"type":"integer","maximum":20.0,"minimum":0.0},{"type":"null"}],"title":"Top Logprobs"},"presence_penalty":{"anyOf":[{"type":"number"},{"type":"null"}],"title":"Presence Penalty","default":0.0},"frequency_penalty":{"anyOf":[{"type":"number"},{"type":"null"}],"title":"Frequency Penalty","default":0.0},"user":{"anyOf":[{"type":"string"},{"type":"null"}],"title":"User"},"n_ctx":{"anyOf":[{"type":"integer"},{"type":"null"}],"title":"N Ctx"},"n_batch":{"anyOf":[{"type":"integer"},{"type":"null"}],"title":"N Batch"},"n_gpu_layers":{"anyOf":[{"type":"integer"},{"type":"null"}],"title":"N Gpu Layers"},"n_threads":{"anyOf":[{"type":"integer"},{"type":"null"}],"title":"N Threads"},"flash_attn":{"anyOf":[{"type":"boolean"},{"type":"null"}],"title":"Flash Attn"},"use_mmap":{"anyOf":[{"type":"boolean"},{"type":"null"}],"title":"Use Mmap"},"use_mlock":{"anyOf":[{"type":"boolean"},{"type":"null"}],"title":"Use Mlock"},"cache_type_k":{"anyOf":[{"type":"string"},{"type":"null"}],"title":"Cache Type K"},"cache_type_v":{"anyOf":[{"type":"string"},{"type":"null"}],"title":"Cache Type V"},"extra_body":{"anyOf":[{"additionalProperties":true,"type":"object"},{"type":"null"}],"title":"Extra Body"},"tools":{"anyOf":[{"items":{"$ref":"#/components/schemas/ChatCompletionFunctionToolParam"},"type":"array"},{"type":"null"}],"title":"Tools"},"tool_choice":{"anyOf":[{"type":"string"},{"additionalProperties":true,"type":"object"},{"type":"null"}],"title":"Tool Choice"},"think":{"anyOf":[{"type":"boolean"},{"type":"null"}],"title":"Think"},"thinking_budget":{"anyOf":[{"type":"integer"},{"type":"null"}],"title":"Thinking Budget"},"cache_key":{"anyOf":[{"type":"string"},{"type":"null"}],"title":"Cache Key"},"return_cache_key":{"anyOf":[{"type":"boolean"},{"type":"null"}],"title":"Return Cache Key"},"auto_truncate":{"anyOf":[{"type":"boolean"},{"type":"null"}],"title":"Auto Truncate","default":true},"truncation_strategy":{"anyOf":[{"type":"string"},{"type":"null"}],"title":"Truncation Strategy"}},"type":"object","required":["model","messages"],"title":"ChatCompletionRequest","description":"OpenAI-compatible chat completion request."},"ChatCompletionSystemMessageParam":{"properties":{"content":{"anyOf":[{"type":"string"},{"items":{"$ref":"#/components/schemas/ChatCompletionContentPartTextParam"},"type":"array"}],"title":"Content"},"role":{"type":"string","const":"system","title":"Role"},"name":{"type":"string","title":"Name"}},"type":"object","required":["content","role"],"title":"ChatCompletionSystemMessageParam","description":"Developer-provided instructions that the model should follow, regardless of\nmessages sent by the user. With o1 models and newer, use `developer` messages\nfor this purpose instead."},"ChatCompletionToolMessageParam":{"properties":{"content":{"anyOf":[{"type":"string"},{"items":{"$ref":"#/components/schemas/ChatCompletionContentPartTextParam"},"type":"array"}],"title":"Content"},"role":{"type":"string","const":"tool","title":"Role"},"tool_call_id":{"type":"string","title":"Tool Call Id"}},"type":"object","required":["content","role","tool_call_id"],"title":"ChatCompletionToolMessageParam"},"ChatCompletionUserMessageParam":{"properties":{"content":{"anyOf":[{"type":"string"},{"items":{"anyOf":[{"$ref":"#/components/schemas/ChatCompletionContentPartTextParam"},{"$ref":"#/components/schemas/ChatCompletionContentPartImageParam"},{"$ref":"#/components/schemas/ChatCompletionContentPartInputAudioParam"},{"$ref":"#/components/schemas/File"}]},"type":"array"}],"title":"Content"},"role":{"type":"string","const":"user","title":"Role"},"name":{"type":"string","title":"Name"}},"type":"object","required":["content","role"],"title":"ChatCompletionUserMessageParam","description":"Messages sent by an end user, containing prompts or additional context\ninformation."},"ClassifiedDetection":{"properties":{"box":{"$ref":"#/components/schemas/BoundingBox"},"detection_class":{"type":"string","title":"Detection Class"},"detection_confidence":{"type":"number","title":"Detection Confidence"},"classification":{"type":"string","title":"Classification"},"classification_confidence":{"type":"number","title":"Classification Confidence"},"all_scores":{"additionalProperties":{"type":"number"},"type":"object","title":"All Scores"}},"type":"object","required":["box","detection_class","detection_confidence","classification","classification_confidence","all_scores"],"title":"ClassifiedDetection","description":"A detection with classification results."},"ClassifyRequest":{"properties":{"image":{"type":"string","title":"Image","description":"Base64-encoded image"},"model":{"type":"string","title":"Model","default":"clip-vit-base"},"classes":{"items":{"type":"string"},"type":"array","title":"Classes","description":"Classes for zero-shot classification"},"top_k":{"type":"integer","maximum":100.0,"minimum":1.0,"title":"Top K","default":5}},"type":"object","required":["image","classes"],"title":"ClassifyRequest"},"ClassifyResponse":{"properties":{"class_name":{"type":"string","title":"Class Name"},"class_id":{"type":"integer","title":"Class Id"},"confidence":{"type":"number","title":"Confidence"},"all_scores":{"additionalProperties":{"type":"number"},"type":"object","title":"All Scores"},"model":{"type":"string","title":"Model"},"inference_time_ms":{"type":"number","title":"Inference Time Ms"}},"type":"object","required":["class_name","class_id","confidence","all_scores","model","inference_time_ms"],"title":"ClassifyResponse"},"Custom":{"properties":{"input":{"type":"string","title":"Input"},"name":{"type":"string","title":"Name"}},"type":"object","required":["input","name"],"title":"Custom","description":"The custom tool that the model called."},"DetectClassifyRequest":{"properties":{"image":{"type":"string","title":"Image","description":"Base64-encoded image"},"detection_model":{"type":"string","title":"Detection Model","description":"YOLO model for detection","default":"yolov8n"},"classification_model":{"type":"string","title":"Classification Model","description":"CLIP model for classification","default":"clip-vit-base"},"classes":{"items":{"type":"string"},"type":"array","title":"Classes","description":"Classes for zero-shot classification of each crop"},"confidence_threshold":{"type":"number","maximum":1.0,"minimum":0.0,"title":"Confidence Threshold","description":"Detection confidence threshold","default":0.5},"detection_classes":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"title":"Detection Classes","description":"Filter detections to these YOLO classes"},"top_k":{"type":"integer","maximum":100.0,"minimum":1.0,"title":"Top K","description":"Top-K classification results per crop","default":3},"min_crop_px":{"type":"integer","minimum":1.0,"title":"Min Crop Px","description":"Minimum crop dimension in pixels (skip tiny detections)","default":16}},"type":"object","required":["image","classes"],"title":"DetectClassifyRequest"},"DetectClassifyResponse":{"properties":{"results":{"items":{"$ref":"#/components/schemas/ClassifiedDetection"},"type":"array","title":"Results"},"total_detections":{"type":"integer","title":"Total Detections"},"classified_count":{"type":"integer","title":"Classified Count"},"detection_model":{"type":"string","title":"Detection Model"},"classification_model":{"type":"string","title":"Classification Model"},"detection_time_ms":{"type":"number","title":"Detection Time Ms"},"classification_time_ms":{"type":"number","title":"Classification Time Ms"},"total_time_ms":{"type":"number","title":"Total Time Ms"}},"type":"object","required":["results","total_detections","classified_count","detection_model","classification_model","detection_time_ms","classification_time_ms","total_time_ms"],"title":"DetectClassifyResponse"},"DetectRequest":{"properties":{"image":{"type":"string","title":"Image","description":"Base64-encoded image"},"model":{"type":"string","title":"Model","default":"yolov8n"},"confidence_threshold":{"type":"number","maximum":1.0,"minimum":0.0,"title":"Confidence Threshold","default":0.5},"classes":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"title":"Classes"}},"type":"object","required":["image"],"title":"DetectRequest"},"DetectResponse":{"properties":{"detections":{"items":{"$ref":"#/components/schemas/Detection"},"type":"array","title":"Detections"},"model":{"type":"string","title":"Model"},"inference_time_ms":{"type":"number","title":"Inference Time Ms"}},"type":"object","required":["detections","model","inference_time_ms"],"title":"DetectResponse"},"Detection":{"properties":{"box":{"$ref":"#/components/schemas/BoundingBox"},"class_name":{"type":"string","title":"Class Name"},"class_id":{"type":"integer","title":"Class Id"},"confidence":{"type":"number","title":"Confidence"}},"type":"object","required":["box","class_name","class_id","confidence"],"title":"Detection"},"DetectionItem":{"properties":{"x1":{"type":"number","title":"X1"},"y1":{"type":"number","title":"Y1"},"x2":{"type":"number","title":"X2"},"y2":{"type":"number","title":"Y2"},"class_name":{"type":"string","title":"Class Name"},"class_id":{"type":"integer","title":"Class Id"},"confidence":{"type":"number","title":"Confidence"}},"type":"object","required":["x1","y1","x2","y2","class_name","class_id","confidence"],"title":"DetectionItem"},"File":{"properties":{"file":{"$ref":"#/components/schemas/FileFile"},"type":{"type":"string","const":"file","title":"Type"}},"type":"object","required":["file","type"],"title":"File","description":"Learn about [file inputs](https://platform.openai.com/docs/guides/text) for text generation."},"FileFile":{"properties":{"file_data":{"type":"string","title":"File Data"},"file_id":{"type":"string","title":"File Id"},"filename":{"type":"string","title":"Filename"}},"type":"object","title":"FileFile"},"Function":{"properties":{"arguments":{"type":"string","title":"Arguments"},"name":{"type":"string","title":"Name"}},"type":"object","required":["arguments","name"],"title":"Function","description":"The function that the model called."},"FunctionCall":{"properties":{"arguments":{"type":"string","title":"Arguments"},"name":{"type":"string","title":"Name"}},"type":"object","required":["arguments","name"],"title":"FunctionCall","description":"Deprecated and replaced by `tool_calls`.\n\nThe name and arguments of a function that should be called, as generated by the model."},"FunctionDefinition":{"properties":{"name":{"type":"string","title":"Name"},"description":{"type":"string","title":"Description"},"parameters":{"additionalProperties":true,"type":"object","title":"Parameters"},"strict":{"anyOf":[{"type":"boolean"},{"type":"null"}],"title":"Strict"}},"type":"object","required":["name"],"title":"FunctionDefinition"},"HTTPValidationError":{"properties":{"detail":{"items":{"$ref":"#/components/schemas/ValidationError"},"type":"array","title":"Detail"}},"type":"object","title":"HTTPValidationError"},"ImageURL":{"properties":{"url":{"type":"string","title":"Url"},"detail":{"type":"string","enum":["auto","low","high"],"title":"Detail"}},"type":"object","required":["url"],"title":"ImageURL"},"InputAudio":{"properties":{"data":{"type":"string","title":"Data"},"format":{"type":"string","enum":["wav","mp3"],"title":"Format"}},"type":"object","required":["data","format"],"title":"InputAudio"},"SessionInfo":{"properties":{"session_id":{"type":"string","title":"Session Id"},"frames_processed":{"type":"integer","title":"Frames Processed"},"actions_triggered":{"type":"integer","title":"Actions Triggered"},"escalations":{"type":"integer","title":"Escalations"},"chain":{"items":{"type":"string"},"type":"array","title":"Chain"},"idle_seconds":{"type":"number","title":"Idle Seconds"},"duration_seconds":{"type":"number","title":"Duration Seconds"}},"type":"object","required":["session_id","frames_processed","actions_triggered","escalations","chain","idle_seconds","duration_seconds"],"title":"SessionInfo"},"SessionsListResponse":{"properties":{"sessions":{"items":{"$ref":"#/components/schemas/SessionInfo"},"type":"array","title":"Sessions"},"count":{"type":"integer","title":"Count"}},"type":"object","required":["sessions","count"],"title":"SessionsListResponse"},"StreamFrameRequest":{"properties":{"session_id":{"type":"string","title":"Session Id"},"image":{"type":"string","title":"Image","description":"Base64-encoded image"}},"type":"object","required":["session_id","image"],"title":"StreamFrameRequest"},"StreamFrameResponse":{"properties":{"status":{"type":"string","title":"Status"},"detections":{"anyOf":[{"items":{"$ref":"#/components/schemas/DetectionItem"},"type":"array"},{"type":"null"}],"title":"Detections"},"confidence":{"anyOf":[{"type":"number"},{"type":"null"}],"title":"Confidence"},"resolved_by":{"anyOf":[{"type":"string"},{"type":"null"}],"title":"Resolved By"}},"type":"object","required":["status"],"title":"StreamFrameResponse"},"StreamStartRequest":{"properties":{"config":{"$ref":"#/components/schemas/CascadeConfigRequest"},"target_fps":{"type":"number","title":"Target Fps","default":1.0},"action_classes":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"title":"Action Classes"},"cooldown_seconds":{"type":"number","title":"Cooldown Seconds","default":5.0}},"type":"object","title":"StreamStartRequest"},"StreamStartResponse":{"properties":{"session_id":{"type":"string","title":"Session Id"}},"type":"object","required":["session_id"],"title":"StreamStartResponse"},"StreamStopRequest":{"properties":{"session_id":{"type":"string","title":"Session Id"}},"type":"object","required":["session_id"],"title":"StreamStopRequest"},"StreamStopResponse":{"properties":{"session_id":{"type":"string","title":"Session Id"},"frames_processed":{"type":"integer","title":"Frames Processed"},"actions_triggered":{"type":"integer","title":"Actions Triggered"},"escalations":{"type":"integer","title":"Escalations"},"duration_seconds":{"type":"number","title":"Duration Seconds"}},"type":"object","required":["session_id","frames_processed","actions_triggered","escalations","duration_seconds"],"title":"StreamStopResponse"},"ValidationError":{"properties":{"loc":{"items":{"anyOf":[{"type":"string"},{"type":"integer"}]},"type":"array","title":"Location"},"msg":{"type":"string","title":"Message"},"type":{"type":"string","title":"Error Type"},"input":{"title":"Input"},"ctx":{"type":"object","title":"Context"}},"type":"object","required":["loc","msg","type"],"title":"ValidationError"}}}}
Evidence
The diff explicitly indicates openapi.json has no newline at end of file, which is a standard
.editorconfig formatting requirement for consistent newline handling.

AGENTS.md
pr_files_diffs/runtimes_edge_openapi_json.patch[8-8]
runtimes/edge/openapi.json[1-1]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`runtimes/edge/openapi.json` is missing a final newline, violating `.editorconfig` newline handling conventions.

## Issue Context
Many tooling chains and style checks expect a final newline to avoid noisy diffs and ensure consistent file formatting.

## Fix Focus Areas
- runtimes/edge/openapi.json[1-1]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. ARM64 tag not validated 🐞 Bug ⛯ Reliability
Description
_get_llamafarm_release_version() returns the GitHub “latest” release tag_name without validating
that the expected ARM64 binary asset exists. download_binary() then builds the ARM64 download URL
from that tag and raises RuntimeError on HTTP error, which can break ARM64 installs whenever the
latest release is missing the artifact.
Code

packages/llamafarm-llama/src/llamafarm_llama/_binary.py[R64-77]

+    # 2. Query GitHub API for latest release
    try:
-        version = metadata.version("llamafarm-llama")
-        if version and not version.startswith("0.0.0"):
-            return f"v{version}"
-    except metadata.PackageNotFoundError:
-        pass
-    # Fallback for dev installs
-    return "v0.0.1"
+        import json
+
+        req = Request(
+            "https://api.github.com/repos/llama-farm/llamafarm/releases/latest",
+            headers={"User-Agent": "llamafarm-llama", "Accept": "application/vnd.github.v3+json"},
+        )
+        with urlopen(req, timeout=10) as response:
+            data = json.loads(response.read())
+            tag = data.get("tag_name")
+            if tag:
+                logger.info(f"Using latest LlamaFarm release: {tag}")
+                return tag
Evidence
The version-selection function returns the latest release tag without checking assets, and the
downloader uses that tag to construct the ARM64 URL; any missing asset results in a hard failure
during download.

packages/llamafarm-llama/src/llamafarm_llama/_binary.py[44-84]
packages/llamafarm-llama/src/llamafarm_llama/_binary.py[562-627]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
ARM64 binary downloads can fail because `_get_llamafarm_release_version()` returns the GitHub `releases/latest` tag without verifying that the ARM64 artifact exists for that release. `download_binary()` then constructs the ARM64 URL using that tag and raises when the asset is missing.

### Issue Context
The code comment/docstring states the function queries for the “latest release with the ARM64 binary”, but the implementation only reads `tag_name` and never checks `assets` or performs a lightweight existence check on the expected URL.

### Fix Focus Areas
- packages/llamafarm-llama/src/llamafarm_llama/_binary.py[44-84]
- packages/llamafarm-llama/src/llamafarm_llama/_binary.py[586-627]

### Suggested implementation direction
- Fetch `releases/latest` JSON and scan `assets` for the expected filename (`llama-{version}-bin-linux-arm64.zip`) OR
- Perform a `HEAD`/`GET` probe against the constructed artifact URL and only return the tag if it succeeds.
- If validation fails, fall back to the hardcoded `fallback` (or iterate through the most recent N releases until a match is found).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


6. Tool stub unpack crash 🐞 Bug ✓ Correctness
Description
In edge chat completions, the ImportError fallback for utils.tool_calling defines
parse_tool_choice() to return None, but the streaming/non-streaming paths unconditionally unpack its
return into two variables. If the optional import fails, requests will crash with a TypeError during
tool_choice parsing.
Code

runtimes/edge/routers/chat_completions/service.py[R59-67]

+except ImportError:
+    # No-op stubs — edge doesn't support tool calling
+    def detect_probable_tool_call(*a, **kw): return False  # type: ignore[misc]
+    def detect_tool_call_in_content(*a, **kw): return None  # type: ignore[misc]
+    def extract_arguments_progress(*a, **kw): return ""  # type: ignore[misc]
+    def extract_tool_name_from_partial(*a, **kw): return None  # type: ignore[misc]
+    def is_tool_call_complete(*a, **kw): return False  # type: ignore[misc]
+    def parse_tool_choice(*a, **kw): return None  # type: ignore[misc]
+    def strip_tool_call_from_content(*a, **kw): return a[0] if a else ""  # type: ignore[misc]
Evidence
The ImportError branch defines a stub that returns None, while call sites always expect a 2-tuple
(mode, function_name). The real implementation returns a tuple, so the stub is incompatible and will
crash when that fallback is used.

runtimes/edge/routers/chat_completions/service.py[49-67]
runtimes/edge/routers/chat_completions/service.py[802-805]
runtimes/edge/utils/tool_calling.py[136-156]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
If `utils.tool_calling` fails to import, `parse_tool_choice` is stubbed to return `None`, but the service always unpacks its return value into `(tool_choice_mode, _)`, which will raise `TypeError: cannot unpack non-iterable NoneType object`.

### Issue Context
Even if tool calling is intended to be optional on edge, the fallback implementation must remain interface-compatible or the code must short-circuit tool logic when tool-calling support is unavailable.

### Fix Focus Areas
- runtimes/edge/routers/chat_completions/service.py[49-67]
- runtimes/edge/routers/chat_completions/service.py[802-806]

### Suggested implementation direction
- In the ImportError branch, define `parse_tool_choice` to return a valid default tuple like `(&quot;none&quot;, None)` (or `(&quot;auto&quot;, None)` depending on desired behavior).
- Consider introducing a module-level flag like `TOOL_CALLING_AVAILABLE = True/False` and set `should_detect_tools = False` when unavailable, to avoid relying on multiple stubs staying perfectly compatible over time.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

40 issues found across 66 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="runtimes/edge/routers/chat_completions/service.py">

<violation number="1" location="runtimes/edge/routers/chat_completions/service.py:46">
P0: The fallback stub returns a plain tuple `(text, None)`, but callers access `.content` and `.thinking` attributes (line 1223+). This will raise `AttributeError` for every non-streaming request when `utils.thinking` is absent — which is the expected edge deployment case.</violation>

<violation number="2" location="runtimes/edge/routers/chat_completions/service.py:398">
P1: `HistoryCompressor` can be `None` (line 39) but is called unconditionally here. This raises `TypeError` on every GGUF request with a context manager when the module is absent. Guard with a `None` check.</violation>
</file>

<file name="runtimes/edge/utils/context_calculator.py">

<violation number="1" location="runtimes/edge/utils/context_calculator.py:87">
P1: This branch uses system RAM as the CUDA memory budget when PyTorch is missing, so GGUF-only CUDA deployments can overestimate available memory and pick an unsafe context size.</violation>

<violation number="2" location="runtimes/edge/utils/context_calculator.py:445">
P1: When memory limits the safe context below 2048, this fallback still forces `2048`, which can overshoot the computed limit and trigger OOM on low-memory devices.</violation>
</file>

<file name="runtimes/edge/routers/cache.py">

<violation number="1" location="runtimes/edge/routers/cache.py:94">
P1: This router is never included in the FastAPI app, so all of the new `/v1/cache/*` endpoints are dead code.</violation>
</file>

<file name="runtimes/edge/server.py">

<violation number="1" location="runtimes/edge/server.py:271">
P1: Path traversal bypass: `Path('..').name` returns `'..'`, so `model_id='..'` passes the basename check and causes `VISION_MODELS_DIR / '..' / 'current.pt'` to escape the vision models directory. Reject `.` and `..` explicitly, and add a resolved-path containment check.

(Based on your team's feedback about combining basename checks with resolved-path containment.) [FEEDBACK_USED]</violation>
</file>

<file name="runtimes/edge/models/yolo_model.py">

<violation number="1" location="runtimes/edge/models/yolo_model.py:59">
P1: Replace the prefix check with a real resolved-path containment check; `startswith()` still accepts sibling directories outside the approved roots.

(Based on your team's feedback about validating identifiers used in filesystem paths with resolved-path containment.) [FEEDBACK_USED]</violation>

<violation number="2" location="runtimes/edge/models/yolo_model.py:96">
P2: Check `classes` explicitly. An empty list currently falls through as `None`, so detection runs without any class filter and returns classes the caller did not request.

(Based on your team's feedback about distinguishing `classes=None` from an explicit empty list.) [FEEDBACK_USED]</violation>
</file>

<file name="runtimes/edge/utils/context_manager.py">

<violation number="1" location="runtimes/edge/utils/context_manager.py:344">
P2: `middle_out` only keeps the first non-system message, so it can drop the assistant half of the initial exchange the strategy promises to preserve.</violation>

<violation number="2" location="runtimes/edge/utils/context_manager.py:413">
P1: Handle multimodal `content` parts here instead of passing list-valued message content to string-only token truncation.</violation>
</file>

<file name="runtimes/edge/services/error_handler.py">

<violation number="1" location="runtimes/edge/services/error_handler.py:118">
P1: Do not return raw exception text from the catch-all handler; it leaks internal server details to clients.

(Based on your team's feedback about avoiding raw exception text in client-facing errors.) [FEEDBACK_USED]</violation>
</file>

<file name="runtimes/edge/models/clip_model.py">

<violation number="1" location="runtimes/edge/models/clip_model.py:77">
P1: Guard the class cache with `_class_embeddings is not None`; after `unload()`, the stale cache key can skip re-encoding and make `classify()` fail on `self._class_embeddings.T`.</violation>

<violation number="2" location="runtimes/edge/models/clip_model.py:80">
P1: Avoid storing per-request classes in shared instance state. Concurrent `classify()` calls can overwrite `self.class_names` and `_class_embeddings`, returning scores for the wrong labels.</violation>
</file>

<file name="runtimes/edge/routers/chat_completions/router.py">

<violation number="1" location="runtimes/edge/routers/chat_completions/router.py:21">
P1: Avoid logging the full chat request body here; this endpoint accepts raw message contents and audio/tool payloads, so DEBUG logs would capture user data and large request bodies.</violation>
</file>

<file name="runtimes/edge/routers/vision/streaming.py">

<violation number="1" location="runtimes/edge/routers/vision/streaming.py:212">
P1: Remote cascade results bypass the confidence threshold check. `_RemoteResult` is always truthy, so `if result:` accepts any remote response regardless of confidence — breaking the cascade escalation logic. The local model path correctly gates on `det_result.confidence >= session.cascade.confidence_threshold`; the remote path should do the same.</violation>
</file>

<file name="packages/llamafarm-llama/src/llamafarm_llama/_binary.py">

<violation number="1" location="packages/llamafarm-llama/src/llamafarm_llama/_binary.py:77">
P1: Selecting the latest monorepo tag without verifying the matching `llama-{version}-bin-linux-arm64.zip` asset exists can make ARM64 downloads 404 when the release tag and llama.cpp version drift.</violation>
</file>

<file name="runtimes/edge/utils/kv_cache_manager.py">

<violation number="1" location="runtimes/edge/utils/kv_cache_manager.py:683">
P1: `gc()` mutates `_entries` and `_content_index` without holding `self._lock`, while all other async mutators acquire it. The `_gc_loop` background task should acquire the lock before calling `gc()`, otherwise concurrent `prepare()`/`save_after_generation()` calls can observe inconsistent state (e.g., a `_content_index` pointing to an already-evicted `_entries` key).</violation>
</file>

<file name="runtimes/edge/utils/history_compressor.py">

<violation number="1" location="runtimes/edge/utils/history_compressor.py:139">
P1: Skip deduplication for `tool` messages. Removing a repeated tool result by content can leave an unmatched `tool_call_id` in history and make the next chat request fail protocol validation.</violation>
</file>

<file name="runtimes/edge/routers/vision/detect_classify.py">

<violation number="1" location="runtimes/edge/routers/vision/detect_classify.py:60">
P2: Reject an explicit empty `detection_classes` list here. Right now `[]` falls through as “no filter” in both detection backends, so this endpoint can return all detections instead of none/an error.

(Based on your team's feedback about differentiating an explicit empty classes list from a missing one.) [FEEDBACK_USED]</violation>

<violation number="2" location="runtimes/edge/routers/vision/detect_classify.py:118">
P1: Validate `classification_model` before loading it. `load_classification_model` passes this value straight into `from_pretrained(...)`, which accepts local directory paths, so this endpoint currently allows path-like model ids to reach arbitrary filesystem locations.</violation>
</file>

<file name="runtimes/edge/utils/thinking.py">

<violation number="1" location="runtimes/edge/utils/thinking.py:136">
P1: This fallback drops the original user prompt when content can't be converted to a list.</violation>

<violation number="2" location="runtimes/edge/utils/thinking.py:248">
P2: This budget check is off by one and starts forcing `</think>` one token too early.</violation>
</file>

<file name="common/llamafarm_common/model_format.py">

<violation number="1" location="common/llamafarm_common/model_format.py:128">
P1: Don't treat a cache with no `.gguf` files as proof the repo is transformers-only. `scan_cache_dir()` only sees files already downloaded locally, so partially cached GGUF repos will be misclassified and loaded with the wrong model class.</violation>
</file>

<file name="common/llamafarm_common/model_cache.py">

<violation number="1" location="common/llamafarm_common/model_cache.py:56">
P1: `TTLCache(ttl=ttl * 10)` still gives each entry a hard internal expiry, so frequently accessed models can disappear after ~10×ttl anyway.</violation>
</file>

<file name="runtimes/edge/models/language_model.py">

<violation number="1" location="runtimes/edge/models/language_model.py:142">
P2: `generate()` ignores the caller's `stop` sequences and can return text past the requested boundary.</violation>
</file>

<file name="runtimes/edge/core/logging.py">

<violation number="1" location="runtimes/edge/core/logging.py:125">
P1: Enable propagation after clearing Uvicorn handlers, or `uvicorn.error` logs can be dropped instead of reaching the root structlog handler.</violation>
</file>

<file name="runtimes/edge/utils/jinja_tools.py">

<violation number="1" location="runtimes/edge/utils/jinja_tools.py:116">
P2: This substring check can misclassify templates as tool-aware and skip the fallback injector, which silently drops tool definitions for affected models.</violation>

<violation number="2" location="runtimes/edge/utils/jinja_tools.py:128">
P1: Use `ImmutableSandboxedEnvironment` here; the regular sandbox still lets untrusted GGUF templates mutate caller-owned `messages`/`tools` lists and dicts during rendering.</violation>
</file>

<file name="runtimes/edge/utils/tool_calling.py">

<violation number="1" location="runtimes/edge/utils/tool_calling.py:19">
P1: Regex tag parsing breaks valid tool calls when an argument string contains `</tool_call>`.</violation>
</file>

<file name="runtimes/edge/routers/chat_completions/types.py">

<violation number="1" location="runtimes/edge/routers/chat_completions/types.py:227">
P2: Multiple audio parts in one message are collapsed into a single transcription key, so STT fallback overwrites earlier segments and reuses the last transcript for every audio part.</violation>
</file>

<file name="runtimes/edge/utils/gguf_metadata_cache.py">

<violation number="1" location="runtimes/edge/utils/gguf_metadata_cache.py:69">
P2: `_cache_lock` is held during a slow `GGUFReader` I/O call in the monkey-patch fallback path, blocking all cache reads/writes for the entire retry duration (potentially seconds). This contradicts the explicit comment on line 76: "outside lock to avoid blocking". Use a separate lock to serialize the monkey-patch so cache lookups for already-cached paths aren't stalled.</violation>
</file>

<file name="runtimes/edge/routers/vision/detection.py">

<violation number="1" location="runtimes/edge/routers/vision/detection.py:58">
P2: Decode and validate the image before calling `_load_fn()`. Right now malformed requests can still trigger a full YOLO load before the endpoint returns 400.</violation>

<violation number="2" location="runtimes/edge/routers/vision/detection.py:61">
P2: Reject an explicit empty `classes` list before calling `model.detect()`. `[]` currently falls through as "no filter" and returns detections for every class.

(Based on your team's feedback about validating empty `classes` lists explicitly.) [FEEDBACK_USED]</violation>
</file>

<file name="runtimes/edge/pyproject.toml">

<violation number="1" location="runtimes/edge/pyproject.toml:67">
P2: This wheel config omits the non-Python context defaults file that `utils/context_calculator.py` expects, so installed edge runtimes will always fall back to the generic 2048-token context calculation path.</violation>
</file>

<file name="runtimes/edge/utils/context_summarizer.py">

<violation number="1" location="runtimes/edge/utils/context_summarizer.py:144">
P2: Count recent conversation turns by turn boundaries, not `keep_recent * 2` raw messages, or tool-using chats will summarize part of the latest turns.</violation>
</file>

<file name="runtimes/edge/Dockerfile">

<violation number="1" location="runtimes/edge/Dockerfile:39">
P2: Copy the edge runtime sources before installing the local `.[vision]` package.</violation>

<violation number="2" location="runtimes/edge/Dockerfile:56">
P2: Use `LF_RUNTIME_PORT` in the health check instead of hard-coding 11540.</violation>
</file>

<file name="cli/cmd/orchestrator/python_env.go">

<violation number="1" location="cli/cmd/orchestrator/python_env.go:167">
P2: This filter still lets other uv index env vars leak through. `UV_INDEX_URL`/`UV_DEFAULT_INDEX` and `UV_INDEX` also affect package resolution, so a user-exported index can still override installs for services that were supposed to run without the PyTorch index.</violation>
</file>

<file name="cli/cmd/orchestrator/services.go">

<violation number="1" location="cli/cmd/orchestrator/services.go:154">
P2: This now forces `UV_INDEX_STRATEGY` into the child env even when it is unset, which can override uv's default with an empty invalid value.</violation>
</file>

<file name="runtimes/edge/config/model_context_defaults.yaml">

<violation number="1" location="runtimes/edge/config/model_context_defaults.yaml:24">
P2: This `*Llama-3*` fallback also catches Llama 3.1 models and downgrades them to 8K context. Add a more specific Llama 3.1 rule before the generic Llama 3 entry.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

lspci -d 1e60: already filters by vendor ID, but the old check looked
for "1e60" in the output text. lspci resolves vendor IDs to names, so
the output contains "Hailo Technologies Ltd." instead of "1e60",
causing detection to always fail. Check for any non-empty output instead.
ConfiguredInferModel doesn't expose .output() in HailoRT 5.2.0.
Use self._infer_model.output().shape instead of
self._configured.output().shape to get the output buffer dimensions.
ConfiguredInferModel.run() does not accept timeout_ms in HailoRT 5.2.0.
HailoRT 5.2.0 accepts timeout as a positional arg, not keyword.
… bots

- Log warning on failed CPU offload instead of silent pass (base.py)
- Log debug on unified memory detection failure instead of silent pass
  (gguf_language_model.py)
- Remove unused _timing_start variable (gguf_language_model.py)
- Add path traversal validation in load_language() (server.py)
- Add path traversal validation in _read_gguf_metadata() (gguf_metadata_cache.py)
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="runtimes/edge/server.py">

<violation number="1" location="runtimes/edge/server.py:218">
P2: Path-traversal validation is incomplete: it misses Windows absolute/UNC path forms, so crafted `model_id` values can bypass this guard.</violation>
</file>

<file name="runtimes/edge/utils/gguf_metadata_cache.py">

<violation number="1" location="runtimes/edge/utils/gguf_metadata_cache.py:91">
P2: Substring matching ".." will reject valid GGUF paths that contain ".." in a filename or directory name. Check path segments for ".." instead of the raw string so legitimate paths aren’t blocked.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Hailo NMS outputs normalized coordinates (0.0–1.0) relative to the
letterboxed input, but the rescaling math treated them as pixel values.
This caused coordinates like (0.5 - 134px) = -133.5, producing garbage
bounding boxes.

Convert normalized coords to pixel space first (multiply by input
dimensions), then subtract padding and divide by scale.
Log raw detection values and mapped coordinates to diagnose coordinate
order mismatch (inverted boxes suggest y1/x1/y2/x2 order may be wrong).
The Hailo Model Zoo NMS postprocess outputs a per-class tensor of shape
(num_classes, max_det, 5) where each detection is [y_min, x_min, y_max,
x_max, score]. The class ID is implicit from the first dimension index.

The old parser assumed a flat (N, 6) layout with [y1, x1, y2, x2, conf,
class_id] which misaligned the data, producing inverted/out-of-bounds
bounding boxes (e.g. det[0]=4.0 from misaligned class boundaries).

Also adds debug logging for output tensor shapes and per-detection
coordinate mapping to aid further diagnosis.
The Hailo NMS output for YOLO models is a flat buffer of size
num_classes × (1 + max_det × 5). For 80 COCO classes with 100 max
detections this is 40080 floats.

Per-class layout: [count, y1, x1, y2, x2, score, y1, x1, ...].
The count field gives the number of valid detections for that class.

The previous parser expected (num_classes, max_det, 5) which never
matched the actual flat (40080,) shape, returning zero detections.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="runtimes/edge/models/hailo_model.py">

<violation number="1" location="runtimes/edge/models/hailo_model.py:142">
P1: The `nc=1` fallback can mis-parse valid multi-class NMS buffers as single-class output, leading to incorrect detections.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

- Harden path traversal check to reject Windows absolute/UNC paths
  (backslash, drive letter) in model_id (server.py)
- Check path segments for ".." instead of raw substring match so
  legitimate paths with ".." in filenames aren't rejected
  (gguf_metadata_cache.py)
- Remove nc=1 fallback in Hailo NMS parser to prevent mis-parsing
  multi-class buffers as single-class (hailo_model.py)
Allow other drone services (vision, comms, flight-control) to request
LLM inference over the Zenoh pub/sub bus instead of HTTP, matching the
IPC pattern used across all other on-drone services.

- Subscribe to local/llm/request for inference requests
- Publish results to local/llm/response
- Publish heartbeat to local/llm/status
- Graceful degradation: logs warning and continues HTTP-only if Zenoh
  socket is unavailable or eclipse-zenoh is not installed
- Configurable via ZENOH_ENDPOINT and ZENOH_ENABLED env vars
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="runtimes/edge/services/zenoh_ipc.py">

<violation number="1" location="runtimes/edge/services/zenoh_ipc.py:152">
P2: Return a generic error string instead of `str(exc)` to avoid exposing internal exception details to IPC clients.

(Based on your team's feedback about avoiding raw exception text in client-facing responses.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

The eclipse-zenoh Python API's zenoh.open() returns a Session
synchronously. Using await caused a TypeError at runtime.
The eclipse-zenoh Python package exposes a synchronous API, not async.
Replace all awaited Zenoh calls with synchronous equivalents and switch
the subscriber from an async iterator to a callback pattern using
asyncio.run_coroutine_threadsafe to bridge back into the event loop.
get_optimal_device() was called every 30s by the health check, logging
"Using CPU (no GPU acceleration)" each time. Cache the result so
detection and its log messages only run once on startup.
Add edge-runtime service to all Docker workflow stages: build-amd64,
build-arm64, create-manifest, test-compose, security-scan, and the
release retagging workflow. Uses repo root as build context with
runtimes/edge/Dockerfile. No special runners, disk cleanup, or
PyTorch handling needed — it's a lightweight image.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="runtimes/edge/services/zenoh_ipc.py">

<violation number="1" location="runtimes/edge/services/zenoh_ipc.py:128">
P1: Track and await/cancel Futures returned by `run_coroutine_threadsafe`; otherwise in-flight request handlers can outlive shutdown and fail after the session is closed.</violation>
</file>

<file name=".github/workflows/docker.yml">

<violation number="1" location=".github/workflows/docker.yml:379">
P1: Adding `edge-runtime` makes the current `grep "$SERVICE"` image lookup ambiguous for `runtime`, so the workflow can tag the wrong image and validate the wrong container.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

rachmlenig and others added 2 commits March 20, 2026 14:42
The lockfile was stale, causing `uv sync --locked` to fail in CI.
Adds an OpenAI-compatible text completions endpoint that accepts a raw
prompt string without applying any chat template. Needed for fine-tuned
models whose GGUF chat template doesn't match their training format.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="runtimes/edge/routers/completions.py">

<violation number="1" location="runtimes/edge/routers/completions.py:55">
P2: `max_tokens` defaulting uses a falsy check, so an explicit `0` is overwritten to `512` instead of being preserved.</violation>

<violation number="2" location="runtimes/edge/routers/completions.py:63">
P2: `temperature`/`top_p` use falsy-default logic, which overrides explicit numeric values like `0` and changes generation behavior.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


if os.getenv("FORCE_CPU_VISION", "").lower() in ("1", "true", "yes"):
logger.info("Hailo detection skipped (FORCE_CPU_VISION=1)")
_use_hailo = False
import hailo_platform # noqa: F401
except ImportError:
logger.info("hailo_platform not installed, using CPU backend for vision")
_use_hailo = False
)
if result.stdout.strip():
logger.info("Hailo-10H detected, using Hailo backend for vision")
_use_hailo = True
# Fallback: check for /dev/hailo0
if os.path.exists("/dev/hailo0"):
logger.info("Hailo device found at /dev/hailo0, using Hailo backend")
_use_hailo = True
return True

logger.info("Hailo not detected, using CPU backend for vision")
_use_hailo = False
@rachmlenig
Copy link
Contributor Author

Fixes for Qodo Code Review findings

Bugs:

  • Tool stub unpack crash (issue 6): parse_tool_choice stub now returns ("none", None) tuple instead of None, matching the expected 2-tuple interface. Also reformatted all single-line stubs to multi-line for PEP8 compliance.
  • ARM64 tag not validated (issue 5): _get_llamafarm_release_version() now checks the release assets for an ARM64 binary before returning the tag. Falls through to the hardcoded fallback if the latest release has no ARM64 asset.

Rule violations:

  • One-line stub function defs (issue 2): Reformatted all ImportError fallback stubs to proper multi-line def blocks.
  • UV_INDEX_STRATEGY not in .env.example (issue 1): Deferred — documentation-only change.

@rachmlenig
Copy link
Contributor Author

Fixes for cubic code review findings

Addressed the following issues from the 40-issue review:

Issue Status
P0: parse_thinking_response stub returns tuple, callers use .content/.thinking Fixed — fallback now returns a _FallbackThinkingResponse dataclass with proper attributes
P1: HistoryCompressor can be None but called unconditionally Fixed — added if HistoryCompressor is not None guard
P1: Path traversal bypass in vision model loading Fixed — reject ./.. + resolved-path containment check in load_detection_model
P1: gc() mutates state without holding _lock Fixed — _gc_loop now acquires manager._lock before calling gc()
P1: ImmutableSandboxedEnvironment for Jinja2 templates Fixed — upgraded from SandboxedEnvironment
P1: Error handler leaks raw exception text Fixed — catch-all now returns generic "An internal server error occurred."
P1: CLIP model concurrent classify() race condition Fixed — classify() now uses request-local variables for class names/embeddings instead of shared instance state
P1: Remote cascade bypasses confidence threshold Fixed — remote results now gated on result.confidence >= session.cascade.confidence_threshold
P1: classification_model not validated before loading Fixed — load_classification_model validates against known CLIP variants or HuggingFace repo ID format
P1: ARM64 binary tag not validated against assets Fixed — now checks release assets for ARM64 binary before using tag
P1: Wrong log_file kwarg in setup_logging() Fixed — removed invalid keyword argument
P1: parse_tool_choice stub returns None instead of tuple Fixed — stub returns ("none", None)

@qodo-free-for-open-source-projects
Copy link
Contributor

CI Feedback 🧐

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: Security Scan (edge-runtime)

Failed stage: Run Trivy vulnerability scanner [❌]

Failed test name: ""

Failure summary:

The action failed during the Trivy container image scan because the image
ghcr.io/llama-farm/llamafarm/edge-runtime:latest could not be found or accessed:
- Docker: the image
is not present locally (No such image).
- Remote (GHCR): the tag latest does not exist for this
image (MANIFEST_UNKNOWN when fetching the manifest).
- Containerd/Podman fallbacks also failed due
to missing/unauthorized runtimes on the runner (containerd socket permission denied; podman socket
not found).
As a result, Trivy exited with code 1 after unable to find the specified image /
MANIFEST_UNKNOWN.

Relevant error logs:
1:  ##[group]Runner Image Provisioner
2:  Hosted Compute Agent
...

398:  INPUT_TRIVYIGNORES: 
399:  INPUT_GITHUB_PAT: 
400:  INPUT_LIMIT_SEVERITIES_FOR_SARIF: 
401:  TRIVY_CACHE_DIR: /home/runner/work/llamafarm/llamafarm/.cache/trivy
402:  ##[endgroup]
403:  Building SARIF report with all severities
404:  Running Trivy with options: trivy image ghcr.io/llama-farm/llamafarm/edge-runtime:latest
405:  INFO	[vulndb] Need to update DB
406:  INFO	[vulndb] Downloading vulnerability DB...
407:  INFO	[vulndb] Downloading artifact...	repo="mirror.gcr.io/aquasec/trivy-db:2"
408:  22.08 MiB / 88.14 MiB [--------------->_____________________________________________] 25.05% ? p/s ?48.98 MiB / 88.14 MiB [--------------------------------->___________________________] 55.58% ? p/s ?75.84 MiB / 88.14 MiB [---------------------------------------------------->________] 86.05% ? p/s ?88.14 MiB / 88.14 MiB [--------------------------------------------->] 100.00% 109.97 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [--------------------------------------------->] 100.00% 109.97 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [--------------------------------------------->] 100.00% 109.97 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [--------------------------------------------->] 100.00% 102.88 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [--------------------------------------------->] 100.00% 102.88 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [--------------------------------------------->] 100.00% 102.88 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [---------------------------------------------->] 100.00% 96.24 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [---------------------------------------------->] 100.00% 96.24 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [---------------------------------------------->] 100.00% 96.24 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [---------------------------------------------->] 100.00% 90.03 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [---------------------------------------------->] 100.00% 90.03 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [---------------------------------------------->] 100.00% 90.03 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [---------------------------------------------->] 100.00% 84.22 MiB p/s ETA 0s88.14 MiB / 88.14 MiB [-------------------------------------------------] 100.00% 28.72 MiB p/s 3.3sINFO	[vulndb] Artifact successfully downloaded	repo="mirror.gcr.io/aquasec/trivy-db:2"
409:  INFO	[vuln] Vulnerability scanning is enabled
410:  INFO	[secret] Secret scanning is enabled
411:  INFO	[secret] If your scanning is slow, please try '--scanners vuln' to disable secret scanning
412:  INFO	[secret] Please see https://trivy.dev/docs/v0.69/guide/scanner/secret#recommendation for faster secret detection
413:  FATAL	Fatal error	run error: image scan error: scan error: unable to initialize a scan service: unable to initialize artifact: unable to initialize container image: unable to find the specified image "ghcr.io/llama-farm/llamafarm/edge-runtime:latest" in ["docker" "containerd" "podman" "remote"]: 4 errors occurred:
414:  * docker error: unable to inspect the image (ghcr.io/llama-farm/llamafarm/edge-runtime:latest): Error response from daemon: No such image: ghcr.io/llama-farm/llamafarm/edge-runtime:latest
415:  * containerd error: failed to list images from containerd client: connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: permission denied"
416:  * podman error: unable to initialize Podman client: no podman socket found: stat /run/user/1001/podman/podman.sock: no such file or directory
417:  * remote error: GET https://ghcr.io/v2/llama-farm/llamafarm/edge-runtime/manifests/latest: MANIFEST_UNKNOWN: manifest unknown
418:  ##[error]Process completed with exit code 1.
419:  ##[group]Run rm -f trivy_envs.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants