raspberrypi · nathan-contino · Sep 30, 2024 · Sep 2, 2024 · Sep 20, 2024 · Sep 12, 2024
diff --git a/documentation/asciidoc/accessories/ai-camera.adoc b/documentation/asciidoc/accessories/ai-camera.adoc
@@ -0,0 +1,7 @@
+include::ai-camera/about.adoc[]
+
+include::ai-camera/getting-started.adoc[]
+
+include::ai-camera/details.adoc[]
+
+include::ai-camera/model-conversion.adoc[]
diff --git a/documentation/asciidoc/accessories/ai-camera/about.adoc b/documentation/asciidoc/accessories/ai-camera/about.adoc
@@ -0,0 +1,9 @@
+[[ai-camera]]
+== About
+
+The Raspberry Pi AI Camera uses the Sony IMX500 imaging sensor to provide low-latency and high-performance AI capabilities to any camera application. The tight integration with https://www.raspberrypi.com/documentation/computers/camera_software.adoc[Raspberry Pi's camera software stack] allows users to deploy their own neural network models with minimal effort.
+
+image::images/ai-camera.png[The Raspberry Pi AI Camera]
+
+This section demonstrates how to run either a pre-packaged or custom neural network model on the camera. Additionally, this section includes the steps required to interpret inference data generated by neural networks running on the IMX500 in https://github.com/raspberrypi/rpicam-apps[`rpicam-apps`] and https://github.com/raspberrypi/picamera2[Picamera2].
+
diff --git a/documentation/asciidoc/accessories/ai-camera/details.adoc b/documentation/asciidoc/accessories/ai-camera/details.adoc
@@ -0,0 +1,270 @@
+
+== Under the Hood
+
+=== Overview
+
+The Raspberry Pi AI Camera works rather differently to more traditional AI-based camera image processing systems, as shown in the diagram below.
+
+image::images/imx500-comparison.svg[Traditional versus IMX500 AI camera systems]
+
+On the left is a diagram of a more traditional AI camera system. Here, the camera delivers only images to the Raspberry Pi. The Raspberry Pi processes the images and is then responsible for performing AI inferencing. This may use an optional external AI accelerator, as shown, or it may happen (more slowly) in the CPU.
+
+On the right we have the IMX500-based system. The camera module contains a small ISP which turns the raw camera image data into an _input tensor_ which is fed directly to the AI accelerator within the camera. In turn, this produces an _output tensor_, containing the inferencing results, which is fed back to the Raspberry Pi itself. There is no need for an external accelerator, nor for the Raspberry Pi to run neural network software on the CPU.
+
+Some concepts that may be helpful to understand include:
+
+==== The _Input Tensor_
+
+This is the part of the sensor image that is passed to the AI engine for inferencing. It is produced by a small on-board ISP which also crops and scales the camera image to the dimensions expected by the neural network that has been loaded. The input tensor is not normally made available to applications, though it is possible to access it for debugging purposes.
+
+==== The _Region of Interest_
+
+The Region of Interest (or _ROI_) specifies exactly which part of the sensor image is cropped out before being rescaled to the size demanded by the neural network. It can be queried and set by an application. The units used are always pixels in the full resolution sensor output.
+
+By default, the ROI is set to be the full image received from the sensor (that is, nothing is actally cropped out).
+
+==== The _Output Tensor_
+
+These are the results of inferencing performed by the neural network. The precise number and shape of the outputs will depend on the neural network, and application code will need to understand how to handle them.
+
+=== System Architecture
+
+The diagram below shows the various camera software components (in green) used during our imaging/inference use case with the Raspberry Pi AI Camera module hardware (in red).
+
+image::images/imx500-block-diagram.svg[IMX500 block diagram]
+
+At startup, the IMX500 sensor module loads firmware to run a particular neural network model. During streaming, the IMX500 generates _both_ an image stream and an inference stream. This inference stream holds the inputs and outputs of the neural network model, also known as input/output **tensors**.
+
+=== Device drivers
+
+At the lowest level, the the IMX500 sensor kernel driver configures the camera module over the I2C bus. The CSI2 driver (`CFE` on Pi 5, `Unicam` on all other Pi platforms) sets up the receiver to write the image data stream into a frame buffer, together with the embedded data and inference data streams into another buffer in memory.
+
+The firmware files also transfer over the I2C bus wires. On most devices, this uses the standard I2C protocol, but Raspberry Pi 5 uses a custom high speed protocol. The RP2040 SPI driver in the kernel handles firmware file transfer, since the transfer uses the RP2040 microcontroller. The microcontroller bridges the I2C transfers from the kernel to the IMX500 via a SPI bus. Additionally, the RP2040 caches firmware files in on-board storage. This avoids the need to transfer entire firmware blobs over the I2C bus, significantly speeding up firmware loading for firmware you've already used.
+
+=== `libcamera`
+
+Once `libcamera` dequeues the image and inference data buffers from the kernel, the IMX500 specific `cam-helper` library (part of the Raspberry Pi IPA within `libcamera`) parses the inference buffer to access the input/output tensors. These tensors are packaged as Raspberry Pi vendor-specific https://libcamera.org/api-html/namespacelibcamera_1_1controls.html[`libcamera` controls]. `libcamera` returns the following controls:
+
+[%header,cols="a,a"]
+|===
+| Control
+| Description
+
+| `CnnOutputTensor`
+| Floating point array storing the output tensor.
+
+| `CnnInputTensor`
+| Floating point array storing the input tensor.
+
+| `CnnOutputTensorInfo`
+| Network specific parameters describing the output tensors structure:
+
+[source,c]
+----
+struct OutputTensorInfo {
+	uint32_t tensorDataNum;
+	uint32_t numDimensions;
+	uint16_t size[MaxNumDimensions];
+};
+
+struct CnnOutputTensorInfo {
+	char networkName[NetworkNameLen];
+	uint32_t numTensors;
+	OutputTensorInfo info[MaxNumTensors];
+};
+----
+
+| `CnnInputTensorInfo`
+| Network specific parameters describing the input tensors structure:
+
+[source,c]
+----
+struct CnnInputTensorInfo {
+	char networkName[NetworkNameLen];
+	uint32_t width;
+	uint32_t height;
+	uint32_t numChannels;
+};
+----
+
+|===
+
+=== `rpicam-apps`
+
+`rpicam-apps` provides an IMX500 post-processing stage base class that implements helpers for IMX500 post-processing stages: https://github.com/raspberrypi/rpicam-apps/blob/post_processing_stages/imx500_post_processing_stage.hpp[`IMX500PostProcessingStage`]. Use this base class to derive a new post-processing stage for any neural network model running on the IMX500. For an example, see https://github.com/raspberrypi/rpicam-apps/blob/post_processing_stages/imx500_mobilenet_ssd.cpp[`imx500_mobilenet_ssd.cpp`]:
+
+[source,cpp]
+----
+class ObjectInference : public IMX500PostProcessingStage
+{
+public:
+	ObjectInference(RPiCamApp *app) : IMX500PostProcessingStage(app) {}
+
+	char const *Name() const override;
+
+	void Read(boost::property_tree::ptree const &params) override;
+
+	void Configure() override;
+
+	bool Process(CompletedRequestPtr &completed_request) override;
+};
+----
+
+For every frame received by the application, the `Process()` function is called (`ObjectInference::Process()` in the above case). In this function, you can extract the output tensor for further processing or analysis:
+
+[source,cpp]
+----
+auto output = completed_request->metadata.get(controls::rpi::CnnOutputTensor);
+if (!output)
+{
+  LOG_ERROR("No output tensor found in metadata!");
+  return false;
+}
+
+std::vector<float> output_tensor(output->data(), output->data() + output->size());
+----
+
+Once completed, the final results can either be visualised or saved in metadata and consumed by either another downstream stage, or the top level application itself. In the object inference case:
+
+[source,cpp]
+----
+if (objects.size())
+  completed_request->post_process_metadata.Set("object_detect.results", objects);
+----
+
+The `object_detect_draw_cv` post-processing stage running downstream fetches these results from the metadata and draws the bounding boxes onto the image in the `ObjectDetectDrawCvStage::Process()` function:
+
+[source,cpp]
+----
+std::vector<Detection> detections;
+completed_request->post_process_metadata.Get("object_detect.results", detections);
+----
+
+The following table contains a full list of helper functions provided by `IMX500PostProcessingStage`:
+
+[%header,cols="a,a"]
+|===
+| Function
+| Description
+
+| `Read()`
+| Typically called from `<Derived Class>::Read()`, this function reads the config parameters for input tensor parsing and saving.
+
+This function also reads the neural network model file string (`"network_file"`) and sets up the firmware to be loaded on camera open.
+
+| `Process()`
+| Typically called from `<Derived Class>::Process()` this function processes and saves the input tensor to a file if required by the JSON config file.
+
+| `SetInferenceRoiAbs()`
+| Sets an absolute region of interest (ROI) crop rectangle on the sensor image to use for inferencing on the IMX500.
+
+| `SetInferenceRoiAuto()`
+| Automatically calculates region of interest (ROI) crop rectangle on the sensor image to preserve the input tensor aspect ratio for a given neural network.
+
+| `ShowFwProgressBar()`
+| Displays a progress bar on the console showing the progress of the neural network firmware upload to the IMX500.
+
+| `ConvertInferenceCoordinates()`
+| Converts from the input tensor coordinate space to the final ISP output image space.
+
+There are a number of scaling/cropping/translation operations occurring from the original sensor image to the fully processed ISP output image. This function converts coordinates provided by the output tensor to the equivalent coordinates after performing these operations.
+
+|===
+
+=== Picamera2
+
+IMX500 integration in Picamera2 is very similar to what is available in `rpicam-apps`. Picamera2 has an IMX500 helper class that provides the same functionality as the `rpicam-apps` `IMX500PostProcessingStage` base class. This can be imported to any python script with:
+
+[source,python]
+----
+from picamera2.devices.imx500 import IMX500
+
+# This must be called before instantiation of Picamera2
+imx500 = IMX500(model_file)
+----
+
+To retrieve the output tensors, fetch them from the controls. You can then apply additional processing in your python script.
+
+For example, in an object inference use case such as https://github.com/raspberrypi/picamera2/tree/main/examples/imx500/imx500_object_detection_demo.py[imx500_object_detection_demo.py], the object bounding boxes and confidence values are extracted in `parse_detections()` and draw the boxes on the image in `draw_detections()`:
+
+[source,python]
+----
+class Detection:
+    def __init__(self, coords, category, conf, metadata):
+        """Create a Detection object, recording the bounding box, category and confidence."""
+        self.category = category
+        self.conf = conf
+        obj_scaled = imx500.convert_inference_coords(coords, metadata, picam2)
+        self.box = (obj_scaled.x, obj_scaled.y, obj_scaled.width, obj_scaled.height)
+
+def draw_detections(request, detections, stream="main"):
+    """Draw the detections for this request onto the ISP output."""
+    labels = get_labels()
+    with MappedArray(request, stream) as m:
+        for detection in detections:
+            x, y, w, h = detection.box
+            label = f"{labels[int(detection.category)]} ({detection.conf:.2f})"
+            cv2.putText(m.array, label, (x + 5, y + 15), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1)
+            cv2.rectangle(m.array, (x, y), (x + w, y + h), (0, 0, 255, 0))
+        if args.preserve_aspect_ratio:
+            b = imx500.get_roi_scaled(request)
+            cv2.putText(m.array, "ROI", (b.x + 5, b.y + 15), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1)
+            cv2.rectangle(m.array, (b.x, b.y), (b.x + b.width, b.y + b.height), (255, 0, 0, 0))
+
+def parse_detections(request, stream='main'):
+    """Parse the output tensor into a number of detected objects, scaled to the ISP out."""
+    outputs = imx500.get_outputs(request.get_metadata())
+    boxes, scores, classes = outputs[0][0], outputs[1][0], outputs[2][0]
+    detections = [ Detection(box, category, score, metadata)
+                   for box, score, category in zip(boxes, scores, classes) if score > threshold]
+    draw_detections(request, detections, stream)
+----
+
+Unlike the `rpicam-apps` example, this example applies no additional hysteresis or temporal filtering.
+
+The IMX500 class in Picamera2 provides the following helper functions:
+
+[%header,cols="a,a"]
+|===
+| Function
+| Description
+
+| `IMX500.get_full_sensor_resolution()`
+| Return the full sensor resolution of the IMX500.
+
+| `IMX500.config`
+| Returns a dictionary of the neural network configuration.
+
+| `IMX500.convert_inference_coords(coords, metadata, picamera2)`
+| Converts the coordinates _coords_ from the input tensor coordinate space to the final ISP output image space. Must be passed Picamera2's image metadata for the image, and the Picamera2 object.
+
+There are a number of scaling/cropping/translation operations occurring from the original sensor image to the fully processed ISP output image. This function converts coordinates provided by the output tensor to the equivalent coordinates after performing these operations.
+
+| `IMX500.show_network_fw_progress_bar()`
+| Displays a progress bar on the console showing the progress of the neural network firmware upload to the IMX500.
+
+| `IMX500.get_roi_scaled(request)`
+| Returns the region of interest (ROI) in the ISP output image coordinate space.
+
+| `IMX500.get_isp_output_size(picamera2)`
+| Returns the ISP output image size.
+
+| `IMX5000.get_input_size()`
+| Returns the input tensor size based on the neural network model used.
+
+| `IMX500.get_outputs(metadata)`
+| Returns the output tensors from the Picamera2 image metadata metadata.
+
+| `IMX500.get_output_shapes(metadata)`
+| Returns the shape of the output tensors from the Picamera2 image metadata for the neural network model used.
+
+| `IMX500.set_inference_roi_abs(rectangle)`
+| Sets the region of interest (ROI) crop rectangle which determines which part of the sensor image is converted to the input tensor that is used for inferencing on the IMX500. The region of interest should be specified in units of pixels at the full sensor resolution, as a `(x_offset, y_offset, width, height)` tuple.
+
+| `IMX500.set_inference_aspect_ratio(aspect_ratio)`
+| Automatically calculates region of interest (ROI) crop rectangle on the sensor image to preserve the given aspect ratio. To make the ROI aspect ratio exactly match the input tensor for this network, use `imx500.set_inference_aspect_ratio(imx500.get_input_size())`.
+
+| `IMX500.get_kpi_info(metadata)`
+| Returns the frame level performance indicators logged by the IMX500 for the given image metadata.
+
+|===