added figure description

Shashank Vasisht · Shashank Vasisht · commit b5e3b36cb0c4 · 2025-04-04T10:17:53.000+05:30
diff --git a/guide/14-deep-learning/rt-detrv2_object_detector.ipynb b/guide/14-deep-learning/rt-detrv2_object_detector.ipynb
@@ -54,9 +54,9 @@
    "id": "d8133140",
    "metadata": {},
    "source": [
-    "We have seen how the object detection models such as [SSD](https://developers.arcgis.com/python/guide/how-ssd-works/), [RetinaNet](https://developers.arcgis.com/python/guide/how-retinanet-works/), and [YOLOv3](https://developers.arcgis.com/python/latest/guide/yolov3-object-detector/) work. However, these models primarily are Convolution Neaural Network (CNN) based architectures. Eversince the development of the revolutionary [Transformer](https://arxiv.org/pdf/1706.03762) architecture in the language domain, researchers had been trying to incorporate the core idea of the transformers - the self attention, to vision domain. This led to the development of [Vision Transformer (ViT)](https://arxiv.org/pdf/2010.11929) model, which used a fully transformer based architecture for the first time to classify images and outperformed it's contemporary CNN based architectures.\n",
+    "We have seen how the object detection models such as [SSD](https://developers.arcgis.com/python/guide/how-ssd-works/), [RetinaNet](https://developers.arcgis.com/python/guide/how-retinanet-works/), and [YOLOv3](https://developers.arcgis.com/python/latest/guide/yolov3-object-detector/) work. However, these models primarily are Convolution Neaural Network (CNN) based architectures. Eversince the development of the revolutionary [Transformer](https://arxiv.org/pdf/1706.03762) architecture in the language domain, researchers had been trying to incorporate the core idea of the transformers - the self attention, to vision domain. \n",
     "\n",
-    "Subsequently, the Vision Transformer architecture inspired the development of a fully transformer based Object Detection model called the [Detection Transformer (DETR)](https://arxiv.org/pdf/2005.12872). The DETR model however, did not beat the popular YOLO family models in terms of both accuracy and speed, but it was a beginning in the direction of fully transformer based Object detection models. The idea of self attention was attractive because it gave global context to the model from the very first layer, however being an O(n<sup>2</sup>) complexity algoithm meant high compute time which, unlike the YOLO family, rendered DETR (much like other Vision Transformer based architectures) slow and unable to process images at real-time. \n",
+    "Subsequently, this inspired the development of a CNN-transformer hybrid Object Detection model called the [Detection Transformer (DETR)](https://arxiv.org/pdf/2005.12872). The DETR extracted it's features through a Convolutional backbone and applied a transformer on top of those features to directly predict a set of detections. The DETR model however, did not beat the popular YOLO family models in terms of both accuracy and speed, but it was a beginning in the direction of fully transformer based Object detection models. The idea of self attention was attractive because it gave global context to the model from the very first layer, however being an O(n<sup>2</sup>) complexity algoithm meant high compute time which, unlike the YOLO family, rendered DETR (much like other Vision Transformer based architectures) slow and unable to process images at real-time. \n",
     "\n",
     "With the recent development of the [Real time DETR (RT-DETR)](https://arxiv.org/pdf/2304.08069) and its successor, [RT-DETRv2](https://arxiv.org/pdf/2407.17140), DETRs have finally outperformed YOLO family in both speed and accuracy. In this guide, we will learn more about RT-DETRv2 and how we can use it for your own tasks using `arcgis.learn`."
    ]
@@ -91,9 +91,9 @@
    "id": "87b3984c",
    "metadata": {},
    "source": [
-    "The YOLO series had been the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However,the speed and accuracy of YOLOs are negatively affected by the NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless, the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. The Real-Time DEtection TRansformer (RT-DETR) became the first end-to-end real-time object detector to addresses this issue by removing the need for NMS and making the architecture computationally efficient, outperforms YOLO series in both speed and accuracy. It's architecture consists of three main components: a hybrid encoder, a query selection mechanism, and a Transformer decoder.\n",
+    "The YOLO series had been the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However,the speed and accuracy of YOLOs are negatively affected by the NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless, the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. The Real-Time DEtection TRansformer (RT-DETR) became the first end-to-end real-time object detector to addresses this issue by removing the need for NMS and making the architecture computationally efficient, outperforms YOLO series in both speed and accuracy. It's architecture (Figure 2) consists of three main components: a hybrid encoder, a query selection mechanism, and a Transformer decoder.\n",
     "\n",
-    "RT-DETR employs a **hybrid encoder** to extract rich, multi-scale features while maintaining real-time efficiency. Unlike traditional Transformer-based architectures that process features across scales simultaneously (leading to high computation costs), RT-DETR decouples intra-scale interactions and cross-scale fusion. The intra-scale interaction step refines features at each level independently, while the cross-scale fusion step aggregates information across different scales efficiently. This design allows for high-quality feature representation with reduced computational overhead.\n",
+    "RT-DETR employs a **hybrid encoder** to extract rich, multi-scale features while maintaining real-time efficiency. Unlike traditional Transformer-based architectures that process features across scales simultaneously (leading to high computation costs), RT-DETR decouples intra-scale interactions and cross-scale fusion (Figure 2). Specifically, The features from the last three stages of a CNN backbone are fed into the encoder. The efficient hybrid encoder transforms multi-scale features into a sequence of image features through the Attention-based Intra-scale Feature Interaction (AIFI) and the CNN-based Cross-scale Feature Fusion (CCFF). The intra-scale interaction step refines features at each level independently, while the cross-scale fusion step aggregates information across different scales efficiently. This design allows for high-quality feature representation with reduced computational overhead.\n",
     "\n",
     "A critical innovation in RT-DETR is its **uncertainty-minimal query selection** mechanism. Instead of generating random object queries, RT-DETR selects high-quality initial queries based on an uncertainty metric, ensuring that the most relevant object representations are passed to the decoder. This step improves detection accuracy and convergence speed.\n",
     "\n",
@@ -104,11 +104,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "689e52aa",
+   "id": "601a868f",
    "metadata": {},
    "source": [
-    "<center><img src=\"../../static/img/rtdetr_architecture.png\" /></center>\n",
-    "<center>Figure 2. RT-DETR Architecture </center>"
+    "<div style=\"text-align: center;\">\n",
+    "  <img src=\"../../static/img/rtdetr_architecture.png\" style=\"width:70%;\" />\n",
+    "  <div style=\"max-width: 1000px; text-align: justify; margin: auto;\">\n",
+    "    <strong>Figure 2.</strong> Overview of RT-DETR Architecture. The features from the last three stages of a CNN backbone are fed into the encoder. \n",
+    "    The efficient hybrid encoder transforms multi-scale features into a sequence of image features through the Attention-based Intra-scale Feature \n",
+    "    Interaction (AIFI) and the CNN-based Cross-scale Feature Fusion (CCFF). Then, the uncertainty-minimal query selection selects a fixed number \n",
+    "    of encoder features to serve as initial object queries for the decoder. Finally, the decoder with auxiliary prediction heads iteratively optimizes \n",
+    "    object queries to generate categories and boxes.\n",
+    "  </div>\n",
+    "</div>\n"
    ]
   },
   {