Skip to content

Commit 590aa75

Browse files
committed
post
1 parent e945ea8 commit 590aa75

File tree

7 files changed

+311
-5
lines changed

7 files changed

+311
-5
lines changed

_includes/footer.html

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
<footer class="mt-auto py-3 text-center">
2-
3-
<small class="text-muted mb-2">
4-
<i class="fas fa-code"></i> by <strong>{{ site.author.name }} &copy;</strong>
2+
<small class="text-muted mb-2 d-block">
3+
2025 &copy; (and <i class="fas fa-code"></i> by) <strong>{{ site.author.name }}</strong>. Under the <a href="https://github.com/agarnung/agarnung.github.io/blob/main/LICENSE" target="_blank">MIT License</a>.
54
</small>
65

7-
{% include social.html %}
8-
6+
<div class="mt-2">
7+
{% include social.html %}
8+
</div>
99
</footer>
Lines changed: 306 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,306 @@
1+
---
2+
title: Fast 3D processing
3+
tags: [programming, C++]
4+
style: fill
5+
color: danger
6+
description: Pushing the limits of fast 3D applications
7+
---
8+
9+
# Introduction
10+
11+
In the field of analysis, inspection, metrology... whether 2D or 3D, it is crucial to implement computationally efficient systems that can operate in real time without introducing large latencies that accumulate and cause delays and desynchronizations in systems where response time is critical.
12+
13+
Creating efficient applications in this regard depends not only on the hardware and the speed of our CPU, the frequency of our RAM, or GPU capacity, but also on how we design our software. In this regard, let's imagine a very typical situation in real-world applications: we want to perform a 3D transformation on a point cloud. Programming in C++, what tools/libraries can and should we use? In this post, we will experimentally explore this question.
14+
15+
For the curious, this is my machine:
16+
```bash
17+
$ uname -a
18+
```
19+
```text
20+
Linux pop-os 6.9.3-76060903-generic #202405300957~1732141768~22.04~f2697e1 SMP PREEMPT_DYNAMIC Wed N x86_64 x86_64 x86_64 GNU/Linux
21+
```
22+
```bash
23+
$ lscpu | grep -E 'Architecture|Cache|CPU|Model|NUMA|cache|L1|L2|L3'
24+
```
25+
```text
26+
CPU operating mode(s): 32-bit, 64-bit
27+
CPU(s): 20
28+
Online CPU(s) list: 0-19
29+
CPU family: 6
30+
Model: 154
31+
Max CPU MHz: 4700.0000
32+
Min CPU MHz: 400.0000
33+
L1d Cache: 544 KiB (14 instances)
34+
L1i Cache: 704 KiB (14 instances)
35+
L2 Cache: 11.5 MiB (8 instances)
36+
L3 Cache: 24 MiB (1 instance)
37+
NUMA node(s): 1
38+
NUMA node 0 CPU(s): 0-19
39+
L1tf vulnerability: Not affected
40+
```
41+
```bash
42+
$ g++ --version
43+
```
44+
```text
45+
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
46+
Copyright (C) 2021 Free Software Foundation, Inc.
47+
This is free software; see the source for copying conditions. There is NO
48+
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
49+
```
50+
51+
# Hands-on code
52+
53+
## 1. OpenCV direct product
54+
55+
First, we will test a regular matrix multiplication of ```cv::Mat``` directly as a matrix product.
56+
```cpp
57+
auto transform_opencv = [=](const cv::Mat& cloud, const cv::Mat& transform, cv::Mat& result) {
58+
result = transform * cloud;
59+
};
60+
```
61+
62+
## 2. OpenCV by pointers I
63+
64+
We will apply a transformation using OpenCV "manually" via pointers.
65+
```cpp
66+
auto transform_opencv_pointers_double = [=](const cv::Mat& cloud, const cv::Mat& transform, cv::Mat& result) {
67+
const double* cloudPtr = cloud.ptr<double>(0);
68+
const double* transformPtr = transform.ptr<double>(0);
69+
double* resultPtr = result.ptr<double>(0);
70+
for (int i = 0; i < cloud.cols; ++i)
71+
{
72+
double x = cloudPtr[i * 3 + 0];
73+
double y = cloudPtr[i * 3 + 1];
74+
double z = cloudPtr[i * 3 + 2];
75+
76+
resultPtr[i * 3 + 0] = transformPtr[0] * x + transformPtr[1] * y + transformPtr[2] * z + transformPtr[3];
77+
resultPtr[i * 3 + 1] = transformPtr[4] * x + transformPtr[5] * y + transformPtr[6] * z + transformPtr[7];
78+
resultPtr[i * 3 + 2] = transformPtr[8] * x + transformPtr[9] * y + transformPtr[10] * z + transformPtr[11];
79+
}
80+
};
81+
```
82+
83+
## 3. OpenCV by pointers II
84+
85+
Instead of double pointers, we will use pointers to Vec3d.
86+
```cpp
87+
auto transform_opencv_pointers_Vec3d = [=](const cv::Mat& cloud, const cv::Mat& transform, cv::Mat& result) {
88+
const cv::Vec3d* cloudPtr = cloud.ptr<cv::Vec3d>(0);
89+
const double* transformPtr = transform.ptr<double>(0);
90+
cv::Vec3d* resultPtr = result.ptr<cv::Vec3d>(0);
91+
for (int i = 0; i < cloud.cols; ++i)
92+
{
93+
const cv::Vec3d& point = cloudPtr[i];
94+
95+
double x = point[0];
96+
double y = point[1];
97+
double z = point[2];
98+
99+
resultPtr[i][0] = transformPtr[0] * x + transformPtr[1] * y + transformPtr[2] * z + transformPtr[3];
100+
resultPtr[i][1] = transformPtr[4] * x + transformPtr[5] * y + transformPtr[6] * z + transformPtr[7];
101+
resultPtr[i][2] = transformPtr[8] * x + transformPtr[9] * y + transformPtr[10] * z + transformPtr[11];
102+
}
103+
};
104+
```
105+
106+
## 4. Eigen
107+
108+
We will use the widely recognized Eigen library to perform the transformation directly as a matrix product.
109+
```cpp
110+
auto transform_eigen = [=](const Eigen::MatrixXd& cloud, const Eigen::Matrix4d& transform, Eigen::MatrixXd& result) {
111+
result = transform * cloud;
112+
};
113+
```
114+
115+
Each tool has its own formats... Eigen requires matrix representation:
116+
```cpp
117+
Eigen::MatrixXd cloud_eigen(4, num_points);
118+
Eigen::Matrix4d transformation_eigen;
119+
for (int i = 0; i < num_points; ++i) {
120+
cloud_eigen(0, i) = cloud.at<double>(0, i * 3 + 0);
121+
cloud_eigen(1, i) = cloud.at<double>(0, i * 3 + 1);
122+
cloud_eigen(2, i) = cloud.at<double>(0, i * 3 + 2);
123+
cloud_eigen(3, i) = 1.0;
124+
}
125+
126+
transformation_eigen << transform.at<double>(0, 0), transform.at<double>(0, 1), transform.at<double>(0, 2), transform.at<double>(0, 3),
127+
transform.at<double>(1, 0), transform.at<double>(1, 1), transform.at<double>(1, 2), transform.at<double>(1, 3),
128+
transform.at<double>(2, 0), transform.at<double>(2, 1), transform.at<double>(2, 2), transform.at<double>(2, 3),
129+
transform.at<double>(3, 0), transform.at<double>(3, 1), transform.at<double>(3, 2), transform.at<double>(3, 3);
130+
```
131+
132+
## 5. Parallel transform
133+
We will use the parallel computing library OMP.
134+
```cpp
135+
auto transform_parallel = [=](const cv::Mat& cloud, const cv::Mat& transform, cv::Mat& result) {
136+
const double* cloudPtr = cloud.ptr<double>(0);
137+
const double* transformPtr = transform.ptr<double>(0);
138+
double* resultPtr = result.ptr<double>(0);
139+
#pragma omp parallel for
140+
for (int i = 0; i < cloud.cols; ++i)
141+
{
142+
double x = cloudPtr[i * 3 + 0];
143+
double y = cloudPtr[i * 3 + 1];
144+
double z = cloudPtr[i * 3 + 2];
145+
146+
resultPtr[i * 3 + 0] = transformPtr[0] * x + transformPtr[1] * y + transformPtr[2] * z + transformPtr[3];
147+
resultPtr[i * 3 + 1] = transformPtr[4] * x + transformPtr[5] * y + transformPtr[6] * z + transformPtr[7];
148+
resultPtr[i * 3 + 2] = transformPtr[8] * x + transformPtr[9] * y + transformPtr[10] * z + transformPtr[11];
149+
}
150+
};
151+
```
152+
153+
## 6. SIMD
154+
We will program our SIMD "kernel" to process multiple vectors in a single instruction, paying special attention to the data size:
155+
```cpp
156+
auto transformWithSIMD = [=](const double* cloud, const double* transform, double* result, int num_points) {
157+
int num_simd_points = (num_points / 4) * 4; // Points handled with SIMD, multiple of 4
158+
159+
// Load transformation matrix rows into SIMD registers
160+
__m256d t0 = _mm256_set1_pd(transform[0]);
161+
__m256d t1 = _mm256_set1_pd(transform[1]);
162+
__m256d t2 = _mm256_set1_pd(transform[2]);
163+
__m256d t3 = _mm256_set1_pd(transform[3]);
164+
165+
__m256d t4 = _mm256_set1_pd(transform[4]);
166+
__m256d t5 = _mm256_set1_pd(transform[5]);
167+
__m256d t6 = _mm256_set1_pd(transform[6]);
168+
__m256d t7 = _mm256_set1_pd(transform[7]);
169+
170+
__m256d t8 = _mm256_set1_pd(transform[8]);
171+
__m256d t9 = _mm256_set1_pd(transform[9]);
172+
__m256d t10 = _mm256_set1_pd(transform[10]);
173+
__m256d t11 = _mm256_set1_pd(transform[11]);
174+
175+
// Process blocks of 4 points at a time
176+
for (int i = 0; i < num_simd_points; i += 4) {
177+
// Load X, Y, and Z coordinates of 4 points
178+
__m256d x = _mm256_set_pd(cloud[(i + 3) * 3], cloud[(i + 2) * 3], cloud[(i + 1) * 3], cloud[i * 3]);
179+
__m256d y = _mm256_set_pd(cloud[(i + 3) * 3 + 1], cloud[(i + 2) * 3 + 1], cloud[(i + 1) * 3 + 1], cloud[i * 3 + 1]);
180+
__m256d z = _mm256_set_pd(cloud[(i + 3) * 3 + 2], cloud[(i + 2) * 3 + 2], cloud[(i + 1) * 3 + 2], cloud[i * 3 + 2]);
181+
182+
// Perform the transformation
183+
__m256d res_x = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(t0, x), _mm256_mul_pd(t1, y)),
184+
_mm256_add_pd(_mm256_mul_pd(t2, z), t3));
185+
__m256d res_y = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(t4, x), _mm256_mul_pd(t5, y)),
186+
_mm256_add_pd(_mm256_mul_pd(t6, z), t7));
187+
__m256d res_z = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(t8, x), _mm256_mul_pd(t9, y)),
188+
_mm256_add_pd(_mm256_mul_pd(t10, z), t11));
189+
190+
// Store the results interleaved (X1, Y1, Z1, X2, Y2, Z2, ...)
191+
double temp_x[4], temp_y[4], temp_z[4];
192+
_mm256_store_pd(temp_x, res_x);
193+
_mm256_store_pd(temp_y, res_y);
194+
_mm256_store_pd(temp_z, res_z);
195+
196+
for (int j = 0; j < 4; ++j) {
197+
result[(i + j) * 3] = temp_x[j];
198+
result[(i + j) * 3 + 1] = temp_y[j];
199+
result[(i + j) * 3 + 2] = temp_z[j];
200+
}
201+
}
202+
203+
// Handle remaining points (if num_points is not a multiple of 4)
204+
for (int i = num_simd_points; i < num_points; ++i) {
205+
int base = i * 3;
206+
double x = cloud[base];
207+
double y = cloud[base + 1];
208+
double z = cloud[base + 2];
209+
210+
result[base] = transform[0] * x + transform[1] * y + transform[2] * z + transform[3];
211+
result[base + 1] = transform[4] * x + transform[5] * y + transform[6] * z + transform[7];
212+
result[base + 2] = transform[8] * x + transform[9] * y + transform[10] * z + transform[11];
213+
}
214+
};
215+
```
216+
217+
# Point cloud and transformation initialization
218+
219+
The first step is to create a random point cloud in homogeneous coordinates and a random 4x4 transformation matrix, specifying it using Euler angles and a 3D translation vector.
220+
```cpp
221+
// Create a point cloud as a single row
222+
int num_points = (int)1e6;
223+
cv::Mat cloud(1, num_points, CV_64FC3);
224+
cv::Mat transform(4, 4, CV_64F);
225+
226+
// Initialize point cloud and transformation matrix
227+
cv::randu(cloud, cv::Scalar(0.0, 0.0, 0.0), cv::Scalar(100.0, 100.0, 100.0));
228+
229+
cv::Mat cloud_homogeneous(4, cloud.cols, CV_64FC1);
230+
for (int i = 0; i < cloud.cols; ++i)
231+
{
232+
const cv::Vec3d& pt = cloud.at<cv::Vec3d>(0, i);
233+
cloud_homogeneous.at<double>(0, i) = pt[0];
234+
cloud_homogeneous.at<double>(1, i) = pt[1];
235+
cloud_homogeneous.at<double>(2, i) = pt[2];
236+
cloud_homogeneous.at<double>(3, i) = 1.0;
237+
}
238+
239+
transform = cv::Mat::eye(4, 4, CV_64F);
240+
{
241+
double roll = M_PI / 6; // Rotation around the X-axis
242+
double pitch = M_PI / 4; // Rotation around the Y-axis
243+
double yaw = M_PI / 3; // Rotation around the Z-axis
244+
cv::Mat Rx = (cv::Mat_<double>(3, 3) <<
245+
1, 0, 0,
246+
0, std::cos(roll), -std::sin(roll),
247+
0, std::sin(roll), std::cos(roll));
248+
cv::Mat Ry = (cv::Mat_<double>(3, 3) <<
249+
std::cos(pitch), 0, std::sin(pitch),
250+
0, 1, 0,
251+
-std::sin(pitch), 0, std::cos(pitch));
252+
cv::Mat Rz = (cv::Mat_<double>(3, 3) <<
253+
std::cos(yaw), -std::sin(yaw), 0,
254+
std::sin(yaw), std::cos(yaw), 0,
255+
0, 0, 1);
256+
cv::Mat R = Rz * Ry * Rx;
257+
R.copyTo(transform(cv::Rect(0, 0, 3, 3)));
258+
transform.at<double>(0, 3) = 10; // Translation in x
259+
transform.at<double>(1, 3) = -5; // Translation in y
260+
transform.at<double>(2, 3) = 3; // Translation in z
261+
}
262+
```
263+
264+
# Objective comparison
265+
266+
We will test the code around 1000 times, measure the elapsed time for all the considered methods, and plot the results for a visual inspection (averaged).
267+
268+
We'll use ```std::chrono``` to measure time, but we could also use the CPU clock counter, as mentioned in our [previous post](./DOPvsOOP).
269+
```cpp
270+
auto measure_time = [=](const std::function<void()>& func) {
271+
auto start = std::chrono::high_resolution_clock::now();
272+
func();
273+
auto end = std::chrono::high_resolution_clock::now();
274+
return std::chrono::duration<double>(end - start).count();
275+
};
276+
```
277+
Before anything else, let's verify that the transformation is performed correctly across all tools!
278+
279+
<img src="../assets/blog_images/2025-03-07-fast-3D-processing/points.png" alt="points" width="350" style="display: block; margin-left: auto; margin-right: auto;">
280+
281+
Indeed, the result is the same in all cases, up to the seventh significant decimal digit (the typical precision of the ```float``` format), at least for the evaluated point.
282+
283+
Now, let's look at the results after a high number of iterations.
284+
285+
First, using **no optimization** flags:
286+
```.pro
287+
QMAKE_CXXFLAGS += -O0
288+
```
289+
290+
<img src="../assets/blog_images/2025-03-07-fast-3D-processing/10000_non_optimized/podium_comparison_ms.png" alt="podium_comparison_ms" style="display: block; margin-left: auto; margin-right: auto;">
291+
292+
And then, using compiler **optimization** flags:
293+
```.pro
294+
QMAKE_CXXFLAGS += -O3 -march=native -funroll-loops -fomit-frame-pointer -finline-functions -ftree-vectorize -mavx -mavx2 -msse4.2
295+
QMAKE_LFLAGS += -Wl, --strip-all # Optimization in the linking process
296+
```
297+
298+
<img src="../assets/blog_images/2025-03-07-fast-3D-processing/10000_optimized/podium_comparison_ms.png" alt="podium_comparison_ms" style="display: block; margin-left: auto; margin-right: auto;">
299+
300+
# Future work
301+
302+
At first glance, it seems surprising that OpenCV's direct method (and also Eigen's) is by far the slowest of all. Without delving deeply into the library's code, given how optimized it is, one would logically expect the overload of the `*` operator to be tied at a lower level to high-performance vectorized instructions. However, this may have a lot to do with how OpenCV is compiled. An interesting task would be to evaluate this same test in Python to see if these optimizations are actually performed "under the hood," if something is missing in the compilation, or if the direct method is indeed the slowest of all.
303+
304+
The results are somewhat predictable since OpenMP is really the only one (as far as I know) that employs a purely parallelized approach. In smaller tasks (or with fewer iterations), it may be outperformed, but not at such a large scale.
305+
306+
It would be interesting to program a **CUDA** kernel and test parallel processing on the GPU to see how it compares to CPU-based methods in terms of handling large-scale point cloud operations.

0 commit comments

Comments
 (0)