Skip to content

Commit 5283e4c

Browse files
authored
Allow use of branch/no-branch calculation (#31)
When calculating the force, allow for branch or predication * Added new parameters and kernels with new and old code * Adjusted condition for kernel calls * Readjusted condition for kernel calls * Changed to a templated function, modified how the calculation parameter is passed to the program and modified the README
1 parent 25957a0 commit 5283e4c

19 files changed

+1408
-1283
lines changed

README.md

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,10 +134,15 @@ A drag factor (`damping`) is used to regulate the velocity. At each timestep, th
134134

135135
The `parameters` described in this section can all be adjusted via command line arguments, as follows:
136136

137-
`./nbody_cuda numParticles simIterationsPerFrame damping dt distEps G numFrames`
137+
`./nbody_cuda numParticles simIterationsPerFrame damping dt distEps G numFrames gwSize calcMethod`
138138

139139
Note that `numParticles` specifies the number of particles simulated, divided by blocksize (i.e. setting `numParticles` to 50 produces 50*256 particles). `simIterationsPerFrame` specifies how many steps of the simulation to take before rendering the next frame and `numFrames` specifies the total number of simulation steps before the program exits. For default values for all of these parameters, refer to `sim_param.cpp`.
140140

141+
`gwSize`: This parameter allows changing the work group size from the default 64.
142+
143+
`calcMethod`: This string parameter, with a default value of BRANCH, selects branch instruction code. If set to PREDICATED, it uses an arithmetic expression. Refer to the [performance](#sycl-vs-cuda-performance) section for details.
144+
145+
141146
### Modifying Simulation Behaviour
142147

143148
You can get quite a wide range of 'galactic' behaviours by playing with the parameters described above.
@@ -222,3 +227,52 @@ in the main loop in simulation.dp.cpp. Whereas NVCC handles this via instruction
222227
force += r * inv_dist_cube * (i != id);
223228
```
224229
in both the CUDA & SYCL code, we get comparable performance between the two using our hardware set up (RTX 3060). For 5 steps of the physical simulation (1 rendered frame) with 12,800 particles, both CUDA and SYCL take ~5.05ms (RTX 3060).
230+
231+
## Update 2024
232+
233+
The ability to execute the nbody code without rendering simplified the process of running the code on different platforms. The results of these executions have brought to light some issues related to the runtime and compilers. As stated before, the original code was modified by substituting:
234+
235+
```
236+
// Original code
237+
if (i == id) continue;
238+
239+
force += r * inv_dist_cube;
240+
```
241+
242+
with
243+
244+
```
245+
// Modified code
246+
force += r * inv_dist_cube * (i != id);
247+
```
248+
249+
in order to address the 40% decrease in SYCL performance compared to the CUDA code. With this change, the performance was almost the same for both compilers in RTX 3060.
250+
251+
We have found that while this is the case for the A100 (CUDA 8.48516 ms vs. SYCL 8.23865 ms), it is not the same on the RTX 2060, where CUDA is heavily penalized (CUDA 10.7281 ms vs. SYCL 8.52349 ms). Even on the A100, the change lowered the CUDA performance (7.95778 ms for the original code).
252+
253+
The code change also greatly improved the performance by 100% on the MAX 1100 GPU, dropping from 21.6555 ms to 10.7633 ms.
254+
Below are the best results from executing the code on the three different platforms.
255+
256+
```
257+
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2060 7.5 [CUDA 12.3]
258+
==================== WORK GROUP SIZE 512 BRANCH ========================
259+
CUDA - At step 10000 kernel time is 8.48516 and mean is 8.53952 and stddev is: 0.0884324
260+
DPC - At step 10000 kernel time is 8.23865 and mean is 8.30511 and stddev is: 0.0788344
261+
==================== WORK GROUP SIZE 512 PREDICATED ====================
262+
CUDA - At step 10000 kernel time is 10.7281 and mean is 10.7601 and stddev is: 0.0630959
263+
DPC - At step 10000 kernel time is 8.52349 and mean is 8.5992 and stddev is: 0.078034
264+
265+
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100-PCIE-40GB 8.0 [CUDA 12.2]
266+
==================== WORK GROUP SIZE 128 BRANCH ========================
267+
CUDA - At step 10000 kernel time is 7.95778 and mean is 7.95753 and stddev is: 0.000680384
268+
DPC - At step 10000 kernel time is 10.051 and mean is 10.0506 and stddev is: 0.00181166
269+
==================== WORK GROUP SIZE 128 PREDICATED ====================
270+
CUDA - At step 10000 kernel time is 8.60294 and mean is 8.60151 and stddev is: 0.00077172
271+
DPC - At step 10000 kernel time is 7.99054 and mean is 7.99109 and stddev is: 0.0041852
272+
273+
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26516]
274+
==================== WORK GROUP SIZE 32 BRANCH ========================
275+
At step 10000 kernel time is 21.5747 and mean is 21.6555 and stddev is: 0.0734683
276+
==================== WORK GROUP SIZE 32 PREDICATED ====================
277+
At step 10000 kernel time is 10.6649 and mean is 10.7633 and stddev is: 0.0507969
278+
```

src/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ target_compile_definitions(${BINARY_NAME} PRIVATE ${RENDER_FLAG} COMPILER_NAME="
4848
target_link_libraries(${BINARY_NAME} PRIVATE ${RENDER_LIB})
4949
target_compile_features(${BINARY_NAME} PRIVATE cxx_auto_type cxx_nullptr cxx_range_for)
5050
target_include_directories(${BINARY_NAME} PRIVATE ${CUDA_INCLUDE_DIRS})
51+
target_compile_options(${BINARY_NAME} PRIVATE -use_fast_math)
5152

5253
add_custom_target(debug DEPENDS ${BINARY_NAME}_d)
5354
add_executable(${BINARY_NAME}_d ${SOURCE_FILES})

src/camera.cpp

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -10,59 +10,59 @@ const float PI = 3.14159265358979323846;
1010
using namespace std;
1111

1212
Camera::Camera() {
13-
position.x = 0;
14-
position.y = PI / 4;
15-
position.z = 50.0;
13+
position.x = 0;
14+
position.y = PI / 4;
15+
position.z = 50.0;
1616

17-
velocity = {0.0, 0.0, 0.0};
18-
look_at = {0.0, 0.0, 0.0};
19-
look_at_vel = {0.0, 0.0, 0.0};
17+
velocity = {0.0, 0.0, 0.0};
18+
look_at = {0.0, 0.0, 0.0};
19+
look_at_vel = {0.0, 0.0, 0.0};
2020
}
2121

2222
void Camera::step() {
23-
position.x -= velocity.x;
24-
position.y -= velocity.y;
25-
position.z *= (1.0 - velocity.z);
26-
look_at += look_at_vel;
27-
28-
velocity *= 0.72; // damping
29-
look_at_vel *= 0.90;
30-
31-
// limits
32-
if (position.x < 0) position.x += 2 * PI;
33-
if (position.x >= 2 * PI) position.x -= 2 * PI;
34-
position.y =
35-
max(-(float)PI / 2 + 0.001f, min(position.y, (float)PI / 2 - 0.001f));
23+
position.x -= velocity.x;
24+
position.y -= velocity.y;
25+
position.z *= (1.0 - velocity.z);
26+
look_at += look_at_vel;
27+
28+
velocity *= 0.72; // damping
29+
look_at_vel *= 0.90;
30+
31+
// limits
32+
if (position.x < 0) position.x += 2 * PI;
33+
if (position.x >= 2 * PI) position.x -= 2 * PI;
34+
position.y =
35+
max(-(float)PI / 2 + 0.001f, min(position.y, (float)PI / 2 - 0.001f));
3636
}
3737

3838
glm::mat4 Camera::getProj(int width, int height) {
39-
return glm::infinitePerspective(glm::radians(30.0f), width / (float)height,
40-
1.f);
39+
return glm::infinitePerspective(glm::radians(30.0f), width / (float)height,
40+
1.f);
4141
}
4242

4343
glm::vec3 getCartesianCoordinates(glm::vec3 v) {
44-
return glm::vec3(cos(v.x) * cos(v.y), sin(v.x) * cos(v.y), sin(v.y)) * v.z;
44+
return glm::vec3(cos(v.x) * cos(v.y), sin(v.x) * cos(v.y), sin(v.y)) * v.z;
4545
}
4646

4747
glm::mat4 Camera::getView() {
48-
// polar to cartesian coordinates
49-
glm::vec3 view_pos = getCartesianCoordinates(position);
48+
// polar to cartesian coordinates
49+
glm::vec3 view_pos = getCartesianCoordinates(position);
5050

51-
return glm::lookAt(view_pos + look_at, look_at, glm::vec3(0, 0, 1));
51+
return glm::lookAt(view_pos + look_at, look_at, glm::vec3(0, 0, 1));
5252
}
5353

5454
glm::vec3 Camera::getForward() {
55-
return glm::normalize(-getCartesianCoordinates(position));
55+
return glm::normalize(-getCartesianCoordinates(position));
5656
}
5757

5858
glm::vec3 Camera::getRight() {
59-
return glm::normalize(
60-
glm::cross(getCartesianCoordinates(position), glm::vec3(0, 0, 1)));
59+
return glm::normalize(
60+
glm::cross(getCartesianCoordinates(position), glm::vec3(0, 0, 1)));
6161
}
6262

6363
glm::vec3 Camera::getUp() {
64-
return glm::normalize(
65-
glm::cross(getCartesianCoordinates(position), getRight()));
64+
return glm::normalize(
65+
glm::cross(getCartesianCoordinates(position), getRight()));
6666
}
6767

6868
void Camera::addVelocity(glm::vec3 vel) { velocity += vel; }

src/camera.hpp

Lines changed: 37 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -6,43 +6,43 @@
66

77
class Camera {
88
public:
9-
Camera();
10-
11-
/**
12-
* Computes next step of camera parameters
13-
* @param c camera at step n
14-
* @return camera at step n+1
15-
*/
16-
void step();
17-
18-
/**
19-
* Computes projection matrix from camera parameters
20-
* @param c camera parameters
21-
* @param width viewport width
22-
* @param height viewport height
23-
* @return projection matrix
24-
*/
25-
glm::mat4 getProj(int width, int height);
26-
27-
/**
28-
* Computes view matrix from camera parameters
29-
* @param c camera parameters
30-
* @param view matrix
31-
*/
32-
glm::mat4 getView();
33-
34-
glm::vec3 getForward();
35-
glm::vec3 getRight();
36-
glm::vec3 getUp();
37-
38-
glm::vec3 getPosition();
39-
40-
void addVelocity(glm::vec3 vel);
41-
void addLookAtVelocity(glm::vec3 vel);
9+
Camera();
10+
11+
/**
12+
* Computes next step of camera parameters
13+
* @param c camera at step n
14+
* @return camera at step n+1
15+
*/
16+
void step();
17+
18+
/**
19+
* Computes projection matrix from camera parameters
20+
* @param c camera parameters
21+
* @param width viewport width
22+
* @param height viewport height
23+
* @return projection matrix
24+
*/
25+
glm::mat4 getProj(int width, int height);
26+
27+
/**
28+
* Computes view matrix from camera parameters
29+
* @param c camera parameters
30+
* @param view matrix
31+
*/
32+
glm::mat4 getView();
33+
34+
glm::vec3 getForward();
35+
glm::vec3 getRight();
36+
glm::vec3 getUp();
37+
38+
glm::vec3 getPosition();
39+
40+
void addVelocity(glm::vec3 vel);
41+
void addLookAtVelocity(glm::vec3 vel);
4242

4343
private:
44-
glm::vec3 position; ///< Polar coordinates in radians
45-
glm::vec3 velocity; ///< dp/dt of polar coordinates
46-
glm::vec3 look_at; ///< Where is the camera looking at
47-
glm::vec3 look_at_vel; ///< dp/dt of lookat position
44+
glm::vec3 position; ///< Polar coordinates in radians
45+
glm::vec3 velocity; ///< dp/dt of polar coordinates
46+
glm::vec3 look_at; ///< Where is the camera looking at
47+
glm::vec3 look_at_vel; ///< dp/dt of lookat position
4848
};

src/gen.cpp

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -13,39 +13,39 @@ mt19937 rng;
1313
uniform_real_distribution<> dis(0, 1);
1414

1515
glm::vec4 randomParticlePos() {
16-
// Random position on a 'thick disk'
17-
glm::vec4 particle;
18-
float t = dis(rng) * 2 * PI;
19-
float s = dis(rng) * 100;
20-
particle.x = cos(t) * s;
21-
particle.y = sin(t) * s;
22-
particle.z = dis(rng) * 4;
23-
24-
particle.w = 1.f;
25-
return particle;
16+
// Random position on a 'thick disk'
17+
glm::vec4 particle;
18+
float t = dis(rng) * 2 * PI;
19+
float s = dis(rng) * 100;
20+
particle.x = cos(t) * s;
21+
particle.y = sin(t) * s;
22+
particle.z = dis(rng) * 4;
23+
24+
particle.w = 1.f;
25+
return particle;
2626
}
2727

2828
glm::vec4 randomParticleVel(glm::vec4 pos) {
29-
// Initial velocity is 'orbital' velocity from position
30-
glm::vec3 vel = glm::cross(glm::vec3(pos), glm::vec3(0, 0, 1));
31-
float orbital_vel = sqrt(2.0 * glm::length(vel));
32-
vel = glm::normalize(vel) * orbital_vel;
33-
return glm::vec4(vel, 0.0);
29+
// Initial velocity is 'orbital' velocity from position
30+
glm::vec3 vel = glm::cross(glm::vec3(pos), glm::vec3(0, 0, 1));
31+
float orbital_vel = sqrt(2.0 * glm::length(vel));
32+
vel = glm::normalize(vel) * orbital_vel;
33+
return glm::vec4(vel, 0.0);
3434
}
3535

3636
std::vector<float> genFlareTex(int tex_size) {
37-
std::vector<float> pixels(tex_size * tex_size);
38-
float sigma2 = tex_size / 2.0;
39-
float A = 1.0;
40-
for (int i = 0; i < tex_size; ++i) {
41-
float i1 = i - tex_size / 2;
42-
for (int j = 0; j < tex_size; ++j) {
43-
float j1 = j - tex_size / 2;
44-
// gamma corrected gauss
45-
pixels[i * tex_size + j] = pow(
46-
A * exp(-((i1 * i1) / (2 * sigma2) + (j1 * j1) / (2 * sigma2))),
47-
2.2);
48-
}
49-
}
50-
return pixels;
51-
}
37+
std::vector<float> pixels(tex_size * tex_size);
38+
float sigma2 = tex_size / 2.0;
39+
float A = 1.0;
40+
for (int i = 0; i < tex_size; ++i) {
41+
float i1 = i - tex_size / 2;
42+
for (int j = 0; j < tex_size; ++j) {
43+
float j1 = j - tex_size / 2;
44+
// gamma corrected gauss
45+
pixels[i * tex_size + j] = pow(
46+
A * exp(-((i1 * i1) / (2 * sigma2) + (j1 * j1) / (2 * sigma2))),
47+
2.2);
48+
}
49+
}
50+
return pixels;
51+
}

src/gen.hpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ glm::vec4 randomParticlePos();
1818
*/
1919
glm::vec4 randomParticleVel(glm::vec4 pos);
2020

21-
std::vector<float> genFlareTex(int size);
21+
std::vector<float> genFlareTex(int size);

0 commit comments

Comments
 (0)