diff --git a/README.md b/README.md index d63a6a1..c177cf2 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,145 @@ -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, -Project 1 - Flocking** +# University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Di Lu + * [LinkedIn](https://www.linkedin.com/in/di-lu-0503251a2/) + * [personal website](https://www.dluisnothere.com/) +* Tested on: Windows 11, i7-12700H @ 2.30GHz 32GB, NVIDIA GeForce RTX 3050 Ti -### (TODO: Your README) +## Introduction -Include screenshots, analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +In this project, I simulate flocking behavior for a 200 x 200 x 200 cube of scattered boids by using CUDA kernel functions +to calculate their position and velocity on each dT. Based on Craig Reynold's artificial life program, for which a SIGGRAPH paper was written in 1989, +the following three behaviors are implemented: + +1. cohesion - boids move towards the perceived center of mass of their neighbors +2. separation - boids avoid getting to close to their neighbors +3. alignment - boids generally try to move with the same direction and speed as +their neighbors + +In the simulation results, the color of each particle is a representation of its velocity. + +![Coherent Grid Flocking with 50,000 boids](images/headerResized.gif) + +_Coherent Grid Flocking with 50,000 boids_ + +## Implementation and Results +To measure the performance of my code, I ran my program on release mode with VSync disabled. There are +three implementations: with the first being naive neighbor search, and each subsequent part +utilizing more optimizations. + +#### Part 1. Naive Boids Simulation + +The first simulation is a naive neighbor search, where each boid searches every other boid in existence and checks +whether they are within distance for cohesion, separation, or alignment. If a non-self boid is within any such distance, +then its position and velocity will be taken into account for the respective rule. + +![Naive Boids Simulation](images/naive.gif) + +_Naive Grid Flocking with 5,000 boids_ + +#### Part 2. Uniform Grid Boids + +The second simulation is a neighbor search that takes into account the largest neighborhood distance among the 3 rules. +The simulation space is divided into grid cubes. Using these cubes, Each boid only needs to check the cubes that overlap +with its spherical neighborhood. + +Each boid calculates the extremities of its reach by using its own radius and position. With these extremities, I can calculate +the maximum and minimum of my desired cells to scan. Hence, the number of useless boid scans are reduced, resulting in a much +faster simulation! + +![Uniform Boids Simulation](images/uniform.gif) + +_Uniform Grid Flocking with 5,000 boids_ + +#### Part 3. Coherent Grid Boids + +The third simulation builds on the second simulation. This time, we also rearrange the position and velocity information such that +boids that are in a cell together are also contiguous in memory. + +![Coherent Boids Simulation](images/coherent.gif) + +_Coherent Grid Flocking with 5,000 boids_ + +## Part 3. Overall Performance Analysis + +* Number of Boids vs. Performance + +![Graph 1](images/graph1.png) + +* BlockSize vs. Performance (N = 100,000) + +![Graph 2](images/graph2.png) + +**For each implementation, how does changing the number of boids affect +performance? Why do you think this is?** + +Across all three implementations, increasing the number of boids will decrease FPS. +If the scene scale is the same, then each cell will become more dense as boids increase, +which means each boid will need to run more sequential operations to check the increased +number of valid neighbors. + +**For each implementation, how does changing the block count and block size +affect performance? Why do you think this is?** + +As seen from the graph, smaller block count usually results in poorer performance before a certain +blockSize is hit. For example, there is a big difference between blockSize == 16 and blockSize == 4, however, +above blockSize 32 is relatively similar in FPS performance. This behavior is not easily seen in simulations +with fewer boids (N = 5,000). This is likely because when blockSize is small, there are less threads that can run +in parallel. When blockSize is large, more and more threads can run the same program at the same time. However, +if the number of threads exceed the number of parallel operations, then there won't be much performance enahancement +anymore. + +**For the coherent uniform grid: did you experience any performance improvements +with the more coherent uniform grid? Was this the outcome you expected? +Why or why not?** + +I experienced a significant performance improvement with coherent uniform grid when the boid number +is very high. For example, on 500,000 boids, coherent grid can comfortably give me ~130 FPS, while +uniform grid just about dies at ~6 FPS. For fewer boids, the performance improvement is not that obvious (for example, +FPS is fairly similar between coherent and uniform grids when we have 5,000 boids.) This outcome is expected, because +when the number of boids is higher, each cell also contains a lot more boids. If the information is not contiguous in +memory, the impact of checking many boids whose information is further apart is more noticeable. + +**Did changing cell width and checking 27 vs 8 neighboring cells affect performance? +Why or why not? Be careful: it is insufficient (and possibly incorrect) to say +that 27-cell is slower simply because there are more cells to check!** + +My understanding is that given scene scale 100 and 100,000 boids, decreasing cell size +(thus checking more cells) will increase performance. This is because given the same density +of boids in a cell, we can check more cells in parallel and run less sequential operations +for each cell. However, if the scene scale were larger with the same number of boids, then this could potentially +decrease efficiency, because we are checking more cells that could be "useless". + +To test my thinking, I used the following two scenarios: + +_Constants:_ + +* _Coherent grid_ +* _Number of boids: 100,000_ + +1. Scene Scale: 100 + +I observed around 100 FPS **increase** when I used cell width == neighborhood size. + +2. Scene Scale: 200 + +I observed nearly 200 FPS **decrease** when I used cell width == neighbordhood size. + + + +## Part 4: More Images and Results! +![](images/naive50k.png) + +_Naive Flocking with 50,000 boids_ + +![](images/big2.gif) + +_Coherent Flocking with 100,000 boids_ + +![](images/big.png) + +_Coherent Flocking with 500,000 boids, could not get a gif of this onto github_ + +![](images/sceneScale200with100kBoids.png) + +_100,000 boids on 200 scene scale instead of 100_ diff --git a/images/27cells.png b/images/27cells.png new file mode 100644 index 0000000..4099326 Binary files /dev/null and b/images/27cells.png differ diff --git a/images/big.png b/images/big.png new file mode 100644 index 0000000..934b8cd Binary files /dev/null and b/images/big.png differ diff --git a/images/big2.gif b/images/big2.gif new file mode 100644 index 0000000..4aab83c Binary files /dev/null and b/images/big2.gif differ diff --git a/images/coherent.gif b/images/coherent.gif new file mode 100644 index 0000000..c2eb40e Binary files /dev/null and b/images/coherent.gif differ diff --git a/images/coherent2.gif b/images/coherent2.gif new file mode 100644 index 0000000..9a1ef63 Binary files /dev/null and b/images/coherent2.gif differ diff --git a/images/graph1.png b/images/graph1.png new file mode 100644 index 0000000..0d1bb74 Binary files /dev/null and b/images/graph1.png differ diff --git a/images/graph2.png b/images/graph2.png new file mode 100644 index 0000000..2b0e0b0 Binary files /dev/null and b/images/graph2.png differ diff --git a/images/header.gif b/images/header.gif new file mode 100644 index 0000000..42dc757 Binary files /dev/null and b/images/header.gif differ diff --git a/images/header2.gif b/images/header2.gif new file mode 100644 index 0000000..1b6b05c Binary files /dev/null and b/images/header2.gif differ diff --git a/images/headerResized.gif b/images/headerResized.gif new file mode 100644 index 0000000..1e01b9b Binary files /dev/null and b/images/headerResized.gif differ diff --git a/images/naive.gif b/images/naive.gif new file mode 100644 index 0000000..580cfa6 Binary files /dev/null and b/images/naive.gif differ diff --git a/images/naive2.gif b/images/naive2.gif new file mode 100644 index 0000000..a106708 Binary files /dev/null and b/images/naive2.gif differ diff --git a/images/naive50k.png b/images/naive50k.png new file mode 100644 index 0000000..ea38d71 Binary files /dev/null and b/images/naive50k.png differ diff --git a/images/sceneScale200with100kBoids.png b/images/sceneScale200with100kBoids.png new file mode 100644 index 0000000..5e227ef Binary files /dev/null and b/images/sceneScale200with100kBoids.png differ diff --git a/images/uniform.gif b/images/uniform.gif new file mode 100644 index 0000000..8f110a0 Binary files /dev/null and b/images/uniform.gif differ diff --git a/images/uniform2.gif b/images/uniform2.gif new file mode 100644 index 0000000..fbd867d Binary files /dev/null and b/images/uniform2.gif differ diff --git a/src/kernel.cu b/src/kernel.cu index 74dffcb..cb24ae8 100644 --- a/src/kernel.cu +++ b/src/kernel.cu @@ -5,8 +5,10 @@ #include #include "utilityCore.hpp" #include "kernel.h" +#include // LOOK-2.1 potentially useful for doing grid-based neighbor search +// Di: Returns max of the two or min of the two elements for comparison. #ifndef imax #define imax( a, b ) ( ((a) > (b)) ? (a) : (b) ) #endif @@ -41,9 +43,9 @@ void checkCUDAError(const char *msg, int line = -1) { // LOOK-1.2 Parameters for the boids algorithm. // These worked well in our reference implementation. -#define rule1Distance 5.0f -#define rule2Distance 3.0f -#define rule3Distance 5.0f +#define rule1Distance 5.0f // 5.0f +#define rule2Distance 3.0f // 3.0f +#define rule3Distance 5.0f // 5.0f #define rule1Scale 0.01f #define rule2Scale 0.1f @@ -52,7 +54,7 @@ void checkCUDAError(const char *msg, int line = -1) { #define maxSpeed 1.0f /*! Size of the starting area in simulation space. */ -#define scene_scale 100.0f +#define scene_scale 200.0f /*********************************************** * Kernel state (pointers are device pointers) * @@ -74,8 +76,10 @@ glm::vec3 *dev_vel2; // pointers on your own too. // For efficient sorting and the uniform grid. These should always be parallel. +// Di all initialized on GPU int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle? int *dev_particleGridIndices; // What grid cell is this particle in? + // needed for use with thrust thrust::device_ptr dev_thrust_particleArrayIndices; thrust::device_ptr dev_thrust_particleGridIndices; @@ -85,6 +89,11 @@ int *dev_gridCellEndIndices; // to this cell? // TODO-2.3 - consider what additional buffers you might need to reshuffle // the position and velocity data to be coherent within cells. +// glm::vec3* dev_posSortedPrev; +glm::vec3* dev_posSorted; +glm::vec3* dev_vel1Sorted; +// thrust::device_ptr dev_thrust_particleArrayIndicesValue; + // LOOK-2.1 - Grid parameters based on simulation parameters. // These are automatically computed for you in Boids::initSimulation @@ -125,6 +134,7 @@ __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) { */ __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) { int index = (blockIdx.x * blockDim.x) + threadIdx.x; + // some threads are extra and thus won't enter this case. if (index < N) { glm::vec3 rand = generateRandomVec3(time, index); arr[index].x = scale * rand.x; @@ -151,13 +161,21 @@ void Boids::initSimulation(int N) { cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3)); checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!"); + // di added + cudaMalloc((void**)&dev_posSorted, N * sizeof(glm::vec3)); + checkCUDAErrorWithLine("cudaMalloc dev_posSorted failed!"); + + cudaMalloc((void**)&dev_vel1Sorted, N * sizeof(glm::vec3)); + checkCUDAErrorWithLine("cudaMalloc dev_posSortedVel1 failed!"); + // LOOK-1.2 - This is a typical CUDA kernel invocation. kernGenerateRandomPosArray<<>>(1, numObjects, dev_pos, scene_scale); checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!"); // LOOK-2.1 computing grid params - gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance); + // remove 2.0 * + gridCellWidth = 2.0 * std::max(std::max(rule1Distance, rule2Distance), rule3Distance); int halfSideCount = (int)(scene_scale / gridCellWidth) + 1; gridSideCount = 2 * halfSideCount; @@ -169,6 +187,23 @@ void Boids::initSimulation(int N) { gridMinimum.z -= halfGridWidth; // TODO-2.1 TODO-2.3 - Allocate additional buffers here. + cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!"); + + cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!"); + + cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!"); + + cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!"); + + dev_thrust_particleArrayIndices = thrust::device_ptr(dev_particleArrayIndices); + dev_thrust_particleGridIndices = thrust::device_ptr(dev_particleGridIndices); + + // dev_posSortedPrev = dev_pos; + cudaDeviceSynchronize(); } @@ -210,8 +245,8 @@ __global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) { dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize); - kernCopyPositionsToVBO << > >(numObjects, dev_pos, vbodptr_positions, scene_scale); - kernCopyVelocitiesToVBO << > >(numObjects, dev_vel1, vbodptr_velocities, scene_scale); + kernCopyPositionsToVBO <<>>(numObjects, dev_pos, vbodptr_positions, scene_scale); + kernCopyVelocitiesToVBO <<>>(numObjects, dev_vel1, vbodptr_velocities, scene_scale); checkCUDAErrorWithLine("copyBoidsToVBO failed!"); @@ -233,7 +268,57 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves // Rule 2: boids try to stay a distance d away from each other // Rule 3: boids try to match the speed of surrounding boids - return glm::vec3(0.0f, 0.0f, 0.0f); + //// return glm::vec3(0.0f, 0.0f, 0.0f); + glm::vec3 velocityChange = glm::vec3(0.f); + glm::vec3 selfPos = pos[iSelf]; + + // Rule 1 + // find the average position of all the birds + glm::vec3 center = glm::vec3(0.f); + int numValid1 = 0; + + glm::vec3 c = glm::vec3(0.f); + + glm::vec3 perceivedVelocity = glm::vec3(0.f); + int numValid3 = 0; + + for (int i = 0; i < N; i++) { + glm::vec3 bPos = pos[i]; + float dist = glm::distance(selfPos, bPos); + + // Rule 1 + if (i != iSelf && dist < rule1Distance) { + center += pos[i]; + numValid1 += 1; + } + + // Rule 2 + if (i != iSelf && dist < rule2Distance) { + c -= (bPos - selfPos); + } + + // Rule 3 + if (i != iSelf && dist < rule3Distance) { + perceivedVelocity += vel[i]; + numValid3 += 1; + } + + } + + if (numValid1 != 0) { + center /= numValid1; + velocityChange += (center - selfPos) * rule1Scale; + } + + velocityChange += c * rule2Scale; + + if (numValid3 != 0) { + perceivedVelocity /= numValid3; + velocityChange += perceivedVelocity * rule3Scale; + } + + return velocityChange; + } /** @@ -244,7 +329,23 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) { // Compute a new velocity based on pos and vel1 // Clamp the speed - // Record the new velocity into vel2. Question: why NOT vel1? + // Record the new velocity into vel2. Question: why NOT vel1? -> + + // we don't want to overwrite original velocities because other boids still need to refer to it to compute their vels. + // and vel1 and vel2 are all being accessed at the same time so if info in vel1 gets changed while another thread is accessing + // that would be bad. + // once we finish computing these vels, vel1 becomes our "write" array and vel2 becomes our "read" array + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + glm::vec3 velChange = computeVelocityChange(N, index, pos, vel1); + glm::vec3 finalVel = vel1[index] + velChange; + float speed = glm::length(finalVel); + if (speed > maxSpeed) { + finalVel = finalVel * maxSpeed / speed; + } + vel2[index] = finalVel; } /** @@ -258,8 +359,14 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) { return; } glm::vec3 thisPos = pos[index]; + + // debug value + glm::vec3 thisVel = vel[index]; + thisPos += vel[index] * dt; + // printf("vel x: %f, vel y: %f, vel z: %f", vel[index].x, vel[index].y, vel[index].z); + // Wrap the boids around so we don't lose them thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x; thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y; @@ -278,21 +385,44 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) { // for(x) // for(y) // for(z)? Or some other order? -__device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) { - return x + y * gridResolution + z * gridResolution * gridResolution; +__device__ int gridIndex3Dto1D(int x, int y, int z, int gridSideCount) { + return x + y * gridSideCount + z * gridSideCount * gridSideCount; } -__global__ void kernComputeIndices(int N, int gridResolution, +__global__ void kernComputeIndices(int N, int gridSideCount, glm::vec3 gridMin, float inverseCellWidth, glm::vec3 *pos, int *indices, int *gridIndices) { // TODO-2.1 // - Label each boid with the index of its grid cell. // - Set up a parallel array of integer indices as pointers to the actual // boid data in pos and vel1/vel2 + + // this kern function is called in parallel by each boid so it only needs to be + int boidIdx = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (boidIdx < N) { + // compute components + glm::vec3 boidPos = pos[boidIdx]; + + int ix = (int) (std::floor((boidPos.x - gridMin.x) * inverseCellWidth)); + int iy = (int) (std::floor((boidPos.y - gridMin.y) * inverseCellWidth)); + int iz = (int) (std::floor((boidPos.z - gridMin.z) * inverseCellWidth)); + + // compute 1D cell idx from 3D + int cellIdx = gridIndex3Dto1D(ix, iy, iz, gridSideCount); + + // fill dev_particleGridIndices at boidIdx + gridIndices[boidIdx] = cellIdx; + + // fill dev_particleArrayIndices, starts exactly as boidIdx initially. + indices[boidIdx] = boidIdx; + } + } // LOOK-2.1 Consider how this could be useful for indicating that a cell // does not enclose any boids +// pass in an invalid array value such as -1 for cells that don't contain boids __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) { int index = (blockIdx.x * blockDim.x) + threadIdx.x; if (index < N) { @@ -306,10 +436,156 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices, // Identify the start point of each cell in the gridIndices array. // This is basically a parallel unrolling of a loop that goes // "this index doesn't match the one before it, must be a new cell!" + + // post sorting + // also looks like it's called in parallel by boids. + // note that idx is NOT boidIdx + // for current idx, check its own cell in GridIndices and one before it. + // if idx 0, must be start cell, and set its start location to itself. then check for end cell status + // if idx N - 1, must be end cell, and set its end location to itself. then check for start cell status + // otherwise, if different from idx - 1, set as start cell. + // if different from idx + 1, set as end cell. + // not sure if most efficient way to do it honestly. + + int accessIdx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (accessIdx < N) { + int gridCell = particleGridIndices[accessIdx]; + + if (accessIdx == 0) { + gridCellStartIndices[gridCell] = accessIdx; + + // check for next grid cell as well + int nextGridCell = particleGridIndices[accessIdx + 1]; + if (gridCell != nextGridCell) { + gridCellEndIndices[gridCell] = accessIdx; + } + } + else if (accessIdx == N - 1) { + gridCellEndIndices[gridCell] = accessIdx; + + // check if gridCell == prevGridCell. + int prevGridCell = particleGridIndices[accessIdx - 1]; + if (gridCell != prevGridCell) { + gridCellStartIndices[gridCell] = accessIdx; + } + } + else { + int prevGridCell = particleGridIndices[accessIdx - 1]; + int nextGridCell = particleGridIndices[accessIdx + 1]; + if (gridCell != prevGridCell) { + gridCellStartIndices[gridCell] = accessIdx; + } + if (gridCell != nextGridCell) { + gridCellEndIndices[gridCell] = accessIdx; + } + } + } +} + +// device helper function +__device__ glm::vec3 kernComputeVelocityChangeScattered(int N, glm::vec3 gridMin, + float cellWidth, int iSelf, glm::vec3* pos, + glm::vec3* vel1, int* gridCellStartIndices, + int* gridCellEndIndices, int* particleArrayIndices, int sideCount) { + + //int selfIdx = (blockIdx.x * blockDim.x) + threadIdx.x; + int selfIdx = iSelf; + if (selfIdx < N) { + glm::vec3 selfPos = pos[selfIdx]; + glm::vec3 velocityChange = glm::vec3(0.f); + + float radius = imax(imax(rule1Distance, rule2Distance), rule3Distance); + + int numValid1 = 0; + int numValid3 = 0; + + glm::vec3 center = glm::vec3(0.f); + glm::vec3 c = glm::vec3(0.f); + glm::vec3 perceivedVelocity = glm::vec3(0.f); + + // find maximum of entire grid in cell space coordinates. + glm::vec3 cellSpaceMax = glm::vec3(sideCount - 1, sideCount - 1, sideCount - 1); + + // create a bounding box based on the location of boid within the grid in cell space coordinates. + float xMax, xMin, yMax, yMin, zMax, zMin = 0; + + // xMax and co. are all cell space coordinates and not world space coordaintes. + // find the grid cell associated with radMaxx, radMaxy, radMaxz, radMinx, radMiny, radMinz + // set xMax, xMin, yMax, yMin, zMax, zMin etc by clamping between that and + int ixRadMax = std::floor((selfPos.x + radius - gridMin.x) / cellWidth); + int iyRadMax = std::floor((selfPos.y + radius - gridMin.y) / cellWidth); + int izRadMax = std::floor((selfPos.z + radius - gridMin.z) / cellWidth); + int ixRadMin = std::floor((selfPos.x - radius - gridMin.x) / cellWidth); + int iyRadMin = std::floor((selfPos.y - radius - gridMin.y) / cellWidth); + int izRadMin = std::floor((selfPos.z - radius - gridMin.z) / cellWidth); + + xMax = imin(ixRadMax, cellSpaceMax.x); + yMax = imin(iyRadMax, cellSpaceMax.y); + zMax = imin(izRadMax, cellSpaceMax.z); + xMin = imax(ixRadMin, gridMin.x); + yMin = imax(iyRadMin, gridMin.y); + zMin = imax(izRadMin, gridMin.z); + + // loop within the bounding box. + // these coordinates have already been clamped + // without consideration of which axis is best to check first + // started with x,y,z + + for (float z = zMin; z <= zMax; z += 1) { + for (float y = yMin; y <= yMax; y += 1) { + for (float x = xMin; x <= xMax; x += 1) { + int gridIdx = gridIndex3Dto1D(x, y, z, sideCount); + int startIdx = gridCellStartIndices[gridIdx]; + int endIdx = gridCellEndIndices[gridIdx]; + + if (startIdx > -1) { + // if start idx is > -1, then boids exist + for (int curIdx = startIdx; curIdx <= endIdx; curIdx++) { + int nBoidIdx = particleArrayIndices[curIdx]; + glm::vec3 nBoidPos = pos[nBoidIdx]; + + float dist = glm::distance(selfPos, nBoidPos); + + // Rule 1 + if (nBoidIdx != selfIdx && dist < rule1Distance) { + center += pos[nBoidIdx]; + numValid1 += 1; + } + + // Rule 2 + if (nBoidIdx != selfIdx && dist < rule2Distance) { + c -= (nBoidPos - selfPos); + } + + // Rule 3 + if (nBoidIdx != selfIdx && dist < rule3Distance) { + perceivedVelocity += vel1[nBoidIdx]; + numValid3 += 1; + } + } + } + } + } + } + + if (numValid1 != 0) { + center /= numValid1; + velocityChange += (center - selfPos) * rule1Scale; + } + + velocityChange += c * rule2Scale; + + if (numValid3 != 0) { + perceivedVelocity /= numValid3; + velocityChange += perceivedVelocity * rule3Scale; + } + + return velocityChange; + } } __global__ void kernUpdateVelNeighborSearchScattered( - int N, int gridResolution, glm::vec3 gridMin, + int N, int gridSideCount, glm::vec3 gridMin, float inverseCellWidth, float cellWidth, int *gridCellStartIndices, int *gridCellEndIndices, int *particleArrayIndices, @@ -322,13 +598,135 @@ __global__ void kernUpdateVelNeighborSearchScattered( // - Access each boid in the cell and compute velocity change from // the boids rules, if this boid is within the neighborhood distance. // - Clamp the speed change before putting the new speed in vel2 + + // for a given boid, try all cells first, then optimize by doing quadrant search later. + // 1. split cell into octants, identify the octant it belongs in + // 2. get 8 neighboring octants. figure out which octants contain cells. --> get all filled cells regardless of distance + // 2. Scan the appropriate 8 or so octants based on the result + // 3. for each cell, read the start/end endices and read boid in each cell. apply velocity changes. + // 4. clamp speed as usual + + int selfIdx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (selfIdx < N) { + glm::vec3 velocityChange = kernComputeVelocityChangeScattered(N, gridMin, cellWidth, selfIdx, pos, vel1, gridCellStartIndices, + gridCellEndIndices, particleArrayIndices, gridSideCount); + + // clamp speed and set new velocity + glm::vec3 finalVel = vel1[selfIdx] + velocityChange; + float speed = glm::length(finalVel); + if (speed > maxSpeed) { + finalVel = finalVel * maxSpeed / speed; + } + + vel2[selfIdx] = finalVel; + } +} + +// helper device +__device__ glm::vec3 kernComputeVelocityChangeCoherent(int N, glm::vec3 gridMin, + float cellWidth, int iSelf, glm::vec3* posSorted, + glm::vec3* vel1Sorted, int* gridCellStartIndices, + int* gridCellEndIndices, int sideCount) { + // int selfIdx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (iSelf < N) { + glm::vec3 selfPos = posSorted[iSelf]; + glm::vec3 velocityChange = glm::vec3(0.f); + + float radius = imax(imax(rule1Distance, rule2Distance), rule3Distance); + + int numValid1 = 0; + int numValid3 = 0; + + glm::vec3 center = glm::vec3(0.f); + glm::vec3 c = glm::vec3(0.f); + glm::vec3 perceivedVelocity = glm::vec3(0.f); + + // find maximum of entire grid in cell space coordinates. + glm::vec3 cellSpaceMax = glm::vec3(sideCount - 1, sideCount - 1, sideCount - 1); + + // create a bounding box based on the location of boid within the grid in cell space coordinates. + float xMax, xMin, yMax, yMin, zMax, zMin = 0; + + // xMax and co. are all cell space coordinates and not world space coordaintes. + // find the grid cell associated with radMaxx, radMaxy, radMaxz, radMinx, radMiny, radMinz + // set xMax, xMin, yMax, yMin, zMax, zMin etc by clamping between that and + int ixRadMax = std::floor((selfPos.x + radius - gridMin.x) / cellWidth); + int iyRadMax = std::floor((selfPos.y + radius - gridMin.y) / cellWidth); + int izRadMax = std::floor((selfPos.z + radius - gridMin.z) / cellWidth); + int ixRadMin = std::floor((selfPos.x - radius - gridMin.x) / cellWidth); + int iyRadMin = std::floor((selfPos.y - radius - gridMin.y) / cellWidth); + int izRadMin = std::floor((selfPos.z - radius - gridMin.z) / cellWidth); + + xMax = imin(ixRadMax, cellSpaceMax.x); + yMax = imin(iyRadMax, cellSpaceMax.y); + zMax = imin(izRadMax, cellSpaceMax.z); + xMin = imax(ixRadMin, gridMin.x); + yMin = imax(iyRadMin, gridMin.y); + zMin = imax(izRadMin, gridMin.z); + + // loop within the bounding box. + // these coordinates have already been clamped + // todo consider which axis is the best for checking + for (float x = xMin; x <= xMax; x += 1) { + for (float y = yMin; y <= yMax; y += 1) { + for (float z = zMin; z <= zMax; z += 1) { + int gridIdx = gridIndex3Dto1D(x, y, z, sideCount); + int startIdx = gridCellStartIndices[gridIdx]; + int endIdx = gridCellEndIndices[gridIdx]; + + if (startIdx > -1) { + // if start idx is > -1, then boids exist + for (int curIdx = startIdx; curIdx <= endIdx; curIdx++) { + // now curIdx is the same as nBoidIdx + glm::vec3 nBoidPos = posSorted[curIdx]; + glm::vec3 vel = vel1Sorted[curIdx]; + + float dist = glm::distance(selfPos, nBoidPos); + //printf("dist: %f \n", dist); + + // Rule 1 + if (curIdx != iSelf && dist < rule1Distance) { + center += posSorted[curIdx]; + numValid1 += 1; + } + + // Rule 2 + if (curIdx != iSelf && dist < rule2Distance) { + c -= (nBoidPos - selfPos); + } + + // Rule 3 + if (curIdx != iSelf && dist < rule3Distance) { + perceivedVelocity += vel1Sorted[curIdx]; + numValid3 += 1; + } + } + } + } + } + } + + if (numValid1 != 0) { + center /= numValid1; + velocityChange += (center - selfPos) * rule1Scale; + } + + velocityChange += c * rule2Scale; + + if (numValid3 != 0) { + perceivedVelocity /= numValid3; + velocityChange += perceivedVelocity * rule3Scale; + } + + return velocityChange; + } } __global__ void kernUpdateVelNeighborSearchCoherent( - int N, int gridResolution, glm::vec3 gridMin, + int N, int gridSideCount, glm::vec3 gridMin, float inverseCellWidth, float cellWidth, int *gridCellStartIndices, int *gridCellEndIndices, - glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) { + glm::vec3 *posSorted, glm::vec3 *vel1Sorted, glm::vec3 *vel2) { // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered, // except with one less level of indirection. // This should expect gridCellStartIndices and gridCellEndIndices to refer @@ -341,6 +739,42 @@ __global__ void kernUpdateVelNeighborSearchCoherent( // - Access each boid in the cell and compute velocity change from // the boids rules, if this boid is within the neighborhood distance. // - Clamp the speed change before putting the new speed in vel2 + int selfIdx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (selfIdx < N) { + glm::vec3 velocityChange = kernComputeVelocityChangeCoherent(N, gridMin, + cellWidth, selfIdx, posSorted, + vel1Sorted, gridCellStartIndices, + gridCellEndIndices, gridSideCount); + + // clamp speed and set new velocity + glm::vec3 finalVel = vel1Sorted[selfIdx] + velocityChange; + float speed = glm::length(finalVel); + if (speed > maxSpeed) { + finalVel = finalVel * maxSpeed / speed; + } + // printf("finalVel: x: %f, y: %f, z: %f \n", finalVel.x, finalVel.y, finalVel.z); + + // vel2 is still the write buffer + vel2[selfIdx] = finalVel; + } +} + +// helper global which rearranges information in pos and vel based on particleArrayIndices O(N) time + +__global__ void kernRearrangePosVel(int N, int* particleArrayIndices, glm::vec3* posSorted, glm::vec3* vel1Sorted, glm::vec3* pos, glm::vec3* vel1) { + int selfIdx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (selfIdx < N) { + int requiredIdx = particleArrayIndices[selfIdx]; + + //printf("vel x: %f, y: %f, z: %f \n ", vel1Prev[requiredIdx].x, vel1Prev[requiredIdx].y, vel1Prev[requiredIdx].z); + + // essentially swap the contents + glm::vec3 requiredPos = pos[requiredIdx]; + glm::vec3 requiredVel = vel1[requiredIdx]; + + posSorted[selfIdx] = requiredPos; + vel1Sorted[selfIdx] = requiredVel; + } } /** @@ -349,6 +783,12 @@ __global__ void kernUpdateVelNeighborSearchCoherent( void Boids::stepSimulationNaive(float dt) { // TODO-1.2 - use the kernels you wrote to step the simulation forward in time. // TODO-1.2 ping-pong the velocity buffers + int fullBlocksPerGrid = (numObjects + blockSize - 1) / blockSize; + + kernUpdateVelocityBruteForce<<>>(numObjects, dev_pos, dev_vel1, dev_vel2); + kernUpdatePos<<>>(numObjects, dt, dev_pos, dev_vel1); + + std::swap(dev_vel1, dev_vel2); } void Boids::stepSimulationScatteredGrid(float dt) { @@ -364,6 +804,35 @@ void Boids::stepSimulationScatteredGrid(float dt) { // - Perform velocity updates using neighbor search // - Update positions // - Ping-pong buffers as needed + int numCells = gridSideCount * gridSideCount * gridSideCount; + + // For each boid, label its array index and grid index + int fullBlocksPerGrid = (numObjects + blockSize - 1) / blockSize; + int fullCellsPerGrid = (numCells + blockSize - 1) / blockSize; + + // fill with -1s as default values + // Per cell basis + kernResetIntBuffer<<>>(numCells, dev_gridCellStartIndices, -1); + kernResetIntBuffer<<>>(numCells, dev_gridCellEndIndices, -1); + + // per boid basis + kernComputeIndices<<>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, + dev_pos, dev_particleArrayIndices, dev_particleGridIndices); + + // pointer to first key, pointer to last key, pointer to first value + thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices); + + kernIdentifyCellStartEnd<<>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices); + + // per boid basis + kernUpdateVelNeighborSearchScattered << > >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, + dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2); + + // update position per boid + kernUpdatePos<<>>(numObjects, dt, dev_pos, dev_vel1); + + // ping pong + std::swap(dev_vel1, dev_vel2); } void Boids::stepSimulationCoherentGrid(float dt) { @@ -382,6 +851,43 @@ void Boids::stepSimulationCoherentGrid(float dt) { // - Perform velocity updates using neighbor search // - Update positions // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE. + int numCells = gridSideCount * gridSideCount * gridSideCount; + + // For each boid, label its array index and grid index + int fullBlocksPerGrid = (numObjects + blockSize - 1) / blockSize; + int fullCellsPerGrid = (numCells + blockSize - 1) / blockSize; + + // fill with -1s as default values + // Per cell basis + kernResetIntBuffer << > > (numCells, dev_gridCellStartIndices, -1); + kernResetIntBuffer << > > (numCells, dev_gridCellEndIndices, -1); + + // per boid basis + kernComputeIndices << > > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, + dev_pos, dev_particleArrayIndices, dev_particleGridIndices); + + // sort dev_array_indices values + thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices); + + // rearrange pos and vel + // we never change vel1 and pos. only use them as reads. + kernRearrangePosVel << > > (numObjects, dev_particleArrayIndices, dev_posSorted, dev_vel1Sorted, dev_pos, dev_vel1); + + // find start and end indices +// per boid basis because we read from dev_particleGridIndices + kernIdentifyCellStartEnd << > > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices); + + // per boid basis + kernUpdateVelNeighborSearchCoherent << > > (numObjects, gridSideCount, gridMinimum, + gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, + dev_gridCellEndIndices, dev_posSorted, dev_vel1Sorted, dev_vel2); + + // update position per boid + kernUpdatePos << > > (numObjects, dt, dev_posSorted, dev_vel2); + + // ping pong + std::swap(dev_vel1, dev_vel2); + std::swap(dev_posSorted, dev_pos); } void Boids::endSimulation() { @@ -390,6 +896,12 @@ void Boids::endSimulation() { cudaFree(dev_pos); // TODO-2.1 TODO-2.3 - Free any additional buffers here. + cudaFree(dev_particleArrayIndices); + cudaFree(dev_particleGridIndices); + cudaFree(dev_gridCellStartIndices); + cudaFree(dev_gridCellEndIndices); + + // cannot free the thrust pointers } void Boids::unitTest() { @@ -439,6 +951,7 @@ void Boids::unitTest() { thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values); // How to copy data back to the CPU side from the GPU + cudaMemcpy(intKeys.get(), dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost); cudaMemcpy(intValues.get(), dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost); checkCUDAErrorWithLine("memcpy back failed!"); diff --git a/src/main.cpp b/src/main.cpp index b82c8c6..eee3475 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -14,12 +14,12 @@ // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID #define VISUALIZE 1 -#define UNIFORM_GRID 0 +#define UNIFORM_GRID 1 #define COHERENT_GRID 0 // LOOK-1.2 - change this to adjust particle count in the simulation -const int N_FOR_VIS = 5000; -const float DT = 0.2f; +const int N_FOR_VIS = 100000; +const float DT = 0.1f; /** * C main function.