diff --git a/README.md b/README.md
index d63a6a1..1e6ab5b 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,123 @@
+Boids Flocking Simulation
+====================
+
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Bowen Deng
+  * [LinkedIn](www.linkedin.com/in/bowen-deng-7dbw13), [twitter](https://twitter.com/7Dbw13)
+* Tested on: Windows 10, AMD Ryzen 9 5900HX with Radeon Graphics @ 3.30GHz 16GB, GeForce RTX 3070 Laptop GPU 8GB (Personal Computer)
+
+
+## Abstract
+
+Flocking simulation based on the Reynolds Boids algorithm, along with two levels of optimization: a uniform grid, and a uniform grid with semi-coherent memory access.
+
+![](images/representative_image.gif)
+
+> Simulation for 10000 boids
+
+![](images/representative_image2.gif)
+
+> Simulation for 100000 boids
+
+In the simulation, particles representing birds or fish (boids) move around the simulation space according to three rules:
+
+1. cohesion - boids move towards the perceived center of mass of their neighbors
+2. separation - boids avoid getting to close to their neighbors
+3. alignment - boids generally try to move with the same direction and speed as their neighbors
+
+These three rules specify a boid's velocity change in a timestep. Different velocities are rendered with different colors in the simulation.
+
+## Highlights
+
+At most 50 million boids are simulated at the same time with about 0.1 fps. A visualization for 10 million boids is shown as below (visualization for 50 million is hard to see the change).
+
+![](images/max.gif)
+
+## Performance Analysis
+
+### Measurement Metric
+
+Besides the provided framerate meter, custom CUDA events are applied to measure only the time of the simulation CUDA kernel. And an average kernel FPS is computed in the following way
+```
+// display average FPS of every 5 seconds
+if (time_elapse >= 5.0f) {
+  kernel_fps = frame / time_elapse;
+  frame = 0;
+  time_elapse = 0;
+}
+```
+
+Here, `time_elapse` is the total time consumed by just the simulation kernel. For all following experiments presented, the average kernel FPS of the first 5 seconds is chosen as the measurement metric.
+
+To use this kernel FPS instead of the origin FPS display, toggle this macro defined in `src/main.cpp`
+```
+// toggle for measuring just simulation CUDA kernel
+#define KERNEL_FPS 1
+```
+
+### Benchmark
+
+The naive neighbor search based on Brute Force is chosen as the benchmark. In this method, each boid simply checks every other boid at every timestep.
+
+For following experiments presented, all parameters are set as default if not mentioned.
+
+### Number of Boids
+
+The framerate drops dramatically as the number of boids grows
+
+![](images/num1.png)
+
+This is a natural result since more boids are simulated, more computation GPU needs to perform at each timestep.
+
+### Coherent Uniform Grid
+
+From previous figure, we can see the impressive performance gain given by applying the uniform grid. In addition, the more coherent uniform grid improves the performance further, especially when the number of boids is large (50000 ~ 500000). A possible reason is that coherent uniform grid makes continuous read in memory more efficient due to the buffer mechanism. As the number of boids grows, the number of neighbors of each boid also grows, and so more continuous read is needed. Coherent uniform grid benefits more in this case.
+
+### Visualization
+
+Turning off the visualization improves the framerate, but there is almost no difference for the benchmark and the grid methods with extremely large boids
+
+![](images/num2.png)
+
+Possibly, as the number of boids becomes large enough, the computation for their velocities reaches the bottleneck, which makes the computation for visualization relatively not so important.
+
+### Block Size
+
+The framerate seems not affected significantly by the block size. It slightly drops when the block size is very large
+
+![](images/size.png)
+
+With a large block size, the performance may be hurt by issues of the scheduler. The computation may not be parallelized well.
+
+### Cell Width
+
+Halving the cell width in grid methods, which makes it equal to the maximal search range of each boid, improves the framerate.
+
+![](images/cell.png)
+
+The original cell width is as twice as the maximal search range of each boid. Halving it means that we should check 27 neighbors for each boid now. However, it is not trivial that the performance will drop. This is because each cell gets smaller and contains fewer boids. As a result, the performance improves a little bit instead.
+
+Use half cell width by toggling this macro defined in `src/kernel.cu`
+```
+// toggle for halving cell width
+#define HALF_CELL_WIDTH 1
+```
+
+## Extra Credit
+
+### Grid-Looping Optimization
+
+Instead of hard-coding a search of the designated area (8 or 27 neighbors), this optimization limits the search area based on the grid cells that have any aspect of them within the maximal search range of each boid.
+
+To enable this feature, toggle this macro defined in `src/kernel.cu`
+
+```
+// toggle for grid-looping optimization
+#define GRID_LOOPING 1
+```
 
-### (TODO: Your README)
+The performance test shows an improvement by applying this optimization
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](images/grid.png)
\ No newline at end of file
diff --git a/images/cell.png b/images/cell.png
new file mode 100644
index 0000000..115d9e4
Binary files /dev/null and b/images/cell.png differ
diff --git a/images/grid.png b/images/grid.png
new file mode 100644
index 0000000..5c2883b
Binary files /dev/null and b/images/grid.png differ
diff --git a/images/max.gif b/images/max.gif
new file mode 100644
index 0000000..eca2b67
Binary files /dev/null and b/images/max.gif differ
diff --git a/images/num1.png b/images/num1.png
new file mode 100644
index 0000000..b68efae
Binary files /dev/null and b/images/num1.png differ
diff --git a/images/num2.png b/images/num2.png
new file mode 100644
index 0000000..deffb8a
Binary files /dev/null and b/images/num2.png differ
diff --git a/images/representative_image.gif b/images/representative_image.gif
new file mode 100644
index 0000000..8722391
Binary files /dev/null and b/images/representative_image.gif differ
diff --git a/images/representative_image2.gif b/images/representative_image2.gif
new file mode 100644
index 0000000..99e5238
Binary files /dev/null and b/images/representative_image2.gif differ
diff --git a/images/size.png b/images/size.png
new file mode 100644
index 0000000..9218a87
Binary files /dev/null and b/images/size.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..2dbb14d 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -54,6 +54,15 @@ void checkCUDAError(const char *msg, int line = -1) {
 /*! Size of the starting area in simulation space. */
 #define scene_scale 100.0f
 
+// indicate an empty cell
+#define NO_BOID -1
+
+// toggle for halving cell width
+#define HALF_CELL_WIDTH 0
+
+// toggle for grid-looping optimization
+#define GRID_LOOPING 1
+
 /***********************************************
 * Kernel state (pointers are device pointers) *
 ***********************************************/
@@ -85,6 +94,7 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_pos2;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -157,7 +167,11 @@ void Boids::initSimulation(int N) {
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
+#if HALF_CELL_WIDTH
+  gridCellWidth = std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#else
   gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#endif
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -169,6 +183,25 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_pos2, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos2 failed!");
+
+  // Wrap device vectors in thrust iterators
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
   cudaDeviceSynchronize();
 }
 
@@ -231,9 +264,53 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+  glm::vec3 pos_center(0.0f, 0.0f, 0.0f);
+  int rule1_num_neighbor = 0;
+
+  for (int i = 0; i < N; i++) {
+    if (i != iSelf && glm::distance(pos[i], pos[iSelf]) < rule1Distance) {
+      pos_center += pos[i];
+      rule1_num_neighbor++;
+    }
+  }
+
+  // no rule1 velocity when no neighbors
+  glm::vec3 rule1_vel(0.0f, 0.0f, 0.0f);
+  if (rule1_num_neighbor > 0) {
+    pos_center /= rule1_num_neighbor;
+    rule1_vel = (pos_center - pos[iSelf]) * rule1Scale;
+  }
+
   // Rule 2: boids try to stay a distance d away from each other
+  glm::vec3 rule2_vel(0.0f, 0.0f, 0.0f);
+
+  for (int i = 0; i < N; i++) {
+    if (i != iSelf && glm::distance(pos[i], pos[iSelf]) < rule2Distance) {
+      rule2_vel -= (pos[i] - pos[iSelf]);
+    }
+  }
+
+  rule2_vel *= rule2Scale;
+
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  glm::vec3 avg_vel(0.0f, 0.0f, 0.0f);
+  int rule3_num_neighbor = 0;
+
+  for (int i = 0; i < N; i++) {
+    if (i != iSelf && glm::distance(pos[i], pos[iSelf]) < rule3Distance) {
+      avg_vel += vel[i];
+      rule3_num_neighbor++;
+    }
+  }
+
+  if (rule3_num_neighbor > 0) {
+    avg_vel /= rule3_num_neighbor;
+  }  
+
+  // BUG: self velocity not substracted
+  glm::vec3 rule3_vel = avg_vel * rule3Scale;
+
+  return rule1_vel + rule2_vel + rule3_vel;
 }
 
 /**
@@ -243,8 +320,21 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
   // Compute a new velocity based on pos and vel1
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  glm::vec3 new_vel = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+
   // Clamp the speed
+  float speed = glm::length(new_vel);
+  if (speed > maxSpeed) {
+    new_vel = new_vel / speed * maxSpeed;
+  }
+
   // Record the new velocity into vel2. Question: why NOT vel1?
+  vel2[index] = new_vel;
 }
 
 /**
@@ -285,10 +375,21 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+  // TODO-2.1
+  // - Label each boid with the index of its grid cell.
+  // - Set up a parallel array of integer indices as pointers to the actual
+  //   boid data in pos and vel1/vel2
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  // init data pointers
+  indices[index] = index;
+
+  // find cell index by relative position
+  glm::vec3 offset = (pos[index] - gridMin) * inverseCellWidth;
+  gridIndices[index] = gridIndex3Dto1D(int(offset.x), int(offset.y), int(offset.z), gridResolution);
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +407,28 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  if (index == 0) {
+    gridCellStartIndices[particleGridIndices[index]] = index;
+    return;
+  }
+
+  if (index == N - 1) {
+    gridCellEndIndices[particleGridIndices[index]] = index;
+    return;
+  }
+
+  if (particleGridIndices[index] != particleGridIndices[index - 1]) {
+    gridCellStartIndices[particleGridIndices[index]] = index;
+  }
+
+  if (particleGridIndices[index] != particleGridIndices[index + 1]) {
+    gridCellEndIndices[particleGridIndices[index]] = index;
+  }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +445,135 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  // find position of this boid
+  int this_index = particleArrayIndices[index];
+  glm::vec3 this_pos = pos[this_index];
+
+#if GRID_LOOPING
+  // find neighbor cells within max_distance
+  float max_distance = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+  glm::vec3 offset = (this_pos - gridMin) * inverseCellWidth;
+  int lower_x = int(offset.x - max_distance * inverseCellWidth);
+  int upper_x = int(offset.x + max_distance * inverseCellWidth);
+  int lower_y = int(offset.y - max_distance * inverseCellWidth);
+  int upper_y = int(offset.y + max_distance * inverseCellWidth);
+  int lower_z = int(offset.z - max_distance * inverseCellWidth);
+  int upper_z = int(offset.z + max_distance * inverseCellWidth);
+#elif HALF_CELL_WIDTH
+  // find 27 neighbor cells
+  glm::vec3 offset = (this_pos - gridMin) * inverseCellWidth;
+  int lower_x = int(offset.x) - 1;
+  int upper_x = int(offset.x) + 1;
+  int lower_y = int(offset.y) - 1;
+  int upper_y = int(offset.y) + 1;
+  int lower_z = int(offset.z) - 1;
+  int upper_z = int(offset.z) + 1;
+#else
+  // find 8 neighbor cells
+  glm::vec3 offset = (this_pos - gridMin) * inverseCellWidth;
+  int lower_x = (offset.x - int(offset.x)) > 0.5f ? int(offset.x) : int(offset.x) - 1;
+  int upper_x = lower_x + 1;
+  int lower_y = (offset.y - int(offset.y)) > 0.5f ? int(offset.y) : int(offset.y) - 1;
+  int upper_y = lower_y + 1;
+  int lower_z = (offset.z - int(offset.z)) > 0.5f ? int(offset.z) : int(offset.z) - 1;
+  int upper_z = lower_z + 1;
+#endif
+
+  // clamp cell index
+  lower_x = imax(lower_x, 0);
+  upper_x = imin(upper_x, gridResolution - 1);
+  lower_y = imax(lower_y, 0);
+  upper_y = imin(upper_y, gridResolution - 1);
+  lower_z = imax(lower_z, 0);
+  upper_z = imin(upper_z, gridResolution - 1);
+
+  glm::vec3 pos_center(0.0f, 0.0f, 0.0f);
+  int rule1_num_neighbor = 0;
+  glm::vec3 rule2_vel(0.0f, 0.0f, 0.0f);
+  glm::vec3 avg_vel(0.0f, 0.0f, 0.0f);
+  int rule3_num_neighbor = 0;
+
+  
+  // iteration order z -> y -> x: utilize adjacent data loaded in buffer
+  for (int z = lower_z; z <= upper_z; z++) {
+    for (int y = lower_y; y <= upper_y; y++) {
+      for (int x = lower_x; x <= upper_x; x++) {
+        int cell_index = gridIndex3Dto1D(x, y, z, gridResolution);
+        int start_index = gridCellStartIndices[cell_index];
+        int end_index = gridCellEndIndices[cell_index];
+
+        // skip empty cells
+        if (start_index == NO_BOID || end_index == NO_BOID) {
+          continue;
+        }
+
+        // find all boids in each cell
+        // compute velocity contribution
+        for (int i = start_index; i <= end_index; i++) {
+          int neighbor_index = particleArrayIndices[i];
+          float distance = glm::distance(pos[neighbor_index], pos[this_index]);
+          // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+          if (neighbor_index != this_index && distance < rule1Distance) {
+            pos_center += pos[neighbor_index];
+            rule1_num_neighbor++;
+          }
+
+          // Rule 2: boids try to stay a distance d away from each other
+          if (neighbor_index != this_index && distance < rule2Distance) {
+            rule2_vel -= (pos[neighbor_index] - pos[this_index]);
+          }
+
+          // Rule 3: boids try to match the speed of surrounding boids
+          if (neighbor_index != this_index && distance < rule3Distance) {
+            avg_vel += vel1[neighbor_index];
+            rule3_num_neighbor++;
+          }
+        }
+      }
+    }
+  }
+
+  // no rule1 velocity when no neighbors
+  glm::vec3 rule1_vel(0.0f, 0.0f, 0.0f);
+  if (rule1_num_neighbor > 0) {
+    pos_center /= rule1_num_neighbor;
+    rule1_vel = (pos_center - pos[this_index]) * rule1Scale;
+  }
+
+  rule2_vel *= rule2Scale;
+
+  if (rule3_num_neighbor > 0) {
+    avg_vel /= rule3_num_neighbor;
+  }
+  // BUG: self velocity not substracted
+  glm::vec3 rule3_vel = avg_vel * rule3Scale;
+
+  glm::vec3 new_vel = vel1[this_index] + rule1_vel + rule2_vel + rule3_vel;
+
+  // Clamp the speed
+  float speed = glm::length(new_vel);
+  if (speed > maxSpeed) {
+    new_vel = new_vel / speed * maxSpeed;
+  }
+
+  vel2[this_index] = new_vel;
+}
+
+// reshuffle all the particle data by the rearranged array index buffer
+__global__ void kernReshufflePosVel(int N, int* particleArrayIndices,
+                                    glm::vec3* pos, glm::vec3* pos2, glm::vec3* vel1, glm::vec3* vel2) {
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  pos2[index] = pos[particleArrayIndices[index]];
+  vel2[index] = vel1[particleArrayIndices[index]];
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,6 +593,123 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  // find position of this boid
+  int this_index = index;
+  glm::vec3 this_pos = pos[this_index];
+
+#if GRID_LOOPING
+  // find neighbor cells within max_distance
+  float max_distance = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+  glm::vec3 offset = (this_pos - gridMin) * inverseCellWidth;
+  int lower_x = int(offset.x - max_distance * inverseCellWidth);
+  int upper_x = int(offset.x + max_distance * inverseCellWidth);
+  int lower_y = int(offset.y - max_distance * inverseCellWidth);
+  int upper_y = int(offset.y + max_distance * inverseCellWidth);
+  int lower_z = int(offset.z - max_distance * inverseCellWidth);
+  int upper_z = int(offset.z + max_distance * inverseCellWidth);
+#elif HALF_CELL_WIDTH
+  // find 27 neighbor cells
+  glm::vec3 offset = (this_pos - gridMin) * inverseCellWidth;
+  int lower_x = int(offset.x) - 1;
+  int upper_x = int(offset.x) + 1;
+  int lower_y = int(offset.y) - 1;
+  int upper_y = int(offset.y) + 1;
+  int lower_z = int(offset.z) - 1;
+  int upper_z = int(offset.z) + 1;
+#else
+  // find 8 neighbor cells
+  glm::vec3 offset = (this_pos - gridMin) * inverseCellWidth;
+  int lower_x = (offset.x - int(offset.x)) > 0.5f ? int(offset.x) : int(offset.x) - 1;
+  int upper_x = lower_x + 1;
+  int lower_y = (offset.y - int(offset.y)) > 0.5f ? int(offset.y) : int(offset.y) - 1;
+  int upper_y = lower_y + 1;
+  int lower_z = (offset.z - int(offset.z)) > 0.5f ? int(offset.z) : int(offset.z) - 1;
+  int upper_z = lower_z + 1;
+#endif
+
+  // clamp cell index
+  lower_x = imax(lower_x, 0);
+  upper_x = imin(upper_x, gridResolution - 1);
+  lower_y = imax(lower_y, 0);
+  upper_y = imin(upper_y, gridResolution - 1);
+  lower_z = imax(lower_z, 0);
+  upper_z = imin(upper_z, gridResolution - 1);
+
+  glm::vec3 pos_center(0.0f, 0.0f, 0.0f);
+  int rule1_num_neighbor = 0;
+  glm::vec3 rule2_vel(0.0f, 0.0f, 0.0f);
+  glm::vec3 avg_vel(0.0f, 0.0f, 0.0f);
+  int rule3_num_neighbor = 0;
+
+
+  // iteration order z -> y -> x: utilize adjacent data loaded in buffer
+  for (int z = lower_z; z <= upper_z; z++) {
+    for (int y = lower_y; y <= upper_y; y++) {
+      for (int x = lower_x; x <= upper_x; x++) {
+        int cell_index = gridIndex3Dto1D(x, y, z, gridResolution);
+        int start_index = gridCellStartIndices[cell_index];
+        int end_index = gridCellEndIndices[cell_index];
+
+        // skip empty cells
+        if (start_index == NO_BOID || end_index == NO_BOID) {
+          continue;
+        }
+
+        // find all boids in each cell
+        // compute velocity contribution
+        for (int i = start_index; i <= end_index; i++) {
+          int neighbor_index = i;
+          float distance = glm::distance(pos[neighbor_index], pos[this_index]);
+          // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+          if (neighbor_index != this_index && distance < rule1Distance) {
+            pos_center += pos[neighbor_index];
+            rule1_num_neighbor++;
+          }
+
+          // Rule 2: boids try to stay a distance d away from each other
+          if (neighbor_index != this_index && distance < rule2Distance) {
+            rule2_vel -= (pos[neighbor_index] - pos[this_index]);
+          }
+
+          // Rule 3: boids try to match the speed of surrounding boids
+          if (neighbor_index != this_index && distance < rule3Distance) {
+            avg_vel += vel1[neighbor_index];
+            rule3_num_neighbor++;
+          }
+        }
+      }
+    }
+  }
+
+  // no rule1 velocity when no neighbors
+  glm::vec3 rule1_vel(0.0f, 0.0f, 0.0f);
+  if (rule1_num_neighbor > 0) {
+    pos_center /= rule1_num_neighbor;
+    rule1_vel = (pos_center - pos[this_index]) * rule1Scale;
+  }
+
+  rule2_vel *= rule2Scale;
+
+  if (rule3_num_neighbor > 0) {
+    avg_vel /= rule3_num_neighbor;
+  }
+  // BUG: self velocity not substracted
+  glm::vec3 rule3_vel = avg_vel * rule3Scale;
+
+  glm::vec3 new_vel = vel1[this_index] + rule1_vel + rule2_vel + rule3_vel;
+
+  // Clamp the speed
+  float speed = glm::length(new_vel);
+  if (speed > maxSpeed) {
+    new_vel = new_vel / speed * maxSpeed;
+  }
+
+  vel2[this_index] = new_vel;
 }
 
 /**
@@ -348,7 +717,16 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 */
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+  kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelocityBruteForce failed!");
+
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+
   // TODO-1.2 ping-pong the velocity buffers
+  cudaMemcpy(dev_vel1, dev_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +742,36 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  dim3 fullBlocksPerGrid_cell((gridCellCount + blockSize - 1) / blockSize);
+
+  // assign boids to cell
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+  // sort by cell index
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+  // init cell start & end array
+  kernResetIntBuffer << <fullBlocksPerGrid_cell, blockSize >> > (gridCellCount, dev_gridCellStartIndices, NO_BOID);
+  checkCUDAErrorWithLine("kernResetIntBuffer failed!");
+  kernResetIntBuffer << <fullBlocksPerGrid_cell, blockSize >> > (gridCellCount, dev_gridCellEndIndices, NO_BOID);
+  checkCUDAErrorWithLine("kernResetIntBuffer failed!");
+
+  // identify cell start & end
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+  // update velocity
+  kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+  
+  // update position
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+  // ping-pong the velocity buffers
+  cudaMemcpy(dev_vel1, dev_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +790,42 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  dim3 fullBlocksPerGrid_cell((gridCellCount + blockSize - 1) / blockSize);
+
+  // assign boids to cell
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+  // sort by cell index
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+  // init cell start & end array
+  kernResetIntBuffer << <fullBlocksPerGrid_cell, blockSize >> > (gridCellCount, dev_gridCellStartIndices, NO_BOID);
+  checkCUDAErrorWithLine("kernResetIntBuffer failed!");
+  kernResetIntBuffer << <fullBlocksPerGrid_cell, blockSize >> > (gridCellCount, dev_gridCellEndIndices, NO_BOID);
+  checkCUDAErrorWithLine("kernResetIntBuffer failed!");
+
+  // identify cell start & end
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+  // reshuffle positions & velocities
+  kernReshufflePosVel << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_pos2, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernReshufflePosVel failed!");
+  cudaMemcpy(dev_pos, dev_pos2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+  cudaMemcpy(dev_vel1, dev_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+
+  // update velocity
+  kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelNeighborSearchCoherent failed!");
+
+  // update position
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+  // ping-pong the velocity buffers
+  cudaMemcpy(dev_vel1, dev_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
 }
 
 void Boids::endSimulation() {
@@ -390,11 +834,16 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+
+  cudaFree(dev_pos2);
 }
 
 void Boids::unitTest() {
   // LOOK-1.2 Feel free to write additional tests here.
-
   // test unstable sort
   int *dev_intKeys;
   int *dev_intValues;
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..6a2bc5e 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -17,6 +17,14 @@
 #define UNIFORM_GRID 0
 #define COHERENT_GRID 0
 
+// toggle for measuring just simulation CUDA kernel
+#define KERNEL_FPS 1
+
+// CUDA events for recording time stamp
+#if KERNEL_FPS
+cudaEvent_t kernel_start, kernel_stop;
+#endif
+
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
 const float DT = 0.2f;
@@ -196,6 +204,10 @@ void initShaders(GLuint * program) {
     cudaGLMapBufferObject((void**)&dptrVertVelocities, boidVBO_velocities);
 
     // execute the kernel
+#if KERNEL_FPS
+    cudaEventRecord(kernel_start);
+#endif
+
     #if UNIFORM_GRID && COHERENT_GRID
     Boids::stepSimulationCoherentGrid(DT);
     #elif UNIFORM_GRID
@@ -204,6 +216,11 @@ void initShaders(GLuint * program) {
     Boids::stepSimulationNaive(DT);
     #endif
 
+#if KERNEL_FPS
+    cudaEventRecord(kernel_stop);
+    cudaEventSynchronize(kernel_stop);
+#endif
+
     #if VISUALIZE
     Boids::copyBoidsToVBO(dptrVertPositions, dptrVertVelocities);
     #endif
@@ -217,6 +234,13 @@ void initShaders(GLuint * program) {
     double timebase = 0;
     int frame = 0;
 
+#if KERNEL_FPS    
+    cudaEventCreate(&kernel_start);
+    cudaEventCreate(&kernel_stop);
+    double time_elapse = 0;
+    double kernel_fps = 0;
+#endif
+
     Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
 
@@ -224,6 +248,7 @@ void initShaders(GLuint * program) {
       glfwPollEvents();
 
       frame++;
+#if !KERNEL_FPS
       double time = glfwGetTime();
 
       if (time - timebase > 1.0) {
@@ -231,13 +256,31 @@ void initShaders(GLuint * program) {
         timebase = time;
         frame = 0;
       }
+#endif
 
       runCUDA();
+ 
+#if KERNEL_FPS
+      float milliseconds = 0;
+      cudaEventElapsedTime(&milliseconds, kernel_start, kernel_stop);
+      time_elapse += milliseconds / 1000;
+
+      // display average FPS of every 5 seconds
+      if (time_elapse >= 5.0f) {
+        kernel_fps = frame / time_elapse;
+        frame = 0;
+        time_elapse = 0;
+      }
+#endif
 
       std::ostringstream ss;
       ss << "[";
       ss.precision(1);
+#if KERNEL_FPS
+      ss << std::fixed << kernel_fps;
+#else
       ss << std::fixed << fps;
+#endif
       ss << " fps] " << deviceName;
       glfwSetWindowTitle(window, ss.str().c_str());