diff --git a/README.md b/README.md
index d63a6a1..c177cf2 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,145 @@
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
-Project 1 - Flocking**
+# University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Di Lu
+  * [LinkedIn](https://www.linkedin.com/in/di-lu-0503251a2/)
+  * [personal website](https://www.dluisnothere.com/)
+* Tested on: Windows 11, i7-12700H @ 2.30GHz 32GB, NVIDIA GeForce RTX 3050 Ti
 
-### (TODO: Your README)
+## Introduction
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+In this project, I simulate flocking behavior for a 200 x 200 x 200 cube of scattered boids by using CUDA kernel functions
+to calculate their position and velocity on each dT. Based on Craig Reynold's artificial life program, for which a SIGGRAPH paper was written in 1989,
+the following three behaviors are implemented:
+
+1. cohesion - boids move towards the perceived center of mass of their neighbors
+2. separation - boids avoid getting to close to their neighbors
+3. alignment - boids generally try to move with the same direction and speed as
+their neighbors
+
+In the simulation results, the color of each particle is a representation of its velocity.
+
+![Coherent Grid Flocking with 50,000 boids](images/headerResized.gif)
+
+_Coherent Grid Flocking with 50,000 boids_
+
+## Implementation and Results
+To measure the performance of my code, I ran my program on release mode with VSync disabled. There are 
+three implementations: with the first being naive neighbor search, and each subsequent part 
+utilizing more optimizations.
+
+#### Part 1. Naive Boids Simulation
+
+The first simulation is a naive neighbor search, where each boid searches every other boid in existence and checks 
+whether they are within distance for cohesion, separation, or alignment. If a non-self boid is within any such distance,
+then its position and velocity will be taken into account for the respective rule. 
+
+![Naive Boids Simulation](images/naive.gif)
+
+_Naive Grid Flocking with 5,000 boids_
+
+#### Part 2. Uniform Grid Boids
+
+The second simulation is a neighbor search that takes into account the largest neighborhood distance among the 3 rules. 
+The simulation space is divided into grid cubes. Using these cubes, Each boid only needs to check the cubes that overlap
+with its spherical neighborhood.
+
+Each boid calculates the extremities of its reach by using its own radius and position. With these extremities, I can calculate
+the maximum and minimum of my desired cells to scan. Hence, the number of useless boid scans are reduced, resulting in a much
+faster simulation!
+
+![Uniform Boids Simulation](images/uniform.gif)
+
+_Uniform Grid Flocking with 5,000 boids_
+
+#### Part 3. Coherent Grid Boids
+
+The third simulation builds on the second simulation. This time, we also rearrange the position and velocity information such that 
+boids that are in a cell together are also contiguous in memory. 
+
+![Coherent Boids Simulation](images/coherent.gif)
+
+_Coherent Grid Flocking with 5,000 boids_
+
+## Part 3. Overall Performance Analysis
+
+* Number of Boids vs. Performance
+
+![Graph 1](images/graph1.png)
+
+* BlockSize vs. Performance (N = 100,000)
+
+![Graph 2](images/graph2.png)
+
+**For each implementation, how does changing the number of boids affect
+performance? Why do you think this is?**
+
+Across all three implementations, increasing the number of boids will decrease FPS. 
+If the scene scale is the same, then each cell will become more dense as boids increase,
+which means each boid will need to run more sequential operations to check the increased
+number of valid neighbors. 
+
+**For each implementation, how does changing the block count and block size
+affect performance? Why do you think this is?**
+
+As seen from the graph, smaller block count usually results in poorer performance before a certain 
+blockSize is hit. For example, there is a big difference between blockSize == 16 and blockSize == 4, however,
+above blockSize 32 is relatively similar in FPS performance. This behavior is not easily seen in simulations 
+with fewer boids (N = 5,000). This is likely because when blockSize is small, there are less threads that can run 
+in parallel. When blockSize is large, more and more threads can run the same program at the same time. However,
+if the number of threads exceed the number of parallel operations, then there won't be much performance enahancement
+anymore.
+
+**For the coherent uniform grid: did you experience any performance improvements
+with the more coherent uniform grid? Was this the outcome you expected?
+Why or why not?**
+
+I experienced a significant performance improvement with coherent uniform grid when the boid number
+is very high. For example, on 500,000 boids, coherent grid can comfortably give me ~130 FPS, while
+uniform grid just about dies at ~6 FPS. For fewer boids, the performance improvement is not that obvious (for example, 
+FPS is fairly similar between coherent and uniform grids when we have 5,000 boids.) This outcome is expected, because 
+when the number of boids is higher, each cell also contains a lot more boids. If the information is not contiguous in 
+memory, the impact of checking many boids whose information is further apart is more noticeable.
+
+**Did changing cell width and checking 27 vs 8 neighboring cells affect performance?
+Why or why not? Be careful: it is insufficient (and possibly incorrect) to say
+that 27-cell is slower simply because there are more cells to check!**
+
+My understanding is that given scene scale 100 and 100,000 boids, decreasing cell size
+(thus checking more cells) will increase performance. This is because given the same density
+of boids in a cell, we can check more cells in parallel and run less sequential operations 
+for each cell. However, if the scene scale were larger with the same number of boids, then this could potentially 
+decrease efficiency, because we are checking more cells that could be "useless".
+
+To test my thinking, I used the following two scenarios:
+
+_Constants:_ 
+
+* _Coherent grid_
+* _Number of boids: 100,000_ 
+
+1. Scene Scale: 100
+
+I observed around 100 FPS **increase** when I used cell width == neighborhood size.
+
+2. Scene Scale: 200
+
+I observed nearly 200 FPS **decrease** when I used cell width == neighbordhood size.
+
+
+
+## Part 4: More Images and Results!
+![](images/naive50k.png)
+
+_Naive Flocking with 50,000 boids_
+
+![](images/big2.gif)
+
+_Coherent Flocking with 100,000 boids_
+
+![](images/big.png)
+
+_Coherent Flocking with 500,000 boids, could not get a gif of this onto github_
+
+![](images/sceneScale200with100kBoids.png)
+
+_100,000 boids on 200 scene scale instead of 100_
diff --git a/images/27cells.png b/images/27cells.png
new file mode 100644
index 0000000..4099326
Binary files /dev/null and b/images/27cells.png differ
diff --git a/images/big.png b/images/big.png
new file mode 100644
index 0000000..934b8cd
Binary files /dev/null and b/images/big.png differ
diff --git a/images/big2.gif b/images/big2.gif
new file mode 100644
index 0000000..4aab83c
Binary files /dev/null and b/images/big2.gif differ
diff --git a/images/coherent.gif b/images/coherent.gif
new file mode 100644
index 0000000..c2eb40e
Binary files /dev/null and b/images/coherent.gif differ
diff --git a/images/coherent2.gif b/images/coherent2.gif
new file mode 100644
index 0000000..9a1ef63
Binary files /dev/null and b/images/coherent2.gif differ
diff --git a/images/graph1.png b/images/graph1.png
new file mode 100644
index 0000000..0d1bb74
Binary files /dev/null and b/images/graph1.png differ
diff --git a/images/graph2.png b/images/graph2.png
new file mode 100644
index 0000000..2b0e0b0
Binary files /dev/null and b/images/graph2.png differ
diff --git a/images/header.gif b/images/header.gif
new file mode 100644
index 0000000..42dc757
Binary files /dev/null and b/images/header.gif differ
diff --git a/images/header2.gif b/images/header2.gif
new file mode 100644
index 0000000..1b6b05c
Binary files /dev/null and b/images/header2.gif differ
diff --git a/images/headerResized.gif b/images/headerResized.gif
new file mode 100644
index 0000000..1e01b9b
Binary files /dev/null and b/images/headerResized.gif differ
diff --git a/images/naive.gif b/images/naive.gif
new file mode 100644
index 0000000..580cfa6
Binary files /dev/null and b/images/naive.gif differ
diff --git a/images/naive2.gif b/images/naive2.gif
new file mode 100644
index 0000000..a106708
Binary files /dev/null and b/images/naive2.gif differ
diff --git a/images/naive50k.png b/images/naive50k.png
new file mode 100644
index 0000000..ea38d71
Binary files /dev/null and b/images/naive50k.png differ
diff --git a/images/sceneScale200with100kBoids.png b/images/sceneScale200with100kBoids.png
new file mode 100644
index 0000000..5e227ef
Binary files /dev/null and b/images/sceneScale200with100kBoids.png differ
diff --git a/images/uniform.gif b/images/uniform.gif
new file mode 100644
index 0000000..8f110a0
Binary files /dev/null and b/images/uniform.gif differ
diff --git a/images/uniform2.gif b/images/uniform2.gif
new file mode 100644
index 0000000..fbd867d
Binary files /dev/null and b/images/uniform2.gif differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..cb24ae8 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -5,8 +5,10 @@
 #include <glm/glm.hpp>
 #include "utilityCore.hpp"
 #include "kernel.h"
+#include <device_launch_parameters.h>
 
 // LOOK-2.1 potentially useful for doing grid-based neighbor search
+// Di: Returns max of the two or min of the two elements for comparison.
 #ifndef imax
 #define imax( a, b ) ( ((a) > (b)) ? (a) : (b) )
 #endif
@@ -41,9 +43,9 @@ void checkCUDAError(const char *msg, int line = -1) {
 
 // LOOK-1.2 Parameters for the boids algorithm.
 // These worked well in our reference implementation.
-#define rule1Distance 5.0f
-#define rule2Distance 3.0f
-#define rule3Distance 5.0f
+#define rule1Distance 5.0f // 5.0f
+#define rule2Distance 3.0f // 3.0f
+#define rule3Distance 5.0f // 5.0f
 
 #define rule1Scale 0.01f
 #define rule2Scale 0.1f
@@ -52,7 +54,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 #define maxSpeed 1.0f
 
 /*! Size of the starting area in simulation space. */
-#define scene_scale 100.0f
+#define scene_scale 200.0f
 
 /***********************************************
 * Kernel state (pointers are device pointers) *
@@ -74,8 +76,10 @@ glm::vec3 *dev_vel2;
 // pointers on your own too.
 
 // For efficient sorting and the uniform grid. These should always be parallel.
+// Di all initialized on GPU
 int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
 int *dev_particleGridIndices; // What grid cell is this particle in?
+
 // needed for use with thrust
 thrust::device_ptr<int> dev_thrust_particleArrayIndices;
 thrust::device_ptr<int> dev_thrust_particleGridIndices;
@@ -85,6 +89,11 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+// glm::vec3* dev_posSortedPrev;
+glm::vec3* dev_posSorted;
+glm::vec3* dev_vel1Sorted;
+// thrust::device_ptr<int> dev_thrust_particleArrayIndicesValue;
+
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -125,6 +134,7 @@ __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
 */
 __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) {
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  // some threads are extra and thus won't enter this case.
   if (index < N) {
     glm::vec3 rand = generateRandomVec3(time, index);
     arr[index].x = scale * rand.x;
@@ -151,13 +161,21 @@ void Boids::initSimulation(int N) {
   cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
   checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
 
+  // di added
+  cudaMalloc((void**)&dev_posSorted, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_posSorted failed!");
+
+  cudaMalloc((void**)&dev_vel1Sorted, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_posSortedVel1 failed!");
+
   // LOOK-1.2 - This is a typical CUDA kernel invocation.
   kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
     dev_pos, scene_scale);
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+  // remove 2.0 *
+  gridCellWidth = 2.0 * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -169,6 +187,23 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+  // dev_posSortedPrev = dev_pos;
+
   cudaDeviceSynchronize();
 }
 
@@ -210,8 +245,8 @@ __global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float
 void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
   dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
 
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+  kernCopyPositionsToVBO <<<fullBlocksPerGrid, blockSize >>>(numObjects, dev_pos, vbodptr_positions, scene_scale);
+  kernCopyVelocitiesToVBO <<<fullBlocksPerGrid, blockSize >>>(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
 
   checkCUDAErrorWithLine("copyBoidsToVBO failed!");
 
@@ -233,7 +268,57 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  //// return glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 velocityChange = glm::vec3(0.f);
+    glm::vec3 selfPos = pos[iSelf];
+ 
+    // Rule 1
+    // find the average position of all the birds
+    glm::vec3 center = glm::vec3(0.f);
+    int numValid1 = 0;
+
+    glm::vec3 c = glm::vec3(0.f);
+
+    glm::vec3 perceivedVelocity = glm::vec3(0.f);
+    int numValid3 = 0;
+
+    for (int i = 0; i < N; i++) {
+        glm::vec3 bPos = pos[i];
+        float dist = glm::distance(selfPos, bPos);
+
+        // Rule 1
+        if (i != iSelf && dist < rule1Distance) {
+            center += pos[i];
+            numValid1 += 1;
+        }
+
+        // Rule 2
+        if (i != iSelf && dist < rule2Distance) {
+            c -= (bPos - selfPos);
+        }
+
+        // Rule 3
+        if (i != iSelf && dist < rule3Distance) {
+            perceivedVelocity += vel[i];
+            numValid3 += 1;
+        }
+
+    }
+
+    if (numValid1 != 0) {
+        center /= numValid1;
+        velocityChange += (center - selfPos) * rule1Scale;
+    }
+
+    velocityChange += c * rule2Scale;
+
+    if (numValid3 != 0) {
+        perceivedVelocity /= numValid3;
+        velocityChange += perceivedVelocity * rule3Scale;
+    }
+
+    return velocityChange;
+
 }
 
 /**
@@ -244,7 +329,23 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+  // Record the new velocity into vel2. Question: why NOT vel1? -> 
+ 
+    // we don't want to overwrite original velocities because other boids still need to refer to it to compute their vels.
+    // and vel1 and vel2 are all being accessed at the same time so if info in vel1 gets changed while another thread is accessing
+    // that would be bad.
+    // once we finish computing these vels, vel1 becomes our "write" array and vel2 becomes our "read" array
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+    glm::vec3 velChange = computeVelocityChange(N, index, pos, vel1);
+    glm::vec3 finalVel = vel1[index] + velChange;
+    float speed = glm::length(finalVel);
+    if (speed > maxSpeed) {
+        finalVel = finalVel * maxSpeed / speed;
+    }
+    vel2[index] = finalVel;
 }
 
 /**
@@ -258,8 +359,14 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
     return;
   }
   glm::vec3 thisPos = pos[index];
+
+  // debug value
+  glm::vec3 thisVel = vel[index];
+
   thisPos += vel[index] * dt;
 
+  // printf("vel x: %f, vel y: %f, vel z: %f", vel[index].x, vel[index].y, vel[index].z);
+
   // Wrap the boids around so we don't lose them
   thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
   thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
@@ -278,21 +385,44 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
 //          for(x)
 //            for(y)
 //             for(z)? Or some other order?
-__device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
-  return x + y * gridResolution + z * gridResolution * gridResolution;
+__device__ int gridIndex3Dto1D(int x, int y, int z, int gridSideCount) {
+  return x + y * gridSideCount + z * gridSideCount * gridSideCount;
 }
 
-__global__ void kernComputeIndices(int N, int gridResolution,
+__global__ void kernComputeIndices(int N, int gridSideCount,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
     // TODO-2.1
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+    
+    // this kern function is called in parallel by each boid so it only needs to be 
+    int boidIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+    if (boidIdx < N) {
+        // compute components
+        glm::vec3 boidPos = pos[boidIdx];
+
+        int ix = (int) (std::floor((boidPos.x - gridMin.x) * inverseCellWidth));
+        int iy = (int) (std::floor((boidPos.y - gridMin.y) * inverseCellWidth));
+        int iz = (int) (std::floor((boidPos.z - gridMin.z) * inverseCellWidth));
+
+        // compute 1D cell idx from 3D
+        int cellIdx = gridIndex3Dto1D(ix, iy, iz, gridSideCount);
+
+        // fill dev_particleGridIndices at boidIdx
+        gridIndices[boidIdx] = cellIdx;
+
+        // fill dev_particleArrayIndices, starts exactly as boidIdx initially.
+        indices[boidIdx] = boidIdx;
+    }
+
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
 //          does not enclose any boids
+// pass in an invalid array value such as -1 for cells that don't contain boids
 __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if (index < N) {
@@ -306,10 +436,156 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+
+    // post sorting
+    // also looks like it's called in parallel by boids.
+    // note that idx is NOT boidIdx
+    // for current idx, check its own cell in GridIndices and one before it.
+    // if idx 0, must be start cell, and set its start location to itself. then check for end cell status
+    // if idx N - 1, must be end cell, and set its end location to itself. then check for start cell status
+    // otherwise, if different from idx - 1, set as start cell.
+    // if different from idx + 1, set as end cell.
+    // not sure if most efficient way to do it honestly.
+
+    int accessIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (accessIdx < N) {
+        int gridCell = particleGridIndices[accessIdx];
+
+        if (accessIdx == 0) {
+            gridCellStartIndices[gridCell] = accessIdx;
+
+            // check for next grid cell as well
+            int nextGridCell = particleGridIndices[accessIdx + 1];
+            if (gridCell != nextGridCell) {
+                gridCellEndIndices[gridCell] = accessIdx;
+            }
+        }
+        else if (accessIdx == N - 1) {
+            gridCellEndIndices[gridCell] = accessIdx;
+
+            // check if gridCell == prevGridCell. 
+            int prevGridCell = particleGridIndices[accessIdx - 1];
+            if (gridCell != prevGridCell) {
+                gridCellStartIndices[gridCell] = accessIdx;
+            }
+        }
+        else {
+            int prevGridCell = particleGridIndices[accessIdx - 1];
+            int nextGridCell = particleGridIndices[accessIdx + 1];
+            if (gridCell != prevGridCell) {
+                gridCellStartIndices[gridCell] = accessIdx;
+            }
+            if (gridCell != nextGridCell) {
+                gridCellEndIndices[gridCell] = accessIdx;
+            }
+        }
+    }
+}
+
+// device helper function
+__device__ glm::vec3 kernComputeVelocityChangeScattered(int N, glm::vec3 gridMin, 
+    float cellWidth, int iSelf, glm::vec3* pos, 
+    glm::vec3* vel1, int* gridCellStartIndices, 
+    int* gridCellEndIndices, int* particleArrayIndices, int sideCount) {
+
+    //int selfIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
+    int selfIdx = iSelf;
+    if (selfIdx < N) {
+        glm::vec3 selfPos = pos[selfIdx];
+        glm::vec3 velocityChange = glm::vec3(0.f);
+
+        float radius = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+
+        int numValid1 = 0;
+        int numValid3 = 0;
+
+        glm::vec3 center = glm::vec3(0.f);
+        glm::vec3 c = glm::vec3(0.f);
+        glm::vec3 perceivedVelocity = glm::vec3(0.f);
+
+        // find maximum of entire grid in cell space coordinates.
+        glm::vec3 cellSpaceMax = glm::vec3(sideCount - 1, sideCount - 1, sideCount - 1);
+
+        // create a bounding box based on the location of boid within the grid in cell space coordinates.
+        float xMax, xMin, yMax, yMin, zMax, zMin = 0;
+
+        // xMax and co. are all cell space coordinates and not world space coordaintes.
+        // find the grid cell associated with radMaxx, radMaxy, radMaxz, radMinx, radMiny, radMinz
+        // set xMax, xMin, yMax, yMin, zMax, zMin etc by clamping between that and 
+        int ixRadMax = std::floor((selfPos.x + radius - gridMin.x) / cellWidth);
+        int iyRadMax = std::floor((selfPos.y + radius - gridMin.y) / cellWidth);
+        int izRadMax = std::floor((selfPos.z + radius - gridMin.z) / cellWidth);
+        int ixRadMin = std::floor((selfPos.x - radius - gridMin.x) / cellWidth);
+        int iyRadMin = std::floor((selfPos.y - radius - gridMin.y) / cellWidth);
+        int izRadMin = std::floor((selfPos.z - radius - gridMin.z) / cellWidth);
+
+        xMax = imin(ixRadMax, cellSpaceMax.x);
+        yMax = imin(iyRadMax, cellSpaceMax.y);
+        zMax = imin(izRadMax, cellSpaceMax.z);
+        xMin = imax(ixRadMin, gridMin.x);
+        yMin = imax(iyRadMin, gridMin.y);
+        zMin = imax(izRadMin, gridMin.z);
+
+        // loop within the bounding box.
+        // these coordinates have already been clamped
+        // without consideration of which axis is best to check first
+        // started with x,y,z
+
+        for (float z = zMin; z <= zMax; z += 1) {
+            for (float y = yMin; y <= yMax; y += 1) {
+                for (float x = xMin; x <= xMax; x += 1) {
+                    int gridIdx = gridIndex3Dto1D(x, y, z, sideCount);
+                    int startIdx = gridCellStartIndices[gridIdx];
+                    int endIdx = gridCellEndIndices[gridIdx];
+
+                    if (startIdx > -1) {
+                        // if start idx is > -1, then boids exist
+                        for (int curIdx = startIdx; curIdx <= endIdx; curIdx++) {
+                            int nBoidIdx = particleArrayIndices[curIdx];
+                            glm::vec3 nBoidPos = pos[nBoidIdx];
+
+                            float dist = glm::distance(selfPos, nBoidPos);
+
+                            // Rule 1
+                            if (nBoidIdx != selfIdx && dist < rule1Distance) {
+                                center += pos[nBoidIdx];
+                                numValid1 += 1;
+                            }
+
+                            // Rule 2
+                            if (nBoidIdx != selfIdx && dist < rule2Distance) {
+                                c -= (nBoidPos - selfPos);
+                            }
+
+                            // Rule 3
+                            if (nBoidIdx != selfIdx && dist < rule3Distance) {
+                                perceivedVelocity += vel1[nBoidIdx];
+                                numValid3 += 1;
+                            }
+                        }
+                    }
+                }
+            }
+        }
+
+        if (numValid1 != 0) {
+            center /= numValid1;
+            velocityChange += (center - selfPos) * rule1Scale;
+        }
+
+        velocityChange += c * rule2Scale;
+
+        if (numValid3 != 0) {
+            perceivedVelocity /= numValid3;
+            velocityChange += perceivedVelocity * rule3Scale;
+        }
+
+        return velocityChange;
+    }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
-  int N, int gridResolution, glm::vec3 gridMin,
+  int N, int gridSideCount, glm::vec3 gridMin,
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   int *particleArrayIndices,
@@ -322,13 +598,135 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+    // for a given boid, try all cells first, then optimize by doing quadrant search later.
+    // 1. split cell into octants, identify the octant it belongs in
+    // 2. get 8 neighboring octants. figure out which octants contain cells. --> get all filled cells regardless of distance
+    // 2. Scan the appropriate 8 or so octants based on the result
+    // 3. for each cell, read the start/end endices and read boid in each cell. apply velocity changes.
+    // 4. clamp speed as usual
+
+    int selfIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (selfIdx < N) {
+        glm::vec3 velocityChange = kernComputeVelocityChangeScattered(N, gridMin, cellWidth, selfIdx, pos, vel1, gridCellStartIndices, 
+            gridCellEndIndices, particleArrayIndices, gridSideCount);
+
+        // clamp speed and set new velocity
+        glm::vec3 finalVel = vel1[selfIdx] + velocityChange;
+        float speed = glm::length(finalVel);
+        if (speed > maxSpeed) {
+            finalVel = finalVel * maxSpeed / speed;
+        }
+
+        vel2[selfIdx] = finalVel;
+    }
+}
+
+// helper device
+__device__ glm::vec3 kernComputeVelocityChangeCoherent(int N, glm::vec3 gridMin,
+    float cellWidth, int iSelf, glm::vec3* posSorted,
+    glm::vec3* vel1Sorted, int* gridCellStartIndices,
+    int* gridCellEndIndices, int sideCount) {
+    // int selfIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (iSelf < N) {
+        glm::vec3 selfPos = posSorted[iSelf];
+        glm::vec3 velocityChange = glm::vec3(0.f);
+
+        float radius = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+
+        int numValid1 = 0;
+        int numValid3 = 0;
+
+        glm::vec3 center = glm::vec3(0.f);
+        glm::vec3 c = glm::vec3(0.f);
+        glm::vec3 perceivedVelocity = glm::vec3(0.f);
+
+        // find maximum of entire grid in cell space coordinates.
+        glm::vec3 cellSpaceMax = glm::vec3(sideCount - 1, sideCount - 1, sideCount - 1);
+
+        // create a bounding box based on the location of boid within the grid in cell space coordinates.
+        float xMax, xMin, yMax, yMin, zMax, zMin = 0;
+
+        // xMax and co. are all cell space coordinates and not world space coordaintes.
+        // find the grid cell associated with radMaxx, radMaxy, radMaxz, radMinx, radMiny, radMinz
+        // set xMax, xMin, yMax, yMin, zMax, zMin etc by clamping between that and 
+        int ixRadMax = std::floor((selfPos.x + radius - gridMin.x) / cellWidth);
+        int iyRadMax = std::floor((selfPos.y + radius - gridMin.y) / cellWidth);
+        int izRadMax = std::floor((selfPos.z + radius - gridMin.z) / cellWidth);
+        int ixRadMin = std::floor((selfPos.x - radius - gridMin.x) / cellWidth);
+        int iyRadMin = std::floor((selfPos.y - radius - gridMin.y) / cellWidth);
+        int izRadMin = std::floor((selfPos.z - radius - gridMin.z) / cellWidth);
+
+        xMax = imin(ixRadMax, cellSpaceMax.x);
+        yMax = imin(iyRadMax, cellSpaceMax.y);
+        zMax = imin(izRadMax, cellSpaceMax.z);
+        xMin = imax(ixRadMin, gridMin.x);
+        yMin = imax(iyRadMin, gridMin.y);
+        zMin = imax(izRadMin, gridMin.z);
+
+        // loop within the bounding box.
+        // these coordinates have already been clamped
+        // todo consider which axis is the best for checking
+        for (float x = xMin; x <= xMax; x += 1) {
+            for (float y = yMin; y <= yMax; y += 1) {
+                for (float z = zMin; z <= zMax; z += 1) {
+                    int gridIdx = gridIndex3Dto1D(x, y, z, sideCount);
+                    int startIdx = gridCellStartIndices[gridIdx];
+                    int endIdx = gridCellEndIndices[gridIdx];
+
+                    if (startIdx > -1) {
+                        // if start idx is > -1, then boids exist
+                        for (int curIdx = startIdx; curIdx <= endIdx; curIdx++) {
+                            // now curIdx is the same as nBoidIdx
+                            glm::vec3 nBoidPos = posSorted[curIdx];
+                            glm::vec3 vel = vel1Sorted[curIdx];
+
+                            float dist = glm::distance(selfPos, nBoidPos);
+                            //printf("dist: %f \n", dist);
+
+                            // Rule 1
+                            if (curIdx != iSelf && dist < rule1Distance) {
+                                center += posSorted[curIdx];
+                                numValid1 += 1;
+                            }
+
+                            // Rule 2
+                            if (curIdx != iSelf && dist < rule2Distance) {
+                                c -= (nBoidPos - selfPos);
+                            }
+
+                            // Rule 3
+                            if (curIdx != iSelf && dist < rule3Distance) {
+                                perceivedVelocity += vel1Sorted[curIdx];
+                                numValid3 += 1;
+                            }
+                        }
+                    }
+                }
+            }
+        }
+
+        if (numValid1 != 0) {
+            center /= numValid1;
+            velocityChange += (center - selfPos) * rule1Scale;
+        }
+
+        velocityChange += c * rule2Scale;
+
+        if (numValid3 != 0) {
+            perceivedVelocity /= numValid3;
+            velocityChange += perceivedVelocity * rule3Scale;
+        }
+
+        return velocityChange;
+    }
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
-  int N, int gridResolution, glm::vec3 gridMin,
+  int N, int gridSideCount, glm::vec3 gridMin,
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+  glm::vec3 *posSorted, glm::vec3 *vel1Sorted, glm::vec3 *vel2) {
   // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
   // except with one less level of indirection.
   // This should expect gridCellStartIndices and gridCellEndIndices to refer
@@ -341,6 +739,42 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+    int selfIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (selfIdx < N) {
+        glm::vec3 velocityChange = kernComputeVelocityChangeCoherent(N, gridMin, 
+            cellWidth, selfIdx, posSorted, 
+            vel1Sorted, gridCellStartIndices,
+            gridCellEndIndices, gridSideCount);
+
+        // clamp speed and set new velocity
+        glm::vec3 finalVel = vel1Sorted[selfIdx] + velocityChange;
+        float speed = glm::length(finalVel);
+        if (speed > maxSpeed) {
+            finalVel = finalVel * maxSpeed / speed;
+        }
+        // printf("finalVel: x: %f, y: %f, z: %f \n", finalVel.x, finalVel.y, finalVel.z);
+
+        // vel2 is still the write buffer
+        vel2[selfIdx] = finalVel;
+    }
+}
+
+// helper global which rearranges information in pos and vel based on particleArrayIndices O(N) time
+
+__global__ void kernRearrangePosVel(int N, int* particleArrayIndices, glm::vec3* posSorted, glm::vec3* vel1Sorted, glm::vec3* pos, glm::vec3* vel1) {
+    int selfIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (selfIdx < N) {
+        int requiredIdx = particleArrayIndices[selfIdx];
+
+        //printf("vel x: %f, y: %f, z: %f \n ", vel1Prev[requiredIdx].x, vel1Prev[requiredIdx].y, vel1Prev[requiredIdx].z);
+
+        // essentially swap the contents
+        glm::vec3 requiredPos = pos[requiredIdx];
+        glm::vec3 requiredVel = vel1[requiredIdx];
+
+        posSorted[selfIdx] = requiredPos;
+        vel1Sorted[selfIdx] = requiredVel;
+    }
 }
 
 /**
@@ -349,6 +783,12 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+    int fullBlocksPerGrid = (numObjects + blockSize - 1) / blockSize;
+
+    kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+    kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel1);
+
+    std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +804,35 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+    int numCells = gridSideCount * gridSideCount * gridSideCount;
+
+    // For each boid, label its array index and grid index
+    int fullBlocksPerGrid = (numObjects + blockSize - 1) / blockSize;
+    int fullCellsPerGrid = (numCells + blockSize - 1) / blockSize;
+
+    // fill with -1s as default values
+    // Per cell basis
+    kernResetIntBuffer<<<fullCellsPerGrid, blockSize>>>(numCells, dev_gridCellStartIndices, -1);
+    kernResetIntBuffer<<<fullCellsPerGrid, blockSize>>>(numCells, dev_gridCellEndIndices, -1);
+    
+    // per boid basis
+    kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+        dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+    // pointer to first key, pointer to last key, pointer to first value
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+    kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+    // per boid basis
+    kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+        dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+
+    // update position per boid
+    kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel1);
+
+    // ping pong
+    std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +851,43 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+    int numCells = gridSideCount * gridSideCount * gridSideCount;
+
+    // For each boid, label its array index and grid index
+    int fullBlocksPerGrid = (numObjects + blockSize - 1) / blockSize;
+    int fullCellsPerGrid = (numCells + blockSize - 1) / blockSize;
+
+    // fill with -1s as default values
+    // Per cell basis
+    kernResetIntBuffer << <fullCellsPerGrid, blockSize >> > (numCells, dev_gridCellStartIndices, -1);
+    kernResetIntBuffer << <fullCellsPerGrid, blockSize >> > (numCells, dev_gridCellEndIndices, -1);
+
+    // per boid basis
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+        dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+    // sort dev_array_indices values
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+    // rearrange pos and vel
+    // we never change vel1 and pos. only use them as reads.
+    kernRearrangePosVel << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_posSorted, dev_vel1Sorted, dev_pos, dev_vel1);
+
+    // find start and end indices
+// per boid basis because we read from dev_particleGridIndices
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+    // per boid basis  
+    kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, 
+        gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, 
+        dev_gridCellEndIndices, dev_posSorted, dev_vel1Sorted, dev_vel2);
+
+    // update position per boid
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_posSorted, dev_vel2);
+
+    // ping pong
+    std::swap(dev_vel1, dev_vel2);
+    std::swap(dev_posSorted, dev_pos);
 }
 
 void Boids::endSimulation() {
@@ -390,6 +896,12 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+
+  // cannot free the thrust pointers
 }
 
 void Boids::unitTest() {
@@ -439,6 +951,7 @@ void Boids::unitTest() {
   thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
 
   // How to copy data back to the CPU side from the GPU
+
   cudaMemcpy(intKeys.get(), dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
   cudaMemcpy(intValues.get(), dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
   checkCUDAErrorWithLine("memcpy back failed!");
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..eee3475 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,12 +14,12 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
+#define UNIFORM_GRID 1
 #define COHERENT_GRID 0
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
-const float DT = 0.2f;
+const int N_FOR_VIS = 100000;
+const float DT = 0.1f;
 
 /**
 * C main function.