diff --git a/README.md b/README.md
index d63a6a1..9022aab 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,35 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yilin Li 
+* Tested on: Windows 11, i7-12700H @ 2.3GHz 16GB, GTX 3060 (Personal Laptop)
 
-### (TODO: Your README)
+## Introduction
+This project implements a flocking simulation based on the Reynolds Boids algorithm, along with two levels of optimization: a uniform grid, and a uniform grid with semi-coherent memory access.
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](images/Naive_40000boids.gif)
+*Naive Algorithm with 40000 boids and 128 block size.*
+
+![](images/Unifrom_40000boids.gif)
+*Uniform Grid Search Algorithm with 40000 boids and 128 block size.*
+
+![](images/Coherent_40000boids.gif)
+*Coherent Uniform Grid Algorithm with 40000 boids and 128 block size.*
+
+## Performance Analysis 
+* For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+  * As we can observe from the figure below, increasing the number of boids will decrease the performance badly for the Naive Algorithm. Increasing the number of boids will also decrease the performance of Uniform Search Algorithm but not as bad as Naive Algorithm. Notice that Coherent Algorithm is barely influenced within our tesing range. 
+  * The results indicate that 100000 boids is still far away from the limit of Coherent Algorithm. For Naive Algorithm, each thread will search through every other boids in the environment and thus it has the worst performance. For Uniform Algorithm, as number of boids increases, its neighbor candidates increases but the range of searching is 8 grid cells by maximum and thus the performance will not be damaged too bad. The Coherent Algorithm, however, optimize the memeory access which utilizes the power of contiguous memory on blocks. Remeber most of the slowness is caused by memory accessing. 
+
+![](images/p_boids.png)
+* For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
+  * Changing the block count doesn't affect performance during my testing. Changing the block size, however, influcences the performance in a 'weird' way shown by the figure below. 
+  * The results show that increasing the block size could hurt the performance. One reason could that many threads share the common memory resource in a block and too many threads can cause the need of reuse registers and memory space. 
+
+![](images/p_block.png)
+
+* For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+  * The performance is greatly improved by the more coherent uniform grid algorithm. The framerate per second drops to around 60 with 300000 boids for uniform grid algorithm. However, the framerate per second stays 90 even with 1000000 boids for the more coherent uniform grid algorithm. As a reference, the framerate per second of the naive algorithm drops to 8 with only 100000 boids. 
+
+* Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+  * Changing cell width and checking 27 is actually performs better than 8 neighboring cells using the coherent uniform grid algorithm. While 27 cells increases the search range for each boids, they also eliminate the time of finding which quadrant the boid belongs to and thus reduce some of the memory access. However, the conclusion is not true if the number of boids reaches a really high number like 1000000. In this case, my guess is that the boids are too many so that they are not fit into the same memory block now and thus memory accessing causes more overhead. 
diff --git a/images/Coherent_100000boids.gif b/images/Coherent_100000boids.gif
new file mode 100755
index 0000000..a281b74
Binary files /dev/null and b/images/Coherent_100000boids.gif differ
diff --git a/images/Coherent_20000boids.gif b/images/Coherent_20000boids.gif
new file mode 100755
index 0000000..7346cde
Binary files /dev/null and b/images/Coherent_20000boids.gif differ
diff --git a/images/Coherent_40000boids.gif b/images/Coherent_40000boids.gif
new file mode 100755
index 0000000..dc6c620
Binary files /dev/null and b/images/Coherent_40000boids.gif differ
diff --git a/images/Coherent_60000boids.gif b/images/Coherent_60000boids.gif
new file mode 100755
index 0000000..cecc0f9
Binary files /dev/null and b/images/Coherent_60000boids.gif differ
diff --git a/images/Coherent_80000boids.gif b/images/Coherent_80000boids.gif
new file mode 100755
index 0000000..9535132
Binary files /dev/null and b/images/Coherent_80000boids.gif differ
diff --git a/images/Naive_100000boids.gif b/images/Naive_100000boids.gif
new file mode 100755
index 0000000..cb083aa
Binary files /dev/null and b/images/Naive_100000boids.gif differ
diff --git a/images/Naive_20000boids.gif b/images/Naive_20000boids.gif
new file mode 100755
index 0000000..c82fa49
Binary files /dev/null and b/images/Naive_20000boids.gif differ
diff --git a/images/Naive_40000boids.gif b/images/Naive_40000boids.gif
new file mode 100755
index 0000000..fc926eb
Binary files /dev/null and b/images/Naive_40000boids.gif differ
diff --git a/images/Naive_60000boids.gif b/images/Naive_60000boids.gif
new file mode 100755
index 0000000..65457c0
Binary files /dev/null and b/images/Naive_60000boids.gif differ
diff --git a/images/Naive_80000boids.gif b/images/Naive_80000boids.gif
new file mode 100755
index 0000000..320b99e
Binary files /dev/null and b/images/Naive_80000boids.gif differ
diff --git a/images/Unifrom_100000boids.gif b/images/Unifrom_100000boids.gif
new file mode 100755
index 0000000..b3d9b9e
Binary files /dev/null and b/images/Unifrom_100000boids.gif differ
diff --git a/images/Unifrom_20000boids.gif b/images/Unifrom_20000boids.gif
new file mode 100755
index 0000000..751c5f2
Binary files /dev/null and b/images/Unifrom_20000boids.gif differ
diff --git a/images/Unifrom_40000boids.gif b/images/Unifrom_40000boids.gif
new file mode 100755
index 0000000..09fee76
Binary files /dev/null and b/images/Unifrom_40000boids.gif differ
diff --git a/images/Unifrom_60000boids.gif b/images/Unifrom_60000boids.gif
new file mode 100755
index 0000000..d7c682d
Binary files /dev/null and b/images/Unifrom_60000boids.gif differ
diff --git a/images/Unifrom_80000boids.gif b/images/Unifrom_80000boids.gif
new file mode 100755
index 0000000..743aa1f
Binary files /dev/null and b/images/Unifrom_80000boids.gif differ
diff --git a/images/p_block.png b/images/p_block.png
new file mode 100644
index 0000000..9c6507a
Binary files /dev/null and b/images/p_block.png differ
diff --git a/images/p_boids.png b/images/p_boids.png
new file mode 100644
index 0000000..0309442
Binary files /dev/null and b/images/p_boids.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..91eb1c4 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -1,457 +1,820 @@
-#define GLM_FORCE_CUDA
-#include <stdio.h>
-#include <cuda.h>
-#include <cmath>
-#include <glm/glm.hpp>
-#include "utilityCore.hpp"
-#include "kernel.h"
-
-// LOOK-2.1 potentially useful for doing grid-based neighbor search
-#ifndef imax
-#define imax( a, b ) ( ((a) > (b)) ? (a) : (b) )
-#endif
-
-#ifndef imin
-#define imin( a, b ) ( ((a) < (b)) ? (a) : (b) )
-#endif
-
-#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
-
-/**
-* Check for CUDA errors; print and exit if there was a problem.
-*/
-void checkCUDAError(const char *msg, int line = -1) {
-  cudaError_t err = cudaGetLastError();
-  if (cudaSuccess != err) {
-    if (line >= 0) {
-      fprintf(stderr, "Line %d: ", line);
-    }
-    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
-    exit(EXIT_FAILURE);
-  }
-}
-
-
-/*****************
-* Configuration *
-*****************/
-
-/*! Block size used for CUDA kernel launch. */
-#define blockSize 128
-
-// LOOK-1.2 Parameters for the boids algorithm.
-// These worked well in our reference implementation.
-#define rule1Distance 5.0f
-#define rule2Distance 3.0f
-#define rule3Distance 5.0f
-
-#define rule1Scale 0.01f
-#define rule2Scale 0.1f
-#define rule3Scale 0.1f
-
-#define maxSpeed 1.0f
-
-/*! Size of the starting area in simulation space. */
-#define scene_scale 100.0f
-
-/***********************************************
-* Kernel state (pointers are device pointers) *
-***********************************************/
-
-int numObjects;
-dim3 threadsPerBlock(blockSize);
-
-// LOOK-1.2 - These buffers are here to hold all your boid information.
-// These get allocated for you in Boids::initSimulation.
-// Consider why you would need two velocity buffers in a simulation where each
-// boid cares about its neighbors' velocities.
-// These are called ping-pong buffers.
-glm::vec3 *dev_pos;
-glm::vec3 *dev_vel1;
-glm::vec3 *dev_vel2;
-
-// LOOK-2.1 - these are NOT allocated for you. You'll have to set up the thrust
-// pointers on your own too.
-
-// For efficient sorting and the uniform grid. These should always be parallel.
-int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
-int *dev_particleGridIndices; // What grid cell is this particle in?
-// needed for use with thrust
-thrust::device_ptr<int> dev_thrust_particleArrayIndices;
-thrust::device_ptr<int> dev_thrust_particleGridIndices;
-
-int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
-int *dev_gridCellEndIndices;   // to this cell?
-
-// TODO-2.3 - consider what additional buffers you might need to reshuffle
-// the position and velocity data to be coherent within cells.
-
-// LOOK-2.1 - Grid parameters based on simulation parameters.
-// These are automatically computed for you in Boids::initSimulation
-int gridCellCount;
-int gridSideCount;
-float gridCellWidth;
-float gridInverseCellWidth;
-glm::vec3 gridMinimum;
-
-/******************
-* initSimulation *
-******************/
-
-__host__ __device__ unsigned int hash(unsigned int a) {
-  a = (a + 0x7ed55d16) + (a << 12);
-  a = (a ^ 0xc761c23c) ^ (a >> 19);
-  a = (a + 0x165667b1) + (a << 5);
-  a = (a + 0xd3a2646c) ^ (a << 9);
-  a = (a + 0xfd7046c5) + (a << 3);
-  a = (a ^ 0xb55a4f09) ^ (a >> 16);
-  return a;
-}
-
-/**
-* LOOK-1.2 - this is a typical helper function for a CUDA kernel.
-* Function for generating a random vec3.
-*/
-__host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
-  thrust::default_random_engine rng(hash((int)(index * time)));
-  thrust::uniform_real_distribution<float> unitDistrib(-1, 1);
-
-  return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
-}
-
-/**
-* LOOK-1.2 - This is a basic CUDA kernel.
-* CUDA kernel for generating boids with a specified mass randomly around the star.
-*/
-__global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (index < N) {
-    glm::vec3 rand = generateRandomVec3(time, index);
-    arr[index].x = scale * rand.x;
-    arr[index].y = scale * rand.y;
-    arr[index].z = scale * rand.z;
-  }
-}
-
-/**
-* Initialize memory, update some globals
-*/
-void Boids::initSimulation(int N) {
-  numObjects = N;
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
-
-  // LOOK-1.2 - This is basic CUDA memory management and error checking.
-  // Don't forget to cudaFree in  Boids::endSimulation.
-  cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
-
-  cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
-
-  cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
-
-  // LOOK-1.2 - This is a typical CUDA kernel invocation.
-  kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
-    dev_pos, scene_scale);
-  checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
-
-  // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
-  int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
-  gridSideCount = 2 * halfSideCount;
-
-  gridCellCount = gridSideCount * gridSideCount * gridSideCount;
-  gridInverseCellWidth = 1.0f / gridCellWidth;
-  float halfGridWidth = gridCellWidth * halfSideCount;
-  gridMinimum.x -= halfGridWidth;
-  gridMinimum.y -= halfGridWidth;
-  gridMinimum.z -= halfGridWidth;
-
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
-  cudaDeviceSynchronize();
-}
-
-
-/******************
-* copyBoidsToVBO *
-******************/
-
-/**
-* Copy the boid positions into the VBO so that they can be drawn by OpenGL.
-*/
-__global__ void kernCopyPositionsToVBO(int N, glm::vec3 *pos, float *vbo, float s_scale) {
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
-
-  float c_scale = -1.0f / s_scale;
-
-  if (index < N) {
-    vbo[4 * index + 0] = pos[index].x * c_scale;
-    vbo[4 * index + 1] = pos[index].y * c_scale;
-    vbo[4 * index + 2] = pos[index].z * c_scale;
-    vbo[4 * index + 3] = 1.0f;
-  }
-}
-
-__global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float s_scale) {
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
-
-  if (index < N) {
-    vbo[4 * index + 0] = vel[index].x + 0.3f;
-    vbo[4 * index + 1] = vel[index].y + 0.3f;
-    vbo[4 * index + 2] = vel[index].z + 0.3f;
-    vbo[4 * index + 3] = 1.0f;
-  }
-}
-
-/**
-* Wrapper for call to the kernCopyboidsToVBO CUDA kernel.
-*/
-void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
-  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
-
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
-
-  checkCUDAErrorWithLine("copyBoidsToVBO failed!");
-
-  cudaDeviceSynchronize();
-}
-
-
-/******************
-* stepSimulation *
-******************/
-
-/**
-* LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
-* __device__ code can be called from a __global__ context
-* Compute the new velocity on the body with index `iSelf` due to the `N` boids
-* in the `pos` and `vel` arrays.
-*/
-__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
-}
-
-/**
-* TODO-1.2 implement basic flocking
-* For each of the `N` bodies, update its position based on its current velocity.
-*/
-__global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
-}
-
-/**
-* LOOK-1.2 Since this is pretty trivial, we implemented it for you.
-* For each of the `N` bodies, update its position based on its current velocity.
-*/
-__global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
-  // Update position by velocity
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
-  if (index >= N) {
-    return;
-  }
-  glm::vec3 thisPos = pos[index];
-  thisPos += vel[index] * dt;
-
-  // Wrap the boids around so we don't lose them
-  thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
-  thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
-  thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
-
-  thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
-  thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
-  thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
-
-  pos[index] = thisPos;
-}
-
-// LOOK-2.1 Consider this method of computing a 1D index from a 3D grid index.
-// LOOK-2.3 Looking at this method, what would be the most memory efficient
-//          order for iterating over neighboring grid cells?
-//          for(x)
-//            for(y)
-//             for(z)? Or some other order?
-__device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
-  return x + y * gridResolution + z * gridResolution * gridResolution;
-}
-
-__global__ void kernComputeIndices(int N, int gridResolution,
-  glm::vec3 gridMin, float inverseCellWidth,
-  glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
-}
-
-// LOOK-2.1 Consider how this could be useful for indicating that a cell
-//          does not enclose any boids
-__global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (index < N) {
-    intBuffer[index] = value;
-  }
-}
-
-__global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
-  int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
-}
-
-__global__ void kernUpdateVelNeighborSearchScattered(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  int *particleArrayIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
-}
-
-__global__ void kernUpdateVelNeighborSearchCoherent(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
-}
-
-/**
-* Step the entire N-body simulation by `dt` seconds.
-*/
-void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
-}
-
-void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
-}
-
-void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
-}
-
-void Boids::endSimulation() {
-  cudaFree(dev_vel1);
-  cudaFree(dev_vel2);
-  cudaFree(dev_pos);
-
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
-}
-
-void Boids::unitTest() {
-  // LOOK-1.2 Feel free to write additional tests here.
-
-  // test unstable sort
-  int *dev_intKeys;
-  int *dev_intValues;
-  int N = 10;
-
-  std::unique_ptr<int[]>intKeys{ new int[N] };
-  std::unique_ptr<int[]>intValues{ new int[N] };
-
-  intKeys[0] = 0; intValues[0] = 0;
-  intKeys[1] = 1; intValues[1] = 1;
-  intKeys[2] = 0; intValues[2] = 2;
-  intKeys[3] = 3; intValues[3] = 3;
-  intKeys[4] = 0; intValues[4] = 4;
-  intKeys[5] = 2; intValues[5] = 5;
-  intKeys[6] = 2; intValues[6] = 6;
-  intKeys[7] = 0; intValues[7] = 7;
-  intKeys[8] = 5; intValues[8] = 8;
-  intKeys[9] = 6; intValues[9] = 9;
-
-  cudaMalloc((void**)&dev_intKeys, N * sizeof(int));
-  checkCUDAErrorWithLine("cudaMalloc dev_intKeys failed!");
-
-  cudaMalloc((void**)&dev_intValues, N * sizeof(int));
-  checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
-
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
-
-  std::cout << "before unstable sort: " << std::endl;
-  for (int i = 0; i < N; i++) {
-    std::cout << "  key: " << intKeys[i];
-    std::cout << " value: " << intValues[i] << std::endl;
-  }
-
-  // How to copy data to the GPU
-  cudaMemcpy(dev_intKeys, intKeys.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
-  cudaMemcpy(dev_intValues, intValues.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
-
-  // Wrap device vectors in thrust iterators for use with thrust.
-  thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
-  thrust::device_ptr<int> dev_thrust_values(dev_intValues);
-  // LOOK-2.1 Example for using thrust::sort_by_key
-  thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
-
-  // How to copy data back to the CPU side from the GPU
-  cudaMemcpy(intKeys.get(), dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
-  cudaMemcpy(intValues.get(), dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
-  checkCUDAErrorWithLine("memcpy back failed!");
-
-  std::cout << "after unstable sort: " << std::endl;
-  for (int i = 0; i < N; i++) {
-    std::cout << "  key: " << intKeys[i];
-    std::cout << " value: " << intValues[i] << std::endl;
-  }
-
-  // cleanup
-  cudaFree(dev_intKeys);
-  cudaFree(dev_intValues);
-  checkCUDAErrorWithLine("cudaFree failed!");
-  return;
-}
+#define GLM_FORCE_CUDA
+#include <stdio.h>
+#include <cuda.h>
+#include <cmath>
+#include <glm/glm.hpp>
+#include "utilityCore.hpp"
+#include "kernel.h"
+
+// LOOK-2.1 potentially useful for doing grid-based neighbor search
+#ifndef imax
+#define imax( a, b ) ( ((a) > (b)) ? (a) : (b) )
+#endif
+
+#ifndef imin
+#define imin( a, b ) ( ((a) < (b)) ? (a) : (b) )
+#endif
+
+#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
+
+/**
+* Check for CUDA errors; print and exit if there was a problem.
+*/
+void checkCUDAError(const char *msg, int line = -1) {
+  cudaError_t err = cudaGetLastError();
+  if (cudaSuccess != err) {
+    if (line >= 0) {
+      fprintf(stderr, "Line %d: ", line);
+    }
+    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
+    exit(EXIT_FAILURE);
+  }
+}
+
+
+/*****************
+* Configuration *
+*****************/
+
+/*! Block size used for CUDA kernel launch. */
+#define blockSize 128
+
+// LOOK-1.2 Parameters for the boids algorithm.
+// These worked well in our reference implementation.
+#define rule1Distance 5.0f
+#define rule2Distance 3.0f
+#define rule3Distance 5.0f
+
+#define rule1Scale 0.01f
+#define rule2Scale 0.1f
+#define rule3Scale 0.1f
+
+#define maxSpeed 1.0f
+
+/*! Size of the starting area in simulation space. */
+#define scene_scale 100.0f
+
+/***********************************************
+* Kernel state (pointers are device pointers) *
+***********************************************/
+
+int numObjects;
+dim3 threadsPerBlock(blockSize);
+
+// LOOK-1.2 - These buffers are here to hold all your boid information.
+// These get allocated for you in Boids::initSimulation.
+// Consider why you would need two velocity buffers in a simulation where each
+// boid cares about its neighbors' velocities.
+// These are called ping-pong buffers.
+glm::vec3 *dev_pos;
+glm::vec3 *dev_vel1;
+glm::vec3 *dev_vel2;
+
+// LOOK-2.1 - these are NOT allocated for you. You'll have to set up the thrust
+// pointers on your own too.
+
+// For efficient sorting and the uniform grid. These should always be parallel.
+int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
+int *dev_particleGridIndices; // What grid cell is this particle in?
+// needed for use with thrust
+thrust::device_ptr<int> dev_thrust_particleArrayIndices;
+thrust::device_ptr<int> dev_thrust_particleGridIndices;
+
+int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
+int *dev_gridCellEndIndices;   // to this cell?
+
+// TODO-2.3 - consider what additional buffers you might need to reshuffle
+// the position and velocity data to be coherent within cells.
+glm::vec3 *temp_pos; 
+glm::vec3 *temp_vel; 
+// LOOK-2.1 - Grid parameters based on simulation parameters.
+// These are automatically computed for you in Boids::initSimulation
+int gridCellCount;
+int gridSideCount;
+float gridCellWidth;
+float gridInverseCellWidth;
+glm::vec3 gridMinimum;
+
+/******************
+* initSimulation *
+******************/
+
+__host__ __device__ unsigned int hash(unsigned int a) {
+  a = (a + 0x7ed55d16) + (a << 12);
+  a = (a ^ 0xc761c23c) ^ (a >> 19);
+  a = (a + 0x165667b1) + (a << 5);
+  a = (a + 0xd3a2646c) ^ (a << 9);
+  a = (a + 0xfd7046c5) + (a << 3);
+  a = (a ^ 0xb55a4f09) ^ (a >> 16);
+  return a;
+}
+
+/**
+* LOOK-1.2 - this is a typical helper function for a CUDA kernel.
+* Function for generating a random vec3.
+*/
+__host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
+  thrust::default_random_engine rng(hash((int)(index * time)));
+  thrust::uniform_real_distribution<float> unitDistrib(-1, 1);
+
+  return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
+}
+
+/**
+* LOOK-1.2 - This is a basic CUDA kernel.
+* CUDA kernel for generating boids with a specified mass randomly around the star.
+*/
+__global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) {
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index < N) {
+    glm::vec3 rand = generateRandomVec3(time, index);
+    arr[index].x = scale * rand.x;
+    arr[index].y = scale * rand.y;
+    arr[index].z = scale * rand.z;
+  }
+}
+
+/**
+* Initialize memory, update some globals
+*/
+void Boids::initSimulation(int N) {
+  numObjects = N;
+  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+
+  // LOOK-1.2 - This is basic CUDA memory management and error checking.
+  // Don't forget to cudaFree in  Boids::endSimulation.
+  cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+  cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
+
+  cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
+
+  // LOOK-1.2 - This is a typical CUDA kernel invocation.
+  kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
+    dev_pos, scene_scale);
+  checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
+
+  // LOOK-2.1 computing grid params
+  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+  int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
+  gridSideCount = 2 * halfSideCount;
+
+  gridCellCount = gridSideCount * gridSideCount * gridSideCount;
+  gridInverseCellWidth = 1.0f / gridCellWidth;
+  float halfGridWidth = gridCellWidth * halfSideCount;
+  gridMinimum.x -= halfGridWidth;
+  gridMinimum.y -= halfGridWidth;
+  gridMinimum.z -= halfGridWidth;
+
+  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(gridCellCount));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(gridCellCount));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+  cudaMalloc((void**)&temp_pos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+  cudaMalloc((void**)&temp_vel, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
+
+  cudaDeviceSynchronize();
+}
+
+
+/******************
+* copyBoidsToVBO *
+******************/
+
+/**
+* Copy the boid positions into the VBO so that they can be drawn by OpenGL.
+*/
+__global__ void kernCopyPositionsToVBO(int N, glm::vec3 *pos, float *vbo, float s_scale) {
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+  float c_scale = -1.0f / s_scale;
+
+  if (index < N) {
+    vbo[4 * index + 0] = pos[index].x * c_scale;
+    vbo[4 * index + 1] = pos[index].y * c_scale;
+    vbo[4 * index + 2] = pos[index].z * c_scale;
+    vbo[4 * index + 3] = 1.0f;
+  }
+}
+
+__global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float s_scale) {
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+  if (index < N) {
+    vbo[4 * index + 0] = vel[index].x + 0.3f;
+    vbo[4 * index + 1] = vel[index].y + 0.3f;
+    vbo[4 * index + 2] = vel[index].z + 0.3f;
+    vbo[4 * index + 3] = 1.0f;
+  }
+}
+
+/**
+* Wrapper for call to the kernCopyboidsToVBO CUDA kernel.
+*/
+void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
+  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+
+  checkCUDAErrorWithLine("copyBoidsToVBO failed!");
+
+  cudaDeviceSynchronize();
+}
+
+
+/******************
+* stepSimulation *
+******************/
+
+/**
+* LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
+* __device__ code can be called from a __global__ context
+* Compute the new velocity on the body with index `iSelf` due to the `N` boids
+* in the `pos` and `vel` arrays.
+*/
+__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
+    glm::vec3 perceivedCenter1(0.f, 0.f, 0.f);
+    glm::vec3 rule2vec(0.f, 0.f, 0.f); 
+    glm::vec3 perceivedVec(0.f, 0.f, 0.f);
+    float numNeighbors1 = 0.0; 
+    float numNeighbors3 = 0.0; 
+    for (int i = 0; i < N; i++) {
+        float dist = glm::distance(pos[i], pos[iSelf]); 
+        // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+        if ((i != iSelf) && (dist < rule1Distance)) {
+            perceivedCenter1 += pos[i];
+            numNeighbors1 += 1.0; 
+        };
+        // Rule 2: boids try to stay a distance d away from each other
+        if ((i != iSelf) && (dist < rule2Distance)) {
+            rule2vec -= (pos[i] - pos[iSelf]); 
+        };
+        // Rule 3: boids try to match the speed of surrounding boids
+        if ((i != iSelf) && (dist < rule3Distance)) {
+            perceivedVec += vel[i];
+            numNeighbors3 += 1.0; 
+        };
+    };
+    // Rule 1 results 
+    if (numNeighbors1 > 0) {
+        perceivedCenter1 /= numNeighbors1;
+    }
+    else {
+        perceivedCenter1 = pos[iSelf];
+    };
+    glm::vec3 v1 = (perceivedCenter1 - pos[iSelf]) * rule1Scale; 
+    // Rule 2 results 
+    glm::vec3 v2 = rule2vec * rule2Scale; 
+    // Rule 3 results 
+    if (numNeighbors3 > 0) {
+        perceivedVec /= numNeighbors3; 
+    }
+    glm::vec3 v3 = perceivedVec * rule3Scale; 
+  
+  return v1 + v2 + v3;
+}
+
+/**
+* TODO-1.2 implement basic flocking
+* For each of the `N` bodies, update its position based on its current velocity.
+*/
+__global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
+  glm::vec3 *vel1, glm::vec3 *vel2) {
+    int idx = threadIdx.x + (blockIdx.x * blockDim.x); 
+  // Compute a new velocity based on pos and vel1
+    glm::vec3 changeVelocity; 
+    glm::vec3 newVelocity; 
+    if (idx < N) {
+        changeVelocity = computeVelocityChange(N, idx, pos, vel1); 
+        newVelocity = vel1[idx] + changeVelocity; 
+    };
+  // Clamp the speed
+    if (glm::length(newVelocity) > maxSpeed) {
+        newVelocity = glm::normalize(newVelocity); 
+    }
+  // Record the new velocity into vel2. Question: why NOT vel1?
+    vel2[idx] = newVelocity; 
+}
+
+/**
+* LOOK-1.2 Since this is pretty trivial, we implemented it for you.
+* For each of the `N` bodies, update its position based on its current velocity.
+*/
+__global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
+  // Update position by velocity
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+  glm::vec3 thisPos = pos[index];
+  thisPos += vel[index] * dt;
+
+  // Wrap the boids around so we don't lose them
+  thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
+  thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
+  thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
+
+  thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
+  thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
+  thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
+
+  pos[index] = thisPos;
+}
+
+// LOOK-2.1 Consider this method of computing a 1D index from a 3D grid index.
+// LOOK-2.3 Looking at this method, what would be the most memory efficient
+//          order for iterating over neighboring grid cells?
+//          for(x)
+//            for(y)
+//             for(z)? Or some other order?
+__device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
+  return x + y * gridResolution + z * gridResolution * gridResolution;
+}
+
+__global__ void kernComputeIndices(int N, int gridResolution,
+  glm::vec3 gridMin, float inverseCellWidth,
+  glm::vec3 *pos, int *indices, int *gridIndices) {
+    // TODO-2.1
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) {
+        return; 
+    }; 
+    // - Label each boid with the index of its grid cell.
+    int x = (int)(inverseCellWidth * (pos[index].x - gridMin.x)); 
+    int y = (int)(inverseCellWidth * (pos[index].y - gridMin.y));
+    int z = (int)(inverseCellWidth * (pos[index].z - gridMin.z));
+    int gridIndice = gridIndex3Dto1D(x, y, z, gridResolution); 
+    gridIndices[index] = gridIndice; 
+    // - Set up a parallel array of integer indices as pointers to the actual
+    //   boid data in pos and vel1/vel2
+    indices[index] = index; 
+}
+
+// LOOK-2.1 Consider how this could be useful for indicating that a cell
+//          does not enclose any boids
+__global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index < N) {
+    intBuffer[index] = value;
+  }
+}
+
+__global__ void kernIdentifyCellStartEnd(int N, int* particleGridIndices,
+    int* gridCellStartIndices, int* gridCellEndIndices) {
+    // TODO-2.1
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) {
+        return;
+    };
+    // Identify the start point of each cell in the gridIndices array.
+    // This is basically a parallel unrolling of a loop that goes
+    // "this index doesn't match the one before it, must be a new cell!"
+    if ((index > 0) && (particleGridIndices[index - 1] != particleGridIndices[index])) {
+        gridCellStartIndices[particleGridIndices[index]] = index;
+    }
+    if (index == 0) {
+        gridCellStartIndices[particleGridIndices[0]] = 0;
+    }
+    if ((index < N - 1) && (particleGridIndices[index + 1] != particleGridIndices[index])) {
+        gridCellEndIndices[particleGridIndices[index]] = index;
+    }
+    if (index == N - 1) {
+        gridCellEndIndices[particleGridIndices[index]] = N - 1; 
+    }
+}
+
+__global__ void kernUpdateVelNeighborSearchScattered(
+    int N, int gridResolution, glm::vec3 gridMin,
+    float inverseCellWidth, float cellWidth,
+    int* gridCellStartIndices, int* gridCellEndIndices,
+    int* particleArrayIndices,
+    glm::vec3* pos, glm::vec3* vel1, glm::vec3* vel2) {
+    // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) {
+        return;
+    };
+    // the number of boids that need to be checked.
+    // - Identify the grid cell that this particle is in
+    int x = (int)(inverseCellWidth * (pos[index].x - gridMin.x));
+    int y = (int)(inverseCellWidth * (pos[index].y - gridMin.y));
+    int z = (int)(inverseCellWidth * (pos[index].z - gridMin.z));
+    float xPos = gridMin.x + x * cellWidth + 0.5 * cellWidth;
+    float yPos = gridMin.y + y * cellWidth + 0.5 * cellWidth;
+    float zPos = gridMin.z + z * cellWidth + 0.5 * cellWidth;
+    int xCandidate[2] = {x, -1};
+    int yCandidate[2] = {y, -1};
+    int zCandidate[2] = {z, -1};
+    // - Identify which cells may contain neighbors. This isn't always 8.
+    if ((pos[index].x < xPos) && (x > 0)) {
+        xCandidate[1] = x - 1;
+    }
+    else if ((pos[index].x >= xPos) && (x < gridResolution - 1)) {
+        xCandidate[1] = x + 1;
+    }
+    if ((pos[index].y < yPos) && (y > 0)) {
+        yCandidate[1] = y - 1;
+    }
+    else if ((pos[index].y >= yPos) && (y < gridResolution - 1)) {
+        yCandidate[1] = y + 1;
+    }
+    if ((pos[index].z < zPos) && (z > 0)) {
+        zCandidate[1] = z - 1;
+    }
+    else if ((pos[index].z >= zPos) && (z < gridResolution - 1)) {
+        zCandidate[1] = z + 1;
+    }
+    // - For each cell, read the start/end indices in the boid pointer array.
+    // - Access each boid in the cell and compute velocity change from
+    int gridIndice;
+    int pointer;
+    int start;
+    int end;
+    glm::vec3 perceivedCenter1(0.f, 0.f, 0.f);
+    glm::vec3 rule2vec(0.f, 0.f, 0.f);
+    glm::vec3 perceivedVec(0.f, 0.f, 0.f);
+    float numNeighbors1 = 0.0;
+    float numNeighbors3 = 0.0;
+    for (auto& i : xCandidate) {
+        if (i == -1) {
+            continue;
+        }
+        for (auto& j : yCandidate) {
+            if (j == -1) {
+                continue; 
+            }
+            for (auto& k : zCandidate) {
+                if (k == -1) {
+                    continue; 
+                }
+                gridIndice = gridIndex3Dto1D(i, j, k, gridResolution);
+                start = gridCellStartIndices[gridIndice];
+                if (start == -1) {
+                    continue;
+                }
+                end = gridCellEndIndices[gridIndice];
+                for (int p = start; p <= end; p++) {
+                    pointer = particleArrayIndices[p];
+                    float dist = glm::distance(pos[pointer], pos[index]);
+                    // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+                    if ((pointer != index) && (dist < rule1Distance)) {
+                        perceivedCenter1 += pos[pointer];
+                        numNeighbors1 += 1.0;
+                    };
+                    // Rule 2: boids try to stay a distance d away from each other
+                    if ((pointer != index) && (dist < rule2Distance)) {
+                        rule2vec -= (pos[pointer] - pos[index]);
+                    };
+                    // Rule 3: boids try to match the speed of surrounding boids
+                    if ((pointer != index) && (dist < rule3Distance)) {
+                        perceivedVec += vel1[pointer];
+                        numNeighbors3 += 1.0; 
+                    };
+                }
+            }
+        }
+    }
+    // Rule 1 results 
+    if (numNeighbors1 > 0) {
+        perceivedCenter1 /= numNeighbors1;
+    }
+    else {
+        perceivedCenter1 = pos[index];
+    };
+    glm::vec3 v1 = (perceivedCenter1 - pos[index]) * rule1Scale;
+    // Rule 2 results 
+    glm::vec3 v2 = rule2vec * rule2Scale;
+    // Rule 3 results 
+    if (numNeighbors3 > 0) {
+        perceivedVec /= numNeighbors3;
+    }
+    glm::vec3 v3 = perceivedVec * rule3Scale;
+
+    glm::vec3 rulevel = v1 + v2 + v3; 
+    glm::vec3 newvel = rulevel + vel1[index];
+    if (glm::length(newvel) > 1) {
+        newvel = glm::normalize(newvel);
+    }
+    vel2[index] = newvel; 
+    //   the boids rules, if this boid is within the neighborhood distance.
+    // - Clamp the speed change before putting the new speed in vel2
+}
+
+__global__ void kernUpdateVelNeighborSearchCoherent(
+  int N, int gridResolution, glm::vec3 gridMin,
+  float inverseCellWidth, float cellWidth,
+  int *gridCellStartIndices, int *gridCellEndIndices,
+  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+  // except with one less level of indirection.
+  // This should expect gridCellStartIndices and gridCellEndIndices to refer
+  // directly to pos and vel1.
+  // - Identify the grid cell that this particle is in
+  // - Identify which cells may contain neighbors. This isn't always 8.
+  // - For each cell, read the start/end indices in the boid pointer array.
+  //   DIFFERENCE: For best results, consider what order the cells should be
+  //   checked in to maximize the memory benefits of reordering the boids data.
+  // - Access each boid in the cell and compute velocity change from
+  //   the boids rules, if this boid is within the neighborhood distance.
+  // - Clamp the speed change before putting the new speed in vel2
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) {
+        return;
+    };
+    // the number of boids that need to be checked.
+    // - Identify the grid cell that this particle is in
+    int x = (int)(inverseCellWidth * (pos[index].x - gridMin.x));
+    int y = (int)(inverseCellWidth * (pos[index].y - gridMin.y));
+    int z = (int)(inverseCellWidth * (pos[index].z - gridMin.z));
+    float xPos = gridMin.x + x * cellWidth + 0.5 * cellWidth;
+    float yPos = gridMin.y + y * cellWidth + 0.5 * cellWidth;
+    float zPos = gridMin.z + z * cellWidth + 0.5 * cellWidth;
+    int xCandidate[2] = { x, -1 };
+    int yCandidate[2] = { y, -1 };
+    int zCandidate[2] = { z, -1 };
+    // - Identify which cells may contain neighbors. This isn't always 8.
+    if ((pos[index].x < xPos) && (x > 0)) {
+        xCandidate[1] = x - 1;
+    }
+    else if ((pos[index].x >= xPos) && (x < gridResolution - 1)) {
+        xCandidate[1] = x + 1;
+    }
+    if ((pos[index].y < yPos) && (y > 0)) {
+        yCandidate[1] = y - 1;
+    }
+    else if ((pos[index].y >= yPos) && (y < gridResolution - 1)) {
+        yCandidate[1] = y + 1;
+    }
+    if ((pos[index].z < zPos) && (z > 0)) {
+        zCandidate[1] = z - 1;
+    }
+    else if ((pos[index].z >= zPos) && (z < gridResolution - 1)) {
+        zCandidate[1] = z + 1;
+    }
+    // - For each cell, read the start/end indices in the boid pointer array.
+    // - Access each boid in the cell and compute velocity change from
+    int gridIndice;
+    int start;
+    int end;
+    glm::vec3 perceivedCenter1(0.f, 0.f, 0.f);
+    glm::vec3 rule2vec(0.f, 0.f, 0.f);
+    glm::vec3 perceivedVec(0.f, 0.f, 0.f);
+    float numNeighbors1 = 0.0;
+    float numNeighbors3 = 0.0;
+    for (auto& i : zCandidate) {
+        if (i == -1) {
+            continue;
+        }
+        for (auto& j : yCandidate) {
+            if (j == -1) {
+                continue;
+            }
+            for (auto& k : xCandidate) {
+                if (k == -1) {
+                    continue;
+                }
+                gridIndice = gridIndex3Dto1D(k, j, i, gridResolution);
+                start = gridCellStartIndices[gridIndice];
+                if (start == -1) {
+                    continue;
+                }
+                end = gridCellEndIndices[gridIndice];
+                for (int p = start; p <= end; p++) {
+                    // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+                    float dist = glm::distance(pos[p], pos[index]);
+                    if ((p != index) && (dist < rule1Distance)) {
+                        perceivedCenter1 += pos[p];
+                        numNeighbors1 += 1.0;
+                    };
+                    // Rule 2: boids try to stay a distance d away from each other
+                    if ((p != index) && (dist < rule2Distance)) {
+                        rule2vec -= (pos[p] - pos[index]);
+                    };
+                    // Rule 3: boids try to match the speed of surrounding boids
+                    if ((p != index) && (dist < rule3Distance)) {
+                        perceivedVec += vel1[p];
+                        numNeighbors3 += 1.0; 
+                    };
+                }
+            }
+        }
+    }
+    // Rule 1 results 
+    if (numNeighbors1 > 0) {
+        perceivedCenter1 /= numNeighbors1;
+    }
+    else {
+        perceivedCenter1 = pos[index];
+    };
+    glm::vec3 v1 = (perceivedCenter1 - pos[index]) * rule1Scale;
+    // Rule 2 results 
+    glm::vec3 v2 = rule2vec * rule2Scale;
+    // Rule 3 results 
+    if (numNeighbors3 > 0) {
+        perceivedVec /= numNeighbors3;
+    }
+    glm::vec3 v3 = perceivedVec * rule3Scale;
+
+    glm::vec3 rulevel = v1 + v2 + v3;
+    glm::vec3 newvel = rulevel + vel1[index];
+    if (glm::length(newvel) > 1) {
+        newvel = glm::normalize(newvel);
+    }
+    vel2[index] = newvel;
+    //   the boids rules, if this boid is within the neighborhood distance.
+    // - Clamp the speed change before putting the new speed in vel2
+}
+
+__global__ void kernReshufflePosVel(int N,
+    glm::vec3 *pos, glm::vec3 *temp_pos,
+    glm::vec3 *vel, glm::vec3 *temp_vel, 
+    int *dev_particleArrayIndices 
+    ) 
+{
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) {
+        return;
+    };
+    int curr = dev_particleArrayIndices[index]; 
+    temp_pos[index] = pos[curr];
+    temp_vel[index] = vel[curr]; 
+}
+/**
+* Step the entire N-body simulation by `dt` seconds.
+*/
+void Boids::stepSimulationNaive(float dt) {
+  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, threadsPerBlock>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+    kernUpdatePos<<<fullBlocksPerGrid, threadsPerBlock>>>(numObjects, dt, dev_pos, dev_vel2); 
+  // TODO-1.2 ping-pong the velocity buffers
+    glm::vec3* temp = dev_vel1;
+    dev_vel1 = dev_vel2;
+    dev_vel2 = temp;
+}
+
+void Boids::stepSimulationScatteredGrid(float dt) {
+  // TODO-2.1
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  dim3 fullBlocksPerGridCell((gridCellCount + blockSize - 1) / blockSize);
+  kernComputeIndices << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+      dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  // Wrap device vectors in thrust iterators for use with thrust.
+  thrust::device_ptr<int> dev_thrust_particleGridIndices(dev_particleGridIndices);
+  thrust::device_ptr<int> dev_thrust_particleArrayIndices(dev_particleArrayIndices);
+  // LOOK-2.1 Example for using thrust::sort_by_key
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+  kernResetIntBuffer <<<fullBlocksPerGridCell, threadsPerBlock >>> (gridCellCount, dev_gridCellStartIndices, -1);
+  kernIdentifyCellStartEnd <<<fullBlocksPerGrid, threadsPerBlock >>> (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+  kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+  kernUpdatePos << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dt, dev_pos, dev_vel2);
+  glm::vec3* temp = dev_vel1;
+  dev_vel1 = dev_vel2;
+  dev_vel2 = temp;
+  // Uniform Grid Neighbor search using Thrust sort.
+  // In Parallel:
+  // - label each particle with its array index as well as its grid index.
+  //   Use 2x width grids.
+  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+  //   are welcome to do a performance comparison.
+  // - Naively unroll the loop for finding the start and end indices of each
+  //   cell's data pointers in the array of boid indices
+  // - Perform velocity updates using neighbor search
+  // - Update positions
+  // - Ping-pong buffers as needed
+}
+
+void Boids::stepSimulationCoherentGrid(float dt) {
+  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+  // In Parallel:
+  // - Label each particle with its array index as well as its grid index.
+  //   Use 2x width grids
+  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+  //   are welcome to do a performance comparison.
+  // - Naively unroll the loop for finding the start and end indices of each
+  //   cell's data pointers in the array of boid indices
+  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+  //   the particle data in the simulation array.
+  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+  // - Perform velocity updates using neighbor search
+  // - Update positions
+  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    dim3 fullBlocksPerGridCell((gridCellCount + blockSize - 1) / blockSize);
+    kernComputeIndices << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+        dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+    // Wrap device vectors in thrust iterators for use with thrust.
+    thrust::device_ptr<int> dev_thrust_particleGridIndices(dev_particleGridIndices);
+    thrust::device_ptr<int> dev_thrust_particleArrayIndices(dev_particleArrayIndices);
+    // LOOK-2.1 Example for using thrust::sort_by_key
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+    kernResetIntBuffer << <fullBlocksPerGridCell, threadsPerBlock >> > (gridCellCount, dev_gridCellStartIndices, -1);
+    checkCUDAError("kernResetIntBuffer failed!"); 
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+    checkCUDAError("kernIdentifyCellStartEnd failed!");
+    kernReshufflePosVel << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dev_pos, temp_pos, dev_vel1, temp_vel, dev_particleArrayIndices); 
+    checkCUDAError("kernReshufflePosVel failed!");
+    std::swap(temp_pos, dev_pos); 
+    std::swap(temp_vel, dev_vel1);
+    kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_pos, dev_vel1, dev_vel2);
+    kernUpdatePos << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dt, dev_pos, dev_vel2);
+    glm::vec3* temp = dev_vel1;
+    dev_vel1 = dev_vel2;
+    dev_vel2 = temp;
+}
+
+void Boids::endSimulation() {
+  cudaFree(dev_vel1);
+  cudaFree(dev_vel2);
+  cudaFree(dev_pos);
+
+  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices); 
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+
+  cudaFree(temp_pos);
+  cudaFree(temp_vel);
+}
+
+void Boids::unitTest() {
+  // LOOK-1.2 Feel free to write additional tests here.
+
+  // test unstable sort
+  int *dev_intKeys;
+  int *dev_intValues;
+  int N = 10;
+
+  std::unique_ptr<int[]>intKeys{ new int[N] };
+  std::unique_ptr<int[]>intValues{ new int[N] };
+
+  intKeys[0] = 0; intValues[0] = 0;
+  intKeys[1] = 1; intValues[1] = 1;
+  intKeys[2] = 0; intValues[2] = 2;
+  intKeys[3] = 3; intValues[3] = 3;
+  intKeys[4] = 0; intValues[4] = 4;
+  intKeys[5] = 2; intValues[5] = 5;
+  intKeys[6] = 2; intValues[6] = 6;
+  intKeys[7] = 0; intValues[7] = 7;
+  intKeys[8] = 5; intValues[8] = 8;
+  intKeys[9] = 6; intValues[9] = 9;
+
+  cudaMalloc((void**)&dev_intKeys, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_intKeys failed!");
+
+  cudaMalloc((void**)&dev_intValues, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
+
+  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+
+  std::cout << "before unstable sort: " << std::endl;
+  for (int i = 0; i < N; i++) {
+    std::cout << "  key: " << intKeys[i];
+    std::cout << " value: " << intValues[i] << std::endl;
+  }
+
+  // How to copy data to the GPU
+  cudaMemcpy(dev_intKeys, intKeys.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
+  cudaMemcpy(dev_intValues, intValues.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
+
+  // Wrap device vectors in thrust iterators for use with thrust.
+  thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
+  thrust::device_ptr<int> dev_thrust_values(dev_intValues);
+  // LOOK-2.1 Example for using thrust::sort_by_key
+  thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
+
+  // How to copy data back to the CPU side from the GPU
+  cudaMemcpy(intKeys.get(), dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
+  cudaMemcpy(intValues.get(), dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
+  checkCUDAErrorWithLine("memcpy back failed!");
+
+  std::cout << "after unstable sort: " << std::endl;
+  for (int i = 0; i < N; i++) {
+    std::cout << "  key: " << intKeys[i];
+    std::cout << " value: " << intValues[i] << std::endl;
+  }
+
+  // cleanup
+  cudaFree(dev_intKeys);
+  cudaFree(dev_intValues);
+  checkCUDAErrorWithLine("cudaFree failed!");
+  return;
+}