diff --git a/README.md b/README.md
index d63a6a1..eb95c1d 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,85 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+![Boids Cover Iamge](images/results/50Kboids.png)
 
-### (TODO: Your README)
+* Linda Zhu
+  * [LinkedIn](https://www.linkedin.com/in/lindadaism/), [Portfolio](https://lindadaism.com/)
+* Tested on: Windows 11, i7-12800H @ 2.40GHz 16GB, NVIDIA GeForce RTX 3070 Ti (Personal Laptop)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+### Results
+
+*Figure 1: Coherent uniform grid method - 5K boids, 200 scene scale*
+
+![](images/results/200SceneScale_5Kboids.gif)
+
+
+*Figure 2: Coherent uniform grid method - 5M boids, 2 max speed, 200 scene scale*
+
+![](images/results/200SceneScale_5Mboids_2MaxSpeed.gif)
+
+
+*Figure 3: Coherent uniform grid method - 10M boids*
+
+![](images/results/default_10Mboids.gif)
+
+
+## Overview
+In this project, I implemented a flocking simulation based on the Reynolds Boids algorithm by writing some simple CUDA kernels, along with two levels of optimization: a scattered uniform grid, and a uniform grid with semi-coherent memory access. I also practiced implementing performance metrics in CUDA using CPU timers or CUDA events.
+
+### Boid Flocking
+
+In the Boids flocking simulation, particles representing birds or fish
+(boids) move around the simulation space according to three rules:
+
+1. cohesion - boids move towards the perceived center of mass of their neighbors
+2. separation - boids avoid getting too close to their neighbors
+3. alignment - boids generally try to move with the same direction and speed as their neighbors
+
+These three rules specify a boid's velocity change in a timestep.
+At every timestep, a boid thus has to look at each of its neighboring boids and compute the velocity change contribution from each of the three rules. Thus, a bare-bones boids implementation has each boid check every other boid in the simulation.
+
+#### Naive Neighbor Search
+
+Basically for each of the `N` boids, check on every other boid in the simulation, i.e. `(N-1)` boids in total, using the three rules to compute its new velocity and update its position accordingly.
+
+#### Scattered Uniform Grid-based Search
+
+Based on this observation, we can see that having each boid check every other boid is very inefficient, especially if (as in our standard parameters) the number of boids is large and the neighborhood distance is much smaller than the full simulation space. We can cull a lot of neighbor checks using a data structure called a **uniform spatial grid**.
+
+A uniform grid is made up of cells that are at least as wide as the neighborhood distance and covers the entire simulation domain. Before computing the new velocities of the boids, we "bin" them into the grid in a preprocess step so that we can use the uniform grid to reduce the number of boids that need to be checked.
+
+#### Coherent Uniform Grid-based Search
+
+Coherent grid search is very similar to scattered, except with one less level of indireciton. One drawback of the above method is that pointers to boids in a single cell are contiguous in memory, but the boid data itself (velocities and positions) is scattered all over the place (why it's called scattered grid simulation). In coherent grid search, we rearrange the boid data itself with additional buffers so that all the velocities and positions of boids in one cell are also contiguous in memory. This helps accesing directly using the grid cell index information.
+
+
+## Performance Analysis
+
+### Framerate change with respect to increasing boid count
+
+![How boid count affects avgFPS and avgKernExecTime](/images/results/avgFPS&avgKernExecTime_numBoids_graph.png)
+
+From the graph we can see, no matter with or without visualization, the average FPS drops while the average kernel execution time increases as we increase the sample boid count. As expected from the optimization purposes, coherent method has the best performance, then scattered grid method, and naive the worst, since naive is brute force search, scattered with a filter narrowing down the neightboring search targets, and coherent stacking one more layer of optimization on the hardware level, in terms of accessing data less frequently.
+
+## Questions
+
+**Q: For each implementation how does changing the number of boids affect performance? Why do you think this is?**
+
+Increasing the number of boids by 10 times or exponentially leads to a significant performance drop. This is likely because given the same simulation configuration such as scene scale and search radius, the higher density of boids in each grid cell makes it necessary to perform more operations in each cell such as condition checking and calculations. The same limitation applies to all three methods we implemented: naive, scattered uniform grid and coherent uniform grid (graph to be added later). This is expected because more hardware resources are required, e.g. more threads, to handle these boid computations -- optimization in the algorithm alone won't be enough.
+
+**Q: For each implementation, how does changing the block count and block size affect performance? Why do you think this is?**
+
+I did not observe any big impact that modifying the block count/ size has on the performance. Since GPU threads are grouped into warps, i.e. each a collection of 32 threads, that can execute simultaneously to maximize occupancy, I chose multiples of 32 for the block size to test this starting from the defualt given value 128. However, the average FPS and average kernel execution time stay relatively constant across all three approaches, especially in the range of [256, 1024] (graph to be added later). I think the reason for this is because changing the block count/ size neither changes the number of threads actually running in parallel executing, nor the way we access data on the device. In theory, if a block size is not a multiple of 32, it will result in some threads inactive in those warps, and will cause a slowdown. Had I tried some other values not multiples of 32, I might have had more observations to reach a convincable conclusion.
+
+**Q: For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?**
+
+Yes, I did see an obvious performance boost with the coherent uniform grid compared to the original uniform grid method with scattered boid data, let alone the naive brute-force method. This outcome is what I expected because the biggest difference of coherent grid, and also its main advantage over scattered grid, is the contiguous memory access of boid data. Reading and writing boid data stored in global memory is the slowest on GPU so freeing one level of indireciton definitely helps making the execution faster. Although we (may) need to allocate space for additional buffers to rearrange the boid data, it is only a one-time operation and cost-wise worthwhile given the time we saved from constant memory acessing.
+
+**Q: Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!**
+
+I've only seen a slight performance improvement from checking 8 neighboring cells (avgFPS: 450, 760, 830, 780, 800 and avgKernelExecTime = 0.2ms) to 27 neighboring cells (avgFPS: 650, 740, 800, 830, 790 and avgKernelExecTime = 0.18ms), given 5K boid count and the same other defualt simulation settings (graph to be added later). I think this might be because the density of my sample boids is too sparse, only in thousands. In this case, the number of searches per grid is generally quite small, regardless of the cell width or search radius, so no big difference. I would imagine a more important impact of tightening grid cells by checking 27 neighbors when the boid count is much higher because in each grid there's fewer checks needed.
+
+
+#### References:
+1. https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/
diff --git a/images/results/1024BlockSize_5Kboids_27Search.gif b/images/results/1024BlockSize_5Kboids_27Search.gif
new file mode 100644
index 0000000..4232b6c
Binary files /dev/null and b/images/results/1024BlockSize_5Kboids_27Search.gif differ
diff --git a/images/results/1024BlockSize_5Kboids_8Search.gif b/images/results/1024BlockSize_5Kboids_8Search.gif
new file mode 100644
index 0000000..542de19
Binary files /dev/null and b/images/results/1024BlockSize_5Kboids_8Search.gif differ
diff --git a/images/results/200SceneScale_5Kboids.gif b/images/results/200SceneScale_5Kboids.gif
new file mode 100644
index 0000000..b937977
Binary files /dev/null and b/images/results/200SceneScale_5Kboids.gif differ
diff --git a/images/results/200SceneScale_5Mboids_2MaxSpeed.gif b/images/results/200SceneScale_5Mboids_2MaxSpeed.gif
new file mode 100644
index 0000000..cfd0614
Binary files /dev/null and b/images/results/200SceneScale_5Mboids_2MaxSpeed.gif differ
diff --git a/images/results/500SceneScale_5Mboids_2MaxSpeed_1.5DT.gif b/images/results/500SceneScale_5Mboids_2MaxSpeed_1.5DT.gif
new file mode 100644
index 0000000..b68526a
Binary files /dev/null and b/images/results/500SceneScale_5Mboids_2MaxSpeed_1.5DT.gif differ
diff --git a/images/results/500SceneScale_5Mboids_5MaxSpeed_1DT.gif b/images/results/500SceneScale_5Mboids_5MaxSpeed_1DT.gif
new file mode 100644
index 0000000..701c3a0
Binary files /dev/null and b/images/results/500SceneScale_5Mboids_5MaxSpeed_1DT.gif differ
diff --git a/images/results/50Kboids.png b/images/results/50Kboids.png
new file mode 100644
index 0000000..37ec849
Binary files /dev/null and b/images/results/50Kboids.png differ
diff --git a/images/results/50Kboids_200SceneScale.png b/images/results/50Kboids_200SceneScale.png
new file mode 100644
index 0000000..f7579e6
Binary files /dev/null and b/images/results/50Kboids_200SceneScale.png differ
diff --git a/images/results/avgFPS&avgKernExecTime_numBoids_graph.png b/images/results/avgFPS&avgKernExecTime_numBoids_graph.png
new file mode 100644
index 0000000..2a930ab
Binary files /dev/null and b/images/results/avgFPS&avgKernExecTime_numBoids_graph.png differ
diff --git a/images/results/default_10Mboids.gif b/images/results/default_10Mboids.gif
new file mode 100644
index 0000000..0245e0d
Binary files /dev/null and b/images/results/default_10Mboids.gif differ
diff --git a/images/results/default_5Mboids.gif b/images/results/default_5Mboids.gif
new file mode 100644
index 0000000..7a89eb8
Binary files /dev/null and b/images/results/default_5Mboids.gif differ
diff --git a/images/results/default_coherent.png b/images/results/default_coherent.png
new file mode 100644
index 0000000..183258f
Binary files /dev/null and b/images/results/default_coherent.png differ
diff --git a/images/results/lonely_boids.gif b/images/results/lonely_boids.gif
new file mode 100644
index 0000000..95b68de
Binary files /dev/null and b/images/results/lonely_boids.gif differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..269e93c 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -37,7 +37,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 *****************/
 
 /*! Block size used for CUDA kernel launch. */
-#define blockSize 128
+#define blockSize 128//1024// 512// 256 
 
 // LOOK-1.2 Parameters for the boids algorithm.
 // These worked well in our reference implementation.
@@ -54,6 +54,10 @@ void checkCUDAError(const char *msg, int line = -1) {
 /*! Size of the starting area in simulation space. */
 #define scene_scale 100.0f
 
+/*! Search algorithm config. */
+#define NEIGHBOR_SEARCH_SIZE_8  0
+#define NEIGHBOR_SEARCH_SIZE_27 1
+
 /***********************************************
 * Kernel state (pointers are device pointers) *
 ***********************************************/
@@ -85,6 +89,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_reshuffled_pos;
+glm::vec3 *dev_reshuffled_vel;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -142,13 +148,14 @@ void Boids::initSimulation(int N) {
 
   // LOOK-1.2 - This is basic CUDA memory management and error checking.
   // Don't forget to cudaFree in  Boids::endSimulation.
-  cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
+  size_t boidDataSize = N * sizeof(glm::vec3);
+  cudaMalloc((void**)&dev_pos, boidDataSize);
   checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
 
-  cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
+  cudaMalloc((void**)&dev_vel1, boidDataSize);
   checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
 
-  cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
+  cudaMalloc((void**)&dev_vel2, boidDataSize);
   checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
 
   // LOOK-1.2 - This is a typical CUDA kernel invocation.
@@ -157,7 +164,11 @@ void Boids::initSimulation(int N) {
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
+#if NEIGHBOR_SEARCH_SIZE_8
   gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#elif NEIGHBOR_SEARCH_SIZE_27
+  gridCellWidth = std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#endif
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -169,6 +180,29 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  size_t particleIdxSize = N * sizeof(int);
+  cudaMalloc((void**)&dev_particleArrayIndices, particleIdxSize);
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, particleIdxSize);
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  size_t gridCellDataSize = gridCellCount * sizeof(int);
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellDataSize);
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellDataSize);
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+  cudaMalloc((void**)&dev_reshuffled_pos, boidDataSize);
+  checkCUDAErrorWithLine("cudaMalloc dev_reshuffled_pos failed!");
+
+  cudaMalloc((void**)&dev_reshuffled_vel, boidDataSize);
+  checkCUDAErrorWithLine("cudaMalloc dev_reshuffled_vel failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -230,10 +264,54 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  glm::vec3 cohesionCenter(0.0f);
+  glm::vec3 cohesionVel(0.0f);
+  
+  glm::vec3 separationCenter(0.0f);;
+  glm::vec3 separationVel(0.0f);;
+
+  glm::vec3 alignmentVel(0.0f);;
+
+  int numCohesionNeighbors = 0, numAlignmentNeighbors = 0;
+
+  for (int i = 0; i < N; i++) {
+      // exclude the target boid itself
+      if (i == iSelf) {
+          continue;
+      }
+
+      float distance = glm::distance(pos[iSelf], pos[i]);
+      // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+      if (distance < rule1Distance) {
+          cohesionCenter += pos[i];
+          numCohesionNeighbors++;
+      }
+
+      // Rule 2: boids try to stay a distance d away from each other, including other boids
+      if (distance < rule2Distance) {
+          separationCenter -= (pos[i] - pos[iSelf]);
+      }
+
+      // Rule 3: boids try to match the speed of surrounding boids
+      if (distance < rule3Distance) {
+          alignmentVel += vel[i];
+          numAlignmentNeighbors++;
+      }
+  }
+
+  if (numCohesionNeighbors > 0) {
+      cohesionCenter /= numCohesionNeighbors;
+      cohesionVel = (cohesionCenter - pos[iSelf]) * rule1Scale;
+  }
+
+  separationVel = separationCenter * rule2Scale;
+
+  if (numAlignmentNeighbors > 0) {
+      alignmentVel /= numAlignmentNeighbors;
+      alignmentVel *= rule3Scale;
+  }
+
+  return vel[iSelf] + cohesionVel + separationVel + alignmentVel;
 }
 
 /**
@@ -243,8 +321,19 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
   // Compute a new velocity based on pos and vel1
+  int index = blockIdx.x * blockDim.x + threadIdx.x;
+  if (index >= N) {
+      return;
+  }
+
+  glm::vec3 newVel = computeVelocityChange(N, index, pos, vel1);
   // Clamp the speed
+  float newSpeed = glm::length(newVel);
+  if (newSpeed > maxSpeed) {
+      newVel = newVel / newSpeed * maxSpeed;
+  }
   // Record the new velocity into vel2. Question: why NOT vel1?
+  vel2[index] = newVel;
 }
 
 /**
@@ -287,8 +376,19 @@ __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 *pos, int *indices, int *gridIndices) {
     // TODO-2.1
     // - Label each boid with the index of its grid cell.
+    int index = blockIdx.x * blockDim.x + threadIdx.x;
+    if (index >= N) {
+        return;
+    }
+
+    glm::ivec3 boidGirdIdx3D = glm::floor((pos[index] - gridMin) * inverseCellWidth);
+    int boidGridIdx = gridIndex3Dto1D(boidGirdIdx3D.x, boidGirdIdx3D.y,
+        boidGirdIdx3D.z, gridResolution);
+
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+    indices[index] = index;
+    gridIndices[index] = boidGridIdx;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +406,24 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+  int index = blockIdx.x * blockDim.x + threadIdx.x;
+  if (index >= N) {
+      return;
+  }
+  int gridIdx = particleGridIndices[index];
+
+  // check if it's the first boid, start filling startIndices
+  if (index == 0) {
+      gridCellStartIndices[gridIdx] = index;
+  }
+  // check if it's the last boid of the last cell, OR 
+  else if (index == N - 1) {
+      gridCellEndIndices[gridIdx] = index;
+  }
+  else if (gridIdx != particleGridIndices[index - 1]) {
+      gridCellStartIndices[gridIdx] = index;
+      gridCellEndIndices[particleGridIndices[index - 1]] = index - 1;
+  }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -316,12 +434,134 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
   // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
   // the number of boids that need to be checked.
+
+  int index = blockIdx.x * blockDim.x + threadIdx.x;
+  if (index >= N) {
+      return;
+  }
+
+  int boidIdx = particleArrayIndices[index];
+  glm::vec3 boidPos = pos[boidIdx];
+  
   // - Identify the grid cell that this particle is in
+  glm::vec3 boidGirdIdx3D = (boidPos - gridMin) * inverseCellWidth;
+  glm::ivec3 intGridIdx3D = glm::floor(boidGirdIdx3D);
+
   // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
+  int gridOffset;  // search range in each direciton, e.g. [-1, 1] means 3 cells
+  int minCells[3]; // minX/Y/Z
+  int maxCells[3]; // maxX/Y/Z
+  
+  boidGirdIdx3D = glm::fract(boidGirdIdx3D);
+  for (int i = 0; i < 3; ++i) {
+#if NEIGHBOR_SEARCH_SIZE_8
+      gridOffset = 0;
+#elif NEIGHBOR_SEARCH_SIZE_27
+      gridOffset = 1;
+#endif
+      if (boidGirdIdx3D[i] < 0.5f) {
+          gridOffset = -1;
+      }
+      int start = intGridIdx3D[i];
+      int end = intGridIdx3D[i] + gridOffset;
+
+      // clamp cell search bounds
+      minCells[i] = imax(0, imin(start, end));
+      maxCells[i] = imin(gridResolution - 1, imax(start, end));
+  }
+
+  // neighbor search for nearby boids
+  glm::vec3 cohesionCenter(0.0f);
+  glm::vec3 cohesionVel(0.0f);
+
+  glm::vec3 separationCenter(0.0f);;
+  glm::vec3 separationVel(0.0f);;
+
+  glm::vec3 alignmentVel(0.0f);;
+
+  int numCohesionNeighbors = 0, numAlignmentNeighbors = 0;
+  int lastGridIdx = (gridResolution * gridResolution * gridResolution) - 1;
+
+  for (int z = minCells[2]; z <= maxCells[2]; ++z) {
+      for (int y = minCells[1]; y <= maxCells[1]; ++y) {
+          for (int x = minCells[0]; x <= maxCells[0]; ++x) {
+              // - For each cell, read the start/end indices in the boid pointer array.
+              int gridIdx = gridIndex3Dto1D(x, y, z, gridResolution);
+              if (gridIdx < 0 || gridIdx > lastGridIdx) {
+                  continue; // skip if past cell bounds
+              }
+
+              int startBoid = gridCellStartIndices[gridIdx];
+              int endBoid = gridCellEndIndices[gridIdx];
+              if (startBoid == -1 || endBoid == -1) {
+                  continue; // skip if cell is empty
+              }
+
+              // - Access each boid in the cell and compute velocity change from
+              //   the boids rules, if this boid is within the neighborhood distance.
+              for (int i = startBoid; i <= endBoid; ++i) {
+                  int neighborIdx = particleArrayIndices[i];
+                  if (neighborIdx == boidIdx) {
+                      continue;
+                  }
+
+                  glm::vec3 neighborPos = pos[neighborIdx];
+                  float distance = glm::distance(boidPos, neighborPos);
+                  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+                  if (distance < rule1Distance) {
+                      cohesionCenter += neighborPos;
+                      numCohesionNeighbors++;
+                  }
+
+                  // Rule 2: boids try to stay a distance d away from each other, including other boids
+                  if (distance < rule2Distance) {
+                      separationCenter -= (neighborPos - boidPos);
+                  }
+
+                  // Rule 3: boids try to match the speed of surrounding boids
+                  if (distance < rule3Distance) {
+                      alignmentVel += vel1[neighborIdx];
+                      numAlignmentNeighbors++;
+                  }
+              }
+          }
+      }
+  }
+
+  if (numCohesionNeighbors > 0) {
+      cohesionCenter /= numCohesionNeighbors;
+      cohesionVel = (cohesionCenter - boidPos) * rule1Scale;
+  }
+
+  separationVel = separationCenter * rule2Scale;
+
+  if (numAlignmentNeighbors > 0) {
+      alignmentVel /= numAlignmentNeighbors;
+      alignmentVel *= rule3Scale;
+  }
+
+  glm::vec3 newVel = vel1[boidIdx] + cohesionVel + separationVel + alignmentVel;
   // - Clamp the speed change before putting the new speed in vel2
+  float speed = glm::length(newVel);
+  if (speed > maxSpeed) {
+      newVel = newVel / speed * maxSpeed;
+  }
+
+  vel2[boidIdx] = newVel;
+}
+
+/**
+* For 2.3: fill in additional buffers of boid data, i.e. pos&vel,
+* such that their order matches pointers to boids in sorted grid cells.
+*/
+__global__ void kernShuffleBoidDataPerGrid(
+    int N, glm::vec3* destBuffer, glm::vec3* sourceBuffer,
+    int* particleArrayIndices) {
+    int index = blockIdx.x * blockDim.x + threadIdx.x;
+    if (index < N) {
+        int boidIdx = particleArrayIndices[index];
+        destBuffer[index] = sourceBuffer[boidIdx];
+    }  
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,47 +581,239 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+    int boidIdx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (boidIdx >= N) {
+        return;
+    }
+
+    glm::vec3 boidPos = pos[boidIdx];
+
+    // - Identify the grid cell that this particle is in
+    glm::vec3 boidGirdIdx3D = (boidPos - gridMin) * inverseCellWidth;
+    glm::ivec3 intGridIdx3D = glm::floor(boidGirdIdx3D);
+
+    // - Identify which cells may contain neighbors. This isn't always 8.
+    int gridOffset;  // search range in each direciton, e.g. [-1, 1] means 3 cells
+    int minCells[3]; // minX/Y/Z
+    int maxCells[3]; // maxX/Y/Z
+
+    boidGirdIdx3D = glm::fract(boidGirdIdx3D);
+    for (int i = 0; i < 3; ++i) {
+#if NEIGHBOR_SEARCH_SIZE_8
+        gridOffset = 0;
+#elif NEIGHBOR_SEARCH_SIZE_27
+        gridOffset = 1;
+#endif
+        if (boidGirdIdx3D[i] < 0.5f) {
+            gridOffset = -1;
+        }
+        int start = intGridIdx3D[i];
+        int end = intGridIdx3D[i] + gridOffset;
+
+        // clamp cell search bounds
+        minCells[i] = imax(0, imin(start, end));
+        maxCells[i] = imin(gridResolution - 1, imax(start, end));
+    }
+
+    // neighbor search for nearby boids
+    glm::vec3 cohesionCenter(0.0f);
+    glm::vec3 cohesionVel(0.0f);
+
+    glm::vec3 separationCenter(0.0f);;
+    glm::vec3 separationVel(0.0f);;
+
+    glm::vec3 alignmentVel(0.0f);;
+
+    int numCohesionNeighbors = 0, numAlignmentNeighbors = 0;
+    int lastGridIdx = (gridResolution * gridResolution * gridResolution) - 1;
+
+    for (int z = minCells[2]; z <= maxCells[2]; ++z) {
+        for (int y = minCells[1]; y <= maxCells[1]; ++y) {
+            for (int x = minCells[0]; x <= maxCells[0]; ++x) {
+                // - For each cell, read the start/end indices in the boid pointer array.
+                int gridIdx = gridIndex3Dto1D(x, y, z, gridResolution);
+                if (gridIdx < 0 || gridIdx > lastGridIdx) {
+                    continue; // skip if past cell bounds
+                }
+
+                int startBoid = gridCellStartIndices[gridIdx];
+                int endBoid = gridCellEndIndices[gridIdx];
+                if (startBoid == -1 || endBoid == -1) {
+                    continue; // skip if cell is empty
+                }
+
+                // - Access each boid in the cell and compute velocity change from
+                //   the boids rules, if this boid is within the neighborhood distance.
+                for (int neighborIdx = startBoid; neighborIdx <= endBoid; ++neighborIdx) {
+                    if (neighborIdx == boidIdx) {
+                        continue;
+                    }
+
+                    glm::vec3 neighborPos = pos[neighborIdx];
+                    float distance = glm::distance(boidPos, neighborPos);
+                    // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+                    if (distance < rule1Distance) {
+                        cohesionCenter += neighborPos;
+                        numCohesionNeighbors++;
+                    }
+
+                    // Rule 2: boids try to stay a distance d away from each other, including other boids
+                    if (distance < rule2Distance) {
+                        separationCenter -= (neighborPos - boidPos);
+                    }
+
+                    // Rule 3: boids try to match the speed of surrounding boids
+                    if (distance < rule3Distance) {
+                        alignmentVel += vel1[neighborIdx];
+                        numAlignmentNeighbors++;
+                    }
+                }
+            }
+        }
+    }
+
+    if (numCohesionNeighbors > 0) {
+        cohesionCenter /= numCohesionNeighbors;
+        cohesionVel = (cohesionCenter - boidPos) * rule1Scale;
+    }
+
+    separationVel = separationCenter * rule2Scale;
+
+    if (numAlignmentNeighbors > 0) {
+        alignmentVel /= numAlignmentNeighbors;
+        alignmentVel *= rule3Scale;
+    }
+
+    glm::vec3 newVel = vel1[boidIdx] + cohesionVel + separationVel + alignmentVel;
+    // - Clamp the speed change before putting the new speed in vel2
+    float speed = glm::length(newVel);
+    if (speed > maxSpeed) {
+        newVel = newVel / speed * maxSpeed;
+    }
+
+    vel2[boidIdx] = newVel;
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+    // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+    int numBlocksPerGrid = (numObjects + blockSize - 1) / blockSize;
+
+    kernUpdateVelocityBruteForce<<<numBlocksPerGrid, blockSize>>>(numObjects, dev_pos,
+        dev_vel1, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdateVelocityBruteForce failed!");
+
+    kernUpdatePos<<<numBlocksPerGrid, blockSize>>>(numObjects, dt,
+        dev_pos, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+    // TODO-1.2 ping-pong the velocity buffers
+    std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
   // TODO-2.1
   // Uniform Grid Neighbor search using Thrust sort.
+  int numBlocksPerBoid = (numObjects + blockSize - 1) / blockSize;
+  int numBlocksPerGrid = (gridCellCount + blockSize - 1) / blockSize;
+
   // In Parallel:
   // - label each particle with its array index as well as its grid index.
   //   Use 2x width grids.
+  kernComputeIndices<<<numBlocksPerBoid, blockSize>>>(numObjects,
+      gridSideCount, gridMinimum, gridInverseCellWidth,
+      dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+  thrust::sort_by_key(dev_thrust_particleGridIndices,
+      dev_thrust_particleGridIndices + numObjects,
+      dev_thrust_particleArrayIndices);
+  
+  // reset boid data arrays for grid cells
+  kernResetIntBuffer<<<numBlocksPerGrid, blockSize>>>(gridCellCount,
+      dev_gridCellStartIndices, -1);
+  kernResetIntBuffer<<<numBlocksPerGrid, blockSize>>>(gridCellCount,
+      dev_gridCellEndIndices, -1);
+
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+  kernIdentifyCellStartEnd<<<numBlocksPerBoid, blockSize>>>(numObjects,
+      dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
   // - Perform velocity updates using neighbor search
+  kernUpdateVelNeighborSearchScattered<<<numBlocksPerBoid, blockSize>>>(
+      numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+      gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices,
+      dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+
   // - Update positions
+  kernUpdatePos<<<numBlocksPerBoid, blockSize>>>(numObjects,
+      dt, dev_pos, dev_vel2);
+
   // - Ping-pong buffers as needed
+  std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
   // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+
+  int numBlocksPerBoid = (numObjects + blockSize - 1) / blockSize;
+  int numBlocksPerGrid = (gridCellCount + blockSize - 1) / blockSize;
+
   // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
+  // - label each particle with its array index as well as its grid index.
+  //   Use 2x width grids.
+  kernComputeIndices<<<numBlocksPerBoid, blockSize>>>(numObjects,
+      gridSideCount, gridMinimum, gridInverseCellWidth,
+      dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+  thrust::sort_by_key(dev_thrust_particleGridIndices,
+      dev_thrust_particleGridIndices + numObjects,
+      dev_thrust_particleArrayIndices);
+  
+  // reset boid data arrays for grid cells
+  kernResetIntBuffer<<<numBlocksPerGrid, blockSize>>>(gridCellCount,
+      dev_gridCellStartIndices, -1);
+  kernResetIntBuffer<<<numBlocksPerGrid, blockSize>>>(gridCellCount,
+      dev_gridCellEndIndices, -1);
+
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+  kernIdentifyCellStartEnd<<<numBlocksPerBoid, blockSize>>>(numObjects,
+      dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
   // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
   //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+  kernShuffleBoidDataPerGrid << <numBlocksPerBoid, blockSize >> > (numObjects,
+      dev_reshuffled_pos, dev_pos, dev_particleArrayIndices);
+  checkCUDAErrorWithLine("kernShuffleBoidDataPerGrid - position failed!");
+
+  kernShuffleBoidDataPerGrid << <numBlocksPerBoid, blockSize >> > (numObjects,
+      dev_reshuffled_vel, dev_vel1, dev_particleArrayIndices);
+  checkCUDAErrorWithLine("kernShuffleBoidDataPerGrid - velocity failed!");
+  
   // - Perform velocity updates using neighbor search
+  kernUpdateVelNeighborSearchCoherent<<<numBlocksPerBoid, blockSize>>>(
+      numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+      gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices,
+      dev_reshuffled_pos, dev_reshuffled_vel, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelNeighborSearchCoherent failed!");
+
   // - Update positions
+  kernUpdatePos<<<numBlocksPerBoid, blockSize>>>(numObjects,
+      dt, dev_reshuffled_pos, dev_vel2);
+  checkCUDAErrorWithLine("Coherent cells: kernUpdatePos failed!");
+
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+  std::swap(dev_vel1, dev_vel2);
+  std::swap(dev_pos, dev_reshuffled_pos);
 }
 
 void Boids::endSimulation() {
@@ -390,6 +822,14 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+
+  cudaFree(dev_reshuffled_pos);
+  cudaFree(dev_reshuffled_vel);
+  checkCUDAErrorWithLine("cudaFree failed!");
 }
 
 void Boids::unitTest() {
@@ -455,3 +895,11 @@ void Boids::unitTest() {
   checkCUDAErrorWithLine("cudaFree failed!");
   return;
 }
+
+int Boids::getBlockSize() {
+    return blockSize;
+}
+
+float Boids::getSceneScale() {
+    return scene_scale;
+}
diff --git a/src/kernel.h b/src/kernel.h
index 3d3da72..0ce747c 100644
--- a/src/kernel.h
+++ b/src/kernel.h
@@ -18,4 +18,7 @@ namespace Boids {
 
     void endSimulation();
     void unitTest();
+
+    int getBlockSize();
+    float getSceneScale();
 }
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..67b1f21 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,8 +14,8 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
@@ -43,6 +43,12 @@ int main(int argc, char* argv[]) {
 std::string deviceName;
 GLFWwindow *window;
 
+// For performance analysis
+double averageFps = 0.0;
+double averageExecTime = 0.0; // average kernel execution
+double totalFrames = 0.0;
+double fpsFrames = 0.0;
+
 /**
 * Initialization of CUDA and GLFW.
 */
@@ -195,6 +201,14 @@ void initShaders(GLuint * program) {
     cudaGLMapBufferObject((void**)&dptrVertPositions, boidVBO_positions);
     cudaGLMapBufferObject((void**)&dptrVertVelocities, boidVBO_velocities);
 
+    // For performance analysis:
+    // use CUDA events to measure the time it takes for a kernel to run
+    cudaEvent_t startKern, endKern;
+    cudaEventCreate(&startKern);
+    cudaEventCreate(&endKern);
+
+    cudaEventRecord(startKern); // mark start of kernel execution
+
     // execute the kernel
     #if UNIFORM_GRID && COHERENT_GRID
     Boids::stepSimulationCoherentGrid(DT);
@@ -204,6 +218,13 @@ void initShaders(GLuint * program) {
     Boids::stepSimulationNaive(DT);
     #endif
 
+    cudaEventRecord(endKern); // mark end of kernel execution
+    cudaEventSynchronize(endKern); // wait for results to be available
+
+    float execTime; // in milliseconds
+    cudaEventElapsedTime(&execTime, startKern, endKern);
+    averageExecTime = (averageExecTime * (totalFrames - 1) + execTime) / totalFrames;
+
     #if VISUALIZE
     Boids::copyBoidsToVBO(dptrVertPositions, dptrVertVelocities);
     #endif
@@ -225,11 +246,15 @@ void initShaders(GLuint * program) {
 
       frame++;
       double time = glfwGetTime();
+      totalFrames++;
 
       if (time - timebase > 1.0) {
         fps = frame / (time - timebase);
         timebase = time;
         frame = 0;
+
+        fpsFrames++;
+        averageFps = (averageFps * (fpsFrames - 1) + fps) / fpsFrames;
       }
 
       runCUDA();
@@ -237,8 +262,18 @@ void initShaders(GLuint * program) {
       std::ostringstream ss;
       ss << "[";
       ss.precision(1);
-      ss << std::fixed << fps;
-      ss << " fps] " << deviceName;
+      ss << std::fixed << fps << "fps, ";
+      ss.precision(1);
+      ss << std::fixed << averageFps << "avgFps, ";
+      ss.precision(2);
+      ss << std::fixed << averageExecTime << "ms avgKernExec, ";
+      ss << N_FOR_VIS << " boids, ";
+      ss << "blockSize " << Boids::getBlockSize();
+      ss << ", sceneScale ";
+      ss.precision(1);
+      ss << std::fixed << Boids::getSceneScale();
+      ss << "] " << deviceName;
+
       glfwSetWindowTitle(window, ss.str().c_str());
 
       glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);