diff --git a/README.md b/README.md
index d63a6a1..ab2b970 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,167 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Vasu Mahesh
+  * [LinkedIn](http://linkedin.com/in/vasumahesh)
+  * [Code Blog](http://www.codeplaysleep.com)
 
-### (TODO: Your README)
+* Tested on a Laptop:
+  * Windows 10
+  * i7-8650U @ 1.90GHz
+  * 16GB RAM
+  * GTX 1060 6GB
+  * Visual Studio 2017 (with v140 toolset)
+  * CUDA v8.0
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+* Tested on a Desktop:
+  * Windows 10
+  * i7-4790K @ 4.0GHz
+  * 16GB RAM
+  * GTX 970 4GB
+  * Visual Studio 2017 (with v140 toolset)
+  * CUDA v8.0
+
+![](images/intro.gif)
+
+## Build
+
+Build Command: `cmake -G "Visual Studio 15 2017 Win64" -DCUDA_TOOLKIT_ROOT_DIR="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0" -T v140,cuda=8.0 ..`
+
+I tested my code on both systems. I made some tiny scripts that computes the average milliseconds per frame (in main.cpp). The program runs for about 30s and stops. It reports that metric in the console.
+
+**Notes:** I made my graphs correspond to Milliseconds per Frame. It is the same as 1.0 / FPS, so I guess it should be fine.
+
+## Results
+
+Results generally compare the time per frame in milliseconds. I compared results from GTX 970 and from GTX 1060.
+
+| Statistic               | GTX 970           | GTX 1060          |
+| ----------------------- |:-----------------:|:-----------------:|
+| Architecture            | Maxwell           | Pascal            |
+| Core Speed              | 1050 - 1178 MHz   | 1404 - 1670 Mhz   |
+| Memory Speed            | 3500 Mhz          | 8000 MHz          |
+| Memory Bus Width        | 256-bit           | 192-bit           |
+| Max. Amount of Memory   | 4096MB            | 6144 MB           |
+| CUDA Cores              | 1664              | 1280              |
+
+The GTX 970 does have more CUDA cores, but the clock speed is really low. I expected 970 to win, but it turned out the 1060 was faster. A point to note, my 970 was running an older driver, which also could play a role. But, I suspect having a slow core speed, the architecture and memory speed could have played a bigger role.
+
+---
+
+### Performance of Algorithm
+
+`For each implementation, how does changing the number of boids affect performance? Why do you think this is?`
+
+As we increase the number of boids, there is a general trend of increase in frame processing time. The Naive Algorithm ended up crashing for 500,000 boids and above.
+
+![](images/algorithm_gtx1060.PNG)
+
+
+| Boid Count  | Naive (ms)                              | Uniform Grid (ms)                          | Coherent Grid (ms)   |
+| ----------- |------------------------------------|-------------------------------------- |-----------------|
+| 5,000       | 1.73356                            | 1.69154                               | 1.79578         |
+| 50,000      | 63.0121                            | 3.09724                               | 2.10838         |
+| 100,000     | 217.913                            | 8.21635                               | 2.12726         |
+| 500,000     | Unspecified Launch Failure         | 293.805                               | 15.647          |
+| 1,000,000   | Unspecified Launch Failure         | 1399.34                               | 54.7602         |
+| 5,000,000   | Unspecified Launch Failure         | Unspecified Launch Failure            | 1282.11         |
+
+
+`For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?`
+
+For coherent uniform grid, there was a tremendous performance increase. Compared to Uniform Grid, there is 1 less indirection which allows us to use the same indices obtained from GridStart and GridEnd vectors directly as memory accessors. So, there is less "jumping" across memory.
+
+
+---
+
+### Block Size
+
+`For each implementation, how does changing the block count and block size
+affect performance? Why do you think this is?`
+
+If we change the block size we see performance issues for sizes less than 32. This might be because there are too few boids per block and we aren't using the GPU fully. There is an unexplained bump at the end of 1024 block size, which was observed in both Maxwell and Pascal.
+
+![](images/block_size_all.PNG)
+
+
+| Block Size  | GTX 970 (ms)      | GTX 1060 (ms)        |
+| ----------- | ------------ | --------------- |
+| 4           | 8.75682      | 8.32634         |
+| 8           | 5.3235       | 5.14231         |
+| 16          | 3.34341      | 3.33034         |
+| 32          | 2.33586      | 2.3088          |
+| 64          | 2.2251       | 2.10187         |
+| 128         | 2.25017      | 2.10287         |
+| 256         | 2.25511      | 2.10359         |
+| 512         | 2.27528      | 2.1192          |
+| 1024        | 2.44437      | 2.40032         |
+
+
+---
+
+### Cell Neighbor Test
+
+`Did changing cell width and checking 27 vs 8 neighboring cells affect performance?
+Why or why not?`
+
+![](images/neighbor_cell_test_gtx1060.PNG)
+
+| Boid Count  | 8 cell test (ms)   | 27 cell test (ms)  |
+| ----------- | ------------- | ------------- |
+| 5,000       | 1.79578       | 1.7994        |
+| 50,000      | 2.10838       | 1.99957       |
+| 100,000     | 2.12726       | 1.45459       |
+| 500,000     | 15.647        | 7.48839       |
+| 1,000,000   | 54.7602       | 21.0801       |
+| 5,000,000   | 1282.11       | 366.668       |
+
+
+When we change the neighborhood from 8 cells to 27 cells, there was no difference till 500,000 boids. When we passed that mark, We can see from the graph that 27 neighbouring cells is much faster. This could be because:
+
+```cpp
+// 8 Neighbor Cell Test
+// ------------------------------
+int CellWidth = 2.0f * SearchRadius;
+
+// 8 cells in a 2x2x2 fashion would result:
+int TotalSquareSide = 2.0f * CellWidth; // Which Equals to 4 x Search Radius
+
+
+// 27 Neighbor Cell Test
+// ------------------------------
+int CellWidth = 1.0f * SearchRadius;
+
+// 27 cells in a 3x3x3 fashion would result:
+int TotalSquareSide = 3.0f * CellWidth; // Which Equals to 3 x Search Radius
+```
+
+`TotalSquareSide` is shorter in the 27 neighborhood compared to the 8 cell neighborhood. Another way to look at this: Our search box is way bigger in 8 cell test compared to 27 cell test. Thus, when the number of boids increase, there is an increase in the number of boids in each `TotalSquareSide` Cube. Thus, causing more checks, more iterations etc. in the 8 cell test.
+
+![](images/neighbor_cell_test_gtx970.PNG)
+
+| Boid Count  | 8 cell test (ms)   | 27 cell test (ms)  |
+| ----------- | ------------- | ------------- |
+| 5,000       | 1.26449       | 1.14823       |
+| 50,000      | 1.45266       | 1.17393       |
+| 100,000     | 2.27812       | 1.62882       |
+| 500,000     | 16.778        | 7.63463       |
+| 1,000,000   | 59.6526       | 20.5757       |
+| 5,000,000   | 2395.84       | 345.515       |
+
+A suprising thing to see here, GTX 970 **performed better** than GTX 1060 when using 27 neighboring cell test. This might be because of the higher core count and lesser checks per block? -- I ran this test more than two times, but the results were similar.
+
+---
+
+### Algorithms with Visualization On
+
+![](images/algorithm_visualization_gtx1060.PNG)
+
+---
+
+### Some GPU vs GPU Graphs
+
+![](images/naive_vs.PNG)
+
+![](images/uniform_vs.PNG)
+
+![](images/coherent_vs.PNG)
\ No newline at end of file
diff --git a/images/Graphs.docx b/images/Graphs.docx
new file mode 100644
index 0000000..3570c89
Binary files /dev/null and b/images/Graphs.docx differ
diff --git a/images/algorithm_gtx1060.PNG b/images/algorithm_gtx1060.PNG
new file mode 100644
index 0000000..e730275
Binary files /dev/null and b/images/algorithm_gtx1060.PNG differ
diff --git a/images/algorithm_gtx970.PNG b/images/algorithm_gtx970.PNG
new file mode 100644
index 0000000..e11efe8
Binary files /dev/null and b/images/algorithm_gtx970.PNG differ
diff --git a/images/algorithm_visualization_gtx1060.PNG b/images/algorithm_visualization_gtx1060.PNG
new file mode 100644
index 0000000..734d3af
Binary files /dev/null and b/images/algorithm_visualization_gtx1060.PNG differ
diff --git a/images/block_size_all.PNG b/images/block_size_all.PNG
new file mode 100644
index 0000000..a13ffc1
Binary files /dev/null and b/images/block_size_all.PNG differ
diff --git a/images/coherent_vs.PNG b/images/coherent_vs.PNG
new file mode 100644
index 0000000..7f759c1
Binary files /dev/null and b/images/coherent_vs.PNG differ
diff --git a/images/intro.gif b/images/intro.gif
new file mode 100644
index 0000000..fb2d3b3
Binary files /dev/null and b/images/intro.gif differ
diff --git a/images/naive_vs.PNG b/images/naive_vs.PNG
new file mode 100644
index 0000000..569aebc
Binary files /dev/null and b/images/naive_vs.PNG differ
diff --git a/images/neighbor_cell_test_gtx1060.PNG b/images/neighbor_cell_test_gtx1060.PNG
new file mode 100644
index 0000000..7a9c0e0
Binary files /dev/null and b/images/neighbor_cell_test_gtx1060.PNG differ
diff --git a/images/neighbor_cell_test_gtx970.PNG b/images/neighbor_cell_test_gtx970.PNG
new file mode 100644
index 0000000..980c548
Binary files /dev/null and b/images/neighbor_cell_test_gtx970.PNG differ
diff --git a/images/uniform_vs.PNG b/images/uniform_vs.PNG
new file mode 100644
index 0000000..e7dc5e9
Binary files /dev/null and b/images/uniform_vs.PNG differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index fdd636d..8c6ae00 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50 -rdc=true
     )
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..e7add7a 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -2,7 +2,11 @@
 #include <stdio.h>
 #include <cuda.h>
 #include <cmath>
+#include <array>
 #include <glm/glm.hpp>
+
+#include <glm/gtx/norm.hpp>
+
 #include "utilityCore.hpp"
 #include "kernel.h"
 
@@ -85,6 +89,9 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3* dev_alignedPos;
+glm::vec3* dev_alignedVel1;
+glm::vec3* dev_alignedVel2;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -168,7 +175,32 @@ void Boids::initSimulation(int N) {
   gridMinimum.y -= halfGridWidth;
   gridMinimum.z -= halfGridWidth;
 
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+
+  cudaMalloc((void**)&dev_alignedPos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_alignedPos failed!");
+
+  cudaMalloc((void**)&dev_alignedVel1, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_alignedVel1 failed!");
+
+  cudaMalloc((void**)&dev_alignedVel2, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_alignedVel2 failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -233,7 +265,59 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+
+  glm::vec3 perceivedCenter = glm::vec3(0);
+
+  const glm::vec3 boidPosition = pos[iSelf];
+  glm::vec3 perceivedVelocity =  glm::vec3(0);
+
+  glm::vec3 tempVector2 = glm::vec3(0);
+  glm::vec3 tempVector3 = glm::vec3(0);
+
+  int rule1Count = 0;
+  int rule3Count = 0;
+
+  // Rule 1
+  for (int idx = 0; idx < N; ++idx)
+  {
+    if (idx == iSelf)
+    {
+      continue;
+    }
+
+    glm::vec3 targetPos = pos[idx];
+
+    if (glm::distance2(boidPosition, targetPos) <= rule1Distance * rule1Distance)
+    {
+      perceivedCenter += targetPos;
+      ++rule1Count;
+    }
+
+    if (glm::distance2(boidPosition, targetPos) <= rule2Distance * rule2Distance)
+    {
+      tempVector2 -= (targetPos - boidPosition);
+    }
+
+    if (glm::distance2(boidPosition, targetPos) <= rule3Distance * rule3Distance)
+    {
+      tempVector3 += vel[idx];
+      ++rule3Count;
+    }
+  }
+
+  if (rule1Count > 0) {
+    perceivedCenter = perceivedCenter / float(rule1Count);
+    perceivedVelocity = (perceivedCenter - boidPosition) * rule1Scale;
+  }
+
+  perceivedVelocity += (tempVector2 * rule2Scale);
+
+  if (rule3Count > 0) {
+    tempVector3 = tempVector3 / float(rule3Count);
+    perceivedVelocity += (tempVector3 * rule3Scale);
+  }
+
+  return perceivedVelocity;
 }
 
 /**
@@ -245,6 +329,20 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  const glm::vec3 velocityDelta = computeVelocityChange(N, index, pos, vel1);
+  vel2[index] = vel1[index] + velocityDelta;
+
+  // Clamp Final Velocity
+  if (glm::length2(vel2[index]) > maxSpeed * maxSpeed)
+  {
+    vel2[index] = glm::normalize(vel2[index]) * maxSpeed;
+  }
 }
 
 /**
@@ -289,6 +387,15 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+
+  const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  indices[index] = index;
+  const glm::vec3 gridIndex = glm::floor((pos[index] - gridMin) * inverseCellWidth);
+  gridIndices[index] = gridIndex3Dto1D(gridIndex.x, gridIndex.y, gridIndex.z, gridResolution);
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -300,12 +407,53 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
   }
 }
 
+__global__ void kernReshuffleParticleData(int N, glm::vec3* pos, glm::vec3* vel1, glm::vec3* vel2, int* indices, glm::vec3* alignedPos, glm::vec3* alignedVel1, glm::vec3* alignedVel2) {
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+
+  // Index to move data from
+  const int sourceIndex = indices[index];
+
+  // Copy over data
+  alignedPos[index] = pos[sourceIndex];
+  alignedVel1[index] = vel1[sourceIndex];
+  alignedVel2[index] = vel2[sourceIndex];
+}
+
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   int *gridCellStartIndices, int *gridCellEndIndices) {
   // TODO-2.1
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  const int currentGrid = particleGridIndices[index];
+
+  if (index + 1 == N)
+  {
+    gridCellEndIndices[currentGrid] = index;
+    return;
+  }
+
+  if (index == 0)
+  {
+    gridCellStartIndices[currentGrid] = 0;
+  }
+
+  const int nextGrid = particleGridIndices[index + 1];
+
+  if (currentGrid != nextGrid)
+  {
+    gridCellEndIndices[currentGrid] = index;
+    gridCellStartIndices[nextGrid] = index + 1;
+  }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +470,99 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+  const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  const float maxSearchDistance = glm::max(glm::max(rule1Distance, rule2Distance), rule3Distance);
+
+  const glm::vec3 minGridIndex = glm::floor((pos[index] - gridMin - glm::vec3(maxSearchDistance)) * inverseCellWidth);
+  const glm::vec3 maxGridIndex = glm::floor((pos[index] - gridMin + glm::vec3(maxSearchDistance)) * inverseCellWidth);
+
+  const int minX = imax(0, int(minGridIndex.x));
+  const int minY = imax(0, int(minGridIndex.y));
+  const int minZ = imax(0, int(minGridIndex.z));
+
+  const int maxX = imin(gridResolution - 1, int(maxGridIndex.x));
+  const int maxY = imin(gridResolution - 1, int(maxGridIndex.y));
+  const int maxZ = imin(gridResolution - 1, int(maxGridIndex.z));
+
+  glm::vec3 perceivedCenter = glm::vec3(0);
+  
+  const glm::vec3 boidPosition = pos[index];
+
+  glm::vec3 tempVectorRule2 = glm::vec3(0);
+  glm::vec3 tempVectorRule3 = glm::vec3(0);
+  glm::vec3 velocityDelta = glm::vec3(0);
+
+  int rule1Count = 0;
+  int rule3Count = 0;
+
+  for (int gridX = minX; gridX <= maxX; ++gridX)
+  {
+    for (int gridY = minY; gridY <= maxY; ++gridY)
+    {
+      for (int gridZ = minZ; gridZ <= maxZ; ++gridZ)
+      {
+        const int gridId = gridIndex3Dto1D(gridX, gridY, gridZ, gridResolution);
+        const int gridStart = gridCellStartIndices[gridId];
+        const int gridEnd = gridCellEndIndices[gridId];
+
+        if (gridStart == -1 || gridEnd == -1)
+        {
+          continue;
+        }
+
+        for (int cellIdx = gridStart; cellIdx <= gridEnd; ++cellIdx)
+        {
+          const int dataIdx = particleArrayIndices[cellIdx];
+
+          if (dataIdx == index)
+          {
+            continue;
+          }
+
+          if (glm::distance2(boidPosition, pos[dataIdx]) <= rule1Distance * rule1Distance)
+          {
+            perceivedCenter += pos[dataIdx];
+            ++rule1Count;
+          }
+
+          if (glm::distance2(boidPosition, pos[dataIdx]) <= rule2Distance * rule2Distance)
+          {
+            tempVectorRule2 -= (pos[dataIdx] - boidPosition);
+          }
+
+          if (glm::distance2(boidPosition, pos[dataIdx]) <= rule3Distance * rule3Distance)
+          {
+            tempVectorRule3 += vel1[dataIdx];
+            ++rule3Count;
+          }
+        }
+      }
+    }
+  }
+
+  if (rule1Count > 0) {
+    perceivedCenter = perceivedCenter / float(rule1Count);
+    velocityDelta = (perceivedCenter - boidPosition) * rule1Scale;
+  }
+
+  velocityDelta += (tempVectorRule2 * rule2Scale);
+
+  if (rule3Count > 0) {
+    tempVectorRule3 = tempVectorRule3 / float(rule3Count);
+    velocityDelta += (tempVectorRule3 * rule3Scale);
+  }
+
+  vel2[index] = vel1[index] + velocityDelta;
+
+  // Clamp
+  if (glm::length2(vel2[index]) > maxSpeed * maxSpeed)
+  {
+    vel2[index] = glm::normalize(vel2[index]) * maxSpeed;
+  }
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,6 +582,97 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+  const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  const float maxSearchDistance = glm::max(glm::max(rule1Distance, rule2Distance), rule3Distance);
+
+  const glm::vec3 minGridIndex = glm::floor((pos[index] - gridMin - glm::vec3(maxSearchDistance)) * inverseCellWidth);
+  const glm::vec3 maxGridIndex = glm::floor((pos[index] - gridMin + glm::vec3(maxSearchDistance)) * inverseCellWidth);
+
+  const int minX = imax(0, int(minGridIndex.x));
+  const int minY = imax(0, int(minGridIndex.y));
+  const int minZ = imax(0, int(minGridIndex.z));
+
+  const int maxX = imin(gridResolution - 1, int(maxGridIndex.x));
+  const int maxY = imin(gridResolution - 1, int(maxGridIndex.y));
+  const int maxZ = imin(gridResolution - 1, int(maxGridIndex.z));
+
+  glm::vec3 perceivedCenter = glm::vec3(0);
+
+  const glm::vec3 boidPosition = pos[index];
+
+  glm::vec3 tempVectorRule2 = glm::vec3(0);
+  glm::vec3 tempVectorRule3 = glm::vec3(0);
+  glm::vec3 velocityDelta = glm::vec3(0);
+
+  int rule1Count = 0;
+  int rule3Count = 0;
+
+  for (int gridZ = minZ; gridZ <= maxZ; ++gridZ)
+  {
+    for (int gridY = minY; gridY <= maxY; ++gridY)
+    {
+      for (int gridX = minX; gridX <= maxX; ++gridX)
+      {
+        const int gridId = gridIndex3Dto1D(gridX, gridY, gridZ, gridResolution);
+        const int gridStart = gridCellStartIndices[gridId];
+        const int gridEnd = gridCellEndIndices[gridId];
+
+        if (gridStart == -1 || gridEnd == -1)
+        {
+          continue;
+        }
+
+        for (int dataIdx = gridStart; dataIdx <= gridEnd; ++dataIdx)
+        {
+          if (dataIdx == index)
+          {
+            continue;
+          }
+
+          if (glm::distance2(boidPosition, pos[dataIdx]) <= rule1Distance * rule1Distance)
+          {
+            perceivedCenter += pos[dataIdx];
+            ++rule1Count;
+          }
+
+          if (glm::distance2(boidPosition, pos[dataIdx]) <= rule2Distance * rule2Distance)
+          {
+            tempVectorRule2 -= (pos[dataIdx] - boidPosition);
+          }
+
+          if (glm::distance2(boidPosition, pos[dataIdx]) <= rule3Distance * rule3Distance)
+          {
+            tempVectorRule3 += vel1[dataIdx];
+            ++rule3Count;
+          }
+        }
+      }
+    }
+  }
+
+  if (rule1Count > 0) {
+    perceivedCenter = perceivedCenter / float(rule1Count);
+    velocityDelta = (perceivedCenter - boidPosition) * rule1Scale;
+  }
+
+  velocityDelta += (tempVectorRule2 * rule2Scale);
+
+  if (rule3Count > 0) {
+    tempVectorRule3 = tempVectorRule3 / float(rule3Count);
+    velocityDelta += (tempVectorRule3 * rule3Scale);
+  }
+
+  vel2[index] = vel1[index] + velocityDelta;
+
+  // Clamp
+  if (glm::length2(vel2[index]) > maxSpeed * maxSpeed)
+  {
+    vel2[index] = glm::normalize(vel2[index]) * maxSpeed;
+  }
 }
 
 /**
@@ -349,6 +681,18 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+  kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize >>> (numObjects, dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelocityBruteForce failed!");
+
+  // Swap Buffers
+  glm::vec3* temp = dev_vel1;
+  dev_vel1 = dev_vel2;
+  dev_vel2 = temp;
+
+  kernUpdatePos<<<fullBlocksPerGrid, blockSize >>> (numObjects, dt, dev_pos, dev_vel1);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +708,35 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  dim3 blocksForGridCells((gridCellCount + blockSize - 1) / blockSize);
+
+  kernResetIntBuffer<<<blocksForGridCells, blockSize>>>(gridCellCount, dev_gridCellStartIndices, -1);
+  kernResetIntBuffer<<<blocksForGridCells, blockSize>>>(gridCellCount, dev_gridCellEndIndices, -1);
+
+  kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+  kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+  kernUpdateVelNeighborSearchScattered<<<fullBlocksPerGrid, blockSize>>>(
+    numObjects, gridSideCount, gridMinimum,
+    gridInverseCellWidth, gridCellWidth,
+    dev_gridCellStartIndices, dev_gridCellEndIndices,
+    dev_particleArrayIndices,
+    dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
+  // Swap Buffers
+  glm::vec3* temp = dev_vel1;
+  dev_vel1 = dev_vel2;
+  dev_vel2 = temp;
+
+  kernUpdatePos<<<fullBlocksPerGrid, blockSize >>> (numObjects, dt, dev_pos, dev_vel1);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +755,41 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  dim3 blocksForGridCells((gridCellCount + blockSize - 1) / blockSize);
+
+  kernResetIntBuffer<<<blocksForGridCells, blockSize>>>(gridCellCount, dev_gridCellStartIndices, -1);
+  kernResetIntBuffer<<<blocksForGridCells, blockSize>>>(gridCellCount, dev_gridCellEndIndices, -1);
+
+  kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+  kernReshuffleParticleData<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2, dev_particleArrayIndices, dev_alignedPos, dev_alignedVel1, dev_alignedVel2);
+
+  cudaMemcpy(dev_pos, dev_alignedPos, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+  cudaMemcpy(dev_vel1, dev_alignedVel1, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+  cudaMemcpy(dev_vel2, dev_alignedVel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+
+  kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+  kernUpdateVelNeighborSearchCoherent<<<fullBlocksPerGrid, blockSize>>>(
+    numObjects, gridSideCount, gridMinimum,
+    gridInverseCellWidth, gridCellWidth,
+    dev_gridCellStartIndices, dev_gridCellEndIndices,
+    dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
+  // Swap Buffers
+  glm::vec3* temp = dev_vel1;
+  dev_vel1 = dev_vel2;
+  dev_vel2 = temp;
+
+  kernUpdatePos<<<fullBlocksPerGrid, blockSize >>> (numObjects, dt, dev_pos, dev_vel1);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+
 }
 
 void Boids::endSimulation() {
@@ -389,6 +797,15 @@ void Boids::endSimulation() {
   cudaFree(dev_vel2);
   cudaFree(dev_pos);
 
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+
+  cudaFree(dev_alignedPos);
+  cudaFree(dev_alignedVel1);
+  cudaFree(dev_alignedVel2);
+
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
 }
 
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..c7872c8 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -12,13 +12,17 @@
 // Configuration
 // ================
 
+double finalTimePerFrame = 0.0;
+
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
+
+#define RUN_TIMED_PERF
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 20000;
 const float DT = 0.2f;
 
 /**
@@ -30,6 +34,9 @@ int main(int argc, char* argv[]) {
   if (init(argc, argv)) {
     mainLoop();
     Boids::endSimulation();
+#ifdef RUN_TIMED_PERF
+    getchar();
+#endif
     return 0;
   } else {
     return 1;
@@ -39,7 +46,6 @@ int main(int argc, char* argv[]) {
 //-------------------------------
 //---------RUNTIME STUFF---------
 //-------------------------------
-
 std::string deviceName;
 GLFWwindow *window;
 
@@ -215,8 +221,15 @@ void initShaders(GLuint * program) {
   void mainLoop() {
     double fps = 0;
     double timebase = 0;
+    double timePerFrame = 0;
     int frame = 0;
 
+    const uint32_t maxPerfTicks = 30;
+    const uint32_t skipTicks = 5;
+    uint32_t perfTicks = 0;
+
+    double sumTimePerFrame = 0.0;
+
     Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
 
@@ -230,6 +243,25 @@ void initShaders(GLuint * program) {
         fps = frame / (time - timebase);
         timebase = time;
         frame = 0;
+        timePerFrame = (1000.0 / fps);
+
+#ifdef RUN_TIMED_PERF
+        if (perfTicks > skipTicks) {
+          if (perfTicks < maxPerfTicks)
+          {
+            std::cout << "Recording Time Per Frame: " << timePerFrame << '\n';
+            sumTimePerFrame += timePerFrame;
+          }
+          else
+          {
+            finalTimePerFrame = sumTimePerFrame / (maxPerfTicks - skipTicks - 1);
+            std::cout << "Final Recorded Time Per Frame (avg): " << finalTimePerFrame << '\n';
+            break;
+          }
+        }
+
+        ++perfTicks;
+#endif
       }
 
       runCUDA();
@@ -238,7 +270,11 @@ void initShaders(GLuint * program) {
       ss << "[";
       ss.precision(1);
       ss << std::fixed << fps;
-      ss << " fps] " << deviceName;
+      ss << " fps] ";
+      ss << "[";
+      ss.precision(4);
+      ss << std::fixed << timePerFrame;
+      ss << " ms] " << deviceName;
       glfwSetWindowTitle(window, ss.str().c_str());
 
       glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
diff --git a/stats_gtx_1060.txt b/stats_gtx_1060.txt
new file mode 100644
index 0000000..0cfc0c3
--- /dev/null
+++ b/stats_gtx_1060.txt
@@ -0,0 +1,78 @@
+Boids Increase Test:
+
+Naive:
+5000: 1.73356 
+50000: 63.0121
+100000: 217.913
+500000: Unspecified Launch Failure
+1000000: Unspecified Launch Failure
+5000000: Unspecified Launch Failure
+
+Uniform Grid:
+5000: 1.69154
+50000: 3.09724
+100000: 8.21635
+500000: 293.805
+1000000: 1399.34
+5000000: Unspecified Launch Failure
+
+Coherent Uniform Grid:
+5000: 1.79578
+50000: 2.10838
+100000: 2.12726
+500000: 15.647
+1000000: 54.7602
+5000000: 1282.11
+
+XYZ Ordering Perf:
+5000: 1.85356
+50000: 2.04586
+100000: 2.13499
+500000: 15.7476
+1000000: 55.526
+5000000: 1324.72
+
+Block Size Test:
+100000 boids, Coherent
+4: 8.32634
+8: 5.14231
+16: 3.33034
+32: 2.3088
+64: 2.10187
+128: 2.10287
+256: 2.10359
+512: 2.1192
+1024: 2.40032
+
+27 Neighbor Test:
+Coherent
+5000: 1.7994
+50000: 1.99957
+100000: 1.45459
+500000: 7.48839
+1000000: 21.0801
+5000000: 366.668
+
+Naive with vis:
+5000: 3.07527
+50000: 66.1877
+100000: 221.937
+500000:
+1000000:
+5000000:
+
+Uniform Grid with vis:
+5000: 2.89332
+50000: 4.78618
+100000: 10.1837
+500000: 296.752
+1000000: 1403.98
+5000000: Launch Failure
+
+Coherent with vis:
+5000: 2.94602
+50000: 3.12959
+100000: 3.59659
+500000: 17.8983
+1000000: 57.9658
+5000000: 1294.47
\ No newline at end of file
diff --git a/stats_gtx_970.txt b/stats_gtx_970.txt
new file mode 100644
index 0000000..cc7319a
--- /dev/null
+++ b/stats_gtx_970.txt
@@ -0,0 +1,47 @@
+Boids Increase Test:
+
+Naive:
+5000: 2.13995
+50000: 54.5064
+100000: 218.609
+500000: Unspecified Launch Failure
+1000000: Unspecified Launch Failure
+5000000: Unspecified Launch Failure
+
+Uniform Grid:
+5000: 1.07539
+50000: 3.36901
+100000: 7.32578
+500000: 332.125
+1000000: 2566.21
+5000000: Unspecified Launch Failure
+
+Coherent Grid:
+5000: 1.26449
+50000: 1.45266
+100000: 2.27812
+500000: 16.778
+1000000: 59.6526
+5000000: 2395.84
+
+Block Size Test:
+100000 boids, Coherent
+4: 8.75682
+8: 5.3235
+16: 3.34341
+32: 2.33586
+64: 2.2251
+128: 2.25017
+256: 2.25511
+512: 2.27528
+1024: 2.44437
+
+27 Neighbor Test:
+Coherent:
+5000: 1.14823
+50000: 1.17393
+100000: 1.62882
+500000: 7.63463
+1000000: 20.5757
+5000000: 345.515
+