diff --git a/README.md b/README.md
index d63a6a1..9f2bb49 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,59 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+## Flocking with CUDA on the GPU
+### Connie Chang
+  * [LinkedIn](linkedin.com/in/conniechang44), [Demo Reel](vimeo.com/ConChang/DemoReel)
+* Tested on: Windows 10, Intel Xeon CPU E5-1630 v4 @ 3.70 GHz, GTX 1070 8GB (SIG Lab)
 
-### (TODO: Your README)
+Introduction
+-------------
+This project consists of three different algorithms for flocking, and a performance analysis. Everthing is written in CUDA for the GPU. I was responsible for implementing the flocking algorithms, the neighborhood search, and invoking CUDA kernels.
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+10,000 particles flocking together
+![](images/10K.gif)
+
+Screenshot of 5,000 particles
+![](images/Screenshot.PNG)
+
+The three algorithms are:  
+* Naive: Querying every particle to find close neighbors.
+* Uniform Grid: Breaking the space into voxels, and only checking nearby voxels.
+* Coherent Uniform Grid: The same as Uniform Grid, but with structuring particle data as contiguously as possible  
+
+Performance Graphs
+------------------
+A graph comparing the performance of each algorithm.  
+![](images/Default_5000Boids_128BlockSize.png)  
+  
+
+A graph comparing the number of particles (boids).  
+![](images/NumberBoidsComparison.png)  
+  
+  
+A graph comparing the number of threads per block.  
+![](images/BlockSizeComparison.png)  
+  
+  
+A graph comparing the uniform grid's cell width, relative to the neighbor search distance.  
+![](images/CellWidthComparison.png)  
+
+Performance Questions and Answers: 
+-----------------------
+*For each implementation, how does changing the number of boids affect performance? Why do you think this is?*  
+
+**Naive**: Increasing boids slows down the simulation. This makes sense because the algorithm takes more time to search through every boid.  
+**Uniform Grid**: Surprisingly, performance gets better as the number of boids increases. I've heard that GPUs can perform faster when larger blocks of threads are used. My guess is that's why my results are like so.  
+**Coherent**: Similar to uniform grid, the performance increases as the number of boids increases.  
+  
+*For each implementation, how does changing the block count and block size affect performance? Why do you think this is?*   
+
+For all implementations, changing the block size did not significantly affect performance. I think this is because we need much more than 1024 threads, the maximum block size. Therefore, the blocks must be queued even at max block size, and we cannot go any faster.  
+
+*For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?*  
+
+No, I did not see any performance improvements. In fact, it was sometimes slower to use the coherent uniform grid. This was not what I expected because I thought it would be faster. I expected lining up the boids more contiguously in memory would mean faster access to them. Seeing my results, my guess is that the extra steps to reshuffle the position/velocity data were too slow. It slowed down the process enough that the contiguous memory could not make up for it.  
+
+*Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!*  
+
+No, changing the cell width did not affect performance. Even though we are checking more neighboring cells with a smaller width, we check for less boids in each cell. In other words, a large cell has a higher chance of encompassing more boids that are too far to influence the current boid. On the other hand, a small cell has a lower chance of encountering such boids.
diff --git a/images/10K.gif b/images/10K.gif
new file mode 100644
index 0000000..198028c
Binary files /dev/null and b/images/10K.gif differ
diff --git a/images/5K.gif b/images/5K.gif
new file mode 100644
index 0000000..2139cb3
Binary files /dev/null and b/images/5K.gif differ
diff --git a/images/BlockSizeComparison.png b/images/BlockSizeComparison.png
new file mode 100644
index 0000000..6f225ab
Binary files /dev/null and b/images/BlockSizeComparison.png differ
diff --git a/images/CellWidthComparison.png b/images/CellWidthComparison.png
new file mode 100644
index 0000000..ce99815
Binary files /dev/null and b/images/CellWidthComparison.png differ
diff --git a/images/Default_5000Boids_128BlockSize.png b/images/Default_5000Boids_128BlockSize.png
new file mode 100644
index 0000000..3b60ea9
Binary files /dev/null and b/images/Default_5000Boids_128BlockSize.png differ
diff --git a/images/NumberBoidsComparison.png b/images/NumberBoidsComparison.png
new file mode 100644
index 0000000..705deca
Binary files /dev/null and b/images/NumberBoidsComparison.png differ
diff --git a/images/Screenshot.PNG b/images/Screenshot.PNG
new file mode 100644
index 0000000..33667b7
Binary files /dev/null and b/images/Screenshot.PNG differ
diff --git a/performance/output_10000boids_128blocks.txt b/performance/output_10000boids_128blocks.txt
new file mode 100644
index 0000000..ebd4f4d
--- /dev/null
+++ b/performance/output_10000boids_128blocks.txt
@@ -0,0 +1,21 @@
+Naive, VIS
+Average fps: 184.358
+
+Naive, NO VIS
+Average fps: 233.616
+
+Uniform Grid, VIS
+Average fps: 604.607
+
+Uniform Grid, NO VIS
+Average fps: 1264.15
+
+Uniform Grid, Coherent, VIS
+Average fps: 562.452
+
+Uniform Grid, Coherent, NO VIS
+Average fps: 1261.27
+
+Uniform Grid, Coherent, VIS
+Average fps: 584.581
+
diff --git a/performance/output_2500boids_128blocks.txt b/performance/output_2500boids_128blocks.txt
new file mode 100644
index 0000000..5f7bbce
--- /dev/null
+++ b/performance/output_2500boids_128blocks.txt
@@ -0,0 +1,18 @@
+Naive, VIS
+Average fps: 477.27
+
+Naive, NO VIS
+Average fps: 1014.99
+
+Uniform Grid, VIS
+Average fps: 438.218
+
+Uniform Grid, NO VIS
+Average fps: 798.167
+
+Uniform Grid, Coherent, VIS
+Average fps: 440.226
+
+Uniform Grid, Coherent, NO VIS
+Average fps: 763.431
+
diff --git a/performance/output_5000boids_1024blocks.txt b/performance/output_5000boids_1024blocks.txt
new file mode 100644
index 0000000..49f3e8f
--- /dev/null
+++ b/performance/output_5000boids_1024blocks.txt
@@ -0,0 +1,6 @@
+Naive, VIS
+Average fps: 305.67
+
+Uniform Grid, Coherent, VIS
+Average fps: 454.431
+
diff --git a/performance/output_5000boids_128block.txt b/performance/output_5000boids_128block.txt
new file mode 100644
index 0000000..91c8097
--- /dev/null
+++ b/performance/output_5000boids_128block.txt
@@ -0,0 +1,54 @@
+Naive, VIS
+Average fps: 357.328
+
+Naive, NO VIS
+Average fps: 554.034
+
+Uniform Grid, VIS
+Average fps: 440.004
+
+Uniform Grid, NO VIS
+Average fps: 801.031
+
+Uniform Grid, Coherent, VIS
+Average fps: 430.916
+
+Uniform Grid, Coherent, NO VIS
+Average fps: 759.633
+
+Naive, VIS
+Average fps: 340.335
+
+Naive, NO VIS
+Average fps: 550.725
+
+Uniform Grid, VIS
+Average fps: 428.152
+
+Uniform Grid, NO VIS
+Average fps: 813.542
+
+Uniform Grid, Coherent, VIS
+Average fps: 476.203
+
+Uniform Grid, Coherent, NO VIS
+Average fps: 763.405
+
+Naive, VIS
+Average fps: 342.638
+
+Naive, NO VIS
+Average fps: 547.646
+
+Uniform Grid, VIS
+Average fps: 435.811
+
+Uniform Grid, NO VIS
+Average fps: 761.638
+
+Uniform Grid, Coherent, VIS
+Average fps: 448.705
+
+Uniform Grid, Coherent, NO VIS
+Average fps: 779.181
+
diff --git a/performance/output_5000boids_128blocks_1cellWidth.txt b/performance/output_5000boids_128blocks_1cellWidth.txt
new file mode 100644
index 0000000..4550b49
--- /dev/null
+++ b/performance/output_5000boids_128blocks_1cellWidth.txt
@@ -0,0 +1,18 @@
+Naive, VIS
+Average fps: 345.707
+
+Uniform Grid, VIS
+Average fps: 446.372
+
+Uniform Grid, Coherent, VIS
+Average fps: 449.218
+
+Naive, NO VIS
+Average fps: 545.644
+
+Uniform Grid, NO VIS
+Average fps: 747.858
+
+Uniform Grid, Coherent, NO VIS
+Average fps: 781.189
+
diff --git a/performance/output_5000boids_128blocks_take2.txt b/performance/output_5000boids_128blocks_take2.txt
new file mode 100644
index 0000000..53bd3c7
--- /dev/null
+++ b/performance/output_5000boids_128blocks_take2.txt
@@ -0,0 +1,24 @@
+Naive, VIS
+Average fps: 337.955
+
+Uniform Grid, VIS
+Average fps: 460.023
+
+Uniform Grid, Coherent, VIS
+Average fps: 424.042
+
+Naive, NO VIS
+Average fps: 543.884
+
+Uniform Grid, NO VIS
+Average fps: 860.128
+
+Uniform Grid, Coherent, NO VIS
+Average fps: 786.732
+
+Uniform Grid, NO VIS
+Average fps: 831.391
+
+Uniform Grid, Coherent, VIS
+Average fps: 430.894
+
diff --git a/performance/output_5000boids_256blocks.txt b/performance/output_5000boids_256blocks.txt
new file mode 100644
index 0000000..47e04d7
--- /dev/null
+++ b/performance/output_5000boids_256blocks.txt
@@ -0,0 +1,18 @@
+Naive, NO VIS
+Average fps: 539.576
+
+Uniform Grid, NO VIS
+Average fps: 714.914
+
+Uniform Grid, Coherent, NO VIS
+Average fps: 797.649
+
+Naive, VIS
+Average fps: 354.235
+
+Uniform Grid, VIS
+Average fps: 440.193
+
+Uniform Grid, Coherent, VIS
+Average fps: 444.447
+
diff --git a/performance/output_5000boids_64blocks.txt b/performance/output_5000boids_64blocks.txt
new file mode 100644
index 0000000..a775e0d
--- /dev/null
+++ b/performance/output_5000boids_64blocks.txt
@@ -0,0 +1,18 @@
+Naive, NO VIS
+Average fps: 541.066
+
+Uniform Grid, NO VIS
+Average fps: 790.12
+
+Uniform Grid, Coherent, NO VIS
+Average fps: 794.772
+
+Naive, VIS
+Average fps: 356.021
+
+Uniform Grid, VIS
+Average fps: 437.014
+
+Uniform Grid, Coherent, VIS
+Average fps: 451.163
+
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..74952fa 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -85,6 +85,12 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_posCoherent;
+glm::vec3 *dev_velCoherent;
+//thrust::device_ptr<int> dev_thrust_particleGridIndicesPos;
+//thrust::device_ptr<int> dev_thrust_particleGridIndicesVel;
+//thrust::device_ptr<int> dev_thrust_particlePos;
+//thrust::device_ptr<int> dev_thrust_particleVel;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -169,6 +175,35 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+
+  // 2.1 Mallocs
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+  // 2.3 Mallocs
+  cudaMalloc((void**)&dev_posCoherent, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndicesPos failed!");
+
+  cudaMalloc((void**)&dev_velCoherent, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndicesVel failed!");
+
+  //dev_thrust_particleGridIndicesPos = thrust::device_ptr<int>(dev_particleGridIndicesPos);
+  //dev_thrust_particleGridIndicesVel = thrust::device_ptr<int>(dev_particleGridIndicesVel);
+  //dev_thrust_particlePos = thrust::device_ptr<int>(dev_pos);
+  //dev_thrust_particleVel = thrust::device_ptr<int>(dev_vel1);
+
   cudaDeviceSynchronize();
 }
 
@@ -230,10 +265,52 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  
+    glm::vec3 delta(0.0f, 0.0f, 0.0f);
+
+    glm::vec3 cm(0.0f, 0.0f, 0.0f);
+    glm::vec3 c(0.0f, 0.0f, 0.0f);
+    glm::vec3 perceived_velocity(0.0f, 0.0f, 0.0f);
+    unsigned int n_1 = 0;
+    //unsigned int n_2 = 0;
+    unsigned int n_3 = 0;
+
+    for (unsigned int i = 0; i < N; i++) {
+        if (i != iSelf) {
+            float dist = glm::distance(pos[i], pos[iSelf]);
+
+            // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+            if (dist < rule1Distance) {
+                cm += pos[i];
+                n_1++;
+            }
+            // Rule 2: boids try to stay a distance d away from each other
+            if (dist < rule2Distance) {
+                c -= (pos[i] - pos[iSelf]);
+                //n_2++;
+            }
+            // Rule 3: boids try to match the speed of surrounding boids
+            if (dist < rule3Distance) {
+                perceived_velocity += vel[i];
+                n_3++;
+            }
+        }
+    }
+
+    // Rule 1
+    if (n_1 != 0) {
+        cm /= n_1;
+        delta += (cm - pos[iSelf]) * rule1Scale;
+    }
+    // Rule 2
+    delta += c * rule2Scale;
+    // Rule 3
+    if (n_3 != 0) {
+        perceived_velocity /= n_3;
+        delta += perceived_velocity * rule3Scale;
+    }
+
+    return delta;
 }
 
 /**
@@ -242,9 +319,23 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
+
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+
   // Compute a new velocity based on pos and vel1
+    glm::vec3 vel = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+
   // Clamp the speed
+    float length = glm::length(vel);
+    if (length > maxSpeed) {
+        vel = (vel / length) * maxSpeed;
+    }
+
   // Record the new velocity into vel2. Question: why NOT vel1?
+    vel2[index] = vel;
 }
 
 /**
@@ -278,7 +369,11 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
 //          for(x)
 //            for(y)
 //             for(z)? Or some other order?
+// Modified to check if cell is out of bounds. Return -1 if it is
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
+    if (x < 0 || x >= gridResolution || y < 0 || y >= gridResolution || z < 0 || z >= gridResolution) {
+        return -1;
+    }
   return x + y * gridResolution + z * gridResolution * gridResolution;
 }
 
@@ -286,9 +381,18 @@ __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
     // TODO-2.1
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+
     // - Label each boid with the index of its grid cell.
+    glm::vec3 iCell = (pos[index] - gridMin) * inverseCellWidth;
+    int iGrid = gridIndex3Dto1D(iCell.x, iCell.y, iCell.z, gridResolution);
+    gridIndices[index] = iGrid;
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+    indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +410,24 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+
+    if (index == 0) {
+        gridCellStartIndices[particleGridIndices[index]] = index;
+        return;
+    }
+
+    if (index == N - 1) {
+        gridCellEndIndices[particleGridIndices[index]] = index;
+    }
+
+    if (particleGridIndices[index] != particleGridIndices[index - 1]) {
+        gridCellStartIndices[particleGridIndices[index]] = index;
+        gridCellEndIndices[particleGridIndices[index - 1]] = index - 1;
+    }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -316,12 +438,100 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
   // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
   // the number of boids that need to be checked.
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
   // - Identify the grid cell that this particle is in
   // - Identify which cells may contain neighbors. This isn't always 8.
+    glm::vec3 iPos = pos[particleArrayIndices[index]] - gridMin;
+
+    float neighborDistance = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+
+    glm::vec3 maxPos = glm::vec3(iPos.x + neighborDistance, iPos.y + neighborDistance, iPos.z + neighborDistance);
+    glm::vec3 minPos = glm::vec3(iPos.x - neighborDistance, iPos.y - neighborDistance, iPos.z - neighborDistance);
+
+    maxPos = glm::clamp(maxPos, 0.f, 2 * scene_scale);
+    minPos = glm::clamp(minPos, 0.f, 2 * scene_scale);
+
+    glm::vec3 maxCell = maxPos * inverseCellWidth;
+    glm::vec3 minCell = minPos * inverseCellWidth;
+
   // - For each cell, read the start/end indices in the boid pointer array.
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
+    glm::vec3 cm;
+    glm::vec3 c;
+    glm::vec3 perceived_velocity;
+    unsigned int n_1 = 0;
+    unsigned int n_3 = 0;
+
+    for (unsigned int i = minCell.x; i <= maxCell.x; i++) {
+        for (unsigned int j = minCell.y; j <= maxCell.y; j++) {
+            for (unsigned int k = minCell.z; k <= maxCell.z; k++) {
+                int cellIndex = gridIndex3Dto1D(i, j, k, gridResolution);
+
+                if (cellIndex > -1 && gridCellStartIndices[cellIndex] > -1) {
+                    for (unsigned int iter = gridCellStartIndices[cellIndex]; iter <= gridCellEndIndices[cellIndex]; iter++) {
+                        int otherBoidIndex = particleArrayIndices[iter];
+
+                        if (particleArrayIndices[index] != otherBoidIndex) {
+
+                            // Calculate velocity delta
+                            float dist = glm::distance(pos[particleArrayIndices[index]], pos[otherBoidIndex]);
+
+                            // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+                            if (dist < rule1Distance) {
+                                cm += pos[otherBoidIndex];
+                                n_1++;
+                            }
+                            // Rule 2: boids try to stay a distance d away from each other
+                            if (dist < rule2Distance) {
+                                c -= (pos[otherBoidIndex] - pos[particleArrayIndices[index]]);
+                            }
+                            // Rule 3: boids try to match the speed of surrounding boids
+                            if (dist < rule3Distance) {
+                                perceived_velocity += vel1[otherBoidIndex];
+                                n_3++;
+                            }
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    glm::vec3 delta;
+
+    // Rule 1
+    if (n_1 != 0) {
+        cm /= n_1;
+        delta += (cm - pos[particleArrayIndices[index]]) * rule1Scale;
+    }
+    // Rule 2
+    delta += c * rule2Scale;
+    // Rule 3
+    if (n_3 != 0) {
+        perceived_velocity /= n_3;
+        delta += perceived_velocity * rule3Scale;
+    }
+
   // - Clamp the speed change before putting the new speed in vel2
+    glm::vec3 nVel = vel1[particleArrayIndices[index]] + delta;
+    float length = glm::length(nVel);
+    if (length > maxSpeed) {
+        nVel = (nVel / length) * maxSpeed;
+    }
+
+    vel2[particleArrayIndices[index]] = nVel;
+}
+
+__global__ void kernReshuffleVec(int N, int *order, glm::vec3 *copyFrom, glm::vec3 *copyTo) {
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+    copyTo[index] = copyFrom[order[index]];
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -333,14 +543,94 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // except with one less level of indirection.
   // This should expect gridCellStartIndices and gridCellEndIndices to refer
   // directly to pos and vel1.
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+
   // - Identify the grid cell that this particle is in
   // - Identify which cells may contain neighbors. This isn't always 8.
+    glm::vec3 iPos = pos[index] - gridMin;
+
+    float neighborDistance = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+
+    glm::vec3 maxPos = glm::vec3(iPos.x + neighborDistance, iPos.y + neighborDistance, iPos.z + neighborDistance);
+    glm::vec3 minPos = glm::vec3(iPos.x - neighborDistance, iPos.y - neighborDistance, iPos.z - neighborDistance);
+
+    maxPos = glm::clamp(maxPos, 0.f, 2 * scene_scale);
+    minPos = glm::clamp(minPos, 0.f, 2 * scene_scale);
+
+    glm::vec3 maxCell = maxPos * inverseCellWidth;
+    glm::vec3 minCell = minPos * inverseCellWidth;
   // - For each cell, read the start/end indices in the boid pointer array.
   //   DIFFERENCE: For best results, consider what order the cells should be
   //   checked in to maximize the memory benefits of reordering the boids data.
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
+    glm::vec3 cm;
+    glm::vec3 c;
+    glm::vec3 perceived_velocity;
+    unsigned int n_1 = 0;
+    unsigned int n_3 = 0;
+
+    for (unsigned int k = minCell.z; k <= maxCell.z; k++) {
+        for (unsigned int j = minCell.y; j <= maxCell.y; j++) {
+            for (unsigned int i = minCell.x; i <= maxCell.x; i++) {
+                int cellIndex = gridIndex3Dto1D(i, j, k, gridResolution);
+
+                if (cellIndex > -1 && gridCellStartIndices[cellIndex] > -1) {
+                    for (unsigned int otherBoidIndex = gridCellStartIndices[cellIndex]; 
+                        otherBoidIndex <= gridCellEndIndices[cellIndex]; otherBoidIndex++) {
+
+                        if (index != otherBoidIndex) {
+
+                            // Calculate velocity delta
+                            float dist = glm::distance(pos[index], pos[otherBoidIndex]);
+
+                            // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+                            if (dist < rule1Distance) {
+                                cm += pos[otherBoidIndex];
+                                n_1++;
+                            }
+                            // Rule 2: boids try to stay a distance d away from each other
+                            if (dist < rule2Distance) {
+                                c -= (pos[otherBoidIndex] - pos[index]);
+                            }
+                            // Rule 3: boids try to match the speed of surrounding boids
+                            if (dist < rule3Distance) {
+                                perceived_velocity += vel1[otherBoidIndex];
+                                n_3++;
+                            }
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    glm::vec3 delta;
+
+    // Rule 1
+    if (n_1 != 0) {
+        cm /= n_1;
+        delta += (cm - pos[index]) * rule1Scale;
+    }
+    // Rule 2
+    delta += c * rule2Scale;
+    // Rule 3
+    if (n_3 != 0) {
+        perceived_velocity /= n_3;
+        delta += perceived_velocity * rule3Scale;
+    }
+
   // - Clamp the speed change before putting the new speed in vel2
+    glm::vec3 nVel = vel1[index] + delta;
+    float length = glm::length(nVel);
+    if (length > maxSpeed) {
+        nVel = (nVel / length) * maxSpeed;
+    }
+
+    vel2[index] = nVel;
 }
 
 /**
@@ -348,40 +638,117 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 */
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, dev_vel1, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdateVelocityBruteForce failed!");
+
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> >(numObjects, dt, dev_pos, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+
   // TODO-1.2 ping-pong the velocity buffers
+    glm::vec3 *temp_ptr = dev_vel1;
+    dev_vel1 = dev_vel2;
+    dev_vel2 = temp_ptr;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
   // TODO-2.1
   // Uniform Grid Neighbor search using Thrust sort.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+    kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+    checkCUDAErrorWithLine("kernResetIntBuffer start indices failed!");
+    kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+    checkCUDAErrorWithLine("kernResetIntBuffer end indices failed!");
+
   // In Parallel:
   // - label each particle with its array index as well as its grid index.
   //   Use 2x width grids.
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, 
+        gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+    checkCUDAErrorWithLine("kernComputeIndices failed!");
+
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices,
+        dev_gridCellStartIndices, dev_gridCellEndIndices);
+    checkCUDAErrorWithLine(" kernIdentifyCellStartEnd failed!");
+
   // - Perform velocity updates using neighbor search
+    kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (
+        numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+        dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices,
+        dev_pos, dev_vel1, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
   // - Update positions
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> >(numObjects, dt, dev_pos, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdatePos failed!");
+
   // - Ping-pong buffers as needed
+    glm::vec3 *temp_ptr = dev_vel1;
+    dev_vel1 = dev_vel2;
+    dev_vel2 = temp_ptr;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
   // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+    kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+    checkCUDAErrorWithLine("kernResetIntBuffer start indices failed!");
+    kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+    checkCUDAErrorWithLine("kernResetIntBuffer end indices failed!");
+
   // In Parallel:
   // - Label each particle with its array index as well as its grid index.
   //   Use 2x width grids
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+        gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+    checkCUDAErrorWithLine("kernComputeIndices failed!");
+
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices,
+        dev_gridCellStartIndices, dev_gridCellEndIndices);
+    checkCUDAErrorWithLine(" kernIdentifyCellStartEnd failed!");
+
   // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
   //   the particle data in the simulation array.
   //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+    kernReshuffleVec << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_posCoherent);
+    kernReshuffleVec << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_vel1, dev_velCoherent);
+  
   // - Perform velocity updates using neighbor search
+    kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (
+        numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+        dev_gridCellStartIndices, dev_gridCellEndIndices,
+        dev_posCoherent, dev_velCoherent, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdateVelNeighborSearchCoherent failed!");
+
   // - Update positions
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> >(numObjects, dt, dev_posCoherent, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdatePos failed!");
+
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+    glm::vec3 *temp_ptr1 = dev_vel1;
+    //glm::vec3 *temp_ptr2 = dev_vel2;
+    dev_vel1 = dev_vel2;
+    dev_vel2 = temp_ptr1;
+
+    glm::vec3 *temp_ptr = dev_posCoherent;
+    dev_posCoherent = dev_pos;
+    dev_pos = temp_ptr;
 }
 
 void Boids::endSimulation() {
@@ -390,6 +757,12 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_posCoherent);
+  cudaFree(dev_velCoherent);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..311ca03 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 10000;
 const float DT = 0.2f;
 
 /**
@@ -30,6 +30,30 @@ int main(int argc, char* argv[]) {
   if (init(argc, argv)) {
     mainLoop();
     Boids::endSimulation();
+
+    std::string sim_type = "";
+    if (!UNIFORM_GRID && !COHERENT_GRID) {
+        sim_type += "Naive, ";
+    }
+    if (UNIFORM_GRID) {
+        sim_type += "Uniform Grid, ";
+    }
+    if (COHERENT_GRID) {
+        sim_type += "Coherent, ";
+    }
+
+    if (VISUALIZE) {
+        sim_type += "VIS";
+    }
+    else {
+        sim_type += "NO VIS";
+    }
+    std::string fileName = "output_10000boids_128blocks.txt";
+    outputFile.open(fileName, std::ios::out | std::ios::app);
+    outputFile << sim_type << std::endl;
+    outputFile << "Average fps: " << fps_total / count << std::endl << std::endl;
+    outputFile.close();
+
     return 0;
   } else {
     return 1;
@@ -230,6 +254,10 @@ void initShaders(GLuint * program) {
         fps = frame / (time - timebase);
         timebase = time;
         frame = 0;
+
+        // Store variables to calculate average fps
+        fps_total += fps;
+        count++;
       }
 
       runCUDA();
diff --git a/src/main.hpp b/src/main.hpp
index 88e9df7..40488fa 100644
--- a/src/main.hpp
+++ b/src/main.hpp
@@ -78,3 +78,10 @@ void runCUDA();
 bool init(int argc, char **argv);
 void initVAO();
 void initShaders(GLuint *program);
+
+//===================================
+// Performance Analysis
+//===================================
+std::ofstream outputFile; //Output frame rate
+float fps_total;
+unsigned int count;