CIS565-Fall-2018 · echiu · Sep 10, 2018 · Sep 10, 2018
diff --git a/README.md b/README.md
@@ -1,11 +1,25 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Eric Chiu
+* Tested on: Windows 10 Education, Intel(R) Xeon(R) CPU E5-1630 v4 @ 3.60GHz 32GB, NVIDIA GeForce GTX 1070 (SIGLAB)
 
-### (TODO: Your README)
+## Result
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](./images/Boids.gif)
+
+## Performance Analysis
+
+![](./images/Boids-FPS-With-Visualization.png)
+
+![](./images/Boids-FPS-Without-Visualization.png)
+
+As we can see from the graphs, when the number of boids increases, the frame rate and performance generally decreases for all implementations: naive, scattered, and coherent. This is probably because it is more likely to have more boids within a neighborhood distance. This means that we have to iterate over more boids in order to calculate the next position of a single boid.
+
+There was performance improvements from the scattered uniform grid to the coherent uniform grid, but only by a small percentage (15% to 20%). I expected the outcome to be at least twice as fast. After thinking about it a little more, I realized that cutting out the middleman does make accessing data faster, but there are still roughly the same number of operations needed to calculate the next position of a single boid. Because of this, it made sense that performance was only by improved by 15% to 20%.
+
+![](./images/Block-FPS-Without-Visualization.png)
+
+When the block size increases from 16 to 32, the frame rate and performance is improved for all implementations: naive, scattered, and coherent. After we increase the block size further to 64, 128, and so forth, the performance generally is barely affected. I suspect the reason for this is because the warp size is set to 32.
+
+Changing cell width and checking 27 vs 8 neighboring cells decreased the performance for all implementations. I suspect the reason for this is because the more neighboring cells are checked, the more neighboring boids must be checked to see if it is within a boid's neighborhood distance. 
diff --git a/images/Block-FPS-Without-Visualization.png b/images/Block-FPS-Without-Visualization.png
diff --git a/images/Boids-FPS-With-Visualization.png b/images/Boids-FPS-With-Visualization.png
diff --git a/images/Boids-FPS-Without-Visualization.png b/images/Boids-FPS-Without-Visualization.png
diff --git a/images/Boids.gif b/images/Boids.gif
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_61
     )