diff --git a/README.md b/README.md
index 41b91f0..35d92b7 100644
--- a/README.md
+++ b/README.md
@@ -1,21 +1,231 @@
 CUDA Rasterizer
 ===============
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
-
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-	* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Xiao Zhang
+  * [LinkedIn](https://www.linkedin.com/in/xiao-zhang-674bb8148/)
+* Tested on: Windows 10, i7-7700K @ 4.20GHz 16.0GB, GTX 1080 15.96GB (my own PC)
+
+Analysis 
+======================
+* When parallelize primitives, blocksize is set to 128 unchanged.
+
+* When parallelize pixels, blocksize is set to 25 x 25 unchanged.
+
+* For shared memory in tile based rasterizer, primitive bucket size is 256 unchanged.
+
+* Rendering time is measured in frame per second, higher is better.
+
+---
+
+## 0. First thing first
+
+### Scan line rasterizer will crash when mesh goes behind the camera but tile based rasterizer won't
+
+This is because the vertex behind the camera will be projected to the far end of the clip space as shown in the following picture. 
+
+![picture credit derhass@stackoverflow](img/0.JPG)
+
+###### picture credit [derhass@stackoverflow](https://stackoverflow.com/users/2327517/derhass)
+
+As a restul, the primitive will potentially occupy more pixels, which will result in longer looping time in scan line rasterizer. If the the execution time for a kernel is too long, GPU will throw a kernel launch failure error. There are two ways to avoid this. The standard way is to perform a depth clipping. But this requires a clipping stage and a clipping algorithm, which we do not have in our CUDA rasterizer(if you want to know more about clipping, check [this](https://stackoverflow.com/questions/41085117/why-does-gl-divide-gl-position-by-w-for-you-rather-than-letting-you-do-it-your) out). So I used the other option which is a tile based rasterizer. Tile based rasterizer solves this problem by parallelize the pixels. And since the size of the primitive bucket of each tile can be adjusted, we can actually control the execution time for the kernel to avoid crashing. 
+
+---
+
+## 1. Checkerboard Scene 
+
+### overview
+
+![](img/checkerboard_render.jpg)
+
+### rendering time
+
+![](img/checkerboard.JPG)
+
+### analysis
+
+In this scene, tile based rasterizer is better in all distance. This is because when we use scan line rasterizer, we parallelize the primitives and the primitives are so few in this scene (only two triangles) therefore we can not utilize parallel processing efficiently. Even worse, when the two triangles cover a lot of pixels, we are spending two much time in each thread looping through the pixels which will impact parallelism further. However, when we use tile based rasterizer, we parallelize the pixels. For each pixel, it only needs to worry about the primitive that overlaps the tile it belongs to. In this scene, there will be two in the worst scenario, which means the looping for each pixel will be very short and that means we are exploiting the most out of parallelism.
+
+### images
+
+* checkerboard far distance
+  
+![](img/checkerboard_far.JPG)
+
+* checkerboard mid distance
+  
+![](img/checkerboard_mid.JPG)
+
+* checkerboard near distance
+  
+![](img/checkerboard_near.JPG)
+
+---
+
+## 2. Box Scene 
+
+### overview
+
+![](img/box_render.jpg)
+
+### rendering time
+
+![](img/box.JPG)
+
+### analysis
+
+In the box scene, scan line rasterizer is doing better at far distance because each primitive takes less pixel and there are more primitives than in the checkerboard scene. But as the mesh moves closer to the camera, performance of tile based rasterizer remain the same whereas performance of scan line rasterizer is getting worse because the workload for each thread increases as the primitive covers more pixels.
+
+### images
+
+* box far distance
+  
+![](img/box_far.JPG)
+
+* box mid distance
+  
+![](img/box_mid.JPG)
+
+* box near distance
+  
+![](img/box_near.JPG)
+
+---
+
+## 3. Flower Scene 
+
+### overview
+
+![](img/flower_render.jpg)
+
+### rendering time
+
+![](img/flower.JPG)
+
+### analysis
+
+Just like the box scene, the performance of scan line rasterizer starts to get worse as the mesh moves closer to the camera whereas the performance of tile based rasterizer almost remains the same.
+
+### images
+
+* flower far distance
+  
+![](img/flower_far.JPG)
+
+* flower mid distance
+  
+![](img/flower_mid.JPG)
+
+* flower near distance
+  
+![](img/flower_near.JPG)
+
+---
+
+## 4. Duck Scene 
+
+### overview (tile based rasterizer is the second row)
+
+![](img/duck_render.jpg)
+
+### rendering time
+
+![](img/duck.JPG)
+
+### analysis
+
+Just like the box scene, the performance of scan line rasterizer starts to get worse as the mesh moves closer to the camera whereas the performance of tile based rasterizer almost remains the same. 
+
+But there is something else worth noticing. When the mesh is far away, you can spot some black triangles on the rendered image. This is because the shared memory being used to store primitives for each tile(block) is not large enough to hold all the potential triangles. Therefore, some triangles that contributes to the rendering are not cached in the shared memory. And if you run the program, you can see the black triangles are flickering, which means, for each frame, the triangles which belongs to a tile but not cached are not the same one. This is because the process of caching the potential triangles runs in parallel within each tile(block), and the run time of this process in one thread is not guaranteed to be the same for every frame, sometime slightly faster, sometime slightly slower. So when one thread is fast enough to use atomicadd to lock the slot and cache the primitve in one frame, it might not be as quick in the next frame and other threads who is faster may do the same thing before this thread gets a chance.
+
+One solution to solve the flickering triangle is to use a better algorithm to check whether a triangle will contribute to a tile or not. Currently, I’m only using AABB to check if they overlap. This is a conservative way to do it which means it will give you a triangle whose AABB overlaps the tile but itself doesn’t. As a result, they will occupy the limited shared memory slot which is supposed for triangles that actually overlaps the tile. This is also the reason why the black triangles disappear when the mesh is closer. When the mesh is closer, the AABB overlapping test rejects more easily resulting in less falsely occupied shared memory slot.
+
+Another solution is to use global memory to store primitives for each tile(block), just like a [previous implementation](https://github.com/Aman-Sachan-asach/CUDA-Rasterizer) of this project did. But I don’t think that makes sense at all. Because from what I read online, tile based rendering utilize “on chip” memory to compensate for scare global memory and its low accessing speed. And since “on chip” is equivalent to “shared memory” in CUDA lingo, I think it’s better to use it instead of global memory as the primitive buckets.
+
+### images
+
+* duck far distance scan line
+  
+![](img/duck_far_0.JPG)
+
+* duck far distance tile based
+  
+![](img/duck_far_1.JPG)
+
+* duck mid distance scan line
+  
+![](img/duck_mid_0.JPG)
+
+* duck mid distance tile based
+  
+![](img/duck_mid_1.JPG)
+
+* duck near distance scan line
+  
+![](img/duck_near_0.JPG)
+
+* duck near distance tile based
+  
+![](img/duck_near_1.JPG)
+
+---
+
+## 5. Cow Scene 
+
+### overview (tile based rasterizer is the second row)
+
+![](img/cow_render.jpg)
+
+### rendering time
+
+![](img/cow.JPG)
+
+### analysis
+
+But there is something new worth noticing. This is the only scene in all the five scenes where the performance of tile based rasterizer is worse than scan line rasterizer when the mesh is close to the camera. One of reasons might be the triangles are relatively small in terms of pixels they occupy. This is good when we parallelize the primitives since one thread will not loop for too long which means we are distributing works among appropriate amount of threads. On the other hand, for tile based rasterizer, small triangles means for each tile(block), there will be more primitives stored in their shared memories, therefore each thread will loop for a longer time. This means we are not distributing our works appropriately. 
+
+### images
+
+* cow far distance scan line
+  
+![](img/cow_far_0.JPG)
+
+* cow far distance tile based
+  
+![](img/cow_far_1.JPG)
+
+* cow mid distance scan line
+  
+![](img/cow_mid_0.JPG)
+
+* cow mid distance tile based
+  
+![](img/cow_mid_1.JPG)
+
+* cow near distance scan line
+  
+![](img/cow_near_0.JPG)
+
+* cow near distance tile based
+  
+![](img/cow_near_1.JPG)
+
+---
+
+## 6. Summary
+
+* Tile based rasterizer is better when primitives are few but occupied pixels are many.(Simple mesh near distance)
 
-### (TODO: Your README)
+* Scan line rasterizer is better when primitives are many but occupied pixels are few.(Complex mesh far distance)
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+* Using shared memory for tile based rasterizer requires a more accurate(less conservative) algorithm to check whether a primitive should be cached.
 
+---
 
 ### Credits
 
 * [tinygltfloader](https://github.com/syoyo/tinygltfloader) by [@soyoyo](https://github.com/syoyo)
 * [glTF Sample Models](https://github.com/KhronosGroup/glTF/blob/master/sampleModels/README.md)
+* [Why does GL divide gl_Position by W for you rather than letting you do it yourself?](https://stackoverflow.com/questions/41085117/why-does-gl-divide-gl-position-by-w-for-you-rather-than-letting-you-do-it-your)
+* [fabian's a trip through the graphics pipeline](https://fgiesen.wordpress.com/2011/07/06/a-trip-through-the-graphics-pipeline-2011-part-6/)
diff --git a/img/0.JPG b/img/0.JPG
new file mode 100644
index 0000000..9d1192c
Binary files /dev/null and b/img/0.JPG differ
diff --git a/img/box.JPG b/img/box.JPG
new file mode 100644
index 0000000..77f9423
Binary files /dev/null and b/img/box.JPG differ
diff --git a/img/box_far.JPG b/img/box_far.JPG
new file mode 100644
index 0000000..61f3373
Binary files /dev/null and b/img/box_far.JPG differ
diff --git a/img/box_mid.JPG b/img/box_mid.JPG
new file mode 100644
index 0000000..a99cbdd
Binary files /dev/null and b/img/box_mid.JPG differ
diff --git a/img/box_near.JPG b/img/box_near.JPG
new file mode 100644
index 0000000..bc14b9d
Binary files /dev/null and b/img/box_near.JPG differ
diff --git a/img/box_render.jpg b/img/box_render.jpg
new file mode 100644
index 0000000..66c6e41
Binary files /dev/null and b/img/box_render.jpg differ
diff --git a/img/checkerboard.JPG b/img/checkerboard.JPG
new file mode 100644
index 0000000..ed165a4
Binary files /dev/null and b/img/checkerboard.JPG differ
diff --git a/img/checkerboard_far.JPG b/img/checkerboard_far.JPG
new file mode 100644
index 0000000..f5cf819
Binary files /dev/null and b/img/checkerboard_far.JPG differ
diff --git a/img/checkerboard_mid.JPG b/img/checkerboard_mid.JPG
new file mode 100644
index 0000000..2a8b8bb
Binary files /dev/null and b/img/checkerboard_mid.JPG differ
diff --git a/img/checkerboard_near.JPG b/img/checkerboard_near.JPG
new file mode 100644
index 0000000..62cf228
Binary files /dev/null and b/img/checkerboard_near.JPG differ
diff --git a/img/checkerboard_render.jpg b/img/checkerboard_render.jpg
new file mode 100644
index 0000000..33915da
Binary files /dev/null and b/img/checkerboard_render.jpg differ
diff --git a/img/cow.JPG b/img/cow.JPG
new file mode 100644
index 0000000..2864c2f
Binary files /dev/null and b/img/cow.JPG differ
diff --git a/img/cow_far_0.JPG b/img/cow_far_0.JPG
new file mode 100644
index 0000000..0dab8bb
Binary files /dev/null and b/img/cow_far_0.JPG differ
diff --git a/img/cow_far_1.JPG b/img/cow_far_1.JPG
new file mode 100644
index 0000000..6016cc3
Binary files /dev/null and b/img/cow_far_1.JPG differ
diff --git a/img/cow_mid_0.JPG b/img/cow_mid_0.JPG
new file mode 100644
index 0000000..b76c626
Binary files /dev/null and b/img/cow_mid_0.JPG differ
diff --git a/img/cow_mid_1.JPG b/img/cow_mid_1.JPG
new file mode 100644
index 0000000..8719b06
Binary files /dev/null and b/img/cow_mid_1.JPG differ
diff --git a/img/cow_near_0.JPG b/img/cow_near_0.JPG
new file mode 100644
index 0000000..7015e96
Binary files /dev/null and b/img/cow_near_0.JPG differ
diff --git a/img/cow_near_1.JPG b/img/cow_near_1.JPG
new file mode 100644
index 0000000..a258581
Binary files /dev/null and b/img/cow_near_1.JPG differ
diff --git a/img/cow_render.jpg b/img/cow_render.jpg
new file mode 100644
index 0000000..ab1d35b
Binary files /dev/null and b/img/cow_render.jpg differ
diff --git a/img/duck.JPG b/img/duck.JPG
new file mode 100644
index 0000000..4f3c7e9
Binary files /dev/null and b/img/duck.JPG differ
diff --git a/img/duck_far_0.JPG b/img/duck_far_0.JPG
new file mode 100644
index 0000000..4553290
Binary files /dev/null and b/img/duck_far_0.JPG differ
diff --git a/img/duck_far_1.JPG b/img/duck_far_1.JPG
new file mode 100644
index 0000000..e6f56e2
Binary files /dev/null and b/img/duck_far_1.JPG differ
diff --git a/img/duck_mid_0.JPG b/img/duck_mid_0.JPG
new file mode 100644
index 0000000..1a7c795
Binary files /dev/null and b/img/duck_mid_0.JPG differ
diff --git a/img/duck_mid_1.JPG b/img/duck_mid_1.JPG
new file mode 100644
index 0000000..dd89b73
Binary files /dev/null and b/img/duck_mid_1.JPG differ
diff --git a/img/duck_near_0.JPG b/img/duck_near_0.JPG
new file mode 100644
index 0000000..1f835db
Binary files /dev/null and b/img/duck_near_0.JPG differ
diff --git a/img/duck_near_1.JPG b/img/duck_near_1.JPG
new file mode 100644
index 0000000..5b82a83
Binary files /dev/null and b/img/duck_near_1.JPG differ
diff --git a/img/duck_render.jpg b/img/duck_render.jpg
new file mode 100644
index 0000000..f324d83
Binary files /dev/null and b/img/duck_render.jpg differ
diff --git a/img/flower.JPG b/img/flower.JPG
new file mode 100644
index 0000000..79f93e2
Binary files /dev/null and b/img/flower.JPG differ
diff --git a/img/flower_far.JPG b/img/flower_far.JPG
new file mode 100644
index 0000000..b25794d
Binary files /dev/null and b/img/flower_far.JPG differ
diff --git a/img/flower_mid.JPG b/img/flower_mid.JPG
new file mode 100644
index 0000000..1c588eb
Binary files /dev/null and b/img/flower_mid.JPG differ
diff --git a/img/flower_near.JPG b/img/flower_near.JPG
new file mode 100644
index 0000000..8eedae2
Binary files /dev/null and b/img/flower_near.JPG differ
diff --git a/img/flower_render.jpg b/img/flower_render.jpg
new file mode 100644
index 0000000..c625866
Binary files /dev/null and b/img/flower_render.jpg differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index a57f69f..3d9b740 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -6,5 +6,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_61 -lineinfo
     )
diff --git a/src/main.cpp b/src/main.cpp
index 7986959..29c622a 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -20,9 +20,10 @@
 
 int main(int argc, char **argv) {
     if (argc != 2) {
-        cout << "Usage: [gltf file]. Press Enter to exit" << endl;
-		getchar();
-        return 0;
+        //cout << "Usage: [gltf file]. Press Enter to exit" << endl;
+		//getchar();
+        //return 0;
+		argv[1] = "../gltfs/cow/cow.gltf";//"../gltfs/box/box.gltf";//default gltf, relative to lauching directory which is ".....\build\"//
     }
 
 	tinygltf::Scene scene;
@@ -52,6 +53,10 @@ int main(int argc, char **argv) {
 
     frame = 0;
     seconds = time(NULL);
+	recordStart = time(NULL);
+	record = false;
+	recordFrame = 0;
+	mode = 0;//tile mode
     fpstracker = 0;
 
     // Launch CUDA/GL
@@ -63,12 +68,28 @@ int main(int argc, char **argv) {
     return 0;
 }
 
+glm::vec3 lightDir(0, 0, -1);
+float x_light_rot = 0.0f, y_light_rot = 0.0f, z_light_rot = 0.0f;
+
 void mainLoop() {
     while (!glfwWindowShouldClose(window)) {
         glfwPollEvents();
+
+		setLightDirDev(lightDir);//me
         runCuda();
 
         time_t seconds2 = time (NULL);
+		
+		//record after press
+		if (record) {
+			recordEnd = time(NULL);
+			if (recordEnd - recordStart >= 10)
+			{
+				recordFrame /= (recordEnd - recordStart);
+				record = false;
+			}
+			recordFrameFinal[mode] = recordFrame;
+		}
 
         if (seconds2 - seconds >= 1) {
 
@@ -77,7 +98,10 @@ void mainLoop() {
             seconds = seconds2;
         }
 
-        string title = "CIS565 Rasterizer | " + utilityCore::convertIntToString((int)fps) + " FPS";
+        string title = "m" + utilityCore::convertIntToString(mode) + " | m0 " + 
+			utilityCore::convertIntToString((int)recordFrameFinal[0]) + " FPS m1 " + 
+			utilityCore::convertIntToString((int)recordFrameFinal[1]) + " FPS | CIS565 Rasterizer | " + 
+			utilityCore::convertIntToString((int)fps) + " FPS";
         glfwSetWindowTitle(window, title.c_str());
 
         glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
@@ -99,6 +123,7 @@ void mainLoop() {
 float scale = 1.0f;
 float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
 float x_angle = 0.0f, y_angle = 0.0f;
+
 void runCuda() {
     // Map OpenGL buffer object for writing from CUDA on a single GPU
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
@@ -108,23 +133,35 @@ void runCuda() {
 		scale * ((float)width / (float)height),
 		-scale, scale, 1.0, 1000.0);
 
+	//cout << "start" << endl;
+	//for (int i = 0; i < 4; i++)
+	//{
+	//	for (int j = 0; j < 4; j++)
+	//	{
+	//		cout << P[i][j] << " ";
+	//	}
+	//	cout << endl;
+	//}
+
 	glm::mat4 V = glm::mat4(1.0f);
 
 	glm::mat4 M =
 		glm::translate(glm::vec3(x_trans, y_trans, z_trans))
 		* glm::rotate(x_angle, glm::vec3(1.0f, 0.0f, 0.0f))
-		* glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f));
+		* glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f))
+		* glm::scale(glm::vec3(3,3,3));
 
 	glm::mat3 MV_normal = glm::transpose(glm::inverse(glm::mat3(V) * glm::mat3(M)));
 	glm::mat4 MV = V * M;
 	glm::mat4 MVP = P * MV;
 
     cudaGLMapBufferObject((void **)&dptr, pbo);
-	rasterize(dptr, MVP, MV, MV_normal);
+	rasterize(dptr, MVP, MV, MV_normal, mode);
     cudaGLUnmapBufferObject(pbo);
 
     frame++;
     fpstracker++;
+	if(record) recordFrame++;
 }
 
 //-------------------------------
@@ -238,10 +275,15 @@ void initVAO(void) {
     };
 
     GLfloat texcoords[] = {
-        1.0f, 1.0f,
-        0.0f, 1.0f,
-        0.0f, 0.0f,
-        1.0f, 0.0f
+        //1.0f, 1.0f,
+        //0.0f, 1.0f,
+        //0.0f, 0.0f,
+        //1.0f, 0.0f
+		// YOU NAUGHTY LITTLE BOY !!! This is counter-intuitive !!! You should change this only in CUDA code !!!
+		0.0f, 0.0f,
+		1.0f, 0.0f,
+		1.0f, 1.0f,
+		0.0f, 1.0f
     };
 
     GLushort indices[] = { 0, 1, 3, 3, 1, 2 };
@@ -328,6 +370,34 @@ void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods
     if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) {
         glfwSetWindowShouldClose(window, GL_TRUE);
     }
+	//me
+	if (key == GLFW_KEY_X && action == GLFW_PRESS) {
+		x_light_rot += 0.01f;
+		lightDir = glm::normalize(glm::mat3(glm::rotate(x_light_rot, glm::vec3(1, 0, 0))) * lightDir);
+		printf("lightDir:%f,%f,%f\n", lightDir.x, lightDir.y, lightDir.z);
+	}
+	if (key == GLFW_KEY_Y && action == GLFW_PRESS) {
+		y_light_rot += 0.01f;
+		lightDir = glm::normalize(glm::mat3(glm::rotate(y_light_rot, glm::vec3(0, 1, 0))) * lightDir);
+		printf("lightDir:%f,%f,%f\n", lightDir.x, lightDir.y, lightDir.z);
+	}
+	if (key == GLFW_KEY_Z && action == GLFW_PRESS) {
+		z_light_rot += 0.01f;
+		lightDir = glm::normalize(glm::mat3(glm::rotate(z_light_rot, glm::vec3(0, 0, 1))) * lightDir);
+		printf("lightDir:%f,%f,%f\n", lightDir.x, lightDir.y, lightDir.z);
+	}
+	if (key == GLFW_KEY_R && action == GLFW_PRESS) {
+		record = !record;
+		if (record)
+		{
+			recordStart = time(NULL);
+			recordFrame = 0;
+			recordFrameFinal[mode] = 0;
+		}
+	}
+	if (key == GLFW_KEY_M && action == GLFW_PRESS) {
+		mode = (mode + 1) % MODE_MAX;
+	}
 }
 
 //----------------------------
@@ -395,6 +465,6 @@ void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos)
 
 void mouseWheelCallback(GLFWwindow* window, double xoffset, double yoffset)
 {
-	const double s = 1.0;	// sensitivity
+	const double s = 0.1;	// sensitivity
 	z_trans += (float)(s * yoffset);
 }
diff --git a/src/main.hpp b/src/main.hpp
index 4816fa1..2d7f5c1 100644
--- a/src/main.hpp
+++ b/src/main.hpp
@@ -25,14 +25,25 @@
 
 #include "rasterize.h"
 
+#define MODE_MAX 2
+
 using namespace std;
 
+
 //-------------------------------
 //------------GL STUFF-----------
 //-------------------------------
 int frame;
 int fpstracker;
 double seconds;
+
+double recordStart;
+double recordEnd;
+double recordFrame;
+double recordFrameFinal[MODE_MAX];
+bool record;
+int mode;
+
 int fps = 0;
 GLuint positionLocation = 0;
 GLuint texcoordsLocation = 1;
@@ -61,6 +72,7 @@ int main(int argc, char **argv);
 //-------------------------------
 
 void runCuda();
+void setLightDirDev(const glm::vec3 &_lightDir);//me
 
 #ifdef __APPLE__
 void display();
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..93cc3ac 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -18,6 +18,10 @@
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
 
+#include <device_functions.h>
+
+#define PRIM_PER_BLOCK_MAX 256
+
 namespace {
 
 	typedef unsigned short VertexIndex;
@@ -28,7 +32,7 @@ namespace {
 
 	typedef unsigned char BufferByte;
 
-	enum PrimitiveType{
+	enum PrimitiveType {
 		Point = 1,
 		Line = 2,
 		Triangle = 3
@@ -41,18 +45,54 @@ namespace {
 		// The attributes listed below might be useful, 
 		// but always feel free to modify on your own
 
-		 glm::vec3 eyePos;	// eye space position used for shading
-		 glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
-		// glm::vec3 col;
-		 glm::vec2 texcoord0;
-		 TextureData* dev_diffuseTex = NULL;
-		// int texWidth, texHeight;
+		glm::vec3 eyePos;	// eye space position used for shading
+		glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
+	   // glm::vec3 col;
+		glm::vec2 texcoord0;
 		// ...
 	};
 
 	struct Primitive {
 		PrimitiveType primitiveType = Triangle;	// C++ 11 init
+		TextureData* dev_diffuseTex = NULL;
+		int diffuseTexWidth;//me
+		int diffuseTexHeight;//me
+		int diffuseTexComponentCount;//me
+
 		VertexOut v[3];
+
+		__device__ __host__ Primitive()
+		{
+			primitiveType = Triangle;
+			dev_diffuseTex = NULL;
+			diffuseTexWidth = 0;
+			diffuseTexHeight = 0;
+			diffuseTexComponentCount = 0;
+		}
+
+		__device__ __host__ Primitive(const Primitive& p)
+		{
+			primitiveType = p.primitiveType;
+			dev_diffuseTex = p.dev_diffuseTex;
+			diffuseTexWidth = p.diffuseTexWidth;
+			diffuseTexHeight = p.diffuseTexHeight;
+			diffuseTexComponentCount = p.diffuseTexComponentCount;
+			v[0] = p.v[0];
+			v[1] = p.v[1];
+			v[2] = p.v[2];
+		}
+
+		__device__ __host__ void operator=(const Primitive& p)
+		{
+			primitiveType = p.primitiveType;
+			dev_diffuseTex = p.dev_diffuseTex;
+			diffuseTexWidth = p.diffuseTexWidth;
+			diffuseTexHeight = p.diffuseTexHeight;
+			diffuseTexComponentCount = p.diffuseTexComponentCount;
+			v[0] = p.v[0];
+			v[1] = p.v[1];
+			v[2] = p.v[2];
+		}
 	};
 
 	struct Fragment {
@@ -62,10 +102,10 @@ namespace {
 		// The attributes listed below might be useful, 
 		// but always feel free to modify on your own
 
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
+		glm::vec3 eyePos;	// eye space position used for shading
+		glm::vec3 eyeNor;
+		VertexAttributeTexcoord texcoord0;
+		TextureData* dev_diffuseTex;
 		// ...
 	};
 
@@ -86,6 +126,7 @@ namespace {
 		TextureData* dev_diffuseTex;
 		int diffuseTexWidth;
 		int diffuseTexHeight;
+		int diffuseTexComponentCount;//me
 		// TextureData* dev_specularTex;
 		// TextureData* dev_normalTex;
 		// ...
@@ -111,58 +152,82 @@ static glm::vec3 *dev_framebuffer = NULL;
 
 static int * dev_depth = NULL;	// you might need this buffer when doing depth test
 
+static glm::vec3 lightDir_dev = glm::vec3(0, 0, 0);
+
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
  */
-__global__ 
+__global__
 void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
-
-    if (x < w && y < h) {
-        glm::vec3 color;
-        color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
-        color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
-        color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
-        // Each thread writes one pixel location in the texture (textel)
-        pbo[index].w = 0;
-        pbo[index].x = color.x;
-        pbo[index].y = color.y;
-        pbo[index].z = color.z;
-    }
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+	//me
+	//since in OpenGL, texture is filled row by row starting from bottom left corner,
+	//we need to re-arrange the coordinates so that it maps from screen space to texture space
+	//so that we can display it using OpenGL by drawing a square whose uv origin is at the bottom left corner as well
+	// 0,0-------w,0                 0,1------1,1
+	//	|         |	                  |        |
+	//  |         |    ---------->    |        |
+	//	|         |					  |        |
+	// 0,h-------w,h                 0,0------1,0
+
+	int indexFrom = x + ((h - 1 - y) * w);
+
+	int index = x + (y * w);
+
+	if (x < w && y < h) {
+		glm::vec3 color;
+		color.x = glm::clamp(image[indexFrom].x, 0.0f, 1.0f) * 255.0;
+		color.y = glm::clamp(image[indexFrom].y, 0.0f, 1.0f) * 255.0;
+		color.z = glm::clamp(image[indexFrom].z, 0.0f, 1.0f) * 255.0;
+		// Each thread writes one pixel location in the texture (textel)
+		pbo[index].w = 0;
+		pbo[index].x = color.x;
+		pbo[index].y = color.y;
+		pbo[index].z = color.z;
+	}
 }
 
-/** 
+/**
 * Writes fragment colors to the framebuffer
 */
-__global__
-void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
 
-    if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
+//!!!IMPORTANT!!!
+//When passing arguments from CPU to GPU using a __global__ kernal, the arguments are
+//either GPU device pointers or values. They can not be CPU pointers(references).
+//!!!IMPORTANT!!!
+__global__
+void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer, glm::vec3 lightDir) {
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int index = x + (y * w);
 
+	if (x < w && y < h) {
 		// TODO: add your fragment shader code here
 
-    }
+		Fragment frag = fragmentBuffer[index];//copy to reduce global memory access
+
+		//half Lambert
+		glm::vec3 color = frag.color * (glm::dot(glm::normalize(-lightDir), glm::normalize(frag.eyeNor)) * 0.5f + 0.5f);
+		
+		framebuffer[index] = color;
+	}
 }
 
 /**
  * Called once at the beginning of the program to allocate memory.
  */
 void rasterizeInit(int w, int h) {
-    width = w;
-    height = h;
+	width = w;
+	height = h;
 	cudaFree(dev_fragmentBuffer);
 	cudaMalloc(&dev_fragmentBuffer, width * height * sizeof(Fragment));
 	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-    cudaFree(dev_framebuffer);
-    cudaMalloc(&dev_framebuffer,   width * height * sizeof(glm::vec3));
-    cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
-    
+	cudaFree(dev_framebuffer);
+	cudaMalloc(&dev_framebuffer, width * height * sizeof(glm::vec3));
+	cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
+
 	cudaFree(dev_depth);
 	cudaMalloc(&dev_depth, width * height * sizeof(int));
 
@@ -187,9 +252,9 @@ void initDepth(int w, int h, int * depth)
 * kern function with support for stride to sometimes replace cudaMemcpy
 * One thread is responsible for copying one component
 */
-__global__ 
+__global__
 void _deviceBufferCopy(int N, BufferByte* dev_dst, const BufferByte* dev_src, int n, int byteStride, int byteOffset, int componentTypeByteSize) {
-	
+
 	// Attribute (vec3 position)
 	// component (3 * float)
 	// byte (4 * byte)
@@ -202,20 +267,20 @@ void _deviceBufferCopy(int N, BufferByte* dev_dst, const BufferByte* dev_src, in
 		int offset = i - count * n;	// which component of the attribute
 
 		for (int j = 0; j < componentTypeByteSize; j++) {
-			
-			dev_dst[count * componentTypeByteSize * n 
-				+ offset * componentTypeByteSize 
+
+			dev_dst[count * componentTypeByteSize * n
+				+ offset * componentTypeByteSize
 				+ j]
 
-				= 
+				=
 
-			dev_src[byteOffset 
-				+ count * (byteStride == 0 ? componentTypeByteSize * n : byteStride) 
-				+ offset * componentTypeByteSize 
+				dev_src[byteOffset
+				+ count * (byteStride == 0 ? componentTypeByteSize * n : byteStride)
+				+ offset * componentTypeByteSize
 				+ j];
 		}
 	}
-	
+
 
 }
 
@@ -235,7 +300,7 @@ void _nodeMatrixTransform(
 }
 
 glm::mat4 getMatrixFromNodeMatrixVector(const tinygltf::Node & n) {
-	
+
 	glm::mat4 curMatrix(1.0);
 
 	const std::vector<double> &m = n.matrix;
@@ -247,7 +312,8 @@ glm::mat4 getMatrixFromNodeMatrixVector(const tinygltf::Node & n) {
 				curMatrix[i][j] = (float)m.at(4 * i + j);
 			}
 		}
-	} else {
+	}
+	else {
 		// no matrix, use rotation, scale, translation
 
 		if (n.translation.size() > 0) {
@@ -275,12 +341,12 @@ glm::mat4 getMatrixFromNodeMatrixVector(const tinygltf::Node & n) {
 	return curMatrix;
 }
 
-void traverseNode (
+void traverseNode(
 	std::map<std::string, glm::mat4> & n2m,
 	const tinygltf::Scene & scene,
 	const std::string & nodeString,
 	const glm::mat4 & parentMatrix
-	) 
+)
 {
 	const tinygltf::Node & n = scene.nodes.at(nodeString);
 	glm::mat4 M = parentMatrix * getMatrixFromNodeMatrixVector(n);
@@ -523,6 +589,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 					TextureData* dev_diffuseTex = NULL;
 					int diffuseTexWidth = 0;
 					int diffuseTexHeight = 0;
+					int diffuseTexComponentCount = 0;
 					if (!primitive.material.empty()) {
 						const tinygltf::Material &mat = scene.materials.at(primitive.material);
 						printf("material.name = %s\n", mat.name.c_str());
@@ -537,9 +604,10 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 									size_t s = image.image.size() * sizeof(TextureData);
 									cudaMalloc(&dev_diffuseTex, s);
 									cudaMemcpy(dev_diffuseTex, &image.image.at(0), s, cudaMemcpyHostToDevice);
-									
+
 									diffuseTexWidth = image.width;
 									diffuseTexHeight = image.height;
+									diffuseTexComponentCount = image.component;
 
 									checkCUDAError("Set Texture Image data");
 								}
@@ -554,7 +622,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 					// ---------Node hierarchy transform--------
 					cudaDeviceSynchronize();
-					
+
 					dim3 numBlocksNodeTransform((numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 					_nodeMatrixTransform << <numBlocksNodeTransform, numThreadsPerBlock >> > (
 						numVertices,
@@ -582,9 +650,10 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 						dev_diffuseTex,
 						diffuseTexWidth,
 						diffuseTexHeight,
+						diffuseTexComponentCount,
 
 						dev_vertexOut	//VertexOut
-					});
+						});
 
 					totalNumPrimitives += numPrimitives;
 
@@ -595,21 +664,21 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 		} // for each node
 
 	}
-	
+
 
 	// 3. Malloc for dev_primitives
 	{
 		cudaMalloc(&dev_primitives, totalNumPrimitives * sizeof(Primitive));
 	}
-	
+
 
 	// Finally, cudaFree raw dev_bufferViews
 	{
 
 		std::map<std::string, BufferByte*>::const_iterator it(bufferViewDevPointers.begin());
 		std::map<std::string, BufferByte*>::const_iterator itEnd(bufferViewDevPointers.end());
-			
-			//bufferViewDevPointers
+
+		//bufferViewDevPointers
 
 		for (; it != itEnd; it++) {
 			cudaFree(it->second);
@@ -621,13 +690,19 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 }
 
+__device__ bool isZero(float v)
+{
+	if (glm::abs(v) < EPSILON)
+		return true;
+	else
+		return false;
+}
 
-
-__global__ 
+__global__
 void _vertexTransformAndAssembly(
-	int numVertices, 
-	PrimitiveDevBufPointers primitive, 
-	glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, 
+	int numVertices,
+	PrimitiveDevBufPointers primitive,
+	glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal,
 	int width, int height) {
 
 	// vertex id
@@ -638,18 +713,46 @@ void _vertexTransformAndAssembly(
 		// Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
 		// Then divide the pos by its w element to transform into NDC space
 		// Finally transform x and y to viewport space
+		glm::vec4 tempP;
+		glm::vec4 tempEyeP;
+		glm::vec3 tempEyeN;
+		glm::vec2 tempTexcoord0;
+
+		if (primitive.dev_position != NULL)
+		{
+			tempP = MVP * glm::vec4(primitive.dev_position[vid], 1);
+			tempP.x = tempP.x / tempP.w;
+			tempP.y = tempP.y / tempP.w;
+			tempP.z = tempP.z / tempP.w;
+			//tempP.w remain the same for recovering from perspective correct interpolation
+			tempP.x = (tempP.x * 0.5 + 0.5) * width;//[-1,1] to [0,1] to [0, width]
+			tempP.y = (tempP.y * -0.5 + 0.5) * height;//[-1,1] to [1,0] to [height, 0], since in screen space, origin is at top left corner
+			tempP.z = (tempP.z * 0.5 + 0.5);//[-1,1] to [0,1]
+			tempEyeP = MV * glm::vec4(primitive.dev_position[vid], 1);
+		}
+		if (primitive.dev_normal != NULL)
+		{
+			tempEyeN = MV_normal * glm::vec3(primitive.dev_normal[vid]);
+
+		}
+		if (primitive.dev_texcoord0 != NULL)
+		{
+			tempTexcoord0 = primitive.dev_texcoord0[vid];
+		}
 
 		// TODO: Apply vertex assembly here
 		// Assemble all attribute arraies into the primitive array
-		
+
+		primitive.dev_verticesOut[vid].pos = tempP;
+		primitive.dev_verticesOut[vid].eyePos = glm::vec3(tempEyeP);//w will always be 1, no need to divide
+		primitive.dev_verticesOut[vid].eyeNor = glm::normalize(tempEyeN);
+		primitive.dev_verticesOut[vid].texcoord0 = tempTexcoord0;
 	}
 }
 
-
-
 static int curPrimitiveBeginId = 0;
 
-__global__ 
+__global__
 void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_primitives, PrimitiveDevBufPointers primitive) {
 
 	// index id
@@ -660,28 +763,590 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 		// TODO: uncomment the following code for a start
 		// This is primitive assembly for triangles
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
+		int pid;	// id for cur primitives vector
+		
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
+			pid = iid / (int)primitive.primitiveType;
+			dev_primitives[pid].primitiveType = Triangle;//me
+			dev_primitives[pid].dev_diffuseTex = primitive.dev_diffuseTex;//me
+			dev_primitives[pid].diffuseTexComponentCount = primitive.diffuseTexComponentCount;//me
+			dev_primitives[pid].diffuseTexHeight = primitive.diffuseTexHeight;//me
+			dev_primitives[pid].diffuseTexWidth = primitive.diffuseTexWidth;//me
+			dev_primitives[pid].v[iid % (int)primitive.primitiveType] = primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		}
 
 
 		// TODO: other primitive types (point, line)
 	}
+
+}
+
+__device__ bool intersectLineHorizontal(float horizontal, float x1, float y1, float x2, float y2, float* result)
+{
+	if (x1 == x2)
+	{
+		if ((horizontal >= y1 && horizontal <= y2) || (horizontal >= y2 && horizontal <= y1))
+		{
+			if (result != nullptr) *result = x1;//or x2
+			return true;
+		}
+		else
+		{
+			return false;
+		}
+	}
+	else
+	{
+		if (y1 == y2)
+		{
+			if (horizontal == y1)//or y2
+			{
+				if (result != nullptr) *result = x1 < x2 ? x1 : x2;
+				return true;
+			}
+			else
+			{
+				return false;
+			}
+		}
+		else
+		{
+			float temp = (x2 - x1)*(horizontal - y1) / (y2 - y1) + x1;
+			if (temp <= glm::max(x1, x2) && temp >= glm::min(x1, x2))//if intersection is within the line segment
+			{
+				if(result != nullptr) *result = temp;
+				return true;
+			}
+			else
+			{
+				return false;
+			}
+		}
+	}
+}
+
+template<class T>
+__device__ T barycentricInterpolate(const glm::vec3 &P,
+	const glm::vec3 &p0, const glm::vec3 &p1, const glm::vec3 &p2,
+	const T &v0, const T &v1, const T &v2)
+{
+	float S = 0.5f *glm::length(glm::cross(p0 - p1, p2 - p1));
+	float s0 = 0.5f *glm::length(glm::cross(p1 - P, p2 - P)) / S;
+	float s1 = 0.5f *glm::length(glm::cross(p2 - P, p0 - P)) / S;
+	float s2 = 0.5f*glm::length(glm::cross(p0 - P, p1 - P)) / S;
+	return v0 * s0 + v1 * s1 + v2 * s2;
+}
+
+__global__
+void scanLineZ(int primitiveCount, const Primitive* primitives, int width, int height, int* depths)
+{
+	int pid = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (pid < primitiveCount)
+	{
+		Primitive primitive = primitives[pid];//copy so that no global memory access in this function anymore
+		int yMin = INT_MAX;
+		int yMax = INT_MIN;
+		int vertCount = 0;
+
+		if (primitive.primitiveType == Triangle)
+		{
+			vertCount = 3;
+
+			//find the vertical boundary
+			for (int i = 0; i < vertCount; i++)
+			{
+				if (primitive.v[i].pos.y < yMin) yMin = primitive.v[i].pos.y;
+				if (primitive.v[i].pos.y > yMax) yMax = primitive.v[i].pos.y;
+			}
+
+			//clamp the vertical boundary
+			yMin = yMin < 0 ? 0 : yMin > height - 1 ? height - 1 : yMin;
+			yMax = yMax < 0 ? 0 : yMax > height - 1 ? height - 1 : yMax;
+
+			//save the fragment row by row
+			for (int i = yMin; i <= yMax; i++)
+			{
+				float xResult[3] = { -1, -1, -1 };
+				int xMin = INT_MAX;
+				int xMax = INT_MIN;
+				bool hit = false;
+
+				//find the horizontal boundary
+				for (int j = 0; j < 3; j++)
+				{
+					int jPlusOne = (j + 1) % 3;
+					if (intersectLineHorizontal(i, primitive.v[j].pos.x, primitive.v[j].pos.y, primitive.v[jPlusOne].pos.x, primitive.v[jPlusOne].pos.y, &xResult[j]))
+					{
+						if (xResult[j] > xMax) xMax = xResult[j];
+						if (xResult[j] < xMin) xMin = xResult[j];
+						hit = true;
+					}
+				}
+
+				if (hit)
+				{
+					//clamp the horizontal boundary
+					xMin = xMin < 0 ? 0 : xMin > width - 1 ? width - 1 : xMin;
+					xMax = xMax < 0 ? 0 : xMax > width - 1 ? width - 1 : xMax;
+
+					//loop from xMin to xMax
+					for (int j = xMin; j <= xMax; j++)
+					{
+						glm::vec2 fragmentPos(j, i);
+
+						Fragment fragment;
+
+						glm::vec3 P(fragmentPos, 0);
+						glm::vec3 p0(primitive.v[0].pos.x, primitive.v[0].pos.y, 0);
+						glm::vec3 p1(primitive.v[1].pos.x, primitive.v[1].pos.y, 0);
+						glm::vec3 p2(primitive.v[2].pos.x, primitive.v[2].pos.y, 0);
+
+						//[0,1]
+						float fragmentZ = barycentricInterpolate(P, p0, p1, p2,
+							primitive.v[0].pos.z,
+							primitive.v[1].pos.z,
+							primitive.v[2].pos.z);
+
+						if (fragmentZ > 0 && fragmentZ < 1)
+						{
+
+							int intZ = fragmentZ * INT_MAX;
+
+							//copy the fragment to fragment buffer
+							atomicMin(&depths[i * width + j], intZ);
+						}
+					}
+				}
+			}
+		}
+		else if (primitive.primitiveType == Line)
+		{
+			//do nothing
+		}
+		else if (primitive.primitiveType == Point)
+		{
+			//do nothing
+		}
+	}
+}
+
+//alpha is ignored
+__device__ glm::vec3 bilinearSample(const TextureData* dev_texture, int comp, int width, int height, const glm::vec2& uv)
+{
+	int size = width * height * comp;
+	float fracU = uv.x * width;
+	int floorU = fracU;
+	fracU -= floorU;
+	float fracV = uv.y * height;
+	int floorV = fracV;
+	fracV -= floorV;
+	int ceilU = glm::clamp(floorU + 1, 0, width - 1);
+	int ceilV = glm::clamp(floorV + 1, 0, height - 1);
+	int indexFUFV = glm::clamp(comp * (floorU + floorV * width), 0, size-1);
+	int indexFUCV = glm::clamp(comp * (floorU + ceilV * width), 0, size - 1);
+	int indexCUFV = glm::clamp(comp * (ceilU + floorV * width), 0, size - 1);
+	int indexCUCV = glm::clamp(comp * (ceilU + ceilV * width), 0, size - 1);
+	glm::vec3 colFUFV(dev_texture[indexFUFV] / 255.f, dev_texture[indexFUFV + 1] / 255.f, dev_texture[indexFUFV + 2] / 255.f);
+	glm::vec3 colFUCV(dev_texture[indexFUCV] / 255.f, dev_texture[indexFUCV + 1] / 255.f, dev_texture[indexFUCV + 2] / 255.f);
+	glm::vec3 colCUFV(dev_texture[indexCUFV] / 255.f, dev_texture[indexCUFV + 1] / 255.f, dev_texture[indexCUFV + 2] / 255.f);
+	glm::vec3 colCUCV(dev_texture[indexCUCV] / 255.f, dev_texture[indexCUCV + 1] / 255.f, dev_texture[indexCUCV + 2] / 255.f);
+	glm::vec3 colFU = (1.f - fracV) * colFUFV + fracV * colFUCV;
+	glm::vec3 colCU = (1.f - fracV) * colCUFV + fracV * colCUCV;
+	return (1.f - fracU) * colFU + fracU * colCU;
+}
+
+__global__
+void scanLine(int primitiveCount, const Primitive* primitives, int width, int height, int* depths, Fragment* fragments)
+{
+	int pid = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (pid < primitiveCount)
+	{
+		Primitive primitive = primitives[pid];//copy so that no global memory access in this function anymore
+		int yMin = INT_MAX;
+		int yMax = INT_MIN;
+		int vertCount = 0;
+
+		if (primitive.primitiveType == Triangle)
+		{
+			vertCount = 3;
+
+			//find the vertical boundary
+			for (int i = 0; i < vertCount; i++)
+			{
+				if (primitive.v[i].pos.y < yMin) yMin = primitive.v[i].pos.y;
+				if (primitive.v[i].pos.y > yMax) yMax = primitive.v[i].pos.y;
+			}
+
+			//clamp the vertical boundary
+			yMin = yMin < 0 ? 0 : yMin > height - 1 ? height - 1 : yMin;
+			yMax = yMax < 0 ? 0 : yMax > height - 1 ? height - 1 : yMax;
+
+			//save the fragment row by row
+			for (int i = yMin; i <= yMax; i++)
+			{
+				float xResult[3] = { -1, -1, -1 };
+				int xMin = INT_MAX;
+				int xMax = INT_MIN;
+				bool hit = false;
+
+				//find the horizontal boundary
+				for (int j = 0; j < 3; j++)
+				{
+					int jPlusOne = (j + 1) % 3;
+					if (intersectLineHorizontal(i, primitive.v[j].pos.x, primitive.v[j].pos.y, primitive.v[jPlusOne].pos.x, primitive.v[jPlusOne].pos.y, &xResult[j]))
+					{
+						if (xResult[j] > xMax) xMax = xResult[j];
+						if (xResult[j] < xMin) xMin = xResult[j];
+						hit = true;
+					}
+				}
+
+				if (hit)
+				{
+					//clamp the horizontal boundary
+					xMin = xMin < 0 ? 0 : xMin > width - 1 ? width - 1 : xMin;
+					xMax = xMax < 0 ? 0 : xMax > width - 1 ? width - 1 : xMax;
+
+					//loop from xMin to xMax
+					for (int j = xMin; j <= xMax; j++)
+					{
+						glm::vec2 fragmentPos(j, i);
+
+						Fragment fragment;
+
+						glm::vec3 P(fragmentPos, 0);
+						glm::vec3 p0(primitive.v[0].pos.x, primitive.v[0].pos.y, 0);
+						glm::vec3 p1(primitive.v[1].pos.x, primitive.v[1].pos.y, 0);
+						glm::vec3 p2(primitive.v[2].pos.x, primitive.v[2].pos.y, 0);
+						
+						float fragmentZ = barycentricInterpolate(P, p0, p1, p2,
+							primitive.v[0].pos.z,
+							primitive.v[1].pos.z,
+							primitive.v[2].pos.z);
+
+						//depth clip, OpenGL and D3D do it in clip space, we do it in screen space due to the structure of our pipeline
+						if (fragmentZ > 0 && fragmentZ < 1)
+						{
+
+							int intZ = fragmentZ * INT_MAX;
+
+							//depth test
+							if (intZ <= depths[i * width + j])
+							{
+								//for perspective correct interpolation
+								float w = 1.f / barycentricInterpolate(P, p0, p1, p2,
+									1.f / primitive.v[0].pos.w,
+									1.f / primitive.v[1].pos.w,
+									1.f / primitive.v[2].pos.w);
+
+								//set the attributes of the fragment
+								fragment.eyePos = w * barycentricInterpolate(P, p0, p1, p2,
+									primitive.v[0].eyePos / primitive.v[0].pos.w,
+									primitive.v[1].eyePos / primitive.v[1].pos.w,
+									primitive.v[2].eyePos / primitive.v[2].pos.w);
+
+								fragment.eyeNor = w * barycentricInterpolate(P, p0, p1, p2,
+									primitive.v[0].eyeNor / primitive.v[0].pos.w,
+									primitive.v[1].eyeNor / primitive.v[1].pos.w,
+									primitive.v[2].eyeNor / primitive.v[2].pos.w);
+
+								fragment.texcoord0 = w * barycentricInterpolate(P, p0, p1, p2,
+									primitive.v[0].texcoord0 / primitive.v[0].pos.w,
+									primitive.v[1].texcoord0 / primitive.v[1].pos.w,
+									primitive.v[2].texcoord0 / primitive.v[2].pos.w);
+
+								fragment.dev_diffuseTex = primitive.dev_diffuseTex;//this should be an uniform
+
+								//temp glm::vec3(1, 0, 1);
+								fragment.color = primitive.dev_diffuseTex == NULL ? glm::vec3(1, 1, 1) : 
+									bilinearSample(primitive.dev_diffuseTex, 
+										primitive.diffuseTexComponentCount, 
+										primitive.diffuseTexWidth, 
+										primitive.diffuseTexHeight, 
+										fragment.texcoord0);
+
+								//copy the fragment to fragment buffer
+								fragments[i * width + j] = fragment;
+							}
+						}
+					}
+				}
+			}
+		}
+		else if (primitive.primitiveType == Line)
+		{
+			//do nothing
+		}
+		else if (primitive.primitiveType == Point)
+		{
+			//do nothing
+		}
+	}
+}
+
+__device__
+bool intersectLineSegments(const glm::vec2& a1, const glm::vec2& a2, const glm::vec2& b1, const glm::vec2& b2)
+{
+	float x = 0;
+	float y = 0;
+	if (a1.x == a2.x)//no ka
+	{
+		x = a1.x;
+		if (b1.x == b2.x)//no kb, parallel
+		{
+			return false;//progressive
+		}
+		else//kb
+		{
+			float kb = (b1.y - b2.y) / (b1.x - b2.x);
+			float bb = b1.y - kb * b1.x;
+			y = kb * x + bb;
+		}
+	}
+	else//ka
+	{
+		float ka = (a1.y - a2.y) / (a1.x - a2.x);
+		float ba = a1.y - ka * a1.x;
+		if (b1.x == b2.x)//no kb
+		{
+			x = b1.x;
+			y = ka * x + ba;
+		}
+		else//kb
+		{
+			float kb = (b1.y - b2.y) / (b1.x - b2.x);
+			float bb = b1.y - kb * b1.x;
+			if (ka == kb)//parallel
+			{
+				return false;//progressive
+			}
+			else
+			{
+				x = (bb - ba) / (ka - kb);
+				y = ka * x + ba;
+			}
+		}
+	}
+
+	glm::vec2 minA = glm::min(a1, a2);
+	glm::vec2 maxA = glm::max(a1, a2);
+	glm::vec2 minB = glm::min(b1, b2);
+	glm::vec2 maxB = glm::max(b1, b2);
+
+	if (x > minA.x && y > minA.y && 
+		x < maxA.x && y < maxA.y &&
+		x > minB.x && y > minB.y &&
+		x < maxB.x && y < maxB.y)
+		return true;
+	else
+		return false;
+
+}
+
+__device__
+bool triangleInTile(const glm::vec3& v0, const glm::vec3& v1, const glm::vec3& v2,
+	float xMin, float xMax, float yMin, float yMax)
+{
+	glm::vec3 min = glm::min(v0, glm::min(v1, v2));
+	glm::vec3 max = glm::max(v0, glm::max(v1, v2));
+
+	//early rejection
+	if ((max.x < xMin) || (min.x > xMax) || (max.y < yMin) || (min.y > yMax))
+	{
+		return false;
+	}
+
+	return true;
+}
+
+template<int tileWidth, int tileHeight>
+__global__
+void scanLineTile(int primitiveCount, const Primitive* primitives, int width, int height, Fragment* fragments)
+{
+	__shared__ Primitive shared_primitives[PRIM_PER_BLOCK_MAX];
+	__shared__ int shared_primitives_count;
+
+	int pixelX = blockIdx.x * blockDim.x + threadIdx.x;
+	int pixelY = blockIdx.y * blockDim.y + threadIdx.y;
+	int pixelId = pixelX + pixelY * width;
+	int threadId = threadIdx.x + threadIdx.y * blockDim.x;
+
+	int depth = INT_MAX;
+
+	int threadCount = blockDim.x * blockDim.y;
+	int primitivePerThread = (primitiveCount - 1 + threadCount) / threadCount;
 	
+	if (threadIdx.x == 0 && threadIdx.y == 0) shared_primitives_count = 0;
+
+	__syncthreads();
+
+	for (int i = 0; i < primitivePerThread; i++)
+	{
+		int temp = threadId * primitivePerThread + i;
+		if (temp < primitiveCount)
+		{
+			Primitive tempPrimitive = primitives[temp];
+			//inside
+			if(triangleInTile(glm::vec3(tempPrimitive.v[0].pos), glm::vec3(tempPrimitive.v[1].pos), glm::vec3(tempPrimitive.v[2].pos), 
+				(float)blockIdx.x * blockDim.x, (blockIdx.x + 1.f) * blockDim.x, (float)blockIdx.y * blockDim.y, (blockIdx.y + 1.f) * blockDim.y))
+			{
+				int oldPrimitiveCount = atomicAdd(&shared_primitives_count, 1);
+				//printf("%d,%d:%d,%d:%d\n", blockIdx.x, blockIdx.y, threadIdx.x, threadIdx.y, oldPrimitiveCount);
+				if (oldPrimitiveCount >= 0 && oldPrimitiveCount < PRIM_PER_BLOCK_MAX)
+				{
+					//printf("%d\n", oldPrimitiveCount);
+					shared_primitives[oldPrimitiveCount] = tempPrimitive;
+				}
+				else
+				{
+					break;
+				}
+			}
+		}
+	}
+
+	//if (threadIdx.x == 0 && threadIdx.y == 0 && shared_primitives_count>0) printf("%d,%d:%d\n", blockIdx.x, blockIdx.y, shared_primitives_count);
+
+	__syncthreads();
+
+	//printf("%d\n", shared_primitives_count);
+	int actual_shared_primitives_count = shared_primitives_count < PRIM_PER_BLOCK_MAX ? shared_primitives_count : PRIM_PER_BLOCK_MAX;
+
+
+	if (actual_shared_primitives_count > 0 && pixelX < width && pixelY < height)
+	{
+		glm::vec2 fragmentPos(pixelX, pixelY);
+		Fragment fragment;
+		glm::vec3 P(fragmentPos, 0);
+		bool bingo = false;
+
+		//depth pass
+		for (int i = 0; i < actual_shared_primitives_count; i++)
+		{
+			glm::vec3 v0(shared_primitives[i].v[0].pos);//copy so that no global memory access in this function anymore
+			glm::vec3 v1(shared_primitives[i].v[1].pos);
+			glm::vec3 v2(shared_primitives[i].v[2].pos);
+
+			glm::vec3 p0(v0.x, v0.y, 0);
+			glm::vec3 p1(v1.x, v1.y, 0);
+			glm::vec3 p2(v2.x, v2.y, 0);
+			
+			glm::vec3 tri[3];
+			tri[0] = v0;
+			tri[1] = v1;
+			tri[2] = v2;
+
+			glm::vec3 baryCoords = calculateBarycentricCoordinate(tri, glm::vec2(pixelX, pixelY));
+			bool isInsideTriangle = isBarycentricCoordInBounds(baryCoords);
+			if (isInsideTriangle)
+			{
+				float fragmentZ = barycentricInterpolate(P, p0, p1, p2,
+					v0.z,
+					v1.z,
+					v2.z);
+
+				//depth clip, OpenGL and D3D do it in clip space, we do it in screen space due to the structure of our pipeline
+				if (fragmentZ > 0 && fragmentZ < 1)
+				{
+
+					int intZ = fragmentZ * INT_MAX;
+					if (intZ < depth)
+					{
+						depth = intZ;
+					}
+				}
+			}
+		}
+		
+		//second pass
+		for (int i = 0; i < actual_shared_primitives_count; i++)
+		{
+			Primitive primitive = shared_primitives[i];//copy so that no global memory access in this function anymore
+			glm::vec3 p0(primitive.v[0].pos.x, primitive.v[0].pos.y, 0);
+			glm::vec3 p1(primitive.v[1].pos.x, primitive.v[1].pos.y, 0);
+			glm::vec3 p2(primitive.v[2].pos.x, primitive.v[2].pos.y, 0);
+			
+			glm::vec3 tri[3];
+			tri[0] = glm::vec3(primitive.v[0].pos);
+			tri[1] = glm::vec3(primitive.v[1].pos);
+			tri[2] = glm::vec3(primitive.v[2].pos);
+
+			glm::vec3 baryCoords = calculateBarycentricCoordinate(tri, glm::vec2(pixelX, pixelY));
+			bool isInsideTriangle = isBarycentricCoordInBounds(baryCoords);
+			if (isInsideTriangle)
+			{
+				float fragmentZ = barycentricInterpolate(P, p0, p1, p2,
+					primitive.v[0].pos.z,
+					primitive.v[1].pos.z,
+					primitive.v[2].pos.z);
+
+				//depth clip, OpenGL and D3D do it in clip space, we do it in screen space due to the structure of our pipeline
+				if (fragmentZ > 0 && fragmentZ < 1)
+				{
+
+					int intZ = fragmentZ * INT_MAX;
+
+					//depth test
+					if (intZ <= depth)//shared_depths[threadId])
+					{
+						depth = intZ;
+						//for perspective correct interpolation
+						float w = 1.f / barycentricInterpolate(P, p0, p1, p2,
+							1.f / primitive.v[0].pos.w,
+							1.f / primitive.v[1].pos.w,
+							1.f / primitive.v[2].pos.w);
+
+						//set the attributes of the fragment
+						fragment.eyePos = w * barycentricInterpolate(P, p0, p1, p2,
+							primitive.v[0].eyePos / primitive.v[0].pos.w,
+							primitive.v[1].eyePos / primitive.v[1].pos.w,
+							primitive.v[2].eyePos / primitive.v[2].pos.w);
+
+						fragment.eyeNor = w * barycentricInterpolate(P, p0, p1, p2,
+							primitive.v[0].eyeNor / primitive.v[0].pos.w,
+							primitive.v[1].eyeNor / primitive.v[1].pos.w,
+							primitive.v[2].eyeNor / primitive.v[2].pos.w);
+
+						fragment.texcoord0 = w * barycentricInterpolate(P, p0, p1, p2,
+							primitive.v[0].texcoord0 / primitive.v[0].pos.w,
+							primitive.v[1].texcoord0 / primitive.v[1].pos.w,
+							primitive.v[2].texcoord0 / primitive.v[2].pos.w);
+
+						fragment.dev_diffuseTex = primitive.dev_diffuseTex;
+
+						fragment.color = primitive.dev_diffuseTex == NULL ? glm::vec3(1, 1, 1) :
+							bilinearSample(primitive.dev_diffuseTex,
+								primitive.diffuseTexComponentCount,
+								primitive.diffuseTexWidth,
+								primitive.diffuseTexHeight,
+								fragment.texcoord0);
+
+						bingo = true;
+					}
+				}
+			}
+		}
+
+		//write back
+		if(bingo) fragments[pixelId] = fragment;
+	}
 }
 
 
+void setLightDirDev(const glm::vec3 &_lightDir)
+{
+	lightDir_dev = _lightDir;
+}
 
 /**
  * Perform rasterization.
  */
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) {
-    int sideLength2d = 8;
-    dim3 blockSize2d(sideLength2d, sideLength2d);
-    dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
+void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal, const int mode) {
+	int sideLength2d = 8;
+	dim3 blockSize2d(sideLength2d, sideLength2d);
+	dim3 blockCount2d((width - 1) / blockSize2d.x + 1,
 		(height - 1) / blockSize2d.y + 1);
 
 	// Execute your rasterization pipeline here
@@ -702,14 +1367,14 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 				dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 				dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 
-				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height);
+				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> > (p->numVertices, *p, MVP, MV, MV_normal, width, height);
 				checkCUDAError("Vertex Processing");
 				cudaDeviceSynchronize();
 				_primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
-					(p->numIndices, 
-					curPrimitiveBeginId, 
-					dev_primitives, 
-					*p);
+					(p->numIndices,
+						curPrimitiveBeginId,
+						dev_primitives,
+						*p);
 				checkCUDAError("Primitive Assembly");
 
 				curPrimitiveBeginId += p->numPrimitives;
@@ -718,20 +1383,38 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 
 		checkCUDAError("Vertex Processing and Primitive Assembly");
 	}
-	
+
 	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
+	initDepth << <blockCount2d, blockSize2d >> > (width, height, dev_depth);
 	
 	// TODO: rasterize
+	cudaDeviceSynchronize();
+	if (mode == 0)
+	{
+		dim3 numTreadsPerBlock(128);
+		dim3 numBlocksForPremitives((curPrimitiveBeginId + numTreadsPerBlock.x - 1) / numTreadsPerBlock.x);
+		scanLineZ << <numBlocksForPremitives, numTreadsPerBlock >> > (curPrimitiveBeginId, dev_primitives, width, height, dev_depth);
+		checkCUDAError("scan line z");
+		cudaDeviceSynchronize();
+		scanLine << <numBlocksForPremitives, numTreadsPerBlock >> > (curPrimitiveBeginId, dev_primitives, width, height, dev_depth, dev_fragmentBuffer);
+		checkCUDAError("scan line");
 
+	}
+	else if (mode == 1)
+	{
+		dim3 tileSize2d(25, 25);
+		dim3 tileCount2d((width - 1 + tileSize2d.x) / tileSize2d.x, (height - 1 + tileSize2d.y) / tileSize2d.y);
+		scanLineTile<25, 25> << <tileCount2d, tileSize2d >> > (curPrimitiveBeginId, dev_primitives, width, height, dev_fragmentBuffer);
+		checkCUDAError("scan line tile");
+	}
+	cudaDeviceSynchronize();
 
-
-    // Copy depthbuffer colors into framebuffer
-	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
+	// Copy depthbuffer colors into framebuffer
+	render << <blockCount2d, blockSize2d >> > (width, height, dev_fragmentBuffer, dev_framebuffer, lightDir_dev);
 	checkCUDAError("fragment shader");
-    // Copy framebuffer into OpenGL buffer for OpenGL previewing
-    sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
-    checkCUDAError("copy render result to pbo");
+	// Copy framebuffer into OpenGL buffer for OpenGL previewing
+	sendImageToPBO << <blockCount2d, blockSize2d >> > (pbo, width, height, dev_framebuffer);
+	checkCUDAError("copy render result to pbo");
 }
 
 /**
@@ -739,7 +1422,7 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
  */
 void rasterizeFree() {
 
-    // deconstruct primitives attribute/indices device buffer
+	// deconstruct primitives attribute/indices device buffer
 
 	auto it(mesh2PrimitivesMap.begin());
 	auto itEnd(mesh2PrimitivesMap.end());
@@ -753,24 +1436,24 @@ void rasterizeFree() {
 
 			cudaFree(p->dev_verticesOut);
 
-			
+
 			//TODO: release other attributes and materials
 		}
 	}
 
 	////////////
 
-    cudaFree(dev_primitives);
-    dev_primitives = NULL;
+	cudaFree(dev_primitives);
+	dev_primitives = NULL;
 
 	cudaFree(dev_fragmentBuffer);
 	dev_fragmentBuffer = NULL;
 
-    cudaFree(dev_framebuffer);
-    dev_framebuffer = NULL;
+	cudaFree(dev_framebuffer);
+	dev_framebuffer = NULL;
 
 	cudaFree(dev_depth);
 	dev_depth = NULL;
 
-    checkCUDAError("rasterize Free");
+	checkCUDAError("rasterize Free");
 }
diff --git a/src/rasterize.h b/src/rasterize.h
index 560aae9..2e01c15 100644
--- a/src/rasterize.h
+++ b/src/rasterize.h
@@ -20,5 +20,5 @@ namespace tinygltf{
 void rasterizeInit(int width, int height);
 void rasterizeSetBuffers(const tinygltf::Scene & scene);
 
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal);
+void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal, int mode);
 void rasterizeFree();
diff --git a/util/CMakeLists.txt b/util/CMakeLists.txt
index c995fae..084fb39 100644
--- a/util/CMakeLists.txt
+++ b/util/CMakeLists.txt
@@ -8,5 +8,5 @@ set(SOURCE_FILES
 
 cuda_add_library(util
    ${SOURCE_FILES}
-   OPTIONS -arch=sm_52
+   OPTIONS -arch=sm_61
    )