diff --git a/README.md b/README.md
index cad1abd..7d56ebf 100644
--- a/README.md
+++ b/README.md
@@ -5,14 +5,78 @@ CUDA Rasterizer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Liang Peng
+* Tested on: Windows 10, i7-6700HQ @ 2.6GHz 2.6GHz 8GB, GTX 960M (Personal Laptop)
 
-### (TODO: Your README)
+## Features
+* [x] Primitive
+	* [x] Point
+	* [x] Line
+	* [x] Triangle
+* [x] Texture
+	* [x] Diffuse
+	* [x] Bilinear Filtering
+* [x] Lighting
+	* [x] Lambert
+	* [x] Blinn-Phong
+* [x] Normal Visualization
+* [x] Depth Visualization
+* [x] Depth Test with Mutex
+* [x] Perspective Corrected Texcoord
+* [x] Performance Analysis
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+## Overview
+### Basic Attributes
+Texture | Normal | Depth | Texcoord
+--- | --- | --- | ---
+![](img/cover_diffuse.gif) | ![](img/cover_normal.gif) | ![](img/cover_depth.gif) | ![](img/cover_texcoord.gif)
+_Note*_ Attributes displayed on each pixel are calculated by several steps. Vertex Shader processes each vertex by transforming vertex positions from model space to view space then to projection space, calculating eye positions, eye normals, texcoords. Next Primitive assembler will group all vertices in primitives, in this project, triangles. Then rasterizer will proccess every triangle by testing every pixel in its bounding box. If a pixel is on the triangle, its position, normal, texcoord are interpolated using barycentric weights on the threes vertices, and stored in fragment buffer. Last Fragment Shader will calculate color to shade each pixel using these attributes and store the color in frame buffer.
 
+### Lighting
+Lambert | Blinn-Phong
+--- | ---
+ ![](img/cover_lambert.gif) | ![](img/cover_blinnphong.gif)
+_Note*_ Lighting is calculated in Fragment Shader for each pixel. Lambert reflectance is proportional to the dot product of surface normal and light direction, while Blinn-Phong reflectance is proportional to the dot product of surface normal and halfway direction raised to the power of shininess.
+
+### Resterization Mode
+Point Cloud | Wireframe | Solid
+--- | --- | ---
+![](img/point.gif) | ![](img/line.gif) | ![](img/triangle.gif)
+ _Note*_ Point Cloud mode is implemented in a straight forward way, by shading the pixel where each pixel is located. Wireframe mode is most concerned with how to shade the pixels between two vertices. In this project, I implemented that by enforcing one pixel in each row if a segment is more vertical than 45 degree, or one pixel in each column if the segment is more horizontal than 45 degree.
+
+### Depth Test with Mutex
+Mutex OFF | Mutex ON
+:---:|:---:
+63 ms/frame | 74 ms/frame
+![](img/mutex_off.gif) | ![](img/mutex_on.gif)
+ _Note*_ We can see in the animated GIFs, noise appears in Mutex OFF mode, because of race condition. To avoid that, each pixel has its own mutex, which is essentially an unsigned integer. When a CUDA thread is trying to seize a pixel to write, it calls atomicCAS to try to change the value of mutex. If it succeeded, it can then safely write that pixel, or it would try again. It alse takes more time to process a frame with Mutex.
+
+### Texcoord Correction
+Perspective Correction OFF | Perspective Correction ON
+:---:|:---:
+ ![](img/texcoord0.gif) | ![](img/texcoord1.gif)
+ _Note*_ Texcoord Correction is unconditionally necessary since texture mapping would look wrong without correction. This is due to the fact that texture coordinates are interpolated in screen space, ignoring the depth information (Texcoords with deeper depth would be denser).
+
+### Texture Filtering
+Nearest Neighbor | Bilinear
+--- | ---
+ ![](img/nearestneighbor.gif) | ![](img/bilinear.gif)
+ _Note*_ Bilinear filtering is useful because texcoords are normalized thus cannot be mapped to a texel exactly after being scaled by texture width and height, thus we grab the four texels around the scaled texcoord and bilinear interpolate the color.
+
+### Performance Analysis
+* Resterization Mode
+
+![](img/chart1.png)
+
+_Note*_ From the stacked column chart we can see that point cloud mode has the lowest percentage of rasterization runtime, while solid mode has the highest percentage. This is because in solid mode, every pixel within the bounding box of each triangle is tested, which makes rasterization of triangles computation-consuming.
+
+* Different Models
+
+Cow | Duck | Truck
+--- | --- | ---
+<img src="img/cow.PNG" height="200"> | <img src="img/duck.PNG" height="200"> | <img src="img/truck.PNG" height="200">
+
+![](img/chart2.png)
 
 ### Credits
 
diff --git a/img/bilinear.gif b/img/bilinear.gif
new file mode 100644
index 0000000..0c51fee
Binary files /dev/null and b/img/bilinear.gif differ
diff --git a/img/chart1.png b/img/chart1.png
new file mode 100644
index 0000000..1601178
Binary files /dev/null and b/img/chart1.png differ
diff --git a/img/chart2.png b/img/chart2.png
new file mode 100644
index 0000000..723eb7f
Binary files /dev/null and b/img/chart2.png differ
diff --git a/img/cover_blinnphong.gif b/img/cover_blinnphong.gif
new file mode 100644
index 0000000..04f8111
Binary files /dev/null and b/img/cover_blinnphong.gif differ
diff --git a/img/cover_depth.gif b/img/cover_depth.gif
new file mode 100644
index 0000000..aa431a6
Binary files /dev/null and b/img/cover_depth.gif differ
diff --git a/img/cover_diffuse.gif b/img/cover_diffuse.gif
new file mode 100644
index 0000000..864ac51
Binary files /dev/null and b/img/cover_diffuse.gif differ
diff --git a/img/cover_lambert.gif b/img/cover_lambert.gif
new file mode 100644
index 0000000..f015a8f
Binary files /dev/null and b/img/cover_lambert.gif differ
diff --git a/img/cover_normal.gif b/img/cover_normal.gif
new file mode 100644
index 0000000..68dfa9d
Binary files /dev/null and b/img/cover_normal.gif differ
diff --git a/img/cover_texcoord.gif b/img/cover_texcoord.gif
new file mode 100644
index 0000000..e76e5d9
Binary files /dev/null and b/img/cover_texcoord.gif differ
diff --git a/img/cow.PNG b/img/cow.PNG
new file mode 100644
index 0000000..2e61342
Binary files /dev/null and b/img/cow.PNG differ
diff --git a/img/duck.PNG b/img/duck.PNG
new file mode 100644
index 0000000..08a8831
Binary files /dev/null and b/img/duck.PNG differ
diff --git a/img/dummy.gif b/img/dummy.gif
new file mode 100644
index 0000000..0bf4a36
Binary files /dev/null and b/img/dummy.gif differ
diff --git a/img/line.gif b/img/line.gif
new file mode 100644
index 0000000..2bcfd86
Binary files /dev/null and b/img/line.gif differ
diff --git a/img/mutex_off.gif b/img/mutex_off.gif
new file mode 100644
index 0000000..f26d467
Binary files /dev/null and b/img/mutex_off.gif differ
diff --git a/img/mutex_on.gif b/img/mutex_on.gif
new file mode 100644
index 0000000..5381504
Binary files /dev/null and b/img/mutex_on.gif differ
diff --git a/img/nearestneighbor.gif b/img/nearestneighbor.gif
new file mode 100644
index 0000000..4e003ec
Binary files /dev/null and b/img/nearestneighbor.gif differ
diff --git a/img/point.gif b/img/point.gif
new file mode 100644
index 0000000..8783788
Binary files /dev/null and b/img/point.gif differ
diff --git a/img/texcoord0.gif b/img/texcoord0.gif
new file mode 100644
index 0000000..5611356
Binary files /dev/null and b/img/texcoord0.gif differ
diff --git a/img/texcoord1.gif b/img/texcoord1.gif
new file mode 100644
index 0000000..aade0fe
Binary files /dev/null and b/img/texcoord1.gif differ
diff --git a/img/triangle.gif b/img/triangle.gif
new file mode 100644
index 0000000..5fe333b
Binary files /dev/null and b/img/triangle.gif differ
diff --git a/img/truck.PNG b/img/truck.PNG
new file mode 100644
index 0000000..cb2d492
Binary files /dev/null and b/img/truck.PNG differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index a57f69f..ce4f1b6 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -6,5 +6,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50
     )
diff --git a/src/main.cpp b/src/main.cpp
index a36b955..646142b 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,6 +13,7 @@
 #define STB_IMAGE_IMPLEMENTATION
 #define TINYGLTF_LOADER_IMPLEMENTATION
 #include <util/tiny_gltf_loader.h>
+#include <chrono>
 
 //-------------------------------
 //-------------MAIN--------------
@@ -65,18 +66,26 @@ int main(int argc, char **argv) {
 void mainLoop() {
     while (!glfwWindowShouldClose(window)) {
         glfwPollEvents();
+		// tic
+		auto tic = std::chrono::high_resolution_clock::now();
         runCuda();
+		cudaDeviceSynchronize();
+		// toc
+		auto toc = std::chrono::high_resolution_clock::now();
+		std::chrono::milliseconds tictoc =
+				std::chrono::duration_cast<std::chrono::milliseconds>(toc - tic);
 
         time_t seconds2 = time (NULL);
 
         if (seconds2 - seconds >= 1) {
-
             fps = fpstracker / (seconds2 - seconds);
             fpstracker = 0;
             seconds = seconds2;
         }
 
-        string title = "CIS565 Rasterizer | " + utilityCore::convertIntToString((int)fps) + " FPS";
+        string title = "CIS565 Rasterizer | "
+				+ utilityCore::convertIntToString((int)fps) + " FPS | ["
+				+ utilityCore::convertIntToString(tictoc.count()) + " ms]";
         glfwSetWindowTitle(window, title.c_str());
 
         glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
@@ -95,7 +104,7 @@ void mainLoop() {
 //-------------------------------
 //---------RUNTIME STUFF---------
 //-------------------------------
-float scale = 1.0f;
+float scale = 1.f;
 float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
 float x_angle = 0.0f, y_angle = 0.0f;
 void runCuda() {
diff --git a/src/main.hpp b/src/main.hpp
index 4816fa1..edc8ff7 100644
--- a/src/main.hpp
+++ b/src/main.hpp
@@ -101,4 +101,4 @@ std::string getFilePathExtension(const std::string &FileName);
 
 void mouseButtonCallback(GLFWwindow* window, int button, int action, int mods);
 void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos);
-void mouseWheelCallback(GLFWwindow* window, double xoffset, double yoffset);
\ No newline at end of file
+void mouseWheelCallback(GLFWwindow* window, double xoffset, double yoffset);
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..8f4c612 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -18,6 +18,10 @@
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
 
+#define TRIANGLE 1
+#define LINE 1
+#define POINT 1
+
 namespace {
 
 	typedef unsigned short VertexIndex;
@@ -25,7 +29,6 @@ namespace {
 	typedef glm::vec3 VertexAttributeNormal;
 	typedef glm::vec2 VertexAttributeTexcoord;
 	typedef unsigned char TextureData;
-
 	typedef unsigned char BufferByte;
 
 	enum PrimitiveType{
@@ -38,7 +41,7 @@ namespace {
 		glm::vec4 pos;
 
 		// TODO: add new attributes to your VertexOut
-		// The attributes listed below might be useful, 
+		// The attributes listed below might be useful,
 		// but always feel free to modify on your own
 
 		 glm::vec3 eyePos;	// eye space position used for shading
@@ -46,7 +49,9 @@ namespace {
 		// glm::vec3 col;
 		 glm::vec2 texcoord0;
 		 TextureData* dev_diffuseTex = NULL;
-		// int texWidth, texHeight;
+		 int diffuseTexWidth;
+		 int diffuseTexHeight;
+		 int diffuseTexStride;
 		// ...
 	};
 
@@ -59,14 +64,15 @@ namespace {
 		glm::vec3 color;
 
 		// TODO: add new attributes to your Fragment
-		// The attributes listed below might be useful, 
+		// The attributes listed below might be useful,
 		// but always feel free to modify on your own
-
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
-		// ...
+		glm::vec3 eyePos;
+		glm::vec3 eyeNor;
+		glm::vec2 texcoord0;
+		TextureData* dev_diffuseTex = NULL;
+		int diffuseTexWidth;
+		int diffuseTexHeight;
+		int diffuseTexStride;
 	};
 
 	struct PrimitiveDevBufPointers {
@@ -86,6 +92,7 @@ namespace {
 		TextureData* dev_diffuseTex;
 		int diffuseTexWidth;
 		int diffuseTexHeight;
+		int diffuseTexStride;
 		// TextureData* dev_specularTex;
 		// TextureData* dev_normalTex;
 		// ...
@@ -108,13 +115,14 @@ static int totalNumPrimitives = 0;
 static Primitive *dev_primitives = NULL;
 static Fragment *dev_fragmentBuffer = NULL;
 static glm::vec3 *dev_framebuffer = NULL;
+static unsigned int *dev_mutex = NULL;
 
 static int * dev_depth = NULL;	// you might need this buffer when doing depth test
 
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
  */
-__global__ 
+__global__
 void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
@@ -133,21 +141,51 @@ void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
     }
 }
 
-/** 
-* Writes fragment colors to the framebuffer
+/**
+* Writes fragment colors to the framebuffer, serve as a fragment shader
 */
 __global__
 void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
 
-    if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
+    if (x >= w || y >= h) {
+		return;
+	}
 
-		// TODO: add your fragment shader code here
+	int index = x + (y * w);
+	glm::vec3 &buffer = framebuffer[index];
+	const Fragment &fragment = fragmentBuffer[index];
+	const glm::vec3 lightPos(50.f, 50.f, 100.f);
+
+	// clear frame buffer
+	buffer = glm::vec3(0.f);
+	// buffer = fragment.eyeNor; return;// display normal
+	// buffer = glm::vec3(fragment.texcoord0, 0.f); return;// display texcoord
+#if TRIANGLE
+	const glm::vec3 L = glm::normalize(lightPos - fragment.eyePos);
+	const glm::vec3 &N = fragment.eyeNor;
+	const glm::vec3 V = glm::normalize(-fragment.eyePos);
+	const glm::vec3 H = glm::normalize(V + L);
+	glm::vec3 baseColor(1.f);
+
+	if (fragment.dev_diffuseTex != NULL) {
+		// has color map
+		baseColor = getColorFromTextureAtCoordinate(
+				fragment.dev_diffuseTex, fragment.texcoord0, fragment.diffuseTexWidth,
+				fragment.diffuseTexHeight, fragment.diffuseTexStride);
+	}
 
-    }
+	// lambert
+	buffer = max(0.f, glm::dot(L, N)) * baseColor;
+
+	// Blinn-Phong
+	buffer += glm::vec3(pow(max(0.f, glm::dot(N, H)), 200.f));
+#elif LINE
+	buffer = fragment.color;
+#elif POINT
+	buffer = fragment.color;
+#endif
 }
 
 /**
@@ -162,7 +200,9 @@ void rasterizeInit(int w, int h) {
     cudaFree(dev_framebuffer);
     cudaMalloc(&dev_framebuffer,   width * height * sizeof(glm::vec3));
     cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
-    
+	cudaMalloc(&dev_mutex, width * height * sizeof(unsigned int));
+	cudaMemset(dev_mutex, 0, width * height * sizeof(unsigned int));
+
 	cudaFree(dev_depth);
 	cudaMalloc(&dev_depth, width * height * sizeof(int));
 
@@ -187,9 +227,9 @@ void initDepth(int w, int h, int * depth)
 * kern function with support for stride to sometimes replace cudaMemcpy
 * One thread is responsible for copying one component
 */
-__global__ 
+__global__
 void _deviceBufferCopy(int N, BufferByte* dev_dst, const BufferByte* dev_src, int n, int byteStride, int byteOffset, int componentTypeByteSize) {
-	
+
 	// Attribute (vec3 position)
 	// component (3 * float)
 	// byte (4 * byte)
@@ -202,20 +242,20 @@ void _deviceBufferCopy(int N, BufferByte* dev_dst, const BufferByte* dev_src, in
 		int offset = i - count * n;	// which component of the attribute
 
 		for (int j = 0; j < componentTypeByteSize; j++) {
-			
-			dev_dst[count * componentTypeByteSize * n 
-				+ offset * componentTypeByteSize 
+
+			dev_dst[count * componentTypeByteSize * n
+				+ offset * componentTypeByteSize
 				+ j]
 
-				= 
+				=
 
-			dev_src[byteOffset 
-				+ count * (byteStride == 0 ? componentTypeByteSize * n : byteStride) 
-				+ offset * componentTypeByteSize 
+			dev_src[byteOffset
+				+ count * (byteStride == 0 ? componentTypeByteSize * n : byteStride)
+				+ offset * componentTypeByteSize
 				+ j];
 		}
 	}
-	
+
 
 }
 
@@ -235,7 +275,7 @@ void _nodeMatrixTransform(
 }
 
 glm::mat4 getMatrixFromNodeMatrixVector(const tinygltf::Node & n) {
-	
+
 	glm::mat4 curMatrix(1.0);
 
 	const std::vector<double> &m = n.matrix;
@@ -280,7 +320,7 @@ void traverseNode (
 	const tinygltf::Scene & scene,
 	const std::string & nodeString,
 	const glm::mat4 & parentMatrix
-	) 
+	)
 {
 	const tinygltf::Node & n = scene.nodes.at(nodeString);
 	glm::mat4 M = parentMatrix * getMatrixFromNodeMatrixVector(n);
@@ -329,8 +369,8 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 
 
-	// 2. for each mesh: 
-	//		for each primitive: 
+	// 2. for each mesh:
+	//		for each primitive:
 	//			build device buffer of indices, materail, and each attributes
 	//			and store these pointers in a map
 	{
@@ -518,11 +558,12 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 					// ----------Materials-------------
 
-					// You can only worry about this part once you started to 
+					// You can only worry about this part once you started to
 					// implement textures for your rasterizer
 					TextureData* dev_diffuseTex = NULL;
 					int diffuseTexWidth = 0;
 					int diffuseTexHeight = 0;
+					int diffuseTexStride = 0;
 					if (!primitive.material.empty()) {
 						const tinygltf::Material &mat = scene.materials.at(primitive.material);
 						printf("material.name = %s\n", mat.name.c_str());
@@ -537,9 +578,10 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 									size_t s = image.image.size() * sizeof(TextureData);
 									cudaMalloc(&dev_diffuseTex, s);
 									cudaMemcpy(dev_diffuseTex, &image.image.at(0), s, cudaMemcpyHostToDevice);
-									
+
 									diffuseTexWidth = image.width;
 									diffuseTexHeight = image.height;
+									diffuseTexStride = image.component;
 
 									checkCUDAError("Set Texture Image data");
 								}
@@ -548,13 +590,13 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 						// TODO: write your code for other materails
 						// You may have to take a look at tinygltfloader
-						// You can also use the above code loading diffuse material as a start point 
+						// You can also use the above code loading diffuse material as a start point
 					}
 
 
 					// ---------Node hierarchy transform--------
 					cudaDeviceSynchronize();
-					
+
 					dim3 numBlocksNodeTransform((numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 					_nodeMatrixTransform << <numBlocksNodeTransform, numThreadsPerBlock >> > (
 						numVertices,
@@ -582,6 +624,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 						dev_diffuseTex,
 						diffuseTexWidth,
 						diffuseTexHeight,
+						diffuseTexStride,
 
 						dev_vertexOut	//VertexOut
 					});
@@ -595,20 +638,20 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 		} // for each node
 
 	}
-	
+
 
 	// 3. Malloc for dev_primitives
 	{
 		cudaMalloc(&dev_primitives, totalNumPrimitives * sizeof(Primitive));
 	}
-	
+
 
 	// Finally, cudaFree raw dev_bufferViews
 	{
 
 		std::map<std::string, BufferByte*>::const_iterator it(bufferViewDevPointers.begin());
 		std::map<std::string, BufferByte*>::const_iterator itEnd(bufferViewDevPointers.end());
-			
+
 			//bufferViewDevPointers
 
 		for (; it != itEnd; it++) {
@@ -623,62 +666,274 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 
 
-__global__ 
+__global__
 void _vertexTransformAndAssembly(
-	int numVertices, 
-	PrimitiveDevBufPointers primitive, 
-	glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, 
-	int width, int height) {
-
+		int numVertices, PrimitiveDevBufPointers primitive, glm::mat4 MVP,
+		glm::mat4 MV, glm::mat3 MV_normal, int width, int height) {
 	// vertex id
 	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
-	if (vid < numVertices) {
-
-		// TODO: Apply vertex transformation here
-		// Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
-		// Then divide the pos by its w element to transform into NDC space
-		// Finally transform x and y to viewport space
+	if (vid >= numVertices) {
+		return;
+	}
 
-		// TODO: Apply vertex assembly here
-		// Assemble all attribute arraies into the primitive array
-		
+	VertexOut &out = primitive.dev_verticesOut[vid];
+	const glm::vec3 &pos = primitive.dev_position[vid];
+	const glm::vec3 &nor = primitive.dev_normal[vid];
+
+	// TODO: Apply vertex transformation here
+	// Multiply the MVP matrix for each vertex position, this will transform
+	// everything into clipping space
+	// Then divide the pos by its w element to transform into NDC space
+	// Finally transform x and y to viewport space
+	out.pos = MVP * glm::vec4(pos, 1.f);
+	out.pos /= out.pos.w;
+	out.pos.x = (1.f - out.pos.x) * .5f * width;
+	out.pos.y = (1.f - out.pos.y) * .5f * height;
+
+	// TODO: Apply vertex assembly here
+	// Assemble all attribute arraies into the primitive array
+	out.eyePos = multiplyMV(MV, glm::vec4(pos, 1.f));
+	out.eyeNor = glm::normalize(MV_normal * nor);
+
+	// texcoord
+	if (primitive.dev_texcoord0 != NULL) {
+		out.texcoord0 = primitive.dev_texcoord0[vid];
 	}
+
+	// diffuse texture data
+	out.dev_diffuseTex = primitive.dev_diffuseTex;
+	out.diffuseTexWidth = primitive.diffuseTexWidth;
+	out.diffuseTexHeight = primitive.diffuseTexHeight;
+	out.diffuseTexStride = primitive.diffuseTexStride;
 }
 
 
 
 static int curPrimitiveBeginId = 0;
 
-__global__ 
-void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_primitives, PrimitiveDevBufPointers primitive) {
-
+__global__
+void _primitiveAssembly(int numIndices, int curPrimitiveBeginId,
+			Primitive* dev_primitives, PrimitiveDevBufPointers primitive) {
 	// index id
 	int iid = (blockIdx.x * blockDim.x) + threadIdx.x;
 
-	if (iid < numIndices) {
+	if (iid >= numIndices) {
+		return;
+	}
 
-		// TODO: uncomment the following code for a start
-		// This is primitive assembly for triangles
+	// TODO: uncomment the following code for a start
+	// This is primitive assembly for triangles
+	int pid;	// id for cur primitives vector
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
+	if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
+		pid = iid / (int)primitive.primitiveType;
+		dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		dev_primitives[pid + curPrimitiveBeginId].primitiveType = Triangle;
+	} else if (primitive.primitiveMode == Line) {
+		// TODO: other primitive types (line)
 
+	} else if (primitive.primitiveMode == Point) {
+		// TODO: other primitive types (point)
 
-		// TODO: other primitive types (point, line)
 	}
-	
 }
 
+__global__
+void _rasterizePrimitive(int totalNumPrimitives, Primitive *dev_primitives,
+		Fragment *dev_fragmentBuffer, int *dev_depth, int width, int height,
+		unsigned int *dev_mutex) {
+	int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+	if (idx >= totalNumPrimitives) {
+		return;
+	}
+
+	const Primitive &primitive = dev_primitives[idx];
+	const glm::vec3 tri[3] = {
+		glm::vec3(primitive.v[0].pos),
+		glm::vec3(primitive.v[1].pos),
+		glm::vec3(primitive.v[2].pos),
+	};
+	const glm::vec3 eyePos[3] = {
+		primitive.v[0].eyePos,
+		primitive.v[1].eyePos,
+		primitive.v[2].eyePos
+	};
+	const glm::vec3 eyeNor[3] = {
+		primitive.v[0].eyeNor,
+		primitive.v[1].eyeNor,
+		primitive.v[2].eyeNor
+	};
+	const glm::vec2 texcoord0[3] = {
+		primitive.v[0].texcoord0,
+		primitive.v[1].texcoord0,
+		primitive.v[2].texcoord0
+	};
+	const float triDepth_1[3] = {
+		1.f / tri[0].z,
+		1.f / tri[1].z,
+		1.f / tri[2].z
+	};
 
+	if (primitive.primitiveType == Triangle) {
+		TextureData *pDiffuseTexData = primitive.v[0].dev_diffuseTex;
+		int diffuseTexWidth = 0;
+		int diffuseTexHeight = 0;
+		int diffuseTexStride = 0;
+
+		if (pDiffuseTexData != NULL) {
+			diffuseTexWidth = primitive.v[0].diffuseTexWidth;
+			diffuseTexHeight = primitive.v[0].diffuseTexHeight;
+			diffuseTexStride = primitive.v[0].diffuseTexStride;
+		}
+
+		// back facing triangle
+		if (calculateSignedArea(tri) >= 0.f) return;
+
+		const AABB aabb = getAABBForTriangle(tri);
+		const int left = max(0, (int)aabb.min.x - 1);
+		const int right = min(width, (int)aabb.max.x + 1);
+		const int bottom = max(0, (int)aabb.min.y - 1);
+		const int top = min(height, (int)aabb.max.y + 1);
+
+		// outsides screen
+		if (left >= right || bottom >= top) return;
+
+		for (int i = left; i < right; ++i) {
+			for (int j = bottom; j < top; ++j) {
+				const int pixelId = i + j * width;
+				const glm::vec2 p(i + .5f, j + .5f);
+				const glm::vec3 baryCoord = calculateBarycentricCoordinate(tri, p);
+
+				// outsides triangle
+				if (!isBarycentricCoordInBounds(baryCoord)) continue;
+
+				const float z = getZAtCoordinate(baryCoord, tri);
+				// too far or too near
+				if (z < 0.f || z > 1.f) continue;
+
+				// depth test, account for race condition when accessing depth buffer
+				const int depth = (int)(z * INT_MAX);
+				bool isOccluded = true;
+#ifndef MUTEX
+				bool isSet = false;
+				while (!isSet) {
+					isSet = atomicCAS(&dev_mutex[pixelId], 0, 1) == 0;
+					if (isSet) {
+						if (dev_depth[pixelId] > depth) {
+							dev_depth[pixelId] = depth;
+							isOccluded = false;
+						}
+						dev_mutex[pixelId] = 0;
+					}
+				}
+#else
+				if (dev_depth[pixelId] > depth) {
+					dev_depth[pixelId] = depth;
+					isOccluded = false;
+				}
+#endif
+				// occluded
+				if (isOccluded) continue;
+
+				Fragment &fragment = dev_fragmentBuffer[pixelId];
+				// assemble fragment attributes
+				fragment.eyePos = getVec3AtCoordinate(baryCoord, eyePos);
+				fragment.eyeNor = glm::normalize(getVec3AtCoordinate(baryCoord, eyeNor));
+				// fragment.texcoord0 = getVec2AtCoordinate(baryCoord, texcoord0);
+				// calculate perspective corrected texcoord
+				fragment.texcoord0 = getPerspectiveCorrectedTexcoordAtCoordinate(
+						baryCoord, texcoord0, triDepth_1);
+
+				if (pDiffuseTexData != NULL) {
+					// diffuse texture provided
+					fragment.dev_diffuseTex = pDiffuseTexData;
+					fragment.diffuseTexWidth = diffuseTexWidth;
+					fragment.diffuseTexHeight = diffuseTexHeight;
+					fragment.diffuseTexStride = diffuseTexStride;
+				}
+			}
+		}
+	}
+}
+
+__host__ __device__
+void _rasterizeLineHelper(const glm::vec3 &a, const glm::vec3 &b,
+		Fragment *dev_fragmentBuffer, int w, int h) {
+	const int left = max(0, (int)min(a.x, b.x));
+	const int right = min(w, (int)max(a.x, b.x) + 1);
+	const int bottom = max(0, (int)min(a.y, b.y));
+	const int top = min(h, (int)max(a.y, b.y) + 1);
+
+	// outsides window
+	if (left >= right && bottom >= top) return;
+
+	int begin = right - left >= top - bottom ? left : bottom;
+	int end = right - left >= top - bottom ? right : top;
+
+	for (int i = 0; i <= end - begin; ++i) {
+		const glm::vec3 p = getVec3AtU((i + 0.f) / (end - begin), a, b);
+		const int index = (int)p.x + (int)p.y * w;
+
+		dev_fragmentBuffer[index].color = glm::vec3(1.f);
+	}
+}
+
+__global__
+void _rasterizeLine(int totalNumPrimitives, Primitive *dev_primitives,
+		Fragment *dev_fragmentBuffer, int w, int h) {
+	int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+	if (idx >= totalNumPrimitives) {
+		return;
+	}
+
+	const Primitive &primitive = dev_primitives[idx];
+	const glm::vec3 &v0 = glm::vec3(primitive.v[0].pos);
+	const glm::vec3 &v1 = glm::vec3(primitive.v[1].pos);
+	const glm::vec3 &v2 = glm::vec3(primitive.v[2].pos);
+
+	if (primitive.primitiveType == Triangle) {
+		_rasterizeLineHelper(v0, v1, dev_fragmentBuffer, w, h);
+		_rasterizeLineHelper(v1, v2, dev_fragmentBuffer, w, h);
+		_rasterizeLineHelper(v2, v0, dev_fragmentBuffer, w, h);
+	}
+}
+
+__global__
+void _rasterizePoint(int totalNumPrimitives, Primitive *dev_primitives,
+		Fragment *dev_fragmentBuffer, int w, int h) {
+	int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+	if (idx >= totalNumPrimitives) {
+		return;
+	}
+
+	const Primitive &primitive = dev_primitives[idx];
+	const glm::vec3 tri[3] = {
+			glm::vec3(primitive.v[0].pos),
+			glm::vec3(primitive.v[1].pos),
+			glm::vec3(primitive.v[2].pos)
+	};
+
+	if (primitive.primitiveType == Triangle) {
+		for (int i = 0; i < 3; ++i) {
+			const int x = (int)tri[i].x;
+			const int y = (int)tri[i].y;
+			const int index = x + y * w;
+
+			if (x < 0 || x >= w || y < 0 || y >= h) continue;
+			dev_fragmentBuffer[index].color = glm::vec3(1.f);
+		}
+	}
+}
 
 /**
  * Perform rasterization.
  */
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) {
+void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV,
+		const glm::mat3 MV_normal) {
     int sideLength2d = 8;
     dim3 blockSize2d(sideLength2d, sideLength2d);
     dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
@@ -699,17 +954,17 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 			auto p = (it->second).begin();	// each primitive
 			auto pEnd = (it->second).end();
 			for (; p != pEnd; ++p) {
-				dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-				dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+				dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1)
+				 		/ numThreadsPerBlock.x);
+				dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1)
+						/ numThreadsPerBlock.x);
 
-				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height);
+				_vertexTransformAndAssembly<<<numBlocksForVertices, numThreadsPerBlock>>>(
+						p->numVertices, *p, MVP, MV, MV_normal, width, height);
 				checkCUDAError("Vertex Processing");
 				cudaDeviceSynchronize();
-				_primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
-					(p->numIndices, 
-					curPrimitiveBeginId, 
-					dev_primitives, 
-					*p);
+				_primitiveAssembly<<<numBlocksForIndices, numThreadsPerBlock>>>(
+						p->numIndices, curPrimitiveBeginId, dev_primitives, *p);
 				checkCUDAError("Primitive Assembly");
 
 				curPrimitiveBeginId += p->numPrimitives;
@@ -718,19 +973,35 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 
 		checkCUDAError("Vertex Processing and Primitive Assembly");
 	}
-	
-	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
-	
-	// TODO: rasterize
-
 
+	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+	initDepth<<<blockCount2d, blockSize2d>>>(width, height, dev_depth);
 
+	// TODO: rasterize
+	dim3 numThreadsPerBlock(128);
+	dim3 numBlocksForPrimitives((totalNumPrimitives + numThreadsPerBlock.x - 1)
+			/ numThreadsPerBlock.x);
+#if TRIANGLE
+	_rasterizePrimitive<<<numBlocksForPrimitives, numThreadsPerBlock>>>(
+			totalNumPrimitives, dev_primitives, dev_fragmentBuffer, dev_depth,
+			width, height, dev_mutex);
+#elif LINE
+	_rasterizeLine<<<numBlocksForPrimitives, numThreadsPerBlock>>>(
+			totalNumPrimitives, dev_primitives, dev_fragmentBuffer,
+			width, height);
+#elif POINT
+	_rasterizePoint<<<numBlocksForPrimitives, numThreadsPerBlock>>>(
+			totalNumPrimitives, dev_primitives, dev_fragmentBuffer,
+			width, height);
+#endif
+	checkCUDAError("rasterize primitive");
     // Copy depthbuffer colors into framebuffer
-	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
+	render<<<blockCount2d, blockSize2d>>>(width, height, dev_fragmentBuffer,
+			dev_framebuffer);
 	checkCUDAError("fragment shader");
     // Copy framebuffer into OpenGL buffer for OpenGL previewing
-    sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
+    sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height,
+			dev_framebuffer);
     checkCUDAError("copy render result to pbo");
 }
 
@@ -750,10 +1021,8 @@ void rasterizeFree() {
 			cudaFree(p->dev_normal);
 			cudaFree(p->dev_texcoord0);
 			cudaFree(p->dev_diffuseTex);
-
 			cudaFree(p->dev_verticesOut);
 
-			
 			//TODO: release other attributes and materials
 		}
 	}
@@ -772,5 +1041,8 @@ void rasterizeFree() {
 	cudaFree(dev_depth);
 	dev_depth = NULL;
 
+	cudaFree(dev_mutex);;
+	dev_mutex = NULL;
+
     checkCUDAError("rasterize Free");
 }
diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h
index 46c701e..b69bd8b 100644
--- a/src/rasterizeTools.h
+++ b/src/rasterizeTools.h
@@ -83,9 +83,9 @@ glm::vec3 calculateBarycentricCoordinate(const glm::vec3 tri[3], glm::vec2 point
  */
 __host__ __device__ static
 bool isBarycentricCoordInBounds(const glm::vec3 barycentricCoord) {
-    return barycentricCoord.x >= 0.0 && barycentricCoord.x <= 1.0 &&
-           barycentricCoord.y >= 0.0 && barycentricCoord.y <= 1.0 &&
-           barycentricCoord.z >= 0.0 && barycentricCoord.z <= 1.0;
+    return barycentricCoord.x >= 0.f && barycentricCoord.x <= 1.f &&
+           barycentricCoord.y >= 0.f && barycentricCoord.y <= 1.f &&
+           barycentricCoord.z >= 0.f && barycentricCoord.z <= 1.f;
 }
 
 // CHECKITOUT
@@ -95,7 +95,128 @@ bool isBarycentricCoordInBounds(const glm::vec3 barycentricCoord) {
  */
 __host__ __device__ static
 float getZAtCoordinate(const glm::vec3 barycentricCoord, const glm::vec3 tri[3]) {
-    return -(barycentricCoord.x * tri[0].z
+    return barycentricCoord.x * tri[0].z
            + barycentricCoord.y * tri[1].z
-           + barycentricCoord.z * tri[2].z);
+           + barycentricCoord.z * tri[2].z;
+}
+
+ /**
+  * For a given barycentric coordinate, compute the corresponding vec3
+  * on the triangle.
+  */
+  __host__ __device__ static
+  glm::vec3 getVec3AtCoordinate(const glm::vec3 barycentricCoord, const glm::vec3 input[3]) {
+ 	 return barycentricCoord.x * input[0]
+ 	 		+ barycentricCoord.y * input[1]
+ 			+ barycentricCoord.z * input[2];
+  }
+
+  /**
+   * For a given barycentric coordinate, compute a float on the triangle.
+  */
+  __host__ __device__ static
+  float getFloatAtCoordinate(const glm::vec3 barycentricCoord, const float input[3]) {
+  	return barycentricCoord.x * input[0]
+  		   + barycentricCoord.y * input[1]
+  		   + barycentricCoord.z * input[2];
+  }
+
+/**
+ * For a given barycentric coordinate, compute the corresponding vec2
+ * on the triangle.
+ */
+__host__ __device__ static
+glm::vec2 getVec2AtCoordinate(const glm::vec3 &baryCoord, const glm::vec2 input[3]) {
+	 return baryCoord.x * input[0]
+	 		+ baryCoord.y * input[1]
+			+ baryCoord.z * input[2];
+}
+
+/**
+ * For a given texture data pointer, compute color at spcified texcoord.
+*/
+__host__ __device__ static
+glm::vec3 getColorFromTextureAtCoordinate(const unsigned char *pTex,
+		const glm::vec2 texcoord, int w, int h, int stride) {
+	const int x = (int)((w - 1.f) * texcoord.x + .5f);
+	const int y = (int)((h - 1.f) * texcoord.y + .5f);
+	const float scale = 1.f / 255.f;
+	const int index = x + y * w;
+
+	return scale * glm::vec3(pTex[index * stride],
+			pTex[index * stride + 1],
+			pTex[index * stride + 2]);
+}
+
+/**
+ * For a given texture data pointer, compute color at spcified texcoord.
+*/
+__host__ __device__ static
+glm::vec3 getColorFromTextureAtCoordinateBilinear(const unsigned char *pTex,
+		const glm::vec2 texcoord, int w, int h, int stride) {
+	const float scale = 1.f / 255.f;
+	const float x = (w - 1.f) * texcoord.x;
+	const float y = (h - 1.f) * texcoord.y;
+	const int xi = (int)x;
+	const int yi = (int)y;
+	const float ux = x - xi;
+	const float uy = y - yi;
+	glm::vec3 c00(0.f);
+	glm::vec3 c01(0.f);
+	glm::vec3 c10(0.f);
+	glm::vec3 c11(0.f);
+
+	{
+		const int index = xi + yi * w;
+		c00 = glm::vec3(pTex[index * stride],
+				pTex[index * stride + 1],
+				pTex[index * stride + 2]);
+	}
+	if (yi < h - 1) {
+		const int index = xi + (yi + 1) * w;
+		c01 = glm::vec3(pTex[index * stride],
+				pTex[index * stride + 1],
+				pTex[index * stride + 2]);
+	}
+	if (xi < w - 1) {
+		const int index = (xi + 1) + yi * w;
+		c10 = glm::vec3(pTex[index * stride],
+				pTex[index * stride + 1],
+				pTex[index * stride + 2]);
+	}
+	if (yi < h - 1 && xi < w - 1) {
+		const int index = (xi + 1) + (yi + 1) * w;
+		c11 = glm::vec3(pTex[index * stride],
+				pTex[index * stride + 1],
+				pTex[index * stride + 2]);
+	}
+
+	return scale * ((1.f - ux) * ((1.f - uy) * c00 + uy * c01)
+			+ ux * ((1.f - uy) * c10 + uy * c11));
+}
+
+/**
+ * For a given barycentric coordinate, compute the corresponding perspective
+ * corrected texcoord on the triangle.
+ */
+__host__ __device__ static
+glm::vec2 getPerspectiveCorrectedTexcoordAtCoordinate(const glm::vec3 &baryCoord,
+		const glm::vec2 _texcoord[3], const float triDepth_1[3]) {
+	const glm::vec2 texcoord[3] = {
+		_texcoord[0] * triDepth_1[0],
+		_texcoord[1] * triDepth_1[1],
+		_texcoord[2] * triDepth_1[2]
+	};
+	const glm::vec2 numerator = getVec2AtCoordinate(baryCoord, texcoord);
+	const float denomenator = getFloatAtCoordinate(baryCoord, triDepth_1);
+
+	return numerator / denomenator;
+}
+
+/**
+ * Linear interpolate between 2 points.
+ */
+__host__ __device__ static
+glm::vec3 getVec3AtU(const float u, const glm::vec3 &a, const glm::vec3 &b) {
+	return (1.f - u) * a + u * b;
 }