diff --git a/README.md b/README.md index cad1abd..6210101 100644 --- a/README.md +++ b/README.md @@ -1,20 +1,116 @@ CUDA Rasterizer =============== -[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md) - **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Austin Eng +* Tested on: Windows 10, i7-4770K @ 3.50GHz 16GB, GTX 780 3072MB (Personal Computer) + + + + + + + + + + + + + + + + + + + + +
Diffuse colorTesting texture coordinatesIncorrect texture readingFixed textures
+ +## Tiled Rasterization + +Tiled rasterization significantly improves performance for scenes with very large triangles close to the camera. In the simple rasterizer where there is one thread per primitive, if the primitive covers the entire screen, the thread has to go through every pixel on the screen. The problem compounds if there are multiple overlapping large triangles. This could be improved if we subdivided triangles into smaller pieces so that multiple threads can handle one large triangle. To accomplish this, we divide the screen into multiple tiles. so that no one thread has to take care of an area greater than `TILE_SIZE * TILE_SIZE`. + +In implementing this, because I had many more blocks, it was much more likely for the primitives for a tile to be split between multiple blocks. As a result, the atomicMin solution for depth testing can no longer be used because another block may have changed depth values between the `atomicMin` and the `depth == depths[i]` check immediately after. + +I instead used a separate mutex for each pixel. This definitely fixed many of the race conditions. I still ran into problems which I could not figure out, only at very close distances. + +![](renders/tiled_race.png) + +*Race condition with tiled rasterization* + + +## Perspective Correct Interpolation + + + + +Incorrect interpolation that does not take perspective distortion into account results in distorted texture mapping. This is because the perspective projection is not linear with respect to depth. As a result, projected textures can appear skewed if not corrected. + + +## Bilinear Texture Filtering + +![](renders/bilinear_breakdown.png) + +Interpolated texture coordinates are rarely ever exactly aligned with texels. Texture artifacts can occur when the interpolated texel jumps to an adjacent one. To avoid this problem, we grab the four pixels neighboring our interpolated texture coordinate and bilinearly interpolate between them. This results in a much smoother image. + + +## Analysis + +![](renders/zoom.gif) + +*Performance tests were done using the above render. Unless stated otherwise, reported results are the average over all the frames or a breakdown of per-frame data.* + + +### Tiled Rasterization + +![](profiling/tile_notile_comparison.png) + +*Comparison of untiled and different degrees of tiled rasterization* + +Tiled rasterization only uses three additional registers and yields significant performance improvements at close distances. With untiled rasterization, moving the camera closer to the object seems to exponentially increase the execution time of the rasterization kernel. Using tiles results in near linear performance because the amount of time that a thread can take is bounded by the size of the tile. + +![](profiling/tile_comparison.png) + +Comparing the performance of different tile sizes is interesting. Using tiles that are too large seem to result in a sort of oscillatory performance behavior. This may just be because of how much the edges of the truck overlap tiles. Tiles that overlap much will probably take longer than those that do not. + +### Constant Memory Camera Matrices + +![](profiling/constant_memory.png) + +Storing camera matrices in constant memory made barely any improvement in kernel execution time even though it drastically reduced register usage and improved occupancy. This may be because the vertex transform kernel is already relatively fast, taking just 13 microseconds. More on this later, but this cost may be cheap enough that it's beneficial to do additional vertex transform computation if multiple types of vertex data are desired (requiring multiple reads). + + +### Bilinear Interpolation + +![](profiling/bilinear.png) + +Even though adding bilinear interpolation nearly doubled the number of registers used (26 to 48), there was surprisingly no decrease in occupancy. As a result, the bilinear interpolation only suffered a small performance hit of about 40 microseconds. This is surprising because the kernel needs to make four global memory reads for the texture instead of just one. Perhaps this latency is hidden by the extra computation to blend the colors. One possible improvement may be to use shared memory to speed up the four texture reads because fragments with adjacent UV coordinates need to look up the same texture values. + + +### Perspective Correct Interpolation + +![](profiling/perspective_correct.png) + +Perspective correct interpolation doubled the execution time of the interpolation kernel while barely using any additional resources (3 registers). Because it is just a few extra arithmetic computations, the additional latency is probably from the additional global memory read to read the primitive's 3D position relative to the eye. If this truly is the case, there may be performance improvements by not storing transformed vertex positions and instead recomputing these in the kernel with matrices stored in constant memory. + + +### Optimized Barycentric Interpolation + +![](profiling/optim_bary.png) + +The optimized barycentric interpolation is able to compute the next pixel's barycentric coordinates using the previous coordinates and three add operations. This is significantly less than the typical 5 adds and 2 multiplys. However, this comes at the cost of using five additional registers and decreased occupancy. This graph shows that the extra resources to decrease computation are well worth it in this situation. Clever ways to decrease the number of registers may help even more. + + +### Separated Rasterization and Interpolation -### (TODO: Your README) +![](profiling/separate.png) -*DO NOT* leave the README to the last minute! It is a crucial part of the -project, and we will not be able to grade you without a good README. +In my initial implementation of the project, I naively interpolated and copied primitive attributes to the fragment right after doing the depth test. This means that any time a thread won the depth test, all other threads would have to wait for it to interpolate the data and write to memory. Separating the decision of which primitive should get rendered in the fragment and the actual interpolation and data copy yielded significant performance benefits. Even though the interpolation is very fast (on average just 0.09ms and not even visible on the graph), doing it during triangle projection kills performance. Separating them yields nearly a 10x speedup. ### Credits * [tinygltfloader](https://github.com/syoyo/tinygltfloader) by [@soyoyo](https://github.com/syoyo) * [glTF Sample Models](https://github.com/KhronosGroup/glTF/blob/master/sampleModels/README.md) +* [Optimized barycentric interpolation](https://fgiesen.wordpress.com/2013/02/10/optimizing-the-basic-rasterizer/) diff --git a/profiling/bilinear.png b/profiling/bilinear.png new file mode 100644 index 0000000..941f406 Binary files /dev/null and b/profiling/bilinear.png differ diff --git a/profiling/constant_memory.png b/profiling/constant_memory.png new file mode 100644 index 0000000..3df2fee Binary files /dev/null and b/profiling/constant_memory.png differ diff --git a/profiling/optim_bary.png b/profiling/optim_bary.png new file mode 100644 index 0000000..31a185b Binary files /dev/null and b/profiling/optim_bary.png differ diff --git a/profiling/perspective_correct.png b/profiling/perspective_correct.png new file mode 100644 index 0000000..203b7b0 Binary files /dev/null and b/profiling/perspective_correct.png differ diff --git a/profiling/separate.png b/profiling/separate.png new file mode 100644 index 0000000..f90e8a5 Binary files /dev/null and b/profiling/separate.png differ diff --git a/profiling/tile_comparison.png b/profiling/tile_comparison.png new file mode 100644 index 0000000..4f89623 Binary files /dev/null and b/profiling/tile_comparison.png differ diff --git a/profiling/tile_notile_comparison.png b/profiling/tile_notile_comparison.png new file mode 100644 index 0000000..2b71589 Binary files /dev/null and b/profiling/tile_notile_comparison.png differ diff --git a/renders/bilinear_breakdown.png b/renders/bilinear_breakdown.png new file mode 100644 index 0000000..ad71181 Binary files /dev/null and b/renders/bilinear_breakdown.png differ diff --git a/renders/blinerp.png b/renders/blinerp.png new file mode 100644 index 0000000..97f5c33 Binary files /dev/null and b/renders/blinerp.png differ diff --git a/renders/cow_turntable.gif b/renders/cow_turntable.gif new file mode 100644 index 0000000..df3a375 Binary files /dev/null and b/renders/cow_turntable.gif differ diff --git a/renders/duck_turntable.gif b/renders/duck_turntable.gif new file mode 100644 index 0000000..0c8491c Binary files /dev/null and b/renders/duck_turntable.gif differ diff --git a/renders/engine_turntable.gif b/renders/engine_turntable.gif new file mode 100644 index 0000000..03d50da Binary files /dev/null and b/renders/engine_turntable.gif differ diff --git a/renders/fixed_textures.png b/renders/fixed_textures.png new file mode 100644 index 0000000..112a20e Binary files /dev/null and b/renders/fixed_textures.png differ diff --git a/renders/milk_truck.png b/renders/milk_truck.png new file mode 100644 index 0000000..4a53480 Binary files /dev/null and b/renders/milk_truck.png differ diff --git a/renders/noblinerp.png b/renders/noblinerp.png new file mode 100644 index 0000000..fb54a2d Binary files /dev/null and b/renders/noblinerp.png differ diff --git a/renders/perspective_correct.png b/renders/perspective_correct.png new file mode 100644 index 0000000..551258b Binary files /dev/null and b/renders/perspective_correct.png differ diff --git a/renders/perspective_incorrect.png b/renders/perspective_incorrect.png new file mode 100644 index 0000000..97858f1 Binary files /dev/null and b/renders/perspective_incorrect.png differ diff --git a/renders/texturing_wrong.png b/renders/texturing_wrong.png new file mode 100644 index 0000000..44faec0 Binary files /dev/null and b/renders/texturing_wrong.png differ diff --git a/renders/tiled_race.png b/renders/tiled_race.png new file mode 100644 index 0000000..b0ee247 Binary files /dev/null and b/renders/tiled_race.png differ diff --git a/renders/truck_turntable.gif b/renders/truck_turntable.gif new file mode 100644 index 0000000..b3c10a3 Binary files /dev/null and b/renders/truck_turntable.gif differ diff --git a/renders/uv_coords.png b/renders/uv_coords.png new file mode 100644 index 0000000..1f056f2 Binary files /dev/null and b/renders/uv_coords.png differ diff --git a/renders/zoom.gif b/renders/zoom.gif new file mode 100644 index 0000000..0e9853f Binary files /dev/null and b/renders/zoom.gif differ diff --git a/src/rasterize.cu b/src/rasterize.cu index 1262a09..1e0ac52 100644 --- a/src/rasterize.cu +++ b/src/rasterize.cu @@ -18,6 +18,14 @@ #include #include +#define PERSPECTIVE_CORRECT +#define BILIN_INTERP +#define CONSTANT_MEM +#define SEPARATE_INTERP +#define OPTIM_BARY // https://fgiesen.wordpress.com/2013/02/10/optimizing-the-basic-rasterizer/ +//#define RASTERIZE_BY_PIXEL +#define TILE_SIZE 32 + namespace { typedef unsigned short VertexIndex; @@ -45,7 +53,7 @@ namespace { glm::vec3 eyeNor; // eye space normal used for shading, cuz normal will go wrong after perspective transformation // glm::vec3 col; glm::vec2 texcoord0; - TextureData* dev_diffuseTex = NULL; + // TextureData* dev_diffuseTex = NULL; // int texWidth, texHeight; // ... }; @@ -53,6 +61,8 @@ namespace { struct Primitive { PrimitiveType primitiveType = Triangle; // C++ 11 init VertexOut v[3]; + TextureData* dev_diffuseTex; + int texWidth, texHeight; }; struct Fragment { @@ -62,10 +72,13 @@ namespace { // The attributes listed below might be useful, // but always feel free to modify on your own - // glm::vec3 eyePos; // eye space position used for shading - // glm::vec3 eyeNor; - // VertexAttributeTexcoord texcoord0; - // TextureData* dev_diffuseTex; + glm::vec3 eyePos; // eye space position used for shading + glm::vec3 eyeNor; + VertexAttributeTexcoord texcoord0; + TextureData* dev_diffuseTex; + int texWidth, texHeight; + + Primitive* prim; // ... }; @@ -84,8 +97,7 @@ namespace { // Materials, add more attributes when needed TextureData* dev_diffuseTex; - int diffuseTexWidth; - int diffuseTexHeight; + int texWidth, texHeight; // TextureData* dev_specularTex; // TextureData* dev_normalTex; // ... @@ -106,10 +118,20 @@ static int height = 0; static int totalNumPrimitives = 0; static Primitive *dev_primitives = NULL; +static Primitive* *dev_primitiveBuffer = NULL; static Fragment *dev_fragmentBuffer = NULL; static glm::vec3 *dev_framebuffer = NULL; static int * dev_depth = NULL; // you might need this buffer when doing depth test +static unsigned int * dev_mutex = NULL; + +#ifdef CONSTANT_MEM +__constant__ float c_MVP[sizeof(glm::mat4) / sizeof(float)]; +__constant__ float c_MV[sizeof(glm::mat4) / sizeof(float)]; +__constant__ float c_MV_normal[sizeof(glm::mat3) / sizeof(float)]; +__constant__ int c_width[1]; +__constant__ int c_height[1]; +#endif /** * Kernel that writes the image to the OpenGL PBO directly. @@ -133,19 +155,80 @@ void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) { } } +// get RGB value at index as 01 floats +__host__ __device__ static +glm::vec3 getRGB(TextureData* tex, int width, int x, int y) +{ + int idx = 3 * (x + y * width); + return glm::vec3( + (float)(tex[idx]) / 255.f, + (float)(tex[idx + 1]) / 255.f, + (float)(tex[idx + 2]) / 255.f + ); +} + /** * Writes fragment colors to the framebuffer */ __global__ +#ifdef CONSTANT_MEM +void render(Fragment *fragmentBuffer, glm::vec3 *framebuffer) { +#else void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) { +#endif int x = (blockIdx.x * blockDim.x) + threadIdx.x; int y = (blockIdx.y * blockDim.y) + threadIdx.y; +#ifdef CONSTANT_MEM + int index = x + (y * c_width[0]); +#else int index = x + (y * w); - - if (x < w && y < h) { - framebuffer[index] = fragmentBuffer[index].color; - - // TODO: add your fragment shader code here +#endif + + const glm::vec3 lightVec = glm::normalize(glm::vec3(1, 3, 5)); + int twidth = fragmentBuffer[index].texWidth; + int theight = fragmentBuffer[index].texHeight; + + glm::vec3 col; +#ifdef CONSTANT_MEM + if (x < c_width[0] && y < c_height[0] && x > 0 && y > 0) { +#else + if (x < w && y < h && x > 0 && y > 0) { +#endif + // TODO: add your fragment shader code here + if (fragmentBuffer[index].eyePos.z != 0) { + if (fragmentBuffer[index].dev_diffuseTex != NULL) { + + TextureData* tex = fragmentBuffer[index].dev_diffuseTex; +#ifdef BILIN_INTERP + float tx = fragmentBuffer[index].texcoord0.x * twidth; + float ty = fragmentBuffer[index].texcoord0.y * theight; + int x_low = tx; + int x_high = tx + 1; + int y_low = ty; + int y_high = ty + 1; + int x_clamp = __min(x_high, twidth - 1); + int y_clamp = __min(y_high, theight - 1); + + col = + (y_high - ty) * ( + (x_high - tx) * getRGB(tex, twidth, x_low, y_low) + + (tx - x_low) * getRGB(tex, twidth, x_clamp, y_low) + ) + + (ty - y_low) * ( + (x_high - tx) * getRGB(tex, twidth, x_low, y_clamp) + + (tx - x_low) * getRGB(tex, twidth, x_clamp, y_clamp) + ); +#else + int tx = fragmentBuffer[index].texcoord0.x * twidth; + int ty = fragmentBuffer[index].texcoord0.y * theight; + col = getRGB(tex, twidth, tx, ty); +#endif + } + else { + col = fragmentBuffer[index].eyeNor; + } + } + framebuffer[index] = glm::dot(lightVec, fragmentBuffer[index].eyeNor) * col; } } @@ -156,16 +239,30 @@ void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) { void rasterizeInit(int w, int h) { width = w; height = h; + +#ifdef CONSTANT_MEM + cudaMemcpyToSymbol(c_width, &width, sizeof(int)); + cudaMemcpyToSymbol(c_height, &height, sizeof(int)); +#endif + cudaFree(dev_fragmentBuffer); cudaMalloc(&dev_fragmentBuffer, width * height * sizeof(Fragment)); cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment)); cudaFree(dev_framebuffer); cudaMalloc(&dev_framebuffer, width * height * sizeof(glm::vec3)); cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3)); + + cudaFree(dev_primitiveBuffer); + cudaMalloc(&dev_primitiveBuffer, width * height * sizeof(Primitive*)); + cudaMemset(dev_primitiveBuffer, 0, width * height * sizeof(Primitive*)); cudaFree(dev_depth); cudaMalloc(&dev_depth, width * height * sizeof(int)); + cudaFree(dev_mutex); + cudaMalloc(&dev_mutex, width * height * sizeof(unsigned int)); + cudaMemset(dev_mutex, 0, width * height * sizeof(unsigned int)); + checkCUDAError("rasterizeInit"); } @@ -365,7 +462,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) { const tinygltf::Mesh & mesh = scene.meshes.at(*itMeshName); auto res = mesh2PrimitivesMap.insert(std::pair>(mesh.name, std::vector())); - std::vector & primitiveVector = (res.first)->second; + std::vector & primitiveVector = (res.first)->second; // for each primitive for (size_t i = 0; i < mesh.primitives.size(); i++) { @@ -379,6 +476,8 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) { VertexAttributePosition* dev_position = NULL; VertexAttributeNormal* dev_normal = NULL; VertexAttributeTexcoord* dev_texcoord0 = NULL; + int texWidth; + int texHeight; // ----------Indices------------- @@ -521,8 +620,6 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) { // You can only worry about this part once you started to // implement textures for your rasterizer TextureData* dev_diffuseTex = NULL; - int diffuseTexWidth = 0; - int diffuseTexHeight = 0; if (!primitive.material.empty()) { const tinygltf::Material &mat = scene.materials.at(primitive.material); printf("material.name = %s\n", mat.name.c_str()); @@ -533,13 +630,14 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) { const tinygltf::Texture &tex = scene.textures.at(diffuseTexName); if (scene.images.find(tex.source) != scene.images.end()) { const tinygltf::Image &image = scene.images.at(tex.source); - + size_t s = image.image.size() * sizeof(TextureData); cudaMalloc(&dev_diffuseTex, s); cudaMemcpy(dev_diffuseTex, &image.image.at(0), s, cudaMemcpyHostToDevice); - diffuseTexWidth = image.width; - diffuseTexHeight = image.height; + // TODO: store the image size to your PrimitiveDevBufPointers + texWidth = image.width; + texHeight = image.width; checkCUDAError("Set Texture Image data"); } @@ -580,8 +678,8 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) { dev_texcoord0, dev_diffuseTex, - diffuseTexWidth, - diffuseTexHeight, + texWidth, + texHeight, dev_vertexOut //VertexOut }); @@ -622,7 +720,39 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) { } +#ifdef CONSTANT_MEM +__global__ +void _vertexTransformAndAssembly( +int numVertices, +PrimitiveDevBufPointers primitive) { + + // vertex id + int vid = (blockIdx.x * blockDim.x) + threadIdx.x; + if (vid < numVertices) { + + // TODO: Apply vertex transformation here + // Multiply the MVP matrix for each vertex position, this will transform everything into clipping space + // Then divide the pos by its w element to transform into NDC space + // Finally transform x and y to viewport space + glm::mat4& MVP = *(glm::mat4*)c_MVP; + glm::mat4& MV = *(glm::mat4*)c_MV; + glm::mat3& MV_normal = *(glm::mat3*)c_MV_normal; + + glm::vec4 screenV = MVP * glm::vec4(primitive.dev_position[vid], 1.f); + glm::vec4 pixelV = screenV / screenV.w; + pixelV.x = (1.f - pixelV.x) / 2.f * c_width[0]; + pixelV.y = (1.f - pixelV.y)/2.f * c_height[0]; + primitive.dev_verticesOut[vid].pos = pixelV; + + // TODO: Apply vertex assembly here + // Assemble all attribute arraies into the primitive array + primitive.dev_verticesOut[vid].eyePos = glm::vec3(MV * glm::vec4(primitive.dev_position[vid], 1.f)); + primitive.dev_verticesOut[vid].eyeNor = MV_normal * primitive.dev_normal[vid]; + if (primitive.dev_texcoord0) primitive.dev_verticesOut[vid].texcoord0 = primitive.dev_texcoord0[vid]; + } +} +#else __global__ void _vertexTransformAndAssembly( int numVertices, @@ -638,12 +768,20 @@ void _vertexTransformAndAssembly( // Multiply the MVP matrix for each vertex position, this will transform everything into clipping space // Then divide the pos by its w element to transform into NDC space // Finally transform x and y to viewport space + glm::vec4 screenV = MVP * glm::vec4(primitive.dev_position[vid], 1.f); + glm::vec4 pixelV = screenV / screenV.w; + pixelV.x = (1.f - pixelV.x) / 2.f * width; + pixelV.y = (1.f - pixelV.y)/2.f * height; + primitive.dev_verticesOut[vid].pos = pixelV; // TODO: Apply vertex assembly here // Assemble all attribute arraies into the primitive array - + primitive.dev_verticesOut[vid].eyePos = glm::vec3(MV * glm::vec4(primitive.dev_position[vid], 1.f)); + primitive.dev_verticesOut[vid].eyeNor = MV_normal * primitive.dev_normal[vid]; + primitive.dev_verticesOut[vid].texcoord0 = primitive.dev_texcoord0[vid]; } } +#endif @@ -660,12 +798,16 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_ // TODO: uncomment the following code for a start // This is primitive assembly for triangles - //int pid; // id for cur primitives vector - //if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) { - // pid = iid / (int)primitive.primitiveType; - // dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType] - // = primitive.dev_verticesOut[primitive.dev_indices[iid]]; - //} + int pid; // id for cur primitives vector + if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) { + pid = iid / (int)primitive.primitiveType; + // each .v is of length primitiveMode. if tries, the first three indices are the three items here + dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType] + = primitive.dev_verticesOut[primitive.dev_indices[iid]]; + dev_primitives[pid + curPrimitiveBeginId].dev_diffuseTex = primitive.dev_diffuseTex; + dev_primitives[pid + curPrimitiveBeginId].texWidth = primitive.texWidth; + dev_primitives[pid + curPrimitiveBeginId].texHeight = primitive.texHeight; + } // TODO: other primitive types (point, line) @@ -673,6 +815,316 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_ } +__global__ +void _rasterizeByPixel(int numPrimitives, Primitive* primitives, Fragment* fragments, int width, int height) { + int x = (blockIdx.x * blockDim.x) + threadIdx.x; + int y = (blockIdx.y * blockDim.y) + threadIdx.y; + int idx = x + y * width; + + int pid; + float closest = -1; + int pidx = -1; + for (pid = 0; pid < numPrimitives; ++pid) { + glm::vec3 triangle[3]; + triangle[0] = glm::vec3(primitives[pid].v[0].pos); + triangle[1] = glm::vec3(primitives[pid].v[1].pos); + triangle[2] = glm::vec3(primitives[pid].v[2].pos); + glm::vec3 bCoord = calculateBarycentricCoordinate(triangle, glm::vec2(x, y)); + if (isBarycentricCoordInBounds(bCoord)) { + float z = -getZAtCoordinate(bCoord, triangle); + if (z < closest || closest == -1) { + closest = z; + pidx = pid; + } + } + } + if (pidx >= 0) fragments[idx].prim = primitives + pidx; +} + +__global__ +#if TILE_SIZE > 0 +void _rasterizePrims(int numPrimitives, Primitive* primitives, int* depths, unsigned int* mutexes, Primitive** prims) { +#else +void _rasterizePrims(int numPrimitives, Primitive* primitives, int* depths, Primitive** prims) { +#endif + // primitive id +#if TILE_SIZE > 0 + // Apparently shared atomics are bad + // https://devtalk.nvidia.com/default/topic/514085/atomicadd-in-shared-memory-is-measured-slower-than-in-global-memory-timing-shared-memory-atomic-o/ + // or apparently they're fast? + // https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/ + //__shared__ int s_depths[TILE_SIZE*TILE_SIZE]; + //__shared__ Primitive* s_prims[TILE_SIZE*TILE_SIZE]; + //int tx = (blockIdx.x * blockDim.x);// +threadIdx.x; + //int ty = (blockIdx.y * blockDim.y);// +threadIdx.y; + int tx = blockIdx.y; + int ty = blockIdx.z; + int pid = (blockIdx.x * blockDim.x) + threadIdx.x; + + /*for (int i = TILE_SIZE * TILE_SIZE * (float)threadIdx.x / blockDim.x; + i < TILE_SIZE * TILE_SIZE * (float)(threadIdx.x + 1) / blockDim.x; + ++i) { + //s_depths[i] = INT_MAX; + s_prims[i] = NULL; + } + __syncthreads();*/ +#else + int pid = (blockIdx.x * blockDim.x) + threadIdx.x; +#endif + if (pid < numPrimitives) { + Primitive &prim = primitives[pid]; + +#ifdef OPTIM_BARY + glm::vec2 v0 = glm::vec2(prim.v[0].pos); + glm::vec2 v1 = glm::vec2(prim.v[1].pos); + glm::vec2 v2 = glm::vec2(prim.v[2].pos); + AABBScreen box = getAABBForTriangleScreen(v0, v1, v2); +#else + glm::vec3 triangle[3] = { glm::vec3(prim.v[0].pos), + glm::vec3(prim.v[1].pos), + glm::vec3(prim.v[2].pos) }; + AABB box = getAABBForTriangle(triangle); +#endif + + // discard if completely outside +#if TILE_SIZE > 0 + //if (box.max.x < tx*TILE_SIZE || box.min.x > (tx + 1) * TILE_SIZE || box.min.x > c_width[0] || + // box.max.y < ty*TILE_SIZE || box.min.y > (ty + 1) * TILE_SIZE || box.min.y > c_height[0]) return; +#else + if (box.max.x < 0 || box.min.x > c_width[0] || + box.max.y < 0 || box.min.y > c_height[0]) return; +#endif + + // clip bbox to screen +#if TILE_SIZE > 0 + box.min.x = __max(box.min.x, tx*TILE_SIZE); + box.max.x = __min(__min(box.max.x, (tx + 1) * TILE_SIZE), c_width[0]); + box.min.y = __max(box.min.y, ty*TILE_SIZE); + box.max.y = __min(__min(box.max.y, (ty + 1) * TILE_SIZE), c_height[0]); +#else + box.min.x = __max(box.min.x, 0); + box.max.x = __min(box.max.x, c_width[0]); + box.min.y = __max(box.min.y, 0); + box.max.y = __min(box.max.y, c_height[0]); +#endif + +#ifdef OPTIM_BARY + float A01 = v0.y - v1.y, B01 = v1.x - v0.x; + float A12 = v1.y - v2.y, B12 = v2.x - v1.x; + float A20 = v2.y - v0.y, B20 = v0.x - v2.x; + + glm::vec2 p = box.min; + + float w0_row = calculateSignedParallelogramArea(v1, p, v2); + float w1_row = calculateSignedParallelogramArea(v2, p, v0); + float w2_row = calculateSignedParallelogramArea(v0, p, v1); + + for (p.y = box.min.y; p.y <= box.max.y; p.y++) { + float w0 = w0_row; + float w1 = w1_row; + float w2 = w2_row; + + for (p.x = box.min.x; p.x <= box.max.x; p.x++) { + if (w0 >= 0 && w1 >=0 && w2 >= 0) { +#if TILE_SIZE > 0 + int intz = INT_MAX * (w0 * prim.v[0].pos.z + w1 * prim.v[1].pos.z + w2 * prim.v[2].pos.z) / (w0 + w1 + w2); + int pix_idx = p.x + p.y * c_width[0]; + unsigned int* mutex = &mutexes[pix_idx]; + bool isSet = 0; + do { + if (isSet = atomicCAS(mutex, 0, 1) == 0) { + if (intz < depths[pix_idx]) { + depths[pix_idx] = intz; + prims[pix_idx] = &prim; + } + } + if (isSet) { + *mutex = 0; + } + } while (!isSet); + //atomicMin((int*)&(s_depths[pix_idx]), intz); + //atomicMin(&depths[pix_idx], intz); + //int t_idx = ((p.x - tx*TILE_SIZE) + (p.y - ty*TILE_SIZE) * TILE_SIZE); + //if (depths[pix_idx] == intz) { + // prims[pix_idx] = &prim; + //s_prims[t_idx] = &prim; + //} +#else + int intz = INT_MAX * (w0 * prim.v[0].pos.z + w1 * prim.v[1].pos.z + w2 * prim.v[2].pos.z) / (w0 + w1 + w2); + int pix_idx = p.x + p.y * c_width[0]; + atomicMin(&depths[pix_idx], intz); + if (depths[pix_idx] == intz) { + prims[pix_idx] = &prim; + } +#endif + } + + w0 += A12; + w1 += A20; + w2 += A01; + } + + w0_row += B12; + w1_row += B20; + w2_row += B01; + } + +#else + for (int x = box.min.x; x < box.max.x; ++x) { + for (int y = box.min.y; y < box.max.y; ++y) { + int pix_idx = x + y * width; + glm::vec3 bCoord = calculateBarycentricCoordinate(triangle, glm::vec2(x, y)); + if (isBarycentricCoordInBounds(bCoord)) { + int intz = INT_MAX * -getZAtCoordinate(bCoord, triangle); + atomicMin(&depths[pix_idx], intz); + if (depths[pix_idx] == intz) { + prims[pix_idx] = &prim; + } + } + } + } +#endif + } +#if TILE_SIZE > 0 + //__syncthreads(); + // for (int i = 0; i < stride + 1; ++i) { + // int g_idx = TILE_SIZE * (blockIdx.y * blockDim.y + blockIdx.z * blockDim.y * blockDim.z); + // if (g_idx + i >= c_width[0] * c_height[0] || threadIdx.x * stride + i >= TILE_SIZE*TILE_SIZE) break; + //depths[g_idx + i] = s_depths[threadIdx.x * stride + i]; + //prims[g_idx + i] = s_prims[threadIdx.x * stride + i]; + //} +#endif +} + +__global__ +void _interpolateAttributes(Fragment* fragments, Primitive** prims, int width, int height) { + + int x = (blockIdx.x * blockDim.x) + threadIdx.x; + int y = (blockIdx.y * blockDim.y) + threadIdx.y; + int index = x + y * width; + + if (x < width && y < height) { + if (prims[index] == NULL) return; + Fragment& frag = fragments[index]; + Primitive& prim = *prims[index]; + + glm::vec3 triangle[3] = { glm::vec3(prim.v[0].pos), glm::vec3(prim.v[1].pos), glm::vec3(prim.v[2].pos) }; +#ifdef PERSPECTIVE_CORRECT + float eye_zs[3] = { prim.v[0].eyePos.z, prim.v[1].eyePos.z, prim.v[2].eyePos.z }; +#endif + glm::vec3 bCoord = calculateBarycentricCoordinate(triangle, glm::vec2(x, y)); +#ifdef PERSPECTIVE_CORRECT + float correct_z = getCorrectZAtCoordinate(bCoord, eye_zs); +#endif + +#ifdef PERSPECTIVE_CORRECT + frag.eyeNor = correct_z * ( + bCoord.x * prim.v[0].eyeNor / eye_zs[0] + + bCoord.y * prim.v[1].eyeNor / eye_zs[1] + + bCoord.z * prim.v[2].eyeNor / eye_zs[2] + ); + frag.eyePos = correct_z * ( + bCoord.x * prim.v[0].eyePos / eye_zs[0] + + bCoord.y * prim.v[1].eyePos / eye_zs[1] + + bCoord.z * prim.v[2].eyePos / eye_zs[2] + ); + frag.texcoord0 = correct_z * ( + bCoord.x * prim.v[0].texcoord0 / eye_zs[0] + + bCoord.y * prim.v[1].texcoord0 / eye_zs[1] + + bCoord.z * prim.v[2].texcoord0 / eye_zs[2] + ); +#else + frag.eyeNor = + bCoord.x * prim.v[0].eyeNor + + bCoord.y * prim.v[1].eyeNor + + bCoord.z * prim.v[2].eyeNor; + frag.eyePos = + bCoord.x * prim.v[0].eyePos + + bCoord.y * prim.v[1].eyePos + + bCoord.z * prim.v[2].eyePos; + frag.texcoord0 = + bCoord.x * prim.v[0].texcoord0 + + bCoord.y * prim.v[1].texcoord0 + + bCoord.z * prim.v[2].texcoord0; +#endif + frag.dev_diffuseTex = prim.dev_diffuseTex; + frag.texWidth = prim.texWidth; + frag.texHeight = prim.texHeight; + } +} + +__global__ +void _rasterize(int numPrimitives, Primitive* primitives, int* depths, Fragment* fragments, int width, int height) { + // primitive id + int pid = (blockIdx.x * blockDim.x) + threadIdx.x; + if (pid < numPrimitives) { + Primitive &prim = primitives[pid]; + glm::vec3 triangle[3] = { glm::vec3(prim.v[0].pos), + glm::vec3(prim.v[1].pos), + glm::vec3(prim.v[2].pos) }; +#ifdef PERSPECTIVE_CORRECT + float eye_zs[3] = { prim.v[0].eyePos.z, prim.v[1].eyePos.z, prim.v[2].eyePos.z }; +#endif + + AABB box = getAABBForTriangle(triangle); + box.min.x = __max(box.min.x, 0); + box.max.x = __min(box.max.x, width); + box.min.y = __max(box.min.y, 0); + box.max.y = __min(box.max.y, height); + + for (int x = box.min.x; x < box.max.x; ++x) { + for (int y = box.min.y; y < box.max.y; ++y) { + int pix_idx = x + y * width; + glm::vec3 bCoord = calculateBarycentricCoordinate(triangle, glm::vec2(x, y)); + if (isBarycentricCoordInBounds(bCoord)) { + + float z = getZAtCoordinate(bCoord, triangle); +#ifdef PERSPECTIVE_CORRECT + float correct_z = getCorrectZAtCoordinate(bCoord, eye_zs); +#endif + int intz = INT_MAX * -z; + atomicMin(&depths[pix_idx], intz); + if (depths[pix_idx] == intz) { + Fragment &frag = fragments[pix_idx]; +#ifdef PERSPECTIVE_CORRECT + frag.eyeNor = correct_z * ( + bCoord.x * prim.v[0].eyeNor / eye_zs[0] + + bCoord.y * prim.v[1].eyeNor / eye_zs[1] + + bCoord.z * prim.v[2].eyeNor / eye_zs[2] + ); + frag.eyePos = correct_z * ( + bCoord.x * prim.v[0].eyePos / eye_zs[0] + + bCoord.y * prim.v[1].eyePos / eye_zs[1] + + bCoord.z * prim.v[2].eyePos / eye_zs[2] + ); + frag.texcoord0 = correct_z * ( + bCoord.x * prim.v[0].texcoord0 / eye_zs[0] + + bCoord.y * prim.v[1].texcoord0 / eye_zs[1] + + bCoord.z * prim.v[2].texcoord0 / eye_zs[2] + ); +#else + frag.eyeNor = + bCoord.x * prim.v[0].eyeNor + + bCoord.y * prim.v[1].eyeNor + + bCoord.z * prim.v[2].eyeNor; + frag.eyePos = + bCoord.x * prim.v[0].eyePos + + bCoord.y * prim.v[1].eyePos + + bCoord.z * prim.v[2].eyePos; + frag.texcoord0 = + bCoord.x * prim.v[0].texcoord0 + + bCoord.y * prim.v[1].texcoord0 + + bCoord.z * prim.v[2].texcoord0; +#endif + frag.dev_diffuseTex = prim.dev_diffuseTex; + frag.texWidth = prim.texWidth; + frag.texHeight = prim.texHeight; + } + } + } + } + } +} /** @@ -681,12 +1133,18 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) { int sideLength2d = 8; dim3 blockSize2d(sideLength2d, sideLength2d); - dim3 blockCount2d((width - 1) / blockSize2d.x + 1, - (height - 1) / blockSize2d.y + 1); + dim3 blockCount2d((width - 1) / blockSize2d.x + 1, (height - 1) / blockSize2d.y + 1); // Execute your rasterization pipeline here // (See README for rasterization pipeline outline.) +#ifdef CONSTANT_MEM + // copy constants + cudaMemcpyToSymbol(c_MVP, &MVP, sizeof(glm::mat4)); + cudaMemcpyToSymbol(c_MV, &MV, sizeof(glm::mat4)); + cudaMemcpyToSymbol(c_MV_normal, &MV_normal, sizeof(glm::mat3)); +#endif + // Vertex Process & primitive assembly { curPrimitiveBeginId = 0; @@ -702,7 +1160,11 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x); dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x); +#ifdef CONSTANT_MEM + _vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p); +#else _vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height); +#endif checkCUDAError("Vertex Processing"); cudaDeviceSynchronize(); _primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> > @@ -720,14 +1182,46 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g } cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment)); + cudaMemset(dev_primitiveBuffer, NULL, width * height * sizeof(Primitive*)); initDepth << > >(width, height, dev_depth); - - // TODO: rasterize - + // TODO: rasterize + dim3 numThreadsPerBlock(128); + dim3 numBlocksForPrimitives((totalNumPrimitives + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x); +#ifdef SEPARATE_INTERP +#ifdef RASTERIZE_BY_PIXEL + START_PROFILE(rasterize_pixels) + _rasterizeByPixel<<>>(totalNumPrimitives, dev_primitives, dev_fragmentBuffer, width, height); + END_PROFILE(rasterize_pixels) +#else +#if TILE_SIZE > 0 + dim3 tileSize(128); + dim3 tileBlocks((totalNumPrimitives + tileSize.x - 1) / tileSize.x, (width - 1) / TILE_SIZE + 1, (height - 1) / TILE_SIZE + 1); + START_PROFILE(rasterize_prims_tiled) + _rasterizePrims << > >(totalNumPrimitives, dev_primitives, dev_depth, dev_mutex, dev_primitiveBuffer); + END_PROFILE(rasterize_prims_tiled) +#else + START_PROFILE(rasterize_prims) + _rasterizePrims << > >(totalNumPrimitives, dev_primitives, dev_depth, dev_primitiveBuffer); + END_PROFILE(rasterize_prims) +#endif +#endif + START_PROFILE(interpolate) + _interpolateAttributes << > >(dev_fragmentBuffer, dev_primitiveBuffer, width, height); + END_PROFILE(interpolate) +#else + START_PROFILE(rasterize) + _rasterize << > >(totalNumPrimitives, dev_primitives, dev_depth, dev_fragmentBuffer, width, height); + END_PROFILE(rasterize) +#endif + checkCUDAError("Rasterization"); // Copy depthbuffer colors into framebuffer - render << > >(width, height, dev_fragmentBuffer, dev_framebuffer); +#ifdef CONSTANT_MEM + render << > >(dev_fragmentBuffer, dev_framebuffer); +#else + render << > >(width, height, dev_fragmentBuffer, dev_framebuffer); +#endif checkCUDAError("fragment shader"); // Copy framebuffer into OpenGL buffer for OpenGL previewing sendImageToPBO<<>>(pbo, width, height, dev_framebuffer); diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h index 46c701e..ec054d7 100644 --- a/src/rasterizeTools.h +++ b/src/rasterizeTools.h @@ -17,6 +17,11 @@ struct AABB { glm::vec3 max; }; +struct AABBScreen { + glm::vec2 min; + glm::vec2 max; +}; + /** * Multiplies a glm::mat4 matrix and a vec4. */ @@ -43,6 +48,15 @@ AABB getAABBForTriangle(const glm::vec3 tri[3]) { return aabb; } +__host__ __device__ static +AABBScreen getAABBForTriangleScreen(const glm::vec2 &v0, const glm::vec2 &v1, const glm::vec2 &v2) { + AABBScreen aabb = { + glm::vec2((int)__min(__min(v0.x, v1.x), v2.x), (int)__min(__min(v0.y, v1.y), v2.y)), + glm::vec2((int)__max(__max(v0.x, v1.x), v2.x), (int)__max(__max(v0.y, v1.y), v2.y)), + }; + return aabb; +} + // CHECKITOUT /** * Calculate the signed area of a given triangle. @@ -52,6 +66,11 @@ float calculateSignedArea(const glm::vec3 tri[3]) { return 0.5 * ((tri[2].x - tri[0].x) * (tri[1].y - tri[0].y) - (tri[1].x - tri[0].x) * (tri[2].y - tri[0].y)); } +__host__ __device__ static +float calculateSignedParallelogramArea(const glm::vec2& v0, const glm::vec2& v1, const glm::vec2& v2) { + return ((v2.x - v0.x) * (v1.y - v0.y) - (v1.x - v0.x) * (v2.y - v0.y)); +} + // CHECKITOUT /** * Helper function for calculating barycentric coordinates. @@ -99,3 +118,12 @@ float getZAtCoordinate(const glm::vec3 barycentricCoord, const glm::vec3 tri[3]) + barycentricCoord.y * tri[1].z + barycentricCoord.z * tri[2].z); } + +__host__ __device__ static +float getCorrectZAtCoordinate(const glm::vec3 barycentricCoord, const float zs[3]) { + return 1.f / (barycentricCoord.x / zs[0] + + barycentricCoord.y / zs[1] + + barycentricCoord.z / zs[2]); +} + +#define BARY_INTERP(coord, val1, val2, val3) (coord.x * val1 + coord.y * val2 + coord.z * val3) \ No newline at end of file diff --git a/util/CMakeLists.txt b/util/CMakeLists.txt index c995fae..4291838 100644 --- a/util/CMakeLists.txt +++ b/util/CMakeLists.txt @@ -8,5 +8,5 @@ set(SOURCE_FILES cuda_add_library(util ${SOURCE_FILES} - OPTIONS -arch=sm_52 + OPTIONS -arch=sm_50 ) diff --git a/util/utilityCore.hpp b/util/utilityCore.hpp index a67db68..c8c8315 100644 --- a/util/utilityCore.hpp +++ b/util/utilityCore.hpp @@ -16,6 +16,7 @@ #include #include #include +#include #define PI 3.1415926535897932384626422832795028841971 #define TWO_PI 6.2831853071795864769252867665590057683943 @@ -24,6 +25,38 @@ #define G 6.67384e-11 #define EPSILON .000000001 #define ZERO_ABSORPTION_EPSILON 0.00001 +#define PROFILING 0 +#define PROFILING_PREFIX "C:\\Users\\213re\\Code\\coursework\\565CIS\\Project4-CUDA-Rasterizer\\profiling\\" + +#ifndef PROFILE_KERNEL +#define PROFILE_KERNEL +#if PROFILING + +#define START_PROFILE(name) \ + cudaEvent_t start_##name, stop_##name; \ + cudaEventCreate(&start_##name); \ + cudaEventCreate(&stop_##name); \ + cudaEventRecord(start_##name); + +#define END_PROFILE(name) \ + cudaEventRecord(stop_##name); \ + cudaEventSynchronize(stop_##name); \ + float milliseconds_##name = 0; \ + cudaEventElapsedTime(&milliseconds_##name, start_##name, stop_##name); \ + cudaEventDestroy(start_##name); \ + cudaEventDestroy(stop_##name); \ + std::ofstream out_##name; \ + out_##name.open(PROFILING_PREFIX "PROFILE_" #name ".txt", std::ios::out | std::ios::app); \ + out_##name << milliseconds_##name << std::endl; \ + out_##name.close(); + +#else + +#define START_PROFILE(name) +#define END_PROFILE(name) + +#endif +#endif namespace utilityCore { extern float clamp(float f, float min, float max); @@ -41,4 +74,4 @@ extern std::istream &safeGetline(std::istream &is, std::string &t); //Thanks to extern void printMat4(const glm::mat4 &); extern void printVec4(const glm::vec4 &); extern void printVec3(const glm::vec3 &); -} +} \ No newline at end of file