IMPORTANT NOTE
In this two weeks, I was trying to implement a BVH tree with GLSL shaders. And then, I was going to implement recursive ray tracing (I failed :D). I immediately initiated another RayTracer project which runs on CPU. My CPU project runs a lot faster than GPU implementation. This proves that I need to learn a lot more things related to shader programming.
I will mainly work on CPU part from now on. I will consider GPU part for the project part of the course. Learning concepts and implementing them comfortably is more important.
Within 3 late days, I have managed to implement the things related to HW2 and I will share my comments and thoughts on things that I’ve encountered in detail.
THINGS I DID WITH GPU
I first want to talk about my experience with compute shaders and the problems that I’ve encountered. I really like playing with shaders and I can implement most of the algorithms that I see with them. I really want to learn how to write clean an efficient shader code. However, it runs so sloooooowly and I cannot understand why. I guess I need to be an expert in order to fully understand what were my problems. While I was searching on the net, I saw people trying to understand what their shader code does by looking at the assembly codes created by shader compilers.
I guess being a newbie shader programmer is not enough for creating optimized shaders. I need to have a deeper understanding about the inner structure of GPUs. Things were working great in the beginning especially when I was working on the first homework. But when recursive algorithms started coming I started having problems.
GPU Code Does Not Have Recursion
I already knew that I could not implement recursive algorithms like BVH traversal and recursive ray tracing just like we normally do in CPU side. But I had solutions in my mind. Since speed was very important for us, I mainly focused on implementing a BVH traversal algorithm in compute shader.
We all have some solutions in our minds for implementing recursive algorithms iteratively. We use stacks. It is one of the most convenient ways of implementing recursive algorithms iteratively. So, How I did that in compute shader?
For BVH traversal, I implemented a stack
BVHNode nodeStack[MAX_STACK_LENGTH]
Here is my pseudo code for traversing the BVH:
IntersectionReport intersectBVH(Ray r, Mesh m)
{
BVHNode rootNode = BVHNodes[mesh.rootBVHNode];
nodeStack[0] = rootNode;
int stackPointer = 1;
BVHNode current = nodeStack[0];
while(stackPointer >=1)
{
// Pop the stack
stackPointer--
current = nodeStack[stackPointer];
bool aabbIntersect = intersectBox(r, current.aabb, tmin, tmax);
if(aabbIntersect == true)
{
// Current node is a leaf
if(current.isLeaf == true)
{
IntersectionReport rprt = intersectFace(r, current.indices, tmin,tmax)
// Things related to face intersection
...
}
// If it is not a leaf, we push children into the stack
else
{
nodeStack[stackPointer] = BVHNodes[current.rightNode];
stackPointer++;
nodeStack[stackPointer] = BVHNodes[current.leftNode];
stackPointer++;
}
}
}
return report;
}
This algorithm is straightforward. I first flattened the BVH tree that I’ve constructed in CPU and sent it to the shader by using Shader Storage Buffer Objects. I had a giant BVHNode array which contains all the nodes coming from all the meshes in the scene. I just kept their offsets and reached them by using these offsets.
However, this destroyed the speed gain. Shader started running ridiculously slow. Especially when tree heights are higher, shader becomes slower and slower. After searching through the net, I have learnt that checking things with if and else everywhere in the shader is not a proper approach. Shaders do not like branching because GPU is not created for these purposes. I needed to have another approach for BVH traversal. Implementing recursion with stacks is not efficient.
I searched for a solution for a week :D. And finally I implemented this algorithm and it worked better. I flatten the BVH again, but I keep the elements in the array sorted in pre-order form. Just like how we traverse the BVH tree. For each node, I was keeping how many child nodes they have.
So, my new algorithm was just looping that giant BVH node array and passing one index if bounding box intersects. Otherwise it jumps to the index of the other child at that level. This gave better results. But again, it was really really slow.
Here are my rendering time results for some scenes;
- Bunny : 14.719 ms
- ScienceTree : 35.859 ms
- Car : 115.773 ms
- Low Poly : 153.762 ms
- Tower : 140.304 ms
- Windmill : 114.857 ms
- Berserker : 51.798 ms
- ChineseDragon : 1201.54 ms
So as the tree grows larger. My runtime is not like it is O(logn). It is because I am not capable of writing highly optimized GPU code for now 😦
Here is a picture showing how much time I spend in intersection checking. More time is spent in brighter parts.
So, I dealt with BVH like that. After that, I started implementing recursive ray tracing. But how :D? Branching is a big sin. Stacks are highly inefficient. How should I do that. Again, only solution came into my mind was again implementing stacks. But here I could only make mirrors unfortunately. Dielectrics were really really hard to implement in GPU.
I could not finish this homework with shaders. Sağlık olsun :D. I will definitely return and clean the mess I made. But I had to finish my homework. In order to keep up with the class, I needed a solution. So I started a CPU based ray tracer. My experience with that was truly Raytracing In One Weekend.
CPU BASED RAYTRACER
I initiated another project immediately after I decided that dealing with GPU optimizations taking a lot of time. I moved my Ray Tracing algorithms in compute shaders to good old C++. Since I am comfortable with C++, I moved faster.
I first rendered our hw1 images starting from simple.xml. Had a some minor errors and I resolved them thanks to GDB. After I saw that I correctly produce the results. I moved on hw2 related topics.
I implemented the BVH tree without flattening. I used the traversal algorithm in our slides. I have used midpoint heuristics for splitting. I searched for some good and efficient ray-box intersection algorithms and found some on the internet. I guess that made my BVH traversal run faster. Furthermore, I also implemented the parallelism with C++ 11 features.
Parallelism
I searched on the internet for efficient parallelism in RayTracing and saw some codes in stackoverflow, github etc. I also read blog posts of the classmates and tried to understand their approaches. After all, I learned that I need to asynchronously process every pixel with N number of threads where N is the number of cores in CPU. Here is the parallel technique I use. (I simplified the code a little bit):
std::atomic<int> count(0);
std::vector<std::future<void> futureVector;
int cores = std::thread::hardware_concurrency();
while(cores--)
{
futureVector.pushBack(std::async(std::launch::async, [=]()
{
while(true)
{
int index = count++;
if(index >= worksize)
break;
vec2 coords = GiveCoords(index, _imageWidth);
Ray primaryRay = ComputePrimaryRay(index, _imageWidth);
vec3 pixel = RayTrace(primaryRay);
WritePixelCoord(coords.x, coords.y, pixel);
}
}));
}
for(auto& element: futureVector)
{
element.get();
}
I saw this pattern of code which uses future and async in a lot of places. By using parallelism and BVH trees, I was ready for calculating the times.
Low poly images gave great results. But chinese_dragon frustrated me a little bit. It was rendering it in 2 seconds while rendering others in 70 – 300 ms. Something was wrong. And I saw the problem after thinking for a day. I was just limiting the depth of my tree as 10. It was slow because of that. That tree depth was enough for small meshes but for large meshes, I needed deeper trees.
After I increased the depth limit. I got really fast results which shocked me.
Berserker | 77.247 ms |
Bunny | 18.162 ms |
Chinese Dragon | 73.838 ms (I did not expect that fast) |
Cornellbox Recursive | 38.083 ms |
Cornellbox | 34.564 ms |
Low Poly scene | 238.048 ms |
ScienceTree Recursive | 173.516 ms |
ScienceTree | 67.676 ms |
Spheres Recursive | 29.247 ms |
Tower | 260.641 ms |
Two Spheres | 21.928 ms |
Windmill | 225.047 ms |
Other Dragon | 240.724 ms |
Water Tap Frames | 100 – 300 ms on average |
Some Errors
Of course I had a lot of errors. For example;
I was trying to implement multithreaded rendering and getting this result. I was using index counter as std::atomic<int> but it was not syncronized for some reason. But From that image, we can see that threads spend a lot less time when not hitting bounding boxes of dragon’s BVH. Another result like this;
When handling dielectrics, I was hitting my head to the desk. I was getting close results to correct images but it was a little bit different. You can see that glass science tree is not correct. This panicked me a lot because it was already 20.00 PM and after 23.59, I was not allowed to resubmit my work in Odtuclass.
I found my mistake. It was really silly. While checking total reflection, I was calculating cosPhi as usual. But when calculating transmitted rays, I was just taking square root of cosPhi. This was the reason of my wrong results. I corrected it and everything worked smoothly after that.
Rendering Results
I am happy with my BVH implementation since it runs a lot faster than I’ve expected. Here are some rendering results for images specific to HW2
Conclusion
I panicked a lot in while implementing this homework because I thought that I was going to be left behind. But I managed to overcome the difficulties and kept up with my classmates I guess. I learned a good lesson from this. I should not take risks like that. Learning is more important. I think I will carry on with CPU implementation because I’m happy with its speed. But it does not mean that I will give up learning shader programming :D.
I was playing around with the scenes. I wanted to see how fast I can render massive images. I increased the resolution for a massive other-dragon image. 12000 * 9000 pixels. It is rendered in approx 90 seconds. In my opinion it looks cool. I will also implement soft shading shortly.
2 thoughts on “Ray Tracer: Revisited Part 2”