Saturday, September 15, 2012

Thoughts on Stackless KD-Tree Traversal

Edit: Erwin Coumans in the comments section has pointed me to a GDC 2009 talk by Takahiro Harada proposing something called Tree Traversal using History Flags, which is essentially the same as the idea proposed in this post, with the small exception that Harada's technique uses a bit field to track previously visited nodes on the up traverse. I think that Harada's technique is actually better than the pointer check I wrote about in this post, since keeping a bit field would allow for tracking the previously visited node without having to go back to global memory to do a node check. In other words, the bit field method allows for less thrashing of global memory, which I should think allows for a nice performance edge. So, much as I suspected, the idea in this post in one that folks smarter than me have arrived upon previously, and my knowledge of the literature on this topic is indeed incomplete. Much thanks to Erwin for pointing me to the Harada talk! The original post is preserved below, in case anyone still has an interest in reading it.



Of course, one of the biggest challenges to implementing a CUDA pathtracer is the lack of recursion on pre-Fermi GPUs. Since I intend for Takua-RT to be able to run on any CUDA enabled CPU, I necessarily have to work with the assumption that I won't have recursion support. Getting around this problem in the core pathtracer is not actually a significant issue, as building raytracing systems that operate in an iterative fashion as opposed to in a recursive fashion is a well-covered topic.

Traversing a kd-tree without recursion, however, is a more challenging proposition. So far as I can tell from a very cursory glance at existing literature on the topic, there are presently two major approaches: fully stack-less methods that require some amount of pre-processing of the kd-tree, such as the rope-based method presented in Popov et. al. [2007], and methods utilizing a short stack or something similar, such as the method presented in Zhou et. al. [2008]. I'm in the process of reading both of these papers more carefully, and will probably explore at least one of these approaches soon. In the meantime, however, I thought it might be a fun exercise to try to come up with some solution of my own, which I'll summarize in this post. I have to admit that I have no idea if this is actually a novel approach, or if its something that somebody also came up with and rejected a long time ago and I just haven't found yet. My coverage of the literature in this area is highly incomplete, so if you, the reader, are aware of a pre-existing version of this idea, please let me know so that I can attribute it properly! 

The basic idea I'm starting with is that when traversing a KD-tree (or any other type of tree, for that matter), at a given node, there's only a finite number of directions one can go in, and a finite number of previous nodes one could have arrived at the current node from. In other words, one could conceivably define a finite-state machine type system for traversing a KD-tree, given an input ray. I say finite-state machine type, because what I shall define here isn't actually strictly a FSM, as this method requires storing information about the previous state in addition to the current state. So here we go:

We begin by tracking two pieces of information: what the current node we are at is, and what direction we had to take from the previous node to get to the current node. There are three possible directions we could have come from:

1. Down from the current node's parent node
2. Up from the current node's left child
3. Up from the current node's right child

Similarly, there are only three directions we can possibly travel in from the current node:

1. Up to the current node's parent node
2. Down to the current node's left child
3. Down to the current node's right child

When we travel up from the current node to its parent, we can easily figure out if we are traveling up from the right or the left by looking at whether the current node is the parent node's left or right child. 

Now we need a few rules on which direction to travel in given the direction we came from and some information on where our ray currently is in space:

1. If we came down from the parent node and if the current node is not a leaf node, intersection test our ray with both children of the current node. If the ray only intersects one of the children, traverse down to that child. If the ray intersects both of the children, traverse down to the left child.
2. If we came down from the parent node and if the current node is a leaf node, carry out intersection tests between the ray and the contents of the node and store the nearest intersection.
3. If we came up from the left child, intersection test our ray with the right child of the current node. If we have an intersection, traverse down the right child. If we don't have an intersection, traverse upwards to the parent.
4. If we came up from the right child, traverse upwards to the parent.

That's it. With those four rules, we can traverse an entire KD-Tree in a DFS fashion, while skipping branches that our ray does not intersect for a more efficient traverse, and avoiding any use of recursion or the use of a stack in memory.

There is, of course, the edge case that our ray is coming in to the tree from the "back", so that the right child of each node is "in front" of the left child instead of "behind", but we can easily deal with this case by simply testing which side of the KD-tree we're entering from and swapping left and right in our ruleset accordingly.

I haven't actually gotten around to implementing this idea yet (as of September 15th, when I started writing this post, although this post may get published much later), so I'm not sure what the performance looks like. There are some inefficiencies in how many nodes our traverse will end up visiting, but on the flipside, we won't need to keep much of anything in memory except for two pieces of state information and the actual KD-tree itself. On the GPU, I might run into implementation level problems that could impact performance, such as too many branching statements or memory thrashing if the KD-tree is kept in global memory and a gazillion threads try to traverse it at once, so these issues will need to be addressed later.

Again, if you, the reader, knows of this idea from a pre-existing place, please let me know! Also, if you see a gaping hole in my logic, please let me know too!

Since this has been a very text heavy post, I'll close with some pictures of a KD-tree representing the scene from the Takua-RT post. They don't really have much to do with the traverse method presented in this post, but they are KD-tree related!


Vimeo's compression really does not like thin narrow lines, so here are some stills:




Monday, September 10, 2012

TAKUA/Avohkii Render


One question I've been asking myself ever since my friend Peter Kutz and I wrapped our little GPU Pathtracer experiment is "why am I writing Takua Render as a CPU-only renderer?" One of the biggest lessons learned from the GPU Pathtracer experiment was that GPUs can indeed provide vast quantities of compute suitable for use in pathtracing rendering. After thinking for a bit at the beginning of the summer, I've decided that since I'm starting my renderer from scratch and don't have to worry about the tons of legacy that real-world big boy renderers like RenderMan have to deal with, there is no reason why I shouldn't architect my renderer to use whatever compute power is available.

With that being said, from this point forward, I will be concurrently developing CPU and GPU based implementations of Takua Render. I call this new overall project TAKUA/Avohkii, mainly because Avohkii is a cool name. Within this project, I will continue developing the C++ based x86 version of Takua, which will retain the name of just Takua, and I will also work on a CUDA based GPU version, called Takua-RT, with full feature parity. I'm also planning on investigating the idea of an ARM port, but that's an idea for later. I'm going to stick with CUDA for the GPU version now since I know CUDA better than OpenCL and since almost all of the hardware I have access to develop and test on right now is NVIDIA based (the SIG Lab runs on NVIDIA cards...), but that could change down the line. The eventual goal is to have a set of renderers that together cover as many hardware bases as possible, and can all interoperate and intercommunicate for farming purposes.

I've already gone ahead and finished the initial work of porting Takua Render to CUDA. One major lesson learned from the GPU Pathtracer experiment was that enormous CUDA kernels tend to run into a number of problems, much like massive monolithic GL shaders do. One problem in particular is that enormous kernels tend to take a long time to run and can result in the GPU driver terminating the kernel, since NVIDIA's drivers by default assume that device threads taking longer than 2 seconds to run are hanging and cull said threads. In the GPU Pathtracer experiment, we used a giant monolithic kernel for a single ray bounce, which ran into problems as geometry count went up and subsequently intersection testing and therefore kernel execution time also increased. For Takua-RT, I've decided to split a single ray bounce into a sequence of micro-kernels that launch in succession. Basically, each operation is now a kernel; each intersection test is a kernel, BRDF evaluation is a kernel, etc. While I suppose I lose a bit of time in having numerous kernel launches, I am getting around the kernel time-out problem.

Another important lesson learned was that culling useless kernel launches is extremely important. I'm checking for empty threads at the end of each ray bounce and culling via string compaction for now, but this can of course be further extended to the micro-kernels for intersection testing later.

Anyhow, enough text. Takua-RT, even in its super-naive unoptimized CUDA-port state right now, is already so much faster than the CPU version that I can render frames with fairly high convergence in seconds to minutes. That means the renderer is now fast enough for use on rendering animations, such as the one at the top of this post. No post-processing whatsoever was applied, aside from my name watermark in the lower right hand corner. The following images are raw output frames from Takua-RT, this time straight from the renderer, without even watermarking:
 







Each of these frames represents 5000 iterations of convergence, and took about a minute to render on a NVIDIA Geforce GTX480. The flickering in the glass ball in animated version comes from having a low trace depth of 3 bounces, including for glass surfaces.

Sunday, September 9, 2012

Jello KD-Tree

I've started an effort to clean up, rewrite, and enhance my ObjCore library, and part of that effort includes taking my KD-Tree viewer from Takua Render and making it just a standard component of ObjCore. As a result, I can now plug the latest version of ObjCore into any of my projects that use it and quickly wire up support for viewing the KD-Tree view for that project. Here's the jello sim from a few months back visualized as a KD-Tree:


I've adopted a new standard grey background for OpenGL tests, since I've found that the higher amount of contrast this darker grey provides plays nicer with Vimeo's compression for a clearer result. But of course I'll still post stills too.

Hopefully at the end of this clean up process, I'll have ObjCore in a solid enough of a state to post to Github.

Wednesday, September 5, 2012

Volumetric Renderer Revisited

I've been meaning to add animation support to my volume renderer for demoreel purposes for a while now, so I did that this week! Here's a rotating cloud animation:


...and of course, a still or two:


Instead of just rotating the camera around the cloud, I wanted for the cloud itself to rotate but have the noise field it samples stay stationary, resulting in a cool kind of morphing effect with the cloud's actual shape. In order to author animations easily, I implemented a fairly rough, crude version of Maya integration. I wrote a script that will take spheres and point lights in Maya and build a scene file for my volume renderer using the Maya spheres to define cloud clusters and the point lights to define... well... lights. With an easy bit of scripting, I can do this for each frame in a keyframed animation in Maya and then simply call the volume renderer once for each frame. Here's what the above animation's Maya scene file looks like:



Also, since my pseudo-blackbody trick was originally intended to simulate the appearance of a fireball, I tried creating an animation of a fireball by just scaling a sphere:


...and as usual again, stills:

So that's that for the volume renderer for now! I think this might be the end of the line for this particular incarnation of the volume renderer (it remains the only piece of tech I'm keeping around that is more or less unmodified from its original CIS460/560 state). I think the next time I revisit the volume renderer, I'm either going to port it entirely to CUDA, as my good friend Adam Mally did with his, or I'm going to integrate it into my renderer project, Peter Kutz style.