Friday, February 22, 2013

Stackless KD-Tree Traversal

I have a working, reasonably optimized, speedy GPU stackless kd-tree traversal implementation! Over the past few days, I implemented the history flag-esque approach I outlined in this post, and it works quite well!

The following image is a heatmap of a kd-tree built for the Stanford Dragon, showing the cost of tracing a ray through each pixel in the image. Brighter values mean more node traversals and intersection tests had to be done for that particular ray. The image was rendered entirely using Takua Render's CUDA pathtracing engine, and took roughly 100 milliseconds to complete:

  ...and a similar heatmap, this time generated for a scene containing two mesh cows, two mesh helixes, and some cubes and spheres in a box:

Although room for even further optimization still exists, as it always does, I am quite happy with the results so far. My new kd-tree construction system and stackless traversal system are both several orders of magnitude faster and more efficient than my older attempts.

Here's a bit of a cool image: in my OpenGL debugging view, I can now follow the kd-tree traversal for a single ray at a time and visualize the exact path and nodes encountered. This tool has been extremely useful for optimizing... without a visual debugging tool, no wonder my previous implementations had so many problems! The scene here is the same cow/helix scene, but rotated 90 degrees. The bluish green line coming in from the left is the ray, and the green boxes outline the nodes of the kd-tree that traversal had to check to get the correct intersection.

...and here's the same image as above, but with all nodes that were skipped drawn in red. As you can see, the system is now efficient enough to cull the vast vast majority of the scene for each ray:

The size of the nodes relative to the density of the geometry in their vicinity also speaks towards the efficiency of the new kd-tree construction system: empty spaces are quickly skipped through with enormous bounding boxes, whereas high density areas have much smaller bounding boxes to allow for efficient culling.

Over the next day or so, I fully expect I'll be able to reintegrate the actual pathtracing core, and have some nice images! Since the part of Takua that needed rewriting the most was the underlying scene and kd-tree system, I will be able to reuse a lot of the BRDF/emittance/etc. stuff from Takua Rev 2.

Friday, February 15, 2013

Revision 3 KD-Tree/ObjCore

The one piece of Takua Render that I've been proudest of so far has been the KD-Tree and obj mesh processing systems that I built. So of course, over the past week I completely threw away the old versions of KdCore and ObjCore and totally rewrote new versions entirely from scratch. The motive behind this rewrite came mostly from the fact that over the past year, I've learned a lot more about KD-Trees and programming in general; as a result, I'm pleased to report that the new versions of KdCore and ObjCore are significantly faster and more memory efficient than previous versions. KdCore3 is now able to process a million objects into an efficient, optimized KD-Tree with a depth of 20 and a minimum of 5 objects per leaf node in roughly one second.

Here's my kitchen scene, exported to Takua Render's AvohkiiScene format, and processed through KdCore3. White lines are the wireframe lines for the geometry itself, red lines represent KD-Tree node boundaries:


...and the same image as above, but with only the KD-Tree. You can use the lightbox to switch between the two images for comparisons:


One of the most noticeable improvements in KdCore3 over KdCore2, aside from the speed increases, is in how KdCore3 manages empty space. In the older versions of KdCore, empty space was often repeatedly split into multiple nodes, meaning that ray traversal through empty space was very inefficient, since repeated intersection tests would be required only for a ray to pass through the KD-Tree without actually hitting anything. The images in this old post demonstrate what I mean. The main source of this problem came from how splits were being chosen in KdCore2; in KdCore2, the chosen split was the lowest cost split regardless of axis. As a result, splits were often chosen that resulted in long, narrow nodes going through empty space. In KdCore3, the best split is chosen as the lowest cost split on the longest axis of the node. As a result, empty space is culled much more efficiently.

Another major change to KdCore3 is that the KD-Tree is no longer built recursively. Instead, KdCore3 builds the KD-Tree layer by layer through an iterative approach that is well suited for adaptation to the GPU. Instead of attempting to guess how deep to build the KD-Tree, KdCore3 now just takes a maximum depth from the user and builds the tree no deeper than the given depth. The entire tree is also no longer stored as a series of nodes with pointers to each other, but instead all nodes are stored in a flat array with a clever indexing scheme to allow nodes to implicitly know where their parent and child nodes are within the array. Furthermore, instead of building as a series of nodes with pointers, the tree builds directly into the array format. This array storage format again makes KdCore3 more suitable to a GPU adaptation, and also makes serializing the Kd-Tree out to disk significantly easier for memory caching purposes.

Another major change is how split candidates are chosen; in KdCore2, the candidates along each axis were the median of all contained object center-points, the middle of the axis, and some randomly chosen candidates. In KdCore3, the user can specify a number of split candidates to try along each axis, and then KdCore3 will simply divide each axis into that number of equally spaced points and use those points as candidates. As a result, KdCore3 is far more efficient than KdCore2 at calculating split candidates, can often find a better candidate with more deterministic results due to the removal of random choices, and offers the user more control over the quality of the final split.

The following series of images demonstrate KD-Trees built by KdCore3 for the Stanford Dragon with various settings. Again, feel free to use the lightbox for comparisons.

Max depth 2, min objects per node 20, min volume .0001% of whole tree
Max depth 5, min objects per node 20, min volume .0001% of whole tree
Max depth 10, min objects per node 20, min volume .0001% of whole tree
Max depth 15, min objects per node 20, min volume .0001% of whole tree
Max depth 20, min objects per node 20, min volume .0001% of whole tree
Max depth 20, min objects per node 5, min volume .0001% of whole tree


KdCore3 is also capable of figuring out when the number of nodes in the tree makes traversing the tree more expensive than brute force intersection testing all of the objects in the tree, and will stop tree construction beyond that point. I've also given KdCore3 an experiment method for finding best splits based on a semi-Monte-Carlo approach. In this mode, instead of using evenly split candidates, KdCore3 will make three random guesses, and then based on the relative costs of the guesses, begin making additional guesses with a probability distribution weighted towards where ever the lower relative cost is. With this approach, KdCore3 will eventually arrive at the absolute optimal cost split, although getting to this point may take some time. The number of guesses KdCore3 will attempt can be limited by the user, of course.

Finally, another one of the major improvements I made in KdCore3 was simply better use of C++. Over the past two years, my knowledge of how to write fast, effective C++ has evolved immensely, and I now write code very differently than how I did when I wrote KdCore2 and KdCore1. For example, KdCore3 avoids relying on class inheritance and virtual method table lookup (KdCore2 relied on inheritance quite heavily). Normally, virtual method lookup doesn't add a noticeable amount of execution time to a single virtual method, but when repeated for a few million objects, the slowdown becomes extremely apparent. In talking with my friend Robert Mead, I realized that virtual method table lookup in almost, if not all implementations today necessarily means a minimum of three pointer lookups in memory to find a function, whereas a direct function call is a single pointer lookup.

If I have time to later, I'll post some benchmarks of KdCore3 versus KdCore2. However, for now, here's a final pair of images showcasing a scene with highly variable density processed through KdCore3. Note the keavy amount of nodes clustered where large amounts of geometry exist, and the near total emptyness of the KD-Tree in areas where the scene is sparse:


Next up: implementing some form of highly efficient stackless KD-Tree traversal, possibly even using that history based approach I wrote about before!

Friday, February 8, 2013

Bounding Boxes for Ellipsoids

Warning: this post is going to be pretty math-heavy.

Let's talk about spheres, or more generally, ellipsoids. Specifically, let's talk about calculating axis aligned bounding boxes for arbitrarily transformed ellipsoids, which is a bit of an interesting problem I recently stumbled upon while working on Takua Rev 3. I'm making this post because finding a solution took a lot of searching and I didn't find any single collected source of information on this problem, so I figured I'd post it for both my own reference and for anyone else who may find this useful.

So what's so hard about calculating tight axis aligned bounding boxes for arbitrary ellipsoids?

Well, consider a basic, boring sphere. The easiest way to calculate a tight axis aligned bounding box (or AABB) for a mesh is to simply min/max all of the vertices in the mesh to get two points representing the min and max points of the AABB. Similarly, getting a tight AABB for a box is easy: just use the eight vertices of the box for the min/max process. A naive approach to getting a tight AABB for a sphere seems simple then: along the three axes of the sphere, have one point on each end of the axis on the surface of the sphere, and then min/max. Figure 1. shows a 2D example of this naive approach, to extend the example to 3D, simply add two more points for the Z axis (I drew the illustrations quickly in Photoshop, so apologies for only showing 2D examples):

Figure 1.

This naive approach, however, quickly fails if we rotate the sphere such that its axes are no longer lined up nicely with the world axes. In Figure 2, our sphere is rotated, resulting in a way too small AABB if we min/max points on the sphere axes:

Figure 2.

If we scale the sphere such that it becomes an ellipsoid, the same problem persists, as the sphere is just a subtype of ellipsoid. In Figures 3 and 4, the same problem found in Figures 1/2 is illustrated with an ellipsoid:

Figure 3.
Figure 4.

One possible solution is to continue using the naive min/max axes approach, but simply expand the resultant AABB by some percentage such that it encompasses the whole sphere. However, we have no way of knowing what percentage will give an exact bound, so the only feasible way to use this fix is by making the AABB always larger than a tight fit would require. As a result, this solution is almost as undesirable as the naive solution, since the whole point of this exercise is to create as tight of an AABB as possible for as efficient intersection culling as possible!

Instead of min/maxing the axes, we need to use some more advanced math to get a tight AABB for ellipsoids.

We begin by noting our transformation matrix, which we'll call M. We'll also need the transpose of M, which we'll call MT. Next, we define a sphere S using a 4x4 matrix:

[ r 0 0 0 ]
[ 0 r 0 0 ]
[ 0 0 r 0 ]
[ 0 0 0 -1] 

where r is the radius of the sphere. So for a unit diameter sphere, r = .5. Once we have built S, we'll take its inverse, which we'll call SI.

We now calculate a new 4x4 matrix R = M*SI*MT. R should be symmetric when we're done, such that R = transpose(R). We'll assign R's indices the following names:

R = [ r11 r12 r13 r14 ] 
      [ r12 r22 r23 r24 ] 
      [ r13 r23 r23 r24 ] 
      [ r14 r24 r24 r24 ] 

Using R, we can now get our bounds:

zmax = (r23 + sqrt(pow(r23,2) - (r33*r22)) ) / r33; 
zmin = (r23 - sqrt(pow(r23,2) - (r33*r22)) ) / r33; 
ymax = (r13 + sqrt(pow(r13,2) - (r33*r11)) ) / r33; 
ymin = (r13 - sqrt(pow(r13,2) - (r33*r11)) ) / r33; 
xmax = (r03 + sqrt(pow(r03,2) - (r33*r00)) ) / r33; 
xmin = (r03 - sqrt(pow(r03,2) - (r33*r00)) ) / r33; 

...and we're done!

Just to prove that it works, a screenshot of a transformed ellipse inside of a tight AABB in 3D from Takua Rev 3's GL Debug view:


I've totally glossed over the mathematical rationale behind this method in this post and focused just on how to quickly get a working implementation, but if you want to read more about the actual math behind how it works, these are the two sources I pulled this from:

Stack Overflow post by user fd
Article by Inigo Quilez

In other news, Takua Rev 3's new scene system is now complete and I am working on a brand new, better, faster, stackless KD-tree implementation. More on that later!