YINING KARL LI | Code & Visuals

Saturday, November 16, 2013

Code & Visuals has moved!

Just a quick new announcement: this blog has moved! I've decided to move Code & Visuals to a Jekyll/Github Pages based system, located at blog.yiningkarlli.com. For more details, see this new post.

I'll leave this Blogger version around, but from now on, new posts will be at the new place. RSS/Atom feeds should transition over automatically.

See you on the other side!

Saturday, July 27, 2013

Pixar Optix Lighting Preview Demo

For the past two months or so, I've been working at Pixar Animation Studio as a summer intern with Pixar's Research Group. The project I'm on for the summer is a realtime, GPU based lighting preview tool implemented on top of NVIDIA's OptiX framework, entirely inside of The Foundry's Katana. I'm incredibly pleased to be able to say that our project was demoed at SIGGRAPH 2013 at the NVIDIA booth, and that NVIDIA has a recording of the entire demo online!

The demo was done by our project's lead, Danny Nahmias, and got an overwhelmingly positive reception. Check out the recording here:

FXGuide also did a podcast about our demo! Check it out here.

I'm just an intern, and the vast majority of the cool work being done on this project is from Danny Nahmias, Phillip Rideout, Mark Meyer, and others, but I'm very very proud, and consider myself extraordinarily lucky, to be part of this team!

Edit: I've replaced the original Ustream embed with a Vimeo mirror since the Ustream embed was crashing Chrome for some people. The original Ustream link is here.

Monday, April 29, 2013

Giant Mesh Test

My friend/schoolmate Zia Zhu is an amazing modeler, and recently she was kind enough to lend me a ZBrush sculpt she did for use as a high-poly test model for Takua Render. The model is a sculpture of Venus, and is made up of slightly over a million quads, or about two million triangles once triangulated inside of Takua Render.

Here are some nice, pretty test renders I did. As usual, everything was rendered with Takua Render, and there has been absolutely zero post-processing:

Each one of these renders was lit using a single, large area light (with importance sampled direct lighting, of course). The material on the model is just standard lambert diffuse white; I'll do another set of test renders once I've finished rewriting my subsurface scatter system. Each render was set to 2800 samples per pixels and took about 20 minutes to render on a single GTX480. In other words, not spectacular, but not bad either.

The key takeaway from this series of tests was that Takua's performance still suffers significantly when datasets become extremely large; while the render took about 20 minutes, setup time (including memory transfer, etc) took nearly 5 minutes, which I'm not happy about. I'll be taking some time to rework Takua's memory manager.

On a happier note, KD-tree construction performed well! The KD-tree for the Venus sculpt was built out to a depth of 30 and took less than a second to build.

Here's a bonus image of what the sculpt looks like in the GL preview mode:

Again, all credit for the actual model goes to the incredibly talented Zia Zhu!

Friday, April 26, 2013

Importance Sampled Direct Lighting

Takua Render now has correct, fully working importance sampled direct lighting, supported for any type of light geometry! More importantly, the importance sampled direct lighting system is now fully integrated with the overall GI pathtracing integrator.

A naive, standard pathtracing implementation shoots out rays and accumulates colors until a light source is reached, upon which the total accumulated color is multiplied by the emittance of the light source and added to the framebuffer. As a result, even the simplest pathtracing integrator does account for both the indirect and direct illumination within a scene, but since sampling light sources is entirely dependent on the BRDF at each point, correctly sampling the direct illumination component in the scene is extremely inefficient. The canonical example of this inefficiency is a scene with a single very small, very intense, very far away light source. Since the probability of hitting such a small light source is so small, convergence is extremely slow.

To demonstrate/test this property, I made a simple test scene with an extremely bright sun-like object illuminating the scene from a huge distance away:

Using naive pathtracing without importance sampled direct lighting produces an image like this after 16 samples per pixel:

Mathematically, the image is correct, but is effectively useless since so few contributing ray paths have actually been found. Even after 5120 samples, the image is still pretty useless:

Instead, a much better approach is to accumulate colors just like before, but not bother waiting until a light source is hit by the ray path through pure BRDF sampling to multiply emittance. Instead, at each ray bounce, a new indirect ray is generated via the BRDF like before, AND to generate a new direct ray towards a randomly chosen light source via multiple importance sampling and multiply the accumulated color by the resultant emittance. Multiple importance sampled direct lighting works by balancing two different sampling strategies: sampling by light source and sampling by BRDF, and then weighting the two results with some sort of heuristic (such as the power heuristic described in Eric Veach's thesis).

Sampling by light source is the trickier part of this technique. The idea is to generate a ray that we know will hit a light source, and then weight the contribution from that ray by the probability of generating that ray to remove the bias introduced by artificially choosing a ray direction. There's a few good ways to do this: one way is to generate an evenly distributed random point on a light source as the target for the direct lighting ray, and then weight the result using the probability distribution function with respect to surface area, transformed into a PDF with respect to solid angle.

Takua Render at the moment uses a slightly different approach, for the sake of simplicity. The approach I'm using is similar to the one described in my earlier post on the topic, but with a disk instead of a sphere. The approach works like this:

1. Figure out a bounding sphere for the light source
2. Construct a ray from the point to be lit to the center of the bounding sphere. Let's call the direction of this ray D.
3. Find a great circle on the bounding sphere with a normal N, such that N is lined up exactly with D. 4. Move the great circle along its normal towards the point to be lit by a distance of exactly the radius of the bounding sphere
5. Treat the great circle as a disk and generate uniformly distributed random points on the disk to shoot rays towards.
6. Weight light samples by the projected solid angle of the disk on the point being lit.

Alternatively, the weighting can simply be based on the normal solid angle instead of the projected solid angle is the random points are chosen with a cosine weighted distribution.

The nice thing about this approach is that it allows for importance sampled direct lighting even for shapes that are difficult to sample random points on; effectively, the problem of sampling light sources is abstracted away, at the cost of a slight loss in efficiency since some percentage of rays fired at the disk have to miss the light in order for the weighting to remain unbiased.

I also started work on the surface area PDF to solid angle PDF method, so I might post about that later too. But for now, everything works! With importance sampled direct lighting, the scene from above is actually renderable in a reasonable amount of time. With just 16 samples per pixel, Takua Render now can generate this image:

...and after 5120 samples per pixel, a perfectly clean render:

The other cool thing about this scene is that most of the scene is actually being lit through pure indirect illumination. With only direct illumination and no GI, the render looks like this:

Saturday, April 20, 2013

Quick Update on Future Plans

Just a super quick update on my future plans:

Next year, starting in September, I'll be joining Dr. Don Greenberg and Dr. Joseph T. Kider and others at Cornell's Program for Computer Graphics. I'll be pursuing a Master of Science in Computer Graphics there, and will most likely be working on something involving rendering (which I suppose is not surprising).

Between the end of school and September, I'll be spending the summer at Pixar Animation Studios once again, this time as part of Pixar's Research Group.

Obviously I'm quite excited by all of this!

Now, back to working on my renderer.

Working Towards Importance Sampled Direct Lighting

I haven't made a post in a few weeks now since I've been working on a number of different things all of which aren't quite done yet. Since its been a few weeks, here's a writeup of one of the things I'm working on and where I am with that.

One of the major features I've been working towards for the past few weeks is full multiple importance sampling, which will serve a couple of purposes. First, importance sampling the direct lighting contribution in the image should allow for significantly higher convergence rates for the same amount of compute, allowing for much smoother renders for the same render time. Second, MIS will serve as groundwork for future bidirectional integration schemes, such as Metropolis transport and photon mapping. I've been working with my friend Xing Du on understanding the math behind MIS and figuring out how exactly the math should translate into implementation.

So first off, some ground truth tests. All ground truth tests are rendered using brute force pathtracing with hundreds of thousands of iterations per pixel. Here is the test scene I've been using lately, with all surfaces reduced to lambert diffuse for the sake of simplification:

Ground truth global illumination render, representing 512000 samples per pixel. All lights sampled by BRDF only.

Ground truth for direct lighting contribution only, with all lights sampled by BRDF only.

Ground truth for indirect lighting contribution only.

The motivation behind importance sampling lights by directly sampling objects with emissive materials comes from the difficulty of finding useful samples from the BRDF only; for example, for the lambert diffuse case, since sampling from only the BRDF produces outgoing rays in totally random (or, slightly better, cosine weighted random) directions, the probability of any ray coming from a diffuse surface actually hitting a light is relatively low, meaning that the contribution of each sample is likely to be low as well. As a result, finding the direct lighting contribution via just BRDF sampling.

For example, here's the direct lighting contribution only, after 64 samples per pixel with only BRDF sampling:

Direct lighting contribution only, all lights sampled by BRDF only, 64 samples per pixel.

Instead of sampling direct lighting contribution by shooting a ray off in a random direction and hoping that maybe it will hit a light, a much better strategy would be to... shoot the ray towards the light source. This way, the contribution from the sample is guaranteed to be useful. There's one hitch though: the weighting for a sample chosen using the BRDF is relatively simple to determine. For example, in the lambert diffuse case, since the probability of any particular random sample within a hemisphere is the same as any other sample, the weighting per sample is even with all other samples. Once we selectively choose the ray direction specifically towards the light though, the weighting per sample is no longer even. Instead, we must weight each sample by the probability of a ray going in that particular direction towards the light, which we can calculate by the solid angle subtended by the light source divided by the total solid angle of the hemisphere.

So, a trivial example case would be if a point was being lit by a large area light subtending exactly half of the hemisphere visible from the point. In this case, the area light subtends Pi steradians, making its total weight Pi/(2*Pi), or one half.

The tricky part of calculating the solid angle weighting is in calculating the fractional unit-spherical surface area projection for non-uniform light sources. In other words, figuring out what solid angle a sphere subtends is easy, but figuring out what solid angle a Stanford Bunny subtends is.... less easy.

The initial approach that Xing and I arrived at was to break complex meshes down into triangles and treat each triangle as a separate light, since calculating the solid angle subtended by a triangle is once again easy. However, since treating a mesh as a cloud of triangle area lights is potentially very expensive, for each iteration the direct lighting contribution from all lights in the scene becomes potentially untenable, meaning that each iteration of the render will have to randomly select a small number of lights to directly sample.

As a result, we brainstormed some ideas for potential shortcuts. One shortcut idea we came up with was that instead of choosing an evenly distributed point on the surface of the light to serve as the target for our ray, we could instead shoot a ray at the bounding sphere for the target light and weight the resulting sample by the solid angle subtended not by the light itself, but by the bounding sphere. Our thinking was that this approach would dramatically simplify the work of calculating the solid angle weighting, while still maintaining mathematical correctness and unbiasedness since the number of rays fired at the bounding sphere that will miss the light should exactly offset the overweighting produced by using the bounding sphere's subtended solid angle.

I went ahead and tried out this idea, and produced the following image:

Direct lighting contribution only, all lights sampled by direct sampling weighted by subtended solid angle, 64 samples per pixel.

First off, for the most part, it works! The resulting direct illumination matches the ground truth and the BRDF-sampling render, but is significantly more converged than the BRDF-sampling render for the same number of samples. BUT, there is a critical flaw: note the black circle around the light source on the ceiling. That black circle happens to fall exactly within the bounding sphere for the light source, and results from a very simple mathematical fact: calculating the solid angle subtended by the bounding sphere for a point INSIDE of the bounding sphere is undefined. In other words, this shortcut approach will fail for any points that are too close to a light source.

One possible workaround I tried was to have any points inside of a light's bounding sphere to fall back to pure BRDF sampling, but this approach is also undesirable, as a highly visible discontinuity between the differently sampled area develops due to vastly different convergence rates.

So, while the overall solid angle weighting approach checks out, our shortcut does not. I'm now working on implementing the first approach described above, which should produce a correct result, and will post in the next few days.

Wednesday, March 6, 2013

Stratified versus Uniform Sampling

As part of Takua Render's new pathtracing core, I've implemented a system allowing for multiple sampling methods instead of just uniform sampling. The first new sampling method I've added in addition to uniform sampling is stratified sampling. Basically, in stratified sampling, instead of spreading samples per iteration across the entire probability region, the probability region is first divided into a number of equal sized, non-overlapping subregions, and then for each iteration, a sample is drawn with uniform probability from within a single subregion, called a strata. The result of stratified sampling is that samples are guaranteed to be more evenly spread across the entire probability domain instead of clustered within a single area, resulting in less visible noise for the same number of samples compared to uniform sampling. At the same time, since stratified sampling still maintains a random distribution within each strata, the aliasing problems associated with a totally even sample distribution are still avoided.

Here's a video showing a scene rendered in Takua Render with uniform and then stratified sampling. The video also shows a side-by-side comparison in its last third.

In the renders in the above video, stratified sampling is being used to choose new ray directions from diffuse surface bounces; instead of choosing a random point over the entire cosine-weighted hemisphere at an intersection point, the renderer first chooses a strata with the same steradian as all other strata, and then chooses a random sample within that solid angle. The strata is chosen sequentially for primary bounces, and then chosen randomly for all secondary bounces to maintain unbiased sampling over the whole render. As a result of the sequential strata selection for primary bounces, images rendered in Takua Render will not converged to an unbiased solution until N iterations have elapsed, where N is the number of strata the probability region is divided into. The number of strata can be set by the user as a value in the scene description which is then squared to get the total strata number. So, if a user specifies a strata level of 16, then the probability region will be divided into 256 strata and a unbiased result will not be reached until 256 or more samples per pixel have been taken.

Here's the Lamborghini model from last post at 256 samples per pixel with stratified (256 strata) and uniform sampling, to demonstrate how much less perceptible noise there is with the stratified sampler. From a distance, the uniform sampler renders may seem slightly darker side by side due to the higher percentage of noise, but if you compare them using the lightbox, you can see that the lighting and brightness is the same.

Stratified sampling, 256 strata, 256 samples per pixel

Uniform sampling, 256 samples per pixel

...and up-close crops with 400% zoom:

Stratified sampling, 256 strata, 256 samples per pixel, 400% crop

Uniform sampling, 256 samples per pixel, 400% crop

At some point soon I will also be implementing Halton sequence sampling and [0,2]-sequence sampling, but for the time being, stratified sampling is already providing a huge visual boost over uniform! In fact, I have a small secret to confess: all of the renders in the last post were rendered with the stratified sampler!

Monday, March 4, 2013

First Progress on New Pathtracing Core

I've started work on a completely new pathtracing core to replace the one used in Rev 2. The purpose of totally rewriting the entire pathtracing integrator and brdf systems is to produce something much more modular and robust; as much as possible, I am now decoupling brdf and new ray direction calculation from the actual pathtracing loop.

I'm still in the earliest stages of this rewrite, but I have some test images! Each of the following images was rendered out to somewhere around 25000 samples per pixel (a lot!), at about 5/6 samples per pixel per second. I let the renders run without a hard ending point and terminated them after I walked away for a while and came back, hence the inexact but enormous samples per pixel counts. Each scene was lit with my standard studio-styled lighting setup and in addition to the showcased model, uses a smooth backdrop that consists of about 10000 triangles.

Approximately 100000 face Stanford Dragon:

Approximately 150000 face Deloreon model:

Approximately 250000 face Lamborghini Aventador model:

Friday, March 1, 2013

Short-stack KD-Tree Traversal

In my last post, I talked about implementing history flag based kd-tree traversal. While the history flag based stackless traverse worked perfectly fine in terms of traversing the tree and finding the nearest intersection, I discovered over the past week that its performance is... less than thrilling. Unfortunately, the history flag system results in a huge amount of redundant node visits, since the entire system is state based and therefore necessarily needs to visit every node in a branch of the tree both to move down and up the branch.

So instead, I decided to try out a short-stack based approach. My initial concern with short-stack based approaches was the daunting memory requirements that keeping a short stack for a few hundred threads, however, I realized that realistically, a short stack never needs to be any larger than the maximum depth of the kd-tree being traversed. Since I haven't yet had a need to test a tree with a depth beyond 100, the memory usage required for keeping short stacks is reasonably predictable and manageable; as a precaution, however, I've also decided to allow for the system to fall back to a stackless traverse in the case that a tree's depth causes short stack memory usage to become unreasonable.

The actual short-stack traverse I'm using is a fairly standard while-while traverse based on the 2008 Kun Zhou realtime kd-tree paper and the 2007 Daniel Horn GPU kd-tree paper. I've added one small addition though: in addition to keeping a short stack for traversing the kd-tree, I've also added an optional second short stack that tracks the last N intersection test objects. The reason for keeping this second short stack is that kd-trees allow for objects to be split across multiple nodes; by tracking which objects we have already encountered, we can safely detect and skip objects that have already been tested. The object tracking short stack is meant to be rather small (say, no more than 10 to 15 objects at a time), and simply loops back and overwrites the oldest values in the stack when it overflows.

The new while-while traversal is significantly faster than the history flag approach, to the order of a 10x or better performance increase in some cases.

In order to validate that the entire kd traversal system works, I did a quick and dirty port of the old Rev 2 pathtracing integrator to run on top of the new Rev 3 framework. The following test images contain about 20000 faces and objects, and clocked in at about 6 samples per pixel per second with a tree depth of 15. Each image was rendered to 1024 samples per pixel:

I also attempted to render these images without any kd-tree acceleration as a control. Without kd-tree acceleration, each sample per pixel took upwards of 5 seconds, and I wound up terminating the renders before they got even close to completion.

The use of my old Rev 2 pathtracing core is purely temporary, however. The next task I'll be tackling is a total rewrite of the entire pathtracing system and associated lighting and brdf evaluation systems. Previously, this systems have basically been monolithic blocks of code, but with this rewrite, I want to create a more modular, robust system that can recycle as much code as possible between GPU and CPU implementations, the GL debugger, and eventually other integration methods, such as photon mapping.

Friday, February 22, 2013

Stackless KD-Tree Traversal

I have a working, reasonably optimized, speedy GPU stackless kd-tree traversal implementation! Over the past few days, I implemented the history flag-esque approach I outlined in this post, and it works quite well!

The following image is a heatmap of a kd-tree built for the Stanford Dragon, showing the cost of tracing a ray through each pixel in the image. Brighter values mean more node traversals and intersection tests had to be done for that particular ray. The image was rendered entirely using Takua Render's CUDA pathtracing engine, and took roughly 100 milliseconds to complete:

...and a similar heatmap, this time generated for a scene containing two mesh cows, two mesh helixes, and some cubes and spheres in a box:

Although room for even further optimization still exists, as it always does, I am quite happy with the results so far. My new kd-tree construction system and stackless traversal system are both several orders of magnitude faster and more efficient than my older attempts.

Here's a bit of a cool image: in my OpenGL debugging view, I can now follow the kd-tree traversal for a single ray at a time and visualize the exact path and nodes encountered. This tool has been extremely useful for optimizing... without a visual debugging tool, no wonder my previous implementations had so many problems! The scene here is the same cow/helix scene, but rotated 90 degrees. The bluish green line coming in from the left is the ray, and the green boxes outline the nodes of the kd-tree that traversal had to check to get the correct intersection.

...and here's the same image as above, but with all nodes that were skipped drawn in red. As you can see, the system is now efficient enough to cull the vast vast majority of the scene for each ray:

The size of the nodes relative to the density of the geometry in their vicinity also speaks towards the efficiency of the new kd-tree construction system: empty spaces are quickly skipped through with enormous bounding boxes, whereas high density areas have much smaller bounding boxes to allow for efficient culling.

Over the next day or so, I fully expect I'll be able to reintegrate the actual pathtracing core, and have some nice images! Since the part of Takua that needed rewriting the most was the underlying scene and kd-tree system, I will be able to reuse a lot of the BRDF/emittance/etc. stuff from Takua Rev 2.

Friday, February 15, 2013

Revision 3 KD-Tree/ObjCore

The one piece of Takua Render that I've been proudest of so far has been the KD-Tree and obj mesh processing systems that I built. So of course, over the past week I completely threw away the old versions of KdCore and ObjCore and totally rewrote new versions entirely from scratch. The motive behind this rewrite came mostly from the fact that over the past year, I've learned a lot more about KD-Trees and programming in general; as a result, I'm pleased to report that the new versions of KdCore and ObjCore are significantly faster and more memory efficient than previous versions. KdCore3 is now able to process a million objects into an efficient, optimized KD-Tree with a depth of 20 and a minimum of 5 objects per leaf node in roughly one second.

Here's my kitchen scene, exported to Takua Render's AvohkiiScene format, and processed through KdCore3. White lines are the wireframe lines for the geometry itself, red lines represent KD-Tree node boundaries:

...and the same image as above, but with only the KD-Tree. You can use the lightbox to switch between the two images for comparisons:

One of the most noticeable improvements in KdCore3 over KdCore2, aside from the speed increases, is in how KdCore3 manages empty space. In the older versions of KdCore, empty space was often repeatedly split into multiple nodes, meaning that ray traversal through empty space was very inefficient, since repeated intersection tests would be required only for a ray to pass through the KD-Tree without actually hitting anything. The images in this old post demonstrate what I mean. The main source of this problem came from how splits were being chosen in KdCore2; in KdCore2, the chosen split was the lowest cost split regardless of axis. As a result, splits were often chosen that resulted in long, narrow nodes going through empty space. In KdCore3, the best split is chosen as the lowest cost split on the longest axis of the node. As a result, empty space is culled much more efficiently.

Another major change to KdCore3 is that the KD-Tree is no longer built recursively. Instead, KdCore3 builds the KD-Tree layer by layer through an iterative approach that is well suited for adaptation to the GPU. Instead of attempting to guess how deep to build the KD-Tree, KdCore3 now just takes a maximum depth from the user and builds the tree no deeper than the given depth. The entire tree is also no longer stored as a series of nodes with pointers to each other, but instead all nodes are stored in a flat array with a clever indexing scheme to allow nodes to implicitly know where their parent and child nodes are within the array. Furthermore, instead of building as a series of nodes with pointers, the tree builds directly into the array format. This array storage format again makes KdCore3 more suitable to a GPU adaptation, and also makes serializing the Kd-Tree out to disk significantly easier for memory caching purposes.

Another major change is how split candidates are chosen; in KdCore2, the candidates along each axis were the median of all contained object center-points, the middle of the axis, and some randomly chosen candidates. In KdCore3, the user can specify a number of split candidates to try along each axis, and then KdCore3 will simply divide each axis into that number of equally spaced points and use those points as candidates. As a result, KdCore3 is far more efficient than KdCore2 at calculating split candidates, can often find a better candidate with more deterministic results due to the removal of random choices, and offers the user more control over the quality of the final split.

The following series of images demonstrate KD-Trees built by KdCore3 for the Stanford Dragon with various settings. Again, feel free to use the lightbox for comparisons.

Max depth 2, min objects per node 20, min volume .0001% of whole tree

Max depth 5, min objects per node 20, min volume .0001% of whole tree

Max depth 10, min objects per node 20, min volume .0001% of whole tree

Max depth 15, min objects per node 20, min volume .0001% of whole tree

Max depth 20, min objects per node 20, min volume .0001% of whole tree

Max depth 20, min objects per node 5, min volume .0001% of whole tree

KdCore3 is also capable of figuring out when the number of nodes in the tree makes traversing the tree more expensive than brute force intersection testing all of the objects in the tree, and will stop tree construction beyond that point. I've also given KdCore3 an experiment method for finding best splits based on a semi-Monte-Carlo approach. In this mode, instead of using evenly split candidates, KdCore3 will make three random guesses, and then based on the relative costs of the guesses, begin making additional guesses with a probability distribution weighted towards where ever the lower relative cost is. With this approach, KdCore3 will eventually arrive at the absolute optimal cost split, although getting to this point may take some time. The number of guesses KdCore3 will attempt can be limited by the user, of course.

Finally, another one of the major improvements I made in KdCore3 was simply better use of C++. Over the past two years, my knowledge of how to write fast, effective C++ has evolved immensely, and I now write code very differently than how I did when I wrote KdCore2 and KdCore1. For example, KdCore3 avoids relying on class inheritance and virtual method table lookup (KdCore2 relied on inheritance quite heavily). Normally, virtual method lookup doesn't add a noticeable amount of execution time to a single virtual method, but when repeated for a few million objects, the slowdown becomes extremely apparent. In talking with my friend Robert Mead, I realized that virtual method table lookup in almost, if not all implementations today necessarily means a minimum of three pointer lookups in memory to find a function, whereas a direct function call is a single pointer lookup.

If I have time to later, I'll post some benchmarks of KdCore3 versus KdCore2. However, for now, here's a final pair of images showcasing a scene with highly variable density processed through KdCore3. Note the keavy amount of nodes clustered where large amounts of geometry exist, and the near total emptyness of the KD-Tree in areas where the scene is sparse:

Next up: implementing some form of highly efficient stackless KD-Tree traversal, possibly even using that history based approach I wrote about before!

Friday, February 8, 2013

Bounding Boxes for Ellipsoids

Warning: this post is going to be pretty math-heavy.

Let's talk about spheres, or more generally, ellipsoids. Specifically, let's talk about calculating axis aligned bounding boxes for arbitrarily transformed ellipsoids, which is a bit of an interesting problem I recently stumbled upon while working on Takua Rev 3. I'm making this post because finding a solution took a lot of searching and I didn't find any single collected source of information on this problem, so I figured I'd post it for both my own reference and for anyone else who may find this useful.

So what's so hard about calculating tight axis aligned bounding boxes for arbitrary ellipsoids?

Well, consider a basic, boring sphere. The easiest way to calculate a tight axis aligned bounding box (or AABB) for a mesh is to simply min/max all of the vertices in the mesh to get two points representing the min and max points of the AABB. Similarly, getting a tight AABB for a box is easy: just use the eight vertices of the box for the min/max process. A naive approach to getting a tight AABB for a sphere seems simple then: along the three axes of the sphere, have one point on each end of the axis on the surface of the sphere, and then min/max. Figure 1. shows a 2D example of this naive approach, to extend the example to 3D, simply add two more points for the Z axis (I drew the illustrations quickly in Photoshop, so apologies for only showing 2D examples):

Figure 1.

This naive approach, however, quickly fails if we rotate the sphere such that its axes are no longer lined up nicely with the world axes. In Figure 2, our sphere is rotated, resulting in a way too small AABB if we min/max points on the sphere axes:

Figure 2.

If we scale the sphere such that it becomes an ellipsoid, the same problem persists, as the sphere is just a subtype of ellipsoid. In Figures 3 and 4, the same problem found in Figures 1/2 is illustrated with an ellipsoid:

Figure 3.

Figure 4.

One possible solution is to continue using the naive min/max axes approach, but simply expand the resultant AABB by some percentage such that it encompasses the whole sphere. However, we have no way of knowing what percentage will give an exact bound, so the only feasible way to use this fix is by making the AABB always larger than a tight fit would require. As a result, this solution is almost as undesirable as the naive solution, since the whole point of this exercise is to create as tight of an AABB as possible for as efficient intersection culling as possible!

Instead of min/maxing the axes, we need to use some more advanced math to get a tight AABB for ellipsoids.

We begin by noting our transformation matrix, which we'll call M. We'll also need the transpose of M, which we'll call MT. Next, we define a sphere S using a 4x4 matrix:

[ r 0 0 0 ]

[ 0 r 0 0 ]

[ 0 0 r 0 ]

[ 0 0 0 -1]

where r is the radius of the sphere. So for a unit diameter sphere, r = .5. Once we have built S, we'll take its inverse, which we'll call SI.

We now calculate a new 4x4 matrix R = M*SI*MT. R should be symmetric when we're done, such that R = transpose(R). We'll assign R's indices the following names:

R = [ r11 r12 r13 r14 ]

[ r12 r22 r23 r24 ]

[ r13 r23 r23 r24 ]

[ r14 r24 r24 r24 ]

Using R, we can now get our bounds:

zmax = (r23 + sqrt(pow(r23,2) - (r33*r22)) ) / r33;

zmin = (r23 - sqrt(pow(r23,2) - (r33*r22)) ) / r33;

ymax = (r13 + sqrt(pow(r13,2) - (r33*r11)) ) / r33;

ymin = (r13 - sqrt(pow(r13,2) - (r33*r11)) ) / r33;

xmax = (r03 + sqrt(pow(r03,2) - (r33*r00)) ) / r33;

xmin = (r03 - sqrt(pow(r03,2) - (r33*r00)) ) / r33;

...and we're done!

Just to prove that it works, a screenshot of a transformed ellipse inside of a tight AABB in 3D from Takua Rev 3's GL Debug view:

I've totally glossed over the mathematical rationale behind this method in this post and focused just on how to quickly get a working implementation, but if you want to read more about the actual math behind how it works, these are the two sources I pulled this from:

Stack Overflow post by user fd
Article by Inigo Quilez

In other news, Takua Rev 3's new scene system is now complete and I am working on a brand new, better, faster, stackless KD-tree implementation. More on that later!

Friday, January 18, 2013

Revision 3, Old Renders

At the beginning of the semester, I decided to re-architect Takua again, hence the lack of updates for a couple of weeks now. I'll talk more in-depth about the details of how this new architecture works in a later post, so for now I'll just quickly describe the motivation behind this second round of re-architecting. As I wrote about before, I've been keeping parallel CPU and GPU branches of my renderer so far, but the two branches have increasingly diverged. On top of that, the GPU branch of my renderer, although significantly better organized than the experimental CUDA renderer from spring 2012, still is rather suboptimal; after TAing CIS565 for a semester, I've developed what I think are some better ways of architecting CUDA code. Over winter break, I began to wonder if merging the CPU and GPU branches might be possible, and if such a task could be done, how I might go about doing it.

This newest re-structuring of Takua accomplishes that goal. I'm calling this new version of Takua "Revision 3", as it is the third major rewrite.

My new architecture centers around a couple of observations. First, we can observe that the lowest common denominator (so to speak) for structured data in CUDA and C++ is... a struct. Similarly, the easiest way to recycle code between CUDA and C++ is to implement code as inlineable, C style functions that can either be embedded in a CUDA kernel at compile time, or wrapped within a C++ class for use in C++. Therefore, one possible way to unify CPU C++ and GPU CUDA codebases could be to implement core components of the renderer using structs and C-style, inlineable functions, allowing easy integration into CUDA kernels, and then write thin wrapper classes around said structs and functions to allow for nice, object oriented C++ code. This exact system is how I am building Takua Revision 3; the end result should be a unified codebase that can compile to both CPU and GPU versions, and allow for both versions to develop in near lockstep.

Again, I'll go into a more detailed explanation once this process is complete.

I'll leave this post with a slightly orthogonal note; whilst in the process of merging code, I found some images from Takua Revision 1 that I never posted for some reason. Here's a particularly cool pair of images from when I was implementing depth of field. The first image depicts a glass Stanford dragon without any depth of field, and the second image depicts the same exact scene with some crazy shallow aperture (I don't remember the exact settings). You can tell these are from the days of Takua Revision 1 by the ceiling; I often made the entire ceiling a light source to speed up renders back then, until Revision 2's huge performance increases rendered cheats like that unnecessary.

Glass Stanford dragon without depth of field

Glass Stanford dragon with depth of field

Tuesday, December 18, 2012

Texture Mapping

A few weeks back I started work on another piece of super low-hanging fruit: texture mapping! Before I delve into the details, here's a test render showing three texture mapped spheres with varying degrees of glossiness in a glossy-walled Cornell box. I was also playing with logos for Takua render and put a test logo idea on the back wall for fun:

...and the same scene with the camera tilted down just to show off the glossy floor (because I really like the blurry glossy reflections):

My texturing system can, of course, support textures of arbitrary resolution. The black and white grid and colored UV tile textures in the above render are square 1024x1024, while the Earth texture is rectangular 1024x512. Huge textures are handled just fine, as demonstrated by the following render using a giant 2048x2048, color tweaked version of Simon Page's Space Janus wallpaper:

Of course UV transformations are supported. Here's the same texture with a 35 degree UV rotation applied and tiling switched on:

Since memory is always at a premium, especially on the GPU, I've implemented textures in a fashion inspired by geometry instancing and node based material systems, such as the system for Maya. Inside of my renderer, I represent texture files as a file node containing the raw image data, streamed from disk via stb_image. I then apply transformations, UV operations, etc through a texture properties node, which maintains a pointer to the relevant texture file node, and then materials point to whatever texture properties nodes they need. This way, texture data can be read and stored once in memory and recycled as many times as needed, meaning that a well formatted scene file can altogether eliminate the need for redundant texture read/storage in memory. This system allows me to create amusing scenes like the following one, where a single striped texture is reused in a number of materials with varied properties:

Admittedly I made that stripe texture really quickly in Photoshop without too much care for straightness of lines, so it doesn't actually tile very well. Hence why the sphere in the lower front shows a discontinuity in its texture... that's not glitchy UVing, just a crappy texture!

I've also gone ahead and extended my materials system to allow any material property to be driven with a texture. In fact, the stripe room render above is using the same stripe texture to drive reflectiveness on the side walls, resulting in reflective surfaces where the texture is black and diffuse surfaces where the texture is white. Here's another example of texture driven material properties showing emission being driven using the same color-adjusted Space Janus texture from before:

Even refractive and reflective index of refraction can be driven with textures, which can yield some weird/interesting results. Here are a pair of renders showing a refractive red cube with uniform IOR, and with IOR driven with a Perlin noise map:

Uniform refractive IOR

Refractive IOR driven with a Perlin noise texture map

The nice thing about a node-style material representation is that I should be able to easily plug in procedural functions in place of textures whenever I get around to implementing some (that way I can use procedural Perlin noise instead of using a noise texture).

Here's an admittedly kind of ugly render using the color UV grid texture to drive refractive color:

For some properties, I've had to add a requirement to specify a range of valid values by the user when using a texture map, since RGB values don't map well to said properties. An example would be glossiness, where a gloss value range of 0% to 100% leaves little room for detailed adjustment. Of course this issue can be fixed by adding support for floating point image formats such as OpenEXR, which is coming very soon! In the following render, the back wall's glossiness is being driven using the stripe texture (texture driven IOR is also still in effect on the red refractive cube):

Of course, even with nice instancing schemes, textures potentially can take up a gargantuan amount of memory, which poses a huge problem in the GPU world where onboard memory is at a premium. I still need to think more about how I'm going to deal with memory footprints larger than on-device memory, but at the moment my plan is to let the renderer allocate and overflow into pinned host memory whenever it detects that the needed footprint is within some margin of total available device memory. This concern is also a major reason why I've decided to stick with CUDA for now... until OpenCL gets support for a unified address space for pinned memory, I'm not wholly sure how I'm supposed to deal with memory overflow issues in OpenCL. I haven't reexamine OpenCL in a little while now though, so perhaps it is time to take another look.

Unfortunately, something I discovered while in the process of extending my material system to support texture driven properties is that my renderer could probably use a bit of refactoring for the sake of organization and readability. Since I now have some time over winter break and am planning on making my Github repo for Takua-RT public soon, I'll probably undertake a bit of code refactoring over the next few weeks.