A Long Forgotten Blog.: February 2016

Let's back away from GPU ray tracing for a second. Noting the performance of the (oh so basic) ray tracer from the last article, I grew curious about just how much better, or worse, CPUs are at tracing rays than GPUs. In addition, because CPUs let you write the software in a full, unrestricted language like C++ it's a much more enjoyable experience. Let's get started then.

I've created a GitHub repo for this project, which you can find here.

CPU Raytracer Architecture
I'll describe the ray tracer architecture as it is at this point in time, but check the git log itself for any further updates.

Let's start with the main component, the abstract Raytracer class. This class defines the interface that the application can use to render scenes. It also contains pointers to the camera, frame buffer and scene objects. Why is it abstract? Because there's several ways that we can generate rays to create our final image. For instance, you can multi sample for each pixel, you can process the framebuffer in a tiled manner, you can process rays using one or more CPU cores etc. It currently has one concrete implementation class, BasicRaytracer which generates and processes pixels using a single thread, in scanline order, and with one sample per pixel (hey it's called "basic" for a reason!).

The FrameBuffer class contains the memory that will be rendered into, and methods for saving its contents to a .tga file afterwards. Nothing special here.

Now for the tricky part, and this is best explained with the help of a class diagram:

Any object that accepts a ray and does some work with implements the ITraceable interface.
As you can see, IScene is traceable, but also defines the interface for a collection of ITraceable objects. This makes sense if you consider that a ray can then just be passed into a scene, which contains other objects, themselves which are traceable.

The BoundedTraceable class simply defines a traceable object with an axis-aligned bounding box added for culling and intersection purposes, what space (object/world) these bounding boxes represent varies depending on context. Unlike the GPU raytracer, we don't simply transform all the triangles that make up the scene into world space prior to intersection testing. Here we have the core geometry objects themselves defined in object space in the standard way and we transform the rays from world space into object space prior to any triangle intersection tests. This is what the GeometryInstance class does, it contains the world space orientaiton, position and scale (represented in a Basis class) of a piece of geometry, and transforms rays into object space prior to passing them off to the contained *Geometry class. There are two of these currently, the BasicGeometry class which simply stores a list of triangles to intersect with, and the KdTreeGeometry class which stores the triangles in a kd tree acceleration structure. More on that later.

Let's evaluate the performance of running the most basic combination in the new system, using the BasicGeometry structure all rays have to test against the entire set of the triangles that comprise the Blender Monkey mesh. For a 1280x720 scene shown here:

rendering time takes around 18.5 seconds.As this is on an AMD Bulldozer CPU we can pretty much divide those results by 6 (or 8) for a parallel implementation. In the best case this is still around 2.3 seconds for a single frame. Hardly real time I'm sure we all agree.

Kd Tree Acceleration
The next thing to do is an investigation into acceleration structures for the triangle meshes that comprise our scene geometry. There are a number of them out there but kd trees are the generally accepted best solution.

Kd trees are simple enough in concept. They are binary space partitioning (BSP) trees that restrict the choice of splitting planes to be aligned with one of the three axes, and where only the leaf nodes of the tree contain any geometry. You have a k-dimensional region of space, and you subdivide it by splitting it along one of its cardinal axes at each level of a binary tree. If you're not familiar with the concept it takes a little bit of visualization. Imagine you have a rectangular region of space, and you want to carve it up so you only consider smaller regions of space at a time. You start by picking an axis to subdivide along, say the x-axis. You carve space into two regions along that axis and store them in the child nodes of the root node of the tree. Then for each of those child nodes, pick an axis, say the y-axis, and subdivide again. See below:

You repeat this until you reach some termination criterion. This can be that less than a certain number of triangles are contained in the node, or that the maximum tree depth is reached. More on the construction algorithms and their resulting performance later.

Kd Tree Traversal
How do kd trees accelerate the time it takes to determine the closest ray intersection?
For each kd tree node you have the following pieces of information at hand:

The bounding box encapsulating the node and all of its children;
The splitting plane;
The incoming ray.

The basic idea is to use the above to compute three t values for the incoming ray that correspond to points of interest in relation to the kd tree node (where the points are defined by ray_position + ray_direction * t). The t values are where the ray enters the bounding volume of the kd tree node, where it intersects the splitting plane of the kd tree node (if it does indeed intersect the splitting plane), and finally where it exits the bounding volume of the kd tree node. Using these t values we can determine the optimal ordering for traversing the nodes, and can possibly even remove entire nodes from further consideration. Consider the below configurations:

tExit > tPlane > tEntry:
Traverse the current side first, before traversing the opposite side.

Ray does not intersect splitting plane so no need to consider opposite side.

tSplit > tExit:
No need to consider opposite side.

tSplit < tEntry:
No need to consider the current side.

Note how the "current side" is defined as the side of the splitting plane that the ray position is in.
When you reach a leaf node, you run through its list of triangles and intersect against them. When the current side child node reports an intersection, you don't need to descend into the opposite side node as by definition the current side child is closer than the opposide side child. The only gotcha that you need to watch out for is when a child node reports an intersection, but the returned intersection point is on the opposite side of the splitting plane. This can happen when a triangle extends into both child nodes. It's possible for a triangle in the opposite node to actually be obscuring the current point, so you need to descend into that node as well to be sure.

In the codebase, the intersection routine is found as an implementation of the IKdTreeTraversal interface, called KdTreeStackTraversal. It's a recursive function, which is probably sub-optimal but good enough for now.

Kd Tree Construction
The choice of splitting plane axis and position, as well as how deep the resulting kd tree is for a particular dataset, has a substantial impact on the final performance of the algorithm. I've implemented two construction algorithms that you can toggle between in the codebase. The first I called naive spatial median, and the second is based off of the surface area heuristic.

The naive spatial median algorithm simply constructs a kd tree to a specified depth by dividing the node in half at every iteration and then for the final leaf nodes determining which triangles overlap them. It's simple to implement and results in a vast improvement over the flat list of triangles approach. However, evenly splitting the dataset along an axis isn't the optimal solution as it doesn't deal with uneven distributions of geometry. And providing the tree depth as an input to the algorithm might result in trees that are too shallow or too deep (and this will vary from dataset to dataset). The kd tree algorithm takes time to process at each level of the traversal stage, and this cost can result in a slower search than if the tree were shallower.

The surface area heuristic algorithm attempts to create an optimal tree by factoring in at each iteration of the kd tree construction, whether or not it would be more beneficial to subdivide the node, or to create a leaf node. Let me see if I can sum this up in a paragraph. The core idea is that, at any level of the kd tree, there is a choice between subdividing further, or simply creating a leaf node for the current region. If you choose to create a leaf node, the cost of processing a ray if it were to reach the node is simply the estimated time it takes to intersect a ray with a triangle, multiplied by the number of triangles in the leaf. The second option is to subdivide, in which case the cost of processing a ray can be estimated as the cost of processing the traversal step, plus the probability that the ray will intersect each of the child nodes, multiplied by the cost of making the child nodes leaves (we don't consider the cost of subdividing them further at this level, as that would complicate the algorithm too much). To sum up:

leaf_cost = triangle_intersection_cost * num_triangles;

subdivide_cost = traversal_cost + probabilityA * (num_triangles_a * triangle_intersection_cost) + probabilityB * (num_triangles_b * triangle_intersection_cost);

You don't have to provide "accurate" (whatever that means in the context of differing CPUs anyway) costs for the traversal_cost and triangle_intersection_cost numbers, rather it's the relative cost that is important. Is your triangle intersection code faster than your traversal step? Then set it to an appropriate fraction of the cost of the traversal step.

The probabilities are where the surface area part of the algorithm comes into play. Geometric probability theory tells us the following:
Assume you have a random ray that passes through a convex volume A.
Assume that you have another convex volume B that is contained in convex volume A.
Then the probability that the random ray will also pass through convex volume B is the surface area of B divided by the surface area of A.

For an axis aligned bounding box, its surface area can be computed as 2 * ((width * height) + (width * length) + (height * length)).

So when you choose your splitting plane, you get two resulting sub-volumes of the current node, and can compute their surface areas, number of overlapping triangles, and their corresponding probabilities. That's great, but how do we choose the correct splitting plane? There's an infinity of possible positions for it along the intervals of the bounding volume of the node you're currently evaluating. But again, smart people who have come before us have already determined that the only candidate splitting positions that actually have an effect on the final cost are the minimum and maximum triangle vertex positions on our axis of choice. (Tip: Compute the bounding volume for each of the triangles of the mesh at the beginning and use their min/max elements.)

I'm describing the SAH algorithm in context of accelerating a triangle mesh here, but it's capable of being extended to the entire scene as well if you use kd trees as your scene subdivision. I haven't done it yet, but check the repo at a later stage.

I might have confused you more than helped you with this explanation. Sorry about that, but it's hard to try and convey complicated algorithms in a couple of sentences sometimes. If you're really interested in writing something yourself though, you're going to be reading the proper academic papers along with this and then it will hopefully make much more sense. Speaking of which, most of my knowledge for building this thing came from a mighty tome called "Physically Based Rendering From Theory To Implementation". Do yourself a favour and go and purchase this book (link), as it contains the most useful and easy to digest knowledge on the subject of raytracing, kd tree construction included.

Performance
Performance improvements with kd trees are substantial. With the naive spatial median constructed kd tree rendering time fell to ~790 milliseconds for a single core (~23.5x faster). This was for a hand tuned kd tree depth, as fast as I could make it. Switching to the SAH and that number falls further to ~615 milliseconds (~30x faster). Were you to parallelise that across 6 or 8 cores you'd have a result that is competitive with the GPU raytracer results on a GTX760 from the last article. How interesting...

A Long Forgotten Blog.

Wednesday 3 February 2016

Let's Write A Raytracer Part 2 : (CPU) kd Trees

StatCounter