Sunday, 31 August 2014

Rendering Post: Cascaded Shadow Maps

I'm going to talk about the very initial implementation of shadows that I've done for the engine so far. To be honest I'm not even sure if this exact method is actually used in games because I just sat down and coded something up over a weekend. But here goes! If you have never heard of cascaded shadow maps before an excellent starting point is here.

Because the engine I'm building only needs to support wide outdoor environments (it's an RTS) the most important light type I need to cater for is the directional light from the sun. So no omni-directional or spotlights need to be done for now. 

Let's outline the basics of the algorithm.
Part 1. Generate the shadow maps.
Bind framebuffer with render to texture for the depth buffer. Switch GPU to double speed z only rendering.
for each cascade 
1). Calculate the truncated view frustum of the main camera in world space, use this to determine the position, bounds, and projection matrix of the light. 
2). Render all relevant geometry of the scene into the shadow map.

Part 2. During the deferred lighting stage directional light shading pass.
for each pixel
1). Read corresponding depth, linearize and calculate the world space position of each pixel.
2). Project world space position into the current cascade's light space, and run the depth comparison and filtering against the depth stored in the shadow map. 

That's literally it. The beautiful thing about shadow mapping, as opposed to something like shadow volumes is the conceptual simplicity of the technique. But it comes at the expense of grievances like having to manage shadow map offsets and deal with issues like duelling frusta etc. Another benefit of this way of doing it (during the deferred lighting stage) is that you get free self shadowing that's completely generic over the entire world. If there's a point in the world, it will get shadowed correctly (provided your shadow maps are generated correctly as well). 

Now let's get into the details.

Calculate the view volume of the light for each cascade slice.
The most important thing you need to do here is ensure that all points that are enclosed in the frustum slice of the main camera will be included in the lights viewing volume. Not only that, but the lights viewing volume should extend from the camera frustum to the light's position itself so that any points that are outside the camera's viewing volume but potentially casting shadows onto the points that are accounted for.

This simple diagram should explain.

How I defined the light's viewing volume was to define the light's camera space, and then transform all points of the cascade slice frustum into that camera space. Which, since we're dealing with an orthographic projection, makes it easy to find the maximum and minimum extents along each axis (with the exception of the near z plane, which is always set to 1.0f. From that, you can create your light basis clipping planes and then, for each element in the world, if it's inside both pairs of clipping planes, then it get's rendered into the shadow map.

A note here on level of detail:
You want to make sure that the level of detail set for the mesh of terrain node you're rendering is based off of the main camera (not the light) to avoid artifacts. But depending on how you do things these may be tolerable and will certainly be faster.

Reconstructing the world space position.

If you read the previous article this would be an example of why you need to be able to correctly read the linear depth from the depth buffer. If you don't, your reconstructed world space positions will be swimming around and you'll notice horribly weird artifacts. A good debug view is to visualize the generated world space pixels normalized to the bounds of the current map (for me it was to calculate the world space position and divide by roughly 4000 (map size is 4km squared).
You should get something like this:



And you should note that as the camera moves around the scene the colors in the scene should not change... at all. That's the first sign you're doing something wrong.

The way I reconstructed the world space position was to calculate the vectors from the camera frustum near corners to the far corners. Translate those vectors into world space and then pass them as attributes to be interpolated over the screen during the full screen pass. From there you get the interpolated view vector and multiply it by the linear depth read from the depth buffer and you've got your world space position.

Doing the final lighting pass.
After you've got your shadow maps and you have your world space positions for each pixel, inside the final shader all you need to do is transform the pixel from world space into the light space and run your comparison. Here's the really basic fragment shader I used, there's a bunch of optimizations (and modernizations, running off of the old 1.1 GLSL spec) to do and the filtering is only the hardware PCF at the moment but it should be a good reference point to just get something on screen.

 #version 110   
   
 //#define DEBUG   
   
 /// Texture units.  
 uniform sampler2D albedoBuffer;  
 uniform sampler2D normalBuffer;  
 uniform sampler2D depthBuffer;  
   
 uniform sampler2DShadow shadowCascade0;  
 uniform sampler2DShadow shadowCascade1;  
 uniform sampler2DShadow shadowCascade2;  
 uniform sampler2DShadow shadowCascade3;  
   
 varying vec2 vTexCoord0;  
   
 /// Contains the components A, B, n, f in that order.  
 /// Used for depth linearization.  
 uniform vec4 ABnf;  
   
 /// World space reconstruction.  
 varying vec3 vFrustumVector;  
 uniform vec3 cameraPosition;  
   
 /// Lighting values.  
 uniform vec3 viewspaceDirection;  
 uniform vec3 lightColor;  
   
 uniform mat4 cascade0WVP;  
 uniform mat4 cascade1WVP;  
 uniform mat4 cascade2WVP;  
 uniform mat4 cascade3WVP;  
   
 void ProcessCascade0(in vec4 cascadeClipSpacePosition)  
 {  
     #ifdef DEBUG  
     gl_FragColor.rgb *= vec3(2.0, 0.5, 0.5);  
     #endif  
   
     // Get depth in light space.  
     cascadeClipSpacePosition.xyz += 1.0;  
     cascadeClipSpacePosition.xyz *= 0.5;  
   
     cascadeClipSpacePosition.z -= 0.0005;  
     cascadeClipSpacePosition.w = 1.0;  
     float multiplier = shadow2DProj(shadowCascade0, cascadeClipSpacePosition).r;  
     gl_FragColor.rgb *= multiplier;      
 }  
   
 void ProcessCascade1(in vec4 cascadeClipSpacePosition)  
 {  
     #ifdef DEBUG  
     gl_FragColor.rgb *= vec3(0.5, 2.0, 0.5);  
     #endif  
   
     // Get depth in light space.  
     cascadeClipSpacePosition.xyz += 1.0;  
     cascadeClipSpacePosition.xyz *= 0.5;  
   
     cascadeClipSpacePosition.z -= 0.001;  
     cascadeClipSpacePosition.w = 1.0;  
     float multiplier = shadow2DProj(shadowCascade1, cascadeClipSpacePosition).r;  
     gl_FragColor.rgb *= multiplier;      
 }  
   
 void ProcessCascade2(in vec4 cascadeClipSpacePosition)  
 {  
     #ifdef DEBUG  
     gl_FragColor.rgb *= vec3(0.5, 0.5, 2.0);  
     #endif  
       
     // Get depth in light space.  
     cascadeClipSpacePosition.xyz += 1.0;  
     cascadeClipSpacePosition.xyz *= 0.5;  
   
     cascadeClipSpacePosition.z -= 0.002;  
     cascadeClipSpacePosition.w = 1.0;  
     float multiplier = shadow2DProj(shadowCascade2, cascadeClipSpacePosition).r;  
     gl_FragColor.rgb *= multiplier;      
 }  
   
 void ProcessCascade3(in vec4 cascadeClipSpacePosition)  
 {  
     #ifdef DEBUG  
     gl_FragColor.rgb *= vec3(2.0, 0.5, 0.5);  
     #endif  
   
     // Get depth in light space.  
     cascadeClipSpacePosition.xyz += 1.0;  
     cascadeClipSpacePosition.xyz *= 0.5;  
   
     cascadeClipSpacePosition.z -= 0.0025;  
     cascadeClipSpacePosition.w = 1.0;  
     float multiplier = shadow2DProj(shadowCascade3, cascadeClipSpacePosition).r;  
     gl_FragColor.rgb *= multiplier;      
 }  
   
 void StartCascadeSampling(in vec4 worldSpacePosition)  
 {  
     vec4 cascadeClipSpacePosition;  
     cascadeClipSpacePosition = cascade0WVP * worldSpacePosition;  
     if (abs(cascadeClipSpacePosition.x) < 1.0 &&   
         abs(cascadeClipSpacePosition.y) < 1.0 &&   
         abs(cascadeClipSpacePosition.z) < 1.0)  
     {  
         ProcessCascade0(cascadeClipSpacePosition);  
         return;  
     }  
       
     cascadeClipSpacePosition = cascade1WVP * worldSpacePosition;  
     if (abs(cascadeClipSpacePosition.x) < 1.0 &&   
         abs(cascadeClipSpacePosition.y) < 1.0 &&   
         abs(cascadeClipSpacePosition.z) < 1.0)  
     {  
         ProcessCascade1(cascadeClipSpacePosition);  
         return;  
     }  
     cascadeClipSpacePosition = cascade2WVP * worldSpacePosition;  
     if (abs(cascadeClipSpacePosition.x) < 1.0 &&   
         abs(cascadeClipSpacePosition.y) < 1.0 &&   
         abs(cascadeClipSpacePosition.z) < 1.0)  
     {  
         ProcessCascade2(cascadeClipSpacePosition);  
         return;  
     }  
       
     cascadeClipSpacePosition = cascade3WVP * worldSpacePosition;  
     if (abs(cascadeClipSpacePosition.x) < 1.0 &&   
         abs(cascadeClipSpacePosition.y) < 1.0 &&   
         abs(cascadeClipSpacePosition.z) < 1.0)  
     {  
         ProcessCascade3(cascadeClipSpacePosition);  
         return;  
     }  
 }  
   
 void main()  
 {  
     float A = ABnf.x;  
     float B = ABnf.y;  
     float n = ABnf.z;  
     float f = ABnf.w;  
   
     // Get the initial z value of the pixel.  
     float z = texture2D(depthBuffer, vTexCoord0).x;  
     z = (2.0 * z) - 1.0;  
       
     // Transform into view space.      
     float zView = -B / (z + A);  
     zView /= -f;  
   
     // Normalize zView.  
     vec3 intermediate = (vFrustumVector * zView);  
       
     vec4 worldSpacePosition = vec4(intermediate + cameraPosition, 1.0);  
       
     vec3 texColor = texture2D(albedoBuffer, vTexCoord0).rgb;  
       
     // Do lighting calculation.  
     vec3 normal = texture2D(normalBuffer, vTexCoord0).rgb * 2.0 - 1.0;  
       
     float dotProduct = dot(viewspaceDirection, normal);  
   
     gl_FragColor.rgb = (max(texColor * dotProduct, 0.0));  
     gl_FragColor.rgb *= lightColor;  
       
     // Now we can transform the world space position of the pixel into the shadow map spaces and   
     // see if they're in shadow.  
     StartCascadeSampling(worldSpacePosition);  
       
     gl_FragColor.rgb += texColor.rgb * 0.2;  
 }  

The final result.

Here's a couple of screenshots:


Test scenario showing how the buildings cast shadow onto one another. 

Closer in details are still preserved with cascaded shadow maps. Self shadowing Tiger tank FTW!


Here you can see cascades 1-3 illustrated.


And here is cascade 0 included as well for the finer details close to the camera.
Just a thanks here to everyone from Turbosquid and the internet for the free models! :)

Friday, 29 August 2014

Rendering Post: Linearizing your depth buffer and simple fog.

Right so enough about tools for now, let's do some rendering work. I hadn't really devoted much time to the renderer for the project, since I've been focusing on the back end and tools. But, tools can be boring to talk about and not terribly useful to the reader, so I added a couple of features to the engine. Let's get started.

Linearizing Your Depth Buffer
An important fundamental step to a lot of graphics techniques is to be able to read the depth of the current pixel from your depth buffer. Well... no shit, but you need to be able to do it properly. When sampling from your depth buffer in the shader, you get a value back in the range [0, 1], and you might assume that 0 is the near plane and 1 is the far plane distance.. but nope, there's a characteristic here that we need to be aware of. What we need is to understand the transformation that the z component of a vertex will undergo and how and why it's interpolated across the surface of the triangle the way it is.

Now here is where it get's potentially confusing. When it comes to the handling of z values, there are two places we get the result 1/z and they are not related. Firstly we use it when it comes to interpolating vertex attributes successfully during rasterization. The second is a consequence of the perspective divide operation.

Let's start with the first (I'm going to assume that the you guys are familiar with perspective, our perception of the world, and why we need projection to replicate it in computer graphic. If not, see here). You see, when we go from the abstract three-dimensional description of a triangle to the two-dimensional triangle that gets interpolated across the screen, we suffer from a loss of dimensionality that makes it hard to determine the correct values we need for rendering. We project the three-dimensional triangle onto a two-dimensional plane and then fill it in. But we need to be able to reconstruct the three-dimensional information from the interpolated two-dimensional information.

Here on the left is the standard similar triangles diagram that explains why you divide by z to shrink the x value the farther the point is from the camera. Easy stuff.


For this quick explanation let's assume that your near plane is set to 1.0, which means we can remove it from the equation. We have:

x' = x / z

which we can manipulate to get the equation

x = x' * z

which is great because it means that we can reconstruct the original view space x from the projected x point by multiplying by the original z value. So as we're interpolating x' during the rasterization process we can find z and recalculate the original x value. We'd be able to recover a 3D attribute from a 2D interpolation. But this assumes that you have z handy, which we don't. We need to find a way to interpolate z linearly across the screen. A linear equation is an equation of the form y = Ax + B

Ignoring what A and B's actual values are for the moment, and substituting for x and z we get:

x = Az + B

but x = x' * z, so

x' * z = Az + B

which. after some manipulation gets us

z = B / (x' - A)

Which is hardly linear.  But now for the magic trick. Take the reciprocal of both sides.

(1 / z) = (1 / B) * x' - (A / B)

Hey, that's linear!

So, we can interpolate z's reciprocal in terms of x' (we get x' by interpolating across the screen during rasterization). From there, we can just take the interpolated value, and get it's reciprocal again and we'll have z. And from that we can recalculate the interpolated x and y in 3D space. So we've overcome the dimension loss of 2D rasterization. This is applied to all attributes associated to a vertex, things like texture coordinates etc.

Secondly, when we read from the depth buffer after a perspective projection. We get an odd value back. We need to understand where it comes from and how it's reversed to recover the proper depth value. Look at the summary below:

We can see the journey of the vertex from model space to view space. Here it gets projected into a unit cube centered on the origin where clipping is performed (hence the name). As an aside, it's also here that the GPU will generate more triangles if it needs to from the clipping process.

I use the stock standard OpenGL transformation system, so right handed coordinate systems with the part of view space that will be seen in the negative z range. Looking at the symmetric perspective projection matrix definition for OpenGL we have the following:



Renaming the items in the third row to A and B and multiplying a 4D homogeneous vector
(x, y, z, 1.0) by this matrix, our vector would be

(don't care, don't care, Az + B, -z)

Two things to note: Az + B is still a linear transform (it's just mapping the z range to the range [-1, 1] with -1 being the near plane and 1 being the far plane). but w has become -z. The negation is just because in our right handed view space the visible portion is on the negative z axis.
When the perspective w divide occurs we'll have a value of the form:

(1 / z) * C

Where C is -(Az + B).  I'd also add that somewhere here they sneakily transform the C value from the range [-1, 1] to [0, 1] so the final value is probably:

(C * 0.5 + 0.5) / z

And then, depending on your depth buffer format, that value might be converted to an integer value. But for modern day cards it should generally be floating point, and it's all hidden away by the API for you.

Using this information we can reconstruct the proper depth when we read from the depth buffer.
Firstly, we need to calculate some values and pass them into our shader as uniforms.
We'll calculate:

     float n = camera.GetNearClipPlaneDistance();  
     float f = camera.GetFarClipPlaneDistance();  
     float A = -(f + n) / (f - n);  
     float B = (-2.0f * f * n) / (f - n);  
     shader->SetUniform4f("ABnf", A, B, n, f);  

And in the shader, we firstly reverse the offset from [-1, 1] to [0, 1], so we get the value back in NDC space. Then we apply the inverse of the C / z equation, which can be best understood by this diagram:


So there's our reverse formula that we use to go from normalized device coordinates back to view space and we're done! In GLSL this is:
 #version 110  

uniform sampler2D textureUnit0;  

 /**  
 * Contains the components A, B, n, f in that order.  
 */  
 uniform vec4 ABnf;  

 varying vec2 vTexCoord0;  

 void main()  
 {  
     float A = ABnf.x;  
     float B = ABnf.y;  
     float n = ABnf.z;  
     float f = ABnf.w;  

     // Get the initial z value of the pixel.  
     float z = texture2D(textureUnit0, vTexCoord0).x;  
     z = (2.0 * z) - 1.0; 
 
     // Transform into view space.      
     float zView = -B / (z + A);  

     // Get normalized value.  
     zView /= -f;  

     gl_FragColor = vec4(zView, zView, zView, 1.0);  
 }  

I negate it because, remember, in OpenGL's view space, visible z values lie in the negative z axis.
Here's a screenshot of what this outputs in a test level:



To sum up the findings.
1) The odd nonlinear values you read from the depth buffer HAVE NOTHING TO DO WITH PERSPECTIVE CORRECT INTERPOLATION.  They are caused entirely by the perspective divide caused by perspective projection.

2) There still needs to be an interpolation of 1/z somewhere in order to enable perspective correct interpolation of vertex attributes, we just don't have to care about it. Your vertices attributes go in on one end, and come out perspective correct on the other, you don't need to divide by z again.

Bonus Feature: Simple exponential fog.
So what are the uses of linear depth? Well as you'll see in posts in the future its really used all over the place, but for starters, I just coded up a simple exponential fog function. All it really does is get the normalized linear distance (a value from [0, 1]) and square it, with some artist configurable parameters of course like starting distance and a density multiplier. It's as simple a shader as you can get, and takes about 30 minutes to get integrated... ok maybe a bit more because I coded up the beginnings of the post processing framework at the same time. Anyways, the point is for a simple shader, the effects can be quite dramatic:


Useful resources:

http://www.songho.ca/opengl/gl_projectionmatrix.html
A beautifully comprehensive guide to the derivations of both orthographic and perspective projection matrices for OpenGL camera systems. Also where I grabbed the projection matrix image from.

http://chrishecker.com/Miscellaneous_Technical_Articles#Perspective_Texture_Mapping
All hail! The original and unsurpassed articles on perspective texture mapping. A little outdated now but if you're looking for the fundamentals of texture mapping and manage to survive them, they're still very informative. In fact, while you're at it, go read all of Chris Hecker's articles, he's up there with Michael Abrash in terms of ability to bring extremely technical topics down to a casual conversation level without dumbing it down.

http://www.amazon.com/Mathematics-Programming-Computer-Graphics-Edition/dp/1584502770
I actually have two copies of this book. One I used to keep at work and one for home.  Chapters 1-3 are what you need if you need an introduction to matrices, vectors, affine transformations, projections, quaternions, and so on. Integrates nicely with OpenGL engine development as all the conventions are based off of that API.




Saturday, 9 August 2014

Tool Post: Static Mesh Compiler

Ok another post, and another tool to cover :)

We've done textures, so let's look at another asset type that we'll need to provide to the game.
Defined in the terminology of my engine, static meshes are essentially non-deformable meshes i.e. their vertex attributes won't change, they're uploaded to the GPU once and never modified again. You see examples of these in almost any game out there today. A rock mesh or a cup on a table, for example.

Let's think on the tool pipeline here. An artist/modeler is going to use his preferred tool, Blender, 3ds, or whatever to generate his mesh and it's going to spit out one of a variety of formats. So your tool has to have an extensible importer framework set up to handle it all. Right now, I have that set up but I've only written the importer for wavefront .obj. When I need another format it's easy to add it.

So, similarly to the Texture Compiler, the Static Mesh Compiler maintains an asset file, which contains imported static meshes in an intermediate format that allows for adding, removing, renaming, and re-positioning of the meshes. Then there is a publishing front end which takes the intermediate format database and outputs a platform optimized, compressed database that is optimal for the engine to load at runtime.

Level Of Detail
I need to branch away for a second and talk about level of detail and why it's important.
Level of detail is essentially a performance optimization. The key point is to not spend precious GPU time drawing something at it's highest possible fidelity when it's so far away you're not going to actually see and appreciate that detail; A simplified representation should suffice. Observe:


On the left we have your standard Stanford bunny representation, this one clocks in at 4968 triangles. On the right however you have a far simpler mesh, at 620 triangles. There's definitely a noticeable difference in quality here. But what happens when the camera is sufficiently far from the model?


Hard to see the difference right? At this point the objects contribution to the scene in terms of pixels output is too small to discern the detail. Voila, we don't need to care about drawing the detailed model any more.

It's worth mentioning that there are several different level of detail techniques. The first and most common, is discrete level of detail, whereby a chain of meshes is generated for every initial mesh, with each mesh in the chain being of a lower level of detail. This is quite common in games today and is characterized by being able to see a visible "pop" of a mesh as it transitions between detail levels. Secondly, continuous level of detail generates a progressive data structure that can be sampled in a continuous fashion at runtime. And thirdly, there exists view-dependent level of detail algorithms that take into account the angle of view that the mesh has with the camera in order to optimize detail. In addition, there are probably a few dozen different types that I'm missing out, notably, I'm interested in there being any level of detail techniques using the newer tessellation hardware available. For all of that, however, it must be said that discrete level of detail is still the most popular as the other techniques are either too slow at runtime or rely on the presence of special hardware in order to work and/or be performant. For instance, I know that one of the keystones of the game LAIR for PS3 was its progressive mesh technology, which required the Cell Processor and (probably) the ability of the GPU to read from system memory to be feasible. For our uses, we'll stick with good old discrete levels of detail. The only drawback to it is the potentially noticeable pop between LOD levels... and the added artist time to create all of the different levels of detail. Which leads to the next topic.

Automated LOD Generation Library
Another thing that would be ideal for our tool is some form of automated level of detail generation.
In an ideal world, maybe you'd have your artists go ahead and lovingly hand craft every single LOD level to be the optimum for a given triangle budget. But for scenarios like ours where you have one, maybe two artists available, you really want to take as much off their hands as possible. Luckily there are several libraries available that cater for automated level of detail generation. After some searching around, what I found most suitable for my immediate needs was GLOD (http://www.cs.jhu.edu/~graphics/GLOD/index.html), a powerful level of detail library developed by a team from The John Hopkins University and the University of Virginia. The thing that makes GLOD desirable for our needs is that it's designed to be closely integrated with OpenGL, for instance, it uses OpenGL vertex arrays as its unified geometry interface, meaning that it is by design capable of working with any per vertex attributes you would care to. If it works in OpenGL, GLOD can simplify/adapt it. In this respect, true to their design goals, it more closely resembles an extension to the OpenGL driver than a standalone LOD toolkit.
Also of benefit is that GLOD is separated into three modules, meaning you are free to use whatever subset of the provided functionality you need. We're just using it as a geometry simplification tool for our discrete level of detail code, however, it has additional level of detail features that I need to experiment with when I have the time.

The Tool Interface
Here's a screenshot of the main tool interface.


So, typical affair for UI, you have your list database view and publishing tool on the top left, the gizmo's for manipulating your mesh in space (good for view testing for the LODs etc) and on the bottom left you have the level of detail tool for manipulating the mesh levels of detail, you can set the individual levels distance to camera and its maximum triangle budget allowed (usually the tool spits out a mesh that's a few triangles under it, depending on circumstance). Of course, you can also move about the scene and set rendering options, like take a peek into the GBuffer, and see the wireframe and bounding volume or your mesh. For most of the tools that have a 3D scene widget that uses the renderer, those options are always available.

I should just add here that most of the tools I'll write about are under active development so they may get features added and/or removed and things may change. Which is great really because then I have new post material!


Thursday, 17 July 2014

Tool Post: Texture Compiler

All engines run off of generated assets. The most advanced renderer on the planet is meaningless if all you can do is draw a programmer defined cube with it. These assets are created by artist tools, such as Maya or 3ds max, but aren't necessarily loaded into the game in those formats. Try parsing a Wavefront .obj model every time you want to load an asset and you'll see what I mean, it's damn slow. Engines tend to run off their own optimized formats that are created from source material by a resource pipeline, a set of tools that converts artist created models, audio clips, textures etc, into a format that is optimal for the engine to locate, load and work with. In addition, the resource pipeline may bundle engine and game specific data into these formats for use later on.

The first tool I created was a texture compiler. Now loading in raw .png files and using them as textures isn't the most horrible thing that could be done. But it does have problems as you'll see later on in this post. It appears trivial at first, but there's a bit of work that needs to be done with source textures before you're ready to render with them properly. Chief among the concerns is the issue of Gamma Correction.

Gamma Correction
There are TONS of resources on this subject now, but I'll include the briefest explanation. 
So, from what I can gather, the story goes like this. Once upon a time we had CRT monitors, and it was noted that the physical output of the beam varied non-linearly with the input voltage applied. What this means is that if you wanted to display the middle grey between pitch black and pure white, and you input the RGB signal (0.5, 0.5, 0.5), you wouldn't get the middle grey as you would expect. If you measured the output brightness, you got something along the lines of (0.22, 0.22, 0.22). Worse still with this phenomenon, you actually get colour shifting(!), observe... I enter (0.9, 0.5, 0.1) and I get (0.79, 0.21, 0.006), the red becomes far more dominant in the result.

When plotted on a graph, the relationship could be viewed thus:


Note the blue line, this is the monitors natural gamma curve. Also note that I've used the power factor 2.2 to represent the exponent that the monitors have. The exponent actually varies, however, 2.2 is generally close enough to correct that it can be used in most cases.

Nowadays, most displays aren't CRT. But, in the interest of backwards compatibility, modern monitors emulate this gamma curve.

But how come all the images you see aren't darker than they should be?
Well that's because a lot of image formats these days are pre-gamma-corrected (jpeg, png are two). That means that the internal image values are mapped to be the green line in the graph, basically raised to the power of 1 / 2.2. This has the effect of cancelling out the monitors gamma when displayed to the user. So at the end you see the image values as they were intended. Which is great when all you're doing is viewing images, but it causes some serious (and subtle) issues when rendering. Because all of the operations that occur during rendering assume linear relationships. Obvious examples are texture filtering, mipmap generation, alpha-blending, or lighting.

Why didn't they keep the image formats linear and just pre-gamma correct before outputting to the display? What's with the inverse gamma curve on the images? The answer is another history lesson, it turns out by lucky coincidence (which was actually purposeful engineering) that raising the image values to the reciprocal gamma exponent had the side effect of allocating more bits to represent darker values. This makes sense as humans are more adept at seeing differences between dark tones than differences between light tones. In a way, it makes sense to still have it around.

What this all comes down to is that we have to deal with these non-linear inputs and outputs on either side of our linear rendering pipeline.

The Texture Compiler



Wow, a seriously complicated tool yeah? It's about as basic an interface as you can get for working with textures. Most of what's interesting happens in the background. It basically works as follows.

What the tool does is maintains a list of textures in an intermediate format, which can be saved out to a texture asset file (*.taf). This enables you to load up an existing asset file, add images, remove images, rename, change a parameter and so on, then save again. Then, when you want to export to the format the engine consumes, you select which platforms you want to publish to (right now it's just PC OpenGL) and hit the Publish button. This then generates a very simple database, it's basically a two part file. The index part and the data part.

When the engine loads up, it only loads the texture database's index. Then, when an asset is encountered that requests a texture, the following process occurs. The engine queries the resident texture set, if the texture has been loaded onto the GPU already, it's returned immediately. If it hasn't been loaded yet, then the texture database index is queried for the offset into the texture database file of the desired texture. The raw database entry is loaded and, if it was successfully compressed by the publishing tool, it's decompressed into the raw image data. Then that raw image data is compressed to whatever GPU friendly compression format is supported and sent off to the GPU. If a texture is requested that isn't inside the texture database, then a blank white texture is returned instead.




It should be noted that the textures inside the texture database file are already in linear space. If you look at the tools screenshot, you'll see that there's a "Gamma correct on publish." option. That will simply tell the tool that on publish, raise the texture values to the desired power (in this case 2.2) to bring the values back into linear space. Then all of the automatic mipmap generation and texture filtering in the API and on the GPU will be correct from the get go. It's also an option specifically because for some textures, you don't want to gamma correct. Normal maps for instance tend to not be pre corrected i.e are already linear. Because our inputs are now linear, and our internal operations are linear, all that's required at the end of the rendering pipeline is to apply the inverse gamma correction to the framebutffer and... that's a bingo!

Just as an addendum on the whole linear pipeline topic, note that the alpha channel of gamma corrected (sRGB) textures will also be linear and therefore need no correction. Aaaaand also that while storing linear textures has its advantages, you won't be allocating as many bits of precision to the lower intensity light values. There are a few ways to go about fixing this (such as moving from 8 bits per channel to 16). Having said that, I haven't really noticed any glaring artifacts as the textures we're using for our game are all bright and colorful, so its alright :)






Monday, 9 September 2013

Project Post: Tyro
Here's a project that I worked on for a while back in 2011. If you're wondering why it's called something strange like Tyro, it's because the definition of tyro is a beginner/novice, so it fit nicely. It's a deferred renderer written in OpenGL 2.0 with GLSL. It also had an "almost" complete super-basic rigid body physics library branch in it that I started working on as well... but things got in the way and I couldn't finish that. I was reluctant to put a non-complete project up but there's plenty of awesome and hopefully someone can still learn/use something.

Still, the cool stuff is the renderer. So I'll post some screenshots, write a bit about it and then post the source code for you all to browse through if you're interested. 

Some Cool Things It Had
The Deferred Renderer
The project initially started as a deferred renderer experiment. After watching the Killzone 2 tech demo's from 2005 to 2009 I was thoroughly enamored with the concept. For all of you who don't know what Deferred Rendering is, here's the short version.

Traditionally, in game engines up to that time, lighting was performed on geometry as the geometry itself was rendered. So (this will be a VERY basic representation), something similar to this would occur:


Render Function
Initialize renderer, do pre draw stuff;

Clear back buffer(s);

Set back buffer mode to accumulate;

foreach light do
    find geomtetry that intersects light volume;    
    render geometry lit with that light into back buffer;
end;

Swap back buffer

You can see that if an object in the world was lit by multiple lights, you would have to re-render the geometry every time, for each light. You could do some things to mitigate this, like batch multiple lights into a single run of the shader for the object... but the core still holds. There was a dependency between performing the lighting calculation and transforming and rendering the actual geometry itself. So, as your light count increased, your draw call count increased, and your triangle count increased as well. Enter deferred shading.

Deferred shading is called deferred shading because it defers shading calculations until after the geometry itself has been drawn. The core concept is that you write the 'properties' of a scene to various offscreen buffers (also called Geometry Buffers or collectively, the G-Buffer) during what is called a material pass. After this pass is done you perform shading passes using the properties stored in the buffers. This shading could be anything you want it to be but it has predominantly been associated with the lighting calculation(s).
G-Buffer visualization. View space normal(TL). Texture albedo(TR). Depth buffer(BL). Specular reflectivity(BR).

Oh, and before the terminology becomes too weird. A renderer that uses deferred shading tends to be called a deferred renderer, not a deferred shader. Yeah I know.

This approach is fundamentally cool because it clearly and cleanly separates lighting calculations from geometry and material calculations. They become two distinct parts of the pipeline. This helps not only performance but architectural cleanliness as well.

I'm going to digress here and add that I think this is one of the core technological approaches that really saved Sony's bacon this generation. The Playstation 3 had this really poor geometry throughput compared to its contemporary, the Xbox 360. Traditionally architected engines pretty much without fail performed much better on Microsoft's console with it's more PC like CPU, unified memory, and (for the time) powerful, unified-shader based GPU. Deferred shading helped level the graphical playing field because it removes a large portion of geometry transformations from the pipeline. Not only that, but the fact that you had all these material properties written to memory buffers, meant that they were conveniently in a format that the PS3's SPUs could operate on efficiently. This meant that the wily developer could alleviate the pressure on the weak GPU by moving entire portions of the graphics pipeline over to the satellite processors. Examples of this were titles like Split Second and Battlefield 3, which performed the entirety of their lighting calculations on SPUs; And titles like Killzone 2, and Uncharted 2, which performed the majority (or all) of their image post processing on the SPUs. Later titles used the G-Buffer to perform post-process anti-aliasing, removing that from the geometry pipeline as well.  So we can see the benefits that Deferred Shading bring to the table.

It's not all great though. The very same characteristics that make Deferred Shading a good thing also introduce some problems too. For one, the memory footprint and bandwidth required for the G-Buffer can get quite large. And if you want sub-pixel AA it gets even scarier.
Transparencies become a problem too, and what you generally have to have is a fallback forward-renderer to draw transparent elements of the scene.
Introducing new material properties means introducing new offscreen buffers. And so on.

A thorough introduction and/or overview of Deferred Shading is way beyond the scope of this post. I'd advise any interested readers to check out the myriad sources available. Some useful links are:

http://www.cg.tuwien.ac.at/courses/Seminar/WS2010/deferred_shading.pdf

http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/lectures/12_deferred_shading.pdf

http://www.guerrilla-games.com/publications/dr_kz2_rsx_dev07.pdf

http://www.dennisfx.com/wp-content/uploads/2013/02/Report_DRendering_Toufexis_D.pdf

http://developer.amd.com/wordpress/media/2012/10/D3DTutorial_DeferredShading.pdf

http://www.slideshare.net/guerrillagames/the-rendering-technology-of-killzone-2

Normal mapping
Tyro also supported tangent space normal mapping. I wasn't then, and am still not, aware of any tools that generate a tangent space for your meshes (links anyone??) so I ended up writing my own tangent space mesh parser. This turns out to be a very tricky thing to get right and that part of Tyro's code base is something I don't really want to look at ever again. In addition, it really does require an artists touch to go through and make sure that you don't average facet details when you shouldn't etc... the things an algorithm tends to not get perfectly right for every kind of mesh type.

Also, it's very annoying when you have to wait for the .exe to parse a mesh every single time you run the damn thing. The importance of a good back-end that gives you pre-parsed meshes in a format your front-end can read quickly...

Here are some links for tangent space normal mapping:


http://crytek.com/download/Triangle_mesh_tangent_space_calculation.pdf

http://www.ozone3d.net/tutorials/bump_mapping.php

Bloom
I added a bloom effect to the result when the light values reached a certain threshold and above.
This is accomplished by trivially taking the lighting result buffer, examining the values and, if they were above your threshold, moving them over to another buffer, call it the bright-pass buffer. Then you downsample that buffer and blur it a few times using a separable filter or something. To make it a little more effective, I think I had the hardware generate some mipmaps from the 
bright-pass buffer and then blur those as well to get the wide kernel effect.

I got the idea from here:
http://kalogirou.net/2006/05/20/how-to-do-good-bloom-for-hdr-rendering/

The standard lighting output.

Brightpass result: pixels above a certain  brightness threshold selected.

Brighpass buffer downsampled and blurred repeatedly.

Final composite result. Add the blurred brightpass buffer to the final image.

Some Things It Should Have Had, But Didn't

Anti-Aliasing

Running off of a DirectX-9 level setup, you can't get any kind of sub-pixel deferred rendering I believe. So it would of had to have been a post-process method. Something along the lines of MLAA or FXAA.

Gamma-Correction

A no-brainer. I wrote this before I had even heard the term. But gamma correction is an important topic all on it's own and vital for any proper lighting/rendering solution. 

Shadows

I was planning on this, but ran out of time to finish it before other things got in the way. Again, shadows are a huge topic all on their own.

In Conclusion

I think I'll leave it there. But I'll leave a couple more screenshots of it in action below and I'll post the whole project online so that you can all download it and try it out.
The project was written in CodeBlocks, runs on Windows only, and needs the Direct-X SDK setup to compile properly :) 










I used the SOIL library for loading pngs into OpenGL textures
.
http://www.lonesock.net/soil.html

Here is the URL to download the zipped project
https://docs.google.com/file/d/0B7HWlmfutdmxY3B4WDJvNko5UXc/edit?usp=sharing

Extract the folder to your C drive and open up the Codeblocks workspace. It should compile and run from there :) 

As a disclaimer, I take no responsibility for how you all use the code.
Also, if there are resources (textures, models, libraries) that I've used that you own the rights to and you want me to remove it and/or give you due credit, by all means let me know and I'll stick it in there.


Monday, 29 April 2013

Indie Game...Of Thrones

Myself and three others from Celestial entered in the Indie Speed Run late last year. We managed to hack out a pretty cool project by the end of it. Our starting keywords were "afterlife" and "throne" so we decided to make a crazy mmo hack-and-slash isometric pen-art style game. Or as it will be known in the future a CMMOHSIPASG. *Cough*

Here's the link to it on the indiedb site: http://www.indiedb.com/games/throne

The team was
- Alison McAlinden, who did all the awesome pen art and animations for the characters, as well as the story.
- Travis Bulford, who did the network programming, UI programming, maze generation code and the AI.
- Cobus Saunderson, who did the user interface graphics (more work than it sounds like!)

My contribution was the isometric engine that powered the display, the particle system engine as well as the particle system scripts (no, the quality was higher than just "programmer art" thank you), and animation engine code and tie-ins for the characters.





It was an awesome experience, albeit a brutal one that I'm not too eager to try again until my degree is finished!

I'm going to talk a little bit about the engine, seeing as that's the component I spent the most time with.
The isometric component of the engine runs on top of the j4game framework (which handles all of the nasty set up tasks), but it's actually almost a completely separate entity. Aside from providing the graphics context and input, they are fairly detached from one another. It's better to think of it as being a plug-in into the j4game framework itself.

The engine had to be able to render the huge maps in the game and had to have sprites that could shear (trees blowing in the wind!) and enter a transparent state when an object of importance moved behind them (the player for example). To that end, it was a quadtree based design with a tricky sprite sorting and merging algorithm in the core loop that would dynamically check which sprites fell in front of priority sprites and do the appropriate blending etc.

The particle system represented particles in a true three-dimensional space (in fact, all coordinates in the engine were in true three dimensions, the engine would cull the quadtree nodes, transform visible sprites into camera space and then sort and draw), so they would have x, y, and z position components, and velocity and acceleration which would get integrated etc. The particle engine was flexible and allowed for custom code to be scripted in (there was no scripting language though, it's scripting was done by deriving from the vanilla particle system, overriding an update method and then writing whatever behavior you would need. The particles themselves could be derived from and customized, so it all worked out to be quick and flexible given the time constraints).

The animation system let you set animations, store animations, loop animations, and generate engine-wide events during certain frames, like cast a fireball or start a particle system during frame n for example.





Wednesday, 17 October 2012

Simple Tile Based Volume Lighting In Toxic Bunny

Introduction
My company (Celestial Games) recently released Toxic Bunny HD, an old-school 2D platforming-shooter remake involving a crazy bunny called Toxic and his quest for revenge, you can read about it here:

http://www.toxicbunny.co.za/

Towards the end of the project I had a little time to try and squeeze in one of the features I had been wanting for a long time, dynamic lighting.

Whilst not a complicated effect to achieve normally, when we start taking it in the context of, "We wrote the game in Java using the Java2D API", it becomes more challenging. What we found during development was that the Java2D API was not well suited to the kinds of graphics workloads we were placing on it; and this was never more true than when we looked at dynamic lighting. A few onerous things about the API (for games) are:

- Lack of useful blending modes (no multiplicative blending, no additive blending).
- Absolutely NO shading support whatsoever.
- Modifying images can result in the API marking them as unaccelerated, destroying performance.
- Bloated implicit object construction for image(sprite) drawing.
- Convolution operations provided by the API are slow.

Whilst we were able to overcome most of these in various ways, they certainly left us with the strong impression that we chose the wrong API to develop a game on. This was a direct result of the fact that the game started as a weekend experiment and wasn't considered to be a "serious" product until after we had laid the foundations out. It certainly forced us to push the API farther than we thought we could.

Goals
In order for the dynamic lighting to be used in the game, it had to look good and not abuse our frame budget.
Toxic Bunny HD runs at a tick rate of 24Hz, that means that our budget is, conservatively, 40 milliseconds. Compared to, say a 60Hz 16 millisecond game, that's an absolutely enormous frame budget, which worked in our favor as a 2D game (a high frame rate count doesn't mean as much in a 2D game environment as in a 3D one).

So, I wanted a lighting pipeline that would definitely push a high end machine at the highest quality setting, but would perform at around 5ms on the lowest quality setting.

The Lighting Pipeline
The lighting pipeline, at it's most conceptual level, evaluates to

For every light:
Step 1: Gather tiles around light source. From a collision detection perspective in the game, our playing area is divided up into 16x16 tiles. It is these tiles that perform the role of light occluders in the lighting system. Every 16x16 tile is marked with a specific collision type, whether it be empty space, a slope of some angle, or a solid (fully filled) space. The lighting system is only concerned about the solid tiles. For every light, the lighting system extracts the solid tiles that fall within the lights radius.

Solid collision tiles for the map are highlighted in red here. 

Step 2: Use gathered tiles to generate half spaces.
We then process the gathered collision tiles one at a time. We calculate the lights position relative to the center of the tile (working in a tile-space so to speak). And use the relative position and the rectangular features of the tile to generate a set of half-spaces. If a particular pixel falls behind all of the generated half-spaces for a given tile, it is occluded by that tile and therefore not shaded. If it is in front of one of the half-spaces, we run the standard point light attenuation equation on it and additively blend it into the back (lighting) buffer (I'll explain why this is possible in Java 2D later).

My quick diagram of how the half spaces are generated. There are 9 regions around the tile that the light can be in.
Step 3: Shade pixels.
For every pixel in the back buffer, run through each tile's generated half spaces and do the visibility test.

After all that's done, and you have your lighting contribution to the scene in a buffer, add it's contribution to the scene.

Note:
The lighting buffer is at a much lower resolution than the screen resolution for performance reasons, we blur the lighting buffer a few times and then stretch it over the screen using bilinear filtering to generate a coarse lighting system. In addition to the huge performance savings, making the algorithm feasible, this looks more volumetric than the high resolution version, which is desirable.

The Implementation Evolution
In order to implement this system we required per-pixel access and additive blending functionality that proved too slow, cumbersome and/or impossible to do with the existing API. Instead we decided to implement a software-based system that essentially rendered everything on the system side using the CPU only and then sent the results over the bus to the graphics card and merged it with the existing back buffer.

When I first implemented the naive algorithm the performance was upwards of 60 milliseconds for a 64x64 pixel lighting buffer, with one TINY light and without the convolution (used to blur the lighting buffer) (!). Clearly unacceptable. So, it was time to put on the optimization hat, and dig in.

Optimizing the half-space count
Profiling the code, one of the obvious bottlenecks was the amount of half-space sets that were being generated. One set for each 16x16 pixel tile. There a few things I implemented to reduce the total tile count and therefore the half-space count.

The first thing was to process tiles in two sets. The first set consisted of tiles that were above the light position on the screen and the second set consisted of tiles that were below the light position on the screen.
The first set of tiles was processed from the light position heading upward (the first tiles processed were in the row of tiles just above the light and the last processed were at the top of the screen) and vice versa for the tiles below the light.

The tiles were processed in rows. Why? This was to essentially merge the tiles together, reducing the tile count. If two or more tiles were adjacent in the row (without any gaps) then they were merged into one large horizontal tile and sent off for further processing. This further processing involved testing the new tile against the previously generated tiles half spaces. If it wasn't visible, then it's half space contribution would be irrelevant to the pixel occlusion stage, and it would be ignored. If it was visible in relation to the previously accepted tiles, then we would run a vertical merge stage. In this stage, if two large tiles were of the same position and width and were above each other by one row, then they would be merged together.

These steps drastically reduced the number of tiles we needed to generate half-spaces for. But while this increased the speed, we were still very far off our target.

Optimizing the rasterization step
The second thing was the actual pixel processing. The light sources are generally much smaller than the entire screen and so to process each individual pixel was an unnecessary waste of time. Instead a coarse tile rasterization system was implemented. In this system, the back buffer was divided up into blocks of pixels (8x8 pixel blocks for the high spec setting), and each of these were tested against the half spaces. If the block was entirely occluded then we discarded all of the pixels inside that block. If the block was entirely inside the visible area, then we trivially computed the lighting equation for all pixels inside that block. Only when the block of pixels was partially obscured by the half spaces were the individual pixels considered.

Again, we saw a huge speed improvement. But again, we were still off target. Specifically, the rasterization and lighting buffer blurring steps were still taking too long.

Utilizing the power of multi-core processors
At this point, without any more algorithmic optimizations to try out, we turned away from algorithms and focused on hardware utilization.

We implemented a job based threading system that allowed us to parallelize the workload across up to 16
processors. On implementing this for the rasterization stage, we reached our performance target. Finally.

One stage of the pipeline was still taking too damn long though. The blurring of the lighting buffer. This stage we initially left up to the Java2D convolution API. But that sucked. So instead we reimplemented it in software (seeing a pattern here?) using our job system. With these optimizations in place, we had reached our goal.
It certainly wasn't easy, but where the API failed us, we wrote our own functionality. We did this in numerous areas throughout the code base.

Visualization of the final optimizations. The cyan tiles are the above light source tiles, the purple ones are the below tiles. On the lower right you can see the coarse rasterization algorithm at work. Red tiles are culled, green are trivially accepted and yellow are partially culled.

There were still problems that proved unsolvable, to be fair, but we could live with them. One thing that remains a (really annoying) bottleneck is transferring the back buffer to an API construct and rendering it over the screen with bilinear filtering. To do that, and only that, on my development machine, still consumes around 1 millisecond(!!!).