Wednesday 17 October 2012

Simple Tile Based Volume Lighting In Toxic Bunny

Introduction
My company (Celestial Games) recently released Toxic Bunny HD, an old-school 2D platforming-shooter remake involving a crazy bunny called Toxic and his quest for revenge, you can read about it here:

http://www.toxicbunny.co.za/

Towards the end of the project I had a little time to try and squeeze in one of the features I had been wanting for a long time, dynamic lighting.

Whilst not a complicated effect to achieve normally, when we start taking it in the context of, "We wrote the game in Java using the Java2D API", it becomes more challenging. What we found during development was that the Java2D API was not well suited to the kinds of graphics workloads we were placing on it; and this was never more true than when we looked at dynamic lighting. A few onerous things about the API (for games) are:

- Lack of useful blending modes (no multiplicative blending, no additive blending).
- Absolutely NO shading support whatsoever.
- Modifying images can result in the API marking them as unaccelerated, destroying performance.
- Bloated implicit object construction for image(sprite) drawing.
- Convolution operations provided by the API are slow.

Whilst we were able to overcome most of these in various ways, they certainly left us with the strong impression that we chose the wrong API to develop a game on. This was a direct result of the fact that the game started as a weekend experiment and wasn't considered to be a "serious" product until after we had laid the foundations out. It certainly forced us to push the API farther than we thought we could.

Goals
In order for the dynamic lighting to be used in the game, it had to look good and not abuse our frame budget.
Toxic Bunny HD runs at a tick rate of 24Hz, that means that our budget is, conservatively, 40 milliseconds. Compared to, say a 60Hz 16 millisecond game, that's an absolutely enormous frame budget, which worked in our favor as a 2D game (a high frame rate count doesn't mean as much in a 2D game environment as in a 3D one).

So, I wanted a lighting pipeline that would definitely push a high end machine at the highest quality setting, but would perform at around 5ms on the lowest quality setting.

The Lighting Pipeline
The lighting pipeline, at it's most conceptual level, evaluates to

For every light:
Step 1: Gather tiles around light source. From a collision detection perspective in the game, our playing area is divided up into 16x16 tiles. It is these tiles that perform the role of light occluders in the lighting system. Every 16x16 tile is marked with a specific collision type, whether it be empty space, a slope of some angle, or a solid (fully filled) space. The lighting system is only concerned about the solid tiles. For every light, the lighting system extracts the solid tiles that fall within the lights radius.

Solid collision tiles for the map are highlighted in red here. 

Step 2: Use gathered tiles to generate half spaces.
We then process the gathered collision tiles one at a time. We calculate the lights position relative to the center of the tile (working in a tile-space so to speak). And use the relative position and the rectangular features of the tile to generate a set of half-spaces. If a particular pixel falls behind all of the generated half-spaces for a given tile, it is occluded by that tile and therefore not shaded. If it is in front of one of the half-spaces, we run the standard point light attenuation equation on it and additively blend it into the back (lighting) buffer (I'll explain why this is possible in Java 2D later).

My quick diagram of how the half spaces are generated. There are 9 regions around the tile that the light can be in.
Step 3: Shade pixels.
For every pixel in the back buffer, run through each tile's generated half spaces and do the visibility test.

After all that's done, and you have your lighting contribution to the scene in a buffer, add it's contribution to the scene.

Note:
The lighting buffer is at a much lower resolution than the screen resolution for performance reasons, we blur the lighting buffer a few times and then stretch it over the screen using bilinear filtering to generate a coarse lighting system. In addition to the huge performance savings, making the algorithm feasible, this looks more volumetric than the high resolution version, which is desirable.

The Implementation Evolution
In order to implement this system we required per-pixel access and additive blending functionality that proved too slow, cumbersome and/or impossible to do with the existing API. Instead we decided to implement a software-based system that essentially rendered everything on the system side using the CPU only and then sent the results over the bus to the graphics card and merged it with the existing back buffer.

When I first implemented the naive algorithm the performance was upwards of 60 milliseconds for a 64x64 pixel lighting buffer, with one TINY light and without the convolution (used to blur the lighting buffer) (!). Clearly unacceptable. So, it was time to put on the optimization hat, and dig in.

Optimizing the half-space count
Profiling the code, one of the obvious bottlenecks was the amount of half-space sets that were being generated. One set for each 16x16 pixel tile. There a few things I implemented to reduce the total tile count and therefore the half-space count.

The first thing was to process tiles in two sets. The first set consisted of tiles that were above the light position on the screen and the second set consisted of tiles that were below the light position on the screen.
The first set of tiles was processed from the light position heading upward (the first tiles processed were in the row of tiles just above the light and the last processed were at the top of the screen) and vice versa for the tiles below the light.

The tiles were processed in rows. Why? This was to essentially merge the tiles together, reducing the tile count. If two or more tiles were adjacent in the row (without any gaps) then they were merged into one large horizontal tile and sent off for further processing. This further processing involved testing the new tile against the previously generated tiles half spaces. If it wasn't visible, then it's half space contribution would be irrelevant to the pixel occlusion stage, and it would be ignored. If it was visible in relation to the previously accepted tiles, then we would run a vertical merge stage. In this stage, if two large tiles were of the same position and width and were above each other by one row, then they would be merged together.

These steps drastically reduced the number of tiles we needed to generate half-spaces for. But while this increased the speed, we were still very far off our target.

Optimizing the rasterization step
The second thing was the actual pixel processing. The light sources are generally much smaller than the entire screen and so to process each individual pixel was an unnecessary waste of time. Instead a coarse tile rasterization system was implemented. In this system, the back buffer was divided up into blocks of pixels (8x8 pixel blocks for the high spec setting), and each of these were tested against the half spaces. If the block was entirely occluded then we discarded all of the pixels inside that block. If the block was entirely inside the visible area, then we trivially computed the lighting equation for all pixels inside that block. Only when the block of pixels was partially obscured by the half spaces were the individual pixels considered.

Again, we saw a huge speed improvement. But again, we were still off target. Specifically, the rasterization and lighting buffer blurring steps were still taking too long.

Utilizing the power of multi-core processors
At this point, without any more algorithmic optimizations to try out, we turned away from algorithms and focused on hardware utilization.

We implemented a job based threading system that allowed us to parallelize the workload across up to 16
processors. On implementing this for the rasterization stage, we reached our performance target. Finally.

One stage of the pipeline was still taking too damn long though. The blurring of the lighting buffer. This stage we initially left up to the Java2D convolution API. But that sucked. So instead we reimplemented it in software (seeing a pattern here?) using our job system. With these optimizations in place, we had reached our goal.
It certainly wasn't easy, but where the API failed us, we wrote our own functionality. We did this in numerous areas throughout the code base.

Visualization of the final optimizations. The cyan tiles are the above light source tiles, the purple ones are the below tiles. On the lower right you can see the coarse rasterization algorithm at work. Red tiles are culled, green are trivially accepted and yellow are partially culled.

There were still problems that proved unsolvable, to be fair, but we could live with them. One thing that remains a (really annoying) bottleneck is transferring the back buffer to an API construct and rendering it over the screen with bilinear filtering. To do that, and only that, on my development machine, still consumes around 1 millisecond(!!!).