Thursday, 17 July 2014

Tool Post: Texture Compiler

All engines run off of generated assets. The most advanced renderer on the planet is meaningless if all you can do is draw a programmer defined cube with it. These assets are created by artist tools, such as Maya or 3ds max, but aren't necessarily loaded into the game in those formats. Try parsing a Wavefront .obj model every time you want to load an asset and you'll see what I mean, it's damn slow. Engines tend to run off their own optimized formats that are created from source material by a resource pipeline, a set of tools that converts artist created models, audio clips, textures etc, into a format that is optimal for the engine to locate, load and work with. In addition, the resource pipeline may bundle engine and game specific data into these formats for use later on.

The first tool I created was a texture compiler. Now loading in raw .png files and using them as textures isn't the most horrible thing that could be done. But it does have problems as you'll see later on in this post. It appears trivial at first, but there's a bit of work that needs to be done with source textures before you're ready to render with them properly. Chief among the concerns is the issue of Gamma Correction.

Gamma Correction
There are TONS of resources on this subject now, but I'll include the briefest explanation. 
So, from what I can gather, the story goes like this. Once upon a time we had CRT monitors, and it was noted that the physical output of the beam varied non-linearly with the input voltage applied. What this means is that if you wanted to display the middle grey between pitch black and pure white, and you input the RGB signal (0.5, 0.5, 0.5), you wouldn't get the middle grey as you would expect. If you measured the output brightness, you got something along the lines of (0.22, 0.22, 0.22). Worse still with this phenomenon, you actually get colour shifting(!), observe... I enter (0.9, 0.5, 0.1) and I get (0.79, 0.21, 0.006), the red becomes far more dominant in the result.

When plotted on a graph, the relationship could be viewed thus:


Note the blue line, this is the monitors natural gamma curve. Also note that I've used the power factor 2.2 to represent the exponent that the monitors have. The exponent actually varies, however, 2.2 is generally close enough to correct that it can be used in most cases.

Nowadays, most displays aren't CRT. But, in the interest of backwards compatibility, modern monitors emulate this gamma curve.

But how come all the images you see aren't darker than they should be?
Well that's because a lot of image formats these days are pre-gamma-corrected (jpeg, png are two). That means that the internal image values are mapped to be the green line in the graph, basically raised to the power of 1 / 2.2. This has the effect of cancelling out the monitors gamma when displayed to the user. So at the end you see the image values as they were intended. Which is great when all you're doing is viewing images, but it causes some serious (and subtle) issues when rendering. Because all of the operations that occur during rendering assume linear relationships. Obvious examples are texture filtering, mipmap generation, alpha-blending, or lighting.

Why didn't they keep the image formats linear and just pre-gamma correct before outputting to the display? What's with the inverse gamma curve on the images? The answer is another history lesson, it turns out by lucky coincidence (which was actually purposeful engineering) that raising the image values to the reciprocal gamma exponent had the side effect of allocating more bits to represent darker values. This makes sense as humans are more adept at seeing differences between dark tones than differences between light tones. In a way, it makes sense to still have it around.

What this all comes down to is that we have to deal with these non-linear inputs and outputs on either side of our linear rendering pipeline.

The Texture Compiler



Wow, a seriously complicated tool yeah? It's about as basic an interface as you can get for working with textures. Most of what's interesting happens in the background. It basically works as follows.

What the tool does is maintains a list of textures in an intermediate format, which can be saved out to a texture asset file (*.taf). This enables you to load up an existing asset file, add images, remove images, rename, change a parameter and so on, then save again. Then, when you want to export to the format the engine consumes, you select which platforms you want to publish to (right now it's just PC OpenGL) and hit the Publish button. This then generates a very simple database, it's basically a two part file. The index part and the data part.

When the engine loads up, it only loads the texture database's index. Then, when an asset is encountered that requests a texture, the following process occurs. The engine queries the resident texture set, if the texture has been loaded onto the GPU already, it's returned immediately. If it hasn't been loaded yet, then the texture database index is queried for the offset into the texture database file of the desired texture. The raw database entry is loaded and, if it was successfully compressed by the publishing tool, it's decompressed into the raw image data. Then that raw image data is compressed to whatever GPU friendly compression format is supported and sent off to the GPU. If a texture is requested that isn't inside the texture database, then a blank white texture is returned instead.




It should be noted that the textures inside the texture database file are already in linear space. If you look at the tools screenshot, you'll see that there's a "Gamma correct on publish." option. That will simply tell the tool that on publish, raise the texture values to the desired power (in this case 2.2) to bring the values back into linear space. Then all of the automatic mipmap generation and texture filtering in the API and on the GPU will be correct from the get go. It's also an option specifically because for some textures, you don't want to gamma correct. Normal maps for instance tend to not be pre corrected i.e are already linear. Because our inputs are now linear, and our internal operations are linear, all that's required at the end of the rendering pipeline is to apply the inverse gamma correction to the framebutffer and... that's a bingo!

Just as an addendum on the whole linear pipeline topic, note that the alpha channel of gamma corrected (sRGB) textures will also be linear and therefore need no correction. Aaaaand also that while storing linear textures has its advantages, you won't be allocating as many bits of precision to the lower intensity light values. There are a few ways to go about fixing this (such as moving from 8 bits per channel to 16). Having said that, I haven't really noticed any glaring artifacts as the textures we're using for our game are all bright and colorful, so its alright :)






Monday, 9 September 2013

Project Post: Tyro
Here's a project that I worked on for a while back in 2011. If you're wondering why it's called something strange like Tyro, it's because the definition of tyro is a beginner/novice, so it fit nicely. It's a deferred renderer written in OpenGL 2.0 with GLSL. It also had an "almost" complete super-basic rigid body physics library branch in it that I started working on as well... but things got in the way and I couldn't finish that. I was reluctant to put a non-complete project up but there's plenty of awesome and hopefully someone can still learn/use something.

Still, the cool stuff is the renderer. So I'll post some screenshots, write a bit about it and then post the source code for you all to browse through if you're interested. 

Some Cool Things It Had
The Deferred Renderer
The project initially started as a deferred renderer experiment. After watching the Killzone 2 tech demo's from 2005 to 2009 I was thoroughly enamored with the concept. For all of you who don't know what Deferred Rendering is, here's the short version.

Traditionally, in game engines up to that time, lighting was performed on geometry as the geometry itself was rendered. So (this will be a VERY basic representation), something similar to this would occur:


Render Function
Initialize renderer, do pre draw stuff;

Clear back buffer(s);

Set back buffer mode to accumulate;

foreach light do
    find geomtetry that intersects light volume;    
    render geometry lit with that light into back buffer;
end;

Swap back buffer

You can see that if an object in the world was lit by multiple lights, you would have to re-render the geometry every time, for each light. You could do some things to mitigate this, like batch multiple lights into a single run of the shader for the object... but the core still holds. There was a dependency between performing the lighting calculation and transforming and rendering the actual geometry itself. So, as your light count increased, your draw call count increased, and your triangle count increased as well. Enter deferred shading.

Deferred shading is called deferred shading because it defers shading calculations until after the geometry itself has been drawn. The core concept is that you write the 'properties' of a scene to various offscreen buffers (also called Geometry Buffers or collectively, the G-Buffer) during what is called a material pass. After this pass is done you perform shading passes using the properties stored in the buffers. This shading could be anything you want it to be but it has predominantly been associated with the lighting calculation(s).
G-Buffer visualization. View space normal(TL). Texture albedo(TR). Depth buffer(BL). Specular reflectivity(BR).

Oh, and before the terminology becomes too weird. A renderer that uses deferred shading tends to be called a deferred renderer, not a deferred shader. Yeah I know.

This approach is fundamentally cool because it clearly and cleanly separates lighting calculations from geometry and material calculations. They become two distinct parts of the pipeline. This helps not only performance but architectural cleanliness as well.

I'm going to digress here and add that I think this is one of the core technological approaches that really saved Sony's bacon this generation. The Playstation 3 had this really poor geometry throughput compared to its contemporary, the Xbox 360. Traditionally architected engines pretty much without fail performed much better on Microsoft's console with it's more PC like CPU, unified memory, and (for the time) powerful, unified-shader based GPU. Deferred shading helped level the graphical playing field because it removes a large portion of geometry transformations from the pipeline. Not only that, but the fact that you had all these material properties written to memory buffers, meant that they were conveniently in a format that the PS3's SPUs could operate on efficiently. This meant that the wily developer could alleviate the pressure on the weak GPU by moving entire portions of the graphics pipeline over to the satellite processors. Examples of this were titles like Split Second and Battlefield 3, which performed the entirety of their lighting calculations on SPUs; And titles like Killzone 2, and Uncharted 2, which performed the majority (or all) of their image post processing on the SPUs. Later titles used the G-Buffer to perform post-process anti-aliasing, removing that from the geometry pipeline as well.  So we can see the benefits that Deferred Shading bring to the table.

It's not all great though. The very same characteristics that make Deferred Shading a good thing also introduce some problems too. For one, the memory footprint and bandwidth required for the G-Buffer can get quite large. And if you want sub-pixel AA it gets even scarier.
Transparencies become a problem too, and what you generally have to have is a fallback forward-renderer to draw transparent elements of the scene.
Introducing new material properties means introducing new offscreen buffers. And so on.

A thorough introduction and/or overview of Deferred Shading is way beyond the scope of this post. I'd advise any interested readers to check out the myriad sources available. Some useful links are:

http://www.cg.tuwien.ac.at/courses/Seminar/WS2010/deferred_shading.pdf

http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/lectures/12_deferred_shading.pdf

http://www.guerrilla-games.com/publications/dr_kz2_rsx_dev07.pdf

http://www.dennisfx.com/wp-content/uploads/2013/02/Report_DRendering_Toufexis_D.pdf

http://developer.amd.com/wordpress/media/2012/10/D3DTutorial_DeferredShading.pdf

http://www.slideshare.net/guerrillagames/the-rendering-technology-of-killzone-2

Normal mapping
Tyro also supported tangent space normal mapping. I wasn't then, and am still not, aware of any tools that generate a tangent space for your meshes (links anyone??) so I ended up writing my own tangent space mesh parser. This turns out to be a very tricky thing to get right and that part of Tyro's code base is something I don't really want to look at ever again. In addition, it really does require an artists touch to go through and make sure that you don't average facet details when you shouldn't etc... the things an algorithm tends to not get perfectly right for every kind of mesh type.

Also, it's very annoying when you have to wait for the .exe to parse a mesh every single time you run the damn thing. The importance of a good back-end that gives you pre-parsed meshes in a format your front-end can read quickly...

Here are some links for tangent space normal mapping:


http://crytek.com/download/Triangle_mesh_tangent_space_calculation.pdf

http://www.ozone3d.net/tutorials/bump_mapping.php

Bloom
I added a bloom effect to the result when the light values reached a certain threshold and above.
This is accomplished by trivially taking the lighting result buffer, examining the values and, if they were above your threshold, moving them over to another buffer, call it the bright-pass buffer. Then you downsample that buffer and blur it a few times using a separable filter or something. To make it a little more effective, I think I had the hardware generate some mipmaps from the 
bright-pass buffer and then blur those as well to get the wide kernel effect.

I got the idea from here:
http://kalogirou.net/2006/05/20/how-to-do-good-bloom-for-hdr-rendering/

The standard lighting output.

Brightpass result: pixels above a certain  brightness threshold selected.

Brighpass buffer downsampled and blurred repeatedly.

Final composite result. Add the blurred brightpass buffer to the final image.

Some Things It Should Have Had, But Didn't

Anti-Aliasing

Running off of a DirectX-9 level setup, you can't get any kind of sub-pixel deferred rendering I believe. So it would of had to have been a post-process method. Something along the lines of MLAA or FXAA.

Gamma-Correction

A no-brainer. I wrote this before I had even heard the term. But gamma correction is an important topic all on it's own and vital for any proper lighting/rendering solution. 

Shadows

I was planning on this, but ran out of time to finish it before other things got in the way. Again, shadows are a huge topic all on their own.

In Conclusion

I think I'll leave it there. But I'll leave a couple more screenshots of it in action below and I'll post the whole project online so that you can all download it and try it out.
The project was written in CodeBlocks, runs on Windows only, and needs the Direct-X SDK setup to compile properly :) 










I used the SOIL library for loading pngs into OpenGL textures
.
http://www.lonesock.net/soil.html

Here is the URL to download the zipped project
https://docs.google.com/file/d/0B7HWlmfutdmxY3B4WDJvNko5UXc/edit?usp=sharing

Extract the folder to your C drive and open up the Codeblocks workspace. It should compile and run from there :) 

As a disclaimer, I take no responsibility for how you all use the code.
Also, if there are resources (textures, models, libraries) that I've used that you own the rights to and you want me to remove it and/or give you due credit, by all means let me know and I'll stick it in there.


Monday, 29 April 2013

Indie Game...Of Thrones

Myself and three others from Celestial entered in the Indie Speed Run late last year. We managed to hack out a pretty cool project by the end of it. Our starting keywords were "afterlife" and "throne" so we decided to make a crazy mmo hack-and-slash isometric pen-art style game. Or as it will be known in the future a CMMOHSIPASG. *Cough*

Here's the link to it on the indiedb site: http://www.indiedb.com/games/throne

The team was
- Alison McAlinden, who did all the awesome pen art and animations for the characters, as well as the story.
- Travis Bulford, who did the network programming, UI programming, maze generation code and the AI.
- Cobus Saunderson, who did the user interface graphics (more work than it sounds like!)

My contribution was the isometric engine that powered the display, the particle system engine as well as the particle system scripts (no, the quality was higher than just "programmer art" thank you), and animation engine code and tie-ins for the characters.





It was an awesome experience, albeit a brutal one that I'm not too eager to try again until my degree is finished!

I'm going to talk a little bit about the engine, seeing as that's the component I spent the most time with.
The isometric component of the engine runs on top of the j4game framework (which handles all of the nasty set up tasks), but it's actually almost a completely separate entity. Aside from providing the graphics context and input, they are fairly detached from one another. It's better to think of it as being a plug-in into the j4game framework itself.

The engine had to be able to render the huge maps in the game and had to have sprites that could shear (trees blowing in the wind!) and enter a transparent state when an object of importance moved behind them (the player for example). To that end, it was a quadtree based design with a tricky sprite sorting and merging algorithm in the core loop that would dynamically check which sprites fell in front of priority sprites and do the appropriate blending etc.

The particle system represented particles in a true three-dimensional space (in fact, all coordinates in the engine were in true three dimensions, the engine would cull the quadtree nodes, transform visible sprites into camera space and then sort and draw), so they would have x, y, and z position components, and velocity and acceleration which would get integrated etc. The particle engine was flexible and allowed for custom code to be scripted in (there was no scripting language though, it's scripting was done by deriving from the vanilla particle system, overriding an update method and then writing whatever behavior you would need. The particles themselves could be derived from and customized, so it all worked out to be quick and flexible given the time constraints).

The animation system let you set animations, store animations, loop animations, and generate engine-wide events during certain frames, like cast a fireball or start a particle system during frame n for example.





Wednesday, 17 October 2012

Simple Tile Based Volume Lighting In Toxic Bunny

Introduction
My company (Celestial Games) recently released Toxic Bunny HD, an old-school 2D platforming-shooter remake involving a crazy bunny called Toxic and his quest for revenge, you can read about it here:

http://www.toxicbunny.co.za/

Towards the end of the project I had a little time to try and squeeze in one of the features I had been wanting for a long time, dynamic lighting.

Whilst not a complicated effect to achieve normally, when we start taking it in the context of, "We wrote the game in Java using the Java2D API", it becomes more challenging. What we found during development was that the Java2D API was not well suited to the kinds of graphics workloads we were placing on it; and this was never more true than when we looked at dynamic lighting. A few onerous things about the API (for games) are:

- Lack of useful blending modes (no multiplicative blending, no additive blending).
- Absolutely NO shading support whatsoever.
- Modifying images can result in the API marking them as unaccelerated, destroying performance.
- Bloated implicit object construction for image(sprite) drawing.
- Convolution operations provided by the API are slow.

Whilst we were able to overcome most of these in various ways, they certainly left us with the strong impression that we chose the wrong API to develop a game on. This was a direct result of the fact that the game started as a weekend experiment and wasn't considered to be a "serious" product until after we had laid the foundations out. It certainly forced us to push the API farther than we thought we could.

Goals
In order for the dynamic lighting to be used in the game, it had to look good and not abuse our frame budget.
Toxic Bunny HD runs at a tick rate of 24Hz, that means that our budget is, conservatively, 40 milliseconds. Compared to, say a 60Hz 16 millisecond game, that's an absolutely enormous frame budget, which worked in our favor as a 2D game (a high frame rate count doesn't mean as much in a 2D game environment as in a 3D one).

So, I wanted a lighting pipeline that would definitely push a high end machine at the highest quality setting, but would perform at around 5ms on the lowest quality setting.

The Lighting Pipeline
The lighting pipeline, at it's most conceptual level, evaluates to

For every light:
Step 1: Gather tiles around light source. From a collision detection perspective in the game, our playing area is divided up into 16x16 tiles. It is these tiles that perform the role of light occluders in the lighting system. Every 16x16 tile is marked with a specific collision type, whether it be empty space, a slope of some angle, or a solid (fully filled) space. The lighting system is only concerned about the solid tiles. For every light, the lighting system extracts the solid tiles that fall within the lights radius.

Solid collision tiles for the map are highlighted in red here. 

Step 2: Use gathered tiles to generate half spaces.
We then process the gathered collision tiles one at a time. We calculate the lights position relative to the center of the tile (working in a tile-space so to speak). And use the relative position and the rectangular features of the tile to generate a set of half-spaces. If a particular pixel falls behind all of the generated half-spaces for a given tile, it is occluded by that tile and therefore not shaded. If it is in front of one of the half-spaces, we run the standard point light attenuation equation on it and additively blend it into the back (lighting) buffer (I'll explain why this is possible in Java 2D later).

My quick diagram of how the half spaces are generated. There are 9 regions around the tile that the light can be in.
Step 3: Shade pixels.
For every pixel in the back buffer, run through each tile's generated half spaces and do the visibility test.

After all that's done, and you have your lighting contribution to the scene in a buffer, add it's contribution to the scene.

Note:
The lighting buffer is at a much lower resolution than the screen resolution for performance reasons, we blur the lighting buffer a few times and then stretch it over the screen using bilinear filtering to generate a coarse lighting system. In addition to the huge performance savings, making the algorithm feasible, this looks more volumetric than the high resolution version, which is desirable.

The Implementation Evolution
In order to implement this system we required per-pixel access and additive blending functionality that proved too slow, cumbersome and/or impossible to do with the existing API. Instead we decided to implement a software-based system that essentially rendered everything on the system side using the CPU only and then sent the results over the bus to the graphics card and merged it with the existing back buffer.

When I first implemented the naive algorithm the performance was upwards of 60 milliseconds for a 64x64 pixel lighting buffer, with one TINY light and without the convolution (used to blur the lighting buffer) (!). Clearly unacceptable. So, it was time to put on the optimization hat, and dig in.

Optimizing the half-space count
Profiling the code, one of the obvious bottlenecks was the amount of half-space sets that were being generated. One set for each 16x16 pixel tile. There a few things I implemented to reduce the total tile count and therefore the half-space count.

The first thing was to process tiles in two sets. The first set consisted of tiles that were above the light position on the screen and the second set consisted of tiles that were below the light position on the screen.
The first set of tiles was processed from the light position heading upward (the first tiles processed were in the row of tiles just above the light and the last processed were at the top of the screen) and vice versa for the tiles below the light.

The tiles were processed in rows. Why? This was to essentially merge the tiles together, reducing the tile count. If two or more tiles were adjacent in the row (without any gaps) then they were merged into one large horizontal tile and sent off for further processing. This further processing involved testing the new tile against the previously generated tiles half spaces. If it wasn't visible, then it's half space contribution would be irrelevant to the pixel occlusion stage, and it would be ignored. If it was visible in relation to the previously accepted tiles, then we would run a vertical merge stage. In this stage, if two large tiles were of the same position and width and were above each other by one row, then they would be merged together.

These steps drastically reduced the number of tiles we needed to generate half-spaces for. But while this increased the speed, we were still very far off our target.

Optimizing the rasterization step
The second thing was the actual pixel processing. The light sources are generally much smaller than the entire screen and so to process each individual pixel was an unnecessary waste of time. Instead a coarse tile rasterization system was implemented. In this system, the back buffer was divided up into blocks of pixels (8x8 pixel blocks for the high spec setting), and each of these were tested against the half spaces. If the block was entirely occluded then we discarded all of the pixels inside that block. If the block was entirely inside the visible area, then we trivially computed the lighting equation for all pixels inside that block. Only when the block of pixels was partially obscured by the half spaces were the individual pixels considered.

Again, we saw a huge speed improvement. But again, we were still off target. Specifically, the rasterization and lighting buffer blurring steps were still taking too long.

Utilizing the power of multi-core processors
At this point, without any more algorithmic optimizations to try out, we turned away from algorithms and focused on hardware utilization.

We implemented a job based threading system that allowed us to parallelize the workload across up to 16
processors. On implementing this for the rasterization stage, we reached our performance target. Finally.

One stage of the pipeline was still taking too damn long though. The blurring of the lighting buffer. This stage we initially left up to the Java2D convolution API. But that sucked. So instead we reimplemented it in software (seeing a pattern here?) using our job system. With these optimizations in place, we had reached our goal.
It certainly wasn't easy, but where the API failed us, we wrote our own functionality. We did this in numerous areas throughout the code base.

Visualization of the final optimizations. The cyan tiles are the above light source tiles, the purple ones are the below tiles. On the lower right you can see the coarse rasterization algorithm at work. Red tiles are culled, green are trivially accepted and yellow are partially culled.

There were still problems that proved unsolvable, to be fair, but we could live with them. One thing that remains a (really annoying) bottleneck is transferring the back buffer to an API construct and rendering it over the screen with bilinear filtering. To do that, and only that, on my development machine, still consumes around 1 millisecond(!!!).

Monday, 17 September 2012

SIMD Math Library Part 2: SSE With Matrices

Introduction
All right everyone, it's time to look at matrices. But first, a slight digression, a story if you will;
As I sat down to continue writing the SSE code one night, a question popped into my head, "Why the hell am I trying to optimize this function? I mean, I've used this particular operation maybe once or twice, and never where the performance actually mattered...". And there's a very important point there.

I am a terribly lazy programmer.

But more than that, it's okay that I am, because it's completely irrelevant from a performance perspective that that function I use that accounts for maybe 0.01% of my run time time cost is optimized. Now of course there's probably a fair few people who may have need for those functions that I haven't optimized (pretty much only the matrix determinant and inverse functions). And my apologies for that. If I ever have some free time on my hands I'll probably go through and implement them. But there's cooler, more relevant things to dedicate my time to right now (you'll see in future articles ;) ).

Well if I'm going to only optimize things that truly matter (to me) then what did I optimize exactly?
There are two main types of operation that I focused on:
- Matrix-matrix multiplication.
- Matrix-vector multiplication (i.e vector transformation).

Of these the second is of particular importance if you're doing software pipelines for vertex processing etc. To that end I wrote a batch multiply which basically transforms an entire array of vectors as efficiently as possible.

Now that that's out of the way, lets get to code comparisons!

Matrix Addition:
MATHLIB_INLINE void matrix4x4_add(const matrix4x4& matrix1, const matrix4x4& matrix2, matrix4x4& result)
/// Performs a matrix addition and stores matrix sum in reuslt.
/// Corresponds to the mathematical statement result = matrix1 + matrix2
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	result.sse_row0 = _mm_add_ps(matrix1.sse_row0, matrix2.sse_row0);
	result.sse_row1 = _mm_add_ps(matrix1.sse_row1, matrix2.sse_row1);
	result.sse_row2 = _mm_add_ps(matrix1.sse_row2, matrix2.sse_row2);
	result.sse_row3 = _mm_add_ps(matrix1.sse_row3, matrix2.sse_row3);

#endif // (MATHLIB_SSE)
#else

	result._03  = matrix1._03 + matrix2._03;
	result._02  = matrix1._02 + matrix2._02;
	result._01  = matrix1._01 + matrix2._01;
	result._00  = matrix1._00 + matrix2._00;

	result._13  = matrix1._13 + matrix2._13;
	result._12  = matrix1._12 + matrix2._12;
	result._11  = matrix1._11 + matrix2._11;
	result._10  = matrix1._10 + matrix2._10;

	result._23  = matrix1._23 + matrix2._23;
	result._22  = matrix1._22 + matrix2._22;
	result._21  = matrix1._21 + matrix2._21;
	result._20  = matrix1._20 + matrix2._20;

	result._33  = matrix1._33 + matrix2._33;
	result._32  = matrix1._32 + matrix2._32;
	result._31  = matrix1._31 + matrix2._31;
	result._30  = matrix1._30 + matrix2._30;

#endif // (MATHLIB_SIMD)
}
Yeah I know what you're all thinking, "What the hell is up with the addition order in the scalar version?". Well, it's a quirky thing, but I store the x element of a vector (or the first element of a matrix row) in the highest word (32 bits in this case) of an SSE vector. This means that the x element of a vector is actually stored at the highest address in memory of a 128-bit vector, because little-endian platforms like x86 store higher order bytes at higher memory addresses. So, basically, to keep the memory access the same as usual, I end up having to do operations in reverse.

Here's the assembly output:
FPU
004017EA push   ebp
004017EB mov    ebp,esp
004017ED mov    ecx,DWORD PTR [ebp+0x8]
004017F0 mov    edx,DWORD PTR [ebp+0xc]
004017F3 mov    eax,DWORD PTR [ebp+0x10]
004017F6 fld    DWORD PTR [ecx]
004017F8 fadd   DWORD PTR [edx]
004017FA fstp   DWORD PTR [eax]
004017FC fld    DWORD PTR [ecx+0x4]
004017FF fadd   DWORD PTR [edx+0x4]
00401802 fstp   DWORD PTR [eax+0x4]
00401805 fld    DWORD PTR [ecx+0x8]
00401808 fadd   DWORD PTR [edx+0x8]
0040180B fstp   DWORD PTR [eax+0x8]
0040180E fld    DWORD PTR [ecx+0xc]
00401811 fadd   DWORD PTR [edx+0xc]
00401814 fstp   DWORD PTR [eax+0xc]
00401817 fld    DWORD PTR [ecx+0x10]
0040181A fadd   DWORD PTR [edx+0x10]
0040181D fstp   DWORD PTR [eax+0x10]
00401820 fld    DWORD PTR [ecx+0x14]
00401823 fadd   DWORD PTR [edx+0x14]
00401826 fstp   DWORD PTR [eax+0x14]
00401829 fld    DWORD PTR [ecx+0x18]
0040182C fadd   DWORD PTR [edx+0x18]
0040182F fstp   DWORD PTR [eax+0x18]
00401832 fld    DWORD PTR [ecx+0x1c]
00401835 fadd   DWORD PTR [edx+0x1c]
00401838 fstp   DWORD PTR [eax+0x1c]
0040183B fld    DWORD PTR [ecx+0x20]
0040183E fadd   DWORD PTR [edx+0x20]
00401841 fstp   DWORD PTR [eax+0x20]
00401844 fld    DWORD PTR [ecx+0x24]
00401847 fadd   DWORD PTR [edx+0x24]
0040184A fstp   DWORD PTR [eax+0x24]
0040184D fld    DWORD PTR [ecx+0x28]
00401850 fadd   DWORD PTR [edx+0x28]
00401853 fstp   DWORD PTR [eax+0x28]
00401856 fld    DWORD PTR [ecx+0x2c]
00401859 fadd   DWORD PTR [edx+0x2c]
0040185C fstp   DWORD PTR [eax+0x2c]
0040185F fld    DWORD PTR [ecx+0x30]
00401862 fadd   DWORD PTR [edx+0x30]
00401865 fstp   DWORD PTR [eax+0x30]
00401868 fld    DWORD PTR [ecx+0x34]
0040186B fadd   DWORD PTR [edx+0x34]
0040186E fstp   DWORD PTR [eax+0x34]
00401871 fld    DWORD PTR [ecx+0x38]
00401874 fadd   DWORD PTR [edx+0x38]
00401877 fstp   DWORD PTR [eax+0x38]
0040187A fld    DWORD PTR [ecx+0x3c]
0040187D fadd   DWORD PTR [edx+0x3c]
00401880 fstp   DWORD PTR [eax+0x3c]
00401883 pop    ebp
00401884 ret

SSE
004016AB push   ebp
004016AC mov    ebp,esp
004016AE mov    ecx,DWORD PTR [ebp+0x8]
004016B1 mov    edx,DWORD PTR [ebp+0xc]
004016B4 mov    eax,DWORD PTR [ebp+0x10]
004016B7 movaps xmm0,XMMWORD PTR [ecx]
004016BA addps  xmm0,XMMWORD PTR [edx]
004016BD movaps XMMWORD PTR [eax],xmm0
004016C0 movaps xmm0,XMMWORD PTR [ecx+0x10]
004016C4 addps  xmm0,XMMWORD PTR [edx+0x10]
004016C8 movaps XMMWORD PTR [eax+0x10],xmm0
004016CC movaps xmm0,XMMWORD PTR [ecx+0x20]
004016D0 addps  xmm0,XMMWORD PTR [edx+0x20]
004016D4 movaps XMMWORD PTR [eax+0x20],xmm0
004016D8 movaps xmm0,XMMWORD PTR [ecx+0x30]
004016DC addps  xmm0,XMMWORD PTR [edx+0x30]
004016E0 movaps XMMWORD PTR [eax+0x30],xmm0
004016E4 leave
004016E5 ret
SSE again gets points for compactness. Subtraction is basically the same so I won't waste valuable space by showing all of the assembly for it.

Matrix Scalar Multiplication:
MATHLIB_INLINE void matrix4x4_scalarMul(const matrix4x4& matrix, float scalar, matrix4x4& result)
/// Performs a matrix-scalar multiplication and stores the scaled matrix in result.
/// Corresponds to the mathematical statement reuslt = matrix * scalar
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

    __m128 scaleVec = _mm_load1_ps(&scalar);
    result.sse_row0 = _mm_mul_ps(matrix.sse_row0, scaleVec);
    result.sse_row1 = _mm_mul_ps(matrix.sse_row1, scaleVec);
    result.sse_row2 = _mm_mul_ps(matrix.sse_row2, scaleVec);
    result.sse_row3 = _mm_mul_ps(matrix.sse_row3, scaleVec);

#endif // (MATHLIB_SSE)
#else

    result._03  = matrix._03 * scalar;
    result._02  = matrix._02 * scalar;
    result._01  = matrix._01 * scalar;
    result._00  = matrix._00 * scalar;

    result._13  = matrix._13 * scalar;
    result._12  = matrix._12 * scalar;
    result._11  = matrix._11 * scalar;
    result._10  = matrix._10 * scalar;

    result._23  = matrix._23 * scalar;
    result._22  = matrix._22 * scalar;
    result._21  = matrix._21 * scalar;
    result._20  = matrix._20 * scalar;

    result._33  = matrix._33 * scalar;
    result._32  = matrix._32 * scalar;
    result._31  = matrix._31 * scalar;
    result._30  = matrix._30 * scalar;

#endif // (MATHLIB_SIMD)
}

FPU
0040192D push   ebp
0040192E mov    ebp,esp
00401930 mov    edx,DWORD PTR [ebp+0x8]
00401933 fld    DWORD PTR [ebp+0xc]
00401936 mov    eax,DWORD PTR [ebp+0x10]
00401939 fld    st(0)
0040193B fmul   DWORD PTR [edx]
0040193D fstp   DWORD PTR [eax]
0040193F fld    st(0)
00401941 fmul   DWORD PTR [edx+0x4]
00401944 fstp   DWORD PTR [eax+0x4]
00401947 fld    st(0)
00401949 fmul   DWORD PTR [edx+0x8]
0040194C fstp   DWORD PTR [eax+0x8]
0040194F fld    st(0)
00401951 fmul   DWORD PTR [edx+0xc]
00401954 fstp   DWORD PTR [eax+0xc]
00401957 fld    st(0)
00401959 fmul   DWORD PTR [edx+0x10]
0040195C fstp   DWORD PTR [eax+0x10]
0040195F fld    st(0)
00401961 fmul   DWORD PTR [edx+0x14]
00401964 fstp   DWORD PTR [eax+0x14]
00401967 fld    st(0)
00401969 fmul   DWORD PTR [edx+0x18]
0040196C fstp   DWORD PTR [eax+0x18]
0040196F fld    st(0)
00401971 fmul   DWORD PTR [edx+0x1c]
00401974 fstp   DWORD PTR [eax+0x1c]
00401977 fld    st(0)
00401979 fmul   DWORD PTR [edx+0x20]
0040197C fstp   DWORD PTR [eax+0x20]
0040197F fld    st(0)
00401981 fmul   DWORD PTR [edx+0x24]
00401984 fstp   DWORD PTR [eax+0x24]
00401987 fld    st(0)
00401989 fmul   DWORD PTR [edx+0x28]
0040198C fstp   DWORD PTR [eax+0x28]
0040198F fld    st(0)
00401991 fmul   DWORD PTR [edx+0x2c]
00401994 fstp   DWORD PTR [eax+0x2c]
00401997 fld    st(0)
00401999 fmul   DWORD PTR [edx+0x30]
0040199C fstp   DWORD PTR [eax+0x30]
0040199F fld    st(0)
004019A1 fmul   DWORD PTR [edx+0x34]
004019A4 fstp   DWORD PTR [eax+0x34]
004019A7 fld    st(0)
004019A9 fmul   DWORD PTR [edx+0x38]
004019AC fstp   DWORD PTR [eax+0x38]
004019AF fmul   DWORD PTR [edx+0x3c]
004019B2 fstp   DWORD PTR [eax+0x3c]
004019B5 leave
004019B6 ret

SSE
00401538 push   ebp
00401539 mov    ebp,esp
0040153B mov    edx,DWORD PTR [ebp+0x8]
0040153E mov    eax,DWORD PTR [ebp+0x10]
00401541 movss  xmm0,DWORD PTR [ebp+0xc]
00401546 shufps xmm0,xmm0,0x0
0040154A movaps xmm1,xmm0
0040154D mulps  xmm1,XMMWORD PTR [edx]
00401550 movaps XMMWORD PTR [eax],xmm1
00401553 movaps xmm1,xmm0
00401556 mulps  xmm1,XMMWORD PTR [edx+0x10]
0040155A movaps XMMWORD PTR [eax+0x10],xmm1
0040155E movaps xmm1,xmm0
00401561 mulps  xmm1,XMMWORD PTR [edx+0x20]
00401565 movaps XMMWORD PTR [eax+0x20],xmm1
00401569 mulps  xmm0,XMMWORD PTR [edx+0x30]
0040156D movaps XMMWORD PTR [eax+0x30],xmm0
00401571 leave
00401572 ret

Matrix Matrix Multiplication: 
This is an important function, as well as a large one, fortunately, it SIMD-izes pretty darn well.
MATHLIB_INLINE void matrix4x4_mul(const matrix4x4& matrix1, const matrix4x4& matrix2, matrix4x4& result)
/// Performs a matrix-matrix multiplication and stores the product in result.
/// Corresponds to the mathematical statement reuslt = matrix1 * matrix2
/// @warning The three matrices must be distinct, the result will be incorrect if result is the same object as matrix1 or matrix2.
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	__m128 row0;
	__m128 row1;
	__m128 row2;
	__m128 row3;
	__m128 value0;

	// Compute row 0 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(3, 3, 3, 3));
	row0 	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(2, 2, 2, 2));
	row0	= _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(1, 1, 1, 1));
	row0	= _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(0, 0, 0, 0));
	row0	= _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row3, value0));

	// Compute row 1 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(3, 3, 3, 3));
	row1 	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(2, 2, 2, 2));
	row1	= _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(1, 1, 1, 1));
	row1	= _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(0, 0, 0, 0));
	row1	= _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row3, value0));

	// Compute row 2 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(3, 3, 3, 3));
	row2	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(2, 2, 2, 2));
	row2	= _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(1, 1, 1, 1));
	row2	= _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(0, 0, 0, 0));
	row2	= _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row3, value0));

	// Compute row 3 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(3, 3, 3, 3));
	row3 	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(2, 2, 2, 2));
	row3	= _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(1, 1, 1, 1));
	row3	= _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(0, 0, 0, 0));
	row3	= _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row3, value0));

	// Write results back to memory:
	result.sse_row0	= row0;
	result.sse_row1	= row1;
	result.sse_row2	= row2;
	result.sse_row3	= row3;

#endif // (MATHLIB_SSE)
#else

	result._03	= (matrix1._03 * matrix2._33) + (matrix1._02 * matrix2._23) + (matrix1._01 * matrix2._13) + (matrix1._00 * matrix2._03);
	result._02	= (matrix1._03 * matrix2._32) + (matrix1._02 * matrix2._22) + (matrix1._01 * matrix2._12) + (matrix1._00 * matrix2._02);
	result._01	= (matrix1._03 * matrix2._31) + (matrix1._02 * matrix2._21) + (matrix1._01 * matrix2._11) + (matrix1._00 * matrix2._01);
	result._00	= (matrix1._03 * matrix2._30) + (matrix1._02 * matrix2._20) + (matrix1._01 * matrix2._10) + (matrix1._00 * matrix2._00);

	result._13	= (matrix1._13 * matrix2._33) + (matrix1._12 * matrix2._23) + (matrix1._11 * matrix2._13) + (matrix1._10 * matrix2._03);
	result._12	= (matrix1._13 * matrix2._32) + (matrix1._12 * matrix2._22) + (matrix1._11 * matrix2._12) + (matrix1._10 * matrix2._02);
	result._11	= (matrix1._13 * matrix2._31) + (matrix1._12 * matrix2._21) + (matrix1._11 * matrix2._11) + (matrix1._10 * matrix2._01);
	result._10	= (matrix1._13 * matrix2._30) + (matrix1._12 * matrix2._20) + (matrix1._11 * matrix2._10) + (matrix1._10 * matrix2._00);

	result._23	= (matrix1._23 * matrix2._33) + (matrix1._22 * matrix2._23) + (matrix1._21 * matrix2._13) + (matrix1._20 * matrix2._03);
	result._22	= (matrix1._23 * matrix2._32) + (matrix1._22 * matrix2._22) + (matrix1._21 * matrix2._12) + (matrix1._20 * matrix2._02);
	result._21	= (matrix1._23 * matrix2._31) + (matrix1._22 * matrix2._21) + (matrix1._21 * matrix2._11) + (matrix1._20 * matrix2._01);
	result._20	= (matrix1._23 * matrix2._30) + (matrix1._22 * matrix2._20) + (matrix1._21 * matrix2._10) + (matrix1._20 * matrix2._00);

	result._33	= (matrix1._33 * matrix2._33) + (matrix1._32 * matrix2._23) + (matrix1._31 * matrix2._13) + (matrix1._30 * matrix2._03);
	result._32	= (matrix1._33 * matrix2._32) + (matrix1._32 * matrix2._22) + (matrix1._31 * matrix2._12) + (matrix1._30 * matrix2._02);
	result._31	= (matrix1._33 * matrix2._31) + (matrix1._32 * matrix2._21) + (matrix1._31 * matrix2._11) + (matrix1._30 * matrix2._01);
	result._30	= (matrix1._33 * matrix2._30) + (matrix1._32 * matrix2._20) + (matrix1._31 * matrix2._10) + (matrix1._30 * matrix2._00);

#endif // (MATHLIB_SIMD)
}

FPU
004019B7 push   ebp
004019B8 mov    ebp,esp
004019BA mov    edx,DWORD PTR [ebp+0x8]
004019BD mov    eax,DWORD PTR [ebp+0xc]
004019C0 mov    ecx,DWORD PTR [ebp+0x10]
004019C3 fld    DWORD PTR [edx]
004019C5 fmul   DWORD PTR [eax+0x30]
004019C8 fld    DWORD PTR [edx+0x4]
004019CB fld    st(0)
004019CD fmul   DWORD PTR [eax+0x20]
004019D0 faddp  st(2),st
004019D2 fld    DWORD PTR [edx+0x8]
004019D5 fld    st(0)
004019D7 fmul   DWORD PTR [eax+0x10]
004019DA faddp  st(3),st
004019DC fld    DWORD PTR [edx+0xc]
004019DF fld    st(0)
004019E1 fmul   DWORD PTR [eax]
004019E3 faddp  st(4),st
004019E5 fxch   st(3)
004019E7 fstp   DWORD PTR [ecx]
004019E9 fld    DWORD PTR [edx]
004019EB fld    st(0)
004019ED fmul   DWORD PTR [eax+0x34]
004019F0 fxch   st(3)
004019F2 fmul   DWORD PTR [eax+0x24]
004019F5 faddp  st(3),st
004019F7 fld    st(1)
004019F9 fmul   DWORD PTR [eax+0x14]
004019FC faddp  st(3),st
004019FE fld    st(3)
00401A00 fmul   DWORD PTR [eax+0x4]
00401A03 faddp  st(3),st
00401A05 fxch   st(2)
00401A07 fstp   DWORD PTR [ecx+0x4]
00401A0A fld    st(1)
00401A0C fmul   DWORD PTR [eax+0x38]
00401A0F fld    DWORD PTR [edx+0x4]
00401A12 fld    st(0)
00401A14 fmul   DWORD PTR [eax+0x28]
00401A17 faddp  st(2),st
00401A19 fxch   st(2)
00401A1B fmul   DWORD PTR [eax+0x18]
00401A1E faddp  st(1),st
00401A20 fld    st(3)
00401A22 fmul   DWORD PTR [eax+0x8]
00401A25 faddp  st(1),st
00401A27 fstp   DWORD PTR [ecx+0x8]
00401A2A fxch   st(1)
00401A2C fmul   DWORD PTR [eax+0x3c]
00401A2F fxch   st(1)
00401A31 fmul   DWORD PTR [eax+0x2c]
00401A34 faddp  st(1),st
00401A36 fld    DWORD PTR [edx+0x8]
00401A39 fmul   DWORD PTR [eax+0x1c]
00401A3C faddp  st(1),st
00401A3E fxch   st(1)
00401A40 fmul   DWORD PTR [eax+0xc]
00401A43 faddp  st(1),st
00401A45 fstp   DWORD PTR [ecx+0xc]
00401A48 fld    DWORD PTR [edx+0x10]
00401A4B fmul   DWORD PTR [eax+0x30]
00401A4E fld    DWORD PTR [edx+0x14]
00401A51 fld    st(0)
00401A53 fmul   DWORD PTR [eax+0x20]
00401A56 faddp  st(2),st
00401A58 fld    DWORD PTR [edx+0x18]
00401A5B fld    st(0)
00401A5D fmul   DWORD PTR [eax+0x10]
00401A60 faddp  st(3),st
00401A62 fld    DWORD PTR [edx+0x1c]
00401A65 fld    st(0)
00401A67 fmul   DWORD PTR [eax]
00401A69 faddp  st(4),st
00401A6B fxch   st(3)
00401A6D fstp   DWORD PTR [ecx+0x10]
00401A70 fld    DWORD PTR [edx+0x10]
00401A73 fld    st(0)
00401A75 fmul   DWORD PTR [eax+0x34]
00401A78 fxch   st(3)
00401A7A fmul   DWORD PTR [eax+0x24]
00401A7D faddp  st(3),st
00401A7F fld    st(1)
00401A81 fmul   DWORD PTR [eax+0x14]
00401A84 faddp  st(3),st
00401A86 fld    st(3)
00401A88 fmul   DWORD PTR [eax+0x4]
00401A8B faddp  st(3),st
00401A8D fxch   st(2)
00401A8F fstp   DWORD PTR [ecx+0x14]
00401A92 fld    st(1)
00401A94 fmul   DWORD PTR [eax+0x38]
00401A97 fld    DWORD PTR [edx+0x14]
00401A9A fld    st(0)
00401A9C fmul   DWORD PTR [eax+0x28]
00401A9F faddp  st(2),st
00401AA1 fxch   st(2)
00401AA3 fmul   DWORD PTR [eax+0x18]
00401AA6 faddp  st(1),st
00401AA8 fld    st(3)
00401AAA fmul   DWORD PTR [eax+0x8]
00401AAD faddp  st(1),st
00401AAF fstp   DWORD PTR [ecx+0x18]
00401AB2 fxch   st(1)
00401AB4 fmul   DWORD PTR [eax+0x3c]
00401AB7 fxch   st(1)
00401AB9 fmul   DWORD PTR [eax+0x2c]
00401ABC faddp  st(1),st
00401ABE fld    DWORD PTR [edx+0x18]
00401AC1 fmul   DWORD PTR [eax+0x1c]
00401AC4 faddp  st(1),st
00401AC6 fxch   st(1)
00401AC8 fmul   DWORD PTR [eax+0xc]
00401ACB faddp  st(1),st
00401ACD fstp   DWORD PTR [ecx+0x1c]
00401AD0 fld    DWORD PTR [edx+0x20]
00401AD3 fmul   DWORD PTR [eax+0x30]
00401AD6 fld    DWORD PTR [edx+0x24]
00401AD9 fld    st(0)
00401ADB fmul   DWORD PTR [eax+0x20]
00401ADE faddp  st(2),st
00401AE0 fld    DWORD PTR [edx+0x28]
00401AE3 fld    st(0)
00401AE5 fmul   DWORD PTR [eax+0x10]
00401AE8 faddp  st(3),st
00401AEA fld    DWORD PTR [edx+0x2c]
00401AED fld    st(0)
00401AEF fmul   DWORD PTR [eax]
00401AF1 faddp  st(4),st
00401AF3 fxch   st(3)
00401AF5 fstp   DWORD PTR [ecx+0x20]
00401AF8 fld    DWORD PTR [edx+0x20]
00401AFB fld    st(0)
00401AFD fmul   DWORD PTR [eax+0x34]
00401B00 fxch   st(3)
00401B02 fmul   DWORD PTR [eax+0x24]
00401B05 faddp  st(3),st
00401B07 fld    st(1)
00401B09 fmul   DWORD PTR [eax+0x14]
00401B0C faddp  st(3),st
00401B0E fld    st(3)
00401B10 fmul   DWORD PTR [eax+0x4]
00401B13 faddp  st(3),st
00401B15 fxch   st(2)
00401B17 fstp   DWORD PTR [ecx+0x24]
00401B1A fld    st(1)
00401B1C fmul   DWORD PTR [eax+0x38]
00401B1F fld    DWORD PTR [edx+0x24]
00401B22 fld    st(0)
00401B24 fmul   DWORD PTR [eax+0x28]
00401B27 faddp  st(2),st
00401B29 fxch   st(2)
00401B2B fmul   DWORD PTR [eax+0x18]
00401B2E faddp  st(1),st
00401B30 fld    st(3)
00401B32 fmul   DWORD PTR [eax+0x8]
00401B35 faddp  st(1),st
00401B37 fstp   DWORD PTR [ecx+0x28]
00401B3A fxch   st(1)
00401B3C fmul   DWORD PTR [eax+0x3c]
00401B3F fxch   st(1)
00401B41 fmul   DWORD PTR [eax+0x2c]
00401B44 faddp  st(1),st
00401B46 fld    DWORD PTR [edx+0x28]
00401B49 fmul   DWORD PTR [eax+0x1c]
00401B4C faddp  st(1),st
00401B4E fxch   st(1)
00401B50 fmul   DWORD PTR [eax+0xc]
00401B53 faddp  st(1),st
00401B55 fstp   DWORD PTR [ecx+0x2c]
00401B58 fld    DWORD PTR [edx+0x30]
00401B5B fmul   DWORD PTR [eax+0x30]
00401B5E fld    DWORD PTR [edx+0x34]
00401B61 fld    st(0)
00401B63 fmul   DWORD PTR [eax+0x20]
00401B66 faddp  st(2),st
00401B68 fld    DWORD PTR [edx+0x38]
00401B6B fld    st(0)
00401B6D fmul   DWORD PTR [eax+0x10]
00401B70 faddp  st(3),st
00401B72 fld    DWORD PTR [edx+0x3c]
00401B75 fld    st(0)
00401B77 fmul   DWORD PTR [eax]
00401B79 faddp  st(4),st
00401B7B fxch   st(3)
00401B7D fstp   DWORD PTR [ecx+0x30]
00401B80 fld    DWORD PTR [edx+0x30]
00401B83 fld    st(0)
00401B85 fmul   DWORD PTR [eax+0x34]
00401B88 fxch   st(3)
00401B8A fmul   DWORD PTR [eax+0x24]
00401B8D faddp  st(3),st
00401B8F fld    st(1)
00401B91 fmul   DWORD PTR [eax+0x14]
00401B94 faddp  st(3),st
00401B96 fld    st(3)
00401B98 fmul   DWORD PTR [eax+0x4]
00401B9B faddp  st(3),st
00401B9D fxch   st(2)
00401B9F fstp   DWORD PTR [ecx+0x34]
00401BA2 fld    st(1)
00401BA4 fmul   DWORD PTR [eax+0x38]
00401BA7 fld    DWORD PTR [edx+0x34]
00401BAA fld    st(0)
00401BAC fmul   DWORD PTR [eax+0x28]
00401BAF faddp  st(2),st
00401BB1 fxch   st(2)
00401BB3 fmul   DWORD PTR [eax+0x18]
00401BB6 faddp  st(1),st
00401BB8 fld    st(3)
00401BBA fmul   DWORD PTR [eax+0x8]
00401BBD faddp  st(1),st
00401BBF fstp   DWORD PTR [ecx+0x38]
00401BC2 fxch   st(1)
00401BC4 fmul   DWORD PTR [eax+0x3c]
00401BC7 fxch   st(1)
00401BC9 fmul   DWORD PTR [eax+0x2c]
00401BCC faddp  st(1),st
00401BCE fld    DWORD PTR [edx+0x38]
00401BD1 fmul   DWORD PTR [eax+0x1c]
00401BD4 faddp  st(1),st
00401BD6 fxch   st(1)
00401BD8 fmul   DWORD PTR [eax+0xc]
00401BDB faddp  st(1),st
00401BDD fstp   DWORD PTR [ecx+0x3c]
00401BE0 leave
00401BE1 ret

SSE
0041DCDC push   ebp
0041DCDD mov    ebp,esp
0041DCDF mov    edx,DWORD PTR [ebp+0x8]
0041DCE2 mov    ecx,DWORD PTR [ebp+0xc]
0041DCE5 mov    eax,DWORD PTR [ebp+0x10]
0041DCE8 movaps xmm0,XMMWORD PTR [edx]
0041DCEB movaps xmm6,xmm0
0041DCEE shufps xmm6,xmm0,0xff
0041DCF2 movaps xmm5,XMMWORD PTR [ecx]
0041DCF5 mulps  xmm6,xmm5
0041DCF8 movaps xmm1,xmm0
0041DCFB shufps xmm1,xmm0,0xaa
0041DCFF movaps xmm4,XMMWORD PTR [ecx+0x10]
0041DD03 mulps  xmm1,xmm4
0041DD06 addps  xmm6,xmm1
0041DD09 movaps xmm1,xmm0
0041DD0C shufps xmm1,xmm0,0x55
0041DD10 movaps xmm3,XMMWORD PTR [ecx+0x20]
0041DD14 mulps  xmm1,xmm3
0041DD17 addps  xmm6,xmm1
0041DD1A shufps xmm0,xmm0,0x0
0041DD1E movaps xmm2,XMMWORD PTR [ecx+0x30]
0041DD22 mulps  xmm0,xmm2
0041DD25 addps  xmm6,xmm0
0041DD28 movaps xmm0,XMMWORD PTR [edx+0x10]
0041DD2C movaps xmm7,xmm0
0041DD2F shufps xmm7,xmm0,0xff
0041DD33 mulps  xmm7,xmm5
0041DD36 movaps xmm1,xmm0
0041DD39 shufps xmm1,xmm0,0xaa
0041DD3D mulps  xmm1,xmm4
0041DD40 addps  xmm1,xmm7
0041DD43 movaps xmm7,xmm0
0041DD46 shufps xmm7,xmm0,0x55
0041DD4A mulps  xmm7,xmm3
0041DD4D addps  xmm1,xmm7
0041DD50 shufps xmm0,xmm0,0x0
0041DD54 mulps  xmm0,xmm2
0041DD57 addps  xmm1,xmm0
0041DD5A movaps XMMWORD PTR [eax+0x10],xmm1
0041DD5E movaps xmm1,XMMWORD PTR [edx+0x20]
0041DD62 movaps xmm7,xmm1
0041DD65 shufps xmm7,xmm1,0xff
0041DD69 mulps  xmm7,xmm5
0041DD6C movaps xmm0,xmm1
0041DD6F shufps xmm0,xmm1,0xaa
0041DD73 mulps  xmm0,xmm4
0041DD76 addps  xmm7,xmm0
0041DD79 movaps xmm0,xmm1
0041DD7C shufps xmm0,xmm1,0x55
0041DD80 mulps  xmm0,xmm3
0041DD83 addps  xmm0,xmm7
0041DD86 shufps xmm1,xmm1,0x0
0041DD8A mulps  xmm1,xmm2
0041DD8D addps  xmm1,xmm0
0041DD90 movaps xmm0,XMMWORD PTR [edx+0x30]
0041DD94 movaps xmm7,xmm0
0041DD97 shufps xmm7,xmm0,0xff
0041DD9B mulps  xmm5,xmm7
0041DD9E movaps xmm7,xmm0
0041DDA1 shufps xmm7,xmm0,0xaa
0041DDA5 mulps  xmm4,xmm7
0041DDA8 addps  xmm5,xmm4
0041DDAB movaps xmm4,xmm0
0041DDAE shufps xmm4,xmm0,0x55
0041DDB2 mulps  xmm3,xmm4
0041DDB5 addps  xmm5,xmm3
0041DDB8 shufps xmm0,xmm0,0x0
0041DDBC mulps  xmm2,xmm0
0041DDBF addps  xmm2,xmm5
0041DDC2 movaps XMMWORD PTR [eax],xmm6
0041DDC5 movaps XMMWORD PTR [eax+0x20],xmm1
0041DDC9 movaps XMMWORD PTR [eax+0x30],xmm2
0041DDCD leave
0041DDCE ret
Two things, yes, the SSE code is smaller, but at this point both functions are too large to be inlined by the compiler. Although, interestingly enough, my compiler decided that my FPU code could stay inlined (???)... anyone have an idea why? Regardless, I left the MATHLIB_INLINE macro there, with the hope that the compiler knows best somehow.

 Matrix Vector Multiplication [Single]:
MATHLIB_INLINE void matrix4x4_vectorMul(const matrix4x4& matrix, const vector4& vector, vector4& result)
/// Performs an multiplication of the given vector (column matrix) by the given matrix.
/// This method corresponds to the mathematical statement result = matrix * vector i.e the vector is a column vector.
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	__m128 m0	= _mm_mul_ps(matrix.sse_row0, vector.sse_vec);
	__m128 m1	= _mm_mul_ps(matrix.sse_row1, vector.sse_vec);
	__m128 m2	= _mm_mul_ps(matrix.sse_row2, vector.sse_vec);
	__m128 m3	= _mm_mul_ps(matrix.sse_row3, vector.sse_vec);

	m0	= _mm_hadd_ps(m1, m0);
	m2	= _mm_hadd_ps(m3, m2);
	m0	= _mm_hadd_ps(m2, m0);

	result.sse_vec	= m0;

#endif // (MATHLIB_SSE)
#else

	float vectorX	= vector.extractX();
	float vectorY	= vector.extractY();
	float vectorZ	= vector.extractZ();
	float vectorW	= vector.extractW();

	result.setXYZW((matrix._03 * vectorW) + (matrix._02 * vectorZ) + (matrix._01 * vectorY) + (matrix._00 * vectorX),
				   (matrix._13 * vectorW) + (matrix._12 * vectorZ) + (matrix._11 * vectorY) + (matrix._10 * vectorX),
				   (matrix._23 * vectorW) + (matrix._22 * vectorZ) + (matrix._21 * vectorY) + (matrix._20 * vectorX),
				   (matrix._33 * vectorW) + (matrix._32 * vectorZ) + (matrix._31 * vectorY) + (matrix._30 * vectorX));

#endif // (MATHLIB_SIMD)
}

FPU
00401C30	push   ebp
00401C31	mov    ebp,esp
00401C33	mov    eax,DWORD PTR [ebp+0x8]
00401C36	mov    ecx,DWORD PTR [ebp+0xc]
00401C39	mov    edx,DWORD PTR [ebp+0x10]
00401C3C	fld    DWORD PTR [ecx+0xc]
00401C3F	fld    DWORD PTR [ecx+0x8]
00401C42	fld    DWORD PTR [ecx+0x4]
00401C45	fld    DWORD PTR [ecx]
00401C47	fld    st(0)
00401C49	fmul   DWORD PTR [eax+0x30]
00401C4C	fld    st(2)
00401C4E	fmul   DWORD PTR [eax+0x34]
00401C51	faddp  st(1),st
00401C53	fld    st(3)
00401C55	fmul   DWORD PTR [eax+0x38]
00401C58	faddp  st(1),st
00401C5A	fld    st(4)
00401C5C	fmul   DWORD PTR [eax+0x3c]
00401C5F	faddp  st(1),st
00401C61	fld    st(1)
00401C63	fmul   DWORD PTR [eax+0x20]
00401C66	fld    st(3)
00401C68	fmul   DWORD PTR [eax+0x24]
00401C6B	faddp  st(1),st
00401C6D	fld    st(4)
00401C6F	fmul   DWORD PTR [eax+0x28]
00401C72	faddp  st(1),st
00401C74	fld    st(5)
00401C76	fmul   DWORD PTR [eax+0x2c]
00401C79	faddp  st(1),st
00401C7B	fld    st(2)
00401C7D	fmul   DWORD PTR [eax+0x10]
00401C80	fld    st(4)
00401C82	fmul   DWORD PTR [eax+0x14]
00401C85	faddp  st(1),st
00401C87	fld    st(5)
00401C89	fmul   DWORD PTR [eax+0x18]
00401C8C	faddp  st(1),st
00401C8E	fld    st(6)
00401C90	fmul   DWORD PTR [eax+0x1c]
00401C93	faddp  st(1),st
00401C95	fxch   st(3)
00401C97	fmul   DWORD PTR [eax]
00401C99	fxch   st(4)
00401C9B	fmul   DWORD PTR [eax+0x4]
00401C9E	faddp  st(4),st
00401CA0	fxch   st(4)
00401CA2	fmul   DWORD PTR [eax+0x8]
00401CA5	faddp  st(3),st
00401CA7	fxch   st(4)
00401CA9	fmul   DWORD PTR [eax+0xc]
00401CAC	faddp  st(2),st
00401CAE	fxch   st(1)
00401CB0	fstp   DWORD PTR [edx+0xc]
00401CB3	fstp   DWORD PTR [edx+0x8]
00401CB6	fstp   DWORD PTR [edx+0x4]
00401CB9	fstp   DWORD PTR [edx]
00401CBB	leave
00401CBC	ret

SSE
004014A8	push   ebp
004014A9	mov    ebp,esp
004014AB	mov    eax,DWORD PTR [ebp+0x8]
004014AE	mov    edx,DWORD PTR [ebp+0xc]
004014B1	movaps xmm1,XMMWORD PTR [edx]
004014B4	movaps xmm3,xmm1
004014B7	mulps  xmm3,XMMWORD PTR [eax]
004014BA	movaps xmm0,xmm1
004014BD	mulps  xmm0,XMMWORD PTR [eax+0x10]
004014C1	movaps xmm2,xmm1
004014C4	mulps  xmm2,XMMWORD PTR [eax+0x20]
004014C8	mulps  xmm1,XMMWORD PTR [eax+0x30]
004014CC	haddps xmm0,xmm3
004014D0	haddps xmm1,xmm2
004014D4	haddps xmm1,xmm0
004014D8	mov    eax,DWORD PTR [ebp+0x10]
004014DB	movaps XMMWORD PTR [eax],xmm1
004014DE	leave
004014DF	ret

 Matrix Vector Multiplication [Batch]: 
MATHLIB_INLINE void matrix4x4_vectorBatchMul(const matrix4x4& matrix, vector4 const * const vectorArray, uint32_t numVectors, vector4* const resultArray)
/// Performs a batch matrix vector multiplication (column matrix).
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	// Load matrix:
	__m128 matrix_row0	= matrix.sse_row0;
	__m128 matrix_row1	= matrix.sse_row1;
	__m128 matrix_row2	= matrix.sse_row2;
	__m128 matrix_row3	= matrix.sse_row3;

	for (uint32_t i = 0; i < numVectors; i++)
	{
		__m128 m0	= _mm_mul_ps(matrix_row0, vectorArray[i].sse_vec);
		__m128 m1	= _mm_mul_ps(matrix_row1, vectorArray[i].sse_vec);
		__m128 m2	= _mm_mul_ps(matrix_row2, vectorArray[i].sse_vec);
		__m128 m3	= _mm_mul_ps(matrix_row3, vectorArray[i].sse_vec);

		m0	= _mm_hadd_ps(m1, m0);
		m2	= _mm_hadd_ps(m3, m2);
		m0	= _mm_hadd_ps(m2, m0);

		resultArray[i].sse_vec	= m0;
	}

#endif // (MATHLIB_SSE)
#else

	for (uint32_t i = 0; i < numVectors; i++)
	{
		float vectorX	= vectorArray[i].extractX();
		float vectorY	= vectorArray[i].extractY();
		float vectorZ	= vectorArray[i].extractZ();
		float vectorW	= vectorArray[i].extractW();

		resultArray[i].setXYZW((matrix._03 * vectorW) + (matrix._02 * vectorZ) + (matrix._01 * vectorY) + (matrix._00 * vectorX),
							   (matrix._13 * vectorW) + (matrix._12 * vectorZ) + (matrix._11 * vectorY) + (matrix._10 * vectorX),
							   (matrix._23 * vectorW) + (matrix._22 * vectorZ) + (matrix._21 * vectorY) + (matrix._20 * vectorX),
							   (matrix._33 * vectorW) + (matrix._32 * vectorZ) + (matrix._31 * vectorY) + (matrix._30 * vectorX));
	}

#endif // (MATHLIB_SIMD)
}

FPU
00401CBD	push   ebp
00401CBE	mov    ebp,esp
00401CC0	push   edi
00401CC1	push   esi
00401CC2	push   ebx
00401CC3	mov    eax,DWORD PTR [ebp+0x8]
00401CC6	mov    ecx,DWORD PTR [ebp+0xc]
00401CC9	mov    edi,DWORD PTR [ebp+0x10]
00401CCC	mov    ebx,DWORD PTR [ebp+0x14]
00401CCF	test   edi,edi
00401CD1	je     0x401d74 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+183>
00401CD7	mov    edx,0x0
00401CDC	mov    esi,0x0
00401CE1	fld    DWORD PTR [ecx+edx*1+0xc]
00401CE5	fld    DWORD PTR [ecx+edx*1+0x8]
00401CE9	fld    DWORD PTR [ecx+edx*1+0x4]
00401CED	fld    DWORD PTR [ecx+edx*1]
00401CF0	fld    st(0)
00401CF2	fmul   DWORD PTR [eax+0x30]
00401CF5	fld    st(2)
00401CF7	fmul   DWORD PTR [eax+0x34]
00401CFA	faddp  st(1),st
00401CFC	fld    st(3)
00401CFE	fmul   DWORD PTR [eax+0x38]
00401D01	faddp  st(1),st
00401D03	fld    st(4)
00401D05	fmul   DWORD PTR [eax+0x3c]
00401D08	faddp  st(1),st
00401D0A	fld    st(1)
00401D0C	fmul   DWORD PTR [eax+0x20]
00401D0F	fld    st(3)
00401D11	fmul   DWORD PTR [eax+0x24]
00401D14	faddp  st(1),st
00401D16	fld    st(4)
00401D18	fmul   DWORD PTR [eax+0x28]
00401D1B	faddp  st(1),st
00401D1D	fld    st(5)
00401D1F	fmul   DWORD PTR [eax+0x2c]
00401D22	faddp  st(1),st
00401D24	fld    st(2)
00401D26	fmul   DWORD PTR [eax+0x10]
00401D29	fld    st(4)
00401D2B	fmul   DWORD PTR [eax+0x14]
00401D2E	faddp  st(1),st
00401D30	fld    st(5)
00401D32	fmul   DWORD PTR [eax+0x18]
00401D35	faddp  st(1),st
00401D37	fld    st(6)
00401D39	fmul   DWORD PTR [eax+0x1c]
00401D3C	faddp  st(1),st
00401D3E	fxch   st(3)
00401D40	fmul   DWORD PTR [eax]
00401D42	fxch   st(4)
00401D44	fmul   DWORD PTR [eax+0x4]
00401D47	faddp  st(4),st
00401D49	fxch   st(4)
00401D4B	fmul   DWORD PTR [eax+0x8]
00401D4E	faddp  st(3),st
00401D50	fxch   st(4)
00401D52	fmul   DWORD PTR [eax+0xc]
00401D55	faddp  st(2),st
00401D57	fxch   st(1)
00401D59	fstp   DWORD PTR [ebx+edx*1+0xc]
00401D5D	fstp   DWORD PTR [ebx+edx*1+0x8]
00401D61	fstp   DWORD PTR [ebx+edx*1+0x4]
00401D65	fstp   DWORD PTR [ebx+edx*1]
00401D68	inc    esi
00401D69	add    edx,0x10
00401D6C	cmp    edi,esi
00401D6E	ja     0x401ce1 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+36>
00401D74	pop    ebx
00401D75	pop    esi
00401D76	pop    edi
00401D77	leave
00401D78	ret

SSE
004014A8	push   ebp
004014A9	mov    ebp,esp
004014AB	push   ebx
004014AC	mov    eax,DWORD PTR [ebp+0x8]
004014AF	mov    ebx,DWORD PTR [ebp+0x10]
004014B2	movaps xmm3,XMMWORD PTR [eax]
004014B5	movaps xmm4,XMMWORD PTR [eax+0x10]
004014B9	movaps xmm5,XMMWORD PTR [eax+0x20]
004014BD	movaps xmm6,XMMWORD PTR [eax+0x30]
004014C1	test   ebx,ebx
004014C3	je     0x401502 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+90>
004014C5	mov    ecx,DWORD PTR [ebp+0xc]
004014C8	mov    edx,DWORD PTR [ebp+0x14]
004014CB	mov    eax,0x0
004014D0	movaps xmm1,XMMWORD PTR [ecx]
004014D3	movaps xmm7,xmm3
004014D6	mulps  xmm7,xmm1
004014D9	movaps xmm0,xmm4
004014DC	mulps  xmm0,xmm1
004014DF	movaps xmm2,xmm5
004014E2	mulps  xmm2,xmm1
004014E5	mulps  xmm1,xmm6
004014E8	haddps xmm0,xmm7
004014EC	haddps xmm1,xmm2
004014F0	haddps xmm1,xmm0
004014F4	movaps XMMWORD PTR [edx],xmm1
004014F7	inc    eax
004014F8	add    ecx,0x10
004014FB	add    edx,0x10
004014FE	cmp    ebx,eax
00401500	ja     0x4014d0 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+40>
00401502	pop    ebx
00401503	leave
00401504	ret

Benchmarks
Let's look at the performance numbers, the testing methodology is pretty much exactly the same as I had with the vectors (you can take a look at the source code, it's all there). I'm just going to post the text output from the program itself (the screenshot approach takes to long). Again, we're measuring single threaded code, it's an architecture, clock speed, and memory bandwidth game.

The first machine I ran the test on was was equipped with a Pentium E5200, overclocked to 2.75GHz:
MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.099916]
Subtraction benchmark           : average time (msecs) [0.089407]
Transposition benchmark         : average time (msecs) [0.049061]
Scalar mul benchmark            : average time (msecs) [0.057214]
Matrix mul benchmark            : average time (msecs) [0.169970]
Vector mul benchmark            : average time (msecs) [0.049327]
Vector batch mul benchmark      : average time (msecs) [0.036488]

Process returned 0 (0x0)   execution time : 55.196 s
Press any key to continue.

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.152721]
Subtraction benchmark           : average time (msecs) [0.151200]
Transposition benchmark         : average time (msecs) [0.098960]
Scalar mul benchmark            : average time (msecs) [0.121864]
Matrix mul benchmark            : average time (msecs) [0.470418]
Vector mul benchmark            : average time (msecs) [0.117321]
Vector batch mul benchmark      : average time (msecs) [0.118026]

Process returned 0 (0x0)   execution time : 123.101 s
Press any key to continue.

The second machine was my good old (really old) laptop, equipped with a Pentium Dual T3400:
MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.623972]
Subtraction benchmark           : average time (msecs) [0.620934]
Transposition benchmark         : average time (msecs) [0.070971]
Scalar mul benchmark            : average time (msecs) [0.423544]
Matrix mul benchmark            : average time (msecs) [0.641712]
Vector mul benchmark            : average time (msecs) [0.108508]
Vector batch mul benchmark      : average time (msecs) [0.063231]

Process returned 0 (0x0)   execution time : 256.348 s
Press any key to continue.

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.634588]
Subtraction benchmark           : average time (msecs) [0.632229]
Transposition benchmark         : average time (msecs) [0.097166]
Scalar mul benchmark            : average time (msecs) [0.439894]
Matrix mul benchmark            : average time (msecs) [0.676615]
Vector mul benchmark            : average time (msecs) [0.173299]
Vector batch mul benchmark      : average time (msecs) [0.152087]

Process returned 0 (0x0)   execution time : 281.617 s
Press any key to continue.


So these results are pretty interesting, it appears that the calculations become memory bandwidth limited... the SSE code is always ahead, but it's not far ahead. The only really notable speedups are the Vector mul and Vector batch mul results.

The third machine was a Core 2 Quad Q8400:

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.089563]
Subtraction benchmark           : average time (msecs) [0.090273]
Transposition benchmark         : average time (msecs) [0.050641]
Scalar mul benchmark            : average time (msecs) [0.060286]
Matrix mul benchmark            : average time (msecs) [0.172725]
Vector mul benchmark            : average time (msecs) [0.051047]
Vector batch mul benchmark      : average time (msecs) [0.037580]

Process returned 0 (0x0)   execution time : 55.302 s
Press any key to continue.

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.157504]
Subtraction benchmark           : average time (msecs) [0.158742]
Transposition benchmark         : average time (msecs) [0.102287]
Scalar mul benchmark            : average time (msecs) [0.126073]
Matrix mul benchmark            : average time (msecs) [0.484266]
Vector mul benchmark            : average time (msecs) [0.121134]
Vector batch mul benchmark      : average time (msecs) [0.121808]

Process returned 0 (0x0)   execution time : 127.281 s
Press any key to continue.

Conclusion
In line with the vectors, we can see a comfortable 2x speedup (laptop results notwithstanding) from the use of SSE intrinsics. The batch multiplication and matrix-matrix multiplication receive around 3x speedups, which is promising.

The source code for this article can be downloaded here:
https://hotfile.com/dl/167284774/730f206/MathLib.rar.html