Wednesday 17 October 2012

Simple Tile Based Volume Lighting In Toxic Bunny

Introduction
My company (Celestial Games) recently released Toxic Bunny HD, an old-school 2D platforming-shooter remake involving a crazy bunny called Toxic and his quest for revenge, you can read about it here:

http://www.toxicbunny.co.za/

Towards the end of the project I had a little time to try and squeeze in one of the features I had been wanting for a long time, dynamic lighting.

Whilst not a complicated effect to achieve normally, when we start taking it in the context of, "We wrote the game in Java using the Java2D API", it becomes more challenging. What we found during development was that the Java2D API was not well suited to the kinds of graphics workloads we were placing on it; and this was never more true than when we looked at dynamic lighting. A few onerous things about the API (for games) are:

- Lack of useful blending modes (no multiplicative blending, no additive blending).
- Absolutely NO shading support whatsoever.
- Modifying images can result in the API marking them as unaccelerated, destroying performance.
- Bloated implicit object construction for image(sprite) drawing.
- Convolution operations provided by the API are slow.

Whilst we were able to overcome most of these in various ways, they certainly left us with the strong impression that we chose the wrong API to develop a game on. This was a direct result of the fact that the game started as a weekend experiment and wasn't considered to be a "serious" product until after we had laid the foundations out. It certainly forced us to push the API farther than we thought we could.

Goals
In order for the dynamic lighting to be used in the game, it had to look good and not abuse our frame budget.
Toxic Bunny HD runs at a tick rate of 24Hz, that means that our budget is, conservatively, 40 milliseconds. Compared to, say a 60Hz 16 millisecond game, that's an absolutely enormous frame budget, which worked in our favor as a 2D game (a high frame rate count doesn't mean as much in a 2D game environment as in a 3D one).

So, I wanted a lighting pipeline that would definitely push a high end machine at the highest quality setting, but would perform at around 5ms on the lowest quality setting.

The Lighting Pipeline
The lighting pipeline, at it's most conceptual level, evaluates to

For every light:
Step 1: Gather tiles around light source. From a collision detection perspective in the game, our playing area is divided up into 16x16 tiles. It is these tiles that perform the role of light occluders in the lighting system. Every 16x16 tile is marked with a specific collision type, whether it be empty space, a slope of some angle, or a solid (fully filled) space. The lighting system is only concerned about the solid tiles. For every light, the lighting system extracts the solid tiles that fall within the lights radius.

Solid collision tiles for the map are highlighted in red here. 

Step 2: Use gathered tiles to generate half spaces.
We then process the gathered collision tiles one at a time. We calculate the lights position relative to the center of the tile (working in a tile-space so to speak). And use the relative position and the rectangular features of the tile to generate a set of half-spaces. If a particular pixel falls behind all of the generated half-spaces for a given tile, it is occluded by that tile and therefore not shaded. If it is in front of one of the half-spaces, we run the standard point light attenuation equation on it and additively blend it into the back (lighting) buffer (I'll explain why this is possible in Java 2D later).

My quick diagram of how the half spaces are generated. There are 9 regions around the tile that the light can be in.
Step 3: Shade pixels.
For every pixel in the back buffer, run through each tile's generated half spaces and do the visibility test.

After all that's done, and you have your lighting contribution to the scene in a buffer, add it's contribution to the scene.

Note:
The lighting buffer is at a much lower resolution than the screen resolution for performance reasons, we blur the lighting buffer a few times and then stretch it over the screen using bilinear filtering to generate a coarse lighting system. In addition to the huge performance savings, making the algorithm feasible, this looks more volumetric than the high resolution version, which is desirable.

The Implementation Evolution
In order to implement this system we required per-pixel access and additive blending functionality that proved too slow, cumbersome and/or impossible to do with the existing API. Instead we decided to implement a software-based system that essentially rendered everything on the system side using the CPU only and then sent the results over the bus to the graphics card and merged it with the existing back buffer.

When I first implemented the naive algorithm the performance was upwards of 60 milliseconds for a 64x64 pixel lighting buffer, with one TINY light and without the convolution (used to blur the lighting buffer) (!). Clearly unacceptable. So, it was time to put on the optimization hat, and dig in.

Optimizing the half-space count
Profiling the code, one of the obvious bottlenecks was the amount of half-space sets that were being generated. One set for each 16x16 pixel tile. There a few things I implemented to reduce the total tile count and therefore the half-space count.

The first thing was to process tiles in two sets. The first set consisted of tiles that were above the light position on the screen and the second set consisted of tiles that were below the light position on the screen.
The first set of tiles was processed from the light position heading upward (the first tiles processed were in the row of tiles just above the light and the last processed were at the top of the screen) and vice versa for the tiles below the light.

The tiles were processed in rows. Why? This was to essentially merge the tiles together, reducing the tile count. If two or more tiles were adjacent in the row (without any gaps) then they were merged into one large horizontal tile and sent off for further processing. This further processing involved testing the new tile against the previously generated tiles half spaces. If it wasn't visible, then it's half space contribution would be irrelevant to the pixel occlusion stage, and it would be ignored. If it was visible in relation to the previously accepted tiles, then we would run a vertical merge stage. In this stage, if two large tiles were of the same position and width and were above each other by one row, then they would be merged together.

These steps drastically reduced the number of tiles we needed to generate half-spaces for. But while this increased the speed, we were still very far off our target.

Optimizing the rasterization step
The second thing was the actual pixel processing. The light sources are generally much smaller than the entire screen and so to process each individual pixel was an unnecessary waste of time. Instead a coarse tile rasterization system was implemented. In this system, the back buffer was divided up into blocks of pixels (8x8 pixel blocks for the high spec setting), and each of these were tested against the half spaces. If the block was entirely occluded then we discarded all of the pixels inside that block. If the block was entirely inside the visible area, then we trivially computed the lighting equation for all pixels inside that block. Only when the block of pixels was partially obscured by the half spaces were the individual pixels considered.

Again, we saw a huge speed improvement. But again, we were still off target. Specifically, the rasterization and lighting buffer blurring steps were still taking too long.

Utilizing the power of multi-core processors
At this point, without any more algorithmic optimizations to try out, we turned away from algorithms and focused on hardware utilization.

We implemented a job based threading system that allowed us to parallelize the workload across up to 16
processors. On implementing this for the rasterization stage, we reached our performance target. Finally.

One stage of the pipeline was still taking too damn long though. The blurring of the lighting buffer. This stage we initially left up to the Java2D convolution API. But that sucked. So instead we reimplemented it in software (seeing a pattern here?) using our job system. With these optimizations in place, we had reached our goal.
It certainly wasn't easy, but where the API failed us, we wrote our own functionality. We did this in numerous areas throughout the code base.

Visualization of the final optimizations. The cyan tiles are the above light source tiles, the purple ones are the below tiles. On the lower right you can see the coarse rasterization algorithm at work. Red tiles are culled, green are trivially accepted and yellow are partially culled.

There were still problems that proved unsolvable, to be fair, but we could live with them. One thing that remains a (really annoying) bottleneck is transferring the back buffer to an API construct and rendering it over the screen with bilinear filtering. To do that, and only that, on my development machine, still consumes around 1 millisecond(!!!).

Monday 17 September 2012

SIMD Math Library Part 2: SSE With Matrices

Introduction
All right everyone, it's time to look at matrices. But first, a slight digression, a story if you will;
As I sat down to continue writing the SSE code one night, a question popped into my head, "Why the hell am I trying to optimize this function? I mean, I've used this particular operation maybe once or twice, and never where the performance actually mattered...". And there's a very important point there.

I am a terribly lazy programmer.

But more than that, it's okay that I am, because it's completely irrelevant from a performance perspective that that function I use that accounts for maybe 0.01% of my run time time cost is optimized. Now of course there's probably a fair few people who may have need for those functions that I haven't optimized (pretty much only the matrix determinant and inverse functions). And my apologies for that. If I ever have some free time on my hands I'll probably go through and implement them. But there's cooler, more relevant things to dedicate my time to right now (you'll see in future articles ;) ).

Well if I'm going to only optimize things that truly matter (to me) then what did I optimize exactly?
There are two main types of operation that I focused on:
- Matrix-matrix multiplication.
- Matrix-vector multiplication (i.e vector transformation).

Of these the second is of particular importance if you're doing software pipelines for vertex processing etc. To that end I wrote a batch multiply which basically transforms an entire array of vectors as efficiently as possible.

Now that that's out of the way, lets get to code comparisons!

Matrix Addition:
MATHLIB_INLINE void matrix4x4_add(const matrix4x4& matrix1, const matrix4x4& matrix2, matrix4x4& result)
/// Performs a matrix addition and stores matrix sum in reuslt.
/// Corresponds to the mathematical statement result = matrix1 + matrix2
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	result.sse_row0 = _mm_add_ps(matrix1.sse_row0, matrix2.sse_row0);
	result.sse_row1 = _mm_add_ps(matrix1.sse_row1, matrix2.sse_row1);
	result.sse_row2 = _mm_add_ps(matrix1.sse_row2, matrix2.sse_row2);
	result.sse_row3 = _mm_add_ps(matrix1.sse_row3, matrix2.sse_row3);

#endif // (MATHLIB_SSE)
#else

	result._03  = matrix1._03 + matrix2._03;
	result._02  = matrix1._02 + matrix2._02;
	result._01  = matrix1._01 + matrix2._01;
	result._00  = matrix1._00 + matrix2._00;

	result._13  = matrix1._13 + matrix2._13;
	result._12  = matrix1._12 + matrix2._12;
	result._11  = matrix1._11 + matrix2._11;
	result._10  = matrix1._10 + matrix2._10;

	result._23  = matrix1._23 + matrix2._23;
	result._22  = matrix1._22 + matrix2._22;
	result._21  = matrix1._21 + matrix2._21;
	result._20  = matrix1._20 + matrix2._20;

	result._33  = matrix1._33 + matrix2._33;
	result._32  = matrix1._32 + matrix2._32;
	result._31  = matrix1._31 + matrix2._31;
	result._30  = matrix1._30 + matrix2._30;

#endif // (MATHLIB_SIMD)
}
Yeah I know what you're all thinking, "What the hell is up with the addition order in the scalar version?". Well, it's a quirky thing, but I store the x element of a vector (or the first element of a matrix row) in the highest word (32 bits in this case) of an SSE vector. This means that the x element of a vector is actually stored at the highest address in memory of a 128-bit vector, because little-endian platforms like x86 store higher order bytes at higher memory addresses. So, basically, to keep the memory access the same as usual, I end up having to do operations in reverse.

Here's the assembly output:
FPU
004017EA push   ebp
004017EB mov    ebp,esp
004017ED mov    ecx,DWORD PTR [ebp+0x8]
004017F0 mov    edx,DWORD PTR [ebp+0xc]
004017F3 mov    eax,DWORD PTR [ebp+0x10]
004017F6 fld    DWORD PTR [ecx]
004017F8 fadd   DWORD PTR [edx]
004017FA fstp   DWORD PTR [eax]
004017FC fld    DWORD PTR [ecx+0x4]
004017FF fadd   DWORD PTR [edx+0x4]
00401802 fstp   DWORD PTR [eax+0x4]
00401805 fld    DWORD PTR [ecx+0x8]
00401808 fadd   DWORD PTR [edx+0x8]
0040180B fstp   DWORD PTR [eax+0x8]
0040180E fld    DWORD PTR [ecx+0xc]
00401811 fadd   DWORD PTR [edx+0xc]
00401814 fstp   DWORD PTR [eax+0xc]
00401817 fld    DWORD PTR [ecx+0x10]
0040181A fadd   DWORD PTR [edx+0x10]
0040181D fstp   DWORD PTR [eax+0x10]
00401820 fld    DWORD PTR [ecx+0x14]
00401823 fadd   DWORD PTR [edx+0x14]
00401826 fstp   DWORD PTR [eax+0x14]
00401829 fld    DWORD PTR [ecx+0x18]
0040182C fadd   DWORD PTR [edx+0x18]
0040182F fstp   DWORD PTR [eax+0x18]
00401832 fld    DWORD PTR [ecx+0x1c]
00401835 fadd   DWORD PTR [edx+0x1c]
00401838 fstp   DWORD PTR [eax+0x1c]
0040183B fld    DWORD PTR [ecx+0x20]
0040183E fadd   DWORD PTR [edx+0x20]
00401841 fstp   DWORD PTR [eax+0x20]
00401844 fld    DWORD PTR [ecx+0x24]
00401847 fadd   DWORD PTR [edx+0x24]
0040184A fstp   DWORD PTR [eax+0x24]
0040184D fld    DWORD PTR [ecx+0x28]
00401850 fadd   DWORD PTR [edx+0x28]
00401853 fstp   DWORD PTR [eax+0x28]
00401856 fld    DWORD PTR [ecx+0x2c]
00401859 fadd   DWORD PTR [edx+0x2c]
0040185C fstp   DWORD PTR [eax+0x2c]
0040185F fld    DWORD PTR [ecx+0x30]
00401862 fadd   DWORD PTR [edx+0x30]
00401865 fstp   DWORD PTR [eax+0x30]
00401868 fld    DWORD PTR [ecx+0x34]
0040186B fadd   DWORD PTR [edx+0x34]
0040186E fstp   DWORD PTR [eax+0x34]
00401871 fld    DWORD PTR [ecx+0x38]
00401874 fadd   DWORD PTR [edx+0x38]
00401877 fstp   DWORD PTR [eax+0x38]
0040187A fld    DWORD PTR [ecx+0x3c]
0040187D fadd   DWORD PTR [edx+0x3c]
00401880 fstp   DWORD PTR [eax+0x3c]
00401883 pop    ebp
00401884 ret

SSE
004016AB push   ebp
004016AC mov    ebp,esp
004016AE mov    ecx,DWORD PTR [ebp+0x8]
004016B1 mov    edx,DWORD PTR [ebp+0xc]
004016B4 mov    eax,DWORD PTR [ebp+0x10]
004016B7 movaps xmm0,XMMWORD PTR [ecx]
004016BA addps  xmm0,XMMWORD PTR [edx]
004016BD movaps XMMWORD PTR [eax],xmm0
004016C0 movaps xmm0,XMMWORD PTR [ecx+0x10]
004016C4 addps  xmm0,XMMWORD PTR [edx+0x10]
004016C8 movaps XMMWORD PTR [eax+0x10],xmm0
004016CC movaps xmm0,XMMWORD PTR [ecx+0x20]
004016D0 addps  xmm0,XMMWORD PTR [edx+0x20]
004016D4 movaps XMMWORD PTR [eax+0x20],xmm0
004016D8 movaps xmm0,XMMWORD PTR [ecx+0x30]
004016DC addps  xmm0,XMMWORD PTR [edx+0x30]
004016E0 movaps XMMWORD PTR [eax+0x30],xmm0
004016E4 leave
004016E5 ret
SSE again gets points for compactness. Subtraction is basically the same so I won't waste valuable space by showing all of the assembly for it.

Matrix Scalar Multiplication:
MATHLIB_INLINE void matrix4x4_scalarMul(const matrix4x4& matrix, float scalar, matrix4x4& result)
/// Performs a matrix-scalar multiplication and stores the scaled matrix in result.
/// Corresponds to the mathematical statement reuslt = matrix * scalar
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

    __m128 scaleVec = _mm_load1_ps(&scalar);
    result.sse_row0 = _mm_mul_ps(matrix.sse_row0, scaleVec);
    result.sse_row1 = _mm_mul_ps(matrix.sse_row1, scaleVec);
    result.sse_row2 = _mm_mul_ps(matrix.sse_row2, scaleVec);
    result.sse_row3 = _mm_mul_ps(matrix.sse_row3, scaleVec);

#endif // (MATHLIB_SSE)
#else

    result._03  = matrix._03 * scalar;
    result._02  = matrix._02 * scalar;
    result._01  = matrix._01 * scalar;
    result._00  = matrix._00 * scalar;

    result._13  = matrix._13 * scalar;
    result._12  = matrix._12 * scalar;
    result._11  = matrix._11 * scalar;
    result._10  = matrix._10 * scalar;

    result._23  = matrix._23 * scalar;
    result._22  = matrix._22 * scalar;
    result._21  = matrix._21 * scalar;
    result._20  = matrix._20 * scalar;

    result._33  = matrix._33 * scalar;
    result._32  = matrix._32 * scalar;
    result._31  = matrix._31 * scalar;
    result._30  = matrix._30 * scalar;

#endif // (MATHLIB_SIMD)
}

FPU
0040192D push   ebp
0040192E mov    ebp,esp
00401930 mov    edx,DWORD PTR [ebp+0x8]
00401933 fld    DWORD PTR [ebp+0xc]
00401936 mov    eax,DWORD PTR [ebp+0x10]
00401939 fld    st(0)
0040193B fmul   DWORD PTR [edx]
0040193D fstp   DWORD PTR [eax]
0040193F fld    st(0)
00401941 fmul   DWORD PTR [edx+0x4]
00401944 fstp   DWORD PTR [eax+0x4]
00401947 fld    st(0)
00401949 fmul   DWORD PTR [edx+0x8]
0040194C fstp   DWORD PTR [eax+0x8]
0040194F fld    st(0)
00401951 fmul   DWORD PTR [edx+0xc]
00401954 fstp   DWORD PTR [eax+0xc]
00401957 fld    st(0)
00401959 fmul   DWORD PTR [edx+0x10]
0040195C fstp   DWORD PTR [eax+0x10]
0040195F fld    st(0)
00401961 fmul   DWORD PTR [edx+0x14]
00401964 fstp   DWORD PTR [eax+0x14]
00401967 fld    st(0)
00401969 fmul   DWORD PTR [edx+0x18]
0040196C fstp   DWORD PTR [eax+0x18]
0040196F fld    st(0)
00401971 fmul   DWORD PTR [edx+0x1c]
00401974 fstp   DWORD PTR [eax+0x1c]
00401977 fld    st(0)
00401979 fmul   DWORD PTR [edx+0x20]
0040197C fstp   DWORD PTR [eax+0x20]
0040197F fld    st(0)
00401981 fmul   DWORD PTR [edx+0x24]
00401984 fstp   DWORD PTR [eax+0x24]
00401987 fld    st(0)
00401989 fmul   DWORD PTR [edx+0x28]
0040198C fstp   DWORD PTR [eax+0x28]
0040198F fld    st(0)
00401991 fmul   DWORD PTR [edx+0x2c]
00401994 fstp   DWORD PTR [eax+0x2c]
00401997 fld    st(0)
00401999 fmul   DWORD PTR [edx+0x30]
0040199C fstp   DWORD PTR [eax+0x30]
0040199F fld    st(0)
004019A1 fmul   DWORD PTR [edx+0x34]
004019A4 fstp   DWORD PTR [eax+0x34]
004019A7 fld    st(0)
004019A9 fmul   DWORD PTR [edx+0x38]
004019AC fstp   DWORD PTR [eax+0x38]
004019AF fmul   DWORD PTR [edx+0x3c]
004019B2 fstp   DWORD PTR [eax+0x3c]
004019B5 leave
004019B6 ret

SSE
00401538 push   ebp
00401539 mov    ebp,esp
0040153B mov    edx,DWORD PTR [ebp+0x8]
0040153E mov    eax,DWORD PTR [ebp+0x10]
00401541 movss  xmm0,DWORD PTR [ebp+0xc]
00401546 shufps xmm0,xmm0,0x0
0040154A movaps xmm1,xmm0
0040154D mulps  xmm1,XMMWORD PTR [edx]
00401550 movaps XMMWORD PTR [eax],xmm1
00401553 movaps xmm1,xmm0
00401556 mulps  xmm1,XMMWORD PTR [edx+0x10]
0040155A movaps XMMWORD PTR [eax+0x10],xmm1
0040155E movaps xmm1,xmm0
00401561 mulps  xmm1,XMMWORD PTR [edx+0x20]
00401565 movaps XMMWORD PTR [eax+0x20],xmm1
00401569 mulps  xmm0,XMMWORD PTR [edx+0x30]
0040156D movaps XMMWORD PTR [eax+0x30],xmm0
00401571 leave
00401572 ret

Matrix Matrix Multiplication: 
This is an important function, as well as a large one, fortunately, it SIMD-izes pretty darn well.
MATHLIB_INLINE void matrix4x4_mul(const matrix4x4& matrix1, const matrix4x4& matrix2, matrix4x4& result)
/// Performs a matrix-matrix multiplication and stores the product in result.
/// Corresponds to the mathematical statement reuslt = matrix1 * matrix2
/// @warning The three matrices must be distinct, the result will be incorrect if result is the same object as matrix1 or matrix2.
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	__m128 row0;
	__m128 row1;
	__m128 row2;
	__m128 row3;
	__m128 value0;

	// Compute row 0 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(3, 3, 3, 3));
	row0 	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(2, 2, 2, 2));
	row0	= _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(1, 1, 1, 1));
	row0	= _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(0, 0, 0, 0));
	row0	= _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row3, value0));

	// Compute row 1 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(3, 3, 3, 3));
	row1 	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(2, 2, 2, 2));
	row1	= _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(1, 1, 1, 1));
	row1	= _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(0, 0, 0, 0));
	row1	= _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row3, value0));

	// Compute row 2 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(3, 3, 3, 3));
	row2	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(2, 2, 2, 2));
	row2	= _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(1, 1, 1, 1));
	row2	= _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(0, 0, 0, 0));
	row2	= _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row3, value0));

	// Compute row 3 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(3, 3, 3, 3));
	row3 	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(2, 2, 2, 2));
	row3	= _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(1, 1, 1, 1));
	row3	= _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(0, 0, 0, 0));
	row3	= _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row3, value0));

	// Write results back to memory:
	result.sse_row0	= row0;
	result.sse_row1	= row1;
	result.sse_row2	= row2;
	result.sse_row3	= row3;

#endif // (MATHLIB_SSE)
#else

	result._03	= (matrix1._03 * matrix2._33) + (matrix1._02 * matrix2._23) + (matrix1._01 * matrix2._13) + (matrix1._00 * matrix2._03);
	result._02	= (matrix1._03 * matrix2._32) + (matrix1._02 * matrix2._22) + (matrix1._01 * matrix2._12) + (matrix1._00 * matrix2._02);
	result._01	= (matrix1._03 * matrix2._31) + (matrix1._02 * matrix2._21) + (matrix1._01 * matrix2._11) + (matrix1._00 * matrix2._01);
	result._00	= (matrix1._03 * matrix2._30) + (matrix1._02 * matrix2._20) + (matrix1._01 * matrix2._10) + (matrix1._00 * matrix2._00);

	result._13	= (matrix1._13 * matrix2._33) + (matrix1._12 * matrix2._23) + (matrix1._11 * matrix2._13) + (matrix1._10 * matrix2._03);
	result._12	= (matrix1._13 * matrix2._32) + (matrix1._12 * matrix2._22) + (matrix1._11 * matrix2._12) + (matrix1._10 * matrix2._02);
	result._11	= (matrix1._13 * matrix2._31) + (matrix1._12 * matrix2._21) + (matrix1._11 * matrix2._11) + (matrix1._10 * matrix2._01);
	result._10	= (matrix1._13 * matrix2._30) + (matrix1._12 * matrix2._20) + (matrix1._11 * matrix2._10) + (matrix1._10 * matrix2._00);

	result._23	= (matrix1._23 * matrix2._33) + (matrix1._22 * matrix2._23) + (matrix1._21 * matrix2._13) + (matrix1._20 * matrix2._03);
	result._22	= (matrix1._23 * matrix2._32) + (matrix1._22 * matrix2._22) + (matrix1._21 * matrix2._12) + (matrix1._20 * matrix2._02);
	result._21	= (matrix1._23 * matrix2._31) + (matrix1._22 * matrix2._21) + (matrix1._21 * matrix2._11) + (matrix1._20 * matrix2._01);
	result._20	= (matrix1._23 * matrix2._30) + (matrix1._22 * matrix2._20) + (matrix1._21 * matrix2._10) + (matrix1._20 * matrix2._00);

	result._33	= (matrix1._33 * matrix2._33) + (matrix1._32 * matrix2._23) + (matrix1._31 * matrix2._13) + (matrix1._30 * matrix2._03);
	result._32	= (matrix1._33 * matrix2._32) + (matrix1._32 * matrix2._22) + (matrix1._31 * matrix2._12) + (matrix1._30 * matrix2._02);
	result._31	= (matrix1._33 * matrix2._31) + (matrix1._32 * matrix2._21) + (matrix1._31 * matrix2._11) + (matrix1._30 * matrix2._01);
	result._30	= (matrix1._33 * matrix2._30) + (matrix1._32 * matrix2._20) + (matrix1._31 * matrix2._10) + (matrix1._30 * matrix2._00);

#endif // (MATHLIB_SIMD)
}

FPU
004019B7 push   ebp
004019B8 mov    ebp,esp
004019BA mov    edx,DWORD PTR [ebp+0x8]
004019BD mov    eax,DWORD PTR [ebp+0xc]
004019C0 mov    ecx,DWORD PTR [ebp+0x10]
004019C3 fld    DWORD PTR [edx]
004019C5 fmul   DWORD PTR [eax+0x30]
004019C8 fld    DWORD PTR [edx+0x4]
004019CB fld    st(0)
004019CD fmul   DWORD PTR [eax+0x20]
004019D0 faddp  st(2),st
004019D2 fld    DWORD PTR [edx+0x8]
004019D5 fld    st(0)
004019D7 fmul   DWORD PTR [eax+0x10]
004019DA faddp  st(3),st
004019DC fld    DWORD PTR [edx+0xc]
004019DF fld    st(0)
004019E1 fmul   DWORD PTR [eax]
004019E3 faddp  st(4),st
004019E5 fxch   st(3)
004019E7 fstp   DWORD PTR [ecx]
004019E9 fld    DWORD PTR [edx]
004019EB fld    st(0)
004019ED fmul   DWORD PTR [eax+0x34]
004019F0 fxch   st(3)
004019F2 fmul   DWORD PTR [eax+0x24]
004019F5 faddp  st(3),st
004019F7 fld    st(1)
004019F9 fmul   DWORD PTR [eax+0x14]
004019FC faddp  st(3),st
004019FE fld    st(3)
00401A00 fmul   DWORD PTR [eax+0x4]
00401A03 faddp  st(3),st
00401A05 fxch   st(2)
00401A07 fstp   DWORD PTR [ecx+0x4]
00401A0A fld    st(1)
00401A0C fmul   DWORD PTR [eax+0x38]
00401A0F fld    DWORD PTR [edx+0x4]
00401A12 fld    st(0)
00401A14 fmul   DWORD PTR [eax+0x28]
00401A17 faddp  st(2),st
00401A19 fxch   st(2)
00401A1B fmul   DWORD PTR [eax+0x18]
00401A1E faddp  st(1),st
00401A20 fld    st(3)
00401A22 fmul   DWORD PTR [eax+0x8]
00401A25 faddp  st(1),st
00401A27 fstp   DWORD PTR [ecx+0x8]
00401A2A fxch   st(1)
00401A2C fmul   DWORD PTR [eax+0x3c]
00401A2F fxch   st(1)
00401A31 fmul   DWORD PTR [eax+0x2c]
00401A34 faddp  st(1),st
00401A36 fld    DWORD PTR [edx+0x8]
00401A39 fmul   DWORD PTR [eax+0x1c]
00401A3C faddp  st(1),st
00401A3E fxch   st(1)
00401A40 fmul   DWORD PTR [eax+0xc]
00401A43 faddp  st(1),st
00401A45 fstp   DWORD PTR [ecx+0xc]
00401A48 fld    DWORD PTR [edx+0x10]
00401A4B fmul   DWORD PTR [eax+0x30]
00401A4E fld    DWORD PTR [edx+0x14]
00401A51 fld    st(0)
00401A53 fmul   DWORD PTR [eax+0x20]
00401A56 faddp  st(2),st
00401A58 fld    DWORD PTR [edx+0x18]
00401A5B fld    st(0)
00401A5D fmul   DWORD PTR [eax+0x10]
00401A60 faddp  st(3),st
00401A62 fld    DWORD PTR [edx+0x1c]
00401A65 fld    st(0)
00401A67 fmul   DWORD PTR [eax]
00401A69 faddp  st(4),st
00401A6B fxch   st(3)
00401A6D fstp   DWORD PTR [ecx+0x10]
00401A70 fld    DWORD PTR [edx+0x10]
00401A73 fld    st(0)
00401A75 fmul   DWORD PTR [eax+0x34]
00401A78 fxch   st(3)
00401A7A fmul   DWORD PTR [eax+0x24]
00401A7D faddp  st(3),st
00401A7F fld    st(1)
00401A81 fmul   DWORD PTR [eax+0x14]
00401A84 faddp  st(3),st
00401A86 fld    st(3)
00401A88 fmul   DWORD PTR [eax+0x4]
00401A8B faddp  st(3),st
00401A8D fxch   st(2)
00401A8F fstp   DWORD PTR [ecx+0x14]
00401A92 fld    st(1)
00401A94 fmul   DWORD PTR [eax+0x38]
00401A97 fld    DWORD PTR [edx+0x14]
00401A9A fld    st(0)
00401A9C fmul   DWORD PTR [eax+0x28]
00401A9F faddp  st(2),st
00401AA1 fxch   st(2)
00401AA3 fmul   DWORD PTR [eax+0x18]
00401AA6 faddp  st(1),st
00401AA8 fld    st(3)
00401AAA fmul   DWORD PTR [eax+0x8]
00401AAD faddp  st(1),st
00401AAF fstp   DWORD PTR [ecx+0x18]
00401AB2 fxch   st(1)
00401AB4 fmul   DWORD PTR [eax+0x3c]
00401AB7 fxch   st(1)
00401AB9 fmul   DWORD PTR [eax+0x2c]
00401ABC faddp  st(1),st
00401ABE fld    DWORD PTR [edx+0x18]
00401AC1 fmul   DWORD PTR [eax+0x1c]
00401AC4 faddp  st(1),st
00401AC6 fxch   st(1)
00401AC8 fmul   DWORD PTR [eax+0xc]
00401ACB faddp  st(1),st
00401ACD fstp   DWORD PTR [ecx+0x1c]
00401AD0 fld    DWORD PTR [edx+0x20]
00401AD3 fmul   DWORD PTR [eax+0x30]
00401AD6 fld    DWORD PTR [edx+0x24]
00401AD9 fld    st(0)
00401ADB fmul   DWORD PTR [eax+0x20]
00401ADE faddp  st(2),st
00401AE0 fld    DWORD PTR [edx+0x28]
00401AE3 fld    st(0)
00401AE5 fmul   DWORD PTR [eax+0x10]
00401AE8 faddp  st(3),st
00401AEA fld    DWORD PTR [edx+0x2c]
00401AED fld    st(0)
00401AEF fmul   DWORD PTR [eax]
00401AF1 faddp  st(4),st
00401AF3 fxch   st(3)
00401AF5 fstp   DWORD PTR [ecx+0x20]
00401AF8 fld    DWORD PTR [edx+0x20]
00401AFB fld    st(0)
00401AFD fmul   DWORD PTR [eax+0x34]
00401B00 fxch   st(3)
00401B02 fmul   DWORD PTR [eax+0x24]
00401B05 faddp  st(3),st
00401B07 fld    st(1)
00401B09 fmul   DWORD PTR [eax+0x14]
00401B0C faddp  st(3),st
00401B0E fld    st(3)
00401B10 fmul   DWORD PTR [eax+0x4]
00401B13 faddp  st(3),st
00401B15 fxch   st(2)
00401B17 fstp   DWORD PTR [ecx+0x24]
00401B1A fld    st(1)
00401B1C fmul   DWORD PTR [eax+0x38]
00401B1F fld    DWORD PTR [edx+0x24]
00401B22 fld    st(0)
00401B24 fmul   DWORD PTR [eax+0x28]
00401B27 faddp  st(2),st
00401B29 fxch   st(2)
00401B2B fmul   DWORD PTR [eax+0x18]
00401B2E faddp  st(1),st
00401B30 fld    st(3)
00401B32 fmul   DWORD PTR [eax+0x8]
00401B35 faddp  st(1),st
00401B37 fstp   DWORD PTR [ecx+0x28]
00401B3A fxch   st(1)
00401B3C fmul   DWORD PTR [eax+0x3c]
00401B3F fxch   st(1)
00401B41 fmul   DWORD PTR [eax+0x2c]
00401B44 faddp  st(1),st
00401B46 fld    DWORD PTR [edx+0x28]
00401B49 fmul   DWORD PTR [eax+0x1c]
00401B4C faddp  st(1),st
00401B4E fxch   st(1)
00401B50 fmul   DWORD PTR [eax+0xc]
00401B53 faddp  st(1),st
00401B55 fstp   DWORD PTR [ecx+0x2c]
00401B58 fld    DWORD PTR [edx+0x30]
00401B5B fmul   DWORD PTR [eax+0x30]
00401B5E fld    DWORD PTR [edx+0x34]
00401B61 fld    st(0)
00401B63 fmul   DWORD PTR [eax+0x20]
00401B66 faddp  st(2),st
00401B68 fld    DWORD PTR [edx+0x38]
00401B6B fld    st(0)
00401B6D fmul   DWORD PTR [eax+0x10]
00401B70 faddp  st(3),st
00401B72 fld    DWORD PTR [edx+0x3c]
00401B75 fld    st(0)
00401B77 fmul   DWORD PTR [eax]
00401B79 faddp  st(4),st
00401B7B fxch   st(3)
00401B7D fstp   DWORD PTR [ecx+0x30]
00401B80 fld    DWORD PTR [edx+0x30]
00401B83 fld    st(0)
00401B85 fmul   DWORD PTR [eax+0x34]
00401B88 fxch   st(3)
00401B8A fmul   DWORD PTR [eax+0x24]
00401B8D faddp  st(3),st
00401B8F fld    st(1)
00401B91 fmul   DWORD PTR [eax+0x14]
00401B94 faddp  st(3),st
00401B96 fld    st(3)
00401B98 fmul   DWORD PTR [eax+0x4]
00401B9B faddp  st(3),st
00401B9D fxch   st(2)
00401B9F fstp   DWORD PTR [ecx+0x34]
00401BA2 fld    st(1)
00401BA4 fmul   DWORD PTR [eax+0x38]
00401BA7 fld    DWORD PTR [edx+0x34]
00401BAA fld    st(0)
00401BAC fmul   DWORD PTR [eax+0x28]
00401BAF faddp  st(2),st
00401BB1 fxch   st(2)
00401BB3 fmul   DWORD PTR [eax+0x18]
00401BB6 faddp  st(1),st
00401BB8 fld    st(3)
00401BBA fmul   DWORD PTR [eax+0x8]
00401BBD faddp  st(1),st
00401BBF fstp   DWORD PTR [ecx+0x38]
00401BC2 fxch   st(1)
00401BC4 fmul   DWORD PTR [eax+0x3c]
00401BC7 fxch   st(1)
00401BC9 fmul   DWORD PTR [eax+0x2c]
00401BCC faddp  st(1),st
00401BCE fld    DWORD PTR [edx+0x38]
00401BD1 fmul   DWORD PTR [eax+0x1c]
00401BD4 faddp  st(1),st
00401BD6 fxch   st(1)
00401BD8 fmul   DWORD PTR [eax+0xc]
00401BDB faddp  st(1),st
00401BDD fstp   DWORD PTR [ecx+0x3c]
00401BE0 leave
00401BE1 ret

SSE
0041DCDC push   ebp
0041DCDD mov    ebp,esp
0041DCDF mov    edx,DWORD PTR [ebp+0x8]
0041DCE2 mov    ecx,DWORD PTR [ebp+0xc]
0041DCE5 mov    eax,DWORD PTR [ebp+0x10]
0041DCE8 movaps xmm0,XMMWORD PTR [edx]
0041DCEB movaps xmm6,xmm0
0041DCEE shufps xmm6,xmm0,0xff
0041DCF2 movaps xmm5,XMMWORD PTR [ecx]
0041DCF5 mulps  xmm6,xmm5
0041DCF8 movaps xmm1,xmm0
0041DCFB shufps xmm1,xmm0,0xaa
0041DCFF movaps xmm4,XMMWORD PTR [ecx+0x10]
0041DD03 mulps  xmm1,xmm4
0041DD06 addps  xmm6,xmm1
0041DD09 movaps xmm1,xmm0
0041DD0C shufps xmm1,xmm0,0x55
0041DD10 movaps xmm3,XMMWORD PTR [ecx+0x20]
0041DD14 mulps  xmm1,xmm3
0041DD17 addps  xmm6,xmm1
0041DD1A shufps xmm0,xmm0,0x0
0041DD1E movaps xmm2,XMMWORD PTR [ecx+0x30]
0041DD22 mulps  xmm0,xmm2
0041DD25 addps  xmm6,xmm0
0041DD28 movaps xmm0,XMMWORD PTR [edx+0x10]
0041DD2C movaps xmm7,xmm0
0041DD2F shufps xmm7,xmm0,0xff
0041DD33 mulps  xmm7,xmm5
0041DD36 movaps xmm1,xmm0
0041DD39 shufps xmm1,xmm0,0xaa
0041DD3D mulps  xmm1,xmm4
0041DD40 addps  xmm1,xmm7
0041DD43 movaps xmm7,xmm0
0041DD46 shufps xmm7,xmm0,0x55
0041DD4A mulps  xmm7,xmm3
0041DD4D addps  xmm1,xmm7
0041DD50 shufps xmm0,xmm0,0x0
0041DD54 mulps  xmm0,xmm2
0041DD57 addps  xmm1,xmm0
0041DD5A movaps XMMWORD PTR [eax+0x10],xmm1
0041DD5E movaps xmm1,XMMWORD PTR [edx+0x20]
0041DD62 movaps xmm7,xmm1
0041DD65 shufps xmm7,xmm1,0xff
0041DD69 mulps  xmm7,xmm5
0041DD6C movaps xmm0,xmm1
0041DD6F shufps xmm0,xmm1,0xaa
0041DD73 mulps  xmm0,xmm4
0041DD76 addps  xmm7,xmm0
0041DD79 movaps xmm0,xmm1
0041DD7C shufps xmm0,xmm1,0x55
0041DD80 mulps  xmm0,xmm3
0041DD83 addps  xmm0,xmm7
0041DD86 shufps xmm1,xmm1,0x0
0041DD8A mulps  xmm1,xmm2
0041DD8D addps  xmm1,xmm0
0041DD90 movaps xmm0,XMMWORD PTR [edx+0x30]
0041DD94 movaps xmm7,xmm0
0041DD97 shufps xmm7,xmm0,0xff
0041DD9B mulps  xmm5,xmm7
0041DD9E movaps xmm7,xmm0
0041DDA1 shufps xmm7,xmm0,0xaa
0041DDA5 mulps  xmm4,xmm7
0041DDA8 addps  xmm5,xmm4
0041DDAB movaps xmm4,xmm0
0041DDAE shufps xmm4,xmm0,0x55
0041DDB2 mulps  xmm3,xmm4
0041DDB5 addps  xmm5,xmm3
0041DDB8 shufps xmm0,xmm0,0x0
0041DDBC mulps  xmm2,xmm0
0041DDBF addps  xmm2,xmm5
0041DDC2 movaps XMMWORD PTR [eax],xmm6
0041DDC5 movaps XMMWORD PTR [eax+0x20],xmm1
0041DDC9 movaps XMMWORD PTR [eax+0x30],xmm2
0041DDCD leave
0041DDCE ret
Two things, yes, the SSE code is smaller, but at this point both functions are too large to be inlined by the compiler. Although, interestingly enough, my compiler decided that my FPU code could stay inlined (???)... anyone have an idea why? Regardless, I left the MATHLIB_INLINE macro there, with the hope that the compiler knows best somehow.

 Matrix Vector Multiplication [Single]:
MATHLIB_INLINE void matrix4x4_vectorMul(const matrix4x4& matrix, const vector4& vector, vector4& result)
/// Performs an multiplication of the given vector (column matrix) by the given matrix.
/// This method corresponds to the mathematical statement result = matrix * vector i.e the vector is a column vector.
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	__m128 m0	= _mm_mul_ps(matrix.sse_row0, vector.sse_vec);
	__m128 m1	= _mm_mul_ps(matrix.sse_row1, vector.sse_vec);
	__m128 m2	= _mm_mul_ps(matrix.sse_row2, vector.sse_vec);
	__m128 m3	= _mm_mul_ps(matrix.sse_row3, vector.sse_vec);

	m0	= _mm_hadd_ps(m1, m0);
	m2	= _mm_hadd_ps(m3, m2);
	m0	= _mm_hadd_ps(m2, m0);

	result.sse_vec	= m0;

#endif // (MATHLIB_SSE)
#else

	float vectorX	= vector.extractX();
	float vectorY	= vector.extractY();
	float vectorZ	= vector.extractZ();
	float vectorW	= vector.extractW();

	result.setXYZW((matrix._03 * vectorW) + (matrix._02 * vectorZ) + (matrix._01 * vectorY) + (matrix._00 * vectorX),
				   (matrix._13 * vectorW) + (matrix._12 * vectorZ) + (matrix._11 * vectorY) + (matrix._10 * vectorX),
				   (matrix._23 * vectorW) + (matrix._22 * vectorZ) + (matrix._21 * vectorY) + (matrix._20 * vectorX),
				   (matrix._33 * vectorW) + (matrix._32 * vectorZ) + (matrix._31 * vectorY) + (matrix._30 * vectorX));

#endif // (MATHLIB_SIMD)
}

FPU
00401C30	push   ebp
00401C31	mov    ebp,esp
00401C33	mov    eax,DWORD PTR [ebp+0x8]
00401C36	mov    ecx,DWORD PTR [ebp+0xc]
00401C39	mov    edx,DWORD PTR [ebp+0x10]
00401C3C	fld    DWORD PTR [ecx+0xc]
00401C3F	fld    DWORD PTR [ecx+0x8]
00401C42	fld    DWORD PTR [ecx+0x4]
00401C45	fld    DWORD PTR [ecx]
00401C47	fld    st(0)
00401C49	fmul   DWORD PTR [eax+0x30]
00401C4C	fld    st(2)
00401C4E	fmul   DWORD PTR [eax+0x34]
00401C51	faddp  st(1),st
00401C53	fld    st(3)
00401C55	fmul   DWORD PTR [eax+0x38]
00401C58	faddp  st(1),st
00401C5A	fld    st(4)
00401C5C	fmul   DWORD PTR [eax+0x3c]
00401C5F	faddp  st(1),st
00401C61	fld    st(1)
00401C63	fmul   DWORD PTR [eax+0x20]
00401C66	fld    st(3)
00401C68	fmul   DWORD PTR [eax+0x24]
00401C6B	faddp  st(1),st
00401C6D	fld    st(4)
00401C6F	fmul   DWORD PTR [eax+0x28]
00401C72	faddp  st(1),st
00401C74	fld    st(5)
00401C76	fmul   DWORD PTR [eax+0x2c]
00401C79	faddp  st(1),st
00401C7B	fld    st(2)
00401C7D	fmul   DWORD PTR [eax+0x10]
00401C80	fld    st(4)
00401C82	fmul   DWORD PTR [eax+0x14]
00401C85	faddp  st(1),st
00401C87	fld    st(5)
00401C89	fmul   DWORD PTR [eax+0x18]
00401C8C	faddp  st(1),st
00401C8E	fld    st(6)
00401C90	fmul   DWORD PTR [eax+0x1c]
00401C93	faddp  st(1),st
00401C95	fxch   st(3)
00401C97	fmul   DWORD PTR [eax]
00401C99	fxch   st(4)
00401C9B	fmul   DWORD PTR [eax+0x4]
00401C9E	faddp  st(4),st
00401CA0	fxch   st(4)
00401CA2	fmul   DWORD PTR [eax+0x8]
00401CA5	faddp  st(3),st
00401CA7	fxch   st(4)
00401CA9	fmul   DWORD PTR [eax+0xc]
00401CAC	faddp  st(2),st
00401CAE	fxch   st(1)
00401CB0	fstp   DWORD PTR [edx+0xc]
00401CB3	fstp   DWORD PTR [edx+0x8]
00401CB6	fstp   DWORD PTR [edx+0x4]
00401CB9	fstp   DWORD PTR [edx]
00401CBB	leave
00401CBC	ret

SSE
004014A8	push   ebp
004014A9	mov    ebp,esp
004014AB	mov    eax,DWORD PTR [ebp+0x8]
004014AE	mov    edx,DWORD PTR [ebp+0xc]
004014B1	movaps xmm1,XMMWORD PTR [edx]
004014B4	movaps xmm3,xmm1
004014B7	mulps  xmm3,XMMWORD PTR [eax]
004014BA	movaps xmm0,xmm1
004014BD	mulps  xmm0,XMMWORD PTR [eax+0x10]
004014C1	movaps xmm2,xmm1
004014C4	mulps  xmm2,XMMWORD PTR [eax+0x20]
004014C8	mulps  xmm1,XMMWORD PTR [eax+0x30]
004014CC	haddps xmm0,xmm3
004014D0	haddps xmm1,xmm2
004014D4	haddps xmm1,xmm0
004014D8	mov    eax,DWORD PTR [ebp+0x10]
004014DB	movaps XMMWORD PTR [eax],xmm1
004014DE	leave
004014DF	ret

 Matrix Vector Multiplication [Batch]: 
MATHLIB_INLINE void matrix4x4_vectorBatchMul(const matrix4x4& matrix, vector4 const * const vectorArray, uint32_t numVectors, vector4* const resultArray)
/// Performs a batch matrix vector multiplication (column matrix).
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	// Load matrix:
	__m128 matrix_row0	= matrix.sse_row0;
	__m128 matrix_row1	= matrix.sse_row1;
	__m128 matrix_row2	= matrix.sse_row2;
	__m128 matrix_row3	= matrix.sse_row3;

	for (uint32_t i = 0; i < numVectors; i++)
	{
		__m128 m0	= _mm_mul_ps(matrix_row0, vectorArray[i].sse_vec);
		__m128 m1	= _mm_mul_ps(matrix_row1, vectorArray[i].sse_vec);
		__m128 m2	= _mm_mul_ps(matrix_row2, vectorArray[i].sse_vec);
		__m128 m3	= _mm_mul_ps(matrix_row3, vectorArray[i].sse_vec);

		m0	= _mm_hadd_ps(m1, m0);
		m2	= _mm_hadd_ps(m3, m2);
		m0	= _mm_hadd_ps(m2, m0);

		resultArray[i].sse_vec	= m0;
	}

#endif // (MATHLIB_SSE)
#else

	for (uint32_t i = 0; i < numVectors; i++)
	{
		float vectorX	= vectorArray[i].extractX();
		float vectorY	= vectorArray[i].extractY();
		float vectorZ	= vectorArray[i].extractZ();
		float vectorW	= vectorArray[i].extractW();

		resultArray[i].setXYZW((matrix._03 * vectorW) + (matrix._02 * vectorZ) + (matrix._01 * vectorY) + (matrix._00 * vectorX),
							   (matrix._13 * vectorW) + (matrix._12 * vectorZ) + (matrix._11 * vectorY) + (matrix._10 * vectorX),
							   (matrix._23 * vectorW) + (matrix._22 * vectorZ) + (matrix._21 * vectorY) + (matrix._20 * vectorX),
							   (matrix._33 * vectorW) + (matrix._32 * vectorZ) + (matrix._31 * vectorY) + (matrix._30 * vectorX));
	}

#endif // (MATHLIB_SIMD)
}

FPU
00401CBD	push   ebp
00401CBE	mov    ebp,esp
00401CC0	push   edi
00401CC1	push   esi
00401CC2	push   ebx
00401CC3	mov    eax,DWORD PTR [ebp+0x8]
00401CC6	mov    ecx,DWORD PTR [ebp+0xc]
00401CC9	mov    edi,DWORD PTR [ebp+0x10]
00401CCC	mov    ebx,DWORD PTR [ebp+0x14]
00401CCF	test   edi,edi
00401CD1	je     0x401d74 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+183>
00401CD7	mov    edx,0x0
00401CDC	mov    esi,0x0
00401CE1	fld    DWORD PTR [ecx+edx*1+0xc]
00401CE5	fld    DWORD PTR [ecx+edx*1+0x8]
00401CE9	fld    DWORD PTR [ecx+edx*1+0x4]
00401CED	fld    DWORD PTR [ecx+edx*1]
00401CF0	fld    st(0)
00401CF2	fmul   DWORD PTR [eax+0x30]
00401CF5	fld    st(2)
00401CF7	fmul   DWORD PTR [eax+0x34]
00401CFA	faddp  st(1),st
00401CFC	fld    st(3)
00401CFE	fmul   DWORD PTR [eax+0x38]
00401D01	faddp  st(1),st
00401D03	fld    st(4)
00401D05	fmul   DWORD PTR [eax+0x3c]
00401D08	faddp  st(1),st
00401D0A	fld    st(1)
00401D0C	fmul   DWORD PTR [eax+0x20]
00401D0F	fld    st(3)
00401D11	fmul   DWORD PTR [eax+0x24]
00401D14	faddp  st(1),st
00401D16	fld    st(4)
00401D18	fmul   DWORD PTR [eax+0x28]
00401D1B	faddp  st(1),st
00401D1D	fld    st(5)
00401D1F	fmul   DWORD PTR [eax+0x2c]
00401D22	faddp  st(1),st
00401D24	fld    st(2)
00401D26	fmul   DWORD PTR [eax+0x10]
00401D29	fld    st(4)
00401D2B	fmul   DWORD PTR [eax+0x14]
00401D2E	faddp  st(1),st
00401D30	fld    st(5)
00401D32	fmul   DWORD PTR [eax+0x18]
00401D35	faddp  st(1),st
00401D37	fld    st(6)
00401D39	fmul   DWORD PTR [eax+0x1c]
00401D3C	faddp  st(1),st
00401D3E	fxch   st(3)
00401D40	fmul   DWORD PTR [eax]
00401D42	fxch   st(4)
00401D44	fmul   DWORD PTR [eax+0x4]
00401D47	faddp  st(4),st
00401D49	fxch   st(4)
00401D4B	fmul   DWORD PTR [eax+0x8]
00401D4E	faddp  st(3),st
00401D50	fxch   st(4)
00401D52	fmul   DWORD PTR [eax+0xc]
00401D55	faddp  st(2),st
00401D57	fxch   st(1)
00401D59	fstp   DWORD PTR [ebx+edx*1+0xc]
00401D5D	fstp   DWORD PTR [ebx+edx*1+0x8]
00401D61	fstp   DWORD PTR [ebx+edx*1+0x4]
00401D65	fstp   DWORD PTR [ebx+edx*1]
00401D68	inc    esi
00401D69	add    edx,0x10
00401D6C	cmp    edi,esi
00401D6E	ja     0x401ce1 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+36>
00401D74	pop    ebx
00401D75	pop    esi
00401D76	pop    edi
00401D77	leave
00401D78	ret

SSE
004014A8	push   ebp
004014A9	mov    ebp,esp
004014AB	push   ebx
004014AC	mov    eax,DWORD PTR [ebp+0x8]
004014AF	mov    ebx,DWORD PTR [ebp+0x10]
004014B2	movaps xmm3,XMMWORD PTR [eax]
004014B5	movaps xmm4,XMMWORD PTR [eax+0x10]
004014B9	movaps xmm5,XMMWORD PTR [eax+0x20]
004014BD	movaps xmm6,XMMWORD PTR [eax+0x30]
004014C1	test   ebx,ebx
004014C3	je     0x401502 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+90>
004014C5	mov    ecx,DWORD PTR [ebp+0xc]
004014C8	mov    edx,DWORD PTR [ebp+0x14]
004014CB	mov    eax,0x0
004014D0	movaps xmm1,XMMWORD PTR [ecx]
004014D3	movaps xmm7,xmm3
004014D6	mulps  xmm7,xmm1
004014D9	movaps xmm0,xmm4
004014DC	mulps  xmm0,xmm1
004014DF	movaps xmm2,xmm5
004014E2	mulps  xmm2,xmm1
004014E5	mulps  xmm1,xmm6
004014E8	haddps xmm0,xmm7
004014EC	haddps xmm1,xmm2
004014F0	haddps xmm1,xmm0
004014F4	movaps XMMWORD PTR [edx],xmm1
004014F7	inc    eax
004014F8	add    ecx,0x10
004014FB	add    edx,0x10
004014FE	cmp    ebx,eax
00401500	ja     0x4014d0 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+40>
00401502	pop    ebx
00401503	leave
00401504	ret

Benchmarks
Let's look at the performance numbers, the testing methodology is pretty much exactly the same as I had with the vectors (you can take a look at the source code, it's all there). I'm just going to post the text output from the program itself (the screenshot approach takes to long). Again, we're measuring single threaded code, it's an architecture, clock speed, and memory bandwidth game.

The first machine I ran the test on was was equipped with a Pentium E5200, overclocked to 2.75GHz:
MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.099916]
Subtraction benchmark           : average time (msecs) [0.089407]
Transposition benchmark         : average time (msecs) [0.049061]
Scalar mul benchmark            : average time (msecs) [0.057214]
Matrix mul benchmark            : average time (msecs) [0.169970]
Vector mul benchmark            : average time (msecs) [0.049327]
Vector batch mul benchmark      : average time (msecs) [0.036488]

Process returned 0 (0x0)   execution time : 55.196 s
Press any key to continue.

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.152721]
Subtraction benchmark           : average time (msecs) [0.151200]
Transposition benchmark         : average time (msecs) [0.098960]
Scalar mul benchmark            : average time (msecs) [0.121864]
Matrix mul benchmark            : average time (msecs) [0.470418]
Vector mul benchmark            : average time (msecs) [0.117321]
Vector batch mul benchmark      : average time (msecs) [0.118026]

Process returned 0 (0x0)   execution time : 123.101 s
Press any key to continue.

The second machine was my good old (really old) laptop, equipped with a Pentium Dual T3400:
MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.623972]
Subtraction benchmark           : average time (msecs) [0.620934]
Transposition benchmark         : average time (msecs) [0.070971]
Scalar mul benchmark            : average time (msecs) [0.423544]
Matrix mul benchmark            : average time (msecs) [0.641712]
Vector mul benchmark            : average time (msecs) [0.108508]
Vector batch mul benchmark      : average time (msecs) [0.063231]

Process returned 0 (0x0)   execution time : 256.348 s
Press any key to continue.

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.634588]
Subtraction benchmark           : average time (msecs) [0.632229]
Transposition benchmark         : average time (msecs) [0.097166]
Scalar mul benchmark            : average time (msecs) [0.439894]
Matrix mul benchmark            : average time (msecs) [0.676615]
Vector mul benchmark            : average time (msecs) [0.173299]
Vector batch mul benchmark      : average time (msecs) [0.152087]

Process returned 0 (0x0)   execution time : 281.617 s
Press any key to continue.


So these results are pretty interesting, it appears that the calculations become memory bandwidth limited... the SSE code is always ahead, but it's not far ahead. The only really notable speedups are the Vector mul and Vector batch mul results.

The third machine was a Core 2 Quad Q8400:

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.089563]
Subtraction benchmark           : average time (msecs) [0.090273]
Transposition benchmark         : average time (msecs) [0.050641]
Scalar mul benchmark            : average time (msecs) [0.060286]
Matrix mul benchmark            : average time (msecs) [0.172725]
Vector mul benchmark            : average time (msecs) [0.051047]
Vector batch mul benchmark      : average time (msecs) [0.037580]

Process returned 0 (0x0)   execution time : 55.302 s
Press any key to continue.

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.157504]
Subtraction benchmark           : average time (msecs) [0.158742]
Transposition benchmark         : average time (msecs) [0.102287]
Scalar mul benchmark            : average time (msecs) [0.126073]
Matrix mul benchmark            : average time (msecs) [0.484266]
Vector mul benchmark            : average time (msecs) [0.121134]
Vector batch mul benchmark      : average time (msecs) [0.121808]

Process returned 0 (0x0)   execution time : 127.281 s
Press any key to continue.

Conclusion
In line with the vectors, we can see a comfortable 2x speedup (laptop results notwithstanding) from the use of SSE intrinsics. The batch multiplication and matrix-matrix multiplication receive around 3x speedups, which is promising.

The source code for this article can be downloaded here:
https://hotfile.com/dl/167284774/730f206/MathLib.rar.html

Thursday 16 August 2012

SIMD Math Library Part 1: SSE With Vectors

Introduction
I'm going to let you in on a personal secret, every single time I start writing a 3D project the first thing I look to do is write a new math library. It's a fundamental building block of course. And there's a clinical cleanliness to the resulting code that a lot of techie programmers find nigh-on-rapturous. My first attempt when I had been programming in C for a sum total of around 3 months was as basic an implementation as you're likely to see. I hadn't even heard of concepts like 'optimization', 'instruction-level-parallelism', and 'data-alignment' yet. And you know what? I didn't need them. Even the most basic C/C++ math library is going to perform just fine running on a modern x86 chip for the *majority* of your projects.

So what's the point of writing a SIMD math library? Well...because I want to, but if you look hard enough, there are benefits to it.

Let's start with the pragmatic, not every CPU is an x86 OoO super desktop chip with 35 floating-point execution units and 300 trillion transistors dedicated to implicit scheduling logic to extract your ILP for you thank you very much. If you're like me and want to write something on say, the Cell BE running inside your still Linux capable PS3 (ha Sony, you thought you'd gotten all of us?!) then for a non-trivial project SIMD optimization still bears fruit (I'm actually going to use SSE initially, but the concepts are directly transferable). Secondly, there are still certain (very specific) modules that we want to squeeze the most performance out of, when you're writing a software rasterizer or software vertex pipeline for instance. Third, because it's educational. Systems programmers should aspire to understand the hardware capabilities our CPUs give us and how to take advantage of them.  And lastly, because it's fun! Low level coding is good for the programmers soul.

But first let's get a few things out of the way. These articles are not meant to be a definitive resource for all things SIMD, SSE, and optimization. I'm very much like the next guy trying to understand exactly how to do these things, and what benefits they bring to my toolbox. To be perfectly honest, a lot of what I've heard of SIMD optimization on x86 (PowerPC and ARM are different) is that it tends to slow down your code! Let's investigate that!?! I'll probably do some (a lot of) stupid things, but I'm hoping that if you know something I don't, drop me an email or a comment.

So what exactly am I going to write about?
I'm not going to post a line by line tutorial specifying how to write a math library, although the code I've written will be available for anyone to download and use. These articles are just a summation of the work I did. That doesn't mean you can't learn from it, quite the contrary, you can sift through the code line by line and see how it fits together. A math library is a very simple thing with a very simple interface.What I am going to write about is a few of the main design choices, and then, for some functions, show the code and assembly output of the two implementations (x87 vs SSE).
Finally, we'll run a test program to compare the execution times to see what, if any, performance is to be gained.

Design Choices
I decided not to have any sort of dynamic linking of function calls in the library. Which means that a recompile is required for any different platform you're on. The reasons for this were that dynamic linking requires that we first detect which CPU/OS combination we're running off of. Both host configuration detection and dynamic linking methods are system dependent. It is much simpler to just use the preprocessor to determine which code gets to be compiled and set that as your 'min-spec' so to say. (Also, dynamic linking prevents code inlining. Which is a problematic from a performance viewpoint.)

In addition, we will assume that all x86 CPUs we'll come across will support up to SSE3 (see Steam's hardware survey), any PowerPC CPUs will support AltiVec, and all ARM CPUs will support NEON(?). However, for comparison and fallback reasons, native C code is provided as a default, so you can set it to run on any platform that doesn't have a SIMD instruction set. Now, for native code, I considered it important to make sure that the behaviour is as close to the vectorized variants as possible which results in the occasional redundant w component calculation. However, there are still minor inconsistencies that crop up, particularly with regards to the w component, so extra caution is required when relying on a w value.

All of the library methods are declared inline, ensuring that any performance we gain from using SIMD code won't be wasted on setting up a function stack-frame.

I use intrinsics instead of inline assembly or handwritten assembly modules. It's easier to read and modify, allows the compiler to schedule your instructions for you and doesn't require you to complicate your build process.

All data types I use have to be aligned. This complicates safely declaring them as stack objects inside a function or dynamically allocating them but ensures maximum performance. I'll discuss ways of dealing with this in another article.

Code Comparison
Ok, let's look at some of the functions in the vector part of the math library and how they're implemented. Then we'll take a look at the differing assembly language outputs.  If you take a glance at the library you can see that the w component of vectors gets modified by the various vector functions when what you really want is it to be mathematically correct at either 0.0 (a vector representation) or 1.0 (a point representation). We correct this by having quick vector4_setToVector and vector4_setToPoint instructions that clear the w component to 0.0 and set the w component to 1.0 respectively. Use them in algorithms that require a valid W component to function.

Vector Addition:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
MATHLIB_INLINE void vector4_add(const vector4& vector1, const vector4& vector2, vector4& result)
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

 result.sse_vec = _mm_add_ps(vector1.sse_vec, vector2.sse_vec);

#endif // (MATHLIB_SSE)
#else

 result.setXYZW(vector1.extractX() + vector2.extractX(),
         vector1.extractY() + vector2.extractY(),
         vector1.extractZ() + vector2.extractZ(),
         vector1.extractW() + vector2.extractW());

#endif // (MATHLIB_SIMD)

Simple enough right, lets take a look at what the compiler spits out (I wrap the inlined functions into a normal function call so we get the assembly in isolation):

FPU
0040134C push   ebp
0040134D mov    ebp,esp
0040134F mov    ecx,DWORD PTR [ebp+0x8]
00401352 mov    edx,DWORD PTR [ebp+0xc]
00401355 mov    eax,DWORD PTR [ebp+0x10]
00401358 fld    DWORD PTR [ecx]
0040135A fadd   DWORD PTR [edx]
0040135C fld    DWORD PTR [ecx+0x4]
0040135F fadd   DWORD PTR [edx+0x4]
00401362 fld    DWORD PTR [ecx+0x8]
00401365 fadd   DWORD PTR [edx+0x8]
00401368 fld    DWORD PTR [ecx+0xc]
0040136B fadd   DWORD PTR [edx+0xc]
0040136E fstp   DWORD PTR [eax+0xc]
00401371 fstp   DWORD PTR [eax+0x8]
00401374 fstp   DWORD PTR [eax+0x4]
00401377 fstp   DWORD PTR [eax]
00401379 leave
0040137A ret

SSE
0040157D push   ebp
0040157E mov    ebp,esp
00401580 mov    eax,DWORD PTR [ebp+0x8]
00401583 movaps xmm0,XMMWORD PTR [eax]
00401586 mov    eax,DWORD PTR [ebp+0xc]
00401589 addps  xmm0,XMMWORD PTR [eax]
0040158C mov    eax,DWORD PTR [ebp+0x10]
0040158F movaps XMMWORD PTR [eax],xmm0
00401592 leave
00401593 ret

Well regardless of performance I think it's safe to say that SSE assembly (and vector assembly in general) is quite a bit more compact than scalar assembly, lessening the code bloat effects of inlining all your math functions.

Vector Add Scaled Vector:
MATHLIB_INLINE void vector4_addScaledVector(const vector4& vector1, const vector4& vector2, float scaleFactor, vector4& result)
/// Adds the scaled vector2 to vector1 and stores the sum in result.
/// Corresponds to the expression result = vector1 + (scaleFactor * vector2).
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

 __m128 scaledVector;
 scaledVector = _mm_load1_ps(&scaleFactor);
 scaledVector = _mm_mul_ps(vector2.sse_vec, scaledVector);
 result.sse_vec = _mm_add_ps(vector1.sse_vec, scaledVector);

#endif // (MATHLIB_SSE)
#else

 result.setXYZW(vector1.extractX() + (vector2.extractX() * scaleFactor),
         vector1.extractY() + (vector2.extractY() * scaleFactor),
                vector1.extractZ() + (vector2.extractZ() * scaleFactor),
                vector1.extractW() + (vector2.extractW() * scaleFactor));

#endif // (MATHLIB_SIMD)
}

FPU
00401428 push   ebp
00401429 mov    ebp,esp
0040142B mov    edx,DWORD PTR [ebp+0x8]
0040142E mov    ecx,DWORD PTR [ebp+0xc]
00401431 fld    DWORD PTR [ebp+0x10]
00401434 mov    eax,DWORD PTR [ebp+0x14]
00401437 fld    st(0)
00401439 fmul   DWORD PTR [ecx]
0040143B fadd   DWORD PTR [edx]
0040143D fld    st(1)
0040143F fmul   DWORD PTR [ecx+0x4]
00401442 fadd   DWORD PTR [edx+0x4]
00401445 fld    st(2)
00401447 fmul   DWORD PTR [ecx+0x8]
0040144A fadd   DWORD PTR [edx+0x8]
0040144D fxch   st(3)
0040144F fmul   DWORD PTR [ecx+0xc]
00401452 fadd   DWORD PTR [edx+0xc]
00401455 fstp   DWORD PTR [eax+0xc]
00401458 fxch   st(2)
0040145A fstp   DWORD PTR [eax+0x8]
0040145D fxch   st(1)
0040145F fstp   DWORD PTR [eax+0x4]
00401462 fstp   DWORD PTR [eax]
00401464 leave
00401465 ret

SSE
00401533 push   ebp
00401534 mov    ebp,esp
00401536 movss  xmm0,DWORD PTR [ebp+0x10]
0040153B shufps xmm0,xmm0,0x0
0040153F mov    eax,DWORD PTR [ebp+0xc]
00401542 mulps  xmm0,XMMWORD PTR [eax]
00401545 mov    eax,DWORD PTR [ebp+0x8]
00401548 addps  xmm0,XMMWORD PTR [eax]
0040154B mov    eax,DWORD PTR [ebp+0x14]
0040154E movaps XMMWORD PTR [eax],xmm0
00401551 leave
00401552 ret

Vector Magnitude:
MATHLIB_INLINE float vector4_magnitude(const vector4& vector)
/// Returns the magnitude of the vector.
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

 __m128 vector0;
 __m128 vector1;
 __m128 vector2;

 vector0 = _mm_mul_ps(vector.sse_vec, vector.sse_vec);   // Square the vector elements.
 vector1 = _mm_shuffle_ps(vector0, vector0, _MM_SHUFFLE(2, 1, 0, 3)); // Shuffle around, add elements
 vector2 = _mm_add_ps(vector0, vector1);     // (x * x) + (y * y)
 vector1 = _mm_shuffle_ps(vector0, vector0, _MM_SHUFFLE(1, 0, 3, 2));
 vector2 = _mm_add_ps(vector2, vector1);     // (x * x) + (y * y) + (z * z)
 vector2 = _mm_sqrt_ps(vector2);
 vector2 = _mm_shuffle_ps(vector2, vector2, _MM_SHUFFLE(3, 3, 3, 3));

 return (_mm_cvtss_f32(vector2));

#endif // (MATHLIB_SSE)
#else

 float x = vector.extractX();
 float y = vector.extractY();
 float z = vector.extractZ();

 return (sqrtf((x * x) + (y * y) + (z * z)));

#endif // (MATHLIB_SIMD)
}

FPU
0040193F push   ebp
00401940 mov    ebp,esp
00401942 mov    eax,DWORD PTR [ebp+0x8]
00401945 fld    DWORD PTR [eax+0xc]
00401948 fld    DWORD PTR [eax+0x8]
0040194B fld    DWORD PTR [eax+0x4]
0040194E fxch   st(2)
00401950 fmul   st,st(0)
00401952 fxch   st(1)
00401954 fmul   st,st(0)
00401956 faddp  st(1),st
00401958 fxch   st(1)
0040195A fmul   st,st(0)
0040195C faddp  st(1),st
0040195E fsqrt
00401960 leave
00401961 ret

SSE 
00401604 push   ebp
00401605 mov    ebp,esp
00401607 sub    esp,0x4
0040160A mov    eax,DWORD PTR [ebp+0x8]
0040160D movaps xmm0,XMMWORD PTR [eax]
00401610 mulps  xmm0,xmm0
00401613 movaps xmm1,xmm0
00401616 shufps xmm1,xmm0,0x93
0040161A addps  xmm1,xmm0
0040161D shufps xmm0,xmm0,0x4e
00401621 addps  xmm0,xmm1
00401624 sqrtps xmm0,xmm0
00401627 shufps xmm0,xmm0,0xff
0040162B movss  DWORD PTR [ebp-0x4],xmm0
00401630 fld    DWORD PTR [ebp-0x4]
00401633 leave
00401634 ret
Look at address 0040162B and 00401630. This is where the danger in using vectorization comes in. When we have to derive a scalar result from a vector calculation and then use that scalar result in some scalar work we end up writing our result back to memory and loading it again. This is NOT a good thing (Why no instruction to push a value from SSE register onto the FPU stack Intel :( ? ).
And look, the generated code is actually more compact on the FPU side.

Vector Dot Product:
MATHLIB_INLINE float vector4_dotProduct(const vector4& vector1, const vector4& vector2)
/// Returns the dot product of the two vectors. This does not take the w component into account.
{
 /// Native FPU code will be faster here because the result will likely be used in a scalar
 /// codepath, so we should avoid an unnecessary store:
 float xcomp = vector1.extractX() * vector2.extractX();
 float ycomp = vector1.extractY() * vector2.extractY();
 float zcomp = vector1.extractZ() * vector2.extractZ();

 return (xcomp + ycomp + zcomp);
}
Well you can see here that I haven't even bothered to implement a SIMD path. A single dot product in isolation won't receive much of a speed up and they're almost always used in some scalar code afterward, guaranteeing a store reload penalty.

Vector Cross Product:
MATHLIB_INLINE void vector4_crossProduct(const vector4& vector1, const vector4& vector2, vector4& result)
/// Computes the cross product of vector1 and vector2 and stores cross product in result.
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

 result.sse_vec = _mm_sub_ps(_mm_mul_ps(_mm_shuffle_ps(vector1.sse_vec, vector1.sse_vec, _MM_SHUFFLE(3, 1, 2, 0)), _mm_shuffle_ps(vector2.sse_vec, vector2.sse_vec, _MM_SHUFFLE(2, 3, 1, 0))),
                _mm_mul_ps(_mm_shuffle_ps(vector1.sse_vec, vector1.sse_vec, _MM_SHUFFLE(2, 3, 1, 0)), _mm_shuffle_ps(vector2.sse_vec, vector2.sse_vec, _MM_SHUFFLE(3, 1, 2, 0))));

#endif // (MATHLIB_SSE)
#else

 float xcomp = (vector1.extractY() * vector2.extractZ()) - (vector1.extractZ() * vector2.extractY());
 float ycomp = (vector1.extractZ() * vector2.extractX()) - (vector1.extractX() * vector2.extractZ());
 float zcomp = (vector1.extractX() * vector2.extractY()) - (vector1.extractY() * vector2.extractX());

 result.setXYZ(xcomp, ycomp, zcomp);

#endif // (MATHLIB_SIMD)
}

FPU
00401497 push   ebp
00401498 mov    ebp,esp
0040149A mov    edx,DWORD PTR [ebp+0x8]
0040149D mov    ecx,DWORD PTR [ebp+0xc]
004014A0 mov    eax,DWORD PTR [ebp+0x10]
004014A3 fld    DWORD PTR [edx+0x8]
004014A6 fld    DWORD PTR [ecx+0x4]
004014A9 fld    DWORD PTR [edx+0x4]
004014AC fld    DWORD PTR [ecx+0x8]
004014AF fld    DWORD PTR [ecx+0xc]
004014B2 fld    DWORD PTR [edx+0xc]
004014B5 fld    st(5)
004014B7 fmul   st,st(5)
004014B9 fld    st(4)
004014BB fmul   st,st(4)
004014BD fsubrp st(1),st
004014BF fstp   DWORD PTR [eax+0xc]
004014C2 fxch   st(3)
004014C4 fmul   st,st(1)
004014C6 fxch   st(4)
004014C8 fmul   st,st(3)
004014CA fsubrp st(4),st
004014CC fxch   st(3)
004014CE fstp   DWORD PTR [eax+0x8]
004014D1 fmulp  st(1),st
004014D3 fxch   st(1)
004014D5 fmulp  st(2),st
004014D7 fsubp  st(1),st
004014D9 fstp   DWORD PTR [eax+0x4]
004014DC leave
004014DD ret

SSE
00401597 push   ebp
00401598 mov    ebp,esp
0040159A mov    eax,DWORD PTR [ebp+0xc]
0040159D movaps xmm1,XMMWORD PTR [eax]
004015A0 movaps xmm2,xmm1
004015A3 shufps xmm2,xmm1,0xd8
004015A7 mov    eax,DWORD PTR [ebp+0x8]
004015AA movaps xmm0,XMMWORD PTR [eax]
004015AD movaps xmm3,xmm0
004015B0 shufps xmm3,xmm0,0xb4
004015B4 mulps  xmm2,xmm3
004015B7 shufps xmm1,xmm1,0xb4
004015BB shufps xmm0,xmm0,0xd8
004015BF mulps  xmm0,xmm1
004015C2 subps  xmm0,xmm2
004015C5 mov    eax,DWORD PTR [ebp+0x10]
004015C8 movaps XMMWORD PTR [eax],xmm0
004015CB leave
004015CC ret

Vector Normalize:
Funnily enough, this method received a HUGE speedup from SSE, not because of anything directly SIMD related but because of a very fast rsqrtps instruction which uses an approximation method and so sacrifices a little bit of precision for a whole lot of speed.
MATHLIB_INLINE void vector4_normalize(vector4& vector)
/// Normalizes the given vector.
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

 __m128 vec0;
 __m128 vec1;

 vec0   = _mm_and_ps(vector.sse_vec, vector4::vector4_clearW);
 vec0   = _mm_mul_ps(vec0, vec0);
 vec1   = vec0;
 vec0   = _mm_shuffle_ps(vec0, vec0, _MM_SHUFFLE(2, 1, 0, 3));
 vec1   = _mm_add_ps(vec0, vec1);
 vec0   = vec1;
 vec1   = _mm_shuffle_ps(vec1, vec1, _MM_SHUFFLE(1, 0, 3, 2));
 vec0   = _mm_add_ps(vec0, vec1);
 vec0   = _mm_rsqrt_ps(vec0);
 vector.sse_vec = _mm_mul_ps(vec0, vector.sse_vec);

#endif // (MATHLIB_SSE)
#else

 float vector_magnitude = vector4_magnitude(vector);

 vector4_scale(vector, 1.0f / vector_magnitude, vector);

#endif // (MATHLIB_SIMD)
}

FPU
00401952 push   ebp
00401953 mov    ebp,esp
00401955 push   ebx
00401956 sub    esp,0x14
00401959 mov    ebx,DWORD PTR [ebp+0x8]
0040195C fld    DWORD PTR [ebx+0xc]
0040195F fld    DWORD PTR [ebx+0x8]
00401962 fld    DWORD PTR [ebx+0x4]
00401965 fxch   st(2)
00401967 fmul   st,st(0)
00401969 fxch   st(1)
0040196B fmul   st,st(0)
0040196D faddp  st(1),st
0040196F fxch   st(1)
00401971 fmul   st,st(0)
00401973 faddp  st(1),st
00401975 fstp   DWORD PTR [esp]
00401978 call   0x414dd0 <sqrtf>
0040197D fdivr  DWORD PTR ds:0x46f0c0
00401983 fld    st(0)
00401985 fmul   DWORD PTR [ebx]
00401987 fld    st(1)
00401989 fmul   DWORD PTR [ebx+0x4]
0040198C fld    st(2)
0040198E fmul   DWORD PTR [ebx+0x8]
00401991 fxch   st(3)
00401993 fmul   DWORD PTR [ebx+0xc]
00401996 fstp   DWORD PTR [ebx+0xc]
00401999 fxch   st(2)
0040199B fstp   DWORD PTR [ebx+0x8]
0040199E fxch   st(1)
004019A0 fstp   DWORD PTR [ebx+0x4]
004019A3 fstp   DWORD PTR [ebx]
004019A5 add    esp,0x14
004019A8 pop    ebx
004019A9 pop    ebp
004019AA ret
At 00401978 does anyone know why the compiler decided to call the function sqrtf rather than just use the fsqrt instruction like in the vector4_magnitude function? I tried restructuring the code a few times to get the compiler to remove it but every time it would call the function? Puzzling...

SSE
004016FA push   ebp
004016FB mov    ebp,esp
004016FD mov    eax,DWORD PTR [ebp+0x8]
00401700 movaps xmm1,XMMWORD PTR [eax]
00401703 movaps xmm0,xmm1
00401706 andps  xmm0,XMMWORD PTR ds:0x4f2f80
0040170D mulps  xmm0,xmm0
00401710 movaps xmm2,xmm0
00401713 shufps xmm2,xmm0,0x93
00401717 addps  xmm0,xmm2
0040171A movaps xmm2,xmm0
0040171D shufps xmm2,xmm0,0x4e
00401721 addps  xmm0,xmm2
00401724 rsqrtps xmm0,xmm0
00401727 mulps  xmm1,xmm0
0040172A movaps XMMWORD PTR [eax],xmm1
0040172D pop    ebp
0040172E ret

Vector Distance:
MATHLIB_INLINE float vector4_distance(const vector4& vector1, const vector4& vector2)
/// Returns the distance between the two vectors.
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

 __m128 vec0;
 __m128 vec1;

 vec0 = _mm_sub_ps(vector2.sse_vec, vector1.sse_vec);
 vec0 = _mm_mul_ps(vec0, vec0);          // vec0 = (x * x) (y * y) (z * z) (w * w)
 vec1 = _mm_shuffle_ps(vec0, vec0, _MM_SHUFFLE(2, 1, 0, 3));   // Shuffle around, add elements.
 vec1 = _mm_add_ps(vec0, vec1);          // r3 = (x * x) + (y * y)
 vec0 = _mm_shuffle_ps(vec0, vec0, _MM_SHUFFLE(1, 0, 3, 2));   // Shuffle around, add elements.
 vec1 = _mm_add_ps(vec0, vec1);          // r3 = (x * x) + (y * y) + (z * z)
 vec1 = _mm_sqrt_ps(vec1);           // square root vector, shuffle and return

 return (_mm_cvtss_f32(_mm_shuffle_ps(vec1, vec1, _MM_SHUFFLE(2, 1, 0, 3))));

#endif // (MATHLIB_SSE)
#else

 float xcomp = vector2.extractX() - vector1.extractX();
 float ycomp = vector2.extractY() - vector1.extractY();
 float zcomp = vector2.extractZ() - vector1.extractZ();

 return (sqrtf((xcomp * xcomp) + (ycomp * ycomp) + (zcomp * zcomp)));

#endif // (MATHLIB_SIMD)
}

FPU
0040191A push   ebp
0040191B mov    ebp,esp
0040191D sub    esp,0x18
00401920 mov    eax,DWORD PTR [ebp+0x8]
00401923 mov    edx,DWORD PTR [ebp+0xc]
00401926 fld    DWORD PTR [edx+0xc]
00401929 fsub   DWORD PTR [eax+0xc]
0040192C fld    DWORD PTR [edx+0x8]
0040192F fsub   DWORD PTR [eax+0x8]
00401932 fld    DWORD PTR [edx+0x4]
00401935 fsub   DWORD PTR [eax+0x4]
00401938 fxch   st(2)
0040193A fmul   st,st(0)
0040193C fxch   st(1)
0040193E fmul   st,st(0)
00401940 faddp  st(1),st
00401942 fxch   st(1)
00401944 fmul   st,st(0)
00401946 faddp  st(1),st
00401948 fstp   DWORD PTR [esp]
0040194B call   0x414de0 <sqrtf>
00401950 leave
00401951 ret
Again we see the appearance of the sqrtf function call...

SSE
0040165E push   ebp
0040165F mov    ebp,esp
00401661 sub    esp,0x4
00401664 mov    eax,DWORD PTR [ebp+0xc]
00401667 movaps xmm0,XMMWORD PTR [eax]
0040166A mov    eax,DWORD PTR [ebp+0x8]
0040166D subps  xmm0,XMMWORD PTR [eax]
00401670 mulps  xmm0,xmm0
00401673 movaps xmm1,xmm0
00401676 shufps xmm1,xmm0,0x93
0040167A addps  xmm1,xmm0
0040167D shufps xmm0,xmm0,0x4e
00401681 addps  xmm0,xmm1
00401684 sqrtps xmm0,xmm0
00401687 shufps xmm0,xmm0,0x93
0040168B movss  DWORD PTR [ebp-0x4],xmm0
00401690 fld    DWORD PTR [ebp-0x4]
00401693 leave
00401694 ret

Benchmarks
After having written all of this I was curious to see what the performance difference was between the two implementations. To this end I set up "benchmarks" for each of the various vector4 functions. Each benchmark would perform a function on a vector4 array of size 10 000 (two different arrays of 10 000 are used for functions that require it etc) and store the results in either a vector4 array or a float array depending on what the result type is. The main function would perform each benchmark 100 000 times and compute the average time taken per benchmark.

Test machine(s):
For testing I used two machines, my desktop which has a Core 2 Quad Q8400 and my laptop which has a far less powerful Pentium Dual T3400. Obviously core count means nothing for this particular application, so it's down to architecture and clock speed.

Desktop Results




Laptop Results


Well the results speak for themselves, about a 200% speed improvement on both machines with some types of calculations receiving a modest improvement and others receiving a large improvement. The main thing for me was that using SSE was never slower than using x87 code. And there is the potential for a significant speedup for some workloads (like normalization).

Conclusion 
Well I've tested the concept of SIMD acceleration of vector operations on the platform which supposedly rewards the programmer the least for it and come out with a generally favorable opinion. It's not a 4x speedup, sure, but I don't consider it a waste of time. The potential for speedup is definitely there, we just have to exercise a little caution when we program and always profile and understand our code.

We still aren't done with SSE yet, up next, matrices.



The source code for this article can be downloaded here:
https://hotfile.com/dl/167284774/730f206/MathLib.rar.html