Monday, 17 September 2012

SIMD Math Library Part 2: SSE With Matrices

Introduction
All right everyone, it's time to look at matrices. But first, a slight digression, a story if you will;
As I sat down to continue writing the SSE code one night, a question popped into my head, "Why the hell am I trying to optimize this function? I mean, I've used this particular operation maybe once or twice, and never where the performance actually mattered...". And there's a very important point there.

I am a terribly lazy programmer.

But more than that, it's okay that I am, because it's completely irrelevant from a performance perspective that that function I use that accounts for maybe 0.01% of my run time time cost is optimized. Now of course there's probably a fair few people who may have need for those functions that I haven't optimized (pretty much only the matrix determinant and inverse functions). And my apologies for that. If I ever have some free time on my hands I'll probably go through and implement them. But there's cooler, more relevant things to dedicate my time to right now (you'll see in future articles ;) ).

Well if I'm going to only optimize things that truly matter (to me) then what did I optimize exactly?
There are two main types of operation that I focused on:
- Matrix-matrix multiplication.
- Matrix-vector multiplication (i.e vector transformation).

Of these the second is of particular importance if you're doing software pipelines for vertex processing etc. To that end I wrote a batch multiply which basically transforms an entire array of vectors as efficiently as possible.

Now that that's out of the way, lets get to code comparisons!

Matrix Addition:
MATHLIB_INLINE void matrix4x4_add(const matrix4x4& matrix1, const matrix4x4& matrix2, matrix4x4& result)
/// Performs a matrix addition and stores matrix sum in reuslt.
/// Corresponds to the mathematical statement result = matrix1 + matrix2
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	result.sse_row0 = _mm_add_ps(matrix1.sse_row0, matrix2.sse_row0);
	result.sse_row1 = _mm_add_ps(matrix1.sse_row1, matrix2.sse_row1);
	result.sse_row2 = _mm_add_ps(matrix1.sse_row2, matrix2.sse_row2);
	result.sse_row3 = _mm_add_ps(matrix1.sse_row3, matrix2.sse_row3);

#endif // (MATHLIB_SSE)
#else

	result._03  = matrix1._03 + matrix2._03;
	result._02  = matrix1._02 + matrix2._02;
	result._01  = matrix1._01 + matrix2._01;
	result._00  = matrix1._00 + matrix2._00;

	result._13  = matrix1._13 + matrix2._13;
	result._12  = matrix1._12 + matrix2._12;
	result._11  = matrix1._11 + matrix2._11;
	result._10  = matrix1._10 + matrix2._10;

	result._23  = matrix1._23 + matrix2._23;
	result._22  = matrix1._22 + matrix2._22;
	result._21  = matrix1._21 + matrix2._21;
	result._20  = matrix1._20 + matrix2._20;

	result._33  = matrix1._33 + matrix2._33;
	result._32  = matrix1._32 + matrix2._32;
	result._31  = matrix1._31 + matrix2._31;
	result._30  = matrix1._30 + matrix2._30;

#endif // (MATHLIB_SIMD)
}
Yeah I know what you're all thinking, "What the hell is up with the addition order in the scalar version?". Well, it's a quirky thing, but I store the x element of a vector (or the first element of a matrix row) in the highest word (32 bits in this case) of an SSE vector. This means that the x element of a vector is actually stored at the highest address in memory of a 128-bit vector, because little-endian platforms like x86 store higher order bytes at higher memory addresses. So, basically, to keep the memory access the same as usual, I end up having to do operations in reverse.

Here's the assembly output:
FPU
004017EA push   ebp
004017EB mov    ebp,esp
004017ED mov    ecx,DWORD PTR [ebp+0x8]
004017F0 mov    edx,DWORD PTR [ebp+0xc]
004017F3 mov    eax,DWORD PTR [ebp+0x10]
004017F6 fld    DWORD PTR [ecx]
004017F8 fadd   DWORD PTR [edx]
004017FA fstp   DWORD PTR [eax]
004017FC fld    DWORD PTR [ecx+0x4]
004017FF fadd   DWORD PTR [edx+0x4]
00401802 fstp   DWORD PTR [eax+0x4]
00401805 fld    DWORD PTR [ecx+0x8]
00401808 fadd   DWORD PTR [edx+0x8]
0040180B fstp   DWORD PTR [eax+0x8]
0040180E fld    DWORD PTR [ecx+0xc]
00401811 fadd   DWORD PTR [edx+0xc]
00401814 fstp   DWORD PTR [eax+0xc]
00401817 fld    DWORD PTR [ecx+0x10]
0040181A fadd   DWORD PTR [edx+0x10]
0040181D fstp   DWORD PTR [eax+0x10]
00401820 fld    DWORD PTR [ecx+0x14]
00401823 fadd   DWORD PTR [edx+0x14]
00401826 fstp   DWORD PTR [eax+0x14]
00401829 fld    DWORD PTR [ecx+0x18]
0040182C fadd   DWORD PTR [edx+0x18]
0040182F fstp   DWORD PTR [eax+0x18]
00401832 fld    DWORD PTR [ecx+0x1c]
00401835 fadd   DWORD PTR [edx+0x1c]
00401838 fstp   DWORD PTR [eax+0x1c]
0040183B fld    DWORD PTR [ecx+0x20]
0040183E fadd   DWORD PTR [edx+0x20]
00401841 fstp   DWORD PTR [eax+0x20]
00401844 fld    DWORD PTR [ecx+0x24]
00401847 fadd   DWORD PTR [edx+0x24]
0040184A fstp   DWORD PTR [eax+0x24]
0040184D fld    DWORD PTR [ecx+0x28]
00401850 fadd   DWORD PTR [edx+0x28]
00401853 fstp   DWORD PTR [eax+0x28]
00401856 fld    DWORD PTR [ecx+0x2c]
00401859 fadd   DWORD PTR [edx+0x2c]
0040185C fstp   DWORD PTR [eax+0x2c]
0040185F fld    DWORD PTR [ecx+0x30]
00401862 fadd   DWORD PTR [edx+0x30]
00401865 fstp   DWORD PTR [eax+0x30]
00401868 fld    DWORD PTR [ecx+0x34]
0040186B fadd   DWORD PTR [edx+0x34]
0040186E fstp   DWORD PTR [eax+0x34]
00401871 fld    DWORD PTR [ecx+0x38]
00401874 fadd   DWORD PTR [edx+0x38]
00401877 fstp   DWORD PTR [eax+0x38]
0040187A fld    DWORD PTR [ecx+0x3c]
0040187D fadd   DWORD PTR [edx+0x3c]
00401880 fstp   DWORD PTR [eax+0x3c]
00401883 pop    ebp
00401884 ret

SSE
004016AB push   ebp
004016AC mov    ebp,esp
004016AE mov    ecx,DWORD PTR [ebp+0x8]
004016B1 mov    edx,DWORD PTR [ebp+0xc]
004016B4 mov    eax,DWORD PTR [ebp+0x10]
004016B7 movaps xmm0,XMMWORD PTR [ecx]
004016BA addps  xmm0,XMMWORD PTR [edx]
004016BD movaps XMMWORD PTR [eax],xmm0
004016C0 movaps xmm0,XMMWORD PTR [ecx+0x10]
004016C4 addps  xmm0,XMMWORD PTR [edx+0x10]
004016C8 movaps XMMWORD PTR [eax+0x10],xmm0
004016CC movaps xmm0,XMMWORD PTR [ecx+0x20]
004016D0 addps  xmm0,XMMWORD PTR [edx+0x20]
004016D4 movaps XMMWORD PTR [eax+0x20],xmm0
004016D8 movaps xmm0,XMMWORD PTR [ecx+0x30]
004016DC addps  xmm0,XMMWORD PTR [edx+0x30]
004016E0 movaps XMMWORD PTR [eax+0x30],xmm0
004016E4 leave
004016E5 ret
SSE again gets points for compactness. Subtraction is basically the same so I won't waste valuable space by showing all of the assembly for it.

Matrix Scalar Multiplication:
MATHLIB_INLINE void matrix4x4_scalarMul(const matrix4x4& matrix, float scalar, matrix4x4& result)
/// Performs a matrix-scalar multiplication and stores the scaled matrix in result.
/// Corresponds to the mathematical statement reuslt = matrix * scalar
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

    __m128 scaleVec = _mm_load1_ps(&scalar);
    result.sse_row0 = _mm_mul_ps(matrix.sse_row0, scaleVec);
    result.sse_row1 = _mm_mul_ps(matrix.sse_row1, scaleVec);
    result.sse_row2 = _mm_mul_ps(matrix.sse_row2, scaleVec);
    result.sse_row3 = _mm_mul_ps(matrix.sse_row3, scaleVec);

#endif // (MATHLIB_SSE)
#else

    result._03  = matrix._03 * scalar;
    result._02  = matrix._02 * scalar;
    result._01  = matrix._01 * scalar;
    result._00  = matrix._00 * scalar;

    result._13  = matrix._13 * scalar;
    result._12  = matrix._12 * scalar;
    result._11  = matrix._11 * scalar;
    result._10  = matrix._10 * scalar;

    result._23  = matrix._23 * scalar;
    result._22  = matrix._22 * scalar;
    result._21  = matrix._21 * scalar;
    result._20  = matrix._20 * scalar;

    result._33  = matrix._33 * scalar;
    result._32  = matrix._32 * scalar;
    result._31  = matrix._31 * scalar;
    result._30  = matrix._30 * scalar;

#endif // (MATHLIB_SIMD)
}

FPU
0040192D push   ebp
0040192E mov    ebp,esp
00401930 mov    edx,DWORD PTR [ebp+0x8]
00401933 fld    DWORD PTR [ebp+0xc]
00401936 mov    eax,DWORD PTR [ebp+0x10]
00401939 fld    st(0)
0040193B fmul   DWORD PTR [edx]
0040193D fstp   DWORD PTR [eax]
0040193F fld    st(0)
00401941 fmul   DWORD PTR [edx+0x4]
00401944 fstp   DWORD PTR [eax+0x4]
00401947 fld    st(0)
00401949 fmul   DWORD PTR [edx+0x8]
0040194C fstp   DWORD PTR [eax+0x8]
0040194F fld    st(0)
00401951 fmul   DWORD PTR [edx+0xc]
00401954 fstp   DWORD PTR [eax+0xc]
00401957 fld    st(0)
00401959 fmul   DWORD PTR [edx+0x10]
0040195C fstp   DWORD PTR [eax+0x10]
0040195F fld    st(0)
00401961 fmul   DWORD PTR [edx+0x14]
00401964 fstp   DWORD PTR [eax+0x14]
00401967 fld    st(0)
00401969 fmul   DWORD PTR [edx+0x18]
0040196C fstp   DWORD PTR [eax+0x18]
0040196F fld    st(0)
00401971 fmul   DWORD PTR [edx+0x1c]
00401974 fstp   DWORD PTR [eax+0x1c]
00401977 fld    st(0)
00401979 fmul   DWORD PTR [edx+0x20]
0040197C fstp   DWORD PTR [eax+0x20]
0040197F fld    st(0)
00401981 fmul   DWORD PTR [edx+0x24]
00401984 fstp   DWORD PTR [eax+0x24]
00401987 fld    st(0)
00401989 fmul   DWORD PTR [edx+0x28]
0040198C fstp   DWORD PTR [eax+0x28]
0040198F fld    st(0)
00401991 fmul   DWORD PTR [edx+0x2c]
00401994 fstp   DWORD PTR [eax+0x2c]
00401997 fld    st(0)
00401999 fmul   DWORD PTR [edx+0x30]
0040199C fstp   DWORD PTR [eax+0x30]
0040199F fld    st(0)
004019A1 fmul   DWORD PTR [edx+0x34]
004019A4 fstp   DWORD PTR [eax+0x34]
004019A7 fld    st(0)
004019A9 fmul   DWORD PTR [edx+0x38]
004019AC fstp   DWORD PTR [eax+0x38]
004019AF fmul   DWORD PTR [edx+0x3c]
004019B2 fstp   DWORD PTR [eax+0x3c]
004019B5 leave
004019B6 ret

SSE
00401538 push   ebp
00401539 mov    ebp,esp
0040153B mov    edx,DWORD PTR [ebp+0x8]
0040153E mov    eax,DWORD PTR [ebp+0x10]
00401541 movss  xmm0,DWORD PTR [ebp+0xc]
00401546 shufps xmm0,xmm0,0x0
0040154A movaps xmm1,xmm0
0040154D mulps  xmm1,XMMWORD PTR [edx]
00401550 movaps XMMWORD PTR [eax],xmm1
00401553 movaps xmm1,xmm0
00401556 mulps  xmm1,XMMWORD PTR [edx+0x10]
0040155A movaps XMMWORD PTR [eax+0x10],xmm1
0040155E movaps xmm1,xmm0
00401561 mulps  xmm1,XMMWORD PTR [edx+0x20]
00401565 movaps XMMWORD PTR [eax+0x20],xmm1
00401569 mulps  xmm0,XMMWORD PTR [edx+0x30]
0040156D movaps XMMWORD PTR [eax+0x30],xmm0
00401571 leave
00401572 ret

Matrix Matrix Multiplication: 
This is an important function, as well as a large one, fortunately, it SIMD-izes pretty darn well.
MATHLIB_INLINE void matrix4x4_mul(const matrix4x4& matrix1, const matrix4x4& matrix2, matrix4x4& result)
/// Performs a matrix-matrix multiplication and stores the product in result.
/// Corresponds to the mathematical statement reuslt = matrix1 * matrix2
/// @warning The three matrices must be distinct, the result will be incorrect if result is the same object as matrix1 or matrix2.
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	__m128 row0;
	__m128 row1;
	__m128 row2;
	__m128 row3;
	__m128 value0;

	// Compute row 0 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(3, 3, 3, 3));
	row0 	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(2, 2, 2, 2));
	row0	= _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(1, 1, 1, 1));
	row0	= _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(0, 0, 0, 0));
	row0	= _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row3, value0));

	// Compute row 1 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(3, 3, 3, 3));
	row1 	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(2, 2, 2, 2));
	row1	= _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(1, 1, 1, 1));
	row1	= _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(0, 0, 0, 0));
	row1	= _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row3, value0));

	// Compute row 2 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(3, 3, 3, 3));
	row2	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(2, 2, 2, 2));
	row2	= _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(1, 1, 1, 1));
	row2	= _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(0, 0, 0, 0));
	row2	= _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row3, value0));

	// Compute row 3 of the matrix product:
	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(3, 3, 3, 3));
	row3 	= _mm_mul_ps(matrix2.sse_row0, value0);

	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(2, 2, 2, 2));
	row3	= _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row1, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(1, 1, 1, 1));
	row3	= _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row2, value0));

	value0	= _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(0, 0, 0, 0));
	row3	= _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row3, value0));

	// Write results back to memory:
	result.sse_row0	= row0;
	result.sse_row1	= row1;
	result.sse_row2	= row2;
	result.sse_row3	= row3;

#endif // (MATHLIB_SSE)
#else

	result._03	= (matrix1._03 * matrix2._33) + (matrix1._02 * matrix2._23) + (matrix1._01 * matrix2._13) + (matrix1._00 * matrix2._03);
	result._02	= (matrix1._03 * matrix2._32) + (matrix1._02 * matrix2._22) + (matrix1._01 * matrix2._12) + (matrix1._00 * matrix2._02);
	result._01	= (matrix1._03 * matrix2._31) + (matrix1._02 * matrix2._21) + (matrix1._01 * matrix2._11) + (matrix1._00 * matrix2._01);
	result._00	= (matrix1._03 * matrix2._30) + (matrix1._02 * matrix2._20) + (matrix1._01 * matrix2._10) + (matrix1._00 * matrix2._00);

	result._13	= (matrix1._13 * matrix2._33) + (matrix1._12 * matrix2._23) + (matrix1._11 * matrix2._13) + (matrix1._10 * matrix2._03);
	result._12	= (matrix1._13 * matrix2._32) + (matrix1._12 * matrix2._22) + (matrix1._11 * matrix2._12) + (matrix1._10 * matrix2._02);
	result._11	= (matrix1._13 * matrix2._31) + (matrix1._12 * matrix2._21) + (matrix1._11 * matrix2._11) + (matrix1._10 * matrix2._01);
	result._10	= (matrix1._13 * matrix2._30) + (matrix1._12 * matrix2._20) + (matrix1._11 * matrix2._10) + (matrix1._10 * matrix2._00);

	result._23	= (matrix1._23 * matrix2._33) + (matrix1._22 * matrix2._23) + (matrix1._21 * matrix2._13) + (matrix1._20 * matrix2._03);
	result._22	= (matrix1._23 * matrix2._32) + (matrix1._22 * matrix2._22) + (matrix1._21 * matrix2._12) + (matrix1._20 * matrix2._02);
	result._21	= (matrix1._23 * matrix2._31) + (matrix1._22 * matrix2._21) + (matrix1._21 * matrix2._11) + (matrix1._20 * matrix2._01);
	result._20	= (matrix1._23 * matrix2._30) + (matrix1._22 * matrix2._20) + (matrix1._21 * matrix2._10) + (matrix1._20 * matrix2._00);

	result._33	= (matrix1._33 * matrix2._33) + (matrix1._32 * matrix2._23) + (matrix1._31 * matrix2._13) + (matrix1._30 * matrix2._03);
	result._32	= (matrix1._33 * matrix2._32) + (matrix1._32 * matrix2._22) + (matrix1._31 * matrix2._12) + (matrix1._30 * matrix2._02);
	result._31	= (matrix1._33 * matrix2._31) + (matrix1._32 * matrix2._21) + (matrix1._31 * matrix2._11) + (matrix1._30 * matrix2._01);
	result._30	= (matrix1._33 * matrix2._30) + (matrix1._32 * matrix2._20) + (matrix1._31 * matrix2._10) + (matrix1._30 * matrix2._00);

#endif // (MATHLIB_SIMD)
}

FPU
004019B7 push   ebp
004019B8 mov    ebp,esp
004019BA mov    edx,DWORD PTR [ebp+0x8]
004019BD mov    eax,DWORD PTR [ebp+0xc]
004019C0 mov    ecx,DWORD PTR [ebp+0x10]
004019C3 fld    DWORD PTR [edx]
004019C5 fmul   DWORD PTR [eax+0x30]
004019C8 fld    DWORD PTR [edx+0x4]
004019CB fld    st(0)
004019CD fmul   DWORD PTR [eax+0x20]
004019D0 faddp  st(2),st
004019D2 fld    DWORD PTR [edx+0x8]
004019D5 fld    st(0)
004019D7 fmul   DWORD PTR [eax+0x10]
004019DA faddp  st(3),st
004019DC fld    DWORD PTR [edx+0xc]
004019DF fld    st(0)
004019E1 fmul   DWORD PTR [eax]
004019E3 faddp  st(4),st
004019E5 fxch   st(3)
004019E7 fstp   DWORD PTR [ecx]
004019E9 fld    DWORD PTR [edx]
004019EB fld    st(0)
004019ED fmul   DWORD PTR [eax+0x34]
004019F0 fxch   st(3)
004019F2 fmul   DWORD PTR [eax+0x24]
004019F5 faddp  st(3),st
004019F7 fld    st(1)
004019F9 fmul   DWORD PTR [eax+0x14]
004019FC faddp  st(3),st
004019FE fld    st(3)
00401A00 fmul   DWORD PTR [eax+0x4]
00401A03 faddp  st(3),st
00401A05 fxch   st(2)
00401A07 fstp   DWORD PTR [ecx+0x4]
00401A0A fld    st(1)
00401A0C fmul   DWORD PTR [eax+0x38]
00401A0F fld    DWORD PTR [edx+0x4]
00401A12 fld    st(0)
00401A14 fmul   DWORD PTR [eax+0x28]
00401A17 faddp  st(2),st
00401A19 fxch   st(2)
00401A1B fmul   DWORD PTR [eax+0x18]
00401A1E faddp  st(1),st
00401A20 fld    st(3)
00401A22 fmul   DWORD PTR [eax+0x8]
00401A25 faddp  st(1),st
00401A27 fstp   DWORD PTR [ecx+0x8]
00401A2A fxch   st(1)
00401A2C fmul   DWORD PTR [eax+0x3c]
00401A2F fxch   st(1)
00401A31 fmul   DWORD PTR [eax+0x2c]
00401A34 faddp  st(1),st
00401A36 fld    DWORD PTR [edx+0x8]
00401A39 fmul   DWORD PTR [eax+0x1c]
00401A3C faddp  st(1),st
00401A3E fxch   st(1)
00401A40 fmul   DWORD PTR [eax+0xc]
00401A43 faddp  st(1),st
00401A45 fstp   DWORD PTR [ecx+0xc]
00401A48 fld    DWORD PTR [edx+0x10]
00401A4B fmul   DWORD PTR [eax+0x30]
00401A4E fld    DWORD PTR [edx+0x14]
00401A51 fld    st(0)
00401A53 fmul   DWORD PTR [eax+0x20]
00401A56 faddp  st(2),st
00401A58 fld    DWORD PTR [edx+0x18]
00401A5B fld    st(0)
00401A5D fmul   DWORD PTR [eax+0x10]
00401A60 faddp  st(3),st
00401A62 fld    DWORD PTR [edx+0x1c]
00401A65 fld    st(0)
00401A67 fmul   DWORD PTR [eax]
00401A69 faddp  st(4),st
00401A6B fxch   st(3)
00401A6D fstp   DWORD PTR [ecx+0x10]
00401A70 fld    DWORD PTR [edx+0x10]
00401A73 fld    st(0)
00401A75 fmul   DWORD PTR [eax+0x34]
00401A78 fxch   st(3)
00401A7A fmul   DWORD PTR [eax+0x24]
00401A7D faddp  st(3),st
00401A7F fld    st(1)
00401A81 fmul   DWORD PTR [eax+0x14]
00401A84 faddp  st(3),st
00401A86 fld    st(3)
00401A88 fmul   DWORD PTR [eax+0x4]
00401A8B faddp  st(3),st
00401A8D fxch   st(2)
00401A8F fstp   DWORD PTR [ecx+0x14]
00401A92 fld    st(1)
00401A94 fmul   DWORD PTR [eax+0x38]
00401A97 fld    DWORD PTR [edx+0x14]
00401A9A fld    st(0)
00401A9C fmul   DWORD PTR [eax+0x28]
00401A9F faddp  st(2),st
00401AA1 fxch   st(2)
00401AA3 fmul   DWORD PTR [eax+0x18]
00401AA6 faddp  st(1),st
00401AA8 fld    st(3)
00401AAA fmul   DWORD PTR [eax+0x8]
00401AAD faddp  st(1),st
00401AAF fstp   DWORD PTR [ecx+0x18]
00401AB2 fxch   st(1)
00401AB4 fmul   DWORD PTR [eax+0x3c]
00401AB7 fxch   st(1)
00401AB9 fmul   DWORD PTR [eax+0x2c]
00401ABC faddp  st(1),st
00401ABE fld    DWORD PTR [edx+0x18]
00401AC1 fmul   DWORD PTR [eax+0x1c]
00401AC4 faddp  st(1),st
00401AC6 fxch   st(1)
00401AC8 fmul   DWORD PTR [eax+0xc]
00401ACB faddp  st(1),st
00401ACD fstp   DWORD PTR [ecx+0x1c]
00401AD0 fld    DWORD PTR [edx+0x20]
00401AD3 fmul   DWORD PTR [eax+0x30]
00401AD6 fld    DWORD PTR [edx+0x24]
00401AD9 fld    st(0)
00401ADB fmul   DWORD PTR [eax+0x20]
00401ADE faddp  st(2),st
00401AE0 fld    DWORD PTR [edx+0x28]
00401AE3 fld    st(0)
00401AE5 fmul   DWORD PTR [eax+0x10]
00401AE8 faddp  st(3),st
00401AEA fld    DWORD PTR [edx+0x2c]
00401AED fld    st(0)
00401AEF fmul   DWORD PTR [eax]
00401AF1 faddp  st(4),st
00401AF3 fxch   st(3)
00401AF5 fstp   DWORD PTR [ecx+0x20]
00401AF8 fld    DWORD PTR [edx+0x20]
00401AFB fld    st(0)
00401AFD fmul   DWORD PTR [eax+0x34]
00401B00 fxch   st(3)
00401B02 fmul   DWORD PTR [eax+0x24]
00401B05 faddp  st(3),st
00401B07 fld    st(1)
00401B09 fmul   DWORD PTR [eax+0x14]
00401B0C faddp  st(3),st
00401B0E fld    st(3)
00401B10 fmul   DWORD PTR [eax+0x4]
00401B13 faddp  st(3),st
00401B15 fxch   st(2)
00401B17 fstp   DWORD PTR [ecx+0x24]
00401B1A fld    st(1)
00401B1C fmul   DWORD PTR [eax+0x38]
00401B1F fld    DWORD PTR [edx+0x24]
00401B22 fld    st(0)
00401B24 fmul   DWORD PTR [eax+0x28]
00401B27 faddp  st(2),st
00401B29 fxch   st(2)
00401B2B fmul   DWORD PTR [eax+0x18]
00401B2E faddp  st(1),st
00401B30 fld    st(3)
00401B32 fmul   DWORD PTR [eax+0x8]
00401B35 faddp  st(1),st
00401B37 fstp   DWORD PTR [ecx+0x28]
00401B3A fxch   st(1)
00401B3C fmul   DWORD PTR [eax+0x3c]
00401B3F fxch   st(1)
00401B41 fmul   DWORD PTR [eax+0x2c]
00401B44 faddp  st(1),st
00401B46 fld    DWORD PTR [edx+0x28]
00401B49 fmul   DWORD PTR [eax+0x1c]
00401B4C faddp  st(1),st
00401B4E fxch   st(1)
00401B50 fmul   DWORD PTR [eax+0xc]
00401B53 faddp  st(1),st
00401B55 fstp   DWORD PTR [ecx+0x2c]
00401B58 fld    DWORD PTR [edx+0x30]
00401B5B fmul   DWORD PTR [eax+0x30]
00401B5E fld    DWORD PTR [edx+0x34]
00401B61 fld    st(0)
00401B63 fmul   DWORD PTR [eax+0x20]
00401B66 faddp  st(2),st
00401B68 fld    DWORD PTR [edx+0x38]
00401B6B fld    st(0)
00401B6D fmul   DWORD PTR [eax+0x10]
00401B70 faddp  st(3),st
00401B72 fld    DWORD PTR [edx+0x3c]
00401B75 fld    st(0)
00401B77 fmul   DWORD PTR [eax]
00401B79 faddp  st(4),st
00401B7B fxch   st(3)
00401B7D fstp   DWORD PTR [ecx+0x30]
00401B80 fld    DWORD PTR [edx+0x30]
00401B83 fld    st(0)
00401B85 fmul   DWORD PTR [eax+0x34]
00401B88 fxch   st(3)
00401B8A fmul   DWORD PTR [eax+0x24]
00401B8D faddp  st(3),st
00401B8F fld    st(1)
00401B91 fmul   DWORD PTR [eax+0x14]
00401B94 faddp  st(3),st
00401B96 fld    st(3)
00401B98 fmul   DWORD PTR [eax+0x4]
00401B9B faddp  st(3),st
00401B9D fxch   st(2)
00401B9F fstp   DWORD PTR [ecx+0x34]
00401BA2 fld    st(1)
00401BA4 fmul   DWORD PTR [eax+0x38]
00401BA7 fld    DWORD PTR [edx+0x34]
00401BAA fld    st(0)
00401BAC fmul   DWORD PTR [eax+0x28]
00401BAF faddp  st(2),st
00401BB1 fxch   st(2)
00401BB3 fmul   DWORD PTR [eax+0x18]
00401BB6 faddp  st(1),st
00401BB8 fld    st(3)
00401BBA fmul   DWORD PTR [eax+0x8]
00401BBD faddp  st(1),st
00401BBF fstp   DWORD PTR [ecx+0x38]
00401BC2 fxch   st(1)
00401BC4 fmul   DWORD PTR [eax+0x3c]
00401BC7 fxch   st(1)
00401BC9 fmul   DWORD PTR [eax+0x2c]
00401BCC faddp  st(1),st
00401BCE fld    DWORD PTR [edx+0x38]
00401BD1 fmul   DWORD PTR [eax+0x1c]
00401BD4 faddp  st(1),st
00401BD6 fxch   st(1)
00401BD8 fmul   DWORD PTR [eax+0xc]
00401BDB faddp  st(1),st
00401BDD fstp   DWORD PTR [ecx+0x3c]
00401BE0 leave
00401BE1 ret

SSE
0041DCDC push   ebp
0041DCDD mov    ebp,esp
0041DCDF mov    edx,DWORD PTR [ebp+0x8]
0041DCE2 mov    ecx,DWORD PTR [ebp+0xc]
0041DCE5 mov    eax,DWORD PTR [ebp+0x10]
0041DCE8 movaps xmm0,XMMWORD PTR [edx]
0041DCEB movaps xmm6,xmm0
0041DCEE shufps xmm6,xmm0,0xff
0041DCF2 movaps xmm5,XMMWORD PTR [ecx]
0041DCF5 mulps  xmm6,xmm5
0041DCF8 movaps xmm1,xmm0
0041DCFB shufps xmm1,xmm0,0xaa
0041DCFF movaps xmm4,XMMWORD PTR [ecx+0x10]
0041DD03 mulps  xmm1,xmm4
0041DD06 addps  xmm6,xmm1
0041DD09 movaps xmm1,xmm0
0041DD0C shufps xmm1,xmm0,0x55
0041DD10 movaps xmm3,XMMWORD PTR [ecx+0x20]
0041DD14 mulps  xmm1,xmm3
0041DD17 addps  xmm6,xmm1
0041DD1A shufps xmm0,xmm0,0x0
0041DD1E movaps xmm2,XMMWORD PTR [ecx+0x30]
0041DD22 mulps  xmm0,xmm2
0041DD25 addps  xmm6,xmm0
0041DD28 movaps xmm0,XMMWORD PTR [edx+0x10]
0041DD2C movaps xmm7,xmm0
0041DD2F shufps xmm7,xmm0,0xff
0041DD33 mulps  xmm7,xmm5
0041DD36 movaps xmm1,xmm0
0041DD39 shufps xmm1,xmm0,0xaa
0041DD3D mulps  xmm1,xmm4
0041DD40 addps  xmm1,xmm7
0041DD43 movaps xmm7,xmm0
0041DD46 shufps xmm7,xmm0,0x55
0041DD4A mulps  xmm7,xmm3
0041DD4D addps  xmm1,xmm7
0041DD50 shufps xmm0,xmm0,0x0
0041DD54 mulps  xmm0,xmm2
0041DD57 addps  xmm1,xmm0
0041DD5A movaps XMMWORD PTR [eax+0x10],xmm1
0041DD5E movaps xmm1,XMMWORD PTR [edx+0x20]
0041DD62 movaps xmm7,xmm1
0041DD65 shufps xmm7,xmm1,0xff
0041DD69 mulps  xmm7,xmm5
0041DD6C movaps xmm0,xmm1
0041DD6F shufps xmm0,xmm1,0xaa
0041DD73 mulps  xmm0,xmm4
0041DD76 addps  xmm7,xmm0
0041DD79 movaps xmm0,xmm1
0041DD7C shufps xmm0,xmm1,0x55
0041DD80 mulps  xmm0,xmm3
0041DD83 addps  xmm0,xmm7
0041DD86 shufps xmm1,xmm1,0x0
0041DD8A mulps  xmm1,xmm2
0041DD8D addps  xmm1,xmm0
0041DD90 movaps xmm0,XMMWORD PTR [edx+0x30]
0041DD94 movaps xmm7,xmm0
0041DD97 shufps xmm7,xmm0,0xff
0041DD9B mulps  xmm5,xmm7
0041DD9E movaps xmm7,xmm0
0041DDA1 shufps xmm7,xmm0,0xaa
0041DDA5 mulps  xmm4,xmm7
0041DDA8 addps  xmm5,xmm4
0041DDAB movaps xmm4,xmm0
0041DDAE shufps xmm4,xmm0,0x55
0041DDB2 mulps  xmm3,xmm4
0041DDB5 addps  xmm5,xmm3
0041DDB8 shufps xmm0,xmm0,0x0
0041DDBC mulps  xmm2,xmm0
0041DDBF addps  xmm2,xmm5
0041DDC2 movaps XMMWORD PTR [eax],xmm6
0041DDC5 movaps XMMWORD PTR [eax+0x20],xmm1
0041DDC9 movaps XMMWORD PTR [eax+0x30],xmm2
0041DDCD leave
0041DDCE ret
Two things, yes, the SSE code is smaller, but at this point both functions are too large to be inlined by the compiler. Although, interestingly enough, my compiler decided that my FPU code could stay inlined (???)... anyone have an idea why? Regardless, I left the MATHLIB_INLINE macro there, with the hope that the compiler knows best somehow.

 Matrix Vector Multiplication [Single]:
MATHLIB_INLINE void matrix4x4_vectorMul(const matrix4x4& matrix, const vector4& vector, vector4& result)
/// Performs an multiplication of the given vector (column matrix) by the given matrix.
/// This method corresponds to the mathematical statement result = matrix * vector i.e the vector is a column vector.
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	__m128 m0	= _mm_mul_ps(matrix.sse_row0, vector.sse_vec);
	__m128 m1	= _mm_mul_ps(matrix.sse_row1, vector.sse_vec);
	__m128 m2	= _mm_mul_ps(matrix.sse_row2, vector.sse_vec);
	__m128 m3	= _mm_mul_ps(matrix.sse_row3, vector.sse_vec);

	m0	= _mm_hadd_ps(m1, m0);
	m2	= _mm_hadd_ps(m3, m2);
	m0	= _mm_hadd_ps(m2, m0);

	result.sse_vec	= m0;

#endif // (MATHLIB_SSE)
#else

	float vectorX	= vector.extractX();
	float vectorY	= vector.extractY();
	float vectorZ	= vector.extractZ();
	float vectorW	= vector.extractW();

	result.setXYZW((matrix._03 * vectorW) + (matrix._02 * vectorZ) + (matrix._01 * vectorY) + (matrix._00 * vectorX),
				   (matrix._13 * vectorW) + (matrix._12 * vectorZ) + (matrix._11 * vectorY) + (matrix._10 * vectorX),
				   (matrix._23 * vectorW) + (matrix._22 * vectorZ) + (matrix._21 * vectorY) + (matrix._20 * vectorX),
				   (matrix._33 * vectorW) + (matrix._32 * vectorZ) + (matrix._31 * vectorY) + (matrix._30 * vectorX));

#endif // (MATHLIB_SIMD)
}

FPU
00401C30	push   ebp
00401C31	mov    ebp,esp
00401C33	mov    eax,DWORD PTR [ebp+0x8]
00401C36	mov    ecx,DWORD PTR [ebp+0xc]
00401C39	mov    edx,DWORD PTR [ebp+0x10]
00401C3C	fld    DWORD PTR [ecx+0xc]
00401C3F	fld    DWORD PTR [ecx+0x8]
00401C42	fld    DWORD PTR [ecx+0x4]
00401C45	fld    DWORD PTR [ecx]
00401C47	fld    st(0)
00401C49	fmul   DWORD PTR [eax+0x30]
00401C4C	fld    st(2)
00401C4E	fmul   DWORD PTR [eax+0x34]
00401C51	faddp  st(1),st
00401C53	fld    st(3)
00401C55	fmul   DWORD PTR [eax+0x38]
00401C58	faddp  st(1),st
00401C5A	fld    st(4)
00401C5C	fmul   DWORD PTR [eax+0x3c]
00401C5F	faddp  st(1),st
00401C61	fld    st(1)
00401C63	fmul   DWORD PTR [eax+0x20]
00401C66	fld    st(3)
00401C68	fmul   DWORD PTR [eax+0x24]
00401C6B	faddp  st(1),st
00401C6D	fld    st(4)
00401C6F	fmul   DWORD PTR [eax+0x28]
00401C72	faddp  st(1),st
00401C74	fld    st(5)
00401C76	fmul   DWORD PTR [eax+0x2c]
00401C79	faddp  st(1),st
00401C7B	fld    st(2)
00401C7D	fmul   DWORD PTR [eax+0x10]
00401C80	fld    st(4)
00401C82	fmul   DWORD PTR [eax+0x14]
00401C85	faddp  st(1),st
00401C87	fld    st(5)
00401C89	fmul   DWORD PTR [eax+0x18]
00401C8C	faddp  st(1),st
00401C8E	fld    st(6)
00401C90	fmul   DWORD PTR [eax+0x1c]
00401C93	faddp  st(1),st
00401C95	fxch   st(3)
00401C97	fmul   DWORD PTR [eax]
00401C99	fxch   st(4)
00401C9B	fmul   DWORD PTR [eax+0x4]
00401C9E	faddp  st(4),st
00401CA0	fxch   st(4)
00401CA2	fmul   DWORD PTR [eax+0x8]
00401CA5	faddp  st(3),st
00401CA7	fxch   st(4)
00401CA9	fmul   DWORD PTR [eax+0xc]
00401CAC	faddp  st(2),st
00401CAE	fxch   st(1)
00401CB0	fstp   DWORD PTR [edx+0xc]
00401CB3	fstp   DWORD PTR [edx+0x8]
00401CB6	fstp   DWORD PTR [edx+0x4]
00401CB9	fstp   DWORD PTR [edx]
00401CBB	leave
00401CBC	ret

SSE
004014A8	push   ebp
004014A9	mov    ebp,esp
004014AB	mov    eax,DWORD PTR [ebp+0x8]
004014AE	mov    edx,DWORD PTR [ebp+0xc]
004014B1	movaps xmm1,XMMWORD PTR [edx]
004014B4	movaps xmm3,xmm1
004014B7	mulps  xmm3,XMMWORD PTR [eax]
004014BA	movaps xmm0,xmm1
004014BD	mulps  xmm0,XMMWORD PTR [eax+0x10]
004014C1	movaps xmm2,xmm1
004014C4	mulps  xmm2,XMMWORD PTR [eax+0x20]
004014C8	mulps  xmm1,XMMWORD PTR [eax+0x30]
004014CC	haddps xmm0,xmm3
004014D0	haddps xmm1,xmm2
004014D4	haddps xmm1,xmm0
004014D8	mov    eax,DWORD PTR [ebp+0x10]
004014DB	movaps XMMWORD PTR [eax],xmm1
004014DE	leave
004014DF	ret

 Matrix Vector Multiplication [Batch]: 
MATHLIB_INLINE void matrix4x4_vectorBatchMul(const matrix4x4& matrix, vector4 const * const vectorArray, uint32_t numVectors, vector4* const resultArray)
/// Performs a batch matrix vector multiplication (column matrix).
{
#if (MATHLIB_SIMD)
#if (MATHLIB_SSE)

	// Load matrix:
	__m128 matrix_row0	= matrix.sse_row0;
	__m128 matrix_row1	= matrix.sse_row1;
	__m128 matrix_row2	= matrix.sse_row2;
	__m128 matrix_row3	= matrix.sse_row3;

	for (uint32_t i = 0; i < numVectors; i++)
	{
		__m128 m0	= _mm_mul_ps(matrix_row0, vectorArray[i].sse_vec);
		__m128 m1	= _mm_mul_ps(matrix_row1, vectorArray[i].sse_vec);
		__m128 m2	= _mm_mul_ps(matrix_row2, vectorArray[i].sse_vec);
		__m128 m3	= _mm_mul_ps(matrix_row3, vectorArray[i].sse_vec);

		m0	= _mm_hadd_ps(m1, m0);
		m2	= _mm_hadd_ps(m3, m2);
		m0	= _mm_hadd_ps(m2, m0);

		resultArray[i].sse_vec	= m0;
	}

#endif // (MATHLIB_SSE)
#else

	for (uint32_t i = 0; i < numVectors; i++)
	{
		float vectorX	= vectorArray[i].extractX();
		float vectorY	= vectorArray[i].extractY();
		float vectorZ	= vectorArray[i].extractZ();
		float vectorW	= vectorArray[i].extractW();

		resultArray[i].setXYZW((matrix._03 * vectorW) + (matrix._02 * vectorZ) + (matrix._01 * vectorY) + (matrix._00 * vectorX),
							   (matrix._13 * vectorW) + (matrix._12 * vectorZ) + (matrix._11 * vectorY) + (matrix._10 * vectorX),
							   (matrix._23 * vectorW) + (matrix._22 * vectorZ) + (matrix._21 * vectorY) + (matrix._20 * vectorX),
							   (matrix._33 * vectorW) + (matrix._32 * vectorZ) + (matrix._31 * vectorY) + (matrix._30 * vectorX));
	}

#endif // (MATHLIB_SIMD)
}

FPU
00401CBD	push   ebp
00401CBE	mov    ebp,esp
00401CC0	push   edi
00401CC1	push   esi
00401CC2	push   ebx
00401CC3	mov    eax,DWORD PTR [ebp+0x8]
00401CC6	mov    ecx,DWORD PTR [ebp+0xc]
00401CC9	mov    edi,DWORD PTR [ebp+0x10]
00401CCC	mov    ebx,DWORD PTR [ebp+0x14]
00401CCF	test   edi,edi
00401CD1	je     0x401d74 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+183>
00401CD7	mov    edx,0x0
00401CDC	mov    esi,0x0
00401CE1	fld    DWORD PTR [ecx+edx*1+0xc]
00401CE5	fld    DWORD PTR [ecx+edx*1+0x8]
00401CE9	fld    DWORD PTR [ecx+edx*1+0x4]
00401CED	fld    DWORD PTR [ecx+edx*1]
00401CF0	fld    st(0)
00401CF2	fmul   DWORD PTR [eax+0x30]
00401CF5	fld    st(2)
00401CF7	fmul   DWORD PTR [eax+0x34]
00401CFA	faddp  st(1),st
00401CFC	fld    st(3)
00401CFE	fmul   DWORD PTR [eax+0x38]
00401D01	faddp  st(1),st
00401D03	fld    st(4)
00401D05	fmul   DWORD PTR [eax+0x3c]
00401D08	faddp  st(1),st
00401D0A	fld    st(1)
00401D0C	fmul   DWORD PTR [eax+0x20]
00401D0F	fld    st(3)
00401D11	fmul   DWORD PTR [eax+0x24]
00401D14	faddp  st(1),st
00401D16	fld    st(4)
00401D18	fmul   DWORD PTR [eax+0x28]
00401D1B	faddp  st(1),st
00401D1D	fld    st(5)
00401D1F	fmul   DWORD PTR [eax+0x2c]
00401D22	faddp  st(1),st
00401D24	fld    st(2)
00401D26	fmul   DWORD PTR [eax+0x10]
00401D29	fld    st(4)
00401D2B	fmul   DWORD PTR [eax+0x14]
00401D2E	faddp  st(1),st
00401D30	fld    st(5)
00401D32	fmul   DWORD PTR [eax+0x18]
00401D35	faddp  st(1),st
00401D37	fld    st(6)
00401D39	fmul   DWORD PTR [eax+0x1c]
00401D3C	faddp  st(1),st
00401D3E	fxch   st(3)
00401D40	fmul   DWORD PTR [eax]
00401D42	fxch   st(4)
00401D44	fmul   DWORD PTR [eax+0x4]
00401D47	faddp  st(4),st
00401D49	fxch   st(4)
00401D4B	fmul   DWORD PTR [eax+0x8]
00401D4E	faddp  st(3),st
00401D50	fxch   st(4)
00401D52	fmul   DWORD PTR [eax+0xc]
00401D55	faddp  st(2),st
00401D57	fxch   st(1)
00401D59	fstp   DWORD PTR [ebx+edx*1+0xc]
00401D5D	fstp   DWORD PTR [ebx+edx*1+0x8]
00401D61	fstp   DWORD PTR [ebx+edx*1+0x4]
00401D65	fstp   DWORD PTR [ebx+edx*1]
00401D68	inc    esi
00401D69	add    edx,0x10
00401D6C	cmp    edi,esi
00401D6E	ja     0x401ce1 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+36>
00401D74	pop    ebx
00401D75	pop    esi
00401D76	pop    edi
00401D77	leave
00401D78	ret

SSE
004014A8	push   ebp
004014A9	mov    ebp,esp
004014AB	push   ebx
004014AC	mov    eax,DWORD PTR [ebp+0x8]
004014AF	mov    ebx,DWORD PTR [ebp+0x10]
004014B2	movaps xmm3,XMMWORD PTR [eax]
004014B5	movaps xmm4,XMMWORD PTR [eax+0x10]
004014B9	movaps xmm5,XMMWORD PTR [eax+0x20]
004014BD	movaps xmm6,XMMWORD PTR [eax+0x30]
004014C1	test   ebx,ebx
004014C3	je     0x401502 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+90>
004014C5	mov    ecx,DWORD PTR [ebp+0xc]
004014C8	mov    edx,DWORD PTR [ebp+0x14]
004014CB	mov    eax,0x0
004014D0	movaps xmm1,XMMWORD PTR [ecx]
004014D3	movaps xmm7,xmm3
004014D6	mulps  xmm7,xmm1
004014D9	movaps xmm0,xmm4
004014DC	mulps  xmm0,xmm1
004014DF	movaps xmm2,xmm5
004014E2	mulps  xmm2,xmm1
004014E5	mulps  xmm1,xmm6
004014E8	haddps xmm0,xmm7
004014EC	haddps xmm1,xmm2
004014F0	haddps xmm1,xmm0
004014F4	movaps XMMWORD PTR [edx],xmm1
004014F7	inc    eax
004014F8	add    ecx,0x10
004014FB	add    edx,0x10
004014FE	cmp    ebx,eax
00401500	ja     0x4014d0 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+40>
00401502	pop    ebx
00401503	leave
00401504	ret

Benchmarks
Let's look at the performance numbers, the testing methodology is pretty much exactly the same as I had with the vectors (you can take a look at the source code, it's all there). I'm just going to post the text output from the program itself (the screenshot approach takes to long). Again, we're measuring single threaded code, it's an architecture, clock speed, and memory bandwidth game.

The first machine I ran the test on was was equipped with a Pentium E5200, overclocked to 2.75GHz:
MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.099916]
Subtraction benchmark           : average time (msecs) [0.089407]
Transposition benchmark         : average time (msecs) [0.049061]
Scalar mul benchmark            : average time (msecs) [0.057214]
Matrix mul benchmark            : average time (msecs) [0.169970]
Vector mul benchmark            : average time (msecs) [0.049327]
Vector batch mul benchmark      : average time (msecs) [0.036488]

Process returned 0 (0x0)   execution time : 55.196 s
Press any key to continue.

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.152721]
Subtraction benchmark           : average time (msecs) [0.151200]
Transposition benchmark         : average time (msecs) [0.098960]
Scalar mul benchmark            : average time (msecs) [0.121864]
Matrix mul benchmark            : average time (msecs) [0.470418]
Vector mul benchmark            : average time (msecs) [0.117321]
Vector batch mul benchmark      : average time (msecs) [0.118026]

Process returned 0 (0x0)   execution time : 123.101 s
Press any key to continue.

The second machine was my good old (really old) laptop, equipped with a Pentium Dual T3400:
MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.623972]
Subtraction benchmark           : average time (msecs) [0.620934]
Transposition benchmark         : average time (msecs) [0.070971]
Scalar mul benchmark            : average time (msecs) [0.423544]
Matrix mul benchmark            : average time (msecs) [0.641712]
Vector mul benchmark            : average time (msecs) [0.108508]
Vector batch mul benchmark      : average time (msecs) [0.063231]

Process returned 0 (0x0)   execution time : 256.348 s
Press any key to continue.

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.634588]
Subtraction benchmark           : average time (msecs) [0.632229]
Transposition benchmark         : average time (msecs) [0.097166]
Scalar mul benchmark            : average time (msecs) [0.439894]
Matrix mul benchmark            : average time (msecs) [0.676615]
Vector mul benchmark            : average time (msecs) [0.173299]
Vector batch mul benchmark      : average time (msecs) [0.152087]

Process returned 0 (0x0)   execution time : 281.617 s
Press any key to continue.


So these results are pretty interesting, it appears that the calculations become memory bandwidth limited... the SSE code is always ahead, but it's not far ahead. The only really notable speedups are the Vector mul and Vector batch mul results.

The third machine was a Core 2 Quad Q8400:

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.089563]
Subtraction benchmark           : average time (msecs) [0.090273]
Transposition benchmark         : average time (msecs) [0.050641]
Scalar mul benchmark            : average time (msecs) [0.060286]
Matrix mul benchmark            : average time (msecs) [0.172725]
Vector mul benchmark            : average time (msecs) [0.051047]
Vector batch mul benchmark      : average time (msecs) [0.037580]

Process returned 0 (0x0)   execution time : 55.302 s
Press any key to continue.

MathLib test program.
MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated].

Running benchmark!
Matrix array sizes      : 10000
Number of iterations    : 100000
----------------------------------------
Addition benchmark              : average time (msecs) [0.157504]
Subtraction benchmark           : average time (msecs) [0.158742]
Transposition benchmark         : average time (msecs) [0.102287]
Scalar mul benchmark            : average time (msecs) [0.126073]
Matrix mul benchmark            : average time (msecs) [0.484266]
Vector mul benchmark            : average time (msecs) [0.121134]
Vector batch mul benchmark      : average time (msecs) [0.121808]

Process returned 0 (0x0)   execution time : 127.281 s
Press any key to continue.

Conclusion
In line with the vectors, we can see a comfortable 2x speedup (laptop results notwithstanding) from the use of SSE intrinsics. The batch multiplication and matrix-matrix multiplication receive around 3x speedups, which is promising.

The source code for this article can be downloaded here:
https://hotfile.com/dl/167284774/730f206/MathLib.rar.html