All right everyone, it's time to look at matrices. But first, a slight digression, a story if you will;
As I sat down to continue writing the SSE code one night, a question popped into my head, "Why the hell am I trying to optimize this function? I mean, I've used this particular operation maybe once or twice, and never where the performance actually mattered...". And there's a very important point there.
I am a terribly lazy programmer.
But more than that, it's okay that I am, because it's completely irrelevant from a performance perspective that that function I use that accounts for maybe 0.01% of my run time time cost is optimized. Now of course there's probably a fair few people who may have need for those functions that I haven't optimized (pretty much only the matrix determinant and inverse functions). And my apologies for that. If I ever have some free time on my hands I'll probably go through and implement them. But there's cooler, more relevant things to dedicate my time to right now (you'll see in future articles ;) ).
Well if I'm going to only optimize things that truly matter (to me) then what did I optimize exactly?
There are two main types of operation that I focused on:
- Matrix-matrix multiplication.
- Matrix-vector multiplication (i.e vector transformation).
Of these the second is of particular importance if you're doing software pipelines for vertex processing etc. To that end I wrote a batch multiply which basically transforms an entire array of vectors as efficiently as possible.
Now that that's out of the way, lets get to code comparisons!
Matrix Addition:
MATHLIB_INLINE void matrix4x4_add(const matrix4x4& matrix1, const matrix4x4& matrix2, matrix4x4& result) /// Performs a matrix addition and stores matrix sum in reuslt. /// Corresponds to the mathematical statement result = matrix1 + matrix2 { #if (MATHLIB_SIMD) #if (MATHLIB_SSE) result.sse_row0 = _mm_add_ps(matrix1.sse_row0, matrix2.sse_row0); result.sse_row1 = _mm_add_ps(matrix1.sse_row1, matrix2.sse_row1); result.sse_row2 = _mm_add_ps(matrix1.sse_row2, matrix2.sse_row2); result.sse_row3 = _mm_add_ps(matrix1.sse_row3, matrix2.sse_row3); #endif // (MATHLIB_SSE) #else result._03 = matrix1._03 + matrix2._03; result._02 = matrix1._02 + matrix2._02; result._01 = matrix1._01 + matrix2._01; result._00 = matrix1._00 + matrix2._00; result._13 = matrix1._13 + matrix2._13; result._12 = matrix1._12 + matrix2._12; result._11 = matrix1._11 + matrix2._11; result._10 = matrix1._10 + matrix2._10; result._23 = matrix1._23 + matrix2._23; result._22 = matrix1._22 + matrix2._22; result._21 = matrix1._21 + matrix2._21; result._20 = matrix1._20 + matrix2._20; result._33 = matrix1._33 + matrix2._33; result._32 = matrix1._32 + matrix2._32; result._31 = matrix1._31 + matrix2._31; result._30 = matrix1._30 + matrix2._30; #endif // (MATHLIB_SIMD) }
Here's the assembly output:
FPU
004017EA push ebp 004017EB mov ebp,esp 004017ED mov ecx,DWORD PTR [ebp+0x8] 004017F0 mov edx,DWORD PTR [ebp+0xc] 004017F3 mov eax,DWORD PTR [ebp+0x10] 004017F6 fld DWORD PTR [ecx] 004017F8 fadd DWORD PTR [edx] 004017FA fstp DWORD PTR [eax] 004017FC fld DWORD PTR [ecx+0x4] 004017FF fadd DWORD PTR [edx+0x4] 00401802 fstp DWORD PTR [eax+0x4] 00401805 fld DWORD PTR [ecx+0x8] 00401808 fadd DWORD PTR [edx+0x8] 0040180B fstp DWORD PTR [eax+0x8] 0040180E fld DWORD PTR [ecx+0xc] 00401811 fadd DWORD PTR [edx+0xc] 00401814 fstp DWORD PTR [eax+0xc] 00401817 fld DWORD PTR [ecx+0x10] 0040181A fadd DWORD PTR [edx+0x10] 0040181D fstp DWORD PTR [eax+0x10] 00401820 fld DWORD PTR [ecx+0x14] 00401823 fadd DWORD PTR [edx+0x14] 00401826 fstp DWORD PTR [eax+0x14] 00401829 fld DWORD PTR [ecx+0x18] 0040182C fadd DWORD PTR [edx+0x18] 0040182F fstp DWORD PTR [eax+0x18] 00401832 fld DWORD PTR [ecx+0x1c] 00401835 fadd DWORD PTR [edx+0x1c] 00401838 fstp DWORD PTR [eax+0x1c] 0040183B fld DWORD PTR [ecx+0x20] 0040183E fadd DWORD PTR [edx+0x20] 00401841 fstp DWORD PTR [eax+0x20] 00401844 fld DWORD PTR [ecx+0x24] 00401847 fadd DWORD PTR [edx+0x24] 0040184A fstp DWORD PTR [eax+0x24] 0040184D fld DWORD PTR [ecx+0x28] 00401850 fadd DWORD PTR [edx+0x28] 00401853 fstp DWORD PTR [eax+0x28] 00401856 fld DWORD PTR [ecx+0x2c] 00401859 fadd DWORD PTR [edx+0x2c] 0040185C fstp DWORD PTR [eax+0x2c] 0040185F fld DWORD PTR [ecx+0x30] 00401862 fadd DWORD PTR [edx+0x30] 00401865 fstp DWORD PTR [eax+0x30] 00401868 fld DWORD PTR [ecx+0x34] 0040186B fadd DWORD PTR [edx+0x34] 0040186E fstp DWORD PTR [eax+0x34] 00401871 fld DWORD PTR [ecx+0x38] 00401874 fadd DWORD PTR [edx+0x38] 00401877 fstp DWORD PTR [eax+0x38] 0040187A fld DWORD PTR [ecx+0x3c] 0040187D fadd DWORD PTR [edx+0x3c] 00401880 fstp DWORD PTR [eax+0x3c] 00401883 pop ebp 00401884 ret
SSE
004016AB push ebp 004016AC mov ebp,esp 004016AE mov ecx,DWORD PTR [ebp+0x8] 004016B1 mov edx,DWORD PTR [ebp+0xc] 004016B4 mov eax,DWORD PTR [ebp+0x10] 004016B7 movaps xmm0,XMMWORD PTR [ecx] 004016BA addps xmm0,XMMWORD PTR [edx] 004016BD movaps XMMWORD PTR [eax],xmm0 004016C0 movaps xmm0,XMMWORD PTR [ecx+0x10] 004016C4 addps xmm0,XMMWORD PTR [edx+0x10] 004016C8 movaps XMMWORD PTR [eax+0x10],xmm0 004016CC movaps xmm0,XMMWORD PTR [ecx+0x20] 004016D0 addps xmm0,XMMWORD PTR [edx+0x20] 004016D4 movaps XMMWORD PTR [eax+0x20],xmm0 004016D8 movaps xmm0,XMMWORD PTR [ecx+0x30] 004016DC addps xmm0,XMMWORD PTR [edx+0x30] 004016E0 movaps XMMWORD PTR [eax+0x30],xmm0 004016E4 leave 004016E5 ret
Matrix Scalar Multiplication:
MATHLIB_INLINE void matrix4x4_scalarMul(const matrix4x4& matrix, float scalar, matrix4x4& result) /// Performs a matrix-scalar multiplication and stores the scaled matrix in result. /// Corresponds to the mathematical statement reuslt = matrix * scalar { #if (MATHLIB_SIMD) #if (MATHLIB_SSE) __m128 scaleVec = _mm_load1_ps(&scalar); result.sse_row0 = _mm_mul_ps(matrix.sse_row0, scaleVec); result.sse_row1 = _mm_mul_ps(matrix.sse_row1, scaleVec); result.sse_row2 = _mm_mul_ps(matrix.sse_row2, scaleVec); result.sse_row3 = _mm_mul_ps(matrix.sse_row3, scaleVec); #endif // (MATHLIB_SSE) #else result._03 = matrix._03 * scalar; result._02 = matrix._02 * scalar; result._01 = matrix._01 * scalar; result._00 = matrix._00 * scalar; result._13 = matrix._13 * scalar; result._12 = matrix._12 * scalar; result._11 = matrix._11 * scalar; result._10 = matrix._10 * scalar; result._23 = matrix._23 * scalar; result._22 = matrix._22 * scalar; result._21 = matrix._21 * scalar; result._20 = matrix._20 * scalar; result._33 = matrix._33 * scalar; result._32 = matrix._32 * scalar; result._31 = matrix._31 * scalar; result._30 = matrix._30 * scalar; #endif // (MATHLIB_SIMD) }
FPU
0040192D push ebp 0040192E mov ebp,esp 00401930 mov edx,DWORD PTR [ebp+0x8] 00401933 fld DWORD PTR [ebp+0xc] 00401936 mov eax,DWORD PTR [ebp+0x10] 00401939 fld st(0) 0040193B fmul DWORD PTR [edx] 0040193D fstp DWORD PTR [eax] 0040193F fld st(0) 00401941 fmul DWORD PTR [edx+0x4] 00401944 fstp DWORD PTR [eax+0x4] 00401947 fld st(0) 00401949 fmul DWORD PTR [edx+0x8] 0040194C fstp DWORD PTR [eax+0x8] 0040194F fld st(0) 00401951 fmul DWORD PTR [edx+0xc] 00401954 fstp DWORD PTR [eax+0xc] 00401957 fld st(0) 00401959 fmul DWORD PTR [edx+0x10] 0040195C fstp DWORD PTR [eax+0x10] 0040195F fld st(0) 00401961 fmul DWORD PTR [edx+0x14] 00401964 fstp DWORD PTR [eax+0x14] 00401967 fld st(0) 00401969 fmul DWORD PTR [edx+0x18] 0040196C fstp DWORD PTR [eax+0x18] 0040196F fld st(0) 00401971 fmul DWORD PTR [edx+0x1c] 00401974 fstp DWORD PTR [eax+0x1c] 00401977 fld st(0) 00401979 fmul DWORD PTR [edx+0x20] 0040197C fstp DWORD PTR [eax+0x20] 0040197F fld st(0) 00401981 fmul DWORD PTR [edx+0x24] 00401984 fstp DWORD PTR [eax+0x24] 00401987 fld st(0) 00401989 fmul DWORD PTR [edx+0x28] 0040198C fstp DWORD PTR [eax+0x28] 0040198F fld st(0) 00401991 fmul DWORD PTR [edx+0x2c] 00401994 fstp DWORD PTR [eax+0x2c] 00401997 fld st(0) 00401999 fmul DWORD PTR [edx+0x30] 0040199C fstp DWORD PTR [eax+0x30] 0040199F fld st(0) 004019A1 fmul DWORD PTR [edx+0x34] 004019A4 fstp DWORD PTR [eax+0x34] 004019A7 fld st(0) 004019A9 fmul DWORD PTR [edx+0x38] 004019AC fstp DWORD PTR [eax+0x38] 004019AF fmul DWORD PTR [edx+0x3c] 004019B2 fstp DWORD PTR [eax+0x3c] 004019B5 leave 004019B6 ret
SSE
00401538 push ebp 00401539 mov ebp,esp 0040153B mov edx,DWORD PTR [ebp+0x8] 0040153E mov eax,DWORD PTR [ebp+0x10] 00401541 movss xmm0,DWORD PTR [ebp+0xc] 00401546 shufps xmm0,xmm0,0x0 0040154A movaps xmm1,xmm0 0040154D mulps xmm1,XMMWORD PTR [edx] 00401550 movaps XMMWORD PTR [eax],xmm1 00401553 movaps xmm1,xmm0 00401556 mulps xmm1,XMMWORD PTR [edx+0x10] 0040155A movaps XMMWORD PTR [eax+0x10],xmm1 0040155E movaps xmm1,xmm0 00401561 mulps xmm1,XMMWORD PTR [edx+0x20] 00401565 movaps XMMWORD PTR [eax+0x20],xmm1 00401569 mulps xmm0,XMMWORD PTR [edx+0x30] 0040156D movaps XMMWORD PTR [eax+0x30],xmm0 00401571 leave 00401572 ret
Matrix Matrix Multiplication:
This is an important function, as well as a large one, fortunately, it SIMD-izes pretty darn well.
MATHLIB_INLINE void matrix4x4_mul(const matrix4x4& matrix1, const matrix4x4& matrix2, matrix4x4& result) /// Performs a matrix-matrix multiplication and stores the product in result. /// Corresponds to the mathematical statement reuslt = matrix1 * matrix2 /// @warning The three matrices must be distinct, the result will be incorrect if result is the same object as matrix1 or matrix2. { #if (MATHLIB_SIMD) #if (MATHLIB_SSE) __m128 row0; __m128 row1; __m128 row2; __m128 row3; __m128 value0; // Compute row 0 of the matrix product: value0 = _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(3, 3, 3, 3)); row0 = _mm_mul_ps(matrix2.sse_row0, value0); value0 = _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(2, 2, 2, 2)); row0 = _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row1, value0)); value0 = _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(1, 1, 1, 1)); row0 = _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row2, value0)); value0 = _mm_shuffle_ps(matrix1.sse_row0, matrix1.sse_row0, _MM_SHUFFLE(0, 0, 0, 0)); row0 = _mm_add_ps(row0, _mm_mul_ps(matrix2.sse_row3, value0)); // Compute row 1 of the matrix product: value0 = _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(3, 3, 3, 3)); row1 = _mm_mul_ps(matrix2.sse_row0, value0); value0 = _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(2, 2, 2, 2)); row1 = _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row1, value0)); value0 = _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(1, 1, 1, 1)); row1 = _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row2, value0)); value0 = _mm_shuffle_ps(matrix1.sse_row1, matrix1.sse_row1, _MM_SHUFFLE(0, 0, 0, 0)); row1 = _mm_add_ps(row1, _mm_mul_ps(matrix2.sse_row3, value0)); // Compute row 2 of the matrix product: value0 = _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(3, 3, 3, 3)); row2 = _mm_mul_ps(matrix2.sse_row0, value0); value0 = _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(2, 2, 2, 2)); row2 = _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row1, value0)); value0 = _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(1, 1, 1, 1)); row2 = _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row2, value0)); value0 = _mm_shuffle_ps(matrix1.sse_row2, matrix1.sse_row2, _MM_SHUFFLE(0, 0, 0, 0)); row2 = _mm_add_ps(row2, _mm_mul_ps(matrix2.sse_row3, value0)); // Compute row 3 of the matrix product: value0 = _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(3, 3, 3, 3)); row3 = _mm_mul_ps(matrix2.sse_row0, value0); value0 = _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(2, 2, 2, 2)); row3 = _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row1, value0)); value0 = _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(1, 1, 1, 1)); row3 = _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row2, value0)); value0 = _mm_shuffle_ps(matrix1.sse_row3, matrix1.sse_row3, _MM_SHUFFLE(0, 0, 0, 0)); row3 = _mm_add_ps(row3, _mm_mul_ps(matrix2.sse_row3, value0)); // Write results back to memory: result.sse_row0 = row0; result.sse_row1 = row1; result.sse_row2 = row2; result.sse_row3 = row3; #endif // (MATHLIB_SSE) #else result._03 = (matrix1._03 * matrix2._33) + (matrix1._02 * matrix2._23) + (matrix1._01 * matrix2._13) + (matrix1._00 * matrix2._03); result._02 = (matrix1._03 * matrix2._32) + (matrix1._02 * matrix2._22) + (matrix1._01 * matrix2._12) + (matrix1._00 * matrix2._02); result._01 = (matrix1._03 * matrix2._31) + (matrix1._02 * matrix2._21) + (matrix1._01 * matrix2._11) + (matrix1._00 * matrix2._01); result._00 = (matrix1._03 * matrix2._30) + (matrix1._02 * matrix2._20) + (matrix1._01 * matrix2._10) + (matrix1._00 * matrix2._00); result._13 = (matrix1._13 * matrix2._33) + (matrix1._12 * matrix2._23) + (matrix1._11 * matrix2._13) + (matrix1._10 * matrix2._03); result._12 = (matrix1._13 * matrix2._32) + (matrix1._12 * matrix2._22) + (matrix1._11 * matrix2._12) + (matrix1._10 * matrix2._02); result._11 = (matrix1._13 * matrix2._31) + (matrix1._12 * matrix2._21) + (matrix1._11 * matrix2._11) + (matrix1._10 * matrix2._01); result._10 = (matrix1._13 * matrix2._30) + (matrix1._12 * matrix2._20) + (matrix1._11 * matrix2._10) + (matrix1._10 * matrix2._00); result._23 = (matrix1._23 * matrix2._33) + (matrix1._22 * matrix2._23) + (matrix1._21 * matrix2._13) + (matrix1._20 * matrix2._03); result._22 = (matrix1._23 * matrix2._32) + (matrix1._22 * matrix2._22) + (matrix1._21 * matrix2._12) + (matrix1._20 * matrix2._02); result._21 = (matrix1._23 * matrix2._31) + (matrix1._22 * matrix2._21) + (matrix1._21 * matrix2._11) + (matrix1._20 * matrix2._01); result._20 = (matrix1._23 * matrix2._30) + (matrix1._22 * matrix2._20) + (matrix1._21 * matrix2._10) + (matrix1._20 * matrix2._00); result._33 = (matrix1._33 * matrix2._33) + (matrix1._32 * matrix2._23) + (matrix1._31 * matrix2._13) + (matrix1._30 * matrix2._03); result._32 = (matrix1._33 * matrix2._32) + (matrix1._32 * matrix2._22) + (matrix1._31 * matrix2._12) + (matrix1._30 * matrix2._02); result._31 = (matrix1._33 * matrix2._31) + (matrix1._32 * matrix2._21) + (matrix1._31 * matrix2._11) + (matrix1._30 * matrix2._01); result._30 = (matrix1._33 * matrix2._30) + (matrix1._32 * matrix2._20) + (matrix1._31 * matrix2._10) + (matrix1._30 * matrix2._00); #endif // (MATHLIB_SIMD) }
FPU
004019B7 push ebp 004019B8 mov ebp,esp 004019BA mov edx,DWORD PTR [ebp+0x8] 004019BD mov eax,DWORD PTR [ebp+0xc] 004019C0 mov ecx,DWORD PTR [ebp+0x10] 004019C3 fld DWORD PTR [edx] 004019C5 fmul DWORD PTR [eax+0x30] 004019C8 fld DWORD PTR [edx+0x4] 004019CB fld st(0) 004019CD fmul DWORD PTR [eax+0x20] 004019D0 faddp st(2),st 004019D2 fld DWORD PTR [edx+0x8] 004019D5 fld st(0) 004019D7 fmul DWORD PTR [eax+0x10] 004019DA faddp st(3),st 004019DC fld DWORD PTR [edx+0xc] 004019DF fld st(0) 004019E1 fmul DWORD PTR [eax] 004019E3 faddp st(4),st 004019E5 fxch st(3) 004019E7 fstp DWORD PTR [ecx] 004019E9 fld DWORD PTR [edx] 004019EB fld st(0) 004019ED fmul DWORD PTR [eax+0x34] 004019F0 fxch st(3) 004019F2 fmul DWORD PTR [eax+0x24] 004019F5 faddp st(3),st 004019F7 fld st(1) 004019F9 fmul DWORD PTR [eax+0x14] 004019FC faddp st(3),st 004019FE fld st(3) 00401A00 fmul DWORD PTR [eax+0x4] 00401A03 faddp st(3),st 00401A05 fxch st(2) 00401A07 fstp DWORD PTR [ecx+0x4] 00401A0A fld st(1) 00401A0C fmul DWORD PTR [eax+0x38] 00401A0F fld DWORD PTR [edx+0x4] 00401A12 fld st(0) 00401A14 fmul DWORD PTR [eax+0x28] 00401A17 faddp st(2),st 00401A19 fxch st(2) 00401A1B fmul DWORD PTR [eax+0x18] 00401A1E faddp st(1),st 00401A20 fld st(3) 00401A22 fmul DWORD PTR [eax+0x8] 00401A25 faddp st(1),st 00401A27 fstp DWORD PTR [ecx+0x8] 00401A2A fxch st(1) 00401A2C fmul DWORD PTR [eax+0x3c] 00401A2F fxch st(1) 00401A31 fmul DWORD PTR [eax+0x2c] 00401A34 faddp st(1),st 00401A36 fld DWORD PTR [edx+0x8] 00401A39 fmul DWORD PTR [eax+0x1c] 00401A3C faddp st(1),st 00401A3E fxch st(1) 00401A40 fmul DWORD PTR [eax+0xc] 00401A43 faddp st(1),st 00401A45 fstp DWORD PTR [ecx+0xc] 00401A48 fld DWORD PTR [edx+0x10] 00401A4B fmul DWORD PTR [eax+0x30] 00401A4E fld DWORD PTR [edx+0x14] 00401A51 fld st(0) 00401A53 fmul DWORD PTR [eax+0x20] 00401A56 faddp st(2),st 00401A58 fld DWORD PTR [edx+0x18] 00401A5B fld st(0) 00401A5D fmul DWORD PTR [eax+0x10] 00401A60 faddp st(3),st 00401A62 fld DWORD PTR [edx+0x1c] 00401A65 fld st(0) 00401A67 fmul DWORD PTR [eax] 00401A69 faddp st(4),st 00401A6B fxch st(3) 00401A6D fstp DWORD PTR [ecx+0x10] 00401A70 fld DWORD PTR [edx+0x10] 00401A73 fld st(0) 00401A75 fmul DWORD PTR [eax+0x34] 00401A78 fxch st(3) 00401A7A fmul DWORD PTR [eax+0x24] 00401A7D faddp st(3),st 00401A7F fld st(1) 00401A81 fmul DWORD PTR [eax+0x14] 00401A84 faddp st(3),st 00401A86 fld st(3) 00401A88 fmul DWORD PTR [eax+0x4] 00401A8B faddp st(3),st 00401A8D fxch st(2) 00401A8F fstp DWORD PTR [ecx+0x14] 00401A92 fld st(1) 00401A94 fmul DWORD PTR [eax+0x38] 00401A97 fld DWORD PTR [edx+0x14] 00401A9A fld st(0) 00401A9C fmul DWORD PTR [eax+0x28] 00401A9F faddp st(2),st 00401AA1 fxch st(2) 00401AA3 fmul DWORD PTR [eax+0x18] 00401AA6 faddp st(1),st 00401AA8 fld st(3) 00401AAA fmul DWORD PTR [eax+0x8] 00401AAD faddp st(1),st 00401AAF fstp DWORD PTR [ecx+0x18] 00401AB2 fxch st(1) 00401AB4 fmul DWORD PTR [eax+0x3c] 00401AB7 fxch st(1) 00401AB9 fmul DWORD PTR [eax+0x2c] 00401ABC faddp st(1),st 00401ABE fld DWORD PTR [edx+0x18] 00401AC1 fmul DWORD PTR [eax+0x1c] 00401AC4 faddp st(1),st 00401AC6 fxch st(1) 00401AC8 fmul DWORD PTR [eax+0xc] 00401ACB faddp st(1),st 00401ACD fstp DWORD PTR [ecx+0x1c] 00401AD0 fld DWORD PTR [edx+0x20] 00401AD3 fmul DWORD PTR [eax+0x30] 00401AD6 fld DWORD PTR [edx+0x24] 00401AD9 fld st(0) 00401ADB fmul DWORD PTR [eax+0x20] 00401ADE faddp st(2),st 00401AE0 fld DWORD PTR [edx+0x28] 00401AE3 fld st(0) 00401AE5 fmul DWORD PTR [eax+0x10] 00401AE8 faddp st(3),st 00401AEA fld DWORD PTR [edx+0x2c] 00401AED fld st(0) 00401AEF fmul DWORD PTR [eax] 00401AF1 faddp st(4),st 00401AF3 fxch st(3) 00401AF5 fstp DWORD PTR [ecx+0x20] 00401AF8 fld DWORD PTR [edx+0x20] 00401AFB fld st(0) 00401AFD fmul DWORD PTR [eax+0x34] 00401B00 fxch st(3) 00401B02 fmul DWORD PTR [eax+0x24] 00401B05 faddp st(3),st 00401B07 fld st(1) 00401B09 fmul DWORD PTR [eax+0x14] 00401B0C faddp st(3),st 00401B0E fld st(3) 00401B10 fmul DWORD PTR [eax+0x4] 00401B13 faddp st(3),st 00401B15 fxch st(2) 00401B17 fstp DWORD PTR [ecx+0x24] 00401B1A fld st(1) 00401B1C fmul DWORD PTR [eax+0x38] 00401B1F fld DWORD PTR [edx+0x24] 00401B22 fld st(0) 00401B24 fmul DWORD PTR [eax+0x28] 00401B27 faddp st(2),st 00401B29 fxch st(2) 00401B2B fmul DWORD PTR [eax+0x18] 00401B2E faddp st(1),st 00401B30 fld st(3) 00401B32 fmul DWORD PTR [eax+0x8] 00401B35 faddp st(1),st 00401B37 fstp DWORD PTR [ecx+0x28] 00401B3A fxch st(1) 00401B3C fmul DWORD PTR [eax+0x3c] 00401B3F fxch st(1) 00401B41 fmul DWORD PTR [eax+0x2c] 00401B44 faddp st(1),st 00401B46 fld DWORD PTR [edx+0x28] 00401B49 fmul DWORD PTR [eax+0x1c] 00401B4C faddp st(1),st 00401B4E fxch st(1) 00401B50 fmul DWORD PTR [eax+0xc] 00401B53 faddp st(1),st 00401B55 fstp DWORD PTR [ecx+0x2c] 00401B58 fld DWORD PTR [edx+0x30] 00401B5B fmul DWORD PTR [eax+0x30] 00401B5E fld DWORD PTR [edx+0x34] 00401B61 fld st(0) 00401B63 fmul DWORD PTR [eax+0x20] 00401B66 faddp st(2),st 00401B68 fld DWORD PTR [edx+0x38] 00401B6B fld st(0) 00401B6D fmul DWORD PTR [eax+0x10] 00401B70 faddp st(3),st 00401B72 fld DWORD PTR [edx+0x3c] 00401B75 fld st(0) 00401B77 fmul DWORD PTR [eax] 00401B79 faddp st(4),st 00401B7B fxch st(3) 00401B7D fstp DWORD PTR [ecx+0x30] 00401B80 fld DWORD PTR [edx+0x30] 00401B83 fld st(0) 00401B85 fmul DWORD PTR [eax+0x34] 00401B88 fxch st(3) 00401B8A fmul DWORD PTR [eax+0x24] 00401B8D faddp st(3),st 00401B8F fld st(1) 00401B91 fmul DWORD PTR [eax+0x14] 00401B94 faddp st(3),st 00401B96 fld st(3) 00401B98 fmul DWORD PTR [eax+0x4] 00401B9B faddp st(3),st 00401B9D fxch st(2) 00401B9F fstp DWORD PTR [ecx+0x34] 00401BA2 fld st(1) 00401BA4 fmul DWORD PTR [eax+0x38] 00401BA7 fld DWORD PTR [edx+0x34] 00401BAA fld st(0) 00401BAC fmul DWORD PTR [eax+0x28] 00401BAF faddp st(2),st 00401BB1 fxch st(2) 00401BB3 fmul DWORD PTR [eax+0x18] 00401BB6 faddp st(1),st 00401BB8 fld st(3) 00401BBA fmul DWORD PTR [eax+0x8] 00401BBD faddp st(1),st 00401BBF fstp DWORD PTR [ecx+0x38] 00401BC2 fxch st(1) 00401BC4 fmul DWORD PTR [eax+0x3c] 00401BC7 fxch st(1) 00401BC9 fmul DWORD PTR [eax+0x2c] 00401BCC faddp st(1),st 00401BCE fld DWORD PTR [edx+0x38] 00401BD1 fmul DWORD PTR [eax+0x1c] 00401BD4 faddp st(1),st 00401BD6 fxch st(1) 00401BD8 fmul DWORD PTR [eax+0xc] 00401BDB faddp st(1),st 00401BDD fstp DWORD PTR [ecx+0x3c] 00401BE0 leave 00401BE1 ret
SSE
0041DCDC push ebp 0041DCDD mov ebp,esp 0041DCDF mov edx,DWORD PTR [ebp+0x8] 0041DCE2 mov ecx,DWORD PTR [ebp+0xc] 0041DCE5 mov eax,DWORD PTR [ebp+0x10] 0041DCE8 movaps xmm0,XMMWORD PTR [edx] 0041DCEB movaps xmm6,xmm0 0041DCEE shufps xmm6,xmm0,0xff 0041DCF2 movaps xmm5,XMMWORD PTR [ecx] 0041DCF5 mulps xmm6,xmm5 0041DCF8 movaps xmm1,xmm0 0041DCFB shufps xmm1,xmm0,0xaa 0041DCFF movaps xmm4,XMMWORD PTR [ecx+0x10] 0041DD03 mulps xmm1,xmm4 0041DD06 addps xmm6,xmm1 0041DD09 movaps xmm1,xmm0 0041DD0C shufps xmm1,xmm0,0x55 0041DD10 movaps xmm3,XMMWORD PTR [ecx+0x20] 0041DD14 mulps xmm1,xmm3 0041DD17 addps xmm6,xmm1 0041DD1A shufps xmm0,xmm0,0x0 0041DD1E movaps xmm2,XMMWORD PTR [ecx+0x30] 0041DD22 mulps xmm0,xmm2 0041DD25 addps xmm6,xmm0 0041DD28 movaps xmm0,XMMWORD PTR [edx+0x10] 0041DD2C movaps xmm7,xmm0 0041DD2F shufps xmm7,xmm0,0xff 0041DD33 mulps xmm7,xmm5 0041DD36 movaps xmm1,xmm0 0041DD39 shufps xmm1,xmm0,0xaa 0041DD3D mulps xmm1,xmm4 0041DD40 addps xmm1,xmm7 0041DD43 movaps xmm7,xmm0 0041DD46 shufps xmm7,xmm0,0x55 0041DD4A mulps xmm7,xmm3 0041DD4D addps xmm1,xmm7 0041DD50 shufps xmm0,xmm0,0x0 0041DD54 mulps xmm0,xmm2 0041DD57 addps xmm1,xmm0 0041DD5A movaps XMMWORD PTR [eax+0x10],xmm1 0041DD5E movaps xmm1,XMMWORD PTR [edx+0x20] 0041DD62 movaps xmm7,xmm1 0041DD65 shufps xmm7,xmm1,0xff 0041DD69 mulps xmm7,xmm5 0041DD6C movaps xmm0,xmm1 0041DD6F shufps xmm0,xmm1,0xaa 0041DD73 mulps xmm0,xmm4 0041DD76 addps xmm7,xmm0 0041DD79 movaps xmm0,xmm1 0041DD7C shufps xmm0,xmm1,0x55 0041DD80 mulps xmm0,xmm3 0041DD83 addps xmm0,xmm7 0041DD86 shufps xmm1,xmm1,0x0 0041DD8A mulps xmm1,xmm2 0041DD8D addps xmm1,xmm0 0041DD90 movaps xmm0,XMMWORD PTR [edx+0x30] 0041DD94 movaps xmm7,xmm0 0041DD97 shufps xmm7,xmm0,0xff 0041DD9B mulps xmm5,xmm7 0041DD9E movaps xmm7,xmm0 0041DDA1 shufps xmm7,xmm0,0xaa 0041DDA5 mulps xmm4,xmm7 0041DDA8 addps xmm5,xmm4 0041DDAB movaps xmm4,xmm0 0041DDAE shufps xmm4,xmm0,0x55 0041DDB2 mulps xmm3,xmm4 0041DDB5 addps xmm5,xmm3 0041DDB8 shufps xmm0,xmm0,0x0 0041DDBC mulps xmm2,xmm0 0041DDBF addps xmm2,xmm5 0041DDC2 movaps XMMWORD PTR [eax],xmm6 0041DDC5 movaps XMMWORD PTR [eax+0x20],xmm1 0041DDC9 movaps XMMWORD PTR [eax+0x30],xmm2 0041DDCD leave 0041DDCE ret
Matrix Vector Multiplication [Single]:
MATHLIB_INLINE void matrix4x4_vectorMul(const matrix4x4& matrix, const vector4& vector, vector4& result) /// Performs an multiplication of the given vector (column matrix) by the given matrix. /// This method corresponds to the mathematical statement result = matrix * vector i.e the vector is a column vector. { #if (MATHLIB_SIMD) #if (MATHLIB_SSE) __m128 m0 = _mm_mul_ps(matrix.sse_row0, vector.sse_vec); __m128 m1 = _mm_mul_ps(matrix.sse_row1, vector.sse_vec); __m128 m2 = _mm_mul_ps(matrix.sse_row2, vector.sse_vec); __m128 m3 = _mm_mul_ps(matrix.sse_row3, vector.sse_vec); m0 = _mm_hadd_ps(m1, m0); m2 = _mm_hadd_ps(m3, m2); m0 = _mm_hadd_ps(m2, m0); result.sse_vec = m0; #endif // (MATHLIB_SSE) #else float vectorX = vector.extractX(); float vectorY = vector.extractY(); float vectorZ = vector.extractZ(); float vectorW = vector.extractW(); result.setXYZW((matrix._03 * vectorW) + (matrix._02 * vectorZ) + (matrix._01 * vectorY) + (matrix._00 * vectorX), (matrix._13 * vectorW) + (matrix._12 * vectorZ) + (matrix._11 * vectorY) + (matrix._10 * vectorX), (matrix._23 * vectorW) + (matrix._22 * vectorZ) + (matrix._21 * vectorY) + (matrix._20 * vectorX), (matrix._33 * vectorW) + (matrix._32 * vectorZ) + (matrix._31 * vectorY) + (matrix._30 * vectorX)); #endif // (MATHLIB_SIMD) }
FPU
00401C30 push ebp 00401C31 mov ebp,esp 00401C33 mov eax,DWORD PTR [ebp+0x8] 00401C36 mov ecx,DWORD PTR [ebp+0xc] 00401C39 mov edx,DWORD PTR [ebp+0x10] 00401C3C fld DWORD PTR [ecx+0xc] 00401C3F fld DWORD PTR [ecx+0x8] 00401C42 fld DWORD PTR [ecx+0x4] 00401C45 fld DWORD PTR [ecx] 00401C47 fld st(0) 00401C49 fmul DWORD PTR [eax+0x30] 00401C4C fld st(2) 00401C4E fmul DWORD PTR [eax+0x34] 00401C51 faddp st(1),st 00401C53 fld st(3) 00401C55 fmul DWORD PTR [eax+0x38] 00401C58 faddp st(1),st 00401C5A fld st(4) 00401C5C fmul DWORD PTR [eax+0x3c] 00401C5F faddp st(1),st 00401C61 fld st(1) 00401C63 fmul DWORD PTR [eax+0x20] 00401C66 fld st(3) 00401C68 fmul DWORD PTR [eax+0x24] 00401C6B faddp st(1),st 00401C6D fld st(4) 00401C6F fmul DWORD PTR [eax+0x28] 00401C72 faddp st(1),st 00401C74 fld st(5) 00401C76 fmul DWORD PTR [eax+0x2c] 00401C79 faddp st(1),st 00401C7B fld st(2) 00401C7D fmul DWORD PTR [eax+0x10] 00401C80 fld st(4) 00401C82 fmul DWORD PTR [eax+0x14] 00401C85 faddp st(1),st 00401C87 fld st(5) 00401C89 fmul DWORD PTR [eax+0x18] 00401C8C faddp st(1),st 00401C8E fld st(6) 00401C90 fmul DWORD PTR [eax+0x1c] 00401C93 faddp st(1),st 00401C95 fxch st(3) 00401C97 fmul DWORD PTR [eax] 00401C99 fxch st(4) 00401C9B fmul DWORD PTR [eax+0x4] 00401C9E faddp st(4),st 00401CA0 fxch st(4) 00401CA2 fmul DWORD PTR [eax+0x8] 00401CA5 faddp st(3),st 00401CA7 fxch st(4) 00401CA9 fmul DWORD PTR [eax+0xc] 00401CAC faddp st(2),st 00401CAE fxch st(1) 00401CB0 fstp DWORD PTR [edx+0xc] 00401CB3 fstp DWORD PTR [edx+0x8] 00401CB6 fstp DWORD PTR [edx+0x4] 00401CB9 fstp DWORD PTR [edx] 00401CBB leave 00401CBC ret
SSE
004014A8 push ebp 004014A9 mov ebp,esp 004014AB mov eax,DWORD PTR [ebp+0x8] 004014AE mov edx,DWORD PTR [ebp+0xc] 004014B1 movaps xmm1,XMMWORD PTR [edx] 004014B4 movaps xmm3,xmm1 004014B7 mulps xmm3,XMMWORD PTR [eax] 004014BA movaps xmm0,xmm1 004014BD mulps xmm0,XMMWORD PTR [eax+0x10] 004014C1 movaps xmm2,xmm1 004014C4 mulps xmm2,XMMWORD PTR [eax+0x20] 004014C8 mulps xmm1,XMMWORD PTR [eax+0x30] 004014CC haddps xmm0,xmm3 004014D0 haddps xmm1,xmm2 004014D4 haddps xmm1,xmm0 004014D8 mov eax,DWORD PTR [ebp+0x10] 004014DB movaps XMMWORD PTR [eax],xmm1 004014DE leave 004014DF ret
Matrix Vector Multiplication [Batch]:
MATHLIB_INLINE void matrix4x4_vectorBatchMul(const matrix4x4& matrix, vector4 const * const vectorArray, uint32_t numVectors, vector4* const resultArray) /// Performs a batch matrix vector multiplication (column matrix). { #if (MATHLIB_SIMD) #if (MATHLIB_SSE) // Load matrix: __m128 matrix_row0 = matrix.sse_row0; __m128 matrix_row1 = matrix.sse_row1; __m128 matrix_row2 = matrix.sse_row2; __m128 matrix_row3 = matrix.sse_row3; for (uint32_t i = 0; i < numVectors; i++) { __m128 m0 = _mm_mul_ps(matrix_row0, vectorArray[i].sse_vec); __m128 m1 = _mm_mul_ps(matrix_row1, vectorArray[i].sse_vec); __m128 m2 = _mm_mul_ps(matrix_row2, vectorArray[i].sse_vec); __m128 m3 = _mm_mul_ps(matrix_row3, vectorArray[i].sse_vec); m0 = _mm_hadd_ps(m1, m0); m2 = _mm_hadd_ps(m3, m2); m0 = _mm_hadd_ps(m2, m0); resultArray[i].sse_vec = m0; } #endif // (MATHLIB_SSE) #else for (uint32_t i = 0; i < numVectors; i++) { float vectorX = vectorArray[i].extractX(); float vectorY = vectorArray[i].extractY(); float vectorZ = vectorArray[i].extractZ(); float vectorW = vectorArray[i].extractW(); resultArray[i].setXYZW((matrix._03 * vectorW) + (matrix._02 * vectorZ) + (matrix._01 * vectorY) + (matrix._00 * vectorX), (matrix._13 * vectorW) + (matrix._12 * vectorZ) + (matrix._11 * vectorY) + (matrix._10 * vectorX), (matrix._23 * vectorW) + (matrix._22 * vectorZ) + (matrix._21 * vectorY) + (matrix._20 * vectorX), (matrix._33 * vectorW) + (matrix._32 * vectorZ) + (matrix._31 * vectorY) + (matrix._30 * vectorX)); } #endif // (MATHLIB_SIMD) }
FPU
00401CBD push ebp 00401CBE mov ebp,esp 00401CC0 push edi 00401CC1 push esi 00401CC2 push ebx 00401CC3 mov eax,DWORD PTR [ebp+0x8] 00401CC6 mov ecx,DWORD PTR [ebp+0xc] 00401CC9 mov edi,DWORD PTR [ebp+0x10] 00401CCC mov ebx,DWORD PTR [ebp+0x14] 00401CCF test edi,edi 00401CD1 je 0x401d74 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+183> 00401CD7 mov edx,0x0 00401CDC mov esi,0x0 00401CE1 fld DWORD PTR [ecx+edx*1+0xc] 00401CE5 fld DWORD PTR [ecx+edx*1+0x8] 00401CE9 fld DWORD PTR [ecx+edx*1+0x4] 00401CED fld DWORD PTR [ecx+edx*1] 00401CF0 fld st(0) 00401CF2 fmul DWORD PTR [eax+0x30] 00401CF5 fld st(2) 00401CF7 fmul DWORD PTR [eax+0x34] 00401CFA faddp st(1),st 00401CFC fld st(3) 00401CFE fmul DWORD PTR [eax+0x38] 00401D01 faddp st(1),st 00401D03 fld st(4) 00401D05 fmul DWORD PTR [eax+0x3c] 00401D08 faddp st(1),st 00401D0A fld st(1) 00401D0C fmul DWORD PTR [eax+0x20] 00401D0F fld st(3) 00401D11 fmul DWORD PTR [eax+0x24] 00401D14 faddp st(1),st 00401D16 fld st(4) 00401D18 fmul DWORD PTR [eax+0x28] 00401D1B faddp st(1),st 00401D1D fld st(5) 00401D1F fmul DWORD PTR [eax+0x2c] 00401D22 faddp st(1),st 00401D24 fld st(2) 00401D26 fmul DWORD PTR [eax+0x10] 00401D29 fld st(4) 00401D2B fmul DWORD PTR [eax+0x14] 00401D2E faddp st(1),st 00401D30 fld st(5) 00401D32 fmul DWORD PTR [eax+0x18] 00401D35 faddp st(1),st 00401D37 fld st(6) 00401D39 fmul DWORD PTR [eax+0x1c] 00401D3C faddp st(1),st 00401D3E fxch st(3) 00401D40 fmul DWORD PTR [eax] 00401D42 fxch st(4) 00401D44 fmul DWORD PTR [eax+0x4] 00401D47 faddp st(4),st 00401D49 fxch st(4) 00401D4B fmul DWORD PTR [eax+0x8] 00401D4E faddp st(3),st 00401D50 fxch st(4) 00401D52 fmul DWORD PTR [eax+0xc] 00401D55 faddp st(2),st 00401D57 fxch st(1) 00401D59 fstp DWORD PTR [ebx+edx*1+0xc] 00401D5D fstp DWORD PTR [ebx+edx*1+0x8] 00401D61 fstp DWORD PTR [ebx+edx*1+0x4] 00401D65 fstp DWORD PTR [ebx+edx*1] 00401D68 inc esi 00401D69 add edx,0x10 00401D6C cmp edi,esi 00401D6E ja 0x401ce1 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+36> 00401D74 pop ebx 00401D75 pop esi 00401D76 pop edi 00401D77 leave 00401D78 ret
SSE
004014A8 push ebp 004014A9 mov ebp,esp 004014AB push ebx 004014AC mov eax,DWORD PTR [ebp+0x8] 004014AF mov ebx,DWORD PTR [ebp+0x10] 004014B2 movaps xmm3,XMMWORD PTR [eax] 004014B5 movaps xmm4,XMMWORD PTR [eax+0x10] 004014B9 movaps xmm5,XMMWORD PTR [eax+0x20] 004014BD movaps xmm6,XMMWORD PTR [eax+0x30] 004014C1 test ebx,ebx 004014C3 je 0x401502 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+90> 004014C5 mov ecx,DWORD PTR [ebp+0xc] 004014C8 mov edx,DWORD PTR [ebp+0x14] 004014CB mov eax,0x0 004014D0 movaps xmm1,XMMWORD PTR [ecx] 004014D3 movaps xmm7,xmm3 004014D6 mulps xmm7,xmm1 004014D9 movaps xmm0,xmm4 004014DC mulps xmm0,xmm1 004014DF movaps xmm2,xmm5 004014E2 mulps xmm2,xmm1 004014E5 mulps xmm1,xmm6 004014E8 haddps xmm0,xmm7 004014EC haddps xmm1,xmm2 004014F0 haddps xmm1,xmm0 004014F4 movaps XMMWORD PTR [edx],xmm1 004014F7 inc eax 004014F8 add ecx,0x10 004014FB add edx,0x10 004014FE cmp ebx,eax 00401500 ja 0x4014d0 <MatrixTests::matrix4x4_vectorMulBatchTest(MathLib::matrix4x4 const&, MathLib::vector4 const*, unsigned int, MathLib::vector4*)+40> 00401502 pop ebx 00401503 leave 00401504 ret
Benchmarks
Let's look at the performance numbers, the testing methodology is pretty much exactly the same as I had with the vectors (you can take a look at the source code, it's all there). I'm just going to post the text output from the program itself (the screenshot approach takes to long). Again, we're measuring single threaded code, it's an architecture, clock speed, and memory bandwidth game.
The first machine I ran the test on was was equipped with a Pentium E5200, overclocked to 2.75GHz:
MathLib test program. MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated]. Running benchmark! Matrix array sizes : 10000 Number of iterations : 100000 ---------------------------------------- Addition benchmark : average time (msecs) [0.099916] Subtraction benchmark : average time (msecs) [0.089407] Transposition benchmark : average time (msecs) [0.049061] Scalar mul benchmark : average time (msecs) [0.057214] Matrix mul benchmark : average time (msecs) [0.169970] Vector mul benchmark : average time (msecs) [0.049327] Vector batch mul benchmark : average time (msecs) [0.036488] Process returned 0 (0x0) execution time : 55.196 s Press any key to continue. MathLib test program. MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated]. Running benchmark! Matrix array sizes : 10000 Number of iterations : 100000 ---------------------------------------- Addition benchmark : average time (msecs) [0.152721] Subtraction benchmark : average time (msecs) [0.151200] Transposition benchmark : average time (msecs) [0.098960] Scalar mul benchmark : average time (msecs) [0.121864] Matrix mul benchmark : average time (msecs) [0.470418] Vector mul benchmark : average time (msecs) [0.117321] Vector batch mul benchmark : average time (msecs) [0.118026] Process returned 0 (0x0) execution time : 123.101 s Press any key to continue.
The second machine was my good old (really old) laptop, equipped with a Pentium Dual T3400:
MathLib test program. MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated]. Running benchmark! Matrix array sizes : 10000 Number of iterations : 100000 ---------------------------------------- Addition benchmark : average time (msecs) [0.623972] Subtraction benchmark : average time (msecs) [0.620934] Transposition benchmark : average time (msecs) [0.070971] Scalar mul benchmark : average time (msecs) [0.423544] Matrix mul benchmark : average time (msecs) [0.641712] Vector mul benchmark : average time (msecs) [0.108508] Vector batch mul benchmark : average time (msecs) [0.063231] Process returned 0 (0x0) execution time : 256.348 s Press any key to continue. MathLib test program. MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated]. Running benchmark! Matrix array sizes : 10000 Number of iterations : 100000 ---------------------------------------- Addition benchmark : average time (msecs) [0.634588] Subtraction benchmark : average time (msecs) [0.632229] Transposition benchmark : average time (msecs) [0.097166] Scalar mul benchmark : average time (msecs) [0.439894] Matrix mul benchmark : average time (msecs) [0.676615] Vector mul benchmark : average time (msecs) [0.173299] Vector batch mul benchmark : average time (msecs) [0.152087] Process returned 0 (0x0) execution time : 281.617 s Press any key to continue.
So these results are pretty interesting, it appears that the calculations become memory bandwidth limited... the SSE code is always ahead, but it's not far ahead. The only really notable speedups are the Vector mul and Vector batch mul results.
The third machine was a Core 2 Quad Q8400:
MathLib test program. MathLib running on platform: [WINDOWS], SIMD acceleration: [accelerated]. Running benchmark! Matrix array sizes : 10000 Number of iterations : 100000 ---------------------------------------- Addition benchmark : average time (msecs) [0.089563] Subtraction benchmark : average time (msecs) [0.090273] Transposition benchmark : average time (msecs) [0.050641] Scalar mul benchmark : average time (msecs) [0.060286] Matrix mul benchmark : average time (msecs) [0.172725] Vector mul benchmark : average time (msecs) [0.051047] Vector batch mul benchmark : average time (msecs) [0.037580] Process returned 0 (0x0) execution time : 55.302 s Press any key to continue. MathLib test program. MathLib running on platform: [WINDOWS], SIMD acceleration: [unaccelerated]. Running benchmark! Matrix array sizes : 10000 Number of iterations : 100000 ---------------------------------------- Addition benchmark : average time (msecs) [0.157504] Subtraction benchmark : average time (msecs) [0.158742] Transposition benchmark : average time (msecs) [0.102287] Scalar mul benchmark : average time (msecs) [0.126073] Matrix mul benchmark : average time (msecs) [0.484266] Vector mul benchmark : average time (msecs) [0.121134] Vector batch mul benchmark : average time (msecs) [0.121808] Process returned 0 (0x0) execution time : 127.281 s Press any key to continue.
Conclusion
In line with the vectors, we can see a comfortable 2x speedup (laptop results notwithstanding) from the use of SSE intrinsics. The batch multiplication and matrix-matrix multiplication receive around 3x speedups, which is promising.
The source code for this article can be downloaded here:
https://hotfile.com/dl/
Hello Bryan!
ReplyDeleteI started programming with SIMD (SEE only) recently and I'm intrigued by your approach to value ordering within XMM registers.
In the matrix multiplication scalar code you say you "invert" the indices because your vectors have the X value in their highest memory address, so a float[4] would look like: [ W | Z | Y | X ]. Does that mean you've got all your vectors "reversed" in memory? Isn't that a lot of work? You'll get vectors in XYZW form from all mesh formats, textures, etc... unless you write your own loaders.
Then, how is your XMMs registers layout? Using _mm_load_ps would yield [X|Y|Z|W], but _mm_loadr_ps [W|Z|Y|X]. Alternatively, having your memory layout as XYZW and using _mm_loadr_ps your XMM registers would contain again [X|Y|Z|W].
I'm courious about the reasons after your decision, and I'd like to ask also whether you think there's any pros/cons on laying out the memory one way or the other.
Thank you!
Marc.
Hi Marc!
DeleteYeah I know it's confusing. Let me try and explain what I was thinking at the time haha.
Most of the vector tutorials that I had read beforehand and how I thought of vector layouts were
[x | y | z | w] in that order, so x on the left to w on the right. Which is how we lay out vectors when we
think about them and draw them on paper etc. So I took that mental image and (perhaps naively) mapped it directly onto how I mapped the values in the vector registers. So the vector register would look exactly the same. This is where the mapping becomes reversed. On little endian platforms (like x86) the higher order bits are stored in higher memory locations.
Since we declare vector classes to be
class vector4
{
public:
float x, y, z, w;
...
};
most of the time, the x is actually stored in a lower memory location. Which means that when loaded into an
x86 vector register it becomes [w | z | y | x]. So I simply declared them in reverse, all to preserve my mental model.
On big-endian platforms the behavior is a more in line with the mental model. Since big-endian is my preferred style I chose this layout. So that was the rationale in a nutshell.
Using the _mm_loadr_ps intrinsic is probably a nice alternative, but I believe it's really just a movaps instruction plus the correct shuffling instruction(s). So until it becomes a dedicated single instruction it'll probably be a tiny bit slower than just _mm_load_ps.
As for using the vectors in a reversed way, it's all hidden behind the interface. And this extends to loading from modelling packages as well. If you load an array of floats from a mesh file then you'd simply set the values of the vector using the setXYZW() method I'd written (on that note I just realized the download link is busted so I need to fix that!) and then it'd work just fine. Once the values are in the vector the math isn't any slower than the reversed storage.
As an aside, seeing as x86 will dominate the realtime 3D market for the next 10 years (PS4, XB1, and PC are all x86) this invalidates a fundamental design choice in my library. So I could and should go ahead and reverse the mental model for the next version! Again, as it's all hidden behind the interface, replacing the nuts and bolts won't break any existing code :)
Thanks for the reply Bryan!
DeleteI thought so, that it was an arbitrary decision only. I just wanted to be sure, since I've been spending some hours these last days on GDB and it's assembly window to make a mental model of it all.
I'll go with the "easy" way of loading floats in xyzw, then x will be in the lower memory location :)
You're absolutely right, PS4, XB1 and PC all use x86 processors. I hope I'll be programming consoles some day! ;)
Hey, great article! Unfortunately, Hotfile recently got shutdown. Can you please re-upload the sources to somewhere else, possibly Dropbox or something similar? Thanks.
ReplyDeleteHey there, yeah sure, I'll do that, but I need to make some changes to the library and then do part three of the simd math library post...
DeleteNot sure quite how long that's going to take, so, if you want, send me your email and I'll send you the code as a zip directly?
My mail's bryanlawsmith@gmail.com, cheers!