I'm going to let you in on a personal secret, every single time I start writing a 3D project the first thing I look to do is write a new math library. It's a fundamental building block of course. And there's a clinical cleanliness to the resulting code that a lot of techie programmers find nigh-on-rapturous. My first attempt when I had been programming in C for a sum total of around 3 months was as basic an implementation as you're likely to see. I hadn't even heard of concepts like 'optimization', 'instruction-level-parallelism', and 'data-alignment' yet. And you know what? I didn't need them. Even the most basic C/C++ math library is going to perform just fine running on a modern x86 chip for the *majority* of your projects.
So what's the point of writing a SIMD math library? Well...because I want to, but if you look hard enough, there are benefits to it.
Let's start with the pragmatic, not every CPU is an x86 OoO super desktop chip with 35 floating-point execution units and 300 trillion transistors dedicated to implicit scheduling logic to extract your ILP for you thank you very much. If you're like me and want to write something on say, the Cell BE running inside your still Linux capable PS3 (ha Sony, you thought you'd gotten all of us?!) then for a non-trivial project SIMD optimization still bears fruit (I'm actually going to use SSE initially, but the concepts are directly transferable). Secondly, there are still certain (very specific) modules that we want to squeeze the most performance out of, when you're writing a software rasterizer or software vertex pipeline for instance. Third, because it's educational. Systems programmers should aspire to understand the hardware capabilities our CPUs give us and how to take advantage of them. And lastly, because it's fun! Low level coding is good for the programmers soul.
But first let's get a few things out of the way. These articles are not meant to be a definitive resource for all things SIMD, SSE, and optimization. I'm very much like the next guy trying to understand exactly how to do these things, and what benefits they bring to my toolbox. To be perfectly honest, a lot of what I've heard of SIMD optimization on x86 (PowerPC and ARM are different) is that it tends to slow down your code! Let's investigate that!?! I'll probably do some (a lot of) stupid things, but I'm hoping that if you know something I don't, drop me an email or a comment.
So what exactly am I going to write about?
I'm not going to post a line by line tutorial specifying how to write a math library, although the code I've written will be available for anyone to download and use. These articles are just a summation of the work I did. That doesn't mean you can't learn from it, quite the contrary, you can sift through the code line by line and see how it fits together. A math library is a very simple thing with a very simple interface.What I am going to write about is a few of the main design choices, and then, for some functions, show the code and assembly output of the two implementations (x87 vs SSE).
Finally, we'll run a test program to compare the execution times to see what, if any, performance is to be gained.
Design Choices
I decided not to have any sort of dynamic linking of function calls in the library. Which means that a recompile is required for any different platform you're on. The reasons for this were that dynamic linking requires that we first detect which CPU/OS combination we're running off of. Both host configuration detection and dynamic linking methods are system dependent. It is much simpler to just use the preprocessor to determine which code gets to be compiled and set that as your 'min-spec' so to say. (Also, dynamic linking prevents code inlining. Which is a problematic from a performance viewpoint.)
In addition, we will assume that all x86 CPUs we'll come across will support up to SSE3 (see Steam's hardware survey), any PowerPC CPUs will support AltiVec, and all ARM CPUs will support NEON(?). However, for comparison and fallback reasons, native C code is provided as a default, so you can set it to run on any platform that doesn't have a SIMD instruction set. Now, for native code, I considered it important to make sure that the behaviour is as close to the vectorized variants as possible which results in the occasional redundant w component calculation. However, there are still minor inconsistencies that crop up, particularly with regards to the w component, so extra caution is required when relying on a w value.
All of the library methods are declared inline, ensuring that any performance we gain from using SIMD code won't be wasted on setting up a function stack-frame.
I use intrinsics instead of inline assembly or handwritten assembly modules. It's easier to read and modify, allows the compiler to schedule your instructions for you and doesn't require you to complicate your build process.
All data types I use have to be aligned. This complicates safely declaring them as stack objects inside a function or dynamically allocating them but ensures maximum performance. I'll discuss ways of dealing with this in another article.
Code Comparison
Ok, let's look at some of the functions in the vector part of the math library and how they're implemented. Then we'll take a look at the differing assembly language outputs. If you take a glance at the library you can see that the w component of vectors gets modified by the various vector functions when what you really want is it to be mathematically correct at either 0.0 (a vector representation) or 1.0 (a point representation). We correct this by having quick vector4_setToVector and vector4_setToPoint instructions that clear the w component to 0.0 and set the w component to 1.0 respectively. Use them in algorithms that require a valid W component to function.
Vector Addition:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | MATHLIB_INLINE void vector4_add(const vector4& vector1, const vector4& vector2, vector4& result) { #if (MATHLIB_SIMD) #if (MATHLIB_SSE) result.sse_vec = _mm_add_ps(vector1.sse_vec, vector2.sse_vec); #endif // (MATHLIB_SSE) #else result.setXYZW(vector1.extractX() + vector2.extractX(), vector1.extractY() + vector2.extractY(), vector1.extractZ() + vector2.extractZ(), vector1.extractW() + vector2.extractW()); #endif // (MATHLIB_SIMD) } |
Simple enough right, lets take a look at what the compiler spits out (I wrap the inlined functions into a normal function call so we get the assembly in isolation):
FPU
0040134C push ebp 0040134D mov ebp,esp 0040134F mov ecx,DWORD PTR [ebp+0x8] 00401352 mov edx,DWORD PTR [ebp+0xc] 00401355 mov eax,DWORD PTR [ebp+0x10] 00401358 fld DWORD PTR [ecx] 0040135A fadd DWORD PTR [edx] 0040135C fld DWORD PTR [ecx+0x4] 0040135F fadd DWORD PTR [edx+0x4] 00401362 fld DWORD PTR [ecx+0x8] 00401365 fadd DWORD PTR [edx+0x8] 00401368 fld DWORD PTR [ecx+0xc] 0040136B fadd DWORD PTR [edx+0xc] 0040136E fstp DWORD PTR [eax+0xc] 00401371 fstp DWORD PTR [eax+0x8] 00401374 fstp DWORD PTR [eax+0x4] 00401377 fstp DWORD PTR [eax] 00401379 leave 0040137A ret
SSE
0040157D push ebp 0040157E mov ebp,esp 00401580 mov eax,DWORD PTR [ebp+0x8] 00401583 movaps xmm0,XMMWORD PTR [eax] 00401586 mov eax,DWORD PTR [ebp+0xc] 00401589 addps xmm0,XMMWORD PTR [eax] 0040158C mov eax,DWORD PTR [ebp+0x10] 0040158F movaps XMMWORD PTR [eax],xmm0 00401592 leave 00401593 ret
Well regardless of performance I think it's safe to say that SSE assembly (and vector assembly in general) is quite a bit more compact than scalar assembly, lessening the code bloat effects of inlining all your math functions.
Vector Add Scaled Vector:
MATHLIB_INLINE void vector4_addScaledVector(const vector4& vector1, const vector4& vector2, float scaleFactor, vector4& result) /// Adds the scaled vector2 to vector1 and stores the sum in result. /// Corresponds to the expression result = vector1 + (scaleFactor * vector2). { #if (MATHLIB_SIMD) #if (MATHLIB_SSE) __m128 scaledVector; scaledVector = _mm_load1_ps(&scaleFactor); scaledVector = _mm_mul_ps(vector2.sse_vec, scaledVector); result.sse_vec = _mm_add_ps(vector1.sse_vec, scaledVector); #endif // (MATHLIB_SSE) #else result.setXYZW(vector1.extractX() + (vector2.extractX() * scaleFactor), vector1.extractY() + (vector2.extractY() * scaleFactor), vector1.extractZ() + (vector2.extractZ() * scaleFactor), vector1.extractW() + (vector2.extractW() * scaleFactor)); #endif // (MATHLIB_SIMD) }
FPU
00401428 push ebp 00401429 mov ebp,esp 0040142B mov edx,DWORD PTR [ebp+0x8] 0040142E mov ecx,DWORD PTR [ebp+0xc] 00401431 fld DWORD PTR [ebp+0x10] 00401434 mov eax,DWORD PTR [ebp+0x14] 00401437 fld st(0) 00401439 fmul DWORD PTR [ecx] 0040143B fadd DWORD PTR [edx] 0040143D fld st(1) 0040143F fmul DWORD PTR [ecx+0x4] 00401442 fadd DWORD PTR [edx+0x4] 00401445 fld st(2) 00401447 fmul DWORD PTR [ecx+0x8] 0040144A fadd DWORD PTR [edx+0x8] 0040144D fxch st(3) 0040144F fmul DWORD PTR [ecx+0xc] 00401452 fadd DWORD PTR [edx+0xc] 00401455 fstp DWORD PTR [eax+0xc] 00401458 fxch st(2) 0040145A fstp DWORD PTR [eax+0x8] 0040145D fxch st(1) 0040145F fstp DWORD PTR [eax+0x4] 00401462 fstp DWORD PTR [eax] 00401464 leave 00401465 ret
SSE
00401533 push ebp 00401534 mov ebp,esp 00401536 movss xmm0,DWORD PTR [ebp+0x10] 0040153B shufps xmm0,xmm0,0x0 0040153F mov eax,DWORD PTR [ebp+0xc] 00401542 mulps xmm0,XMMWORD PTR [eax] 00401545 mov eax,DWORD PTR [ebp+0x8] 00401548 addps xmm0,XMMWORD PTR [eax] 0040154B mov eax,DWORD PTR [ebp+0x14] 0040154E movaps XMMWORD PTR [eax],xmm0 00401551 leave 00401552 ret
Vector Magnitude:
MATHLIB_INLINE float vector4_magnitude(const vector4& vector) /// Returns the magnitude of the vector. { #if (MATHLIB_SIMD) #if (MATHLIB_SSE) __m128 vector0; __m128 vector1; __m128 vector2; vector0 = _mm_mul_ps(vector.sse_vec, vector.sse_vec); // Square the vector elements. vector1 = _mm_shuffle_ps(vector0, vector0, _MM_SHUFFLE(2, 1, 0, 3)); // Shuffle around, add elements vector2 = _mm_add_ps(vector0, vector1); // (x * x) + (y * y) vector1 = _mm_shuffle_ps(vector0, vector0, _MM_SHUFFLE(1, 0, 3, 2)); vector2 = _mm_add_ps(vector2, vector1); // (x * x) + (y * y) + (z * z) vector2 = _mm_sqrt_ps(vector2); vector2 = _mm_shuffle_ps(vector2, vector2, _MM_SHUFFLE(3, 3, 3, 3)); return (_mm_cvtss_f32(vector2)); #endif // (MATHLIB_SSE) #else float x = vector.extractX(); float y = vector.extractY(); float z = vector.extractZ(); return (sqrtf((x * x) + (y * y) + (z * z))); #endif // (MATHLIB_SIMD) }
FPU
0040193F push ebp 00401940 mov ebp,esp 00401942 mov eax,DWORD PTR [ebp+0x8] 00401945 fld DWORD PTR [eax+0xc] 00401948 fld DWORD PTR [eax+0x8] 0040194B fld DWORD PTR [eax+0x4] 0040194E fxch st(2) 00401950 fmul st,st(0) 00401952 fxch st(1) 00401954 fmul st,st(0) 00401956 faddp st(1),st 00401958 fxch st(1) 0040195A fmul st,st(0) 0040195C faddp st(1),st 0040195E fsqrt 00401960 leave 00401961 ret
SSE
00401604 push ebp 00401605 mov ebp,esp 00401607 sub esp,0x4 0040160A mov eax,DWORD PTR [ebp+0x8] 0040160D movaps xmm0,XMMWORD PTR [eax] 00401610 mulps xmm0,xmm0 00401613 movaps xmm1,xmm0 00401616 shufps xmm1,xmm0,0x93 0040161A addps xmm1,xmm0 0040161D shufps xmm0,xmm0,0x4e 00401621 addps xmm0,xmm1 00401624 sqrtps xmm0,xmm0 00401627 shufps xmm0,xmm0,0xff 0040162B movss DWORD PTR [ebp-0x4],xmm0 00401630 fld DWORD PTR [ebp-0x4] 00401633 leave 00401634 ret
And look, the generated code is actually more compact on the FPU side.
Vector Dot Product:
MATHLIB_INLINE float vector4_dotProduct(const vector4& vector1, const vector4& vector2) /// Returns the dot product of the two vectors. This does not take the w component into account. { /// Native FPU code will be faster here because the result will likely be used in a scalar /// codepath, so we should avoid an unnecessary store: float xcomp = vector1.extractX() * vector2.extractX(); float ycomp = vector1.extractY() * vector2.extractY(); float zcomp = vector1.extractZ() * vector2.extractZ(); return (xcomp + ycomp + zcomp); }
Vector Cross Product:
MATHLIB_INLINE void vector4_crossProduct(const vector4& vector1, const vector4& vector2, vector4& result) /// Computes the cross product of vector1 and vector2 and stores cross product in result. { #if (MATHLIB_SIMD) #if (MATHLIB_SSE) result.sse_vec = _mm_sub_ps(_mm_mul_ps(_mm_shuffle_ps(vector1.sse_vec, vector1.sse_vec, _MM_SHUFFLE(3, 1, 2, 0)), _mm_shuffle_ps(vector2.sse_vec, vector2.sse_vec, _MM_SHUFFLE(2, 3, 1, 0))), _mm_mul_ps(_mm_shuffle_ps(vector1.sse_vec, vector1.sse_vec, _MM_SHUFFLE(2, 3, 1, 0)), _mm_shuffle_ps(vector2.sse_vec, vector2.sse_vec, _MM_SHUFFLE(3, 1, 2, 0)))); #endif // (MATHLIB_SSE) #else float xcomp = (vector1.extractY() * vector2.extractZ()) - (vector1.extractZ() * vector2.extractY()); float ycomp = (vector1.extractZ() * vector2.extractX()) - (vector1.extractX() * vector2.extractZ()); float zcomp = (vector1.extractX() * vector2.extractY()) - (vector1.extractY() * vector2.extractX()); result.setXYZ(xcomp, ycomp, zcomp); #endif // (MATHLIB_SIMD) }
FPU
00401497 push ebp 00401498 mov ebp,esp 0040149A mov edx,DWORD PTR [ebp+0x8] 0040149D mov ecx,DWORD PTR [ebp+0xc] 004014A0 mov eax,DWORD PTR [ebp+0x10] 004014A3 fld DWORD PTR [edx+0x8] 004014A6 fld DWORD PTR [ecx+0x4] 004014A9 fld DWORD PTR [edx+0x4] 004014AC fld DWORD PTR [ecx+0x8] 004014AF fld DWORD PTR [ecx+0xc] 004014B2 fld DWORD PTR [edx+0xc] 004014B5 fld st(5) 004014B7 fmul st,st(5) 004014B9 fld st(4) 004014BB fmul st,st(4) 004014BD fsubrp st(1),st 004014BF fstp DWORD PTR [eax+0xc] 004014C2 fxch st(3) 004014C4 fmul st,st(1) 004014C6 fxch st(4) 004014C8 fmul st,st(3) 004014CA fsubrp st(4),st 004014CC fxch st(3) 004014CE fstp DWORD PTR [eax+0x8] 004014D1 fmulp st(1),st 004014D3 fxch st(1) 004014D5 fmulp st(2),st 004014D7 fsubp st(1),st 004014D9 fstp DWORD PTR [eax+0x4] 004014DC leave 004014DD ret
SSE
00401597 push ebp 00401598 mov ebp,esp 0040159A mov eax,DWORD PTR [ebp+0xc] 0040159D movaps xmm1,XMMWORD PTR [eax] 004015A0 movaps xmm2,xmm1 004015A3 shufps xmm2,xmm1,0xd8 004015A7 mov eax,DWORD PTR [ebp+0x8] 004015AA movaps xmm0,XMMWORD PTR [eax] 004015AD movaps xmm3,xmm0 004015B0 shufps xmm3,xmm0,0xb4 004015B4 mulps xmm2,xmm3 004015B7 shufps xmm1,xmm1,0xb4 004015BB shufps xmm0,xmm0,0xd8 004015BF mulps xmm0,xmm1 004015C2 subps xmm0,xmm2 004015C5 mov eax,DWORD PTR [ebp+0x10] 004015C8 movaps XMMWORD PTR [eax],xmm0 004015CB leave 004015CC ret
Vector Normalize:
Funnily enough, this method received a HUGE speedup from SSE, not because of anything directly SIMD related but because of a very fast rsqrtps instruction which uses an approximation method and so sacrifices a little bit of precision for a whole lot of speed.
MATHLIB_INLINE void vector4_normalize(vector4& vector) /// Normalizes the given vector. { #if (MATHLIB_SIMD) #if (MATHLIB_SSE) __m128 vec0; __m128 vec1; vec0 = _mm_and_ps(vector.sse_vec, vector4::vector4_clearW); vec0 = _mm_mul_ps(vec0, vec0); vec1 = vec0; vec0 = _mm_shuffle_ps(vec0, vec0, _MM_SHUFFLE(2, 1, 0, 3)); vec1 = _mm_add_ps(vec0, vec1); vec0 = vec1; vec1 = _mm_shuffle_ps(vec1, vec1, _MM_SHUFFLE(1, 0, 3, 2)); vec0 = _mm_add_ps(vec0, vec1); vec0 = _mm_rsqrt_ps(vec0); vector.sse_vec = _mm_mul_ps(vec0, vector.sse_vec); #endif // (MATHLIB_SSE) #else float vector_magnitude = vector4_magnitude(vector); vector4_scale(vector, 1.0f / vector_magnitude, vector); #endif // (MATHLIB_SIMD) }
FPU
00401952 push ebp 00401953 mov ebp,esp 00401955 push ebx 00401956 sub esp,0x14 00401959 mov ebx,DWORD PTR [ebp+0x8] 0040195C fld DWORD PTR [ebx+0xc] 0040195F fld DWORD PTR [ebx+0x8] 00401962 fld DWORD PTR [ebx+0x4] 00401965 fxch st(2) 00401967 fmul st,st(0) 00401969 fxch st(1) 0040196B fmul st,st(0) 0040196D faddp st(1),st 0040196F fxch st(1) 00401971 fmul st,st(0) 00401973 faddp st(1),st 00401975 fstp DWORD PTR [esp] 00401978 call 0x414dd0 <sqrtf> 0040197D fdivr DWORD PTR ds:0x46f0c0 00401983 fld st(0) 00401985 fmul DWORD PTR [ebx] 00401987 fld st(1) 00401989 fmul DWORD PTR [ebx+0x4] 0040198C fld st(2) 0040198E fmul DWORD PTR [ebx+0x8] 00401991 fxch st(3) 00401993 fmul DWORD PTR [ebx+0xc] 00401996 fstp DWORD PTR [ebx+0xc] 00401999 fxch st(2) 0040199B fstp DWORD PTR [ebx+0x8] 0040199E fxch st(1) 004019A0 fstp DWORD PTR [ebx+0x4] 004019A3 fstp DWORD PTR [ebx] 004019A5 add esp,0x14 004019A8 pop ebx 004019A9 pop ebp 004019AA ret
SSE
004016FA push ebp 004016FB mov ebp,esp 004016FD mov eax,DWORD PTR [ebp+0x8] 00401700 movaps xmm1,XMMWORD PTR [eax] 00401703 movaps xmm0,xmm1 00401706 andps xmm0,XMMWORD PTR ds:0x4f2f80 0040170D mulps xmm0,xmm0 00401710 movaps xmm2,xmm0 00401713 shufps xmm2,xmm0,0x93 00401717 addps xmm0,xmm2 0040171A movaps xmm2,xmm0 0040171D shufps xmm2,xmm0,0x4e 00401721 addps xmm0,xmm2 00401724 rsqrtps xmm0,xmm0 00401727 mulps xmm1,xmm0 0040172A movaps XMMWORD PTR [eax],xmm1 0040172D pop ebp 0040172E ret
Vector Distance:
MATHLIB_INLINE float vector4_distance(const vector4& vector1, const vector4& vector2) /// Returns the distance between the two vectors. { #if (MATHLIB_SIMD) #if (MATHLIB_SSE) __m128 vec0; __m128 vec1; vec0 = _mm_sub_ps(vector2.sse_vec, vector1.sse_vec); vec0 = _mm_mul_ps(vec0, vec0); // vec0 = (x * x) (y * y) (z * z) (w * w) vec1 = _mm_shuffle_ps(vec0, vec0, _MM_SHUFFLE(2, 1, 0, 3)); // Shuffle around, add elements. vec1 = _mm_add_ps(vec0, vec1); // r3 = (x * x) + (y * y) vec0 = _mm_shuffle_ps(vec0, vec0, _MM_SHUFFLE(1, 0, 3, 2)); // Shuffle around, add elements. vec1 = _mm_add_ps(vec0, vec1); // r3 = (x * x) + (y * y) + (z * z) vec1 = _mm_sqrt_ps(vec1); // square root vector, shuffle and return return (_mm_cvtss_f32(_mm_shuffle_ps(vec1, vec1, _MM_SHUFFLE(2, 1, 0, 3)))); #endif // (MATHLIB_SSE) #else float xcomp = vector2.extractX() - vector1.extractX(); float ycomp = vector2.extractY() - vector1.extractY(); float zcomp = vector2.extractZ() - vector1.extractZ(); return (sqrtf((xcomp * xcomp) + (ycomp * ycomp) + (zcomp * zcomp))); #endif // (MATHLIB_SIMD) }
FPU
0040191A push ebp 0040191B mov ebp,esp 0040191D sub esp,0x18 00401920 mov eax,DWORD PTR [ebp+0x8] 00401923 mov edx,DWORD PTR [ebp+0xc] 00401926 fld DWORD PTR [edx+0xc] 00401929 fsub DWORD PTR [eax+0xc] 0040192C fld DWORD PTR [edx+0x8] 0040192F fsub DWORD PTR [eax+0x8] 00401932 fld DWORD PTR [edx+0x4] 00401935 fsub DWORD PTR [eax+0x4] 00401938 fxch st(2) 0040193A fmul st,st(0) 0040193C fxch st(1) 0040193E fmul st,st(0) 00401940 faddp st(1),st 00401942 fxch st(1) 00401944 fmul st,st(0) 00401946 faddp st(1),st 00401948 fstp DWORD PTR [esp] 0040194B call 0x414de0 <sqrtf> 00401950 leave 00401951 ret
SSE
0040165E push ebp 0040165F mov ebp,esp 00401661 sub esp,0x4 00401664 mov eax,DWORD PTR [ebp+0xc] 00401667 movaps xmm0,XMMWORD PTR [eax] 0040166A mov eax,DWORD PTR [ebp+0x8] 0040166D subps xmm0,XMMWORD PTR [eax] 00401670 mulps xmm0,xmm0 00401673 movaps xmm1,xmm0 00401676 shufps xmm1,xmm0,0x93 0040167A addps xmm1,xmm0 0040167D shufps xmm0,xmm0,0x4e 00401681 addps xmm0,xmm1 00401684 sqrtps xmm0,xmm0 00401687 shufps xmm0,xmm0,0x93 0040168B movss DWORD PTR [ebp-0x4],xmm0 00401690 fld DWORD PTR [ebp-0x4] 00401693 leave 00401694 ret
Benchmarks
After having written all of this I was curious to see what the performance difference was between the two implementations. To this end I set up "benchmarks" for each of the various vector4 functions. Each benchmark would perform a function on a vector4 array of size 10 000 (two different arrays of 10 000 are used for functions that require it etc) and store the results in either a vector4 array or a float array depending on what the result type is. The main function would perform each benchmark 100 000 times and compute the average time taken per benchmark.
Test machine(s):
For testing I used two machines, my desktop which has a Core 2 Quad Q8400 and my laptop which has a far less powerful Pentium Dual T3400. Obviously core count means nothing for this particular application, so it's down to architecture and clock speed.
Desktop Results
Laptop Results
Conclusion
Well I've tested the concept of SIMD acceleration of vector operations on the platform which supposedly rewards the programmer the least for it and come out with a generally favorable opinion. It's not a 4x speedup, sure, but I don't consider it a waste of time. The potential for speedup is definitely there, we just have to exercise a little caution when we program and always profile and understand our code.
We still aren't done with SSE yet, up next, matrices.
The source code for this article can be downloaded here:
https://hotfile.com/dl/
No comments:
Post a Comment