*[In this reprinted #altdevblogaday in-depth piece, Valve Software programmer Bruce Dawson continues his series on the floatin-point format with a new chapter on intermediate floating-point precision.]*Riddle me this Batman: how much precision are these calculations evaluated at?

void MulDouble(double x, double y, double* pResult) { *pResult = x * y; } void MulFloat(float x, float y, float* pResult) { *pResult = x * y; }

**Previously on this channelâ€¦**If you're just joining us then you may find it helpful to read some of the earlier posts in this series.

- 1: Tricks With the Floating-Point Format â€“ an overview of the float format
- 2: Stupid Float Tricks â€“ incrementing the integer representation
- 3: Don't Store That in a Float â€“ a cautionary tale about time
- 3b: They sure look equalâ€¦ â€“ special bonus post (not on altdevblogaday)
- 4: Comparing Floating Point Numbers, 2012 Edition
- 5: Float Precisionâ€”From Zero to 100+ Digits
- 5b: C++ 11 std::async for Fast Float Format Finding â€“ special bonus post (not on altdevblogaday)
- 6: Intermediate Precision â€“ (return *this;)

**Default x86 code**The obvious place to start is with a default VC++ project. Let's look at the results for the release build of a Win32 (x86) project with the default compiler settings except for turning off Link Time Code Generation (LTCG). Here's the generated code:

It's interesting that the code for MulDouble and MulFloat is identical except for size specifiers on the load and store instructions. These size specifiers just indicate whether to load/store a float or double from memory. Either way the fmul is done at "register precision".

[email protected]@[email protected] push ebp mov ebp, esp fld QWORD PTR _x$[ebp] mov eax, DWORD PTR _pResult$[ebp] fmul QWORD PTR _y$[ebp] fstp QWORD PTR [eax] pop ebp ret 0 [email protected]@[email protected] push ebp mov ebp, esp fld DWORD PTR _x$[ebp] mov eax, DWORD PTR _pResult$[ebp] fmul DWORD PTR _y$[ebp] fstp DWORD PTR [eax] pop ebp ret 0

**Register precision**By default a 32-bit VC++ x86 project generates x87 code for floating-point operations, as shown above. The peculiar x87 FPU has eight registers, and people will usually tell you that 'register precision' on this FPU means 80-bit precision. These people are wrong. Mostly. It turns out that the x87 FPU has a precision setting. This can be set to 24-bit (float), 53-bit (double), or 64-bit (long double). VC++ hasn't supported long double for a while so it initializes the FPU to double precision during thread startup. This means that every floating-point operation on the x87 FPU â€“ add, multiply, square root, etc. â€“ is done to double precision by default, albeit in 80-bit registers.The registers are 80 bits, but every operation is rounded to double precision before being stored in the register. Except even that is not quite true. The mantissa is rounded to a double-precision compatible 53 bits, but the exponent is not, so the precision is double compatible but the range is not. In the x87 register diagram below blue is the sign bit, pink is the exponent, and green is the mantissa. The full exponent is always used, but when the rounding is set to 24-bit then only the light green mantissa is used. When the rounding is set to 53-bit then only the light and medium green mantissa bits are used, and when 64-bit precision is requested the entire mantissa is used. In summary, as long as your results are between DBL_MIN and DBL_MAX (they should be) then the larger exponent doesn't matter and we can simplify and just say that register precision on x87 means double precision. Except when it doesn't. While the VC++ CRT initializes the x87 FPUs register precision to _PC_53 (53-bits of mantissa) this can be changed. By calling _controlfp_s you can increase the precision of register values to _PC_64 or lower it to _PC_24. And, D3D9 has a habit of setting the x87 FPU to _PC_24 precision unless you specify D3DCREATE_FPU_PRESERVE. This means that on the thread where you initialized D3D9 you should expect many of your calculations to give different results than on other threads. D3D9 does this to improve the performance of the x87's floating-point divide and square root. Nothing else runs faster or slower. So, the assembly code above does its intermediate calculations in:

- 64-bit precision (80-bit long double), if somehow you avoid the CRT's initialization or if you use _controlfp_s to set the precision to _PC_64
- or 53-bit precision (64-bit double), by default
- or 24-bit precision (32-bit float), if you've initialized D3D9 or used _controlfp_s to set the precision to _PC_24

**/arch:SSE/SSE2**The acronym soup which is SSE changes all of this. Instructions from the SSE family all specify what precision to use and there is no precision override flag, so you can tell exactly what you're going to get (as long as nobody has changed the rounding flags, but let's pretend I didn't mention that shall we?) If we enable /arch:sse2 and otherwise leave our compiler settings unchanged (release build, link-time-code-gen off, /fp:strict) we will see these results in our x86 build:

The MulDouble code looks comfortably similar, with just some mnemonic and register changes on three of the instructions. Simple enough. The MulFloat code looksâ€¦ longer. Weirder. Slower. The MulFoat code has four extra instructions. Two of these instructions convert the inputs from float to double, and a third converts the result from double back to float. The fourth is an explicit load of 'y' because SSE can't load-and-multiply in one instruction when combining float and double precision. Unlike the x87 instruction set where conversions happen as part of the load/store process, SSE conversions must be explicit. This gives greater control, but it adds cost. If we optimistically assume that the load and float-to-double conversions of 'x' and 'y' happen in parallel then the dependency chain for the floating-point math varies significantly between these two functions: MulDouble:

[email protected]@[email protected] push ebp mov ebp, esp movsd xmm0, QWORD PTR _x$[ebp] mov eax, DWORD PTR _pResult$[ebp] mulsd xmm0, QWORD PTR _y$[ebp] movsd QWORD PTR [eax], xmm0 pop ebp ret 0 [email protected]@[email protected] push ebp mov ebp, esp movss xmm0, DWORD PTR _x$[ebp] movss xmm1, DWORD PTR _y$[ebp] mov eax, DWORD PTR _pResult$[ebp] cvtps2pd xmm0, xmm0 cvtps2pd xmm1, xmm1 mulsd xmm0, xmm1 cvtpd2ps xmm0, xmm0 movss DWORD PTR [eax], xmm0 pop ebp ret 0

MulFloat

movsd-> mulsd-> movsd

That means the MulFloat dependency chain is 66% longer than the MulDouble dependency chain. The movss and movsd instructions are somewhere between cheap and free and the convert instructions have comparable cost to floating-point multiply, so the actual latency increase could be even higher. Measuring such things is tricky at best, and the measurements extrapolate poorly, but in my crude tests I found that on optimized 32-bit /arch:SSE2 builds with VC++ 2010 MulFloat takes 35% longer to run than MulDouble. On tests where there is less function call overhead I've seen the float code take 78% longer than the double code. Ouch.

movss-> cvtps2pd-> mulsd-> cvtpd2ps-> movss

**To widen or not to widen**So why does the compiler do this? Why does it waste three instructions to widen the calculations to double precision? Is this even legal? Is it perhaps required? The guidance for whether to do float calculations using double-precision intermediates has gone back and forth over the years, except when when it was neither back nor forth. The IEEE 754 floating-point math standard chose not to rule on this issue:

Okay, so it's up to the language â€“ let's see what they say. In

Annex B.1: The format of an anonymous destination is defined by language expression evaluation rules.

*The C Programming Language,*Copyright Â© 1978 (prior to the 1985 IEEE 754 standard for floating point), Kernighan and Ritchie wrote:

However the C++ 1998 standard does not appear to contain that language. The most obvious governing language is the usual arithmetic conversions (section 5.9 in the 1998 standard) which basically says that double-precision is only used if the expression contains a double:

All floating arithmetic in C is carried out in double-precision; whenever a float appears in an expression it is lengthened to double by zero-padding its fraction.

But the C++ 1998 standard also appears to permit floating-point math to be done at higher precisions if a compiler feels like it:

5.9: If the above condition is not met and either operand is of type double, the other operand is converted to type double.

The C99 standard also appears to permit but not require floating-point math to be done at higher precision:

5.10 The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.55)

And Intel's compiler lets you choose whether intermediate results should use the precision of the source numbers, double precision, or extended precision.

Evaluation type may be wider than semantic type â€“ wide evaluation does not widen semantic type

Apparently compilers

/fp:double: Rounds intermediate results to 53-bit (double) precision and enables value-safe optimizations.

*may*use greater precision for intermediates, but

*should*they? The classic Goldberg article points out errors that can occur from doing calculations on floats using double precision:

On the other hand, a well reasoned article by Microsoft's Eric Fleegal says:

This suggests that computing every expression in the highest precision available is not a good rule.

The consensus is clear: sometimes having high-precision intermediates is great, and sometimes it is horrible. The IEEE math standard defers to the C++ standard, which defers to compiler writers, which sometimes then defer to developers.

It is generally more desirable to compute the intermediate results in as high as [sic] precision as is practical.

**It's a design decision and a performance bug**As Goldberg and Fleegal point out, there are no easy answers. Sometimes higher precision intermediates will preserve vital information, and sometimes they will lead to confusing discrepancies. I think I've reverse engineered how the VC++ team made their decision to use double-precision intermediates with /arch:SSE2. For many years 32-bit code on Windows has used the x87 FPU, set to double precision, so for years the intermediate values have been (as long as they stay in the registers) double precision. It seems obvious that when SSE and SSE2 came along the VC++ team wanted to maintain consistency. This means explicitly coding double precision temporaries. The compiler (spoiler alert: prior to VS 11) also has a strong tendency to use x87 instructions on any function that returns a float or double, presumably because the result has to be returned in an x87 register anyway, and I'm sure they wanted to avoid inconsistencies within the same program. The frustrating thing about my test functions above is that the extra instructions are provably unneeded. They will make no difference to the result. The reason I can be so confident that the double-precision temporaries make no difference in the example above is that there is just one operation being done. The IEEE standard has always guaranteed that the basic operations (add, subtract, multiply, divide, square root, some others) give perfectly rounded results. On any single basic operation on two floats if the result is stored into a float then widening the calculation to double is completely pointless because the result is already

*perfect*. The optimizer should recognize this and remove the four extraneous instructions. Intermediate precision matters in complex calculation, but in a single multiply, it doesn't. In short, even though the VC++ policy in this case is to use double-precision for float calculations, the "as-if" rule means that the compiler can (and should) use single precision if it is faster and if the results would be the same "as-if" double precision was used.

**64-bit changes everything**Here's our test code again:

The difference is almost comical. The shortest function for 32-bit was eight instructions, and these are three instructions. What happened? The main differences come from the 64-bit ABI. On 32-bit we need "push ebp/mov ebp,esp/pop ebp" in order to set up and tear down a stack frame for fast stack walking. This is crucial for ETW/xperf profiling and for many other tools and, as dramatic as it looks in this example, is not actually expensive and should not be disabled. On 64-bit the stack walking is done using metadata rather than executed instructions, so we save three instructions. The 64-bit ABI also helps because the three parameters are all passed in registers instead of on the stack, and that saves us from two instructions to load parameters into registers. And with that we've accounted for all of the differences between the 32-bit and 64-bit versions of MulDouble. The differences in MulFloat are more significant. x64 implies /arch:SSE2 so there's no need to worry about the x87 FPU. And, the math is being done at float precision. That saves three conversion instructions. It appears that the VC++ team decided that double-precision temporaries were no longer a good idea for x64. In the sample code I've used so far in this post this is pure win â€“ the results will be identical, only faster. For more complicated calculations â€“ anything more than a single binary operator â€“ the results may differ on 64-bit because of the move away from double-precision intermediates. The results may be better or worse, but watch out. Floating point results can be sensitive to initial conditions: it's the butterfly effect, wherein a distant hurricane may cause a butterfly to flap its wings If you want to control the precision of intermediates then cast some of your input values to double, and store explicit temporaries in doubles. The cost to do this can be minimal in many cases.

[email protected]@[email protected] mulsd xmm0, xmm1 movsdx QWORD PTR [r8], xmm0 ret 0 [email protected]@[email protected] mulss xmm0, xmm1 movss DWORD PTR [r8], xmm0 ret 0

**64-bit FTW**64-bit code-gen was a chance for the VC++ team to wipe the slate clean. The new ABI plus the new policy on intermediate precision means that the instruction count for these routines goes from eight-to-twelve instructions to just three. Plus there are twice as many registers. No wonder x64 code can (if your memory footprint doesn't increase too much) be faster than x86 code.

**VS 11 changes everything**Visual Studio 11 beta (which I'm guessing will eventually be called VS 2012) appears to have a lot of code-gen differences, and one of them is in this area. VS 11 no longer uses double precision for intermediate values when it is generating SSE/SSE2 code for float calculations.

**It's the journey, not the destination**So far our test code has simply done one basic math operation on two floats and stored the result in a float, in which case higher precision does not affect the result. As soon as we do multiple operations, or store the result in a double, higher precision intermediates give us a different and more accurate answer. One could argue that the compiler should use higher precision whenever the result is stored in a double. However this would diverge from the C++ policy for integer math where the size of the destination doesn't affect the calculation. For example:

int x = 1000000; long long y = x * x;

**Comprehension test**Here's some simple code. It just adds four numbers together. The numbers have been carefully chosen so that there are three possible results, depending on whether the precision of intermediate values is float, double, or long-double.

float g_one = 1.0f; float g_small_1 = FLT_EPSILON * 0.5; float g_small_2 = DBL_EPSILON * 0.5; float g_small_3 = DBL_EPSILON * 0.5; void AddThreeAndPrint() { printf("Sum = %1.16en", ((g_one + g_small_1) + g_small_2) + g_small_3); PrintOne(); }

// Here's how to use _controlfp_s to set the x87 register precistion // to 24-bits (float precision) unsigned oldState; _controlfp_s(&oldState, _PC_24, _MCW_PC);

**Bigger is not necessarily better**I want to emphasize that more precise intermediates are not 'better'. As we have seen they can cause performance to be worse, but they can also cause unexpected results. Consider this code:

float g_three = 3.0f; float g_seven = 7.0f; void DemoOfDanger() { float calc = g_three / g_seven; if (calc == g_three / g_seven) printf("They're equal!n"); else printf("They're not equal!n"); }

**Sometimes bigger**Imagine this code:

*is*betterfloat g_three = 3.0f; float g_seven = 7.0f; double DemoOfAwesome() { return g_three / g_seven; }

float g_pointOne = 0.1f; void PrintOne() { printf("One might equal %1.10fn", g_pointOne * 10); }