**UPDATE:**Just to make sure we see it in the same way: this post is about playing with bits and floats just for the sake of playing with bits and floats; it's not something you could or should use in anything serious though :)

**UPDATE 2:**There were two undefined behaviours pointed out in my code (one, two) - these are now fixed.

## The problem explained

Let's start by demonstrating a the problem by running the following code that compares subsequent integers with a floating point value:` float a = 100000000.0f;`

printf("...99 --> %i\n", a == 99999999);

printf("...00 --> %i\n", a == 100000000);

printf("...01 --> %i\n", a == 100000001);

printf("...02 --> %i\n", a == 100000002);

printf("...03 --> %i\n", a == 100000003);

printf("...04 --> %i\n", a == 100000004);

printf("...05 --> %i\n", a == 100000005);

The result:

`...99 --> 1`

...00 --> 1

...01 --> 1

...02 --> 1

...03 --> 1

...04 --> 1

...05 --> 0

Sadly this was to be expected in the floating point realm. However, while in this world both 99999999 and 100000004 might be equal to 100000000, this is sooo not true for common sense nor standard arithmetic.

Let's look at another example - an attempt to sort a collection of numbers by value in PHP:

`<?php`

$x = array(

20000000000000002,

20000000000000003,

20000000000000000.0,

);

**sort($x);**

foreach ($x as $i) {

if (is_float($i)) {

printf("%.0f\n", $i);

} else {

printf("%i\n", $i);

}

}

The "sorted" result (64-bit PHP):

`> php test.php`

20000000000000002

20000000000000000

20000000000000003

**Side note**: The code above must be executed using 64-bit PHP. The 32-bit PHP has integers limited to 32-bit, so the numbers I used in the example would exceed their limit and would get silently converted to doubles. This results in the following output:

20000000000000000

20000000000000000

20000000000000004

So, what's going on?

It all boils down to floats having to little precision for larger integers (this is a good time to look at this and this). For example, the 32-bit float has only 23 bits dedicated to the significand - this means that if an integer value that is getting converted to float needs more than 24 bits (sic!; keep in mind that in floats there is a hardcoded "1" at the top position, which is not present in the bit-level representation) to be represented, it will get truncated - i.e. the least significant bits will be treated as zeroes.

In the C-code case above the decimal value 100000001 actually requires 27 bits to be properly represented:

0b

**101111101011110000100000**~~001~~

However, since only the leading "1" and following 23-bits will fit inside a float, the "1" at the very end gets truncated. Therefore, this number actually becomes another number:

0b

**101111101011110000100000000**

Which in decimal is 100000000 and therefore is equal to the float constant of 100000000.0f.

Same problem exists between 64-bit integers and 64-bit doubles - the latter have only 52 bits dedicated for storing the value.

## A somewhat amusing side note

Actually, it gets even better. Let's re-write the first code shown above (the C one) to use a loop:` float a = 100000000.0f;`

int i;

for(i = 100000000 - 5; i <= 100000000 + 5; i++) {

printf("%11.1f == %9u --> %i\n", a, i, a == i);

}

As you can see, there are no big changes. Now let's compile it and run it:

`>gcc test.c`

> a

100000000.0 == 99999995 --> 0

100000000.0 == 99999996 --> 0

100000000.0 == 99999997 --> 0

100000000.0 == 99999998 --> 0

100000000.0 == 99999999 --> 0

100000000.0 == 100000000 --> 1

100000000.0 == 100000001 --> 0

100000000.0 == 100000002 --> 0

100000000.0 == 100000003 --> 0

100000000.0 == 100000004 --> 0

100000000.0 == 100000005 --> 0

The result is magically correct! How about we compile it with optimization then?

`>gcc test.c -O3`

> a

100000000.0 == 99999995 --> 0

100000000.0 == 99999996 --> 1

100000000.0 == 99999997 --> 1

100000000.0 == 99999998 --> 1

100000000.0 == 99999999 --> 1

100000000.0 == 100000000 --> 1

100000000.0 == 100000001 --> 1

100000000.0 == 100000002 --> 1

100000000.0 == 100000003 --> 1

100000000.0 == 100000004 --> 1

100000000.0 == 100000005 --> 0

Why is that? Well, in both cases the compiler needs to convert the integer to a float and then compare it with the second float value. This however can be done in two different ways:

**Option 1**: The integer is converted to a floating point value, then is stored in memory as a 32-bit float and then loaded into the FPU for the comparison OR (in case of constants) the integer constant can be converted to a 32-bit float constant at compilation time and then it will be loaded into the FPU for comparison at runtime.

**Option 2**: The integer is directly loaded into the FPU for comparison (using

*fild*FPU instruction or similar).

The difference here is related to the FPU internally operating on larger floating point values with more precision (by default it's 80-bits, though you can change this) - so the 32-bit integer isn't truncated on load, as it would happen if it gets converted explicitly to a 32-bit float (which, again, has only 24-bits for the actual value).

Which option is selected depends strictly on the compiler - it's mood, version, options used at compilation, etc.

## The perfect comparison

Of course, it's possible to do a perfect comparison.The simplest and most straightforward way is to cast both the int value and the float value to a double before comparing them - double has large enough significand to store all possible 32-bit int values. And for the 64-bit integers you can use the 80-bit long double which has exactly 64 bits dedicated for storing the value (plus the ever-present "1").

But that's too easy. Let's try to do the actual comparison without converting to larger types.

This can be done in two ways: the "mathematical" way (or: value-specific way) and the encoding-specific way. Both are presented below.

**UPDATE 3:**Actually there seems to be another way, as pointed out in the comments below and in this reddit post. It does make sense, but I still wonder if there is any counterexample (please note that I'm not saying there is; I'm just saying it never hurts to look for one ;>).

## The mathematical way

We basically do it the other way around - i.e. we try to convert the float to an integer. There are a couple of problems here which we need to deal with:1. The float value might be bigger than INT_MAX or smaller than INT_MIN. In such case this might happen and we wouldn't be able to catch it after the conversion, so we need to deal with it sooner.

2. The float value might have a non-zero fractional part. This would get truncated when converted to an int (e.g. (int)1.1f is equal to 1) - we don't want this to happen either.

The implementation of this method (with some comments) is presented below:

`bool IntFloatCompare(int i, float f) {`

// Simple case.

if ((float)i != f)

return false;

// Note: The constant used here CAN be represented as a float. Normally

// you would want to use INT_MAX here instead, but that value

// *cannot* be represented as a float.

const float TooBigForInt = (float)0x80000000u;

if (f >= TooBigForInt) {

return false;

}

if (f < -TooBigForInt) {

return false;

}

float ft = truncf(f);

if (ft != f) {

// Not an integer.

return false;

}

// It should be safe to cast float to integer now.

int fi = (int)f;

return fi == i;

}

## The encoding-specific way

This method relies on decoding the float value from the bit-level representation, checking if it's an integer, checking if it is in range and finally comparing the bits with the integer value. I'll just leave you with the code. If in doubt - refer to this wikipedia page.`bool IntFloatCompareBinary(int i, float f) {`

uint32_t fu32;

memcpy(&fu32, &f, 4);

uint32_t sign = fu32 >> 31;

uint32_t exp = (fu32 >>23) & 0xff;

uint32_t frac = fu32 & 0x7fffff;

// NaN? Inf?

if (exp == 0xff) {

return false;

}

// Subnormal representation?

if (exp == 0) {

// Check if fraction is 0. If so, it's true if "i" is 0 as well.

// Otherwise it's false in all cases.

return (frac == 0 && i == 0);

}

int exp_decoded = (int)exp - 127;

// If exponent is negative, the number has a fraction part, which means it's not equal.

if (exp_decoded < 0) {

return false;

}

// If exponenta is above or equal to 31, int cannot represent so big numbers.

if (exp_decoded > 31) {

return false;

}

// There is one case where exp_decoded equal to 31 makes sens - when float is

// equal to INT_MIN, i.e. sign is - and fraction part is 0.

if (exp_decoded == 31 && (sign != 1 || frac != 0)) {

return false;

}

// What is left is in range of integer, but still can have a fraction part.

// Check if any fraction part will be left.

uint32_t value_frac = (frac << exp_decoded) & 0x7fffff;

if (value_frac != 0) {

return false;

}

// Check the value.

int value = (1 << 23) | frac;

int shift_diff = exp_decoded - 23;

if (shift_diff <0) {

value >>= -shift_diff;

} else {

value <<= shift_diff;

}

if (sign) {

value = -value;

}

return i == value;

}

## Summary

The above functions can be used for a perfect comparison and they SeemToWork™ (at least on little endian x86). With some more work both functions could be converted to be perfect "less than" comparators which then could be used to fix the PHP sorting example.But... seriously, just cast the integer and float to something that has more precision ;>

P.S. Did you know that there are exactly 75'497'471 positive integer values that can be precisely represented as a float? Not a lot for the total of 2'147'483'647 positive integers.

## Comments:

float a = 100000000.0f;

printf("...99 --> %u\n", (int)a == 99999999);

printf("...00 --> %u\n", (int)a == 100000000);

printf("...01 --> %u\n", (int)a == 100000001);

printf("...02 --> %u\n", (int)a == 100000002);

printf("...03 --> %u\n", (int)a == 100000003);

printf("...04 --> %u\n", (int)a == 100000004);

printf("...05 --> %u\n", (int)a == 100000005);

Works fine on any optimization level.

http://codepad.org/ZT3HcN6k

Results in 2**32 == 2**31-1

return (int)f == f && (int)f == i;

Since every machine has SSE2 nowadays, gcc -mfpmath=sse -msse2 gets rid of x87 and its (inconsistent) use of extended precision. Which doesn't make floating point trivial, but at least all your builds will yield the same results - and be identical to ARM, PowerPC etc. builds if you ever need any of these.

Oh, I totally agree - using floats for anything that requires precise results sounds like a dangerous area.

That said, this post was about playing with bits and floats, and not about how a production-level code should look like :)

@chippydip

That does sound like something that would work. I need to think if there is any counterexample to this (I'm not saying there is, but... no harm in checking).

Similar approach: http://www.reddit.com/r/programming/comments/244xvq/the_perfect_int_float_comparison/ch3xl4v

So the problem is that you are referring to a platform (AVR/gcc) which is using non-standard doubles which are 32-bits only, same as float (i.e. there are no true doubles on that platform) - https://gcc.gnu.org/wiki/avr-gcc:

Deviations from the Standard

double

double is only 32 bits wide and implemented in the same way as float

Therefore on that platform there still is no real cast to a larger type than float in the place you've specified (even if the name of the type would change due to implementation specifics).

On platforms where "double" is actually a real "double sized float" there is no cast there. See e.g.:

http://linux.die.net/man/3/truncf

double trunc(double x);

float truncf(float x); <--- uses float type

I don't have the standard/draft at hand, but I'm pretty sure truncf uses floats in the official definition as well.

## Add a comment: