2010-11-24:

Various behavior of scanf/atoi/strtol

c++:c:windows:linux
While discussing a few days ago a piece of code with aps, we've encountered some interesting (imho) differences in the implementation of atoi and [sf]scanf between different versions of msvcrt (Microsoft C-Runtime Library), glibc (GNU C Library) and the libc used on OSX. The said differences are observed when a number in the provided string cannot be represented as an integer, i.e. it's larger than INT_MAX (which is 0x7fffffff, or 2147483647 decimal) or smaller than INT_MIN (0x80000000, -2147483648 decimal).

Let's start with an example:
printf("%.8x\n", atoi("12345678901234567890"));
printf("%.8x\n", atoi("-12345678901234567890"));

The above code outputs:

* msvcrt (up to msvcr71 inclusive)
eb1f0ad2
14e0f52e


* msvcrt (from msvcr80 inclusive)
7fffffff
80000000


* glibc
7fffffff
80000000


So, as one can see, in the older versions of msvcrt an overflow occurs, and we are shown only the 32 least significant bits of the full number. In case of glibc and the newer msvcrt instead of an overflow we get saturation -  if the number in the provided string if lower than INT_MIN or higher than INT_MAX, we get INT_MIN or INT_MAX (since the variable "saturates" at these limits).

These behavior differ for different functions (atoi/[fs]scanf %i/strtol), different versions, different implementations, etc.
I've (and a few of my readers too) done some tests (for the same numbers as in the above example code) on different versions/libraries/architectures, and the results are quite interesting:

namearchveratoi-atoi+sscanf-sscanf+strtol-strtol+
crtdll.dll32-bit?OFOFOFOFSat.Sat.
msvcrt.dll32-bit7.0.7600.16385OFOFOFOFSat.Sat.
msvcrt.dll32-bit7.0.3790.3959OFOFOFOFSat.Sat.
msvcrt20.dll32-bit2.12.0.0OFOFOFOFSat.Sat.
msvcrt40.dll32-bit6.1.7600.16385OFOFOFOFSat.Sat.
msvcr71.dll32-bit7.10.3052.4OFOFOFOFSat.Sat.
msvcr80.dll32-bit8.0.50727.4053Sat.Sat.OFOFSat.Sat.
msvcr90.dll32-bit9.0.30729.1Sat.Sat.OFOFSat.Sat.
msvcr100.dll32-bit10.0.30319.1Sat.Sat.OFOFSat.Sat.
GNU Lib. C32-bit2.7Sat.Sat.Sat.Sat.Sat.Sat.
GNU Lib. C32-bit2.7-10Sat.Sat.Sat.Sat.Sat.Sat.
GNU Lib. C32-bit2.11.1Sat.Sat.Sat.Sat.Sat.Sat.
GNU Lib. C64-bit2.9-10-10-10
GNU Lib. C64-bit2.11.1-10-10-10
GNU Lib. C64-bit2.11.2-10-10-10
OSX ? Lib. C64-bitOSX 10.6.4-10-10-10
OSX ? Lib. C32-bitOSX 10.5.8Sat.Sat.-10Sat.Sat.

Thanks to Zarul Shahrin and Unavoweda for running these tests on OSX.
Thanks for the remarks and sharing results to: djstrong, przemoc, ppkt, Rolek, faramir
In the above table, strtol's result is always casted to an int


The function strtol returns a long int, meaning it returns a 32-bit value on 32-bit systems, and a 64-bit value on 64-bit systems. In case of glibc's atoi and [sf]scanf %i functions, which rely on strtol to do the actual conversion, the result is truncated to 32 least significant bits. So, even though strtol will return 0x7fffffffffffffff and 0x8000000000000000 on 64-bit systems, these values will be truncated to 0xffffffff and 0x00000000, which represent -1 and 0 int two's complement.

(Another thanks goes to Zarul here) The function sscanf %i on the 32-bit OSX uses internally strtoimax function to do the conversion. This function returns intmax_t, which, on a 32-bit x86, is defined as int64_t (meaning long long). So, even though the arch is 32-bit, the results in this case are similar to the 64-bit functions that rely on strtol.

It might be worth to check in which of the above cases ERANGE (see the quotes from the C99 draft below) is actually put in the errno global variable.

In a comment on the Polish side of the mirror, Rolek has made a test (and shared results, thanks ;>) inter alia for the atoi/atol/wtoi/wtol functions, using "1234567890123456789012345678901234567890" and "-1234567890123456789012345678901234567890" as test strings (or wide-char versions of these in case of wto[li]). The results in case of crtdll.dll (version ?) and msvcrt20.dll (version ?) were different for the atoi/atol and wtoi/wtol pairs:

* crtdll.dll and msvcrt20.dll
atoi 0xCE3F0AD2 0x31C0F52E (OF)
atol 0xCE3F0AD2 0x31C0F52E (OF)
wtoi 0xEB1F0AD2 0x82167EEB (truncate(20) → OF)
wtol 0xEB1F0AD2 0x82167EEB (truncate(20) → OF)

Where does this difference come from? Well, it looks like the authors of wto[li] functions in the said libraries made a little shortcut in the implementations of these functions. Instead of writing a normal string→int conversion function, they made an atol converting wrapper of the wto[li] functions. This wrapper works like this: first convert wide-char→ASCII (using WideCharToMultiByte with the char limit set to 20 chars; this is at most 20 digits in case of positive numbers, and a minus sign and at most 19 digits in case of negative numbers), and then calls atol. So, the string is truncated during the conversion, hence the difference in returned results.

In the newer versions of msvcrt, wto[li] is no longer a wrapper, but a fully functional converter, that additionally to ASCII digits (U+0030 to U+0039) supports a half of different Unicode digits, e.g. Arabic Digits (U+0660 to U+0669) or Full Width Digits (U+FF10 to U+FF19).

printf("%i\n", _wtoi(L"\uFF11\u0663\u0C69\u17e7")); 1337

"\uFF11\u0663\u0C69\u17e7" are the codes of 1٣౩៧

Let's end this post with a few quotes from a C99 standard draft:

atoi
The functions atof, atoi, atol, and atoll need not affect the value of the integer
expression errno on an error. If the value of the result cannot be represented, the
behavior is undefined.

sscanf %i
I didn't find any remark what should happen in case of the number couldn't be represented as an integer. However there are some shy remarks that it should behave same as strtol does. But it's a UB (undefined behavior) afaic.

strtol
If the correct value
is outside the range of representable values, LONG_MIN, LONG_MAX, LLONG_MIN,
LLONG_MAX, ULONG_MAX, or ULLONG_MAX is returned (according to the return type
and sign of the value, if any), and the value of the macro ERANGE is stored in errno.

That's it for today :)

P.S. Btw, does anyone know how to check lib C version on OSX?

P.S.2. I include the test below. If you would like to share your results, please include the version of lib c, and the OS/architecture you've run the test on ;>
#include<stdio.h>
#include<stdlib.h>

int main(void)
{
 puts("atoi");
 printf("  %.8x\n", atoi("12345678901234567890"));
 printf("  %.8x\n", atoi("-12345678901234567890"));

 printf("sscanf %%i\n");
 { int res = 0; sscanf("12345678901234567890", "%i", &res); printf("  %.8x\n", res); }
 { int res = 0; sscanf("-12345678901234567890", "%i", &res); printf("  %.8x\n", res); }

 printf("strtol\n");
 printf("  %.8x\n", strtol("12345678901234567890", NULL, 10));
 printf("  %.8x\n", strtol("-12345678901234567890", NULL, 10));

 return 0;
}


P.S.3. It might be worth taking a look on the comments on the Polish side of the mirror (using a translator or sth).

Comments:

2010-11-24 11:25:25 = jweyrich
{
Very interesting post :-)

$ otool -L /usr/lib/libc.dylib
/usr/lib/libc.dylib:
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 125.2.1)
/usr/lib/system/libmathCommon.A.dylib (compatibility version 1.0.0, current version 315.0.0)

Mine is 125.2.1 (OSX 10.6.5). Note that /usr/lib/libc.dylib indirectly points to /usr/lib/libSystem.B.dylib.

And the test results:

$ gcc -o test test.c && ./test
bla.c: In function ‘main’:
bla.c:15: warning: format ‘%.8x’ expects type ‘unsigned int’, but argument 2 has type ‘long int’
bla.c:16: warning: format ‘%.8x’ expects type ‘unsigned int’, but argument 2 has type ‘long int’
atoi
ffffffff
00000000
sscanf %i
ffffffff
00000000
strtol
ffffffff
00000000
}
2010-11-24 11:33:49 = jweyrich
{
Forgot to mention the architecture of my previous results - it was x86_64.
Now in x86:
atoi
7fffffff
80000000
sscanf %i
ffffffff
00000000
strtol
7fffffff
80000000
}
2010-11-25 08:45:47 = Glenn Anderson
{
This was done with msvcr100.dll 64-bit, version 10.0.30319.1.

atoi
7fffffff
80000000
sscanf %i
eb1f0ad2
14e0f52e
strtol
7fffffff
80000000

So it is the same as the 32-bit version.
}
2011-01-12 11:29:51 = extraexploit
{
great post!!! thank you for share this kind of info.
}

Add a comment:

Nick:
URL (optional):
Math captcha: 3 ∗ 7 + 4 =