String-to-Integer vs Unicode additional digit groups

by gynvael.coldwind//vx (last update: 18 September 2011)

The interesting difference between ASCII and Unicode is that the first had only one group of digits defined (30h to 39h), and the latter defines 42 decimal digit groups[1].

A common programming language operation is to convert a sequence of digit-characters (yes, a number) to a machine-understandable integer. Does any default in-language string-to-integer support Unicode digits? Does any is-digit function return true on Unicode digits? This table addresses these questions.

This gets especially interesting when some non-ASCII digits are visually similar to ASCII digits, but have different values. An example here might be the Bengali four and Oriya two shown at the picture below.. yes, they look like 89, but the value is actually 42 (thanks to Ange Albertini for pointing out this example):

42 that looks like 89

Important note: I've created this table out of curiosity so it might be a little chaotic. I tried to not make any errors but I cannot guarantee that there are none. If you have any comments or have tested another programming language/library/digit group/etc and would like to share the results, feel free to either e-mail me or leave a comment under this post on my blog.

[1] Unicode Characters in the 'Number, Decimal Digit' Category (I think there are more groups actually, but didn't have time to check).

Notes:

There are additional notes and test cases after the table.

Table of support:

    Python C Java Perl Javascript Ruby C#
Digit group Type 2.6
(narrow)
2.7
(narrow)
3.2 Win32
(msvcrt20)
Win32
(msvcrt40)
Win32
(msvcr71)
Win32
(msvcr100)
glibc
(2.13)
1.7
(.pI)
1.7
(.gNV)
5.12
(int)
5.12
(\d)
(pI) (\d) 1.9 .NET 4.0
(TI32)
.NET 4.0
(ID)
ASCII BMP yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes
ARABIC-INDIC BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
EXTENDED ARABIC-INDIC BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
NKO BMP yes yes yes no no no no no yes yes no yes no no no no yes
DEVANAGARI BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
BENGALI BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
GURMUKHI BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
GUJARATI BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
ORIYA BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
TAMIL BMP yes yes yes no no no no no yes yes no yes no no no no yes
TELUGU BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
KANNADA BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
MALAYALAM BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
THAI BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
LAO BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
TIBETAN BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
MYANMAR BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
MYANMAR SHAN BMP yes yes yes no no no no no yes yes no yes no no no no yes
KHMER BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
MONGOLIAN BMP yes yes yes no yes yes yes no yes yes no yes no no no no yes
LIMBU BMP yes yes yes no no no no no yes yes no yes no no no no yes
NEW TAI LUE BMP yes yes yes no no no no no yes yes no yes no no no no yes
TAI THAM HORA BMP no yes yes no no no no no yes yes no yes no no no no no
TAI THAM THAM BMP no yes yes no no no no no yes yes no yes no no no no no
BALINESE BMP yes yes yes no no no no no yes yes no yes no no no no yes
SUNDANESE BMP yes yes yes no no no no no yes yes no yes no no no no yes
LEPCHA BMP yes yes yes no no no no no yes yes no yes no no no no yes
OL CHIKI BMP yes yes yes no no no no no yes yes no yes no no no no yes
VAI BMP yes yes yes no no no no no yes yes no yes no no no no yes
SAURASHTRA BMP yes yes yes no no no no no yes yes no yes no no no no yes
KAYAH LI BMP yes yes yes no no no no no yes yes no yes no no no no yes
JAVANESE BMP no yes yes no no no no no yes yes no yes no no no no no
CHAM BMP yes yes yes no no no no no yes yes no yes no no no no yes
MEETEI MAYEK BMP no yes yes no no no no no yes yes no yes no no no no no
FULLWIDTH BMP yes yes yes yes yes yes yes no yes yes no yes no no no no yes
OSMANYA SMP no no no no no no no no no no no yes no no no no no
BRAHMI SMP no no no no no no no no no no no no no no no no no
MATHEMATICAL BOLD SMP no no no no no no no no no no no yes no no no no no
MATHEMATICAL DOUBLE-STRUCK SMP no no no no no no no no no no no yes no no no no no
MATHEMATICAL SANS-SERIF SMP no no no no no no no no no no no yes no no no no no
MATHEMATICAL SANS-SERIF BOLD SMP no no no no no no no no no no no yes no no no no no
MATHEMATICAL MONOSPACE SMP no no no no no no no no no no no yes no no no no no
ROMAN (I V X etc) BMP no no no no no no no no no yes no no no no no no no

Additional notes

Test cases

Most of the scripts/programs I've written for the purpose of making this table used one of these files as input:

Please note that the above files don't containt any ROMAN digits (their character codes are U+2160 to U+216F and U+2170 to U+217F; e.g. U+216C is Ⅼ aka decimal 50).

Further research

There is some more things that could be checked, e.g.:

Other notes

Thanks to the following people for pointing me to various languages, etc (some I tested, some I didn't): Roi Martin, Tomasz DÄ…browski, Maciej Tebecha, Ange Albertini, himn1, argasek, dfgg, meal, nathell and Unavowed.

Teh end.