String-to-Integer vs Unicode additional digit groups table

unicode

The interesting difference between ASCII and Unicode is that the first had only one group of digits defined (30h to 39h), and the latter defines 42 decimal digit groups (I think it actually defines more, but nvm). A common programming language operation is to convert a sequence of digit-characters (yes, a number) to a machine-understandable integer. Does any default in-language string-to-integer support Unicode digits? Does any is-digit function return true on Unicode digits? Well, I did some checking and created a table (programming language/version/library vs digit group) that addresses these questions.

String-to-Integer vs Unicode additional digit groups table

Important note: I've created this table out of curiosity so it might be a little chaotic. I tried to not make any errors but I cannot guarantee that there are none. If you have any comments or have tested another programming language/library/digit group/etc and would like to share the results, feel free to either e-mail me or leave a comment under this post.

As for test cases, most of the scripts/programs I've written for the purpose of making this table used one of these files as input:
* test_case_string.txt - format: <HEX CODE> <CHARACTER NAME>
* test_case_utf8.txt - format: <UTF-8 ENCODED CHARACTER> <CHARACTER NAME>
Please note that the above files don't containt any ROMAN digits (their character codes are U+2160 to U+216F and U+2170 to U+217F; e.g. U+216C is Ⅼ aka decimal 50).

There are some more things that could be checked, e.g.:
* Other programming languages (like Go or Objective C or Delphi), libraries and functions could be tested.
* Does this have any security implications (filter bypassing perhaps?). See also Unicode Security Considerations.
* Are non-decimal digits supported in some cases?
* Did I miss any digit groups?

By the way...
If want to improve your binary file and protocol skills, check out the workshop I'll be running between April and June → Mastering Binary Files and Protocols: The Complete Journey

Also, I would like to thank the following people for pointing me to various languages (some I tested, some I didn't): Roi Martin, Tomasz Dąbrowski, Maciej Tebecha, himn1, argasek, dfgg, meal and nathell.

Cheers,

Comments:

2011-09-17 21:17:05 = lRem

{

Perl's partial support looks *very* weird to me...

}

2011-09-17 21:20:44 = Gynvael Coldwind

{

@lRem
Agreed, it does look inconsistent.

Maybe some Perl expert could look into it?;)

}

2011-09-17 22:23:53 = jduck

{

FYI, Ruby doesn't support alternate encodings until version 1.9.

}

2011-09-17 23:26:00 = Gynvael Coldwind

{

@jduck
That actually was a typo in the description - I've used 1.9.2 for tests (the default bundle for Windows).

}

2024-11-20 05:11:54 = sixtyvividtails

{

Interesting fact: rundll32.exe on all win10/win11 uses _wtoi function from msvcrt.dll to resolve provided ordinals - presumably for compatibility reasons. This means all 17 extra alphabets^W digit groups are supported. Combine it with positive (negative) or mod 64k _wtoi peculiarity (as discovered by Hexacorn: https://www.hexacorn.com/blog/2020/02/05/stay-positive-lolbins-not/), and you'll get something like a spellbinding.

For example, applying Witch Ordinals to well-known Poweliks mshtml trick:
rundll32.exe javascript:alert('๓ē໓นkค-iŞ-๖ēคนtฯ');window.close();"\..\mshtml #৩੧೪໑၅৯២៦୫໓໕໘९൭៩೩٢۳๘൪၆๒୬༤୩৫৬၉੯൯"

}

Nick:
URL (optional):
Math captcha: 9 ∗ 2 ＋ 3 =

Sections

Links / Blogs

Posts

String-to-Integer vs Unicode additional digit groups table

Comments:

Add a comment: