PHP preg_match and UTF-8

A few days ago I've received a piece of PHP 5 code, and got asked if it's OK. Basically, the code was validating user input, and was checking if only letters are used: both latin letters (A-Z) and additional Polish diacritized letters (i.e. Ą Ż Ś Ź Ę Ć Ń Ó Ł and lower version of these: ą ż ś ź ę ć ń ó ł). Additionally, there was a relatively small size limit to the input. And, as you might have already guessed, the code was not OK, and hence this post.

Important information: PHP 5 was used, and both the page and source files were UTF-8 encoded.

This is how the check looks like (in simplified form, i.e. I've changed the size limit to: 5 and the whole regexp to check for just a single word):
preg_match('/^[a-zA-Zążśźęćń󳥯ŚŹĘĆŃÓŁ]{1,5}$/', $input_text)
So, basically, the author's intention are clear:
- only latin and Polish letters are allowed, and nothing
- the minimum allowed letter count is 1, and the maximum allowed letter count is 5

I guess at the first sight everything looks OK, alas, preg_match does not work in UTF-8 mode, hence it is not aware that some of the given characters have more than one byte.
Let's take a look on how the given diacritized letters are encoded in UTF-8:
ą: C4 85    ż: C5 BC    ś: C5 9B    ź: C5 BA
ę: C4 99    ć: C4 87    ń: C5 84    ó: C3 B3
ł: C5 82    Ą: C4 84    Ż: C5 BB    Ś: C5 9A
Ź: C5 B9    Ę: C4 98    Ć: C4 86    Ń: C5 83
Ó: C3 93    Ł: C5 81

Yess... all of these are two byte characters in UTF-8. And, as I've said, preg_match is not aware of that, so an obvious consequence of this is that if we write a regexp like /[Ł]/ it is in fact treated as /[\xC5\x81]/, so, both byte \xC5 and \x81 are a match, even if they stand alone!

So, let's summarize what will not work accordingly to code authors will here:
1. We are allowed to use these bytes (sorted by value): 81 82 83 84 85 86 87 93 98 99 9A 9B B3 B9 BA BB BC C4 C5. So, any combination of these byte will go through filters.
Basically, we can either feed the input sequences that are not a proper UTF-8 string, e.g. start with a byte from the 80-BF range, which is reserved for the second, third or fourth character in the utf-8 sequence (and not the first), or make a sequence only of C4 and C5 bytes, which are reserved to be used as the first byte in the sequence only. Maybe this would issue some database-warning/error later on? Who knows :)
Or... we can pass a proper UTF-8 character, starting with C4 or C5, followed by a proper second byte from the allowed list. This would give us the following characters:
C4 group:
ā: C4 81    Ă: C4 82    ă: C4 83    Ą: C4 84
ą: C4 85    Ć: C4 86    ć: C4 87    ē: C4 93
Ę: C4 98    ę: C4 99    Ě: C4 9A    ě: C4 9B
ij: C4 B3    Ĺ: C4 B9    ĺ: C4 BA    Ļ: C4 BB
ļ: C4 BC
C5 group:
Ł: C5 81    ł: C5 82    Ń: C5 83    ń: C5 84
Ņ: C5 85    ņ: C5 86    Ň: C5 87    œ: C5 93
Ř: C5 98    ř: C5 99    Ś: C5 9A    ś: C5 9B
ų: C5 B3    Ź: C5 B9    ź: C5 BA    Ż: C5 BB
ż: C5 BC

2. The other thing was the allowed size. The author's will was that up to 5 letters would be allowed. However, if the input is e.g.: ŁŁŁŁŁ (5 letters as you can see), it will be not allowed. Yes, thats because each 'Ł' is actually 2 bytes, so the length in bytes of the string is 10, and not 5, so preg_match will not find a match in this case.

OK... so, how to fix this? There are two ways:
1. use the mb_* functions which are multi-byte-characters aware, e.g. mb_ereg (remember to set mb_regex_encoding or mb_internal_encoding to UTF-8!)
2. It seems that adding (*UTF8) to the beginning of the query (in this case it would be preg_match('/(*UTF8)^[a-zA-Zążśźęćń󳥯ŚŹĘĆŃÓŁ]{1,5}$/', $input_text)) will make preg utf8-aware. However, this seems to work depending on how the PHP was actually built (i.e. it doesn't work on all servers).
(update) 3. As the first comment states, also adding u to the regexp will work (i.e. turning on unicode mode I guess), e.g. preg_match('/^[a-zA-Zążśźęćń󳥯ŚŹĘĆŃÓŁ]{1,5}$/u', $input_text), but I guess it's same as the first idea.
(update 2) 4. PHP 6 is supposed to support unicode natively... but it will take a while before it reaches a stable release...

OK, that's that I guess.

P.S. tests and test code I've used for this post can be found here:
Tests (source code)

P.S.2. you know that strlen('a utf-8 string') will not return the number of letters, but the number of bytes, right? right??? that's why there is a mb_strlen() function.


2010-10-16 18:31:51 = .
/^[a-zążśźęćń󳥯ŚŹĘĆŃÓŁ]{1,5}$/iu RTFM :)
2010-10-16 18:44:57 = Gynvael Coldwind
I've added the /u to the 'fix list'. I'm sure the author of the code is now well aware of the problem, so 'RTFM' is too harsh ;)

Add a comment:

URL (optional):
Math captcha: 5 ∗ 3 + 5 =