Important information: PHP 5 was used, and both the page and source files were UTF-8 encoded.
This is how the check looks like (in simplified form, i.e. I've changed the size limit to: 5 and the whole regexp to check for just a single word):
preg_match('/^[a-zA-Zążśźęćń󳥯ŚŹĘĆŃÓŁ]{1,5}$/', $input_text)
So, basically, the author's intention are clear:
- only latin and Polish letters are allowed, and nothing
- the minimum allowed letter count is 1, and the maximum allowed letter count is 5
I guess at the first sight everything looks OK, alas, preg_match does not work in UTF-8 mode, hence it is not aware that some of the given characters have more than one byte.
Let's take a look on how the given diacritized letters are encoded in UTF-8:
ą: C4 85 ż: C5 BC ś: C5 9B ź: C5 BA
ę: C4 99 ć: C4 87 ń: C5 84 ó: C3 B3
ł: C5 82 Ą: C4 84 Ż: C5 BB Ś: C5 9A
Ź: C5 B9 Ę: C4 98 Ć: C4 86 Ń: C5 83
Ó: C3 93 Ł: C5 81
Yess... all of these are two byte characters in UTF-8. And, as I've said, preg_match is not aware of that, so an obvious consequence of this is that if we write a regexp like /[Ł]/ it is in fact treated as /[\xC5\x81]/, so, both byte \xC5 and \x81 are a match, even if they stand alone!
So, let's summarize what will not work accordingly to code authors will here:
1. We are allowed to use these bytes (sorted by value): 81 82 83 84 85 86 87 93 98 99 9A 9B B3 B9 BA BB BC C4 C5. So, any combination of these byte will go through filters.
Basically, we can either feed the input sequences that are not a proper UTF-8 string, e.g. start with a byte from the 80-BF range, which is reserved for the second, third or fourth character in the utf-8 sequence (and not the first), or make a sequence only of C4 and C5 bytes, which are reserved to be used as the first byte in the sequence only. Maybe this would issue some database-warning/error later on? Who knows :)
Or... we can pass a proper UTF-8 character, starting with C4 or C5, followed by a proper second byte from the allowed list. This would give us the following characters:
C4 group:
ā: C4 81 Ă: C4 82 ă: C4 83 Ą: C4 84
ą: C4 85 Ć: C4 86 ć: C4 87 ē: C4 93
Ę: C4 98 ę: C4 99 Ě: C4 9A ě: C4 9B
ij: C4 B3 Ĺ: C4 B9 ĺ: C4 BA Ļ: C4 BB
ļ: C4 BC
C5 group:
Ł: C5 81 ł: C5 82 Ń: C5 83 ń: C5 84
Ņ: C5 85 ņ: C5 86 Ň: C5 87 œ: C5 93
Ř: C5 98 ř: C5 99 Ś: C5 9A ś: C5 9B
ų: C5 B3 Ź: C5 B9 ź: C5 BA Ż: C5 BB
ż: C5 BC
2. The other thing was the allowed size. The author's will was that up to 5 letters would be allowed. However, if the input is e.g.: ŁŁŁŁŁ (5 letters as you can see), it will be not allowed. Yes, thats because each 'Ł' is actually 2 bytes, so the length in bytes of the string is 10, and not 5, so preg_match will not find a match in this case.
By the way...
There are more blog posts you might like on my company's blog: https://hexarcana.ch/b/
OK... so, how to fix this? There are two ways:
1. use the mb_* functions which are multi-byte-characters aware, e.g. mb_ereg (remember to set mb_regex_encoding or mb_internal_encoding to UTF-8!)
2. It seems that adding (*UTF8) to the beginning of the query (in this case it would be preg_match('/(*UTF8)^[a-zA-Zążśźęćń󳥯ŚŹĘĆŃÓŁ]{1,5}$/', $input_text)) will make preg utf8-aware. However, this seems to work depending on how the PHP was actually built (i.e. it doesn't work on all servers).
(update) 3. As the first comment states, also adding u to the regexp will work (i.e. turning on unicode mode I guess), e.g. preg_match('/^[a-zA-Zążśźęćń󳥯ŚŹĘĆŃÓŁ]{1,5}$/u', $input_text), but I guess it's same as the first idea.
(update 2) 4. PHP 6 is supposed to support unicode natively... but it will take a while before it reaches a stable release...
OK, that's that I guess.
P.S. tests and test code I've used for this post can be found here:
Tests
Tests (source code)
P.S.2. you know that strlen('a utf-8 string') will not return the number of letters, but the number of bytes, right? right??? that's why there is a mb_strlen() function.
Comments:
I've added the /u to the 'fix list'. I'm sure the author of the code is now well aware of the problem, so 'RTFM' is too harsh ;)
Add a comment: