Null Byte Poison is a neat little attack that usually can be applied when "length+data"-type strings get converted into "zero-terminated"-type strings. It's a well known problem though that haunted PHP scripts for several years, and even visited the browser world. Nowadays a lot of languages (or rather: runtime environments of these languages) have built-in protections against it (including PHP!) - for instance see this Python example:

>>> open("\0") Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: embedded null character

Unicode brought another similar problem to the table in the form of two ways (one invalid-yet-working) to encode a Null Byte without using an actual \x00 byte - this allows to, depending on the scenario, either bypass a Null Byte Poison detection, or actually inject a Null Byte into a "zero-terminated"-type string at a later processing stage (which is sometimes useful):

  • UTF-8 overlong sequence: \xC0\x80
  • UTF-7 being UTF-7: +AAA-

In the above cases when the strings get decoded to Unicode we (might) see Null Bytes popping up in the data. Thankfully all decent UTF-8 decoders deal properly with overlong sequences and nothing modern uses UTF-7 anyway (with the notable exception of Express.JS in some scenarios).

While playing with a path traversal bug in mutool (details below) I've found yet another Unicode-related way to inject a Null Byte into a string. This method actually relies on a decoder bug and is pretty case-specific, but I think it's worth testing for as I wouldn't be surprised to find it again in similar scenarios in the future.

The bug in question resided in this code (mupdf/source/fitz/time.c):

wchar_t * fz_wchar_from_utf8(const char *s) { wchar_t *d, *r; int c; r = d = malloc((strlen(s) + 1) * sizeof(wchar_t)); if (!r) return NULL; while (*s) { s += fz_chartorune(&c, s); *d++ = c; } *d = 0; return r; }

The problematic part is related to the size of the wchar_t type - while everything is fine on Linux (where the size is 4 bytes), on Windows wchar_t is only 2 bytes. This means, that the *d++ = c; part might actually truncate the decoded code point if it's larger than U+FFFF - and Unicode's code points span up to U+10FFFF.

So basically in the above code running on Windows characters U+10000, U+20000, ..., U+F0000 and U+100000 will be converted to a Null Byte.

Note: Out of the code points above only U+10000 and U+20000 are defined in Unicode, though my browser shows only one of them correctly. Let's see how your's does:

𐀀 𠀀

As far as the MuPDF mutool path traversal vulnerability (and eventual DLL spoofing RCE) goes, the original report is quoted below. For some context, mutool is a command-line utility that (among other things) can extract fonts and images from a PDF file (a tool, that's pretty useful when doing things like Paged Out!). And now, the report:

Hey folks, *** Summary: Using mutool extract on an attacker-provided PDF can lead to path traversal and creating a file with attacker-provided content at a location chosen by the attacker (with the file name fully-controlled by the attacker). This on the other hand commonly can be used to trick some other part of the OS to execute the dumped file (e.g. due to DLL spoofing, binary planting, or other similar techniques), so in practice it's a deferred RCE. Note: I've tested this on Windows; it should work on Linux too, but I'm not sure if in full extent (I suspect the file suffix cut-off trick might not work). Attached PoC which dumps (on Windows) a "" file in the root of the currently used disk. [Tested on 1.17.0] *** More details: When doing: mutool extract filename.pdf the tool extracts images and font files. The fonts are the interesting part. The name of the extracted font file is taken from the /FontName entry directly, and without any sanitization is passed to fz_snprintf to create the final name: (pdfextract.c:savefont) fontname = pdf_to_name(ctx, obj); ... fz_snprintf(namebuf, sizeof(namebuf), "%s-%04d.%s", fontname, pdf_to_num(ctx, dict), ext); ... out = fz_new_output_with_path(ctx, namebuf, 0); If the /FontName is set to e.g. ..\..\..\..\..\..\..\somefile.ext (tested this on Windows thus the backslashes), then the file will put outside of the current working directory of mutool. The created file would of course be named somefile.ext-1234.ttf, where "1234" is the object id, and the "ttf" is one of a set of constant fonts. However, the "-1234.ttf" suffix can be tricked into being removed. This is due to the fz_new_output_with_path function supporting UTF-8. However, since Windows does not support UTF-8 directly, a conversion needs to be made to wide characters. This conversion is done using the fz_wchar_from_utf8() function: wchar_t * fz_wchar_from_utf8(const char *s) { wchar_t *d, *r; int c; r = d = malloc((strlen(s) + 1) * sizeof(wchar_t)); if (!r) return NULL; while (*s) { s += fz_chartorune(&c, s); *d++ = c; <----------- IMPORTANT } *d = 0; return r; } The problem is located in the marked above line - the fz_chartorune() function returns a Unicode character code, which can be between U+0000 and U+10FFFF, and this is saved to "int c" variable. However, in the marked line the int is assigned to wchar_t, which on Windows is 16-bit only. So if the UTF-8 sequence would be e.g. for U+10000, then after a downcast to wchar_t we get a truncated 16-bit value of 0x0000 - technically a null byte. Given the above, the attacker can set a /FontName like this: /FontName /..\..\..\..\..\some\dir\evil.dll#f4#80#80#80 In the first step this will then be converted to: ..\..\..\..\..\some\dir\evil.dll\xf4\x80\x80\x80-1234.ttf And in the second step to: ..\..\..\..\..\some\dir\evil.dll\0-1234.ttf Effectively becoming: ..\..\..\..\..\some\dir\evil.dll While the examples above use a classic path traversal, technically the path can be absolute, to even reach to other SMB shares: \file_on_the_root_of_disk#f4#80#80#80 \\.\D:\file_on_selected_disk#f4#80#80#80 \\another_computer\file_on_another_computer#f4#80#80#80 A typical way of exploiting this kind of vulnerabilities in archive extractors is dumping a DLL file somewhere on the disk in hope it will be loaded by mistake by another application (i.e. DLL spoofing). Commonly naming it "version.dll" or "ws2_32.dll" and dumping it in current users Download directory does the trick (a downloaded installer will load it). *** Proposed Fix: The more aggressive fix is to just skip the /FontName entirely and just use the object number & extension (same as while dumping images). A less aggressive fix would be to change ALL characters apart from 0-9a-zA-Z_ to _ character. Please let me know if you have any questions.


  • 2020-06-06: Me: Reported.
  • 2020-06-24: Me: Ping.
  • 2020-06-25: MuPDF folks: Pong. Proposed fixes.
  • 2020-06-25: Me: Response to proposed fixes.
  • 2020-07-13: Me: Ping.
  • 2020-07-13: MuPDF folks: Pong. Turns out a different patch was made on 2020-06-24 and was commited on 2020-06-26. Bug is resolved.
  • 2020-07-13: Me: Info about planned publication.
  • 2020-08-10: Me: This blogpost is published.

Nothing major, but a fun little vulnerability nonetheless ;>


2020-08-11 01:59:57 = vedsoydeik
Fun indeed. Current relase is still vulnerable. It is worth to notice that font file path traversal alone can potentially be used in deadly social engineering attacks.
2020-08-11 02:05:23 = vedsoydeik
Oh, nevermind, there is a suffix...
2020-08-11 05:36:09 = Bourbon
Interesting & nice blog. I can definitely see this getting used for creating a HackTheBox machine... Thank you!
2020-08-19 17:04:31 = nu11sec
Great research sir!
2020-08-19 17:04:31 = nu11sec
Great research sir!
2020-11-19 09:34:18 = Nism0
When I see "Just another..." in a post title on this blog, I grab a popcorn as it's going to be a nice reading.
2020-11-30 18:46:11 = test

test again test test test

test !

Add a comment:

URL (optional):
Math captcha: 7 ∗ 3 + 4 =