BSWAP + 66h prefix

In the last few days I've been playing with osdev again (last time I've coded something more than a boot menu (sorry, PL), was in 2003), so expect a few posts about assembler, x86 emulators and similar institutions. Today's post will be about the bswap reg16 instruction, running in protected mode - which, as one will find out, can be used, for example, to detect bochs or QEMU.

The bswap reg16 instruction is in fact a bswap reg32 with the 66h prefix, also known as the operand-size override prefix (it switches the operands between 32 and 16 bits, where 32 is the default in PMODE of course). As one can read in the Intel manuals, using bswap with the 66h prefix will result in getting an undefined behavior .

When the BSWAP instruction references a 16-bit register, the result is undefined.

As one can read in the 86BUGS.LST By Harald Feldmann Revision 04 (it's in the inter61d.zip pack), it is not a new behavior of this instruction - it is known to be similar (identical) in the old 80486 processors (and newer):

Mnemonic: BSWAP reg32
Opcode  : 0F C8+reg# (00001111 11001rrr)
Bug in  : 486
Do not use this instruction with 16 bit registers as operand.
Results are undefined in that case.

Of course, when a researcher sees the undefined word, he already thinks of a way to use it, check it, and define it. And so, a piece of code came to be (I've used some GCC naked functions, not sure if you're familiar with them, since they are a lot more uglier then the VC++ ones):

// gcc bswap.c -masm=intel
#include <stdio.h>

extern unsigned int test(unsigned int a);
  .global _test\n\
     mov eax, [esp+4]\n\
     .byte 0x66\n\
     bswap eax\n\

unsigned int i;
for(i = 0; i < 0xf0000000; i += 1237)
  printf("%.8x -> %.8x\n", i, test(i));

return 0;

As I've found out (thanks to MeMeK'a and oshogbo for testing the above code on AMD processors), the bswap reg16 on both new Intel's (tested on Core 2 Duo and Core 2 Quad) and AMD's (Duron 800 MhZ and Athlon XP 2000+) sets the reg16 to zero (in the above case, it set the lower 16 bits of EAX (aka AX) to 0).

A question arises: does the same happen on x86 emulators like bochs or QEMU?

Let's check! Bochs first (cpu/bit.cc):

void BX_CPP_AttrRegparmN(1) BX_CPU_C::BSWAP_ERX(bxInstruction_c *i)
#if BX_CPU_LEVEL >= 4
 Bit32u val32, b0, b1, b2, b3;

 if (i->os32L() == 0) {
   BX_ERROR(("BSWAP with 16-bit opsize: undefined behavior !"));

 val32 = BX_READ_32BIT_REG(i->opcodeReg());
 b0 = val32 & 0xff; val32 >>= 8;
 b1 = val32 & 0xff; val32 >>= 8;
 b2 = val32 & 0xff; val32 >>= 8;
 b3 = val32;
 val32 = (b0<<24) | (b1<<16) | (b2<<8) | b3;

 BX_WRITE_32BIT_REGZ(i->opcodeReg(), val32);
 BX_INFO(("BSWAP_ERX: required CPU >= 4, use --enable-cpu-level=4 option"));
 exception(BX_UD_EXCEPTION, 0, 0);

Well, as one can see on the above listing, it is detected whether a 32-bit register is in use (what is not the case when a 66h prefix is used in PMODE, or the same prefix is missing in RMODE), and if not, a relevant error message is saved in the logs, but the execution continues. So, the bswap reg16 acts just like the bswap reg32 instruction. Even though a reg16 is provided, the whole reg32 is modified (and not zeroed).

Time to check the QEMU (target-i386/translate.c):

   case 0x1c8 ... 0x1cf: /* bswap reg */
       reg = (b & 7) | REX_B(s);
#ifdef TARGET_X86_64
           gen_op_mov_TN_reg(OT_LONG, 0, reg);
           tcg_gen_bswap_i32(cpu_T[0], cpu_T[0]);
           gen_op_mov_reg_T0(OT_LONG, reg);

It seems that QEMU assumes that the bswap is always used with a 32-bit register, and it ignores the 66h prefix. So, in the end, it works the same as bochs.

To sum it up: it is possible to detect execution in emulated environment (bochs, QEMU) using the bswap reg16 instruction.

But why would anyone want to detect emulators? Well, bochs is a very handy tool when it comes to the analysis of early-boot phase infecting malware (like Sinowal MBR aka Win32/Mebroot). So, a reverse engineer should know that something that may seem as an innocent bswap invocation, is in fact an emulator detection that will scramble the analysis (for example: calculate invalid offsets and crash the OS).

Btw, I don't think that this detection will work on any virtualizing software, since the virtualizers (Virtual PC, VirtualBox, etc) execute bswap on the original hardware.

And thats it.

UPDATE: On the Polish side of the mirror reader "..." has written that the Intel/AMD behaviour is related to internal (zero-)extending the value of a smaller (16-bit) register to full machine word (32-bits in this case), which results in applying the bswap to a 32-bit value "00 00 AH AL", which after the swap becomes "AL AH 00 00", and gets truncated to lower 16-bits, which are "00 00". It seems to sound logical ;>
It's worth also checking out comments on this side of the blog.
UPDATE 2: Peter Ferrie in the comments to this post placed on openrce mentioned a recently reported (but not yet fixed; at least I can't find a fix in the current release source code) by him bug in DOSBox related to the 16-bit bswap. He also writes, that the behavior of bswap has not changed since 486 (when it was introduced), so it's pretty much unofficially defined.
It's worth also taking a look at the "Insanely Low-Level" blog and the discussion related to the decoding of the bswap 16-bit instruction (in context of the diStorm engine).


2009-12-29 17:22:31 = Rolf Rolles
VMProtect uses this; I had to code special support in my emulator for BSWAP/16.
2009-12-30 06:35:18 = Gynvael Coldwind
@Rolf Rolles
Ah, interesting! Thanks for commenting ;>
2010-01-04 09:26:26 = arkon
Everything is defined after all, obviously.
Anyway Peter is not 100% correct.
Even if you set the high half of EAX, you still get 0. Which means, the instruction truely doesn't work for 16 bits registers.
Check this in debug.exe:
-a 100
0AFE:0100 db 66
0AFE:0101 mov ax, 1111
0AFE:0104 dw 2222
0AFE:0106 db 0f
0AFE:0107 db c8

(mov eax, 11112222; bswap ax)
2010-01-05 17:12:06 = Ron
Your link to the DOSBox bug is broken -- it has <a hreg> instead of <a href> :)
2010-01-08 11:01:11 = Gynvael Coldwind
Thx! Fixed :)

Hi! Thanks for commenting ;>
You are correct, of course. Peter seemed to be not precise about his statement ("the top 16 bits are zero in 16-bit mode").
The correct form is (from what I know) that the 16-bit registers are zero-extended to 32-bits before bswap takes place (in some temporary register of course, without overwriting the register), hence the zeroes get swapped into the lower 16-bit part of the register, and are copied to the ?x register.
2015-01-10 17:44:15 = Hans Trollhoff
The 486 (or possibly only particular steppings of it) does something different:

Add a comment:

URL (optional):
Math captcha: 2 ∗ 5 + 2 =