GXemul: Dynamic Translation Internals

Back to the index.


This page describes the internal workings of the GXemul dynamic translation system (usually referred to as just "dyntrans").


Static vs. dynamic:

In order to support guest operating systems, it is important to take into account the fact that code pages in memory can be overwritten with new code, e.g. when loading new programs inside the guest OS. Statically doing a one-time translation of the code to run will thus not work, it is necessary to translate code dynamically *. Self-modifying code and Just-in-Time compilers running inside the emulator are other things that would not work with a static translator. GXemul is a dynamic translator. However, it does not necessarily translate into native code, like some other emulators.

* Actually this is not necessary. Instead of translating the code ahead of execution, the instructions could be interpreted one-at-a-time. But this is slow.


Executable Intermediate Representation:

Dynamic translators usually translate from the emulated architecture (e.g. MIPS) into a kind of intermediate representation (IR), and then to native code (e.g. AMD64 or x86 code) which can be executed by the host. Since one of the main goals for GXemul is to keep everything as portable as possible, the IR is something which can be executed regardless of whether the final step (translation from IR to native code) has been implemented or not.

The IR in GXemul consists of arrays of pointers to functions, and a few arguments which are passed along to those functions. This is usually referred to as threaded code. The functions are implemented in either manually hand-coded C code, or generated C code.

Here is a simplified diagram of how these arrays work.

There is one instruction call slot for every possible program counter location. For example, in the traditional MIPS case, instruction words are 32 bits long, and pages are 4 KB large, resulting in 1024 instructions per page. In GXemul, this is represented using 1025 or 1026 instruction call slots.

After the last of the regular instruction call slots, there is one or two additional slots, which are special "end of page" slots. These do not count as executed instructions. Instead, they jump to the first instruction on the next virtual page (which might cause exceptions, etc).

(For architectures with 32-bit long instructions and 4 KB page size without a delay slot, there are 1025 entries, and for those with a delay slot there are 1026 entries.)

The core dyntrans loop, which executes the instructions in the instruction call slots, looks something like this:

	/*  The normal instruction execution core:  */
	#define I	ic = cpu->cd.DYNTRANS_ARCH.next_ic ++; ic->f(cpu, ic);

	...

	n_instrs = 0;

	for (;;) {
		struct DYNTRANS_IC *ic;

		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;

		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;

		cpu->n_translated_instrs += 120;
		if (cpu->n_translated_instrs >= N_SAFE_DYNTRANS_LIMIT)
			break;
	}

Why the number 120? The answer is: this number is evenly divisible by 1, 2, 3, 4, 5, 6, 8, 10, 12 (and some other numbers). Long emulated instruction loops of those lenghts will thus result in the same host branch being taken from the same place multiple times, which seems to fit with some modern CPUs' branch prediction mechanisms.

There are two important pointers: next_ic points to the next instruction call, and cur_ic_page points to the whole page of such instructions. Thus, when the main loop exits, we can check what next_ic points to. If it points to one of the slots inside the current page (cur_ic_page), then the program counter's lowest bits should be adjusted to be in sync with the next_ic pointer. It can also happen that the next_ic points to something else: a special case is when one wants to break out of the dyntrans loop, e.g. if there was an error. Then, next_ic does not point to any of the instruction slots in that page, and thus the program counter's lowest bits are not synced from the next_ic value.

The complexity of individual instructions vary. A simple example of what an instruction can look like is the MIPS addu instruction:

	/*
	 *  3-register arithmetic instructions:
	 *
	 *  arg[0] = ptr to rs
	 *  arg[1] = ptr to rt
	 *  arg[2] = ptr to rd
	 */
	X(addu)
	{
		reg(ic->arg[2]) = (int32_t)(reg(ic->arg[0]) + reg(ic->arg[1]));
	}

It stores the result of a 32-bit addition of the register at arg[0] with the register at arg[1] into the register at arg[2]. If the emulated CPU is a 64-bit CPU, then this will store a correctly sign-extended value. If it is a 32-bit CPU, then only the lowest 32 bits will be stored, and the high part ignored. X(addu) is expanded to mips_instr_addu in the 64-bit case, and mips32_instr_addu in the 32-bit case. Both are compiled into the GXemul executable; no code is created during run-time.

Another reasonably simple example is the M88K bsr (branch to subroutine) instruction. This is a branch which also sets the return address register to the instruction following the branch. Here is the implementation for "samepage" offsets, i.e. branch offsets which are inside the same dyntrans page:

	X(bsr_samepage)
	{
		cpu->cd.m88k.r[M88K_RETURN_REG] = (cpu->pc &
		    ~((M88K_IC_ENTRIES_PER_PAGE-1) << M88K_INSTR_ALIGNMENT_SHIFT))
		    + ic->arg[2];
		cpu->cd.m88k.next_ic = (struct m88k_instr_call *) ic->arg[0];
	}

Don't be scared by the logic operators. Here, the M88K r1 register (the return register) is set to the return address, which is calculated by taking the current program counter, clearing the lowest bits, and adding arg[2] which is the offset within the page of the next instruction. Then the next instruction call to execute is set to arg[0], which needs to be on the same page (since cur_ic_page is not updated).

The compiled code may look like this (example from an AMD64 host):

000000000036dac0 :
  36dac0:       b8 03 f0 ff ff          mov    $0xfffff003,%eax
  36dac5:       23 87 a8 00 00 00       and    0xa8(%rdi),%eax
  36dacb:       03 46 18                add    0x18(%rsi),%eax
  36dace:       89 87 e4 00 00 00       mov    %eax,0xe4(%rdi)
  36dad4:       48 8b 46 08             mov    0x8(%rsi),%rax
  36dad8:       48 89 87 e0 03 00 00    mov    %rax,0x3e0(%rdi)
  36dadf:       c3                      retq   


Emulation of Loads and Stores:

The emulation of memory access instructions differ slightly between the various emulated architectures, but most of the principles in this chapter apply. The emulation of loads and stores when emulating a MIPS processor consist of a "happy case" shorter code path, and a fallback to a more general case if the preconditions for the happy case are not met.

Pseudo code for the short path:

Step: Overhead:
1. Calculate the final virtual address. 3 host memory accesses: read the Base and Offset arguments.
2. Preconditions check: a) find a corresponding host RAM page, if there is one, and b) check address alignment. 1 host memory access for 32 bit, 3 for 64 bit.
3. Perform native load or store. 1 host memory access (or multiple if doing byte-order-swap).
4. "Write-back" to emulated CPU's destination register (for loads). 2 host memory accesses.

Step 1 calculates the actual virtual address (usually a base register plus an immediate offset).

Step 2 looks up whether there is a corresponding RAM page in the host. If there is, we can potentially load or store directly from that page in the host's RAM.

However, before going ahead with the load or store, we need to check the alignment of the address as well. Different host architectures have different properties, for example when running on (some?) ARM hosts, the lower bits of an address may be ignored leading to the wrong address being used if we were to use the address as-is; when running on an x86-based host, unaligned loads and stores are silently accepted even though they shouldn't be on the hardware being emulated (in this case MIPS). Checking the lowest bits of the final address is thus important for correct emulation, so that exceptions can be generated etc.

For emulation of 32-bit address spaces, the lookup from emulated virtual address to host page is a single fetch from an array. For emulation of 64-bit address spaces, it is a three-level lookup. In either case, there are separate lookups for loads and stores. A page which is write-protected has just a mapping for load, no mapping for store. A page which is readable and writable has both a mapping for load and one for store.

If the alignment was ok, and there was a direct mapping to host page, we can go ahead (step 3) and perform a native load or store, possibly followed by byte order swap (which may be implemented by using multiple memory accesses) if the emulated CPU is of different endianness than the host.

(This mechanism is currently hardcoded for 4 KB base page size. The 1 KB page size of the VR41xx CPUs for example does not work, and neither would e.g. VAX' 512 byte pages. As long as the guest OS doesn't make use of smaller page sizes than 4 KB, it should work even when emulating such CPUs.)

If any of the checks fail in step 2, the code bails out and calls a more general memory access routine, which looks up addresses via the emulated MIPS TLB and depending on the outcome, may add host pages to the virtual-to-host arrays, or cause exceptions on the emulated cpu, or call out to memory mapped device routines.

This implementation is obviously not intended to result in any world speed records, but it strikes a reasonable balance between emulation correctness and host portability.

The actual code in src/cpus/cpu_mips_instr_loadstore.cc looks horrible due to being originally written in C with the intention of mimicking C++ templates. If it were to be written in C++ templates from scratch, it would likely be more readable.

When the TLB in the emulated CPU gets instructed to change its mapping, for example if virtual address 0x1000 pointed to physical address 0x1234000, but is changed to point to 0x2345000, then the virtual-to-physical-to-host mappings in the emulator must be updated to reflect this. This is known as invalidating the translation lookup tables.

Also, when performing a store to a page for which there are code translations (the IR described above), all such translated instruction slots are reset.


Performance:

The performance of using this kind of executable IR is obviously lower than what can be achieved by emulators using native code generation, but can be significantly higher than using a naive fetch-decode-execute interpretation loop. Using threaded code as the IR is an interesting compromise.

The overhead per emulated instruction is usually around or below approximately 10 host instructions. This is very much dependent on your host architecture and what compiler and compiler switches you are using. Added to this instruction count is (of course) also the C code used to implement each specific instruction.


Instruction Combinations:

Short, common instruction sequences can sometimes be replaced by a "combined" instruction. An example could be a compare instruction followed by a conditional branch instruction. The advantages of instruction combinations are that

The special cases where instruction combinations give the most gain are in the cores of string/memory manipulation functions such as memset() or strlen(). The core loop can then (at least to some extent) be replaced by a native call to the equivalent function.

The implementations of compound instructions still keep track of the number of executed instructions, etc. When single-stepping, these translations are invalidated, and replaced by normal instruction calls (one per emulated instruction).

An example of what such an instruction combination can look like is a 3-instruction memset loop on MIPS. The pattern to detect is a sequence of addiu, bne, and sw with certain registers. In this case, an attempt to identify the sequence is only necessary for certain sw instructions; otherwise, the attempt to find the pattern is not done. (Note: The sw instruction is in the delay slot of the bne instruction.)

If the sequence is detected, the instruction call slot with addiu is changed to point to the instruction combination.

	Original sequence:			Changed to:

	slot 4:  ...				slot 4:  ...
	slot 5:  addiu	r5, r3, 24		slot 5:  addiu	  r5, r3, 24
	slot 6:  addiu	r9, r9, 4		slot 6:  sw_loop  r9, r9, 4
	slot 7:  bne    r12, r9, slot 6		slot 7:  bne      r12, r9, slot 6
	slot 8:  sw	r6, -4(r9)		slot 8:  sw	  r6, -4(r9)
	slot 9:  subu	r2, r3, 32		slot 9:  subu	  r2, r3, 32      <--- (*)
	slot 10: ...				slot 10: ...

(*) If the sw_loop code is executed, then execution continues here (&ic[3]).

The sw_loop is then implemented as follows:

	/*
	 *  sw_loop:
	 *
	 *  s:	addiu	rX,rX,4			rX = arg[0] and arg[1]
	 *	bne	rY,rX,s  (or rX,rY,s)	rt=arg[1], rs=arg[0]
	 *	sw	rZ,-4(rX)		rt=arg[0], rs=arg[1]
	 */
	X(sw_loop)
	{
		MODE_uint_t rX = reg(ic->arg[0]), rZ = reg(ic[2].arg[0]);
		uint64_t *rYp = (uint64_t *) ic[1].arg[0];
		MODE_uint_t rY, bytes_to_write;
		unsigned char *page;
		int partial = 0;

		page = cpu->cd.mips.host_store[rX >> 12];

		/*  Fallback:  */
		if (cpu->delay_slot || page == NULL || (rX & 3) != 0 || rZ != 0) {
			instr(addiu)(cpu, ic);
			return;
		}

		if (rYp == (uint64_t *) ic->arg[0])
			rYp = (uint64_t *) ic[1].arg[1];

		rY = reg(rYp);

		bytes_to_write = rY - rX;
		if ((rX & 0xfff) + bytes_to_write > 0x1000) {
			bytes_to_write = 0x1000 - (rX & 0xfff);
			partial = 1;
		}

		memset(page + (rX & 0xfff), 0, bytes_to_write);

		reg(ic->arg[0]) = rX + bytes_to_write;

		cpu->n_translated_instrs += bytes_to_write / 4 * 3 - 1;
		cpu->cd.mips.next_ic = partial?
		    (struct mips_instr_call *) &ic[0] :
		    (struct mips_instr_call *) &ic[3];
	}

For large memsets, calling the native/system memset implementation will most likely be a lot faster than emulating the MIPS memset one instruction at a time.

The fallback check makes sure that the rest of the code in sw_loop will be able to run. The implementation only supports zero-fill memsets, so if rZ is non-zero (or some other dangerous condition is present), then we do fall-back to addiu. This is done by explicitly calling instr(addiu)(cpu, ic); and then returning.

(The example above is 32-bit specific, and only works if a direct page address can be obtained via cpu->cd.mips.host_store.)

To find out which such instruction combinations to implement, real-world code (such as a complete guest operating system with suitable user applications) should be executed with special instruction statistics gathering, and then the most common instructions can be retrieved from that statistics.


Native Code Generation Back-ends:

Native code generation was included in GXemul before 0.4.x, but only for i386 and Alpha hosts. That did not feel clean and portable enough to work as a long-term solution. Time has proved this right: i386 has been superseded by amd64, and the Alpha architecture has unfortunately been killed.

The current implementation instead focuses on an having an intermediate representation that can be executed regardless of what the native architecture is.

In theory, it would be possible to add native code generation again, as long as that generated code abides to the C ABI on the host.