This document outlines the design for multi-threaded TCG (a.k.a MTTCG) system-mode emulation. user-mode emulation has always mirrored the thread structure of the translated executable although some of the changes done for MTTCG system emulation have improved the stability of linux-user emulation.
The original system-mode TCG implementation was single threaded and dealt with multiple CPUs with simple round-robin scheduling. This simplified a lot of things but became increasingly limited as systems being emulated gained additional cores and per-core performance gains for host systems started to level off.
We introduce a new running mode where each vCPU will run on its own user-space thread. This is enabled by default for all FE/BE combinations where the host memory model is able to accommodate the guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the guest has had the required work done to support this safely (TARGET_SUPPORTS_MTTCG).
System emulation will fall back to the original round robin approach if:
forced by –accel tcg,thread=single
enabling –icount mode
64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
In the general case of running translated code there should be no inter-vCPU dependencies and all vCPUs should be able to run at full speed. Synchronisation will only be required while accessing internal shared data structures or when the emulated architecture requires a coherent representation of the emulated machine state.
Between emulated guests and host systems there are a range of memory consistency models. Even emulating weakly ordered systems on strongly ordered hosts needs to ensure things like store-after-load re-ordering can be prevented when the guest wants to.
Barriers (sometimes known as fences) provide a mechanism for software to enforce a particular ordering of memory operations from the point of view of external observers (e.g. another processor core). They can apply to any memory operations as well as just loads or stores.
The Linux kernel has an excellent write-up on the various forms of memory barrier and the guarantees they can provide.
Barriers are often wrapped around synchronisation primitives to provide explicit memory ordering semantics. However they can be used by themselves to provide safe lockless access by ensuring for example a change to a signal flag will only be visible once the changes to payload are.
DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
This would enforce a strong load/store ordering so all loads/stores complete at the memory barrier. On single-core non-SMP strongly ordered backends this could become a NOP.
Aside from explicit standalone memory barrier instructions there are also implicit memory ordering semantics which comes with each guest memory access instruction. For example all x86 load/stores come with fairly strong guarantees of sequential consistency whereas Arm has special variants of load/store instructions that imply acquire/release semantics.
In the case of a strongly ordered guest architecture being emulated on a weakly ordered host the scope for a heavy performance impact is quite high.
- DESIGN REQUIREMENTS: Be efficient with use of memory barriers
host systems with stronger implied guarantees can skip some barriers
merge consecutive barriers to the strongest one
The system currently has a tcg_gen_mb() which will add memory barrier operations if code generation is being done in a parallel context. The tcg_optimize() function attempts to merge barriers up to their strongest form before any load/store operations. The solution was originally developed and tested for linux-user based systems. All backends have been converted to emit fences when required. So far the following front-ends have been updated to emit fences when required:
Memory Control and Maintenance
This includes a class of instructions for controlling system cache behaviour. While QEMU doesn’t model cache behaviour these instructions are often seen when code modification has taken place to ensure the changes take effect.
There are two broad types of synchronisation primitives found in modern ISAs: atomic instructions and exclusive regions.
The first type offer a simple atomic instruction which will guarantee some sort of test and conditional store will be truly atomic w.r.t. other cores sharing access to the memory. The classic example is the x86 cmpxchg instruction.
The second type offer a pair of load/store instructions which offer a guarantee that a region of memory has not been touched between the load and store instructions. An example of this is Arm’s ldrex/strex pair where the strex instruction will return a flag indicating a successful store only if no other CPU has accessed the memory region since the ldrex.
Traditionally TCG has generated a series of operations that work because they are within the context of a single translation block so will have completed before another CPU is scheduled. However with the ability to have multiple threads running to emulate multiple CPUs we will need to explicitly expose these semantics.
- DESIGN REQUIREMENTS:
Support classic atomic instructions
Support load/store exclusive (or load link/store conditional) pairs
Generic enough infrastructure to support all guest architectures
- CURRENT OPEN QUESTIONS:
How problematic is the ABA problem in general?
The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which can be used directly or combined to emulate other instructions like Arm’s ldrex/strex instructions. While they are susceptible to the ABA problem so far common guests have not implemented patterns where this may be a problem - typically presenting a locking ABI which assumes cmpxchg like semantics.
The code also includes a fall-back for cases where multi-threaded TCG ops can’t work (e.g. guest atomic width > host atomic width). In this case an EXCP_ATOMIC exit occurs and the instruction is emulated with an exclusive lock which ensures all emulation is serialised.
While the atomic helpers look good enough for now there may be a need to look at solutions that can more closely model the guest architectures semantics.