Articles posted November 2008

MAME 0.128u4 Changes

A number of people have asked about the changes that went into MAME 0.128u4, and why they were made. This article explains the motivation and reasoning behind these changes, and what still needs to be done.

CPU Context Switching

When MAME was originally designed, it wasn't envisioned that it would eventually be used to emulate multiple CPUs at the same time. MAME 0.1 was derived from Nicola's Multi-Pac emulator, which emulates Pac-Man based hardware, consisting simply of a single Z80 CPU. Eventually, more arcade games were discovered that ran on similar hardware, and a large collection of Z80-based games were quickly supported.

Of course, it was soon discovered that arcade hardware is much more diverse than a single Z80. Support for additional 8-bit CPUs was added, including the 6502 and 6809. In addition, it quickly became apparent that many games had more than one CPU, and so support for multiple CPUs soon followed. There was a problem with running multiple CPUs of the same type, however: the code in the various CPU emulators was not designed to run multiple instances of the same type of CPU at the same time. So in order to make this work, a concept called context switching was introduced.

Let's take the Z80 as our example. The main problem with running more than one Z80 CPU at a time is that the Z80 CPU emulator stored the state of the CPU in global variables. At the time (remember this is back in the days of the 386 and DOS), it was faster to fetch data from global variables than to allocate memory for each Z80 and reference all the CPU state via a pointer.

So in order to allow the Z80 core to continue to work the way it did, we introduced context switching. With context switching, when MAME needs to stop executing one of the Z80's and begin executing one of the others, it asks the first Z80 to copy all of its relevant global variables to some temporary memory, and then copy the second Z80's state from temporary memory back into the global variables. At this point, the global variables contain the state of the second Z80, and it can be executed normally.

In practice, this worked great because MAME would execute the first Z80 for many thousands of cycles, perform a context switch, execute the second Z80 for many thousands of cycles, perform a context switch, etc. In this execution model, the context switch did not require a significant percentage of the total execution time, because it was only done once every few thousand emulated cycles.

Where the problems arise is when we want to do "cycle accurate" execution of these two Z80's. In this case, we really want to run each Z80 for just one or two cycles at a time, and switch back and forth between them many many times per second. When we do this, we execute the first Z80 for a cycle or two, context switch, execute the second Z80 for a cycle or two, switch again, etc. In this situation, it turns out, we spend far more time context switching than we do executing.

The solution to this problem is to get rid of the context switching. To do this, we need to allocate separate memory for each Z80 and instruct each CPU to reference all of its data from that memory, rather than accessing global variables. In the past, doing this was a bit of a performance penalty, but on modern processors and with modern compliers, it is either a wash or marginally faster to do it the "right" way.

Thus, a major motivation behind the changes in MAME 0.128u4 is to make it possible to eventually get rid of the CPU context switching. To do this, the interfaces to all the CPU cores had to change, and each core must be modified to fetch its state from memory pointers instead of global variables. As of the 0.128u4 release, several important CPU cores have been converted, but more work is still pending in this direction.

Memory Context Switching

In addition to the way that CPUs do context switching, the memory system in MAME also does something similar. In MAME, when a CPU needs to read or write to memory, it works with the memory system to figure out whom to call to implement the necessary read/write behavior. Whenever MAME needed to stop executing of one CPU and begin execution of another, it had to perform a memory context switch in addition to the CPU context switch, so that memory accesses from the new CPU accessed the correct memory.

In order to remove memory context switching, the state of the memory subsystem needed to be moved out of global variables and into allocated memory. But it turned out that the memory system wasn't properly organized in this manner, so some significant changes had to be made. In the end, I decided to expose the concept of an "address space" to describe the state of the memory subsystem.

Each CPU can have one or more address spaces, and whenever the CPU needs to talk to the memory system, it hands a pointer to the relevant address space to the memory system so that the memory system knows how to map that memory access. In addition, when the memory system identifies a particular callback in a game driver to handle a read or a write, it also passes along the pointer to the address space structure, so that the game driver has the information it needs to know about handling that memory access.

To make all this work, the interfaces to all the read/write handlers in the system had to change, the memory system had to be rewired, and all of the CPU cores had to be modified to pass along these address space objects. As of 0.128u4, the memory context switch has been removed entirely. Completing this work was a necessary precursor to eliminating the CPU context switching.

The "Active" CPU

One thing that goes hand-in-hand with context switching is the notion of an "active" CPU, which is defined to be the CPU whose context is currently copied into the global variables. As we remove the need for CPU context switching, the notion of an active CPU becomes less well-defined and less meaningful.

Take, for example, a CPU (we'll call it CPU A) which can write to the memory space of another CPU (call it CPU B).

In a context switching system, when CPU A is executing, CPU A's state is loaded into the global variables, and all of CPU A's memory reads and writes use the global memory state information in order to know what happens when memory is accessed. Logically, CPU A is the "active" CPU. Now let's say CPU A performs some action which causes it to modify CPU B's memory space. In order to do this, we must save off CPU A's state (both CPU and memory state), and load up CPU B's state, thus making CPU B the "active" CPU. Then we perform the access to CPU B's memory space. When finished, we save off CPU B's state and restore CPU A's state so that it can continue executing.

In a more modern system, when CPU A is executing, CPU A's state lives off in memory somewhere and can be accessed at any time, and its address spaces are similarly configured so that they can be accessed at any time without context switching. Now CPU A performs the same action which causes it to modify CPU B's memory space. In this case, we don't need to do any context switching. Instead, CPU A can directly modify CPU B's memory space by passing in a pointer to CPU B's address space when it performs its memory operations.

It should be obvious that the big difference in these two scenarios is that there is no context switching in the second case, which should make things faster. But even more importantly, there is no "active" CPU at any point in the second case.

One could argue that because CPU A is executing, that it should be designated "active"; however, even in this case, the notion of active is significantly different than in the first example, because CPU B is never considered active during the transaction. Looking further down the line, it is clear that if we define a CPU as "active" while it is executing, then we forever consign ourselves to only executing a single CPU at a time, because only one CPU can ever be "active" at a time.

Thus, the right solution is to get rid of this "active" CPU notion. Unfortunately, the MAME drivers and core are peppered with references to the active CPU. These must go over time. Only when they are gone can we fully remove the context switching; as long as references to the active CPU still exist, we have to continue to context switch in order to keep up a valid definition of the active CPU.

What Does This All Mean?

For users, all the existing games should continue to work. Speed should be equivalent if not a bit better, especially in situations where there is aggressive context switching today. Once all the context switching is removed, MAME will be doing less work when it switches frequently between different CPUs.

Longer term, however, it is likely that some of these cases will get slower again, because for many early games, the multiple CPUs work closely in concert, and to achieve the most accurate behavior, we ideally should execute each CPU one instruction at a time, alternating back and forth very quickly. Today, this makes performance very bad; with the context switching changes in, I hope it will be bearable at least for many of the 8-bit systems.

For developers, these changes mean that MAME is more object-oriented. Even though MAME is written in straight C and not C++, you can imagine that core structures in the system such as running_machine, device_config, and address_space are objects, and pointers to these objects are passed in and out of most calls in the system. There should be very few if any cases where the need for an "active" CPU is necessary any more, and references to the global Machine (capital "M") object are removed in favor of pointers that are passed into your functions.

The changes we are making aren't particularly revolutionary from a software architecture point of view, but they are revolutionary from a MAME design point of view. There are still many more changes to come as we push toward removing the CPU context switching altogether, but as of MAME 0.128u4 the core infrastructure is in place to make these changes happen. This particular update was the "biggie"; future updates along this path should be smaller until we are in a position to finally remove the CPU context switch for good.

(minor edits on 29-Nov-2008 for clarity)