<< Part 5: MAME of Borg Table of Contents Part 7: Not yet written! >>

Part 6: Timing Is Everything

If I had to pick the one piece of code I've written for MAME that I think has had the biggest positive impact on the project as a whole, it would definitely be the timing system. Because MAME originally started out as a simple Pac-Man emulator, the original architecture of the system was based on the architecture of Pac-Man: a single Z80 CPU with a very straightforward video system and pretty simple sound playback. Many of the things we take for granted nowadays weren't supported until several revisions in: the graphics viewer was introduced in MAME 0.08, the first non-Z80 CPU was added in MAME 0.10, multiple CPU support was introduced in MAME 0.12, the debugger was introduced in MAME 0.30, etc. As MAME evolved, the original architecture was stretched more and more to accommodate a wider range of video hardware, CPUs, sound chips, and all sorts of weird things that were never dreamed of back when Nicola first started writing the system.

Around the time MAME 0.30 showed up, I was definitely seeing the cracks in the way MAME handled multiple CPUs and scheduling time-sensitive events. When it was just a game with a single CPU executing, MAME would divide each second into frames. This meant that, for example, Pac-Man, which ran at 60 frames/second, would divide each second into 60 equal parts. MAME would execute the Z80 CPU for exactly 1/60th of a second, then it would compute the video and sound output for that 1/60th of a second, and output the result. Then it moved on to the next 1/60th of a second. Pretty simple, right?

Execute CPU A Update Video/Sound
Execution of a single frame on a single CPU system in MAME

When multiple CPU support was added, the same architecture was extended. Take Galaga, for example. It runs three Z80 CPUs in parallel. In order to emulate this, MAME would run the first Z80 for 1/60th of a second, then the second Z80 for 1/60th of a second, then the third Z80, and then update the video and sound.

Execute CPU A Execute CPU B Execute CPU C Update Video/Sound
Execution of a single frame on a multiple CPU system in MAME

However, there was a problem with this approach. In order to get three CPUs working together in parallel, they needed to communicate with each other. And that communication can happen in a number of ways. In many early games, multiple CPUs all shared a chunk of RAM, and so when one CPU wrote into that RAM, another CPU would see that something had changed and would act on it. This presented problems for MAME's way of executing. Let's take the example of CPU A sending a signal to CPU C at the end of the 1/60th of a second timeslice. Here's how it would work on the real arcade system:

Execute CPU A Sent
Execute CPU B
Execute CPU C Rec'd
Update Video/Sound
Example of CPU A signalling to CPU C at the end of a video frame on real hardware

Notice that when CPU A sends a signal, CPU C receives it right away. This is because all of the CPUs are in fact executing in parallel. As I have discovered while working on MAME, timing is very important. Unlike PC games, which have to be designed to run on everything from a low-end machine to the latest and greatest hardware, most arcade games were designed to run on a fixed hardware platform that never changed. Because of this, they tended to write code that was only ever tested on that fixed system. Which means that any code that is timing sensitive was tweaked until it worked on that system, and then it was safely assumed that it would work going forward, because every arcade PCB that was produced would have exactly the same hardware running at exactly the same clock speed.

However, once emulation enters the picture, the timing is no longer as precise, especially when attempting to simulate multiple CPUs by executing them one after the other, rather than in parallel. Once you flatten out the timeline, you can see where the problem lies:

Execute CPU A Sent Execute CPU B Rec'd Execute CPU C Update Video/Sound
Example of CPU A signalling to CPU C at the end of a video frame in MAME

And this is the crux of the problem. CPU A sent the signal at the end of its timeslice (which is 1/60th of a second in this case). However, CPU C isn't running in parallel like it was on the original hardware. Instead, it gets its timeslice way after CPU A has finished, and so it sees the signal that CPU A sent at the beginning of its timeslice instead of at the end. In many cases, this didn't matter too much, but all too often, we would find cases where a signal that was received too early on CPU C would break the emulation.

One potential way of solving this problem is to break up the execution into smaller timeslices, so rather than running each CPU for 1/60th of a second, maybe we would go for a finer grain, like 1/300th of a second:

A B C A B C A B C A B C Sent B Rec'd Update Video/Sound
Increasing the "interleave" of the CPUs by 5x reduces the timing gap between send/receive

You'll note, however, that this approach doesn't really solve the problem. CPU C still gets the signal at the beginning of its timeslice. It's just that it's not quite as early as it used to be in the old scheme, and thus it isn't quite as much of a problem. But it's still not sufficient to run games that require very tight synchronization between CPUs, like Zookeeper or the non-bootleg version of Arkanoid.

One additional problem that surfaced with the old method is that all events only occurred on boundaries between frames (or on boundaries between fractional frames, if you used a higher "interleave" factor). This presented a problem for games like Tapper, which had a timing chip that could be programmed to go off at nice, fine intervals that didn't necessarily fall in line with the frame rate of the game. Prior to the existence of any formal timing system in MAME, these kinds of timers were handled with some seriously gross hacks, and were never quite accurate.

Thus, the primary goal of the new timer system I proposed and ended up writing was twofold. First, provide an infrastructure that supported accurate interleaving of multiple CPUs. And second, provide a way to schedule events at arbitrary times in the future, not necessarily attached to frame boundaries. The best part about the timing system is that it actually solved both problems with one mechanism.

The basic concept of the timer system was to throw away all frame-based interleaving from the core system and provide an entirely new architecture. The architecture would be based around timers. A timer was allocated by code somewhere in the system and set to go off at some time in the future. When a timer fired, it called a callback in the C code. The big trick to making this all work is that each time a timer fired, the system would make sure that all the CPUs in the system had executed exactly that much time before proceeding. Sound confusing? It is, so let's look at an example.

Let's say a game has 3 CPUs. An interrupt is supposed to fire in 240 microseconds (usecs). So, the timer system will compute how many cycles each CPU needs to execute to account for 240 usecs. It will then execute each CPU for the proper number of cycles, and when they have all finished, it will then fire the timer by calling the callback.

How does this help with synchronization? Well, let's say CPU A is the main game CPU, and CPU B is the sound CPU. Let's say there is an interrupt coming up in 240 usecs, and we are currently executing CPU A for the number of cycles it takes to make up 240 usecs. However, in the middle, say 100 usecs into that timeslice, CPU A sends an important message to CPU B. What CPU A does is it sets a timer to go off immediately (at the 100 usecs mark) and doesn't actually send the message yet. What happens then is that the timer system says, "Whoa, there's a new timer now that is set to go off right now. CPU A has already executed all the cycles necessary to account for the 100 usecs since it started, so we'll stop it right there in its tracks, and start executing CPU B for the number of cycles it needs to reach the 100 usec mark." After CPU B is finished executing, both CPUs are now in sync, and the new timer fires. In the timer callback, CPU A sends its message to CPU B as planned, and then executes the correct number of cycles to finish off its original timeslice of 240 usecs. When CPU B gets its turn to execute, it receives the message that CPU A sent and then finishes executing its timeslice.

By performing this kind of detailed dance, we have effectively made it so that when CPU A decides to send a message at 100 usecs into a timeslice, it can be arranged such that CPU B will actually receive that message at 100 usecs into its timeslice. In addition, all frame timing and CPU interleaving are now performed via timers as well, meaning that the whole system of computing how long to execute each CPU is now completely in the hands of one system. Everyone else is just a client of the timer system, and this is architecturally much cleaner than the old system, in addition to being far more flexible.

Alas, while this system works best for a number of cases, there are still a lot of limitations. For example, if CPU B needs to signal back to CPU A, this doesn't work so well. However, for most games there is a mostly master/slave relationship between CPUs, so you can put the master as CPU A, and it will "lead" the timing for the other CPUs. Over time, a number of tweaks and adjustments have been made to the timing system, but the basic premise has remained the same for the past 5 years. It's definitely the most central and durable of my contributions to the project.

For the record, here is a copy of my original proposal for the timer system to the MAME development list:

>Now the custom I/O chip is emulated more properly, generating the NMIs it
>requires instead of just copying the data where the game expects it.

In Dig Dug, I did a similar thing, except that I still copied the data 
myself, and generated the NMI for the final byte.

Ideally, I think the way to get around this is to implement some sort of 
an event manager.  I've been meaning to take the time to do it, but 
here's a sketch of how it would work.  I'd appreciate any comments on 
this mechanism, but it solves a *lot* of multi-CPU and timing problems.  
Maybe once we hash it out, I can work on it for MAME 29.

Basically, there would be a global event manager.  Its job would be to 
process event requests and determine how many CPU cycles to execute until 
the next event.  Rather than running the CPUs based on how many 
interrupts per frame, and context switching every video frame, it would 
run the CPUs in a loop like this (pseudo-code, obviously):

while (!done)
{
  microseconds = event_time_till_next_event ();
  for (activecpu = 0; activecpu < totalcpu; activecpu++)
  {
    cycles = microsec_to_cycles (&Machine->cpu[activecpu], microseconds);
    cpu_execute (cycles);
  }
  event_process_next_event ();
}

Now in order to mimic the current system, at startup we would allocate a 
persistent event that triggered once per video frame.  We would also 
allocate an additional persistent event for each CPU that triggered at 
that appropriate interrupt rate.  To register an event, you basically 
call a registration function with the frequency of the event, and a 
pointer to a callback function:

void event_register_persistent (int microseconds, void (*callback)(void));
void event_register_oneshot    (int microseconds, void (*callback)(void));

The callback could be responsible for generating an interrupt on the 
appropriate CPU, updating the video display, or whatever.

At the same time, I would also propose adding the ability to specify a 
minimum timeslice, perhaps via another event_ function that can be called 
in init_machine:

void event_set_min_timeslice (int microseconds);

Since the event manager ultimately determines the number of cycles 
executed per context switch, it would be quite clean to add this 
functionality.

Some current problems this could avoid:

* The NMI problem above can be addressed by generating the first NMI, and 
then registering a oneshot event for a few microseconds later which would 
then generate the next NMI, and so on until all data has been sent

* The CTC emulation could be made very precise by registering a oneshot 
event for the precise length of the timer, and then generating the 
interrupt at that time.

* In theory, Qix could just register a oneshot event with an extremely 
short duration and give up the rest of the current timeslice in order to 
achieve what it's currently doing with the yield_cpu hack

Some issues that would still need to be resolved:

* What happens if CPU #1 executes for 1000 cycles, and then CPU #2 
registers an event during its execution that should have already 
happened?  I guess we already have problems like this in the current 
system; the event manager should probably just set the timer for this 
event to the start of the next timeslice.

* I'm sure there are others I haven't thought of.... :-)

Aaron
(Mon, 1 Sep 97)

 

<< Part 5: MAME of Borg Table of Contents Part 7: Not yet written! >>