<< Newer | Articles posted December 2005 |
CPU Scheduling in MAME (part 2)
Part 1 of this article described the basic round robin scheduling algorithm used by MAME when running multiple CPUs. It can be summarized as:
- Determine when the next timer is going to fire
- Loop over each CPU:
- Compute the difference between this time and the CPU's local time
- Execute for at least that long
- Count up the cycles executed and compute the CPU's local time
- Return to step 1
One very common event that isn't predictable, however, is communication between two CPUs. In fact, this is one of the main reasons I added the timer system to MAME in the first place. Let's look at an example of what can go wrong with CPU communication when there is no synchronization.
Going back to our previous example, we have CPU #0 running at 14MHz, and CPU #1 running at 2MHz. We also know there is a timer that is going to fire in 150usecs, or at time 0.000150. So we begin executing CPU #0 for 2100 cycles. This time, however, the code executing on CPU #0 decides to signal an interrupt to CPU #1 at a point 1500 cycles into its execution (at time = 1500 / 14,000,000 = 0.000107143). With no synchronization, the interrupt is signalled (generally some state is set in CPU #1's context indicating that an interrupt is pending), and CPU #0 continues executing until the end of its timeslice. When it's done, it has executed 2112 cycles like before, and its local time is 0.000150857.
Now it's time to execute CPU #1 for the first time. Right away, however, it notices that it has an interrupt pending, so it may process the interrupt, and then continue executing for the rest of its timeslice normally.
So what's the problem? Well, CPU #0 signalled the interrupt at local time = 0.000107143, but CPU #1 acted on the interrupt at local time = 0 (right at the start of its timeslice). Let's say CPU #1 is a sound CPU; maybe the music started just a fraction of a second too early as a result. Worse, let's say CPU #1 was busy doing something else and wasn't ready to handle the interrupt just then. Maybe you've crashed the code or altered its state in a bad way.
What might not be immediately obvious in looking at the timer system is that most of the timers that are created are set to fire instantly. If you've even seen timer_set(TIME_NOW, ...) in MAME, you've seen a request for a instant timer. This is done because timers are effectively synchronization barriers. The whole scheduling algorithm is based upon executing all the CPUs until the time when the next timer is set to fire. Setting up an instant timer is essentially asking the scheduler and the timer system to work together to bring all the CPUs up to the current instant in time before calling the function you provide it.
How does this help the interrupt signalling issue described above? Let's say that when CPU #0 decides to signal an interrupt to CPU #1, it doesn't signal it immediately. Rather, it sets an instant timer to go off. Since the current time = 0.000107143, that is when the timer is scheduled to fire. But timers don't fire until the end of the round robin sequence, and then only after all CPUs have reached that time. So even if we were at the end of the round robin, it still wouldn't be able to fire just yet because CPU #1 is still sitting back at time = 0.
We have another issue here as well. CPU #0 has just set a timer to go off at time = 0.000107143, but it also still has 600 cycles left to execute before the end of its timeslice. We could potentially allow it to complete its run, ending up at time = 0.000150857. But then when we executed CPU #1, we would only execute up until the time the timer was set to go off (at 0.000107143), and the two CPUs would be significantly out of sync. Rather than letting that happen, whenever a new timer is created during a timeslice, and it is scheduled to fire before that timeslice is up, the timer system and the scheduler work together to abort the execution of that CPU. Generally, this means that the CPU will stop executing at the end of the current instruction, and return control back to the scheduler.
In this case, since CPU #0 set the timer during its timeslice, and since the timer is scheduled to fire during the timeslice, execution on CPU #0 is aborted with 600 cycles left to go. The scheduler knows that CPU #0 only executed 1500 of its requested 2100 cycles, and updates the local time of CPU #0 to be 1500 / 14,000,000 = 0.000107143.
Now CPU #1 gets its chance to execute. 0.000107143 * 2,000,000 = 215 cycles, so we execute CPU #1 for that long. When it is finished, maybe it actually executed 217 cycles, so its local time is 0.0001085.
The global time is updated to the minimum of the CPU times, in this case 0.000107143, and the timer system is asked to process timers. At this point, the callback for the instant timer we set is called, and in that callback is where we signal the interrupt to CPU #1.
Returning to the scheduler, we look into the future and see that there is still a timer set to go off at time = 0.000150, so we compute cycles for CPU #0 (600) and execute. After CPU #0 is done, we switch to CPU #1 and start executing. CPU #1 now has an interrupt pending, but it has been signalled at the correct local time (or close to it, at 0.0001085), and synchronization between the two CPUs has been achieved.
What happens when CPU #1 tries to talk back to CPU #0? Find out in part 3, which will deal with this complex issue.
CPU Scheduling in MAME (part 1)
This entry was prompted by post over at MAME Testers, but I wanted to post a slightly more permanent response since this issue is somewhat confusing.
Multi-CPU games in MAME are scheduled in a round-robin fashion. The order of the round robin execution is strictly defined by the order of the CPUs in the machine driver. There is no way to alter this order; however, you can affect the scheduling by suspending CPUs or adjusting the granularity of the scheduling. That kind of stuff will be discussed more in part 2.
The scheduler relies on the timer system, which knows when the next timer is scheduled. All scheduling happens between firings of timers. Similarly, timers are never fired while a CPU is executing. This important to keep in mind.
The scheduler queries the timer system to find out when the next timer is set to fire. It then loops over each CPU, computes how many cycles that CPU needs to execute to reach that time, and runs the CPU for that many cycles. When the CPU is finished executing, the CPU core returns how many cycles it actually executed. This information is accumulated and converted back into a "local CPU time", in order to account for overshooting or early exiting from the CPU core.
For example....
Let's say that CPU #0 is running at 14MHz, and CPU #1 is running at 2MHz. Let's also say that we're starting at time 0 (local CPU time for both CPUs is 0), and a timer is scheduled to go off in 150 microseconds (time = 0.000150).
The round robin logic will start with CPU #0 and compute how many cycles it needs to execute to reach 0.000150. Since we're starting from time 0, we need to execute for at least 150usec. 0.000150 * 14,000,000 = 2100 cycles. It then calls that CPU's execute function with 2100 cycles; when the execute function returns, it specifies how many cycles it actually ran. Let's say it returns saying that it ran 2112 cycles. (CPU cores generally overshoot because many instructions take more than 1 cycle each to execute.) 2112 cycles puts the local CPU time for CPU #0 at 0.000150857 (2112 / 14,000,000).
Now it's time for CPU #1 to execute. 0.000150 * 2,000,000 = 300 cycles. So we call execute(300), and get back 300 cycles. CPU #1 local time is now 0.000150.
At this point, both CPUs have executed, and both their local times are greater than or equal to the target time 0.000150. So the scheduler calls the timer system to let it process the timers. When finished, it again asks when the next timer will fire. Let's say it's set to fire exactly 150usec later at time = 0.000300.
Back to the scheduler, we start the round robin over again. CPU #0 needs to execute (0.000300 - 0.000150857) * 14,000,000 = 2088 cycles to reach a local time of 0.300. Note that we took into account the extra cycles that we executed last time. So we call execute(2088), and we get back, say, 2091. That puts our local time at 0.000150857 + 0.000149357 = 0.000300214.
Now it's CPU #1's turn. (0.000300 - 0.000150) * 2,000,000 = 300 cycles again. Calling execute(300), we get back 302 cycles. This puts CPU #1's local time at 0.000150 + 0.000151 = 0.000301.
Again, both CPUs have executed, both their local times are greater than or equal to 0.000300, so we contact the timer system to let it run its timers. This procedure continues throughout the execution of the system.
Some things to note here. After the first round robin, CPU #0's local time was slightly ahead of CPU #1's local time. After the second pass, the opposite was true. Thus, you can't be guaranteed at any time that any given CPU is ahead of or behind the others.
Also, be sure to keep in mind that each CPU has its own local time. The timer system also has a "global time". The global time is generally the minimum of all the CPU local times. Which time is used when calling the timer system depends entirely upon which CPU context is currently active. If a CPU context is active (generally true only while a CPU is executing; for example, in read/write callbacks), then all timer operations treat the "current time" as the CPU local time, accounting for all cycles that have been executed on that CPU so far in the current timeslice. If a CPU context is not active (all other times; for example, in timer callbacks), then the "current time" is the global time.
The global time is updated at the end of each timeslice before the scheduler calls into the timer system to let it run its timers, and that global time is used to dispatch the timers.
There are a number of things that this post hasn't covered yet, including early exiting, suspending CPUs, spinning and yielding. I hope to follow up with details of how those all fit into the timing scheme shortly.
DMA Woes
So I thought I had everything figured out. I found a nice depth buffer bug that fixes numerous problems in the Voodoo games (and which happened to be there in the old implementation as well). I figured out the weird alpha blending problem. And the only thing that was left was understanding why Carnevil was no longer rendering its polygons correctly relative to the background movie. I looked everywhere for the problem, and it turned out to be much more subtle than I realized.
It turns out that a number of the games use the DMA functionality of the Galileo GT64010A chipset to send data to the Voodoo. This actually makes a lot of sense. The Voodoo has a limited size FIFO where the data goes in, and when that is full, whoever is sending the data has to stall until the Voodoo has had a chance to finish whatever it was backed up doing. Far better to let the DMA system stall than letting the main CPU stall when it could be doing real work.
In the old implementation, I made all commands to the Vooodoo complete instantly. This meant that any DMA transfer, no matter how long, finished instantaneously, even if it was rendering tons and tons of polygons or clearing the screen. In the new implementation, I changed the code to actually approximate the amount of time each polygon and screen clear takes. The problem with this is that DMAs no longer complete instantly, and this has revealed a subtle problem.
Many of the games kick off a nice long DMA to the Voodoo and let it complete asynchronously while the main CPU is off busy doing other things. The problem is, there is code in most of the games that also directly accesses the Voodoo without checking to see if there is still a DMA in progress. When this happens, new commands were being inserted into the rendering queue at the wrong spot. The most obvious side effect was in Carnevil, where they would write a command in the middle of a DMA that turns off depth buffering, meaning that all subsequent triangles were rendered without doing any depth comparisons.
The trick is figuring out what the behavior should be. I've tried hacking it so that if the CPU writes to the Voodoo while the DMA is in progress, then the CPU stalls until the DMA completes, and only then does the write go through. This works but introduces other glitches in the system. So now I have to come up with another theory.
On the plus side, I finally found a nice PDF of the GT64010A datasheet over at http://www.datasheetarchive.com. I'd been working off of a Google HTML document that was auto-converted from a PDF, and it just never came out quite right!