Articles posted December 2005 Older >>

DCS Help

I'm looking for someone out there who has a 2D Midway game with a DCS sound system. This would include Mortal Kombat 2, Mortal Kombat 3, NBA Hangtime, WWF Wrestlemania, 2 On 2 Open Ice Challenge, or Rampage: World Tour.

Basically, I need to know how much RAM there is for the ADSP chip. If the system is on an external card, this is pretty easy, just identify the RAM chips on the card and let me know what they are. For the Wolf unit games, I think the sound system is on-board, so you'll just need to identify the smaller RAM chips (they will probably be 8kx8 or 32kx8 parts).

In fact, knowing this info for the Cruisin' and Killer Instinct games would be useful as well.

More Vegas

Been trying to figure out what's wrong with the remaining two Voodoo 2 games. They are both frustratingly puzzling.

First is War: The Final Assault. This game used to work before I added support for the ethernet chip, but hasn't been working ever since. I hacked in some code to disable the chip and sure enough, it will work without it, but I know that the board has it hooked up so I can't hack around it like that. It boots through about halfway and then just stops. It sends 3 packets out on the wire, but they are all broadcast packets and since there's nobody listening, there are no responses, nor would I expect there to be. Maybe I'll try to trace what's going on in both cases and compare the two, though tracefiles from a 250MHz RISC chip are somewhat unwieldy to scan through. On the plus side, the graphics problems from my old Voodoo emulation are gone and the game looks great when it does run.

The other non-working game, Road Burners is much closer to working now that I have the ADSP-2181 hooked up properly. The audio section of the system now works and I can boot into attract mode. There are some serious texture glitches on the in-game part of the attract mode, but apart from that it looks pretty good. The problem is that as soon as I start the game it just hangs up waiting for something. Haven't figured that one yet.

There's a part of me that wants to try and start on the Voodoo Banshee emulation, but I also really want to get the Voodoo 2 games working first. Plus, the Banshee requires a whole bunch more glue to support the 2D side of things, so that means faking VGA grossness. I only hope the games don't actually use the 2D stuff for anything except setting the display mode! Fortunately, the 3D section is pretty compatible with the Voodoo 2.

Viva (Las) Vegas

Been taking another look at the Vegas drivers again now that the Voodoo 2 emulation is working. Gauntlet Legends was working again in 0.102u4, which is a good sign. A couple more fixes and Gauntlet Dark Legacy was working again — at least as well as it had been before, which is to say fine until you actually got in-game, at which point all the environment graphics were drawn all-black.

Looking into this problem a bit more revealed that the Dark Legacy engine had added light maps to the rendering, using a multiplication between two textures to produce the final result. The Voodoo rasterizers support this just fine, but what was new is that they were now using a new mode in the Voodoo 2 which specifically selects the texture color rather than a color computed through other means. Since this mode didn't exist in the original Voodoo, the code was just using '0' for this case and producing a black background. Adding support for this made the graphics appear. Yay!

Next task was to figure out why Tenth Degree no longer worked.

It used to, and I was sure it was due to the Voodoo rewrite. I spent quite a bit of time looking into that, assuming I was returning an incorrect value from the status register or something. Turns out I was completely wrong. Instead, an "optimization" I had made to the MIPS dynarec core a while back had a subtle side effect. The problem was literally with the particular opcode:

     slti   r2,r3,$1
In the old dynarec core, that was translated as:
     mov   eax,[r3.lo]
mov edx,[r3.hi]
sub eax,1
sbb edx,0
shr edx,31
mov [r2.hi],0
mov [r2.lo],edx
The optimization I added was to convert code that subtracted 1 from a register to use the dec opcode instead, since it is more compact. So the new code was::
     mov   eax,[r3.lo]
mov edx,[r3.hi]
dec eax
sbb edx,0
shr edx,31
mov [r2.hi],0
mov [r2.lo],edx
which is 4 bytes smaller, taking up less instruction/trace cache space. Multiply this over a lot of generated code and it has an impact. The problem is that dec eax is not quite the same as sub eax,1. Specifically, dec does not set the carry flag, meaning that the sbb instruction that followed would never "borrow" from the high word, messing up the math.

So with that, Tenth Degree is working again, and better than ever. One thing I discovered in my recent research is that if certain values (red, green, blue, alpha, Z, and W) overflow during triangle rasterization, they are allowed to wrap in a slightly odd fashion. For example, if the red component increases from $FF to $100 to $101 over the course of several pixels, you would expect it to wrap from $FF to $00 to $01. But the internal microcode in the Voodoo actually checks explicitly for $100 and clamps it to $FF, while allowing $101 to wrap to $01. So what you actually get is a transition from $FF to $FF to $01.

Why is this important? Well, let's say you are drawing a triangle such that the leftmost pixel has a red value of 0.0 and the rightmost pixel has a red value of 1.0. Converting these values to integers between 0-255 means the left value is $00 and the right value is $100. If the Voodoo allowed simple wrapping, that last pixel would be drawn as $00, showing up as an ugly black pixel right next to a bright red one. The simple clamping logic allows for a bit of slop of 1 on either the high or low side without producing artifacts.

The upshot is that if you run an old build of MAME and look at Tenth Degree, you'll see lots of artifacts — unsightly black or white pixels that shouldn't be there. Implementing this clamping logic turns out to fix these problems. Mace: The Dark Age also suffered from the same problem to a lesser degree. I bet the Tenth Degree engine was based off of the Mace engine.

CPU Scheduling in MAME (part 4)

Part 3 of this series discussed the problems involved in scheduling communication from a "later" CPU (one scheduled later in the round-robin order) to an "earlier" CPU. Since the MAME scheduler does not support changing the order of CPU execution within the round-robin, the only options to improve latency are to increase the interleave factor, either globally or temporarily.

There are two other means of altering the scheduling of CPUs during execution. These are the cpu_yield() and cpu_spin() calls. Both of these methods have been abused often in the past due to a lack of understanding about how they actually work, so now is the time to set the record straight.

What cpu_yield() does is end the current CPU's timeslice early. It does not affect the execution of any other CPUs in the system.

Let's look at an example.

Again we'll say that CPU #0 is running at 14MHz, and CPU #1 is running at 2MHz. There is a scheduled timer that is set to fire at time 0.000150. So we begin executing CPU #0 for 2100 cycles, but this time, partway through its timeslice (say, 1250 cycles in), one of its read/write handlers calls cpu_yield(). This aborts the current timeslice, leaving CPU #0's local time set to 0.000089286.

Now CPU #1 gets its turn to execute. Normally it would have executed for its entire timeslice up to time 0.000150; however, since the previous CPU stopped early, the scheduler only schedules up to the time when the previous CPU stopped execution. This equates to 0.000089286 * 2,000,000 = 179 cycles. Let's say it executes for 180 cycles, giving it a local time of 0.00009.

At this point, we tell the timer system that the global time is 0.000089286, but there are no timers ready to fire until 0.000150, so nothing happens, and the round robin begins again.

So far so good, but there is an additional side-effect that is not immediately obvious: after a cpu_yield(), the CPU is descheduled until the next time the interleave timer fires. (Recall from the last part of this article that the interleave value causes a timer to be scheduled at the requested interleave rate.) This means that in future scheduling rounds, CPU #0 will not participate, until the interleave timer fires and allows it once again.

In fact, cpu_yield() belongs to a class of synchronization calls: cpu_yielduntil_trigger(), cpu_yielduntil_int(), cpu_yielduntil_time(). All of these perform the same basic operation as cpu_yield() — that is, they stop execution on the current CPU and remove it from scheduling — but each specifies a different event that will enable the CPU to be scheduled again. This descheduling of the CPU has some interesting consequences.

Let's look at the previous example again, with this additional knowledge in hand. To make things a little simpler to explain, we'll change the situation so that instead of calling cpu_yield(), we will call cpu_yielduntil_time(0.00005). This is essentially telling the scheduler to not only give up our timeslice, but remove us from the scheduling equation altogether for the next 50 microseconds. So:

CPU #0 executes as before, ending its timeslice early at time 0.000089286 by calling cpu_yielduntil_time(0.00005). This aborts its timeslice and also internally sets a timer to go off at the current local time (0.000089286) plus 0.00005 seconds, or at time 0.000139286.

Then CPU #1 executes up to the time when the previous CPU stopped execution, which is time 0.000089286. This is again 179 cycles, so we run the CPU, and it comes back claiming 180 cycles, making its local time now 0.00009.

The timer system is called, but nothing is ready to fire, so the round-robin starts over. This time, when we ask the timer system when the next timer is set to go off, it reports 0.000139286, due to the timer that was set in response to cpu_yielduntil_time().

Now the round-robin begins again, except that CPU #0 is completely removed from the scheduling, so we skip right to CPU #1. Computing cycles: ((0.000139286 - 0.00009) * 2,000,000) = 99 cycles. We run CPU #1 with that request, and get back 101 cycles, putting its local time at 0.0001405, and the round-robin ends.

At this point, the timer set in cpu_yielduntil_time() fires, and it enables CPU #0 to be scheduled for the next round. However, notice that the two CPUs are now signficantly out of sync with respect to each other. CPU #0 is still stuck back at local time 0.000089286, while CPU #1 is already at 0.0001405. There is still a timer set to go off at 0.000150, so that will be used as the target for the next round-robin.

For CPU #0, that next pass at execution will run 0.000150 - 0.000089286 = 0.000060714 seconds, or 850 cycles, which is much longer than normal because it needs to execute more cycles to catch up to the target time. CPU #1, on the other hand, is almost at the target time, and only needs to execute 19 cycles to reach the target.

So the big side-effect of using cpu_yielduntil_time() in this manner is that you cause the yielding CPU to get behind in the scheduling process. Repeated use of the yield calls can put the CPU farther and farther behind, which can result in some seriously wacky behavior. For example, since CPU #0 is starting out at a time significantly behind CPU #1, if it sets a timer, that timer could already be in the past relative to CPU #1.

Analagous to the cpu_yield() calls are the cpu_spin() calls. These calls operate exactly the same as their partners, except that the local time for a spinning CPU is automatically bumped to the current global time at the end of each timeslice. This means that the CPU doesn't get behind; rather, it effectively "burns" all of the remaining cycles in a way that the spinning CPU will never get them back.

The thing that is confusing to people is this: Yielding is a form of synchronization. Spinning is a hack. Even though they look similar, they are used for two very different purposes.

There is really only one legitimate reason for spinning, and that is for adding spin loop optimizations to games that are doing no useful work while waiting for some event. If the event happens at some known time in the future, use cpu_spinuntil_time(). If that event is an interrupt, use cpu_spinuntil_int(). If that event is some other externally-driver factor, use cpu_spinuntil_trigger(), and then call cpu_trigger() when the event occurs.

If you find yourself using a cpu_spin() call for synchronization, you are masking some other problem in the emulation system.

Finally, a word about triggers. Triggers are simply a signalling mechanism that is used to indicate that a particular event has occurred. A trigger is identified by an integer, which is essentially a random number that is chosen by the person creating/using the trigger. There is no collision detection or means of allocating them (though there probably should be). To signal a trigger, you simply call cpu_trigger() with the trigger identifier. Triggers are mostly used in conjunction with the yield and spin calls to block execution of a CPU until the trigger is signalled. It's really no more complicated than that.

And that concludes my discussion of CPU scheduling in MAME. If there are significant questions, I may follow up a little later with a part 5 to answer some of the remaining questions. Thanks for hanging in there — I realize a lot of this can be difficult to follow!

(Edited to fix incorrect information about cpu_yield()).

CPU Scheduling in MAME (part 3)

Part 2 of this article discussed the cooperation between the scheduler and timers that enables events to be synchronized between multiple CPUs. In short, when a CPU needs to signal an event to another CPU, it sets an "instant" timer which causes all CPUs to execute up to that point in time. Then, the timer callback is fired and the event can be signalled safely and accurately.

The big problem with this approach is that it really only works well if the target CPU has a local time that is less than the current time. The example we looked at last time had CPU #0 signalling an interrupt to CPU #1. Since the round-robin ordering causes CPU #0 to execute first, we were (pretty much) guaranteed that when CPU #0 wanted to issue its signal, its local time would be greater than the local time of CPU #1.

But what happens when CPU #1 wants to send a signal back to CPU #0?

Taking the naive approach, and returning back to our previous example, let's say that we have CPU #0 running at 14MHz, and CPU #1 running at 2MHz. There is a scheduled timer that is set to fire in 150usecs, or at time 0.000150. So we begin executing CPU #0 for 2100 cycles, and it completes its timeslice normally, returning 2112 cycles, meaning its local time ends up being 0.000150857.

Now we execute CPU #1 for its timeslice, which turns out to be 300 cycles as in the example from Part 1. However, this time, at 50 cycles into execution, CPU #1 decides that it needs to send a signal to CPU #0. So, rather than sending the signal immediately, it sets an instant timer to go off immediately, at time = 50 / 2,000,000 = 0.000025. This also has the side effect of aborting the execution of CPU #1 and ending the current round robin.

So, at this point, the scheduler contacts the timer system and indicates that the global time should be updated to the minimum of all the CPUs' local times, which would be 0.000025. The timer system sees that there is a timer scheduled to go off at 0.000025, and fires it; in the callback, we send our signal to CPU #0.

But wait, isn't CPU #0's local time already way off in the future at 0.000150857? Yep. This means that the signal arrives much later that it should have (0.000150857 - 0.000025 = 0.000125857 seconds, or 1762 CPU cycles too late), and we have lost our synchronization.

How do we solve this problem? Well, we could have swapped the execution order, making the scheduler execute CPU #1 first. But in order to do that, we would have needed to predict the future once again. If the communication details between two CPUs are well-understood and follow strict rules, then it might be possible to make this sort of fine-grain scheduler tweaking work. Up to now, however, there has not been a good case made for running out-of-order like this. And so the round-robin order remains fixed.

Traditionally in MAME — in fact, even before I ever wrote the timer system and the scheduler — the way this sort of communication issue was resolved was to increase the interleave factor between CPUs. The interleave factor was a number that indicated how frequently the CPUs in MAME were configured to re-synchronize their execution times each video frame. (This was specified in terms of video frames originally because all timing in MAME was done relative to video frames before the timers existed.)

The interleave factor is specified globally in MAME's machine driver structure. It is implemented by computing how many times per second the synchronization is implied (for example, an interleave of 100 with a game that runs at 60Hz, would imply 6000 synchronizations per second), and simply setting a timer with a NULL callback to go off at that rate. No callback is needed because no action needs to be performed; rather, the mere existence of the timer firing at that rate effectively brings all CPU into sync at that frequency.

Back to our example, let's say our game runs at a 60Hz frame rate, and we bump the interleave factor to 500. That will ensure for us 500 * 60 = 30,000 synchronizations per second, or once every 0.000033333 seconds. This means that there will be a timer set to fire every 0.000033333 like clockwork. So let's re-evaluate what happens and why the interleave improves things.

Remember that the timer system figures out when the first timer is set to fire. Previously, our first timer was going to go off at 0.000150, but now we have this interleave timer which is going to go off even earlier at 0.000033333, so that determines what our first timeslice will be. Taking 0.000033333 * 14,000,000 gives us 467 cycles, so we execute CPU #0 for 467 cycles. Let's say it comes back having executed 470 cycles. That puts our local time at 0.000033571.

Now we execute CPU #1 for its timeslice, which turns out to be 67 cycles with the new timer in place. Again, at 50 cycles into execution, CPU #1 decides that it needs to send a signal to CPU #0. So we set an instant timer to go off immediately, at time = 0.000025. This ends the round robin, and the timer system is notified just as before.

This time, however, CPU #0 receives the signal at 0.000033571, which is only 0.000008571 seconds or 120 CPU cycles too late. This is a big improvement over 1762 cycles, but it's still not perfect. By increasing the interleave we could make it even better if we wanted. In fact, the interleave factor effectively determines the worst case latency for a signal from one CPU to another.

Could we make it perfect? Well, actually, we could. If we set up a timer to run at a frequency of 2,000,000 times per second (the clock speed of the 2nd fastest CPU), then we would get as close as possible to perfect interleave. CPU #0 would never execute longer than a single clock on CPU #1, so when CPU #1 sent a signal, it would hit CPU #0 at the same time as the end of that particular clock cycle on CPU #1.

To set the interleave that high would require specifying an interleave of 33333 in the game's machine driver. Try it sometime. Things get very very slow. This is because a context switch between two CPUs is not free, and when you try to set up a timer to run that frequently, you spend all your time context switching and very little time actually executing any code on the CPUs.

The ideal solution to this is to detect when it is likely that CPU #1 needs to signal CPU #0, and temporarily boost the interleave so that, at least for a while, synchronization is guaranteed. This is the purpose of the cpu_boost_interleave call. It takes two parameters. The first parameter is how frequently the timer should fire — note that it is not specified in terms of video frames, but rather as an absolute time. You can also pass 0 here, which causes the system to automatically pick the clock rate of the 2nd fastest CPU, which will give you ideal synchronization. The second parameter specifies how long, in seconds, you want to maintain this level of interleave. Generally, you don't want it on too long.

Most commonly, games in MAME are set up so that "master" CPUs are run first, and communication tends to go from earlier CPUs to later ones in the round-robin order. Interleave boosting is used in these cases when the "slave" CPUs need to send some information back, and the master is sitting there waiting for a response.

Keep in mind that none of these systems are perfect, yet they have been successfully used for many thousands of different platforms. In Part 4, I'll wrap this series up with a bit about spinning, yielding, and triggers.