<< Newer All Articles Older >>

The Universal Platform, Part 2

In the previous article, I described how the Galaxian video hardware was designed around the concepts of a tilemap and sprites. This article goes into more details about how the hardware renders the tilemap, and where column scrolling fits in.

To recap: a tilemap is a two-dimensional array of memory that describes how the video system displays the screen. Each "tile" in a tilemap is a fixed size (traditionally 8x8 pixels) and so to cover a 256x256 pixel screen, you need an array of 32x32 tiles.

Now, on the Galaxian video hardware, the visible area of the screen is actually smaller than 256x256 — some of the top and bottom pixels are "blanked" to reduce the overall screen height to 224 pixels. This doesn't affect the underlying tilemap, which is still 32x32 tiles. But it does mean that some of those tiles are not visible.

Now think about how the video signal is transmitted to the monitor. First, keep in mind that the video signal is generated one row at a time from top to bottom, left to right, in order. This means that in order to make the tilemap visible on the screen, the video hardware must, at each pixel location, look up which tile is specified by the tilemap RAM corresponding to that pixel (this is known as the tile index). Once it knows which tile index to display, it then must look up the actual tile graphics data in the tile ROMs, and output the appropriate pixel from the 8x8 tile graphics.

This sounds like a lot of work, and it is. In fact, it's too much work to really do all of that on each pixel. So the hardware was optimized to be able to do it in realtime, by making some fundamental assumptions.

On Galaxian hardware, each pixel of each tile can be one of four colors, requiring two bits of ROM data. The data for graphics can be stored in many different ways, but on Galaxian, they opted for a logical arrangement where each bit is stored in a separate ROM, and the bits are ordered in the same left-to-right, top-to-bottom order that the screen is rendered. The advantage of storing the graphics this way is that each row of tile graphics (remember, the tiles are 8x8 pixels) is exactly 8 bits wide, or one byte. And you can read one byte from a ROM all at once.

Thus, on each scanline, the Galaxian tilemap hardware gets away with only looking up the tile index from the 32x32 tilemap RAM once every 8 pixels. It then uses that tile index to look up the appropriate row of tile graphics data once every 8 pixels, reading an entire row of data from the tile ROMs in one operation. Then, over the course of the next 8 pixels, it spools the data out one bit at a time from an internal data store. While it is doing that, it is also taking the time to figure out which tile is coming next so that it can immediately start outputting that tile once the current tile is finished.

So, we are down to only needing to read tilemap RAM once every 8 pixels as we work our way across a given scanline. Further, we only need to look up the tile graphics once every 8 pixels as well. In order to do this, we compute which entry in tilemap RAM to look up by dividing the X and Y coordinates by 8 and rounding down, so that we fetch the tilemap entry at (X/8, Y/8). Once we read the tile index from RAM, we then need to look up the tile graphics from the appropriate row in the tile ROMs. The row number is just the remainder from dividing Y by 8. For example, if we are at pixel location (48, 17), then we would fetch the tilemap entry at (48/8, 17/8), or (6, 2). And we would fetch tile graphics from row number (remainder(17/8)) = 1 of the tile.

Because of the way this works, the hardware designers realized that it was very easy to allow you to specify a number (let's call it the vertical scroll value) to be added to the value of Y before looking up the tile index and tile graphics. Doing this adds just a little bit of hardware, but gives you the ability to control the vertical scrolling of the tilemap. Take the example above again, but this time, let's add a vertical scroll value of 1 to all the Y coordinates. We are still at pixel location (48, 17) on the screen, but we will add 1 to 17 and use Y=18 in our calculations. So we look up the tilemap entry at (48/8, 18/8), or (6, 2) yet again — same as last time. But when we compute the remainder of 18/8, we get 2 instead of 1, meaning that we will display row 2 of the tile. This has effectively produced a scrolling effect of shifting that tile upwards by 1 pixel.

The first question you might ask is, what happens when you have Y=255 and you add 1? You will end up with Y=256 and then where will you access the data for the tilemap? Without going into the finer details, the answer is that the value of 256 "wraps" around back to 0. Thus, if you slowly increase the vertical scroll value by 1 each frame, the screen will scroll upwards one pixel each frame and you will eventually see what used to be at the top of the screen appear at the bottom. This is due to the wrapping effect, where values above 256 have 256 subtracted from them. Because of this wrapping, it is often true that you don't want to have the entire tilemap visible, because once you scroll, you will immediately see the top appear at the bottom. This is why it is nice to have the extra non-visible tiles in the blanked out region of the screen.

Now, having a vertical scroll value for the whole tilemap is pretty nifty, but on Galaxian they went a step further, and allowed you to specify a different vertical scroll value for each group of 8 pixels across. The reason this works well is clear if you look once again at what the video hardware is doing. It has to look up a new tilemap index and new tile graphics for each 8 pixel group as it scans horizontally across as scanline. Since it has to do these computations each time anyway, it doesn't add much complexity to pick a different vertical scroll value for each group, and it gives you a lot of flexibility. This is called "column scroll" because each column of the tilemap can have its own independent scroll value.

However, even column scroll on a tilemap doesn't really provide enough flexibility for a complete game. So, in part 3 of this article, I'll dive into the sprites.

The Universal Platform, Part 1

It's interesting looking back at some of the early arcade game platforms and realizing just how much they share in common. Certainly, the Taito's classic Space Invaders hardware was a very popular generic platform which was adopted, copied, hacked, and modified to support a large number of different games. Following on the heels of that success, Namco's Galaxian hardware design became its heir apparent.

On the surface, both Space Invaders and Galaxian share a number of similarities. Both games were vertically oriented space shoot-em-ups. They both had a large number of invaders approaching from above, raining bombs on the player. They both had player ships that could only move left or right, and fired missiles up at the incoming invaders. Certainly when I was a kid and knew nothing about the hardware involved, I thought of Galaxian as a sort of super Space Invaders, since the Galaxians would actually swoop down to attack you, which seemed like a neat trick.

From a hardware perspective, however, the two platforms were quite different. Sure, they both used the Zilog Z80 to power the gameplay (technically, Space Invaders was powered by an Intel 8080 CPU, which was the Z80's architectural starting point). But the video architecture was very different between the two.

Space Invaders was based off of a very computer-like concept of a bitmapped display, where each pixel was carefully drawn one at a time by the main CPU. This had some nice advantages in that the game code had full flexibility on what to draw where. It also had the disadvantage that it required a lot of work to erase and redraw the player and enemy graphics each frame. If you ever look closely at the invaders, you will notice that they don't smoothly move all at once, but rather you can watch a ripple effect happen as they are updated in columns. This is because there wasn't enough CPU power to erase and redraw all those invaders on each frame. Keep in mind that a typical arcade game is approximately 256 pixels by 256 pixels, so even just clearing the screen requires 256x256 = 65,536 operations. If each operation takes 4 clock cycles, then that is about 1/4 of a second on a 1 MHz CPU! Now, in reality it's not quite that bad: because the display is black & white, each pixel only requires 1 bit, so you can pack 8 of them into one byte and reduce the number of operations by 8. Still, 1/32 of a second is still two whole video frames.

In contrast, instead of a bitmapped display, Galaxian's video was built out of tilemaps and sprites, a video design that eventually became the norm for arcade games throughout the 80's and 90's. A tilemap essentially divides the screen area into a number of 8 pixel by 8 pixel squares. Rather than being able to draw each pixel independently, the game CPU instead chooses which one of 256 possible pre-designed 8x8 graphics tiles will appear in each square. These graphics tiles are stored in a separate ROM from the game code. Of course, one of the tiles is always a blank square, so to clear the screen, you would need to set each of the tiles to display the blank square. Taking the Space Invaders example above, a game with 256 by 256 pixels would need a 32 by 32 tilemap to cover that area with 8x8 graphics tiles. To clear the screen (i.e., set each tile to display the blank square), it requres 32x32 = 1,024 operations. At 4 cycles per operation, clearing the screen happens in a brisk 1/250th of a second for a 1 MHz CPU.

Of course, you lose a lot of flexibility when switching from a bitmapped display to a tilemapped display, but you gain a lot in speed. The biggest limitation of tilemaps is that you can only position your tiles on 8-pixel boundaries. This makes it hard to smoothly animate anything because you can only move objects across the screen 8 pixels at a time, which is very chunky and distracting. To overcome these limitations, the Galaxian hardware included two additional features that gave the designers the flexibility they needed to produce smooth animation: column scrolling and sprites. I'll talk about those in Part 2.

Slaying the Beast

It is a thing of legend, known for years, and yet never confronted. It has grown out of control ever since the earliest days of 1997. It has branched and forked, taking over dozens of source files. Years have passed and many features have been added to ease its burdens. And yet, this code remained largely as it was since its inception.

Occasional brave adventurers have stepped in from time to time, seeking to add yet another game to its hideous bloat. But they all quickly recognized the horror and fultilty, frantically ignoring similarities to other games, instead just focusing on getting in and getting out as quickly as possible.

And so the behemoth grew.

But something stirred within me. I noticed many games using colortables needlessly. And that included every single game intertwined with this terrible monster. The thought occurred to me to fix up the video timing as well. Just a simple change, really.

And then I saw stars. And they looked like of like the stars of Astrocade, except done badly. Consulting the schematics revealed fundamental flaws in the driver's video design. The potential impact was too great! What should I do? Turn away yet again? Leave it for a distant future, or another soul braver than I?

No. Not this time.

Schematics in hand, years of history against me, I stand now to face this beast. And I will make it clean.

Yes, the foul galaxianscramblefroggeramidarscobra beast will be tamed. Or I will die trying.

Wish me luck.

Good News/Bad News

Having recently touched all of the TMS34010-based games led me to start looking into some long-standing issues with some of the Williams/Midway Z/Y/X/Wolf-unit games. The first thing to do was to add save state support to them, because playing Revolution X more than once to get to the place where the video craps out is not for the faint of heart. This was relatively straightforward, and also had the side benefit of making some useful data available in the debugger (such as the local_videoram array and the dma_registers array).


Side note: any array or pointer you register with state_save_register_global_array/_pointer automatically becomes a viewable option in a memory window in the debugger. This is very handy if you need to expose any internal state.

Anyway, it turned out that the Revolution X bug was caused by over-aggressive masking in the blitter code. It was masking out one too many bits in the clipping window and this led to the right edge of the clip window being less than the left edge, effectively clipping out everything during the blit. Hence the black screen.

Save states also helped find the Rampage World Tour bug that led to a hang. It turns out that there was code that would read the incorrect status register in the blitter chip to see if it was done. The top bit is the important one, and happened to be 0 most of the time, so most of these incorrect checks happened to work. Occasionally, however, a certain blit would be issued that caused this bit to be set to 1, and then this check tended to fail and just hang indefinitely waiting for a phantom blit to finish.

Finally, I took a look at the Mortal Kombat 3 bug where the palette on the character selection screen or the intro screen is all wrong. This turns out to be a cycle counting issue. The code that builds up these screens sets up a queue of palette entries to change. Each character that is displayed has its own palette, so if there are a lot of characters, there are a lot of palette changes to queue. But palettes are only changed in the VBLANK routine, so it's possible for the queue to get too full, depending on how quickly the game accumulates entries in the queue. When it hits its limit, it fortunately doesn't corrupt memory (good), but instead throws everything out of the queue (bad), leaving you with a bunch of missed palette changes.

This queueing behavior is why the problem gets worse if you have unlocked a bunch of characters, because more characters are displayed on the screen, and thus more palettes get queued.

The reason this worked on a real machine is probably due to the TMS34010's cache. We don't emulate the cache behavior in MAME; rather, we act like all instructions are in the cache, and count the minimum number of cycles for each instruction. This is generally the right approach, but in this case it works against us because we allow the code that sets up the palette queues to run too fast, overflowing the queues before the VBLANK comes in and clears everything out.

So for now, we'll just have to live with it. The 34010 cache is described in gory detail in the manual, so it is possible to simulate it eventually. But for now, a little color glitch isn't going to hurt anything.

How Slow Can You Go?

I get tired of reading people just blindly saying that MAME gets slower with each release. A lot of people attribute it to dumb things like adding more games, or making some completely benign change in the core, or supporting Mahjong games, or other silly nonsense.

Yes, you can look at the history of MAME over time and see that the system is overall slower than it used to be. However, the performance is also quite stable over long stretches of time. What you really see is a few abrupt performance drops here and there. Generally this is due to jettisoning features or hacks that were optimized for lower-end systems, in exchange for simpler, clearer code.

In an effort to get a handle on MAME's performance curve over the last few years, I've done some benchmarking of the Pac-Man driver over all of the native Windows ports of MAME since I did the first one back in version 0.37b15. The main reason I did not include DOS versions was because they don't support the -nothrottle and -ftr/-str command line options that make benchmarking possible.

I picked Pac-Man because the driver itself really hasn't changed all that much over the years. The chart at the right shows the results of benchmarking on a year 2000-era computer (Dell Dimension 4100, 933MHz Pentium III) with a year 2000-era graphics card (nVidia GeForce 256) running Windows XP. If you look at the chart, you can see periods of performance stability followed by drops here and there due to specific changes to the MAME core that optimized it more toward more modern systems and simpler code.

The first drop came with the 0.53 release (about at 10% speed hit). That was the first release where we dropped support for 8-bit palette modes. This meant that all internal bitmaps for rendering were 16 bits or greater. Although Pac-Man doesn't need 16-bit rendering, the simplification of removing tons of rendering code associated with supporting both types of rendering made this a good tradeoff.

After that performance was roughly stable until the 0.61 release, where it took another 6% hit. This was the release where we removed dirty bitmap support. Previously, drivers could optimize their rendering by telling the OS-dependent code which 8x8 areas of the screen had changed, and which were static. As you can see from Pac-Man, much of the screen is static most of the time, so this optimization helped. Again, though, it added a lot of complexity to the code and obscured a lot of the real behaviors.

You'll note another 8% drop in the 0.63 release. This one I haven't yet investigated enough to understand; I'll update this post once I figure out what happened here. One change that happened is that we upgraded to a newer GCC compiler; it's possible that in the case of Pac-Man, this impacted performance.

From that point on, right up until 0.106, Pac-Man performance was pretty flat. It varied a bit release to release, but really was quite steady until the big video rewrite. At that point, performance took a pretty significant hit (about 30%). The primary reason for this was that the Windows code was changed to always choose a 32-bit video mode if available, rather than supporting 16-bit video modes. The reasoning behind this decision is that palette values are computed to full 24-bit resolution internally, and even though Pac-Man doesn't use more than 32 colors itself, those 32 colors aren't perfectly represented by a 16-bit video mode.

Interestingly, things got about 15% better in 0.110. Again, I haven't yet done the analysis to figure out why.

So, what to take away from this? It's more really just an interesting set of datapoints I've wanted to see for a while. Certainly, lower-end games like Pac-Man have suffered the most from the various core changes over the years. However, in addition to code cleanup, another reason for this is that we have shifted the "optimization point" (i.e., the games we target to run most efficiently) away from the lower-end games and more toward the mid-to-late 80's games. This means that we generally take for granted that lower-end games will run fine on today's machines (or even 7-year-old machines like my Dell), but the later games are the ones that need more attention.

One follow-up test I'd like to do is to try the same analysis with a driver that is more in the middle, perhaps something like Rastan, or a game that required 32-bit color support back in version 0.37b15. These games won't see any hit from the removal of dirty support, 8-bit color modes, or even 32-bit color modes. I think the result would be a much more consistent graph.