Making Bad Apple!! on the Game Boy - Part 1
A little over two weeks ago as of writing, I uploaded a new video on my YouTube channel. It's a recording of my Game Boy debut - a cover version of the Bad Apple!! music video, as rendered from an "actual" (emulated) cartridge.
This marked the end of a 3 month journey of familiarizing myself with the hardware on a more practical scale, rather than just reading and researching it. And I'm very happy with what I've made here. So happy in fact, I had one of my friends react to it live on a stream and his reaction was absolutely delightful. It didn't take long until the question on everyone's mind was asked:
"Do you think you could make a video explaining how it was done? I think people would watch that video!"
Well, I'm not going to make a video, but I'm itching for a chance to talk about it! This is going to be a multi-part series of posts, each one focusing on a different subject. In this first part I'd like to mainly go into the relevant hardware of the Game Boy. Knowing what I had to work with is important for understanding certain decisions I made in the development of the ROM and it'll help to appreciate just how underpowered and unsuited the console is from a modern point of view. It's quite amazing that anything could run on a Game Boy at all!
Before we get started, I'll use plural pronouns from now on, so "we", "us", etc. Since you're reading this I'm just gonna assume you might be interested in doing something like this yourself (or maybe you already have) and there's nothing you can do to stop me. Without any further ado, let's get this started already!
Chapter 1 - The Chip that Overdelivers
There aren't that many integrated circuits on the Game Boy's circuit board. In fact there are only four, two of which we'll take a closer look at. In this section, let's focus on the chip that's labeled "DMG-CPU", DMG standing for "Dot Matrix Game" - the internal name for what we know as the Game Boy. There's a lot more inside than just a central processing unit! This one chip is essentially responsible for everything. It includes the CPU as expected, but also the Pixel-Processing Unit (PPU), the Audio Processing Unit (APU) as well as a small amount of its own memory! The ROM uses everything on that list, so let's go over them one after the other, starting with...
I. The CPU
The one component that everyone has heard of. Designed by Sharp specifically for the Game Boy, the CPU is responsible for reading and processing instructions as it encounters them in memory. It runs at a clock speed of 4,194,304 Hz (in fact, all units on the chip operate at this speed). That is, every second there are 4,194,304 "T-cycles", the smallest unit of time on the GB hardware. Instructions can take various amounts of T-cycles to complete. However, it is always a multiple of 4 of them, so typically we take a bit of a shortcut, divide the T-cycle count by 4, and speak of instructions as taking a certain amount of "M-cycles" instead, usually* 1 to 3.
Typically, instructions act on one or more of the CPU's internal registers, small 8-bit portions of memory that the unit has direct and very fast access to. The registers are usually labeled with single letters, specifically A, B, C, D, E, H and L. With a "load" instruction It's possible to set any register to any value (for example, "ld l, 5" sets the value in the L register to 5) or to the value stored in another register ("ld b, c" copies the value in C into B). Out of these 8-bit registers, A is a little special - it is the "accumulator" and a lot of instructions implicitly use it as both a source and/or a destination. For example, the add instruction "add c" takes the value currently stored in A, adds the value in C to it, and writes the result back into A. The same goes for the subtract instruction sub, bitwise operations (or, and, xor) and a few others.
The CPU can also interact with memory locations and read from or write to them. But there's a small problem here. We know how to use our 8-bit registers now, but with 8 bits we can only store values between 0 and 255. Without spoiling too much, the Game Boy has over 65 thousand unique memory addresses. How do we access all of them? The solution is quite simple - use two registers together! 16 bits is enough to store any value between 0 and 65,535 which covers our entire address space. Specifically, we have access to the AF, BC, DE and HL register pairs. Don't worry about that mysterious F register for now, it's the flag register and it's not too important right now. Aside from AF, each of the three remaining register pairs can be used to retrieve a value from memory and place it in the accumulator (e.g. "ld a, [bc]") or write from the accumulator into it (e.g. "ld [de], a"). Once again we have a special register, this time it's the HL pair. HL has the unique power to load to and from memory with any 8-bit register, not just the accumulator. Yes, that includes H and L themselves. It's also a source and the destination of 16-bit additions (there's no 16-bit subtract instruction).
And because that's not enough registers to wrap your head around, there's two more that are exclusively usable as 16-bit registers. They are the program counter (PC) and the stack pointer (SP) and they have their own rules. PC always contains the location in memory that the current instruction is read from. That's it. We can change its value in limited ways with jump and return instructions, but without our own intervention it'll increment on its own as it goes through memory and executes our code one instruction at a time. SP is a very useful, but also very limited register. Aside from a few interactions between SP and HL, the stack pointer usually acts as a way to store important values outside of a register (for "safekeeping" during calculations, for example) by offering quick reads and writes to memory. As a result, pretty much none of the usual instructions work for SP. Instead, the two most important instructions that can be used with SP (and only SP) are "push" and "pop". "push" writes any of the familiar register pairs to wherever SP currently points and the value in SP is reduced by two. "pop" meanwhile does the opposite - it increments SP by two and copies the value stored in memory into the specified register pair.
Finally for this section, let's tackle the F register. It's not a register in the usual sense, in fact we can't interact with it directly at all. Instead, certain bits in F (normally labeled Z, N, H and C) can be changed or used by certain instructions. Most important are the zero flag Z and the carry flag C. N and H serve essentially no purpose outside of a single instruction (DAA) and they're never even used in the ROM. We'll go over how Z and C are used as they're needed, else this section becomes even longer than it already is. All we need to know for now is that certain instructions can modify or use their status and they will help us in controlling the program flow.
II. The PPU
The CPU was a bit of a handful, so let's go over something a little simpler. The PPU's job is to draw to the screen, scanline by scanline. Each scanline takes exactly 456 T-cycles (look, there they are again), though they're called "dots" here, and there are four distinct modes of operation that last various amounts of dots:
Note that the only time the screen is actually drawn to is during mode 3. VRAM, where the graphics data is located, is locked during this time. Attemping to read VRAM during mode 3 returns all 1s and writes have no effect. During all other modes, VRAM is accessible for reading and writing like any other memory region. Mode 0 is called "HBlank", here the PPU just stalls for whatever the remaining number of dots is until the next scanline starts. Mode 2 is the "OAM scan", during which OAM (Object Attribute Memory) is locked while the PPU fetches object data for the scanline. The ROM never uses objects, so modes 0 and 2 are identical for out intents and purposes. That is, our "HBlank" is a combined 284 dots long - 71 M-cycles. Finally, mode 1 is "VBlank". During this 4560 dot period, all graphics memory is freely accessible and the nothing is drawn to the screen.
Depending on the number and position of objects on the scanline, mode 3 may be lengthened (and mode 2 shortened). Again though, we don't use any objects. Modes 3 and 2 have the lengths given above all the time no matter what and there are no mode 3 dot penalties.
III. The APU
Last but certainly not least, let's talk about sound. The Game Boy's Audio Processing Unit gives us four channels to play music with.
The first two are pulse channels that can play four square waves with varying duty cycles - 12.5%, 25%, 50% and 75%. The duty cycle basically relates the length of the "high" portion of the wave to the length of the entire period. For example, a 50% duty cycles splits the low portion and high portion of each period 50:50.
The way the duty cycles are implemented in the APU is a bit more complicated, but for now that's all we need to care about.
Channel 3 is somewhat more versatile. Instead of being limited to any particular waveform, it can play any wave of our choosing! ...provided it fits into a 32-entry wave table. Each sample takes 4 bits to describe, so the whole table is 16 bytes long. In exchange for this ability though, the volume options are seriously limited. The wave channel can only play at 100%, 50%, 25% or 0% volume, compared to the 16 steps available for the pulse channels.
Finally, channel 4 is a noise channel. It's powered by a simple linear feedback shift register, which is good at quickly providing pseudorandom numbers. It also has 16 volume steps, just like the pulse channel.
The ROM uses both pulse channels; wave and noise are never touched. But if you listen to the ROM on a Game Boy, it plays back the actual song of the music video instead of playing back a cover of it. How could we possibly use the pulse channels to play back sampled audio? Let's look at how they work in a bit more detail.
Just like the wave channel, we can think of the pulse channels as having an internal wave table of 8 samples - it's just not exposed to us. When we trigger the channel to start a note, we start a timer that counts down from whatever we set the channel's frequency to. Once this timer reaches zero, it advances the position in this internal wave table and resets the timer. It turns out that there's a nice, exploitable quirk in the trigger logic: triggering a channel resets the timer, but it does not reset the position in the wave. Simply put, if we constantly retrigger a pulse channel before the timer runs out, we can keep the channel at a single sample for as long as we want to! We can then use the channel's volume setting to approximate any arbitrary wave we desire. Finally, we can use both pulse channels and supply them with the same volume. That way the audio is twice as loud as it would be with only a single channel.
Now, one question you may have is "Why not just use the wave channel?". And it's a very good question. The fact of the matter is that the wave channel is a bit of a mess to configure. The wave table is locked while the channel is playing, so we'd need to stop it, write new samples, then trigger it again. It's also easy enough to accidentally corrupt the wave by retriggering it without stopping the channel first. This is a DMG-only issue that was fixed on Game Boy Color, but since the DMG is our primary target we'd need to worry about that. And this corruption bug actually happens in a few DMG games that didn't pay attention to it! Another issue is that the first sample in the wave table is never actually played until the wave loops. This is just a quirk of how the currently playing sample is determined after a trigger, think of it as the last played sample being "stuck in the buffer". Since we'd preferably provide the channel with new samples just after the final sample in the table is reached, this means that every 32nd sample would be a corrupted, held version of the previous one. So instead of having to worry about all of this and more, it's just a lot more convenient to use the pulse channel quirk to our advantage. That way we get to decide when each sample is pushed, which is much more comfortable than preparing 32 samples at once. We'll later find out that we wouldn't even reasonably have the time to do the latter.
Today's Conclusion
I'll have to admit, this was quite a bit longer than I originally prepared for. Initially I wanted to include the memory layout in today's post as well, but honestly considering the length of this one I'm going to move that subject to the next post, so it'll (hopefully) be a shorter one. Once that is all done, I'd like to explain the format that I came up with to store each frame and the sample data on the Game Boy end of things. Then, to wrap the series up, I want to show how the actual conversion from an mp4 to a GB-compatible video went. It's going to be easier to work backwards from the decompression of the format to its compression.
I have no planned rhythm for these, but I'll try to write the next part of this series over the following week. See you then, and enjoy the rest of your day!















