I’ve been noticing a lot of fun stories lately about bugs in old software that suddenly showed up in newer Windows versions. For example, here’s an excellent writeup by Silent about a bug in Grand Theft Auto: San Andreas that laid dormant until Windows 11 24H2 came out. MattKC also recently posted a cool video about the massive project of decompiling LEGO Island, which also solved the mystery of the “exit glitch” that happened in newer versions of Windows. Nathan Baggs has also been at it again, fixing a modern compatibility issue with Sid Meier’s Alpha Centauri this time.

I won’t spoil these stories for you, but they all reminded me of a bug that I fixed twelve years ago in Basilisk II but never wrote about until now. Basilisk II is one of the more popular 68k Mac emulators, allowing you to run an old Mac system on your modern machine. Nowadays, you can even run it in your browser using Infinite Mac! Here’s a screenshot of Basilisk II running on my Windows 10 machine.

The bug was: when you launched it, the emulated Mac would just sit there with a black screen rather than booting up. It didn’t happen every time, which really confused everybody. The problem seemed to be way more common on newer Windows versions, which were Vista and 7 at the time, but people also occasionally saw it on XP too. It definitely failed most of the time for me with Windows 7. Nobody was seeing this issue on Mac OS X or Linux.

To re-familiarize myself with this bug for the purposes of writing this post, I downloaded the broken version from the Internet Archive and tried it out in some virtual machines. Windows 2000 and XP ran it without any trouble on the first try, but Vista and 7 didn’t:

Basilisk II has a UI quirk that’s really annoying in this particular situation: the close button doesn’t work. You have to cleanly shut down the emulated machine in order to exit, which is impossible to do when you’re stuck with a black screen. This functionality is useful to protect you from losing data in the emulator when it’s working correctly, but it meant that whenever it started with a black screen, I had to go into the Task Manager, right-click on the process in the list, and choose “End Task”. How irritating! It took me about 10 tries before I was able to convince it to run properly without the black screen. No wonder I had tortured myself with this bug fix back in 2013.

Back in the day, there were all kinds of interesting theories and solutions posted by users about this problem. One person blamed a Bluetooth-related “BTTray.exe” service. Someone else found that opening the hard disk image with HFVExplorer before running Basilisk would allow it to work. Another person observed that running it as Administrator fixed the issue. Compatibility Mode settings were also a common workaround. Somebody was even using Safe Mode to get around it. People had been complaining about this as far back as 2005. Given that there were so many differing explanations with varying success, it seemed likely that none of them could truly be the answer.

The only solution that worked for everyone was to revert to an outdated version of Basilisk II from 2001, known as “build 142”. This was sometimes referred to as the pre-JIT version, because it came out before a just-in-time compiler was added to the 68k emulation to drastically improve performance. The old version worked fine, but it lacked all of the modern (at the time) improvements such as JIT.

Anyway, in 2013 I was also affected by this problem on my Windows 7 computer, and decided to take a stab at fixing it. This bug extermination tale isn’t quite as epic as the three linked above, because Basilisk II is open-source. I had access to all the code to see what was going on. But still, even having the source code, I had no idea where to start looking. Why would the behavior randomly change between runs? Maybe an uninitialized variable? Why were modern Windows versions more likely to cause it to fail?

Since I was able to reproduce both a failure case and a success case, I added a bunch of debug trace output to Basilisk II. I wanted to see what was changing between a successful run and an unsuccessful run. I focused on the video code. Was video actually working internally and failing to be displayed in the window, or was something deeper screwing up the video and/or causing the emulated machine to fail to boot?

This iterative testing and debug tracing process revealed that redraw_func() in BasiliskII/src/SDL/video_sdl.cpp was periodically called during both failures and successes, but the display was only ever found to be “dirty” and needing an update in video_refresh_window_vosf() when video worked correctly. When the black screen bug was happening, the display was never dirty.

Of course, this led me to keep tracing backwards to try to figure out why the display wasn’t being marked as dirty. I added more checks to see all the code that was running. My big discovery ended up being that that SDL_monitor_desc::video_open() was only being called once if there was a black screen, but three times if the video was working. Going backwards from there, I found that this traced all the way back to VideoDriverControl() in BasiliskII/src/video.cpp. It was never being called during “black screen” runs.

The VideoDriverControl() function is special. It’s only ever called from the 68k CPU emulator’s opcode parsing! It’s hooked up to CPU opcode 0x7119. There’s a big table of opcodes in this same range relating to disk, CD, floppy, display, sound, external filesystem access, and much more, starting at 0x7100. The Basilisk II source code identifies these as “Extended opcodes (illegal moveq form)”.

If you look at the Motorola M68000 Family Programmer’s Reference Manual, sure enough, 0x71xx is a family of invalid MOVEQ instructions. Bits 15-8 being 0x71 make it look like a MOVEQ involving register D0, but bit 8 is supposed to be 0, so it’s invalid — the instruction format specifically says it should be 0.

You can see this for yourself in your favorite 68000-series disassembler. 0x7119 fails to disassemble, but 0x7019 decodes as MOVQ #25, D0.

The whole point of this analysis is to show how Basilisk II cleverly uses this invalid range of instructions as its mechanism for communicating between the emulated CPU and the host machine. The CPU emulation looks for these invalid opcodes and calls various functions inside the codebase for handling disks, audio, displays, and stuff like that. In some ways, it’s kind of similar to the A-line instruction mechanism that classic Mac programs use for communicating with the operating system, except it only works in this particular emulator.

So in a successful boot with video, the 0x7119 instruction was being executed at some point by the emulated CPU. In a boot with a black screen it wasn’t. In other words, the emulated machine itself was the source of the problem. Yikes!

Wait a minute. This doesn’t make sense. How would the emulated machine even know to call such an instruction in the first place? I wasn’t even supplying a boot disk to Basilisk II, so how it possibly be loading a driver that executes Basilisk II’s custom 0x7119 instruction?

This is where Basilisk II differs from other emulators like MAME. It’s not trying to perfectly reproduce the way that a stock machine runs. Instead, it patches the ROM you supply (I used a Quadra/LC/Performa 630 ROM) so that it bypasses things that would cause crashes in the emulated environment, and also injects its own code to tell the emulated machine what to do for video, audio, keyboard and mouse input, and so on. So it does make sense after all.

This is all detailed in BasiliskII/src/rom_patches.cpp and BasiliskII/src/slot_rom.cpp. In particular, the part of the code I was dealing with involved the function InstallSlotROM(), which creates a declaration ROM containing two drivers: Display_Video_Apple_Basilisk and Network_Ethernet_Apple_BasiliskII. It sticks this DeclROM at the end of the Mac’s ROM so that it’ll be automatically detected at startup.

I verified that InstallSlotROM() was indeed being called, both during successes and failures. So the driver was definitely being added to the ROM. The problem was caused by something executing differently inside the emulated machine. When the video worked, the Display_Video_Apple_Basilisk driver was loading and running. When there was a black screen, it wasn’t. Furthermore, it became apparent by looking at a CPU trace that during a black screen failure, the emulated machine was running fine otherwise! It just didn’t have any video.

So why was the emulated machine often failing to load the driver in newer versions of Windows? Why would the version of Windows even matter for this? This is an emulator, for crying out loud. Shouldn’t the internal state be the same every time I run it?

The big breakthrough for this problem came as I examined the InstallSlotROM() function in more detail, adding more debug output to try to discern differences. I noticed that whenever the black screen problem occurred, the value of the variable ROMBaseHost, used by InstallSlotROM(), looked much different than it did during successes.

Success values of ROMBaseHostFailure (black screen) values of ROMBaseHost
0x04C900000x02970000
0x04C400000x02730000
0x04C800000x02720000
0x04CA00000x02710000
0x04C500000x025C0000

That’s odd. ROMBaseHost is the address from the host machine’s perspective where the emulated machine’s ROM lives. Why would the host address of the ROM even matter inside of the emulated machine? Was this just a coincidence? (Narrator: no, it wasn’t.)

I looked at the code that allocated ROMBaseHost in the platform-specific Windows directory. First, it allocated space for the emulated RAM, and then allocated one megabyte for ROM:

// Create areas for Mac RAM and ROM
RAMBaseHost = (uint8 *)vm_acquire_mac(RAMSize);
ROMBaseHost = (uint8 *)vm_acquire_mac(0x100000);
if (RAMBaseHost == VM_MAP_FAILED || ROMBaseHost == VM_MAP_FAILED) {
	ErrorAlert(STR_NO_MEM_ERR);
	QuitEmulator();
}

vm_acquire_mac() is a function that goes through a few layers, but eventually it ends up calling VirtualAlloc() to do its job on Windows. The same section of code in the Unix port looked like this instead:

uint8 *ram_rom_area = (uint8 *)vm_acquire_mac(RAMSize + 0x100000);
if (ram_rom_area == VM_MAP_FAILED) {	
	ErrorAlert(STR_NO_MEM_ERR);
	QuitEmulator();
}
RAMBaseHost = ram_rom_area;
ROMBaseHost = RAMBaseHost + RAMSize;

The difference is that this code allocates both RAM and ROM at the same time, rather than through two separate allocation calls. For a little more perspective here, in all of the test cases I listed above, regardless of success or failure, RAMBaseHost was somewhere in the range 0x3xxxxxx. Here are two examples:

SuccessFailure (black screen)
RAMBaseHost0x03C900000x03B80000
ROMBaseHost0x04C900000x02970000

Was it as simple as that? ROMBaseHost being below RAMBaseHost in the host machine’s memory space prevented the emulated computer from loading the video driver? The equivalent Unix code prevented that situation from ever happening.

As it turns out, yes. That was the problem. My fix ended up being to port the Unix version of the code over to Windows.

I was slightly nervous that the separate vm_acquire_mac() allocations were an intentional thing on the Windows port, but as soon as I combined them into one, the black screen went away and everything worked perfectly every time.

To explain the fix in more detail, the individual calls to vm_acquire_mac() for allocating RAM and ROM meant that sometimes the address of ROM from the host’s perspective was below RAM, and sometimes it was above RAM. It would fail whenever the ROM was below RAM. This was what caused the problem to be so random. It was also probably a decent explanation for why newer Windows versions seemed to experience the problem more often. My theory is that sometime around Vista, the behavior of Windows’ memory allocator changed, and it became much more likely for the second allocation’s address to be lower than the first. Experimentally, it seems like XP usually just kept going upward with addresses when running this code.

What the heck though? Why would the host machine’s address for the ROM even matter to begin with? Wouldn’t the emulated computer have its own completely independent address space anyway?

Looking through some of the documentation included with the source code, you can see that Basilisk II has a few different addressing modes. I checked, and the Windows version uses DIRECT_ADDRESSING:

Emulated CPU, “direct” addressing (EMULATED_68K = 1, DIRECT_ADDRESSING = 1):
As in the virtual addressing mode, the 68k processor is emulated with the UAE CPU engine and two memory areas are set up for RAM and ROM. Mac RAM starts at address 0 for the emulated 68k, but it may start at a different address for the host CPU. Besides, the virtual memory areas seen by the emulated 68k are separated by exactly the same amount of bytes as the corresponding memory areas allocated on the host CPU. This means that address translation simply implies the addition of a constant offset (MEMBaseDiff). Therefore, the memory banks are no longer used and the memory access functions are replaced by inline memory accesses.

What this means is host addresses are easily translated to virtual addresses in the emulated machine by subtracting an offset (MEMBaseDiff), which is simply the same value as RAMBaseHost. And likewise, to convert from an emulator address to a host machine address, you add MEMBaseDiff instead. This effectively makes the RAM always mapped to virtual address 0, and the ROM ends up mapped in virtual address space at ROMBaseHost – RAMBaseHost.

I find this whole setup quite confusing, but I will admit that I’m not highly experienced in writing emulator code. I’m assuming the reasoning behind this setup, as opposed to just using “if” statements to check if a virtual address is inside of RAM or ROM, has something to do with performance. I didn’t spend much time looking further into it. I did notice that the old pre-JIT version without the black screen bug didn’t have this direct addressing mode, so that’s why it didn’t have this problem.

Let’s think about what this offset subtraction means in the example success and failure scenarios I listed above. In the success case, RAMBaseHost was 0x03C90000 and ROMBaseHost was 0x04C90000. This means the virtual ROM address was 0x04C90000 – 0x03C90000 = 0x01000000. That result actually makes a whole lot of sense, because I had Basilisk II set up to use 16 MB of RAM, so Windows’ allocator did exactly what you might expect and allocated the ROM directly after the RAM. This is also what my patch guaranteed the behavior would always be in the Windows version of Basilisk II going forward. Makes sense.

On the other hand, in the failure scenario, RAMBaseHost was 0x03B80000 and ROMBaseHost was 0x02970000. The subtraction to determine the virtual ROM address ended up wrapping around below 0: 0x02970000 – 0x3B80000 = 0xFEDF0000. So the ROM was being mapped to a really high virtual address inside of the emulated machine. Here is a visual aid to show what was happening:

That’s definitely a big difference from the emulated machine’s perspective. It still doesn’t really explain why it failed in this case, though. It’s just an address, right? The Mac’s ROM is relocatable. Who cares if it’s at 0xFEDF0000 instead of 0x01000000? I traced through the emulated CPU instructions and found where they began to differ. The problem was inside of the ROM’s Slot Manager code. During failures, the ROM’s virtual address was at 0xFExxxxxx, which is the standard slot space for slot E. On the other hand, when it succeeded, it was in slot 0 because of the high nibble being 0. What it comes down to is the ROM didn’t expect itself to be mapped at 0xFExxxxxx, so the Slot Manager failed when attempting to load the DeclROM that Basilisk II placed at the end of the ROM.

Basically, it’s risky to allow the Mac’s ROM to be placed anywhere in the address space of the emulated machine, and newer versions of Windows just so happened to be allocating memory in a way that the caused the emulated Mac’s Slot Manager to dislike the virtual address of its ROM. This caused it to bail out when it should have been loading the video driver.

To really confirm that I had figured out the problem, I modified the Linux version of Basilisk II to force it to put the ROM below RAM, just like I was seeing when the Windows version didn’t work. This caused the video to successfully fail in Linux every time. And with that, I was confident that I had tracked down and eliminated the bug for good. Two days after I submitted the fix, my pull request was merged, and Basilisk II has worked great on Windows ever since then.

The funny thing is that this actually used to be a bug in the Unix version too, and it had already been fixed in 2005 — I wasn’t even the first one to track this problem down. The person who fixed it for Unix didn’t apply the same fix to the Windows version. As I look back on this today, I’m realizing that when I fixed it in the Windows port in 2013, I forgot to update the corresponding deallocation code to only be a single call to vm_release(). Whoops! That little mistake is probably harmless, but I should submit another PR to fix it for consistency.

TL;DR: As usual, this compatibility issue with newer Windows versions was not the fault of Windows. If you call malloc (or something equivalent) twice in a row back-to-back, you can’t assume the second pointer you get back is going to be greater than the first. The broken code in Basilisk II mostly got away with it in Windows XP, but Vista must have changed something pretty significantly under the hood.

I want to close this off by saying that looking at my pull request from 2013 — only the second one I ever opened on GitHub, by the way — embarrasses me a little bit. It’s a nice reminder of just how green I was with Git at the time. I was apologizing in the PR for accidentally including a minor unrelated file naming case sensitivity fix, which is something I easily should have been able to separate into individual branches and PRs myself if needed, but I had no idea what I was doing. It’s fun to have a window into the past to see how far I’ve come in the past twelve years! I wonder what the next decade will bring?

Trackback

no comments

Add your comment now