Temptation of the Apple: Dolphin on macOS M1

From the announcement made on November 10th, 2020, users have had high hopes for the new Apple M1 devices. With its powerful Apple Silicon processor smashing benchmarks all over the place, users and developers were both asking if a native Dolphin build would be possible. Now we have the answer.

Apple's M1 hardware is incredibly powerful and excels at running Dolphin. This announcement has been in the works for some time, eagle eyed users may have noticed that earlier this month macOS builds were now being designated as "Intel". That's because delroth and Skyler had set up a new buildbot using a service called MacStadium for creating Universal macOS binaries. These builds are available immediately and natively support both macOS M1 and Intel macOS devices.

Tackling macOS on ARM

It is an understatement to say that Apple dropped a bomb on the PC industry with the M1 ARM processor. ARM is a Reduced Instruction Set Computing (RISC) architecture that was specifically designed for efficiency with portable devices. With a tight instruction set instead of the ever ballooning mess that is x86, ARM was able to get away with literally less processor while performing optimized tasks, giving it exceptional power efficiency. However given unoptimized workloads, an ARM processor would need many more cycles to perform it than an x86 CPU. All combined, ARM was the processor of choice for battery life in portable devices, but when pushed they had poor overall performance compared to Intel's x86 processors. It was a processor for casual things like phones, and not really meant for "real work". But that is the past.

Intel's iron grip of process superiority has long slipped, and the ARM instruction set has carefully expanded to more efficiently handle more tasks while not sacrificing power efficiency. Yet even with ARM reaching datacenters and even some interesting hardware giving us a glimpse at what could be, ARM's reputation as being weaker than x86 has remained firmly entrenched.

But with M1, Apple has completely shattered this foolish notion. Not only can the M1 perform the same tasks as their former Intel processors, they can do it faster even when using their Rosetta 2 translation layer! All of this while still providing considerably better single threaded performance compared to Intel. Let's just say they had gotten our attention.

We were very excited.

We immediately put it through its paces. Using the Rosetta 2 translation layer with Dolphin's x86-64 JIT, the M1 easily ran most games at full speed and handily outran like-class Intel Macs. The experience wasn't entirely smooth due to jitter from Jitting a JIT, yet the processor proved itself more than capable of handling Dolphin. But the fact it had to do it through a translation layer was a huge performance bottleneck. Developers thought, why not just use Dolphin's AArch64 JIT for native support? And thus, the race was on as several people tried to figure out the hurdles of getting Dolphin's AArch64 JIT to run on the M1.

Unfortunately, getting the AArch64 JIT to work wasn't exactly trivial. Apple requires W^X (Write Xor Execute) conformance for native macOS M1 applications. What it does is make it so that areas of memory must be explicitly marked as for Write or Execute, but not both! Because it's easier and hasn't been forbidden on any of the prior platforms that Dolphin supports, the emulator previously just marked memory regions used by the JIT as for Write and Execute. This requirement from Apple is mostly a security feature to prevent bugs in programs that read untrusted data from being exploited to run malware. Outside of emulators, the primary place that you'll actually see self-modifying code is web browsers, which is often a vector for attack on a computer.

This was thankfully a lot less strict than on iOS devices, which strictly forbid mapping memory as executable whatsoever and made iOS untenable for us to officially support. Apple even provides documentation for helping developers port JITs to macOS on ARM. Skyler used a method described in the documentation that would change the mapped memory between Writeable when emitting code to Executable when executing code. Since Dolphin wasn't designed for this, there were a few hiccups along the way, but eventually everything was massaged into working with the new restrictions.

Once that was out of the way, the focus shifted towards maintainability and setting up the infrastructure. Beyond getting it to run correctly, this was by far the hardest challenge to official M1 support. Dolphin's infrastructure is rather complicated and sensitive to changes. Moving macOS builds over to a universal binary (x86-64 and AArch64 all in one) along with getting the hardware necessary to build macOS universal binaries was a challenge and could have proven to be an expensive endeavor. In the end, MacStadium made the move extremely inexpensive by providing us with free access to M1 hardware, so we were able to focus on making Dolphin's buildbot infrastructure handle the new builds.

Putting the M1 Hardware To The Test

So now that it runs, you're probably wondering how does it run. There's a few things we need to keep in mind. Dolphin's AArch64 JIT isn't quite as mature as the x86-64 JIT. While things aren't as bad as they were a couple of years ago and compatibility should be roughly the same thanks to efforts from JosJuice, it is still the less complete of the two JITs.

One of the differences is instruction coverage. Any PowerPC instruction that isn't included in the JIT has to fallback to interpreter, which costs a huge performance penalty. Most common instructions are covered by both JITs at this point. There is one important feature missing in the AArch64 Jit, though: memchecks. Thankfully, this only affects Full MMU games such as Star Wars Rogue Squadron II, III, and Spider-Man 2. There are some niceties missing from AArch64 JIT, too, like JitCache space reuse used to prevent spurious JitCache flushes.

Even with missing memchecks in the AArch64 JIT, Rogue Squadron 2 runs admirably.

AArch64 does have its advantages, though. Namely, the processors have 31 registers, compared to the 16 available in x86-64 processors. The PowerPC processor we are emulating has 32 registers, and while it is rare for all of them to be used within a single code block, more registers is always nice to have. Another difference is that AArch64 and PowerPC have 3 operand instructions while x86-64 only has two.

PPC:     A = B + C  
AArch64: A = B + C  
x86-64:  A = B, A = A + C

As you can see, it makes emulating some instructions much cleaner and easier than on our x86-64 JIT. Alright, enough with the boring details. How does the M1 hardware perform when put up against some of the beasts of the GameCube and Wii library? We also included data from two computers featured in Progress Reports previously for comparison.

There's no denying it; macOS M1 hardware kicks some serious ass. It absolutely obliterates a two and a half year old Intel MacBook Pro that was over three times its price all while keeping within ARM's reach of a powerful desktop computer. We were so impressed, we decided to make a second graph to express it.

The efficiency is almost literally off the chart. Compared to an absolute monstrosity of a Desktop PC, it uses less than 1/10th of the energy while providing ~65% of the performance. And the poor Intel MacBook Pro just can't compare.

Taking Things a (Lock)Step Further

After doing strenuous performance testing on the macOS M1 and its Apple Silicon, it was clear that it was powerful. The problem is that if you give developers a new toy, they eventually decide to push things further and further. This was the first time we got to see Dolphin's AArch64 JIT really stretch its legs on something other than a phone or tablet with an ultra aggressive governor that's also limited by graphics drivers. What is the absolute worst idea that we could come up with given this new found power? Netplay.

This was the real test to see if the AArch64 JIT and x86-64 JIT truly equals. We couldn't exactly test this before because the Android GUI lacks netplay support, but macOS runs the desktop version with no compromises. That includes having full netplay support. Now, testing this was mostly a joke because there are tons of differences between the JITs. Everything from instruction coverage to known rounding errors. The chances of this working was next to zero. But there was no reason to stop and think if we should - technology had made it so we could.

Sometimes testing yields unexpected results!

And it actually worked! We just can't be certain exactly how well yet due to limited testing. Every single game we've tested on netplay so far has managed to synchronize, albeit with Dolphin's desync checker giving a false positive. Testers have tried everything from Super Smash Bros. Melee and Mario Party 5 to things like spectating The Legend of Zelda: The Wind Waker. All of the sessions stayed in sync.

This might not be true for all games. Up until earlier this month, games like Mario Kart: Double Dash!!, F-Zero GX, and Mario Kart Wii would immediately desync due to physics differences. Thanks to the work of JosJuice, those rounding bugs in the AArch64 JIT and interpreter (...we'll get to that in the Progress Report) are now fixed, meaning these games should at least have a chance to sync on netplay.

Because of limited libraries, we don't have a great idea of what games will work and what games are problematic. As a stress test, Techjar and Skyler played the Super Mario Sunshine Co-op Mod. The physics calculations in Super Mario Sunshine are extremely sensitive to CPU rounding bugs and it provided a tough test for both JITs. Oh yeah, they also enabled the 60 FPS hack just to make things even more interesting.

Not only did the games sync up, the Macbook Air M1 was able to handle Super Mario Sunshine's 60 FPS hack.

Everyone knowledgeable on Dolphin's JITs thought that cross-JIT netplay would be impossible, at least without tons of dedicated fixes. Yet here we are, able to experience it first hand. And it can only get better from here, as we are now able to monitor and test JIT determinism on netplay. While you might be excited to dive right in, it's important to note that we were only able to test a few games and we have no idea what compatibility will look like when unleashed on the wider library.

Note: Yes, we're aware that Windows and Linux AArch64 devices existed before the M1. There was no allure to testing netplay on those because they could not run Dolphin reasonably. We really didn't expect this to work or we probably would have tried it sooner.

In Conclusion

There's little else we can say: The M1 hardware is fantastic and higher tiers are on the way promising even better performance. But what we have is already efficient, powerful, and gives us a mainstream AArch64 device that isn't Android and uses our AArch64 JIT to its fullest potential. The only big downside is the proprietary graphics API present in macOS that prevents us from using the latest versions of OpenGL and forces us to use MoltenVK in order to take advantage of Vulkan. That is a very small price to pay to get a glimpse at some really cool hardware that redefines what an ARM processor can do. There's undeniable excitement for the next generation of AArch64 hardware to see how much further that this can go.

EDITORS NOTE: A small error was noticed in our 9900k performance testing. This has been corrected. However, the differences are very minor and do not affect our conclusion.

Pues siguir el discutiniu nel filu d'esti artículu, nel foru.

Entrada siguiente

Entrada postrera

Entraes paecíes