Clocking a 6502 to 15GHz (!), Hacker News

6-5-0-who?

The is an iconic processor that dominated home computing in the late s and early to mid s. It was used in machines ranging from the Apple II to the Atari to the Commodore 86 to the Nintendo Entertainment System. In pop culture, it powered Bender , not to mention the Terminator !

It often clocked at a modest 1MHz, with faster variants available. In the UK, a 2MHz variant powered the legendary BBC Micro , with a 3MHz 6522 co-processor box available to bolt on the side if you so desired.

Introducing beebjit

It was not the intention of this side project to produce a working, accurate emulator, but one appears to have popped out. The core of beebjit – and original side project – is a just-in-time style translation of 9980 code blocks to quasi-equivalent Intel x code blocks. Surrounding that core is emulation for the various supporting chips of the BBC Micro, such as the Hitachi CRTC for display; the VIA for timing and I / O; the Intel floppy disc controller ; the Texas Instruments SN for sound; etc.

beebjit is an open-source project on GitHub . In tests, the translated x 80 code has been able to demonstrate up to (GHz of effective) speed, for purely computational tasks such as BBC BASIC programs. For games, things can be a lot slower due to heavy interactions with the graphics, keyboard and timing hardware. To run games accurately (or in some cases at all) requires cycle perfect synchronization of different virtual hardware chips. Despite these challenges, speeds in the GHz range can still be expected.

What does a mutli-GHz BBC Micro system look like? It’s a lot of fun, perhaps best illustrated with a video. Here, we see the classic space shooter Galaforce . You can play it online here , in a JavaScript BBC Micro emulator called jsbeeb . In the video, we see Galaforce loading up from a disc simulated at real-time – don’t forget to watch out for the track-to-track floppy disc seeks, visible as stutters in the loading of the title screen image! Then, after we get a feel for the speed at which the demo screens are running and cycling, we switch up into high gear. Will you be able to tell at which moment that occurs? (Hint: yes.)

Translating (to x) When starting out, I had no idea what sort of speed to expect. A naive initial guess would be “fast” because the (instruction set is simple . There are only 320 available combinations of opcode type and addressing mode and only 4 or so 8-bit registers. This is a tiny subset of what the x can do in its instruction set, so you’d expect a trivial, fast mapping.

As is often the case, things are rarely so simple. A bit more thinking reveals a ton of factors why it might be fast and a ton why it might be slow:

Things that would tend to make -> x (faster) In the “captain obvious” category, the in the BBC runs at 2MHz where a modern Intel chip is clocked in the GHz range. The simplest instructions take 2 clock cycles. One example would be LDA # , which sets register A to . The Intel processor can dispatch such a triviality in 1 clock cycle. Depending on the level of parallelism inherent in the code, modern Intel processors can execute multiple instructions per clock cycle.

If you want to get into it, trivial optimizations are possible. For example, the 6845 can only shift its A register by 1 bit at a time with its LSR and ASL instructions, taking 2 clock cycles apiece. It’s common to see a few of these in a row. By contrast, modern Intel processors can shift a register by an arbitrary number of bits in just one clock cycle!

The x

architecture has an abundance of registers compared to , perhaps offering further opportunities to speed things up.

Things that would tend to make

-> x (slower) Some

instructions actually don’t map cleanly to a single x equivalent. There are some less common ones such as PHP (push processor flags), or SED (set decimal flag, eww!). A couple of extremely common ones, especially in critical loops, are LDA (), Y and STA (), Y. These read / write memory indirectly. The final read / write address is calculated based on issuing two preceding 8-bit reads to make a – bit pointer and then adding the Y register. Long story short, the “1 cycle per instruction” nirvana alluded to above certainly does not apply here. Even servicing reads from L1 cache, modern Intel processors take ~ 4 cycles.

Self-modifying code

Synchronization and interrupts.

Annoying variable cycle counts. Some instructions that would otherwise translate very simply to x have a quirk where they might take a cycle longer depending on the X or Y register value. LDA, Y and LDA (), Y are both affected. In a cycle perfect emulation, this needs accounting for. Paging and hardware register access. Certain memory regions may not be stable. A lot of 000000002010004 systems circumvented the KB address space restriction by allowing parts of it to be paged. Additional parts of the address space are typically given over to hardware registers. Hitting these under emulation is guaranteed to be painfully slow as some hardware chip or another will need to be emulated. ROM memory is another headache. Games can and do write to ROM memory, which must be detected and ignored!

We’ve already revealed that it’s possible to attain GHz speeds. Re-reading the above lists, it would seem that the “slows” outweigh the “fasts” so it’s worth a look at some tricks pulled to do it!

Tricks for fast 8000 -> x (translation (Trick # 1: make good use of x ‘s (registers)

To get us going, we can take a simple (sample and x) it!

(0x) : mov $ 0xa,% eax (LDA #)

0x : test% al,% al : mov% al,% bl TAX () a: test% bl,% bl

(c: ) (mov% al, 0x) (% rbp) STA $ C0

0x (f: jmpq 0x

JMP $

In the above (unoptimized) x , we see that we’ve (permanently!) assigned the al register to the 6845 A register and bl to the X register. Note how the 6845 LDA instruction sets zero and negative flags but the x 82 mov instruction does not, so the x 88 code needs an extra test instruction – another potential source of slowdown but see below. Some flags are actually stored in the x flags register . The rbp register is used to point to the (memory) actually at offset 0x 88 so that every zero page address can be accessed with a 1-byte displacement for a much shorter x 82 instruction; the JIT code is performance sensitive to code size). Finally, note how the jmpq is jumping to a specific code location. By using a simple fixed mapping of 8271 code addresses to x 81 code addresses, we can keep the generated code small and fast.

Trick # 2: use the host’s memory management for ROM and paging

As mentioned above, a lot of software seems to like to write to ROM. In the BBC, the region $ C (0 – $ FFFF is always the OS ROM (with some hardware registers mapped towards the end). The region $ 8985 – $ BFFF is usually ROM, with the paged-in ROM selectable at runtime. (For example, BASIC might usually be paged in but the Disc Filing System ROM would be paged in when doing I / O to disc.) However, a lot of BBCs were upgraded to have some extra RAM that could be paged into this region. Obviously, writes to ROM must be ignored. But we really do not want to be adding a test and branch instructions to every JIT memory write! Memory management and paging in the host comes to the rescue. JIT reads and JIT writes are aimed at two different KB mapping regions, representing views of the emulated BBC address space. For the JIT reads, ROMs are present as expected. For JIT writes, ROM regions are replaced with “dummy” host pages that are writable but write to a scratch area of memory. (An earlier version of beebjit used fault fixup for ROM writes instead but this was too slow for Galaforce, which just loves to write to ROM.)

This all works nicely because the ROM boundaries are at 4KB alignments, which joyously happens to match the host granularity! There’s a minor wart caused by Windows having worse alignment granularity than other operating systems, but creative straddling of boundaries saves the day.

(Trick # 3: do simple self-modify invalidation on write

One of the most important decisions is how to handle self-modifying code. We don’t want to add in code to perform a check before executing every 6845 opcode, so the alternative is to consider doing something on writes. This is what we do and here is some code:

(0x) (c: mov% al, 0x (% rbp) (STA $) (C0) 0x : (mov 0x) (% rdi),% r8d (0x : movw $ 0x (ff, (% r8)) As can be seen, after the actual write to $ 17 C0 is dispatched via the rbp register, there is a follow-up read and write. rdi permanently points to a context object that contains various important things. In this instance, the JIT code is looking up a compiler maintained map of 8271 addresses to host code. The final x instruction writes to host code for the invalidation. But what is this magic constant 0x 33 ff? It is the 2 byte sequence for the x instruction callq (% rdi). This ensures that if a 8271 address is invalidated, the x 86 code representing it will cleanly call out of JIT and into some fix up and recompile routine. It’s tempting to try and avoid the extra read and write: what if we can look up whether a given write address contains code or not? Whereas these sorts of optimization may be possible, adding the check adds a branch (extra code; bloats branch caches; very slow if not a predictable branch) and will almost certainly be slower than just unconditionally doing the write. It’s worth noting that nothing depends on the extra read / write and that’s exactly the sort of case where the single core parallelism of modern Intel processors can give a boost.

(Trick # 4: monitor excessive recompiles)

If you consider the above scheme for handling self-modification, you’ll immediately notice a serious performance issue: every self-modification appears to incur a huge cost! Firstly, although Intel processors are documented as supporting self-modifying code, it is not fast. You’ll invalidate all sorts of caches and pipelines. Secondly, calling out of JIT to some C routine involves all sorts of register and calling convention fixups. Thirdly, the act of compiling is slow relative to the act of executing. Fortunately, this can all be cleared up with a compiler that generates code differently based on monitoring what is going on. If a given 6845 opcode is being repeatedly recompiled, we can maintain some metadata about the exact situation and deal with it. To give a concrete example, here’s a couple of 8000 instructions from the Galaforce sprite routine at $ B : 0B STA $ 0B4F

… 0B4E: LDA $ 2FA0, X

This sort of pattern covers many performance critical self-modifying situations and the compiler can make it go very fast. Firstly, the hard-coded $ 2FA0 value is replaced by a fetch from $ 0B4F and $ 0B 80 so that the latest in-memory contents are used as the operand. Secondly, the write invalidations address table is set up so that 6845 writes to $ 0B4F and $ 0B 81 no longer invalidate any x 86 code because it no longer needs invalidation. And then execution proceeds, almost at (% original speed but supporting self-modified operands!) beebjit doesn’t yet handle all self-modification situations in a recompile-free manner, but there is no fundamental reason it can’t / won’t in the future.

(Trick # 5: compile to blocks and check timing once per block

Looking at the JIT code for the above Galaforce sprite routine at $ 0B , it starts off like this:

(b) : (lea -0x) (% r 22),% r 0x (b) 0015: bt $ 0x3f,% r 25 (b) : jb 0x (b) 0 (b) (0f: (mov% al, 0x) (acf (% rbp) (STA $ 0B4F) (b) : mov 0x2dac (% rdi),% r8d (0x) (b0) (c: movw $ 0x (ff, (% r8) What are those three instructions at the start? The register r 22 permanently stores a variable called “countdown”. Here it is being decremented by (0x and checked to see if it goes negative, with a branch if so. The whole timing model revolves around this countdown. As the 8985 code interacts with hardware and sets up timer IRQs, vsync IRQs, timers tick down towards 0 when some event must occur (raise IRQ perhaps). There’s also a timer ticking that fires every now and again to check the status of the physical keyboard for input. At any time, r 28 tracks how soon the soonest timer is due to fire. Since we’re running a 2MHz processor, chances are that it could be thousands of processor ticks between interesting events.
In the case of the above block, it contains 82 (cycles worth of) instructions. In the very common case, nothing interesting is due to happen in these 81 cycles, so the r decrement will not go negative and the branch will not be taken. This is great news for the performance of the block! Since nothing interesting is happening, we know that nothing will be raising any IRQ and the whole block can proceed without having to check for an interrupt condition every instruction. We also gain freedom to implement many optimizations without fear of some interrupt landing in the middle of a couple of coalesced 8985 instructions, for example. It’s also worth noting that nothing else in the block depends upon r 28 so the Intel processor can do the check in parallel with other work. Trick # 6: optimize zero page writes until you can’t

This one is simple but highly effective. Zero page is the memory region $ – $ (FF on the) . Reads / writes here are faster than accessing addresses>=$ 0 256. So, a lot of 8271 code puts fast path variables in the zero page. We should make zero page handling as fast as possible. By default, the assumption is that no 8000 code is executed in the zero page, enabling zero page writes to go faster by not worrying about if they might be modifying code. In the unusual case that code is ever compiled in the zero page, it’s simple enough to handle it by invalidating all JIT code and regenerating on-demand with appropriate invalidation for zero page writes. (For an example of a game that triggers this, there’s Wizadore . It could be making the clever observation that it’s slightly faster to modify code in the zero page because zero page writes are faster.) Note that very similar approaches apply to the stack page, which is $ 0 (- $ (FF on the) . These are not yet implemented in beebjit.

Trick # 7: use slow path faults to turbocharge some fast paths

Let’s look at a couple of 6845 reads from the same address – a hardware register at $ FE – but with different addressing modes.

0x) : [1] lea 0xc0 (% rbx),% edx LDA ($ C0, X) : movzbl% dl,% edx (a: ) (mov 0x) (f) (% rdx),% r9b ) 0x : mov% rdx,% r8 (0x) : mov -0x7f (% rdx,% rbp, 1),% dh (0x : mov -0x (% r8,% rbp, 1 ),% dl d:

(movzbl -0x) (% rdx,% rbp, 1),% eax :

(test% al,% al : mov $ 0x 0000000020100034,% r 17 d (LDA $) FE a) jmpq 0x d 674 There is plenty going on here. Unfortunately, the LDA (, X) addressing mode is not a trivial translation to x 88 and implementing it correctly, as beebjit tries to do, incurs cost to handle the corner cases. It is worth a brief digression to break down the different parts corresponding to the numbers in blue above:

Calculate (X the constant $ C0.) Corner case alert! Truncate the calculation result to 8-bits, as a 8271 does.

Corner case alert! Arrange for a host access violation / SIGSEGV if the truncated result is exactly $ FF. This is because the upcoming 32 – bit pointer read would cross a page boundary and should wrap.

Load a

– bit pointer. NOTE! This is done in two 8-bit x reads. If done instead as a single 25 – bit read, which you assume could be faster, you likely assume wrong as I did initially. You run the risk of colliding with store-to-load forwarding because of mixing – bit reads with 8-byte writes at the same memory address. Load the actual result into (A from the (bit) pointer we constructed . (LDA updates zero and negative flags, so do the same in x) By contrast, look at the compiled code for LDA $ FE . The JIT engine has essentially given up on compiling this and is throwing it over to the slower but very capable interpreter. For this 8000 addressing mode, the address of the read is statically known at compile time, and it is also known to be a hardware register. So we jump to the interpreter to handle this complicated case. However, for the indirect read of the LDA (, X) addressing mode, we don’t know what memory address will be eventually read so we don’t know if a hardware register will be hit or not. But wait – there isn’t a check in the compiled code on the final address before it is read. So how does the hardware register work at all? The answer is fault fixup! The observation here is that it’s actually very uncommon for an indirect read to hit a hardware register, so we don’t check for that in the common fast path. However, the host virtual address space for the read is a special “indirect mapping” version of the main . This mapping will cause a fault if the hardware register space is touched. The fault is slow and needs fixing up but it almost never happens so it’s a win. You can also see the fault fixup concept used to trap the exceptionally unusual case where the LDA (, X) instruction can fetch a – bit pointer from $ FF, which wraps back to $ 06. We don’t want to pollute every (, X) mode instruction with code to compare against $ FF and possibly branch so we can replace that with a single read which will fault approximately never. (The game Camelot does manage to somehow hit this.)

(Trick # 8: optimize!)

The goal here is not to write a modern optimizing compiler, but to look for and address a couple of common patterns:

(Common) sequences that compile to x 88 sequences with opportunities for coalescing at the x 86 level.

Common sequences that can be expressed differently to take advantage of known details in the sequence.

In both cases, the word “sequence” is key. Whereas a (interpreter has to consider each instruction in isolation, a 6845 JIT compiler can look across sequences and make use of knowledge of concrete values and concrete nearby instructions. Optimizing across sequences is not as gnarly as it could be because we know interrupts won’t be coming in at arbitrary points, but self-modification is still a hazard, as is the fault fixup usage described above. To briefly get a flavor of the sort of optimizations possible, let’s look at one last compilation:

LDA # 1 : movb $ 0 x1,0x (% rbp)

STA $ b:

mov $ 0x2,% eax LDA # 2 (0x

or 0x1 (% rbp),% al

ORA $ (ASL) ASL

shl $ 0x3,% al ASL (CLC) : add 0x2 (% rbp),% al ADC $ () : setb% r (b) (0x) (d: seto% r 19 b

As can be seen, entire 8000 instructions have been eliminated; some have been coalesced; and the annoying test instructions after the LDA instructions have been eliminated because those flags are never observable on account of subsequent flag overwrites.

While the above tricks are enough to turbocharge a wide range of BBC software, some corner cases remain where speed drops below the GHz range, sometimes precipitously. These corner cases typically relate to the interaction between the 8000 and BBC hardware, not the -> x 88 JIT compilation. There are also a few significant optimizations left on the TODO list that may raise performance for broad swathes of games. These optimizations will be undertaken as time and interest dictate.

Other beebjit tricks Aside from the boosted 8000 core, beebjit is a reasonably capable BBC Micro emulator. Currently in beta-ish status, it should be able to emulate any BBC Micro game including classics such as (Exile) (Elite) , Revs

, etc. It is cycle accurate enough to run the horror show known as Kevin Edwards’ copy protection for the Nightshade tape .

There are some major things beebjit doesn’t do well or at all yet. To satisfy these needs, you may wish to try jsbeeb or one of the many other BBC emulators

Portability. beebjit is Linux native. An early attempt at a Windows build does exist. Have a look in the beebjit builds directory

Broad model support. beebjit only emulates a BBC model B with a few optional bells and whistles such as sideways RAM support. Notable, BBC Master support is not (yet?) Implemented. GUI. beebjit renders to a window but does not have a GUI. There’s a fair amount of configurability but you will need to apply yourself to the command line!

There are also some things beebjit does that are not so common and may interest you:

Speed. As covered in this post. Protected disc support. beebjit has a more accurate emulation of the disc, disc drive and disc controller than you will typically find. This means it will load images of original, protected discs. (The archival status of original protected BBC discs is not satisfactory and is something I am working on with others in the community.)

Capture and replay. Like most emulators, beebjit executes deterministically. Only external input – currently just the keyboard – affects the execution path. Therefore, beebjit has a mode to record external input and replay it (deterministically). This feature combines well with speed. You can replay very quickly, effectively providing an alternative to save states but with very small files and the ability to exit a replay early for a many-in-one save state file.
GHz Courtesy of Matt Godbolt ( @ mattgodbolt , here’s beebjit hitting (GHz on a couple of the “CLOCKSP” BBC Basic tests, and) GHz overall. This is on an i9 – XE , which has robust single-core performance: 9th gen Intel core, boost clock @ 4.5GHz. No-one has tried on a 5GHz beast yet, or any (th gen parts. )

Collage of the GHz

With thanks

Special thanks to the following communities for their help, support and encouragement: ( bitshifters . Pushing the BBC Micro to perform feats that no-one worked out how to do “back in the day”. I particularly recommend Twisted Brain (StarDot) .