8-bit optimization for Z80 and 6502 in 2026

by Oscar Toledo G. Mar/26/2026

Recently, I've made CVBasic, a BASIC compiler for Z80, 6502, and TMS9900 processors. Besides the challenges in code optimization, there is also the problem of libraries. The Z80 and the TMS9900 processors have 16-bit addition/subtraction instructions, the 6502 processor only can do it with a sequence of instructions. For multiplication and division instructions only the TMS9900 processor supports both operations directly.

This means that if a program for Z80 or 6502 requires multiplication or division then a subroutine must be called to do the work. The operations cannot be inlined because these are complicated enough.

Short but slow

For the initial release of CVBasic, I prepared reasonable subroutines for multiplication and division. For example, here is the original code for the multiplication routine:


	; Fast 16-bit multiplication.
_mul16:
	ld b,h	; 5
	ld c,l	; 5
	ld a,16	;  8
	ld hl,0	; 11
.1:
	srl d	; 10
	rr e	; 10
	jr nc,.2	; 8/13
	add hl,bc	; 12
.2:	sla c	; 10
	rl b	; 10
	dec a	; 5
	jp nz,.1	; 11
	ret	; 11

This subroutine does the operation HL = HL x DE. It loops 16 times, each time shifting a multiplier bit and if it is one, it adds the multiplicand. The multiplicand is shifted each time to the left to account for the different value at each bit position.

It is a small subroutine, and it looks efficient. However, it is slow. For starters, it always run 16 times. Let's suppose the worst case where all multiplier bits are 1 (that is DE = $ffff). The cycles used are 5 + 5 + 8 + 11 + 16 * (10 + 10 + 8 + 12 + 10 + 10 + 5 + 11) + 11 for a total of 1256 cycles.

The cycles for each instruction are referred from Grauw.nl (see the Z80+M1 column) As CVBasic is written originally for MSX and Colecovision, both platforms add one wait state in each M1 cycle.

Why 1256 cycles is too much? The MSX and Colecovision are based on the same video processor, and render typically 60 frames per second on a TV screen. A Z80 processor runs at 3.58 mhz. This means the Z80 runs approximately for 59659 cycles in each video frame.

These sixty thousand cycles is all the time available for a game before the next video frame starts to be rendered. The simplest Z80 instruction uses 5 cycles, for example, NOP, or LD A,B. This means a maximum of 11931 instructions per frame, but some instructions are even slower. For example, ADD A,5 uses 8 cycles, or LD HL,5000 uses 11 cycles.

Now 59659 / 1256 cycles = 47 multiplications per frame. This is a very low value and it doesn't even take in account VRAM updating, video interrupt overhead, music player in the background, and the game logic.

Of course, almost no Z80 games do real multiplication operations, instead most games resort to using precalculated tables, or power-of-two operations doing bit shifts.

Early exit

However, there are a few cases worthy of using this operation, and for this is needed a better subroutine.

My first idea was unrolling the loop to save some cycles, and using early finish if the multiplier becomes zero, and switch operands so the multiplier is always the smallest value.


; Fast 16-bit multiplication.
_mul16:
	or a	; 5
	sbc hl,de	; 17
	add hl,de	; 12
	jr nc,$+3	; 8/13
	ex de,hl	; 5 Smallest operand in DE.

	ld b,h	; 5
	ld c,l	; 5
	ld hl,0	; 11
.1:
	srl d	; 10
	rr e	; 10
	jp nc,$+4	; 11
	add hl,bc	; 12
	sla c	; 10
	rl b	; 10
	srl d	; 10
	rr e	; 10
	jp nc,$+4	; 11
	add hl,bc	; 12
	sla c	; 10
	rl b	; 10
	ld a,d	; 5
	or e	; 5
	jp nz,.1	; 11
	ret	; 11

Let's suppose that it will multiply only by zero, one, two or three. The prologue will take 5+17+12+13+5+5+11 = 68 cycles.

The cycle table for the different operands:

by 0 = 10 + 10 + 11 + 10 + 10 + 10 + 10 + 11 + 10 + 10 + 5 + 5 + 11 + 11 = 134 cycles.
by 1 = 10 + 10 + 11 + 12 + 10 + 10 + 10 + 10 + 11 + 10 + 10 + 5 + 5 + 11 + 11 = 146 cycles.
by 2 = 10 + 10 + 11 + 10 + 10 + 10 + 10 + 11 + 12 + 10 + 10 + 5 + 5 + 11 + 11 = 146 cycles.
by 3 = 10 + 10 + 11 + 12 + 10 + 10 + 10 + 10 + 11 + 12 + 10 + 10 + 5 + 5 + 11 + 11 = 158 cycles.
most complicated case = 8 * 147 + 11 = 1187 cycles.

After adding the prologue it is 202, 214, and 226 cycles. This is a six-fold improvement over the previous subroutine that had constant execution time. And the most complicated case is 1255 cycles (1 cycle less than the simpler routine)

Strength reduction

So this subroutine fares almost the same than the original when the multiplier is a big number. What about a more optimized subroutine? I noticed the high-byte was using 16-bit operations for calculations that weren't used at all. For example, $100 x $ff = $ff00, but $100 x $100 = $10000, this means only the low-byte of the multiplier is used for the high-byte of the result.

I divided the loop in two parts, one for the high-byte that uses only 8-bit operations, and one for the low-byte having the early exit code.


_mul16:
	or a		; 5
	sbc hl,de	; 17
	add hl,de	; 12
	jr nc,$+3	; 13/8
	ex de,hl	; 5 Smallest operand in DE.

	ld b,h		; 5
	ld c,l		; 5
	ld hl,0		; 11
	ld a,d	; 5
	or a		; 5 High-byte is zero?
	jp z,.2		; 11, Yes, jump.
	xor a		; 5
	sla d		; 10
	jp nc,$+4	; 11
	add a,c		; 5
	add a,a		; 5
	sla d		; 10
	jp nc,$+4	; 11
	add a,c		; 5
	add a,a		; 5
	sla d		; 10
	jp nc,$+4	; 11
	add a,c		; 5
	add a,a		; 5
	sla d		; 10
	jp nc,$+4	; 11
	add a,c		; 5
	add a,a		; 5
	sla d		; 10
	jp nc,$+4	; 11
	add a,c		; 5
	add a,a		; 5
	sla d		; 10
	jp nc,$+4	; 11
	add a,c		; 5
	add a,a		; 5
	sla d		; 10
	jp nc,$+4	; 11
	add a,c		; 5
	add a,a		; 5
	sla d		; 10
	jp nc,$+4	; 11
	add a,c		; 5
	ld h,a		; 5
.2:			;
.3:	srl e		; 10
	jp nc,$+4	; 11
	add hl,bc	; 12
	ret z		; 6/12
	sla c		; 10
	rl b		; 10
	srl e		; 10
	jp nc,$+4	; 11
	add hl,bc	; 12
	ret z		; 6/12
	sla c		; 10
	rl b		; 10
	jp .3		; 11

Let's calculate again the times for multiplying by 0, 1, 2 and 3:

Prologue cycles = 5+17+12+13+5+5+11+5+5+11 = 89 cycles.
by 0 = 10 + 11 + 12 = 33 cycles.
by 1 = 10 + 11 + 12 + 12 = 45 cycles.
by 2 = 10 + 11 + 6 + 10 + 10 + 10 + 11 + 12 + 12 = 92 cycles.
by 3 = 10 + 11 + 12 + 6 + 10 + 10 + 10 + 11 + 12 + 12 = 104 cycles.

Total cycles for each case: 122, 134, 181, and 193 cycles. This means the extra test for high-byte zero is compensated by the optimization.

Now for the most complicated case, both registers with the highest value 65535:

First part = 5 + 8 * (10 + 11 + 5 + 5) = 253 cycles.
Second part = 4 * (10 + 11 + 12 + 6 + 10 + 10 + 10 + 11 + 12 + 6 + 10 + 10 + 11) - (10 + 10 + 11) + 6 = 491 cycles.

The total is 89 + 253 + 491 = 833 cycles, 37% speed up over the original code.

Given the good efficiency of multiplying by 0, 1, 2, and 3, it becomes a reasonable alternative to tables. In a real world example, my game Metro Wars greatly benefit from this for calculating the origin positions for copying the stage background tiles for the pixel-by-pixel scrolling, and also allowing for more time for the game loop avoiding slowdowns when many enemies and bullets appear.

Of course, the subroutine is larger, but at 96 bytes versus the original 23 bytes, I think the speed is more important and it is a worthy addition to a programmers' toolkit.

On a side note, the _div16 routine apparently couldn't be optimized, until I discovered that if the high-byte of the dividend is zero, you can avoid executing half of the routine! Almost double speed-up.

Let's look at the 6502 routines

As I said before, CVBasic supports three different processors. One of these is the 6502. The 6502 library comes in two flavors: Creativision (using the same video processor as a Colecovision), and NES/Famicom (a completely different VDP)

Both have the _mul16 and _div16 routines. This is the _mul16 routine:


	; 16-bit multiplication.
_mul16:
	PLA
	STA result
	PLA
	STA result+1
	PLA
	STA temp2+1
	PLA
	STA temp2
	LDA result+1
	PHA
	LDA result
	PHA
	LDA #0
	STA result
	STA result+1
	LDX #15
.1:
	LSR temp2+1
	ROR temp2
	BCC .2
	LDA result
	CLC
	ADC temp
	STA result
	LDA result+1
	ADC temp+1
	STA result+1
.2:	ASL temp
	ROL temp+1
	DEX
	BPL .1
	LDA result
	LDY result+1
	RTS

The operation assumes some data is already in memory (temp for the multiplier), and extracts the multiplicand from stack into temp2. Notice it uses result to save the return address.

The 6502 is a RISC-style processor. There are only 3 main registers: A (or accumulator), X and Y. Common instructions use 2 cycles, memory access instructions use 3 cycles, and indexed instructions use 4 or 5 cycles.

It takes the same approach of shifting the complete 16-bit value to calculate the result. However, the zero detection test would take too much time because the operand is in memory, and this is already faster than the Z80 subroutine, because the 6502 instructions are way faster even if the processor has a slower clock. For example, the INC instruction takes 2 cycles for the 6502, and the same instruction takes 4 cycles for the Z80. This means that a 6502 running at half the speed (2 mhz. typical) is as faster or more than a Z80.

After a brief look at the main loop, I discovered the Y register is never used. The 6502 cannot do most operations directly with X or Y. Instead, I could use Y like memory, reading it with TYA (copying Y to A), and saving it with TAY.


	LDA #0
	STA result
	TAY
	LDX #15
.1:
	LSR temp2+1
	ROR temp2
	BCC .2
	LDA result
	CLC
	ADC temp
	STA result
	TYA
	ADC temp+1
	TAY
.2:	ASL temp
	ROL temp+1
	DEX
	BPL .1
	LDA result
	RTS

The new code is faster, and still I'm evaluating the possibility of unrolling the loop into two 8-bit loops. Saving one instruction in each loop (3 cycles * 16 times = 48 cycles * 2 = total 96 cycles) by using extra memory.

I also optimized the division routine. Let's look only to the inner loop of it:


	LDX #15
.2:
	ROL temp2
	ROL temp2+1
	ROL result
	ROL result+1
	LDA result
	SEC
	SBC temp
	STA result
	LDA result+1
	SBC temp+1
	STA result+1
	BCS .3
	LDA result
	ADC temp
	STA result
	LDA result+1
	ADC temp+1
	STA result+1
	CLC
.3:
	DEX
	BPL .2

Again a loop running 16 times, shifting the dividend, and checking whether the divisor can be subtracted. Each time it can do a succesful subtraction, it puts a one in the result, else it adds back the divisor to restore the value. This is a direct port of the Z80 subroutine (SUB HL,DE / JR NC / ADD HL,DE)

The Y register isn't used. The intermediate result can be calculated in A and Y, and only saved if the subtraction is successful:


	LDX #15
.2:
	ROL temp2
	ROL temp2+1
	ROL result
	ROL result+1
	LDA result
	SEC
	SBC temp
	TAY
	LDA result+1
	SBC temp+1
	BCC .3
	STY result
	STA result+1
.3:	DEX
	BPL .2

The low-byte is saved into the Y register, and the high-byte is kept in the accumulator. The carry flag does the comparison, and if the subtraction is possible, the Y and A values are saved into the result. result is effectively the remainder, while the actual result of the division is available in temp2. Technically, it is doing a comparison with subtraction, and the result is available if needed; a completely different way of thinking versus Z80.

The speed up is significative.

I hope you have enjoyed this article. I could pass all-day optimizing code but other works needs to be done, and while writing this article I've seen other chances to optimize the code! This story never ends.

Edit (Mar/29/2026): For the 6502 target of CVBasic, I redesigned the calls to avoid passing one argument in the stack. This made the code shorter and faster, and I've implemented the double loop for _mul16 getting a 28% speed-up.

The source code for the CVBasic compiler and libraries is available at https://github.com/nanochess/cvbasic

Did you like this article? Invite me a coffee on ko-fi or become a monthly supporter.

8-bit optimization for Z80 and 6502 in 2026

Short but slow

Early exit

Strength reduction

Let's look at the 6502 routines

Links