\section{Discussion}
\label{sec:discussion}

\begin{multicols}{2}

In this section, we discuss three optimisation vectors that are not yet considered.
First, we look at subroutine calls, which are expensive in Thumb, compared to ARM.
In \cref{sec:discussion:branches}, we consider branch optimisation,
	which concerns moving code blocks around to make as many branch instructions narrow by minimising their offsets.
The last optimisation vector is about instructions that exist in Thumb but not in ARM.
These are discussed in \cref{sec:discussion:thumb-only}.

We also look at the results from \cref{sec:results} in more detail, in \cref{sec:discussion:matching}.

\subsection{Optimising subroutine calls}
\label{sec:discussion:subroutines}
In \cref{sec:storing-pc}, we saw that subroutine calls are slower and larger in Thumb than in ARM,
	because an extra \ual{add} instruction has to be inserted.
We discussed that a more efficient solution would exploit the link register:
	upon a \ual{bl} and \ual{blx} instruction, the link register is loaded with the address of the instruction after the branch.
If the callee would be responsible for saving this address, rather than the caller,
	we could delay storing the return address and exploit these instructions to eliminate the \ual{add} instruction.
However, we also mentioned that tail recursion makes this difficult (see \cref{sec:storing-pc:other-solutions}).

\subsubsection{Tail recursive entry points}
To improve on this, two things can be done.
First, we could add a \emph{tail recursive entry point} to function blocks.
The code for the \clean{length} function from \cref{sec:storing-pc:other-solutions} could then look like this:

\begin{minted}{ual}
slength_P2:
	str   lr,[sp,#-4]!    @ Store the return address
slength_P2_tr:              @ The tail recursive entry point
	ldr   r0,[r6]         @ Load top of A-stack (the list)

	@ ... etc

e_0:                        @ If / after the Cons part has been evaluated
	add   r4,r4,#1        @ Increase counter
	b     slength_P2_tr   @ Tail recursively jump back to start of function
	.ltorg
\end{minted}

This way, the code to store the return address is only executed once.
The return address is saved on the stack, as before.
To call the subroutine, the following block can be used:

\begin{minted}{ual}
	bl    slength_P2
\end{minted}

This is an improvement of between two and six bytes per subroutine call:
	in any case, we win two bytes since the \ual{add} instruction is eliminated.
In the minimal case that a function is called only once, we save only these two bytes.
If the function is called often, many \ual{str} instructions can be removed at the cost of one extra \ual{str} instruction in the function itself,
	and the space saved per subroutine call asymptotically approaches six bytes.

As for running time, we win one instruction cycle per subroutine call.
The \ual{add} instruction is removed, but the \ual{str} instruction is still executed
	(whether it is placed in the caller's or the callee's block does not matter).

In the current code generator, branches are often removed by putting the callee direct behind the caller.
Adding a tail recursive entry point before the callee complicates this:
	we would need to jump behind that entry point
	or place the entry point somewhere else
		and jump from the tail recursive entry point to the start of the function.
It should be tested if any of these options is an improvement over the current slow subroutine calls.

To implement tail recursive entry points,
	one would need to be able to recognise tail recursion such that the tail recursive entry point is used only for these subroutine calls,
	but not for the `plain' recursive subroutine calls.
Also, more research is needed to see if a tail recursive entry point is sufficient in all cases.

\subsubsection{Mixed calling convention}
A second possibility would be to have a \emph{mixed calling convention}.
Currently, in every subroutine call, the caller is responsible for storing the return address on the stack.
In a mixed calling convention,
	the callee would be responsible for this,
	except for functions that exploit tail recursion or that are not always called but also jumped to (as described above).
Recognising tail recursion can be done on a relatively high level of abstraction,
	so annotations can be added to the intermediate ABC-code to indicate what calling convention should be used for a certain function.

An advantage of this approach might be that there are less cases to consider.
The obvious disadvantage is that it makes the calling convention overly complex.

\subsection{Branch optimisation}
\label{sec:discussion:branches}
A second optimisation vector that is yet to be considered is branch optimisation.
The Thumb instruction set has four variants of the simple branch instruction \ual{b}~\parencite[A6.7.12]{armv7m}:

\begin{itemize}[itemsep=0pt]
	\item
		16-bits, conditional, with an offset between $-256$ and $254$.
	\item
		16-bits, not conditional, with an offset between $-2,048$ and $2,046$.
	\item
		32-bits, conditional, with an offset between $-1,048,576$ and $1,048,574$.
	\item
		32-bits, not conditional, with an offset between $-16,777,216$ and $16,777,214$.
\end{itemize}

By reorganising the code, the code generator could make as many branch instructions as possible 16-bit.

This optimisation is restricted to the \ual{b} instruction:
	the branch with link (\ual{bl}) is always 32-bit;
	the branch and exchange (\ual{bx}) and branch with link and exchange (\ual{blx}) instructions always 16-bit.
Since the \ual{b} instruction is used relatively little by the code generator,
	we cannot hope for much improvement.
The Clean compiler, with a code size of 3,827,868 bytes (Thumb optimised, \cref{fig:reg-alloc:results}),
	counts 56,009 simple branches, of which 13752 ($25\%$) are already narrow.
In the best case we can make all the wide branches narrow
	and win $(56,009 - 13,752)\cdot 2 = 84,514$ bytes, that is $2.2\%$.
The \ual{bl}, \ual{blx} and \ual{bx} instructions are used $42,436$, $23,070$ and $374$ times in the Clean compiler, respectively.

% $ objdump -d a.out | grep -w 'b\(en\|eq\|cs\|hs\|cc\|lo\|mi\|pl\|vs\|vc\|hi\|ls\|ge\|lt\|gt\|le\|al\)\?\.w' | wc -l
% 42257
% $ objdump -d a.out | grep -w 'b\(en\|eq\|cs\|hs\|cc\|lo\|mi\|pl\|vs\|vc\|hi\|ls\|ge\|lt\|gt\|le\|al\)\?\.n' | wc -l
% 13752
% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bl | wc -l
% 42436
% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF blx | wc -l
% 23070
% $ objdump -d a.out | grep -w 'bx\(en\|eq\|cs\|hs\|cc\|lo\|mi\|pl\|vs\|vc\|hi\|ls\|ge\|lt\|gt\|le\|al\)\?' | wc -l
% 374

\subsection{Thumb-only instructions}
\label{sec:discussion:thumb-only}
The Thumb instruction set introduces some instructions that are not available on ARM.
These include:

\begin{itemize}
	\item \ual{cbz} and \ual{cbnz},
		\enquote{Compare and Branch on (Non-)Zero} can conditionally branch forward an even number of bytes between 0 and 126 when a register is (un)equal to zero~\parencite[A6.7.21]{armv7m}.
		This is an improvement over combining a \ual{tst} and a \ual{beq} or \ual{bne} instruction.
	\item \ual{tbb} and \ual{tbh},
		\enquote{Table Branch Byte} and \enquote{Table Branch Halfword} can branch forward using a table of single byte or halfword offsets,
			when given a pointer to and an index in the table~\parencite[A6.7.139]{armv7m}.
\end{itemize}

Where they are applicable, using these functions will be more efficient than the equivalent ARM code, which is being used now.
There do not seem to be direct use cases for \ual{tbb} and \ual{tbh}, but \ual{cbz} and \ual{cbnz} can improve performance.
It has not been checked yet how much can be won.

The \ual{it} instruction allows for conditional execution in the Thumb instruction set.
It does not exist in the ARM instruction set, which has a four-bit conditional field in every instruction~(see \cref{sec:intro:conditionals}).
As such, the \ual{it} instruction is new in Thumb.
However, it does not introduce a performance optimisation, as there is no slower alternative to \ual{it} blocks.

\begin{figure*}[b!]
	\centering
	\begin{tikzpicture}
		\begin{axis}
			[ width=.5\textwidth
			, xlabel={Size decrease (\%)}
			, ylabel={Running time increase (\%)}
			, mark options={scale=2}
			, scatter/classes={%
				a={mark=x,draw=\ifcolours red\else gray\fi},
				b={mark=+,draw=black}}
			]
			\addplot[scatter,only marks,scatter src=explicit symbolic]
				table[meta=label] {
					x     y    label
					12.8  11.9 a
					18.6  3.6  b
					9.8   5.1  a
					18.9  5.9  b
					18.9  3.7  b
					20.1  5.4  a
					22.9  0.9  a
					0.0   2.6  a
					20.0  5.2  b
					22.0  0.0  b
					12.1  8.0  a
				};
		\end{axis}
	\end{tikzpicture}
	\caption{
		Comparison of code size decrease and running time increase.
		{\ifcolours Red\else Grey\fi} crosses indicate programs that are smaller than 1000b in ARM;
			black pluses correspond to larger programs.
		\label{fig:results:scatter}}
\end{figure*}

\subsection{Matching programs and instruction sets}
\label{sec:discussion:matching}
If we compare the code size decrease with the running time increase,
	we get the plot in \cref{fig:results:scatter}.
It seems that code size decrease correlates negatively with increase in running time,
	suggesting that Thumb is more suitable for some programs than others.
However, we do not have enough data to see if this is significant.
More research is needed to see if there actually is such a relationship,
	and if so, what programs are more suitable for Thumb.

The outliers in \cref{fig:results:scatter} are
	RFib $(0.0,2.6)$,
	Ack $(12.8,11.9)$,
	Fib $(9.8,5.1)$ and
	Tak $(12.1,8.0)$,
	all small programs that rely heavily on recursion
	(for a description of the programs, see \cref{sec:results:testsuite} above).
This is explained by the relatively high number of jumps:
	in Thumb we had to add several instructions, introducing extra code size and running time (see \cref{sec:storing-pc}).

\end{multicols}