\section{Optimising register allocation}
\label{sec:reg-alloc}

\begin{figure}[b!]
	\small
	\begin{subfigure}{.5\linewidth}
		\begin{tikzpicture}
			\ifcolours\else\selectcolormodel{gray}\fi
			\begin{axis}
				[ xbar
				, xmin=0
				, width=.9\linewidth
				, height=1.1\linewidth
				, symbolic y coords={S0,A ptr.,A0,Heap ptr.,A1,B0,A2,B ptr.,B1,S1,B2,B3,B4,Heap ctr.,A3}
				, ytick=data
				, scaled x ticks=real:1
				, xtick scale label code/.code={}
				, axis lines*=left
				, compat={1.3}
				]
				\addplot coordinates {
					(378618,S0)
					(270274,A ptr.)
					(218018,A0)
					(166284,Heap ptr.)
					(152821,A1)
					(110390,B0)
					(107481,A2)
					(102640,B ptr.)
					( 64496,B1)
					( 55526,S1)
					( 41092,B2)
					( 25924,B3)
					( 15699,B4)
					( 14930,Heap ctr.)
					(  9413,A3)
				};
				\draw ($({axis cs:85000,S0})-(0,2em)$) -- ($({axis cs:85000,A3})+(0,2em)$);
			\end{axis}
		\end{tikzpicture}
		\subcaption{In the Clean compiler.}
	\end{subfigure}%
	\begin{subfigure}{.5\linewidth}
		\begin{tikzpicture}
			\ifcolours\else\selectcolormodel{gray}\fi
			\begin{axis}
				[ xbar
				, xmin=0
				, width=.9\linewidth
				, height=1.1\linewidth
				, symbolic y coords={S0,A ptr.,A0,Heap ptr.,A1,B0,A2,B ptr.,B1,S1,B2,B3,B4,Heap ctr.,A3}
				, ytick=data
				, axis lines*=left
				, compat={1.3}
				]
				\addplot coordinates {
					(387,S0)
					(372,A ptr.)
					(283,B1)
					(257,A3)
					(217,B0)
					(207,A0)
					(185,A2)
					(185,A1)
					(137,S1)
					(115,B ptr.)
					(113,Heap ptr.)
					( 24,Heap ctr.)
					( 13,B2)
					(  2,B3)
					(  0,B4)
				};
				\draw ($({axis cs:155,S0})-(0,2em)$) -- ($({axis cs:155,A3})+(0,2em)$);
			\end{axis}
		\end{tikzpicture}
		\subcaption{In the RTS.\label{fig:reg-usage:rts}}
	\end{subfigure}%
	\caption{Register usage with the Thumb-2 backend.\label{fig:reg-usage}}
\end{figure}

% pi@rasppi:~/clean/src/compiler$ ~/count-regs.sh 
% r0:     15699
% r1:     25924
% r2:     41092
% r3:     64496
% r4:     110390
% r5:     14930
% r6:     218018
% r7:     152821
% r8:     107481
% r9:     270274
% sb:     0
% r10:    166284
% sl:     0
% r11:    9413
% fp:     0
% r12:    378618
% ip:     0
% r13:    0
% sp:     102640
% r14:    55526
% lr:     0
% r15:    0
% pc:     61145

% pi@rasppi:~/rts$ ~/count-regs.sh thumb2*.s
% BSTACK_0:       217
% BSTACK_1:       283
% BSTACK_2:       13
% BSTACK_3:       2
% BSTACK_4:       0
% BSTACK_PTR:     0
% ASTACK_0:       207
% ASTACK_1:       185
% ASTACK_2:       350
% ASTACK_3:       257
% ASTACK_PTR:     372
% HEAP_FREE:      24
% HEAP_PTR:       113
% SCRATCH_REG:    387
% LINK_REG:       0
% sp:     115
% lr:     137
% pc:     72

\begin{multicols}{2}

\subsection{Introduction}
\label{sec:reg-alloc:intro}
In Thumb-1, i.e., the 16-bit subset of Thumb-2, register fields are 3 bits wide,
	allowing the programmer to access \ual{r0} through \ual{r7}.
Special encoding variants are defined to access \ual{sp} (\ual{r13}) and \ual{pc} (\ual{r15}),
	though these instructions have limited functionality compared to those working on the eight low registers.
The 32-bit instructions defined by Thumb-2 have access to all sixteen registers.

This introduces an interesting code size optimisation vector.
If we put the most-used registers in the eight low registers, we may save code size.

\subsection{Usage in Clean}
\label{sec:reg-alloc:clean}
In Clean programs, the ARM registers are used as follows~\parencite[\texttt{armstartup.s}]{armcg}:

\begin{itemize}[itemsep=0pt]
	\item Five registers for the top of the B-stack,
		which we call B0 through B4.
	\item Four registers for the top of the A-stack,
		A0 through A3.
	\item Three registers for the A-stack, B-stack and heap pointers:
		A ptr., B ptr., Heap ptr., respectively.
		The B-stack is interleaved with the C-stack and uses SP as its pointer.
	\item One register for the number of free words on the heap, Heap ctr.
	\item Two scratch registers S0 and S1.
	\item The program counter PC.
\end{itemize}

\begin{table*}[b!]
	\centering
	\newcommand{\highlight}[1]{\textbf{\ifcolours\textcolor{red}{#1}\else\underline{#1}\fi}}
	\begin{subfigure}{.5\linewidth}
		\centering
		\begin{tabular}{l >{\tt\footnotesize}c >{\tt\footnotesize}c}
			& & \\ % Dummy row to align with other table
			Register & {\normalsize\normalfont ARM} & {\normalsize\normalfont Thumb-2} \\\hline
			A3           & r11 & \highlight{r12} \\
			Heap counter & r5  & \highlight{r9} \\
			B4           & r0  & \highlight{r10} \\
			B3           & r1  & \highlight{r11} \\
			B2           & r2  & \highlight{r8} \\
			S1           & r14 & r14 \\
			B1           & r3  & r3 \\
		\end{tabular}
	\end{subfigure}%
	\begin{subfigure}[t]{.5\linewidth}
		\centering
		\begin{tabular}{l >{\tt\footnotesize}c >{\tt\footnotesize}c}
			Register & {\normalsize\normalfont ARM} & {\normalsize\normalfont Thumb-2} \\\hline
			B pointer    & sp  & sp \\
			A2           & r8  & \highlight{r2} \\
			B0           & r4  & r4 \\
			A1           & r7  & r7 \\
			Heap pointer & r10 & \highlight{r1} \\
			A0           & r6  & r6 \\
			A pointer    & r9  & \highlight{r5} \\
			S0           & r12 & \highlight{r0} \\
		\end{tabular}
	\end{subfigure}%
	\caption{Register allocation in the ARM and Thumb backends.\label{tab:reg-alloc:arm-and-thumb}}
\end{table*}

In \cref{fig:reg-usage}, we count for each register the number of instructions it occurs in, in the code generated for the Clean compiler~\parencite{cocl}.
When a register is used multiple times in the same instruction, this counts as one occurrence.
The vertical line at indicates the number of occurrences that a variable should have to justify putting it in the lower registers,
	if we were to put the most-used variables in the lower registers
	(e.g., there are eight registers with over 85,000 occurrences in the Clean compiler).

The RTS shows an entirely different pattern (\cref{fig:reg-usage:rts}),
	because many registers have another meaning in the RTS than in generated code.
For large programs, register usage in the RTS is negligible.
For small programs it may not be negligible,
	but these programs are not likely to need rigorous optimisation for code size,
	considering that they will be small anyway.

The counting method used here is rather simplistic:
	basing a register allocation on these counts alone means assuming that every instruction has a 16-bit variant, which is not the case.
A more accurate method would be to only count those instructions where using a high or low register actually makes a difference.
This is much more complicated, and for a rough estimate the simplistic method used here will already allow us to shrink down the code size.

\subsection{Optimisation}
\label{sec:reg-alloc:optimisation}
The register allocation used in the ARM backend is inefficient for Thumb-2 on several points (\cref{tab:reg-alloc:arm-and-thumb}).
A number of often-used registers (the main scratch register, the A-stack pointer and the heap pointer) are in the upper half,
	while registers that are used less often are in the lower half (the heap counter and B2 through B4).

\Cref{fig:reg-usage} suggests that we put all variables that occur more than 85,000 times in the low registers.
However, the B-stack pointer has to be SP:
	the B- and C-stacks are combined into one
	and the C-stack pointer should be the system pointer to allow for an efficient foreign function interface.

The third column in \cref{tab:reg-alloc:arm-and-thumb} shows the proposed allocation for Thumb.
The heap counter and the A-stack pointer are swapped, as are A2 and B2.
The registers for S0, B4, the heap pointer, B3 and A3 have been rotated.
This way, all eight most often used registers are in the lower half except the B-stack pointer.

\subsection{The foreign function interface}
\label{sec:reg-alloc:ffi}
Changing the register allocation has its impact on the foreign function interface.
Clean provides mechanisms to export Clean functions, so that they can be called from other software,
	and to call other functions from Clean,
	as long as the other software respects a certain high-level infrastructure \parencite[chp.~11]{cleanlangrep}.

The low-level interface (which registers are used, for example) is platform-dependent.
For ARM, it is defined in \cite{armcallstd}.

Some registers have a special function:
	\ual{r15} is the program counter;
	\ual{r13} the stack pointer.
The Clean backend cannot use these registers in another way.
There are four argument / result / scratch registers, \ual{r0} through \ual{r3}.
These are not guaranteed to be preserved upon a function call.
The link register, \ual{r14}, and \ual{r12}, cannot be used freely either:
	a subroutine jumps to the address in the link register when it is done
		(and can use \ual{r14} as a local variable if it stores its value upon entering on the stack);
		and \ual{r12} can be used by the linker when extra instructions are needed
			when a branch instruction attempts to jump to a label so far away that it does not fit in the instruction any more \parencite[5.3.1.1]{armcallstd}.
For local variables, \ual{r4} through \ual{r8}, \ual{r10} and \ual{r11} can be used:
	these have to be preserved by subroutines.
The last register, \ual{r9}, is platform-dependent.
Some quick tests indicate that it is callee-saved on our test setup (see \cref{sec:system}).

The register allocation in the ARM backend is optimised for the foreign function interface:
	the B-stack registers are in the argument registers, because the B-stack is usually empty during function calls.
The link register \ual{r14} and \ual{r12} are used as scratch registers,
	because they are only used for short term storage.
All other variables have to be kept over subroutine calls and are kept in the other registers, which are callee-saved.

Changing the register allocation in the way proposed above means that this foreign function interface will be less efficient.
Whenever a foreign function needs to be called, the caller-saved registers that need to be preserved have to be saved.
With the allocation as proposed in \cref{tab:reg-alloc:arm-and-thumb},
	these are the heap pointer (\ual{r1}), A2 (\ual{r2}) and A3 (\ual{r12}).
Before every call, a wide \ual{push} instruction needs to be inserted;
	after every return, a wide \ual{pop} instruction.
This introduces an 8-byte overhead per foreign function call
	and also means that every call will be slightly slower.
We deem the foreign function interface to be less important than the actual Clean code,
	so this is acceptable.

\subsection{Results}
\label{sec:reg-alloc:results}
To measure the code size improvement introduced by this optimisation,
	we again take the Clean compiler and compile it in three different ways:

\begin{itemize}[itemsep=0pt]
	\item
		With the existing ARM backend \parencite{armcg}.
	\item
		With the new Thumb backend, leaving the register allocation as in the ARM backend.
	\item
		With the new Thumb backend, with the optimised register allocation as in the third column of \cref{tab:reg-alloc:arm-and-thumb}.
	\item
		With the new Thumb backend, with the optimised register allocation and leaving out any \ual{.align} directives to approximate object-code level efficiency.
\end{itemize}

\begin{figure*}[b]
	\small
	\centering
	\begin{tikzpicture}
		\ifcolours\else\selectcolormodel{gray}\fi
		\begin{axis}
			[ anchor=south east
			, height=.25\linewidth
			, width=.45\linewidth
			, xbar
			, xmin=0, xmax=6000000
			, xlabel={Bytes}
			, ytick=\empty
			, reverse stacked plots
			, ylabel={Clean compiler}
			, ymin=-1, ymax=1
			, axis lines*=left
			, nodes near coords, nodes near coords align=west
			, reverse legend
			, every axis plot/.append style={point meta=explicit symbolic}
			, legend style={at={($(1,-0.5)+(1.5em,0pt)$)},anchor=north,legend columns=-1}
			]
			\addplot coordinates { (3785144,0) [80.7\%] };
			\addplot coordinates { (3827868,0) [81.6\%] };
			\addplot coordinates { (4385964,0) [93.5\%] };
			\addplot coordinates { (4692628,0) [100\%] };
			\legend{Thumb object-code (approximation), Thumb optimised, Thumb, ARM}
		\end{axis}
		\begin{axis}
			[ anchor=south west
			, height=.25\linewidth
			, width=.45\linewidth
			, xshift=3em
			, xbar
			, xmin=0, xmax=80000
			, xlabel={Bytes}
			, ytick=\empty
			, ymin=-1, ymax=1
			, ylabel={\texttt{frontend/scanner.icl}}
			, axis lines*=left
			, every axis plot/.append style={point meta=explicit symbolic}
			, nodes near coords, nodes near coords align=west
			]
			\addplot coordinates { (57296,0) [82.0\%] };
			\addplot coordinates { (57864,0) [82.8\%] };
			\addplot coordinates { (64588,0) [92.4\%] };
			\addplot coordinates { (69896,0) [100\%] };
		\end{axis}
	\end{tikzpicture}
	\caption{Code size for different backends.\label{fig:reg-alloc:results}}
\end{figure*}

The object code generator has not been finished yet at the moment these measurements were done,
	so we can only estimate its efficiency.
A modification described in \cref{sec:storing-pc:solution} required us to add \ual{.align} to jump instructions, adding a \ual{nop} instruction in approximately half the instances.
In the last measurement, we approximate object-code level efficiency by leaving out these \ual{.align} directives, as they will not be needed in the object code generator.

We only measure the size of generated code, that is, all \texttt{.text} segments except those that belong to the RTS or external libraries.
These measurements were done on the system described in \cref{sec:system}.
One may argue that the Clean compiler is not representative code because of some peculiarities like large lookup tables and arrays that are rarely used in practice.
For this reason, we compare both the total code size and the size of a part of the lexer \parencite[\texttt{frontend/scanner.icl}]{cocl},
	which can be considered representative.
The results are in \cref{fig:reg-alloc:results}.

We see that even without the optimisation discussed in this section, some code size is saved.
After all, at least some instructions can be made 16-bit, and in the sections above we have hardly ever had to add extra instructions.
However, optimising register allocation allows us to save much more, up to almost 20\%.

\end{multicols}