1 files changed, 122 insertions, 21 deletions
diff --git a/thesis/reg-alloc.tex b/thesis/reg-alloc.tex
index 0fc764e..5918b59 100644
--- a/thesis/reg-alloc.tex
+++ b/thesis/reg-alloc.tex
@@ -14,7 +14,8 @@
 				, ytick=data
 				, scaled x ticks=real:1
 				, xtick scale label code/.code={}
-				, ytick pos=left
+				, axis lines*=left
+				, compat={1.3}
 				]
 				\addplot coordinates {
 					(378618,S0)
@@ -47,7 +48,8 @@
 				, height=1.1\linewidth
 				, symbolic y coords={S0,A ptr.,A0,Heap ptr.,A1,B0,A2,B ptr.,B1,S1,B2,B3,B4,Heap ctr.,A3}
 				, ytick=data
-				, ytick pos=left
+				, axis lines*=left
+				, compat={1.3}
 				]
 				\addplot coordinates {
 					(387,S0)
@@ -149,6 +151,30 @@ In Clean programs, the ARM registers are used as follows~\parencite[\texttt{arms
 	\item The program counter PC.
 \end{itemize}
 
+\begin{table*}[b!]
+	\centering
+	\newcommand{\highlight}[1]{\textcolor{red}{\textbf{#1}}}
+	\begin{tabular}{l >{\tt\footnotesize}c >{\tt\footnotesize}c}
+		Register & {\normalsize\normalfont ARM} & {\normalsize\normalfont Thumb-2} \\\hline
+		A3           & r11 & \highlight{r12} \\
+		Heap counter & r5  & \highlight{r9} \\
+		B4           & r0  & \highlight{r10} \\
+		B3           & r1  & \highlight{r11} \\
+		B2           & r2  & \highlight{r8} \\
+		S1           & r14 & r14 \\
+		B1           & r3  & r3 \\
+		B pointer    & sp  & sp \\
+		A2           & r8  & \highlight{r2} \\
+		B0           & r4  & r4 \\
+		A1           & r7  & r7 \\
+		Heap pointer & r10 & \highlight{r1} \\
+		A0           & r6  & r6 \\
+		A pointer    & r9  & \highlight{r5} \\
+		S0           & r12 & \highlight{r0} \\
+	\end{tabular}
+	\captionof{table}{Register allocation in the ARM and Thumb backends\label{tab:reg-alloc:arm-and-thumb}}
+\end{table*}
+
 In \cref{fig:reg-usage}, we count for each register the number of instructions it occurs in, in the code generated for the Clean compiler~\parencite{cocl}.
 When a register is used multiple times in the same instruction, this counts as one occurrence.
 The vertical line at indicates the number of occurrences that a variable should have to justify putting it in the lower registers,
@@ -162,31 +188,106 @@ For small programs it may not be negligible,
 	but these programs are not likely to need rigorous optimisation for code size,
 	considering that they will be small anyway.
 
+The counting method used here is rather simplistic:
+	basing a register allocation on these counts alone means assuming that every instruction has a 16-bit variant, which is not the case.
+A more accurate method would only count those instructions where using a high or low register actually makes a difference.
+This is much more complicated, and for a rough estimate the simplistic method used here will already allow us to shrink down the code size.
+
+\begin{figure*}[b!]
+	\small
+	\centering
+	\begin{tikzpicture}
+		\begin{axis}
+			[ anchor=south east
+			, height=.4\linewidth
+			, xbar
+			, xmin=0, xmax=6000000
+			, xlabel={Bytes}
+			, ytick=\empty
+			, reverse stacked plots
+			, ylabel={Clean compiler}
+			, ymin=-1, ymax=1
+			, axis lines*=left
+			, nodes near coords, nodes near coords align=west
+			, reverse legend
+			, every axis plot/.append style={point meta=explicit symbolic}
+			, legend style={at={($(1,-0.3)+(1.5em,0pt)$)},anchor=north,legend columns=-1}
+			]
+			\addplot coordinates { (3785144,0) [80.7\%] };
+			\addplot coordinates { (3827868,0) [81.6\%] };
+			\addplot coordinates { (4385964,0) [93.5\%] };
+			\addplot coordinates { (4692628,0) [100\%] };
+			\legend{Thumb object-code (approximation), Thumb optimised, Thumb, ARM}
+		\end{axis}
+		\begin{axis}
+			[ anchor=south west
+			, height=.4\linewidth
+			, xshift=3em
+			, xbar
+			, xmin=0, xmax=80000
+			, xlabel={Bytes}
+			, ytick=\empty
+			, ymin=-1, ymax=1
+			, ylabel={\texttt{frontend/scanner.icl}}
+			, axis lines*=left
+			, every axis plot/.append style={point meta=explicit symbolic}
+			, nodes near coords, nodes near coords align=west
+			]
+			\addplot coordinates { (57296,0) [82.0\%] };
+			\addplot coordinates { (57864,0) [82.8\%] };
+			\addplot coordinates { (64588,0) [92.4\%] };
+			\addplot coordinates { (69896,0) [100\%] };
+		\end{axis}
+	\end{tikzpicture}
+	\caption{Code size for different backends\label{fig:reg-alloc:results}}
+\end{figure*}
+
 \subsection{Optimisation}
 \label{sec:reg-alloc:optimisation}
+The register allocation used in the ARM backend is inefficient for Thumb-2 on several points (\cref{tab:reg-alloc:arm-and-thumb}).
+A number of often-used registers (the main scratch register, the A-stack pointer and the heap pointer) are in the upper half,
+	while registers that are used less often are in the lower half (the heap counter and B2 through B4).
+
 \Cref{fig:reg-usage} suggests that we put all variables that occur more than 85,000 times in the low registers.
 However, the B-stack pointer has to be SP:
 	the B- and C-stacks are combined into one
-	(an optimisation suggested early on by \cite{execspecs} \todo{exact ref.})
 	and the C-stack pointer should be the system pointer to allow for an efficient foreign function interface.
 
-% Before optimisation:
-% pi@rasppi:~/clean/src/compiler$ ls -l a.out 
-% -rwxr-xr-x 1 pi pi 9987016 Dec  7 21:46 a.out
-% pi@rasppi:~/clean/src/compiler$ strip --strip-all a.out 
-% pi@rasppi:~/clean/src/compiler$ ls -l a.out 
-% -rwxr-xr-x 1 pi pi 5034316 Dec  7 21:46 a.out
-
-% After optimisation as suggested by John:
-%  1 r9 <-> r5
-%  2 r0 -> r10
-%  3 r1 -> r11
-%  4 r10 -> r1
-%  5 r11 -> r12
-%  6 r12 -> r0
-% static char *register_name[15]={"sp","r1","r5","r12","r8","r7","r6","r4","r3","r2","r11","r10","r9","r0","r14"};
-% static int register_order[15]={13,1,5,12,8,7,6,4,3,2,11,10,9,0,14};
-% pi@rasppi:~/clean/src/compiler$ ls -l a.out 
-% -rwxr-xr-x 1 pi pi 7591584 Dec  7 22:16 a.out
+The third column in \cref{tab:reg-alloc:arm-and-thumb} shows the proposed allocation for Thumb.
+The heap counter and the A-stack pointer are swapped, as are A2 and B2.
+The registers for S0, B4, the heap pointer, B3 and A3 have been rotated.
+This way, all eight most often used registers are in the lower half except the B-stack pointer.
+
+\subsection{Results}
+\label{sec:reg-alloc:results}
+To measure the code size improvement introduced by this optimisation,
+	we again take the Clean compiler and compile it in three different ways:
+
+\begin{itemize}[itemsep=0pt]
+	\item
+		With the existing ARM backend \parencite{armcg}.
+	\item
+		With the new Thumb backend, leaving the register allocation as in the ARM backend.
+	\item
+		With the new Thumb backend, with the optimised register allocation as in the third column of \cref{tab:reg-alloc:arm-and-thumb}.
+	\item
+		With the new Thumb backend, with the optimised register allocation and leaving out any \ual{.align} directives to approximate object-code level efficiency.
+\end{itemize}
+
+The object code generator has not been finished yet at the moment these measurements were done,
+	so we can only estimate its efficiency.
+A modification described in \cref{sec:storing-pc:solution} required us to add \ual{.align} to jump instructions, adding a \ual{nop} instruction in approximately half the instances.
+In the last measurement, we approximate object-code level efficiency by leaving out these \ual{.align} directives, as they will not be needed in the object code generator.
+
+We only measure the size of generated code, that is, all \texttt{.text} segments except those that belong to the RTS or external libraries.
+These measurements were done on the system described in \cref{sec:system}.
+One may argue that the Clean compiler is not representative code because of some peculiarities like large lookup tables and arrays that are rarely used in practice.
+For this reason, we compare both the total code size and the size of a part of the lexer \parencite[\texttt{frontend/scanner.icl}]{cocl},
+	which can be considered representative.
+The results are in \cref{fig:reg-alloc:results}.
+
+We see that even without the optimisation discussed in this section, some code size is saved.
+After all, at least some instructions can be made 16-bit, and in the sections above we have hardly ever had to add extra instructions.
+However, optimising register allocation allows us to save much more, up to almost 20\%.
 
 \end{multicols}