6 files changed, 172 insertions, 108 deletions
diff --git a/thesis/discussion.tex b/thesis/discussion.tex
new file mode 100644
index 0000000..ae72de3
--- /dev/null
+++ b/thesis/discussion.tex
@@ -0,0 +1,169 @@
+\section{Discussion}
+\label{sec:discussion}
+
+\begin{multicols}{2}
+
+\subsection{Optimising subroutine calls}
+\label{sec:discussion:subroutines}
+In \cref{sec:storing-pc}, we saw that subroutine calls are slower and larger in Thumb than in ARM,
+	because an extra \ual{add} instruction has to be inserted.
+We discussed that a more efficient solution would exploit the link register:
+	upon a \ual{bl} and \ual{blx} instruction, the link register is loaded with the address of the instruction after the branch.
+If the callee would be responsible for saving this address, rather than the caller,
+	we could delay storing the return address and exploit these instructions to eliminate the \ual{add} instruction.
+However, we also mentioned that tail recursion makes this difficult (see \cref{sec:storing-pc:other-solutions}).
+
+\subsubsection{Tail recursive entry points}
+To improve on this, two things can be done.
+First, we could add a \emph{tail recursive entry point} to function blocks.
+The code for the \clean{length} function from \cref{sec:storing-pc:other-solutions} could then look like this:
+
+\begin{minted}{ual}
+slength_P2:
+	str   lr,[sp,#-4]!    @ Store the return address
+slength_P2_tr:              @ The tail recursive entry point
+	ldr   r0,[r6]         @ Load top of A-stack (the list)
+
+	@ ... etc
+
+e_0:                        @ If / after the Cons part has been evaluated
+	add   r4,r4,#1        @ Increase counter
+	b     slength_P2_tr   @ Tail recursively jump back to start of function
+	.ltorg
+\end{minted}
+
+This way, the code to store the return address is only executed once.
+The return address is saved on the stack, as before.
+To call the subroutine, the following block can be used:
+
+\begin{minted}{ual}
+	bl    slength_P2
+\end{minted}
+
+This is an improvement of between two and six bytes per subroutine call:
+	in any case, we win two bytes since the \ual{add} instruction is eliminated.
+In the minimal case that a function is called only once, we save only these two bytes.
+If the function is called often, many \ual{str} instructions can be removed at the cost of one extra \ual{str} instruction in the function itself,
+	and the space saved per subroutine call asymptotically approaches six bytes.
+
+As for running time, we win one instruction cycle per subroutine call.
+The \ual{add} instruction is removed, but the \ual{str} instruction is still executed
+	(whether it is placed in the caller's or the callee's block does not matter).
+
+To implement tail recursive entry points,
+	one would need to be able to recognise tail recursion such that the tail recursive entry point is used only for these subroutine calls,
+	but not for the `plain' recursive subroutine calls.
+Also, more research is needed to see if a tail recursive entry point is sufficient in all cases.
+
+\subsubsection{Mixed calling convention}
+A second possibility would be to have a \emph{mixed calling convention}.
+Currently, in every subroutine call, the caller is responsible for storing the return address on the stack.
+In a mixed calling convention,
+	the callee would be responsible for this,
+	except for functions that exploit tail recursion.
+Recognising tail recursion can be done on a relatively high level of abstraction,
+	so annotations can be added to the intermediate ABC-code to indicate what calling convention should be used for a certain function.
+
+An advantage of this approach might be that there are less cases to consider.
+The obvious disadvantage is that it makes the calling convention overly complex.
+
+\begin{figure*}[b!]
+	\centering
+	\begin{tikzpicture}
+		\begin{axis}
+			[ width=.5\textwidth
+			, xlabel={Size decrease (\%)}
+			, ylabel={Running time increase (\%)}
+			, mark options={scale=2}
+			, scatter/classes={%
+				a={mark=x,draw=\ifcolours red\else gray\fi},
+				b={mark=+,draw=black}}
+			]
+			\addplot[scatter,only marks,scatter src=explicit symbolic]
+				table[meta=label] {
+					x     y    label
+					12.8  11.9 a
+					18.6  3.6  b
+					9.8   5.1  a
+					18.9  5.9  b
+					18.9  3.7  b
+					20.1  5.4  a
+					22.9  0.9  a
+					0.0   2.6  a
+					20.0  5.2  b
+					22.0  0.0  b
+					12.1  8.0  a
+				};
+		\end{axis}
+	\end{tikzpicture}
+	\caption{
+		Comparison of code size decrease and running time increase.
+		{\ifcolours Red\else Grey\fi} crosses indicate programs that are smaller than 1000b in ARM;
+			black pluses correspond to larger programs.
+		\label{fig:results:scatter}}
+\end{figure*}
+
+\subsection{Branch optimisation}
+\label{sec:discussion:branches}
+A second optimisation vector that is yet to be considered is branch optimisation.
+The Thumb instruction set has four variants of the simple branch instruction \ual{b}~\parencite[A6.7.12]{armv7m}:
+
+\begin{itemize}[itemsep=0pt]
+	\item
+		16-bits, conditional, with an offset between $-256$ and $254$.
+	\item
+		16-bits, not conditional, with an offset between $-2,048$ and $2,046$.
+	\item
+		32-bits, conditional, with an offset between $-1,048,576$ and $1,048,574$.
+	\item
+		32-bits, not conditional, with an offset between $-16,777,216$ and $16,777,214$.
+\end{itemize}
+
+By reorganising the code, the code generator could make as many branch instructions as possible 16-bit.
+
+This optimisation is restricted to the \ual{b} instruction:
+	the branch with link (\ual{bl}) is always 32-bit;
+	the branch and exchange (\ual{bx}) and branch with link and exchange (\ual{blx}) instructions always 16-bit.
+Since the \ual{b} instruction is used relatively little by the code generator,
+	we cannot hope for much improvement.
+The Clean compiler, with a code size of 3,827,868 bytes (Thumb optimised, \cref{fig:reg-alloc:results}),
+	counts only 29,982 simple branches, of which 6,979 ($23\%$) are already narrow.
+In the best case we can make all the wide branches narrow
+	and win $(29,982 - 6,979)\cdot 2 = 46,004$ bytes, only $1.2\%$.
+The \ual{bl}, \ual{blx} and \ual{bx} instructions are used $42,436$, $23,070$ and $306$ times in the Clean compiler, respectively.
+
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.w | wc -l
+% 23003
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.n | wc -l
+% 6979
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bl | wc -l
+% 42436
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF blx | wc -l
+% 23070
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bx | wc -l
+% 306
+
+\subsection{Matching programs and instruction sets}
+\label{sec:discussion:matching}
+If we compare the code size decrease with the running time increase,
+	we get the plot in \cref{fig:results:scatter}.
+It seems that code size decrease correlates negatively with increase in running time,
+	suggesting that Thumb is more suitable for some programs than others.
+However, we do not have enough data to see if this is significant.
+More research is needed to see if there actually is such a relationship,
+	and if so, what programs are more suitable for Thumb.
+
+The outliers in \cref{fig:results:scatter} are
+	RFib $(0.0,2.6)$,
+	Ack $(12.8,11.9)$,
+	Fib $(9.8,5.1)$ and
+	Tak $(12.1,8.0)$,
+	all small programs that rely heavily on recursion
+	(for a description of the programs, see \cref{sec:results:testsuite} above).
+This is explained by the relatively high number of jumps:
+	in Thumb we had to add several instructions, introducing extra code size and running time (see \cref{sec:storing-pc}).
+
+\bigskip
+\todo{more}
+
+\end{multicols}
diff --git a/thesis/further.tex b/thesis/further.tex
deleted file mode 100644
index e499b7f..0000000
--- a/thesis/further.tex
+++ /dev/null
@@ -1,11 +0,0 @@
-\section{Further work}
-\label{sec:further}
-
-\begin{multicols}{2}
-
-\todo{
-	There are some things that are still to be done and some optimisation vectors that are not yet considered.
-	What will be written depends on how far I get in the coming weeks with this.
-}
-
-\end{multicols}
diff --git a/thesis/intro.tex b/thesis/intro.tex
index 3d6347c..ccffb4c 100644
--- a/thesis/intro.tex
+++ b/thesis/intro.tex
@@ -308,22 +308,12 @@ In much of the rest of this thesis we discuss differences between ARM and Thumb,
 In \cref{sec:storing-pc}, we consider an issue arising from halfword-aligned instructions and the way a read of PC is interpreted under Thumb.
 \Cref{sec:code-addresses} discusses problems related to the fact that Thumb instruction addresses use bit 1 and should have bit 0 set for interworking,
 	while under ARM a branch will automatically clear both these bits.
-There was a minor problem with negative offsets to the \ual{ldr} instruction,
-	that cannot be as large in Thumb as they can in ARM.
-\Cref{sec:load-offsets} deals with this.
+For some instructions, the constants in the ARM variant can be larger than those in the Thumb variant.
+\Cref{sec:load-offsets} deals with related issues in the \ual{ldr} instruction.
 
 Moving to Thumb introduces a number of interesting optimisation vectors.
 One of them, register allocation, is discussed in \cref{sec:reg-alloc}.
 
 We benchmark the Thumb backend for Clean and discuss the results in \cref{sec:results}.
 
-\subsection{Terminology}
-\label{sec:intro:terminology}
-As mentioned, we will usually use \enquote{Thumb} where the Thumb-2 extension is meant,
-	and only distinguish Thumb and Thumb-2 when this distinction is relevant ---
-	for \enquote{ARM}, context makes out whether the architecture or the instruction set is meant.
-
-For brevity, a number of common abbreviations is used.
-An overview can be found in \cref{sec:terminology}.
-
 \end{multicols}
diff --git a/thesis/results.tex b/thesis/results.tex
index 76afb27..b826138 100644
--- a/thesis/results.tex
+++ b/thesis/results.tex
@@ -91,42 +91,6 @@ They are:
 		A sequential version of Warshall's shortest path algorithm on a $6\times6$ matrix.
 \end{description}
 
-\begin{figure*}[b!]
-	\centering
-	\begin{tikzpicture}
-		\begin{axis}
-			[ width=.5\textwidth
-			, xlabel={Size decrease (\%)}
-			, ylabel={Running time increase (\%)}
-			, mark options={scale=2}
-			, scatter/classes={%
-				a={mark=x,draw=red},
-				b={mark=+,draw=black}}
-			]
-			\addplot[scatter,only marks,scatter src=explicit symbolic]
-				table[meta=label] {
-					x     y    label
-					12.8  11.9 a
-					18.6  3.6  b
-					9.8   5.1  a
-					18.9  5.9  b
-					18.9  3.7  b
-					20.1  5.4  a
-					22.9  0.9  a
-					0.0   2.6  a
-					20.0  5.2  b
-					22.0  0.0  b
-					12.1  8.0  a
-				};
-		\end{axis}
-	\end{tikzpicture}
-	\caption{
-		Comparison of code size decrease and running time increase.
-		Red crosses indicate programs that are smaller than 1000b in ARM;
-			black pluses correspond to larger programs.
-		\label{fig:results:scatter}}
-\end{figure*}
-
 \subsection{Code size}
 \label{sec:results:size}
 We compared the code size for the code generated by the ARM and Thumb backends for all programs in the test suite.
@@ -155,25 +119,6 @@ The increase in running time is relatively low compared to the decrease in code
 For the larger programs, this is
 	3.7\% with a standard deviation of 2.04pp. (though this is based on only five programs).
 
-\subsection{Discussion}
-\label{sec:results:discussion}
-If we compare the code size decrease with the running time increase,
-	we get the plot in \cref{fig:results:scatter}.
-It seems that a high code size decrease often comes together with a low increase in running time,
-	suggesting that Thumb is more suitable for some programs than others.
-However, we do not have enough data to see if this is significant.
-More research is needed to see if there actually is such a relationship,
-	and if so, what programs are more suitable for Thumb.
-
-The outliers in \cref{fig:results:scatter} are
-	RFib $(0.0,2.6)$,
-	Ack $(12.8,11.9)$,
-	Fib $(9.8,5.1)$ and
-	Tak $(12.1,8.0)$,
-	all small programs that rely heavily on recursion.
-This is explained by the relatively high number of jumps:
-	in Thumb we had to add several instructions, introducing extra code size and running time (see \cref{sec:storing-pc}).
-
 \end{multicols}
 
 % vim: nowrap
diff --git a/thesis/terms.tex b/thesis/terms.tex
deleted file mode 100644
index 4104875..0000000
--- a/thesis/terms.tex
+++ /dev/null
@@ -1,28 +0,0 @@
-\section{Terminology}
-\label{sec:terminology}
-
-\begin{multicols}{2}
-
-\begin{description}
-	\item[HNF]
-		Head normal form, the state of a graph node that cannot be rewritten itself (though it children may).
-	\item[ISA]
-		Instruction set architecture.
-	\item[IT block]
-		If-then block, a Thumb construct for conditional execution.
-	\item[LSB(s)]
-		Least significant bit(s), the bit(s) with the lowest value in a word.
-	\item[PC]
-		Program counter, \ual{r15} on ARM.
-	\item[RTS]
-		Run-time system.
-		Clean's run-time system ensures that the \clean{Start} node is reduced and printed.
-		It also takes care of garbage collecting.
-	\item[SP]
-		Stack pointer, \ual{r13} on ARM.
-	\item[UAL]
-		Unified Assembler Language, an assembly language syntax that
-			\enquote{provides a canonical form for all ARM and Thumb instructions} \parencite[A4.2]{armv7m}.
-\end{description}
-
-\end{multicols}
diff --git a/thesis/thesis.tex b/thesis/thesis.tex
index 95abdee..273694e 100644
--- a/thesis/thesis.tex
+++ b/thesis/thesis.tex
@@ -54,14 +54,13 @@ It produces on average 20\% smaller code than the ARM code generator, which is o
 \input{load-offsets}
 \input{reg-alloc}
 \input{results}
-\input{further}
+\input{discussion}
 
 \ifdraft\else
 \cleardoublepage
 \fi
 \let\section\oldsection
 \appendix
-\input{terms}
 \input{system}
 \input{status}