Remove terminology appendix; write discussion

author: Camil Staps 2017-01-02 21:50:11 +0100
committer: Camil Staps 2017-01-02 21:50:49 +0100
commit: 163929496ea077a7882632cd99a98c806b02591a (patch)
tree: 89660ec8fe7a12738ea5146f1af7dce5c5a63306 /thesis/discussion.tex
parent: Make columns optional (diff)
1 files changed, 169 insertions, 0 deletions
diff --git a/thesis/discussion.tex b/thesis/discussion.tex
new file mode 100644
index 0000000..ae72de3
--- /dev/null
+++ b/thesis/discussion.tex
@@ -0,0 +1,169 @@
+\section{Discussion}
+\label{sec:discussion}
+
+\begin{multicols}{2}
+
+\subsection{Optimising subroutine calls}
+\label{sec:discussion:subroutines}
+In \cref{sec:storing-pc}, we saw that subroutine calls are slower and larger in Thumb than in ARM,
+	because an extra \ual{add} instruction has to be inserted.
+We discussed that a more efficient solution would exploit the link register:
+	upon a \ual{bl} and \ual{blx} instruction, the link register is loaded with the address of the instruction after the branch.
+If the callee would be responsible for saving this address, rather than the caller,
+	we could delay storing the return address and exploit these instructions to eliminate the \ual{add} instruction.
+However, we also mentioned that tail recursion makes this difficult (see \cref{sec:storing-pc:other-solutions}).
+
+\subsubsection{Tail recursive entry points}
+To improve on this, two things can be done.
+First, we could add a \emph{tail recursive entry point} to function blocks.
+The code for the \clean{length} function from \cref{sec:storing-pc:other-solutions} could then look like this:
+
+\begin{minted}{ual}
+slength_P2:
+	str   lr,[sp,#-4]!    @ Store the return address
+slength_P2_tr:              @ The tail recursive entry point
+	ldr   r0,[r6]         @ Load top of A-stack (the list)
+
+	@ ... etc
+
+e_0:                        @ If / after the Cons part has been evaluated
+	add   r4,r4,#1        @ Increase counter
+	b     slength_P2_tr   @ Tail recursively jump back to start of function
+	.ltorg
+\end{minted}
+
+This way, the code to store the return address is only executed once.
+The return address is saved on the stack, as before.
+To call the subroutine, the following block can be used:
+
+\begin{minted}{ual}
+	bl    slength_P2
+\end{minted}
+
+This is an improvement of between two and six bytes per subroutine call:
+	in any case, we win two bytes since the \ual{add} instruction is eliminated.
+In the minimal case that a function is called only once, we save only these two bytes.
+If the function is called often, many \ual{str} instructions can be removed at the cost of one extra \ual{str} instruction in the function itself,
+	and the space saved per subroutine call asymptotically approaches six bytes.
+
+As for running time, we win one instruction cycle per subroutine call.
+The \ual{add} instruction is removed, but the \ual{str} instruction is still executed
+	(whether it is placed in the caller's or the callee's block does not matter).
+
+To implement tail recursive entry points,
+	one would need to be able to recognise tail recursion such that the tail recursive entry point is used only for these subroutine calls,
+	but not for the `plain' recursive subroutine calls.
+Also, more research is needed to see if a tail recursive entry point is sufficient in all cases.
+
+\subsubsection{Mixed calling convention}
+A second possibility would be to have a \emph{mixed calling convention}.
+Currently, in every subroutine call, the caller is responsible for storing the return address on the stack.
+In a mixed calling convention,
+	the callee would be responsible for this,
+	except for functions that exploit tail recursion.
+Recognising tail recursion can be done on a relatively high level of abstraction,
+	so annotations can be added to the intermediate ABC-code to indicate what calling convention should be used for a certain function.
+
+An advantage of this approach might be that there are less cases to consider.
+The obvious disadvantage is that it makes the calling convention overly complex.
+
+\begin{figure*}[b!]
+	\centering
+	\begin{tikzpicture}
+		\begin{axis}
+			[ width=.5\textwidth
+			, xlabel={Size decrease (\%)}
+			, ylabel={Running time increase (\%)}
+			, mark options={scale=2}
+			, scatter/classes={%
+				a={mark=x,draw=\ifcolours red\else gray\fi},
+				b={mark=+,draw=black}}
+			]
+			\addplot[scatter,only marks,scatter src=explicit symbolic]
+				table[meta=label] {
+					x     y    label
+					12.8  11.9 a
+					18.6  3.6  b
+					9.8   5.1  a
+					18.9  5.9  b
+					18.9  3.7  b
+					20.1  5.4  a
+					22.9  0.9  a
+					0.0   2.6  a
+					20.0  5.2  b
+					22.0  0.0  b
+					12.1  8.0  a
+				};
+		\end{axis}
+	\end{tikzpicture}
+	\caption{
+		Comparison of code size decrease and running time increase.
+		{\ifcolours Red\else Grey\fi} crosses indicate programs that are smaller than 1000b in ARM;
+			black pluses correspond to larger programs.
+		\label{fig:results:scatter}}
+\end{figure*}
+
+\subsection{Branch optimisation}
+\label{sec:discussion:branches}
+A second optimisation vector that is yet to be considered is branch optimisation.
+The Thumb instruction set has four variants of the simple branch instruction \ual{b}~\parencite[A6.7.12]{armv7m}:
+
+\begin{itemize}[itemsep=0pt]
+	\item
+		16-bits, conditional, with an offset between $-256$ and $254$.
+	\item
+		16-bits, not conditional, with an offset between $-2,048$ and $2,046$.
+	\item
+		32-bits, conditional, with an offset between $-1,048,576$ and $1,048,574$.
+	\item
+		32-bits, not conditional, with an offset between $-16,777,216$ and $16,777,214$.
+\end{itemize}
+
+By reorganising the code, the code generator could make as many branch instructions as possible 16-bit.
+
+This optimisation is restricted to the \ual{b} instruction:
+	the branch with link (\ual{bl}) is always 32-bit;
+	the branch and exchange (\ual{bx}) and branch with link and exchange (\ual{blx}) instructions always 16-bit.
+Since the \ual{b} instruction is used relatively little by the code generator,
+	we cannot hope for much improvement.
+The Clean compiler, with a code size of 3,827,868 bytes (Thumb optimised, \cref{fig:reg-alloc:results}),
+	counts only 29,982 simple branches, of which 6,979 ($23\%$) are already narrow.
+In the best case we can make all the wide branches narrow
+	and win $(29,982 - 6,979)\cdot 2 = 46,004$ bytes, only $1.2\%$.
+The \ual{bl}, \ual{blx} and \ual{bx} instructions are used $42,436$, $23,070$ and $306$ times in the Clean compiler, respectively.
+
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.w | wc -l
+% 23003
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.n | wc -l
+% 6979
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bl | wc -l
+% 42436
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF blx | wc -l
+% 23070
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bx | wc -l
+% 306
+
+\subsection{Matching programs and instruction sets}
+\label{sec:discussion:matching}
+If we compare the code size decrease with the running time increase,
+	we get the plot in \cref{fig:results:scatter}.
+It seems that code size decrease correlates negatively with increase in running time,
+	suggesting that Thumb is more suitable for some programs than others.
+However, we do not have enough data to see if this is significant.
+More research is needed to see if there actually is such a relationship,
+	and if so, what programs are more suitable for Thumb.
+
+The outliers in \cref{fig:results:scatter} are
+	RFib $(0.0,2.6)$,
+	Ack $(12.8,11.9)$,
+	Fib $(9.8,5.1)$ and
+	Tak $(12.1,8.0)$,
+	all small programs that rely heavily on recursion
+	(for a description of the programs, see \cref{sec:results:testsuite} above).
+This is explained by the relatively high number of jumps:
+	in Thumb we had to add several instructions, introducing extra code size and running time (see \cref{sec:storing-pc}).
+
+\bigskip
+\todo{more}
+
+\end{multicols}
author	Camil Staps	2017-01-02 21:50:11 +0100
committer	Camil Staps	2017-01-02 21:50:49 +0100
commit	163929496ea077a7882632cd99a98c806b02591a (patch)
tree	89660ec8fe7a12738ea5146f1af7dce5c5a63306 /thesis/discussion.tex
parent	Make columns optional (diff)