diff options
Diffstat (limited to 'thesis')
-rw-r--r-- | thesis/discussion.tex | 169 | ||||
-rw-r--r-- | thesis/further.tex | 11 | ||||
-rw-r--r-- | thesis/intro.tex | 14 | ||||
-rw-r--r-- | thesis/results.tex | 55 | ||||
-rw-r--r-- | thesis/terms.tex | 28 | ||||
-rw-r--r-- | thesis/thesis.tex | 3 |
6 files changed, 172 insertions, 108 deletions
diff --git a/thesis/discussion.tex b/thesis/discussion.tex new file mode 100644 index 0000000..ae72de3 --- /dev/null +++ b/thesis/discussion.tex @@ -0,0 +1,169 @@ +\section{Discussion} +\label{sec:discussion} + +\begin{multicols}{2} + +\subsection{Optimising subroutine calls} +\label{sec:discussion:subroutines} +In \cref{sec:storing-pc}, we saw that subroutine calls are slower and larger in Thumb than in ARM, + because an extra \ual{add} instruction has to be inserted. +We discussed that a more efficient solution would exploit the link register: + upon a \ual{bl} and \ual{blx} instruction, the link register is loaded with the address of the instruction after the branch. +If the callee would be responsible for saving this address, rather than the caller, + we could delay storing the return address and exploit these instructions to eliminate the \ual{add} instruction. +However, we also mentioned that tail recursion makes this difficult (see \cref{sec:storing-pc:other-solutions}). + +\subsubsection{Tail recursive entry points} +To improve on this, two things can be done. +First, we could add a \emph{tail recursive entry point} to function blocks. +The code for the \clean{length} function from \cref{sec:storing-pc:other-solutions} could then look like this: + +\begin{minted}{ual} +slength_P2: + str lr,[sp,#-4]! @ Store the return address +slength_P2_tr: @ The tail recursive entry point + ldr r0,[r6] @ Load top of A-stack (the list) + + @ ... etc + +e_0: @ If / after the Cons part has been evaluated + add r4,r4,#1 @ Increase counter + b slength_P2_tr @ Tail recursively jump back to start of function + .ltorg +\end{minted} + +This way, the code to store the return address is only executed once. +The return address is saved on the stack, as before. +To call the subroutine, the following block can be used: + +\begin{minted}{ual} + bl slength_P2 +\end{minted} + +This is an improvement of between two and six bytes per subroutine call: + in any case, we win two bytes since the \ual{add} instruction is eliminated. +In the minimal case that a function is called only once, we save only these two bytes. +If the function is called often, many \ual{str} instructions can be removed at the cost of one extra \ual{str} instruction in the function itself, + and the space saved per subroutine call asymptotically approaches six bytes. + +As for running time, we win one instruction cycle per subroutine call. +The \ual{add} instruction is removed, but the \ual{str} instruction is still executed + (whether it is placed in the caller's or the callee's block does not matter). + +To implement tail recursive entry points, + one would need to be able to recognise tail recursion such that the tail recursive entry point is used only for these subroutine calls, + but not for the `plain' recursive subroutine calls. +Also, more research is needed to see if a tail recursive entry point is sufficient in all cases. + +\subsubsection{Mixed calling convention} +A second possibility would be to have a \emph{mixed calling convention}. +Currently, in every subroutine call, the caller is responsible for storing the return address on the stack. +In a mixed calling convention, + the callee would be responsible for this, + except for functions that exploit tail recursion. +Recognising tail recursion can be done on a relatively high level of abstraction, + so annotations can be added to the intermediate ABC-code to indicate what calling convention should be used for a certain function. + +An advantage of this approach might be that there are less cases to consider. +The obvious disadvantage is that it makes the calling convention overly complex. + +\begin{figure*}[b!] + \centering + \begin{tikzpicture} + \begin{axis} + [ width=.5\textwidth + , xlabel={Size decrease (\%)} + , ylabel={Running time increase (\%)} + , mark options={scale=2} + , scatter/classes={% + a={mark=x,draw=\ifcolours red\else gray\fi}, + b={mark=+,draw=black}} + ] + \addplot[scatter,only marks,scatter src=explicit symbolic] + table[meta=label] { + x y label + 12.8 11.9 a + 18.6 3.6 b + 9.8 5.1 a + 18.9 5.9 b + 18.9 3.7 b + 20.1 5.4 a + 22.9 0.9 a + 0.0 2.6 a + 20.0 5.2 b + 22.0 0.0 b + 12.1 8.0 a + }; + \end{axis} + \end{tikzpicture} + \caption{ + Comparison of code size decrease and running time increase. + {\ifcolours Red\else Grey\fi} crosses indicate programs that are smaller than 1000b in ARM; + black pluses correspond to larger programs. + \label{fig:results:scatter}} +\end{figure*} + +\subsection{Branch optimisation} +\label{sec:discussion:branches} +A second optimisation vector that is yet to be considered is branch optimisation. +The Thumb instruction set has four variants of the simple branch instruction \ual{b}~\parencite[A6.7.12]{armv7m}: + +\begin{itemize}[itemsep=0pt] + \item + 16-bits, conditional, with an offset between $-256$ and $254$. + \item + 16-bits, not conditional, with an offset between $-2,048$ and $2,046$. + \item + 32-bits, conditional, with an offset between $-1,048,576$ and $1,048,574$. + \item + 32-bits, not conditional, with an offset between $-16,777,216$ and $16,777,214$. +\end{itemize} + +By reorganising the code, the code generator could make as many branch instructions as possible 16-bit. + +This optimisation is restricted to the \ual{b} instruction: + the branch with link (\ual{bl}) is always 32-bit; + the branch and exchange (\ual{bx}) and branch with link and exchange (\ual{blx}) instructions always 16-bit. +Since the \ual{b} instruction is used relatively little by the code generator, + we cannot hope for much improvement. +The Clean compiler, with a code size of 3,827,868 bytes (Thumb optimised, \cref{fig:reg-alloc:results}), + counts only 29,982 simple branches, of which 6,979 ($23\%$) are already narrow. +In the best case we can make all the wide branches narrow + and win $(29,982 - 6,979)\cdot 2 = 46,004$ bytes, only $1.2\%$. +The \ual{bl}, \ual{blx} and \ual{bx} instructions are used $42,436$, $23,070$ and $306$ times in the Clean compiler, respectively. + +% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.w | wc -l +% 23003 +% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.n | wc -l +% 6979 +% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bl | wc -l +% 42436 +% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF blx | wc -l +% 23070 +% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bx | wc -l +% 306 + +\subsection{Matching programs and instruction sets} +\label{sec:discussion:matching} +If we compare the code size decrease with the running time increase, + we get the plot in \cref{fig:results:scatter}. +It seems that code size decrease correlates negatively with increase in running time, + suggesting that Thumb is more suitable for some programs than others. +However, we do not have enough data to see if this is significant. +More research is needed to see if there actually is such a relationship, + and if so, what programs are more suitable for Thumb. + +The outliers in \cref{fig:results:scatter} are + RFib $(0.0,2.6)$, + Ack $(12.8,11.9)$, + Fib $(9.8,5.1)$ and + Tak $(12.1,8.0)$, + all small programs that rely heavily on recursion + (for a description of the programs, see \cref{sec:results:testsuite} above). +This is explained by the relatively high number of jumps: + in Thumb we had to add several instructions, introducing extra code size and running time (see \cref{sec:storing-pc}). + +\bigskip +\todo{more} + +\end{multicols} diff --git a/thesis/further.tex b/thesis/further.tex deleted file mode 100644 index e499b7f..0000000 --- a/thesis/further.tex +++ /dev/null @@ -1,11 +0,0 @@ -\section{Further work} -\label{sec:further} - -\begin{multicols}{2} - -\todo{ - There are some things that are still to be done and some optimisation vectors that are not yet considered. - What will be written depends on how far I get in the coming weeks with this. -} - -\end{multicols} diff --git a/thesis/intro.tex b/thesis/intro.tex index 3d6347c..ccffb4c 100644 --- a/thesis/intro.tex +++ b/thesis/intro.tex @@ -308,22 +308,12 @@ In much of the rest of this thesis we discuss differences between ARM and Thumb, In \cref{sec:storing-pc}, we consider an issue arising from halfword-aligned instructions and the way a read of PC is interpreted under Thumb. \Cref{sec:code-addresses} discusses problems related to the fact that Thumb instruction addresses use bit 1 and should have bit 0 set for interworking, while under ARM a branch will automatically clear both these bits. -There was a minor problem with negative offsets to the \ual{ldr} instruction, - that cannot be as large in Thumb as they can in ARM. -\Cref{sec:load-offsets} deals with this. +For some instructions, the constants in the ARM variant can be larger than those in the Thumb variant. +\Cref{sec:load-offsets} deals with related issues in the \ual{ldr} instruction. Moving to Thumb introduces a number of interesting optimisation vectors. One of them, register allocation, is discussed in \cref{sec:reg-alloc}. We benchmark the Thumb backend for Clean and discuss the results in \cref{sec:results}. -\subsection{Terminology} -\label{sec:intro:terminology} -As mentioned, we will usually use \enquote{Thumb} where the Thumb-2 extension is meant, - and only distinguish Thumb and Thumb-2 when this distinction is relevant --- - for \enquote{ARM}, context makes out whether the architecture or the instruction set is meant. - -For brevity, a number of common abbreviations is used. -An overview can be found in \cref{sec:terminology}. - \end{multicols} diff --git a/thesis/results.tex b/thesis/results.tex index 76afb27..b826138 100644 --- a/thesis/results.tex +++ b/thesis/results.tex @@ -91,42 +91,6 @@ They are: A sequential version of Warshall's shortest path algorithm on a $6\times6$ matrix. \end{description} -\begin{figure*}[b!] - \centering - \begin{tikzpicture} - \begin{axis} - [ width=.5\textwidth - , xlabel={Size decrease (\%)} - , ylabel={Running time increase (\%)} - , mark options={scale=2} - , scatter/classes={% - a={mark=x,draw=red}, - b={mark=+,draw=black}} - ] - \addplot[scatter,only marks,scatter src=explicit symbolic] - table[meta=label] { - x y label - 12.8 11.9 a - 18.6 3.6 b - 9.8 5.1 a - 18.9 5.9 b - 18.9 3.7 b - 20.1 5.4 a - 22.9 0.9 a - 0.0 2.6 a - 20.0 5.2 b - 22.0 0.0 b - 12.1 8.0 a - }; - \end{axis} - \end{tikzpicture} - \caption{ - Comparison of code size decrease and running time increase. - Red crosses indicate programs that are smaller than 1000b in ARM; - black pluses correspond to larger programs. - \label{fig:results:scatter}} -\end{figure*} - \subsection{Code size} \label{sec:results:size} We compared the code size for the code generated by the ARM and Thumb backends for all programs in the test suite. @@ -155,25 +119,6 @@ The increase in running time is relatively low compared to the decrease in code For the larger programs, this is 3.7\% with a standard deviation of 2.04pp. (though this is based on only five programs). -\subsection{Discussion} -\label{sec:results:discussion} -If we compare the code size decrease with the running time increase, - we get the plot in \cref{fig:results:scatter}. -It seems that a high code size decrease often comes together with a low increase in running time, - suggesting that Thumb is more suitable for some programs than others. -However, we do not have enough data to see if this is significant. -More research is needed to see if there actually is such a relationship, - and if so, what programs are more suitable for Thumb. - -The outliers in \cref{fig:results:scatter} are - RFib $(0.0,2.6)$, - Ack $(12.8,11.9)$, - Fib $(9.8,5.1)$ and - Tak $(12.1,8.0)$, - all small programs that rely heavily on recursion. -This is explained by the relatively high number of jumps: - in Thumb we had to add several instructions, introducing extra code size and running time (see \cref{sec:storing-pc}). - \end{multicols} % vim: nowrap diff --git a/thesis/terms.tex b/thesis/terms.tex deleted file mode 100644 index 4104875..0000000 --- a/thesis/terms.tex +++ /dev/null @@ -1,28 +0,0 @@ -\section{Terminology} -\label{sec:terminology} - -\begin{multicols}{2} - -\begin{description} - \item[HNF] - Head normal form, the state of a graph node that cannot be rewritten itself (though it children may). - \item[ISA] - Instruction set architecture. - \item[IT block] - If-then block, a Thumb construct for conditional execution. - \item[LSB(s)] - Least significant bit(s), the bit(s) with the lowest value in a word. - \item[PC] - Program counter, \ual{r15} on ARM. - \item[RTS] - Run-time system. - Clean's run-time system ensures that the \clean{Start} node is reduced and printed. - It also takes care of garbage collecting. - \item[SP] - Stack pointer, \ual{r13} on ARM. - \item[UAL] - Unified Assembler Language, an assembly language syntax that - \enquote{provides a canonical form for all ARM and Thumb instructions} \parencite[A4.2]{armv7m}. -\end{description} - -\end{multicols} diff --git a/thesis/thesis.tex b/thesis/thesis.tex index 95abdee..273694e 100644 --- a/thesis/thesis.tex +++ b/thesis/thesis.tex @@ -54,14 +54,13 @@ It produces on average 20\% smaller code than the ARM code generator, which is o \input{load-offsets} \input{reg-alloc} \input{results} -\input{further} +\input{discussion} \ifdraft\else \cleardoublepage \fi \let\section\oldsection \appendix -\input{terms} \input{system} \input{status} |