diff options
author | Camil Staps | 2017-01-02 21:50:11 +0100 |
---|---|---|
committer | Camil Staps | 2017-01-02 21:50:49 +0100 |
commit | 163929496ea077a7882632cd99a98c806b02591a (patch) | |
tree | 89660ec8fe7a12738ea5146f1af7dce5c5a63306 /thesis/discussion.tex | |
parent | Make columns optional (diff) |
Remove terminology appendix; write discussion
Diffstat (limited to 'thesis/discussion.tex')
-rw-r--r-- | thesis/discussion.tex | 169 |
1 files changed, 169 insertions, 0 deletions
diff --git a/thesis/discussion.tex b/thesis/discussion.tex new file mode 100644 index 0000000..ae72de3 --- /dev/null +++ b/thesis/discussion.tex @@ -0,0 +1,169 @@ +\section{Discussion} +\label{sec:discussion} + +\begin{multicols}{2} + +\subsection{Optimising subroutine calls} +\label{sec:discussion:subroutines} +In \cref{sec:storing-pc}, we saw that subroutine calls are slower and larger in Thumb than in ARM, + because an extra \ual{add} instruction has to be inserted. +We discussed that a more efficient solution would exploit the link register: + upon a \ual{bl} and \ual{blx} instruction, the link register is loaded with the address of the instruction after the branch. +If the callee would be responsible for saving this address, rather than the caller, + we could delay storing the return address and exploit these instructions to eliminate the \ual{add} instruction. +However, we also mentioned that tail recursion makes this difficult (see \cref{sec:storing-pc:other-solutions}). + +\subsubsection{Tail recursive entry points} +To improve on this, two things can be done. +First, we could add a \emph{tail recursive entry point} to function blocks. +The code for the \clean{length} function from \cref{sec:storing-pc:other-solutions} could then look like this: + +\begin{minted}{ual} +slength_P2: + str lr,[sp,#-4]! @ Store the return address +slength_P2_tr: @ The tail recursive entry point + ldr r0,[r6] @ Load top of A-stack (the list) + + @ ... etc + +e_0: @ If / after the Cons part has been evaluated + add r4,r4,#1 @ Increase counter + b slength_P2_tr @ Tail recursively jump back to start of function + .ltorg +\end{minted} + +This way, the code to store the return address is only executed once. +The return address is saved on the stack, as before. +To call the subroutine, the following block can be used: + +\begin{minted}{ual} + bl slength_P2 +\end{minted} + +This is an improvement of between two and six bytes per subroutine call: + in any case, we win two bytes since the \ual{add} instruction is eliminated. +In the minimal case that a function is called only once, we save only these two bytes. +If the function is called often, many \ual{str} instructions can be removed at the cost of one extra \ual{str} instruction in the function itself, + and the space saved per subroutine call asymptotically approaches six bytes. + +As for running time, we win one instruction cycle per subroutine call. +The \ual{add} instruction is removed, but the \ual{str} instruction is still executed + (whether it is placed in the caller's or the callee's block does not matter). + +To implement tail recursive entry points, + one would need to be able to recognise tail recursion such that the tail recursive entry point is used only for these subroutine calls, + but not for the `plain' recursive subroutine calls. +Also, more research is needed to see if a tail recursive entry point is sufficient in all cases. + +\subsubsection{Mixed calling convention} +A second possibility would be to have a \emph{mixed calling convention}. +Currently, in every subroutine call, the caller is responsible for storing the return address on the stack. +In a mixed calling convention, + the callee would be responsible for this, + except for functions that exploit tail recursion. +Recognising tail recursion can be done on a relatively high level of abstraction, + so annotations can be added to the intermediate ABC-code to indicate what calling convention should be used for a certain function. + +An advantage of this approach might be that there are less cases to consider. +The obvious disadvantage is that it makes the calling convention overly complex. + +\begin{figure*}[b!] + \centering + \begin{tikzpicture} + \begin{axis} + [ width=.5\textwidth + , xlabel={Size decrease (\%)} + , ylabel={Running time increase (\%)} + , mark options={scale=2} + , scatter/classes={% + a={mark=x,draw=\ifcolours red\else gray\fi}, + b={mark=+,draw=black}} + ] + \addplot[scatter,only marks,scatter src=explicit symbolic] + table[meta=label] { + x y label + 12.8 11.9 a + 18.6 3.6 b + 9.8 5.1 a + 18.9 5.9 b + 18.9 3.7 b + 20.1 5.4 a + 22.9 0.9 a + 0.0 2.6 a + 20.0 5.2 b + 22.0 0.0 b + 12.1 8.0 a + }; + \end{axis} + \end{tikzpicture} + \caption{ + Comparison of code size decrease and running time increase. + {\ifcolours Red\else Grey\fi} crosses indicate programs that are smaller than 1000b in ARM; + black pluses correspond to larger programs. + \label{fig:results:scatter}} +\end{figure*} + +\subsection{Branch optimisation} +\label{sec:discussion:branches} +A second optimisation vector that is yet to be considered is branch optimisation. +The Thumb instruction set has four variants of the simple branch instruction \ual{b}~\parencite[A6.7.12]{armv7m}: + +\begin{itemize}[itemsep=0pt] + \item + 16-bits, conditional, with an offset between $-256$ and $254$. + \item + 16-bits, not conditional, with an offset between $-2,048$ and $2,046$. + \item + 32-bits, conditional, with an offset between $-1,048,576$ and $1,048,574$. + \item + 32-bits, not conditional, with an offset between $-16,777,216$ and $16,777,214$. +\end{itemize} + +By reorganising the code, the code generator could make as many branch instructions as possible 16-bit. + +This optimisation is restricted to the \ual{b} instruction: + the branch with link (\ual{bl}) is always 32-bit; + the branch and exchange (\ual{bx}) and branch with link and exchange (\ual{blx}) instructions always 16-bit. +Since the \ual{b} instruction is used relatively little by the code generator, + we cannot hope for much improvement. +The Clean compiler, with a code size of 3,827,868 bytes (Thumb optimised, \cref{fig:reg-alloc:results}), + counts only 29,982 simple branches, of which 6,979 ($23\%$) are already narrow. +In the best case we can make all the wide branches narrow + and win $(29,982 - 6,979)\cdot 2 = 46,004$ bytes, only $1.2\%$. +The \ual{bl}, \ual{blx} and \ual{bx} instructions are used $42,436$, $23,070$ and $306$ times in the Clean compiler, respectively. + +% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.w | wc -l +% 23003 +% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.n | wc -l +% 6979 +% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bl | wc -l +% 42436 +% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF blx | wc -l +% 23070 +% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bx | wc -l +% 306 + +\subsection{Matching programs and instruction sets} +\label{sec:discussion:matching} +If we compare the code size decrease with the running time increase, + we get the plot in \cref{fig:results:scatter}. +It seems that code size decrease correlates negatively with increase in running time, + suggesting that Thumb is more suitable for some programs than others. +However, we do not have enough data to see if this is significant. +More research is needed to see if there actually is such a relationship, + and if so, what programs are more suitable for Thumb. + +The outliers in \cref{fig:results:scatter} are + RFib $(0.0,2.6)$, + Ack $(12.8,11.9)$, + Fib $(9.8,5.1)$ and + Tak $(12.1,8.0)$, + all small programs that rely heavily on recursion + (for a description of the programs, see \cref{sec:results:testsuite} above). +This is explained by the relatively high number of jumps: + in Thumb we had to add several instructions, introducing extra code size and running time (see \cref{sec:storing-pc}). + +\bigskip +\todo{more} + +\end{multicols} |