diff options
Diffstat (limited to 'thesis/reg-alloc.tex')
-rw-r--r-- | thesis/reg-alloc.tex | 143 |
1 files changed, 122 insertions, 21 deletions
diff --git a/thesis/reg-alloc.tex b/thesis/reg-alloc.tex index 0fc764e..5918b59 100644 --- a/thesis/reg-alloc.tex +++ b/thesis/reg-alloc.tex @@ -14,7 +14,8 @@ , ytick=data , scaled x ticks=real:1 , xtick scale label code/.code={} - , ytick pos=left + , axis lines*=left + , compat={1.3} ] \addplot coordinates { (378618,S0) @@ -47,7 +48,8 @@ , height=1.1\linewidth , symbolic y coords={S0,A ptr.,A0,Heap ptr.,A1,B0,A2,B ptr.,B1,S1,B2,B3,B4,Heap ctr.,A3} , ytick=data - , ytick pos=left + , axis lines*=left + , compat={1.3} ] \addplot coordinates { (387,S0) @@ -149,6 +151,30 @@ In Clean programs, the ARM registers are used as follows~\parencite[\texttt{arms \item The program counter PC. \end{itemize} +\begin{table*}[b!] + \centering + \newcommand{\highlight}[1]{\textcolor{red}{\textbf{#1}}} + \begin{tabular}{l >{\tt\footnotesize}c >{\tt\footnotesize}c} + Register & {\normalsize\normalfont ARM} & {\normalsize\normalfont Thumb-2} \\\hline + A3 & r11 & \highlight{r12} \\ + Heap counter & r5 & \highlight{r9} \\ + B4 & r0 & \highlight{r10} \\ + B3 & r1 & \highlight{r11} \\ + B2 & r2 & \highlight{r8} \\ + S1 & r14 & r14 \\ + B1 & r3 & r3 \\ + B pointer & sp & sp \\ + A2 & r8 & \highlight{r2} \\ + B0 & r4 & r4 \\ + A1 & r7 & r7 \\ + Heap pointer & r10 & \highlight{r1} \\ + A0 & r6 & r6 \\ + A pointer & r9 & \highlight{r5} \\ + S0 & r12 & \highlight{r0} \\ + \end{tabular} + \captionof{table}{Register allocation in the ARM and Thumb backends\label{tab:reg-alloc:arm-and-thumb}} +\end{table*} + In \cref{fig:reg-usage}, we count for each register the number of instructions it occurs in, in the code generated for the Clean compiler~\parencite{cocl}. When a register is used multiple times in the same instruction, this counts as one occurrence. The vertical line at indicates the number of occurrences that a variable should have to justify putting it in the lower registers, @@ -162,31 +188,106 @@ For small programs it may not be negligible, but these programs are not likely to need rigorous optimisation for code size, considering that they will be small anyway. +The counting method used here is rather simplistic: + basing a register allocation on these counts alone means assuming that every instruction has a 16-bit variant, which is not the case. +A more accurate method would only count those instructions where using a high or low register actually makes a difference. +This is much more complicated, and for a rough estimate the simplistic method used here will already allow us to shrink down the code size. + +\begin{figure*}[b!] + \small + \centering + \begin{tikzpicture} + \begin{axis} + [ anchor=south east + , height=.4\linewidth + , xbar + , xmin=0, xmax=6000000 + , xlabel={Bytes} + , ytick=\empty + , reverse stacked plots + , ylabel={Clean compiler} + , ymin=-1, ymax=1 + , axis lines*=left + , nodes near coords, nodes near coords align=west + , reverse legend + , every axis plot/.append style={point meta=explicit symbolic} + , legend style={at={($(1,-0.3)+(1.5em,0pt)$)},anchor=north,legend columns=-1} + ] + \addplot coordinates { (3785144,0) [80.7\%] }; + \addplot coordinates { (3827868,0) [81.6\%] }; + \addplot coordinates { (4385964,0) [93.5\%] }; + \addplot coordinates { (4692628,0) [100\%] }; + \legend{Thumb object-code (approximation), Thumb optimised, Thumb, ARM} + \end{axis} + \begin{axis} + [ anchor=south west + , height=.4\linewidth + , xshift=3em + , xbar + , xmin=0, xmax=80000 + , xlabel={Bytes} + , ytick=\empty + , ymin=-1, ymax=1 + , ylabel={\texttt{frontend/scanner.icl}} + , axis lines*=left + , every axis plot/.append style={point meta=explicit symbolic} + , nodes near coords, nodes near coords align=west + ] + \addplot coordinates { (57296,0) [82.0\%] }; + \addplot coordinates { (57864,0) [82.8\%] }; + \addplot coordinates { (64588,0) [92.4\%] }; + \addplot coordinates { (69896,0) [100\%] }; + \end{axis} + \end{tikzpicture} + \caption{Code size for different backends\label{fig:reg-alloc:results}} +\end{figure*} + \subsection{Optimisation} \label{sec:reg-alloc:optimisation} +The register allocation used in the ARM backend is inefficient for Thumb-2 on several points (\cref{tab:reg-alloc:arm-and-thumb}). +A number of often-used registers (the main scratch register, the A-stack pointer and the heap pointer) are in the upper half, + while registers that are used less often are in the lower half (the heap counter and B2 through B4). + \Cref{fig:reg-usage} suggests that we put all variables that occur more than 85,000 times in the low registers. However, the B-stack pointer has to be SP: the B- and C-stacks are combined into one - (an optimisation suggested early on by \cite{execspecs} \todo{exact ref.}) and the C-stack pointer should be the system pointer to allow for an efficient foreign function interface. -% Before optimisation: -% pi@rasppi:~/clean/src/compiler$ ls -l a.out -% -rwxr-xr-x 1 pi pi 9987016 Dec 7 21:46 a.out -% pi@rasppi:~/clean/src/compiler$ strip --strip-all a.out -% pi@rasppi:~/clean/src/compiler$ ls -l a.out -% -rwxr-xr-x 1 pi pi 5034316 Dec 7 21:46 a.out - -% After optimisation as suggested by John: -% 1 r9 <-> r5 -% 2 r0 -> r10 -% 3 r1 -> r11 -% 4 r10 -> r1 -% 5 r11 -> r12 -% 6 r12 -> r0 -% static char *register_name[15]={"sp","r1","r5","r12","r8","r7","r6","r4","r3","r2","r11","r10","r9","r0","r14"}; -% static int register_order[15]={13,1,5,12,8,7,6,4,3,2,11,10,9,0,14}; -% pi@rasppi:~/clean/src/compiler$ ls -l a.out -% -rwxr-xr-x 1 pi pi 7591584 Dec 7 22:16 a.out +The third column in \cref{tab:reg-alloc:arm-and-thumb} shows the proposed allocation for Thumb. +The heap counter and the A-stack pointer are swapped, as are A2 and B2. +The registers for S0, B4, the heap pointer, B3 and A3 have been rotated. +This way, all eight most often used registers are in the lower half except the B-stack pointer. + +\subsection{Results} +\label{sec:reg-alloc:results} +To measure the code size improvement introduced by this optimisation, + we again take the Clean compiler and compile it in three different ways: + +\begin{itemize}[itemsep=0pt] + \item + With the existing ARM backend \parencite{armcg}. + \item + With the new Thumb backend, leaving the register allocation as in the ARM backend. + \item + With the new Thumb backend, with the optimised register allocation as in the third column of \cref{tab:reg-alloc:arm-and-thumb}. + \item + With the new Thumb backend, with the optimised register allocation and leaving out any \ual{.align} directives to approximate object-code level efficiency. +\end{itemize} + +The object code generator has not been finished yet at the moment these measurements were done, + so we can only estimate its efficiency. +A modification described in \cref{sec:storing-pc:solution} required us to add \ual{.align} to jump instructions, adding a \ual{nop} instruction in approximately half the instances. +In the last measurement, we approximate object-code level efficiency by leaving out these \ual{.align} directives, as they will not be needed in the object code generator. + +We only measure the size of generated code, that is, all \texttt{.text} segments except those that belong to the RTS or external libraries. +These measurements were done on the system described in \cref{sec:system}. +One may argue that the Clean compiler is not representative code because of some peculiarities like large lookup tables and arrays that are rarely used in practice. +For this reason, we compare both the total code size and the size of a part of the lexer \parencite[\texttt{frontend/scanner.icl}]{cocl}, + which can be considered representative. +The results are in \cref{fig:reg-alloc:results}. + +We see that even without the optimisation discussed in this section, some code size is saved. +After all, at least some instructions can be made 16-bit, and in the sections above we have hardly ever had to add extra instructions. +However, optimising register allocation allows us to save much more, up to almost 20\%. \end{multicols} |