diff options
Diffstat (limited to 'thesis')
-rw-r--r-- | thesis/preamble.tex | 6 | ||||
-rw-r--r-- | thesis/reg-alloc.tex | 143 | ||||
-rw-r--r-- | thesis/storing-pc.tex | 11 | ||||
-rw-r--r-- | thesis/system.tex | 24 | ||||
-rw-r--r-- | thesis/terms.tex | 4 | ||||
-rw-r--r-- | thesis/thesis.tex | 3 |
6 files changed, 167 insertions, 24 deletions
diff --git a/thesis/preamble.tex b/thesis/preamble.tex index 34f2ba2..ea4725f 100644 --- a/thesis/preamble.tex +++ b/thesis/preamble.tex @@ -17,6 +17,7 @@ \usepackage[british]{babel} \usepackage[babel=true]{csquotes} \usepackage{enumitem} +\usepackage{array} \usepackage{amsmath} \usepackage[usenames,dvipsnames,svgnames,table]{xcolor} @@ -56,6 +57,7 @@ \usepackage{subcaption} \usepackage{pgfplots} +\pgfplotsset{compat=newest} \usepackage{tikz} \usetikzlibrary{positioning,calc} @@ -75,6 +77,10 @@ \let\oldsection\section \renewcommand{\section}{\ifdraft\else\clearpage\fi\oldsection} +\newenvironment{inlinefloat} + {\par\bigskip\noindent\minipage{\linewidth}} + {\endminipage\par\medskip} + \raggedcolumns \renewcommand{\baselinestretch}{1.1} diff --git a/thesis/reg-alloc.tex b/thesis/reg-alloc.tex index 0fc764e..5918b59 100644 --- a/thesis/reg-alloc.tex +++ b/thesis/reg-alloc.tex @@ -14,7 +14,8 @@ , ytick=data , scaled x ticks=real:1 , xtick scale label code/.code={} - , ytick pos=left + , axis lines*=left + , compat={1.3} ] \addplot coordinates { (378618,S0) @@ -47,7 +48,8 @@ , height=1.1\linewidth , symbolic y coords={S0,A ptr.,A0,Heap ptr.,A1,B0,A2,B ptr.,B1,S1,B2,B3,B4,Heap ctr.,A3} , ytick=data - , ytick pos=left + , axis lines*=left + , compat={1.3} ] \addplot coordinates { (387,S0) @@ -149,6 +151,30 @@ In Clean programs, the ARM registers are used as follows~\parencite[\texttt{arms \item The program counter PC. \end{itemize} +\begin{table*}[b!] + \centering + \newcommand{\highlight}[1]{\textcolor{red}{\textbf{#1}}} + \begin{tabular}{l >{\tt\footnotesize}c >{\tt\footnotesize}c} + Register & {\normalsize\normalfont ARM} & {\normalsize\normalfont Thumb-2} \\\hline + A3 & r11 & \highlight{r12} \\ + Heap counter & r5 & \highlight{r9} \\ + B4 & r0 & \highlight{r10} \\ + B3 & r1 & \highlight{r11} \\ + B2 & r2 & \highlight{r8} \\ + S1 & r14 & r14 \\ + B1 & r3 & r3 \\ + B pointer & sp & sp \\ + A2 & r8 & \highlight{r2} \\ + B0 & r4 & r4 \\ + A1 & r7 & r7 \\ + Heap pointer & r10 & \highlight{r1} \\ + A0 & r6 & r6 \\ + A pointer & r9 & \highlight{r5} \\ + S0 & r12 & \highlight{r0} \\ + \end{tabular} + \captionof{table}{Register allocation in the ARM and Thumb backends\label{tab:reg-alloc:arm-and-thumb}} +\end{table*} + In \cref{fig:reg-usage}, we count for each register the number of instructions it occurs in, in the code generated for the Clean compiler~\parencite{cocl}. When a register is used multiple times in the same instruction, this counts as one occurrence. The vertical line at indicates the number of occurrences that a variable should have to justify putting it in the lower registers, @@ -162,31 +188,106 @@ For small programs it may not be negligible, but these programs are not likely to need rigorous optimisation for code size, considering that they will be small anyway. +The counting method used here is rather simplistic: + basing a register allocation on these counts alone means assuming that every instruction has a 16-bit variant, which is not the case. +A more accurate method would only count those instructions where using a high or low register actually makes a difference. +This is much more complicated, and for a rough estimate the simplistic method used here will already allow us to shrink down the code size. + +\begin{figure*}[b!] + \small + \centering + \begin{tikzpicture} + \begin{axis} + [ anchor=south east + , height=.4\linewidth + , xbar + , xmin=0, xmax=6000000 + , xlabel={Bytes} + , ytick=\empty + , reverse stacked plots + , ylabel={Clean compiler} + , ymin=-1, ymax=1 + , axis lines*=left + , nodes near coords, nodes near coords align=west + , reverse legend + , every axis plot/.append style={point meta=explicit symbolic} + , legend style={at={($(1,-0.3)+(1.5em,0pt)$)},anchor=north,legend columns=-1} + ] + \addplot coordinates { (3785144,0) [80.7\%] }; + \addplot coordinates { (3827868,0) [81.6\%] }; + \addplot coordinates { (4385964,0) [93.5\%] }; + \addplot coordinates { (4692628,0) [100\%] }; + \legend{Thumb object-code (approximation), Thumb optimised, Thumb, ARM} + \end{axis} + \begin{axis} + [ anchor=south west + , height=.4\linewidth + , xshift=3em + , xbar + , xmin=0, xmax=80000 + , xlabel={Bytes} + , ytick=\empty + , ymin=-1, ymax=1 + , ylabel={\texttt{frontend/scanner.icl}} + , axis lines*=left + , every axis plot/.append style={point meta=explicit symbolic} + , nodes near coords, nodes near coords align=west + ] + \addplot coordinates { (57296,0) [82.0\%] }; + \addplot coordinates { (57864,0) [82.8\%] }; + \addplot coordinates { (64588,0) [92.4\%] }; + \addplot coordinates { (69896,0) [100\%] }; + \end{axis} + \end{tikzpicture} + \caption{Code size for different backends\label{fig:reg-alloc:results}} +\end{figure*} + \subsection{Optimisation} \label{sec:reg-alloc:optimisation} +The register allocation used in the ARM backend is inefficient for Thumb-2 on several points (\cref{tab:reg-alloc:arm-and-thumb}). +A number of often-used registers (the main scratch register, the A-stack pointer and the heap pointer) are in the upper half, + while registers that are used less often are in the lower half (the heap counter and B2 through B4). + \Cref{fig:reg-usage} suggests that we put all variables that occur more than 85,000 times in the low registers. However, the B-stack pointer has to be SP: the B- and C-stacks are combined into one - (an optimisation suggested early on by \cite{execspecs} \todo{exact ref.}) and the C-stack pointer should be the system pointer to allow for an efficient foreign function interface. -% Before optimisation: -% pi@rasppi:~/clean/src/compiler$ ls -l a.out -% -rwxr-xr-x 1 pi pi 9987016 Dec 7 21:46 a.out -% pi@rasppi:~/clean/src/compiler$ strip --strip-all a.out -% pi@rasppi:~/clean/src/compiler$ ls -l a.out -% -rwxr-xr-x 1 pi pi 5034316 Dec 7 21:46 a.out - -% After optimisation as suggested by John: -% 1 r9 <-> r5 -% 2 r0 -> r10 -% 3 r1 -> r11 -% 4 r10 -> r1 -% 5 r11 -> r12 -% 6 r12 -> r0 -% static char *register_name[15]={"sp","r1","r5","r12","r8","r7","r6","r4","r3","r2","r11","r10","r9","r0","r14"}; -% static int register_order[15]={13,1,5,12,8,7,6,4,3,2,11,10,9,0,14}; -% pi@rasppi:~/clean/src/compiler$ ls -l a.out -% -rwxr-xr-x 1 pi pi 7591584 Dec 7 22:16 a.out +The third column in \cref{tab:reg-alloc:arm-and-thumb} shows the proposed allocation for Thumb. +The heap counter and the A-stack pointer are swapped, as are A2 and B2. +The registers for S0, B4, the heap pointer, B3 and A3 have been rotated. +This way, all eight most often used registers are in the lower half except the B-stack pointer. + +\subsection{Results} +\label{sec:reg-alloc:results} +To measure the code size improvement introduced by this optimisation, + we again take the Clean compiler and compile it in three different ways: + +\begin{itemize}[itemsep=0pt] + \item + With the existing ARM backend \parencite{armcg}. + \item + With the new Thumb backend, leaving the register allocation as in the ARM backend. + \item + With the new Thumb backend, with the optimised register allocation as in the third column of \cref{tab:reg-alloc:arm-and-thumb}. + \item + With the new Thumb backend, with the optimised register allocation and leaving out any \ual{.align} directives to approximate object-code level efficiency. +\end{itemize} + +The object code generator has not been finished yet at the moment these measurements were done, + so we can only estimate its efficiency. +A modification described in \cref{sec:storing-pc:solution} required us to add \ual{.align} to jump instructions, adding a \ual{nop} instruction in approximately half the instances. +In the last measurement, we approximate object-code level efficiency by leaving out these \ual{.align} directives, as they will not be needed in the object code generator. + +We only measure the size of generated code, that is, all \texttt{.text} segments except those that belong to the RTS or external libraries. +These measurements were done on the system described in \cref{sec:system}. +One may argue that the Clean compiler is not representative code because of some peculiarities like large lookup tables and arrays that are rarely used in practice. +For this reason, we compare both the total code size and the size of a part of the lexer \parencite[\texttt{frontend/scanner.icl}]{cocl}, + which can be considered representative. +The results are in \cref{fig:reg-alloc:results}. + +We see that even without the optimisation discussed in this section, some code size is saved. +After all, at least some instructions can be made 16-bit, and in the sections above we have hardly ever had to add extra instructions. +However, optimising register allocation allows us to save much more, up to almost 20\%. \end{multicols} diff --git a/thesis/storing-pc.tex b/thesis/storing-pc.tex index 40da1d1..bfb8442 100644 --- a/thesis/storing-pc.tex +++ b/thesis/storing-pc.tex @@ -121,13 +121,18 @@ We then get, for the example from \texttt{armstartup.s:540-2}: We store the value \ual{0x2d}. This address points to the \ual{tst} instruction as before, with the LSB set to 1 to indicate Thumb mode. -In assembly code, we can solve this by adding labels for the addresses we want to store on the stack. -In object code, we need to keep track of the current alignment and add either 9 or 11 to the read program counter. - The offset, 9, is calculated as the number of bytes to the instruction after the branch plus one for Thumb mode. For \ual{b} and \ual{bl} instructions, this means an offset of 9, since these instructions are 32-bit. The \ual{bx} and \ual{blx} instructions are 16-bit, and require an offset of 7. +The second problem is that + when the \ual{add} instruction is located at the start of a halfword (e.g. \ual{0x22}), + the value that is read for PC will still be \ual{0x20}, as it is word-aligned~\parencite[A5.1.2]{armv7m} +When generating object code, we need to keep track of the current alignment and add either 7, 9 or 11 to the read program counter, + depending on both the alignment and the size of the branch instruction. +When generating assembly code, we are less concerned with efficiency. +In this code generator, we simply force-align the \ual{add} instruction (by adding an \ual{.align} directive, which will insert a \ual{nop} if necessary). + \subsection{Implementation details} \label{sec:storing-pc:implementation} A jump always jumps to either a label (with \ual{b} or \ual{bl}) or a register (with \ual{bx} or \ual{blx}). diff --git a/thesis/system.tex b/thesis/system.tex new file mode 100644 index 0000000..669329e --- /dev/null +++ b/thesis/system.tex @@ -0,0 +1,24 @@ +\section{System setup} +\label{sec:system} + +\begin{multicols}{2} + +Tests were run on a Raspberry Pi 3, Model B with 1GB RAM. +Its processor is listed in \texttt{/proc/cpuinfo} as an `ARMv7 Processor rev 4 (v7l)', + although the Hardware field is `BCM2709', which is an ARMv8 chip. + +To build programs, we used the following suite: + +\begin{itemize}[itemsep=0pt] + \item gcc (Raspbian 4.9.2-10) 4.9.2 + \item GNU assembler (GNU Binutils for Raspbian) 2.25 + \item GNU ld (GNU Binutils for Raspbian) 2.25 + \item A Clean compiler built from intermediate ABC files retrieved on December 7, 2016 \parencite{cocl}, + built with the ARM code generator revision 295 \parencite{armcg} + and the ARM RTS revision 387 \parencite{armrts}. +\end{itemize} + +These programs are given \texttt{-march=armv7-a} when applicable to optimise for the ARMv7 architecture. +The Thumb backend proposed in this thesis uses features that are deprecated by ARMv8-A (see \cref{sec:intro:conditionals}). + +\end{multicols} diff --git a/thesis/terms.tex b/thesis/terms.tex index 707b956..bb5e6a1 100644 --- a/thesis/terms.tex +++ b/thesis/terms.tex @@ -1,6 +1,8 @@ \section{Terminology} \label{sec:terminology} +\begin{multicols}{2} + \begin{description} \item[HNF] Head normal form, the state of a graph node that cannot be rewritten itself (though it children may). @@ -22,3 +24,5 @@ Unified Assembler Language. \todo{reference, more explanation.} \end{description} + +\end{multicols} diff --git a/thesis/thesis.tex b/thesis/thesis.tex index 2ab1667..fed073f 100644 --- a/thesis/thesis.tex +++ b/thesis/thesis.tex @@ -35,6 +35,7 @@ The code generator proposed here can be used in situations where small code size \cleardoublepage \fi +\setcounter{tocdepth}{2} \ifdraft \begin{multicols}{2} \tableofcontents @@ -58,8 +59,10 @@ The code generator proposed here can be used in situations where small code size \ifdraft\else \cleardoublepage \fi +\let\section\oldsection \appendix \input{terms} +\input{system} \ifdraft\else \cleardoublepage |