\section{Introduction} \label{sec:intro} \begin{multicols}{2} \subsection{ARM, Thumb and Thumb-2} \label{sec:intro:arm-thumb-thumb2} ARM is a RISC architecture with several enhancements to a basic RISC architecture allowing ARM processors to \enquote{achieve a good balance of high performance, small program size, low power consumption, and small silicon area}~\parencite[A1.1]{armv7ar}. Several instruction sets were designed for the ARM architecture. First of all, the 32-bit ARM ISA allows the programmer to easily make full use of all the architecture's features. The Thumb instruction set provides a 16-bit alternative to the ARM ISA, giving in on performance to achieve improved code density. Starting from ARMv6T2, an extension to the Thumb instruction set, known as Thumb-2, adds 32-bit instructions to Thumb to \enquote{achieve performance similar to ARM code, with code density better than that of earlier Thumb code}~\parencite[A1.2]{armv7ar}. This gives the ARM and Thumb instruction sets \enquote{almost identical functionality}~\parencite[A1.2]{armv7ar}, whereas Thumb gives a smaller code size. In this thesis, we will usually use \enquote{Thumb} where the Thumb-2 extension is meant. Only when the distinction with pre-ARMv6T2 Thumb is important will we distinguish between (early) Thumb and Thumb-2. As for \enquote{ARM}, it should be clear from the context whether the architecture or the instruction set is meant. Using the Unified Assembler Language (UAL), one can write assembly code for both the ARM and the Thumb instruction set and fix the target ISA only at assemble-time. The main differences between ARM and Thumb-2 are the following: \subsubsection{Conditional execution} \label{sec:intro:conditionals} In ARM, every instruction has a 4-bit conditional field that allows for conditional execution. In the Thumb instruction set, all conditional instructions except branches have to be in an \enquote{IT block}. A first \ual{it} instruction gives the base condition and a then-else pattern. The following instructions are executed conditionally. For example: \begin{minted}{ual} itte gt @ tte: then, then, else movgt r2,r3 @ mov if gt movgt r0,r1 @ mov if gt movle r0,#0 @ mov if le (= not gt) \end{minted} This is UAL syntax. When assembling for Thumb, an \ual{it} instruction with three \ual{mov} instructions (without conditional field) is generated. For ARM, the \ual{it} instruction is ignored and three conditional \ual{mov} instructions are generated. ARMv8-A deprecates some uses of the \ual{it} instruction for performance reasons~\parencite[J5.2]{armv8a}. \subsubsection{Register usage} \label{sec:intro:registers} ARM processors have sixteen registers. ARM instructions have 4-bit register fields to address them. Some 16-bit Thumb instructions have 3-bit register fields that can only address the lowest eight registers. For these instructions there exist 32-bit variants that can address all sixteen registers. \begin{figure*}[b] \centering \begin{subfigure}[b]{.2\linewidth} \centering \begin{tikzpicture}[clean] \node (start) {\clean{Start}}; \end{tikzpicture} \caption{Initially} \end{subfigure}% \begin{subfigure}[b]{.2\linewidth} \centering \begin{tikzpicture}[clean] \node[node args=1] (fac) {\clean{fac}}; \node[below=of fac.arg1.south] (4) {\clean{4}}; \draw (fac.arg1) -- (4.north); \end{tikzpicture} \caption{Applying \clean{Start}} \end{subfigure}% \begin{subfigure}[b]{.3\linewidth} \centering \begin{tikzpicture}[clean] \node[node args=2] (times) {\clean{*}}; \node[below=of times.arg1.south] (4) {\clean{4}}; \node[node args=1,right=of times.arg2.east] (fac) {\clean{fac}}; \node[node args=2,right=of fac.arg1.east] (minus) {\clean{-}}; \node[right=of minus.arg2.east] (1) {\clean{1}}; \draw (times.arg1) -- (4.north); \draw (times.arg2) -- (fac.west); \draw (fac.arg1) -- (minus.west); \draw (minus.arg1) |- (4.east); \draw (minus.arg2) -- (1.west); \end{tikzpicture} \caption{Applying \clean{fac n}} \end{subfigure}% \begin{subfigure}[b]{.2\linewidth} \centering \begin{tikzpicture}[clean] \node[node args=2] (times) {\clean{*}}; \node[below=of times.arg1.south] (4) {\clean{4}}; \node[node args=1,right=of times.arg2.east] (fac) {\clean{fac}}; \node[below=of fac.arg1.south] (3) {\clean{3}}; \draw (times.arg1) -- (4.north); \draw (times.arg2) -- (fac.west); \draw (fac.arg1) -- (3.north); \end{tikzpicture} \caption{Applying \clean{-}} \end{subfigure} \caption{Rewriting a Clean node\label{fig:intro:rewriting}} \end{figure*} \subsubsection{Interworking} \label{sec:intro:interworking} The ARM and Thumb instruction sets are designed to \emph{interwork}: different parts of a program can be assembled for different instruction sets and it is possible to switch instruction set when the program counter is written to~\parencite[A4.1]{armv7ar}. The Thumb-2 code generator proposed in this thesis does not produce ARM code, though the existence of the interworking facility has effects on the techniques that can be used in it. This will be covered in \cref{sec:code-addresses}. \subsection{Clean} \label{sec:intro:clean} Clean is \enquote{a general purpose, state-of-the-art, pure and lazy functional programming language designed for making real-world applications}~\parencite{clean}. It evolved from LEAN, an intermediate language based on graph rewriting~\parencite{lean}. A Clean program consists of a list of rewrite rules, for example: \begin{minted}{clean} fac 0 = 1 fac n = n * fac (n - 1) Start = fac 4 \end{minted} Executing a Clean program means rewriting the \clean{Start} rule until no rewriting is possible any more. The first rewriting steps of the above program are shown in \cref{fig:intro:rewriting}. \subsubsection{The ABC-machine} \label{sec:intro:clean:abc} A Clean program is compiled to ABC-code, a platform-independent language for the ABC-machine, an abstract graph rewriting machine~\parencite{execspecs}. A code generator is used to generate machine code equivalent to the ABC-code for concrete machines. This, again, is done in several steps, so that only a relatively small part of the compilation process is target-dependent. The ABC-machine is \enquote{an imperative abstract machine architecture for graph rewriting}~\parencite{execspecs}. It consists of three stacks: for arguments (A), basic values (B) and control information like return addresses (C). It also has a program store containing the graph rewriting rules that make up the program, a graph store that contains the graph to be rewritten, and several other elements that we need not mention here. We do not discuss the internals of the ABC-machine in much depth here --- they have been described in \cite{execspecs} and \cite{m680x0}. However, to get a better idea of the working of the ABC-machine, we consider the ABC-code for the factorial example above briefly. The two rules for \clean{fac 0} and \clean{fac n} are translated to the following ABC-code% \footnote{The actual code generated by the compiler is slightly larger, since it uses some unnecessary instructions. They are removed automatically by the code generator and are not considered here for brevity.}: \begin{minted}[tabsize=4]{abc} sfac.1 eqI_b 0 0 | Is argument on B-stack 0? jmp_true case.1 | Jump to first alternative jmp case.2 | Jump to second alternative case.1 | First alternative (fac 0) pop_b 1 | Pop argument from B-stack pushI 1 | Push result to B-stack rtn | Return case.2 | Second alternative (fac 1) pushI 1 | Push 1 to B-stack push_b 1 | Copy argument to B-stack subI | Subtract on B-stack jsr sfac.1 | Call factorial mulI | Multiply two stack elements rtn | Return \end{minted} The code under \abc{sfac.1} is for the pattern match on the first argument. In \abc{case.1}, we handle the case that it is zero. We only need to remove it from the stack and push the return value 1. In \abc{case.2}, we copy the argument, decrement it and recursively call \abc{sfac.1}. After the \abc{jsr} instruction, we will have two elements on the stack: the result of the recursive call and the original argument. We can the multiply them and return. This function uses the B-stack, for basic values, for its arguments and return values, because it uses only integers. More complex types would be passed on the A-stack. The \clean{Start} rule compiles to% \footnote{Again, the code has been shorted insignificantly for brevity.}: \begin{minted}[tabsize=4]{abc} __fac_Start build _ 0 nStart.2 | Build the Start node jmp _driver | Jump to the RTS nStart.2 | The Start node entry jsr eaStart.2 | Jump to actual code fillI_b 0 0 | Copy result to A-stack pop_b 1 | Pop B-stack rtn | Return eaStart.2 | Actual code pushI 4 | Push 4 to B-stack jmp sfac.1 | Call factorial \end{minted} When the program starts and \abc{__fac_Start} is executed, a node for \clean{Start} is created in the graph store. We then jump to the run-time system (discussed in \cref{sec:intro:clean:rts}), where the \abc{_driver} subroutine will make sure that this node is reduced and printed. In \abc{nStart.2}, we call \abc{eaStart.2}. This subroutine will execute \clean{fac 4} and return the result on the B-stack. Since \abc{nStart.2} is expected to return its result on the A-stack, we need to copy it and clean up the B-stack. In \abc{eaStart.2}, we only need to set the argument for the factorial, 4, on the stack, and we jump to \abc{sfac.1} to let it do the rest. \subsubsection{The code generator} \label{sec:intro:clean:cg} The code generator translates the abstract ABC-code into concrete machine code. While the ABC-machine can be implemented in a straightforward manner using macro expansion, Clean's code generator does more than that: it introduces several optimisations, some of which are target-dependent. The ARM code for the factorial example is as follows% \footnote{Some irrelevant peculiarities have been removed for brevity and clarity.}: \begin{minted}{ual} sfac_P1: cmp r4,#0 @ Is argument 0? bne case_P2 case_P1: mov r4,#1 @ Return 1 ldr pc,[sp],#4 case_P2: str r4,[sp,#-4]! @ Copy argument add r4,r4,#-1 @ Decrement argument str pc,[sp,#-4]! bl sfac_P1 @ Call factorial ldr r12,[sp],#4 @ Multiply with own argument mul r4,r12,r4 ldr pc,[sp],#4 \end{minted} We see that the argument is passed in \ual{r4}, and that register is also used for the result. In ARM, \ual{r4} through \ual{r0} are used for the top of the B-stack. The machine stack pointer \ual{sp} is used as the B-stack pointer. For intermediate results, \ual{r12} is used as a scratch register. When a reduction has finished, control is passed back to the callee by loading the PC from the stack (\ual{ldr pc,[sp],#4}). The \clean{Start} rule translates roughly to: \begin{minted}{ual} ____fac__Start: ldr r12,=nStart_P2 @ Store Start node on the heap str r12,[r10] mov r6,r10 @ Save node pointer on A-stack add r10,r10,#12 @ Reserve heap space b __driver @ Jump to RTS nStart_P2: str r6,[r9],#4 @ Save node entry str pc,[sp,#-4]! @ Save return address bl eaStart_P2 @ Jump to actual code ldr r6,[r9,#-4]! @ Restore node entry ldr r12,=INT+2 @ Make an Int node on A-stack str r12,[r6] str r4,[r6,#4] @ Give that node value r4 ldr pc,[sp],#4 @ Return to callee eaStart_P2: mov r4,#4 @ Push 4 to B-stack sfac_P1 @ A jump to sfac_P1 is not @ ... @ needed as it is right below \end{minted} For the A-stack, \ual{r6} through \ual{r8} and \ual{r11} are used, and \ual{r9} is the A-stack pointer. Nodes (the graph store of the ABC-machine) are stored on a heap which uses \ual{r10} as a pointer. To store the \clean{Start} node on the heap, we need to write its address and then increase \ual{r10} to reserve the space. The \ual{__driver} subroutine will reduce the top of the A-stack, so saving the pointer to the \clean{Start} node in \ual{r6}, the top of the A-stack, makes sure that \ual{__driver} jumps to \ual{nStart_P2}. There, we do some administration and jump to \ual{eaStart_P2}. In that routine, the literal \ual{4} is pushed to the B-stack and we \enquote{jump} to \ual{sfac_P1} --- the code generator optimises this by placing \ual{sfac_P1} right below and removing the branch instruction. When \ual{sfac_P1} returns, we continue in \ual{nStart_P2}. There we need to copy the result from the B-stack to the A-stack. Since the B-stack is untyped (the compiler makes sure that it is safe), while the A-stack is typed, we need to create a node on the heap with the \ual{INT+2} entry address. Note that ABC instructions like \abc{pushI 4} (push 4 to the B-stack) are translated to ARM code like \ual{mov r4,#4}: the content of the B-stack is not moved down, as one might expect. This is possible because the code generator knows that the B-stack will be empty at this point. Had it not been empty, but had one element, for example, a \ual{mov r3,r4} instruction would have preceded to move it down. The actual ABC-code contains annotations that tell the code generator how many elements the stacks hold, so that these optimisations can be made. \subsubsection{The run-time system} \label{sec:intro:clean:rts} After compilation, a Clean program is linked together with the Clean run-time system. The RTS ensures that that the \clean{Start} node is reduced and printed and takes care of garbage collecting. This system is written in Clean, C and assembly, so making Clean available on Thumb-2 inherently means adapting the platform-dependent parts of the RTS as well. \subsection{A Thumb backend for Clean} \label{sec:intro:clean-thumb} In this thesis, we propose a Thumb backend for Clean. The ARM code generator and RTS were taken as a starting point. Thanks to the Unified Assembler Language, only little time was needed to make these systems assemble for the Thumb instruction set. The proposed backend does not use ARM and Thumb interworking. The reason for this is threefold. First, there are several processors, like the ARMv7-M series~\parencite[A4.1]{armv7m}, that do support Thumb instructions but cannot run ARM code. By only using Thumb, we target the widest possible range of devices. Second, we doubt that using interworking can be done efficiently. In the run-time system, only minimal time overhead is introduced by using Thumb instructions. For generated code it would be complicated to detect if the ARM or Thumb instruction set would give better results, and this would give significantly better results only in specific cases. Third, the problem discussed in \cref{sec:code-addresses} could be solved efficiently only without interworking. Using interworking would introduce overhead at every branch instruction, since the solution to this problem would have to be adapted. \subsection{Organisation} \label{sec:intro:organisation} In much of the rest of this thesis we discuss differences between ARM and Thumb, their influences on code generation, and the way they were dealt with in the Thumb backend for Clean proposed in this thesis. In \cref{sec:storing-pc}, we consider an issue arising from halfword-aligned instructions and the way a read of PC is interpreted under Thumb. \Cref{sec:code-addresses} discusses problems related to the fact that Thumb instruction addresses use bit 1 and should have bit 0 set for interworking, while under ARM a branch will automatically clear both these bits. There was a minor problem with negative offsets to the \ual{ldr} instruction, that cannot be as large in Thumb as they can in ARM. \Cref{sec:load-offsets} deals with this. Moving to Thumb introduces a number of interesting optimisation vectors. One of them, register allocation, is discussed in \cref{sec:reg-alloc}. We benchmark the Thumb backend for Clean and discuss the results in \cref{sec:results}. \subsection{Terminology} \label{sec:intro:terminology} As mentioned, we will usually use \enquote{Thumb} where the Thumb-2 extension is meant, and only distinguish Thumb and Thumb-2 when this distinction is relevant --- for \enquote{ARM}, context makes out whether the architecture or the instruction set is meant. For brevity, a number of common abbreviations is used. An overview can be found in \cref{sec:terminology}. \end{multicols}