thesis/intro.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322

\section{Introduction}
\label{sec:intro}

\begin{multicols}{2}

\subsection{ARM, Thumb and Thumb-2}
\label{sec:intro:arm-thumb-thumb2}
ARM is a RISC architecture with several enhancements to a basic RISC architecture allowing ARM processors to
	\enquote{achieve a good balance of high performance, small program size, low power consumption, and small silicon area}~\parencite[A1.1]{armv7ar}.

Several instruction sets were designed for the ARM architecture.
First of all, the 32-bit ARM ISA allows the programmer to easily make full use of all the architecture's features.
The Thumb instruction set provides a 16-bit alternative to the ARM ISA, giving in on performance to achieve improved code density.
Starting from ARMv6T2, an extension to the Thumb instruction set, known as Thumb-2, adds 32-bit instructions to Thumb to
	\enquote{achieve performance similar to ARM code, with code density better than that of earlier Thumb code}~\parencite[A1.2]{armv7ar}.
This gives the ARM and Thumb instruction sets \enquote{almost identical functionality}~\parencite[A1.2]{armv7ar},
	whereas Thumb gives a smaller code size.

Using the Unified Assembler Language (UAL), one can write assembly code for both the ARM and the Thumb instruction set and fix the target ISA only at assemble-time.

The main differences between ARM and Thumb-2 are the following:

\subsubsection{Conditional execution}
\label{sec:intro:conditionals}
In ARM, every instruction has a 4-bit conditional field that allows for conditional execution.
In the Thumb instruction set, all conditional instructions except branches have to be in an \enquote{IT block}.
A first \ual{it} instruction gives the base condition and a then-else pattern.
The following instructions are executed conditionally.
For example:

\begin{minted}{ual}
	itte   gt     @ tte: then, then, else
	movgt  r2,r3  @ mov if gt
	movgt  r0,r1  @ mov if gt
	movle  r0,#0  @ mov if le (= not gt)
\end{minted}

This is UAL syntax.
When assembling for Thumb, an \ual{it} instruction with three \ual{mov} instructions (without conditional field) is generated.
For ARM, the \ual{it} instruction is ignored and three conditional \ual{mov} instructions are generated.

ARMv8-A deprecates some uses of the \ual{it} instruction for performance reasons~\parencite[J5.2]{armv8a}.

\subsubsection{Register usage}
\label{sec:intro:registers}
ARM processors have sixteen registers.
ARM instructions have 4-bit register fields to address them.
Some 16-bit Thumb instructions have 3-bit register fields that can only address the lowest eight registers.
For these instructions there exist 32-bit variants that can address all sixteen registers.

\begin{figure*}[t]
	\centering
	\begin{subfigure}[b]{.2\linewidth}
		\centering
		\begin{tikzpicture}[clean]
			\node (start) {\clean{Start}};
		\end{tikzpicture}
		\caption{Initially}
	\end{subfigure}%
	\begin{subfigure}[b]{.2\linewidth}
		\centering
		\begin{tikzpicture}[clean]
			\node[node args=1] (fac) {\clean{fac}};
			\node[below=of fac.arg1.south] (4) {\clean{4}};
			\draw (fac.arg1) -- (4.north);
		\end{tikzpicture}
		\caption{Applying \clean{Start}}
	\end{subfigure}%
	\begin{subfigure}[b]{.3\linewidth}
		\centering
		\begin{tikzpicture}[clean]
			\node[node args=2] (times) {\clean{*}};
			\node[below=of times.arg1.south] (4) {\clean{4}};
			\node[node args=1,right=of times.arg2.east] (fac) {\clean{fac}};
			\node[node args=2,right=of fac.arg1.east] (minus) {\clean{-}};
			\node[right=of minus.arg2.east] (1) {\clean{1}};
			\draw (times.arg1) -- (4.north);
			\draw (times.arg2) -- (fac.west);
			\draw (fac.arg1) -- (minus.west);
			\draw (minus.arg1) |- (4.east);
			\draw (minus.arg2) -- (1.west);
		\end{tikzpicture}
		\caption{Applying \clean{fac n}}
	\end{subfigure}%
	\begin{subfigure}[b]{.2\linewidth}
		\centering
		\begin{tikzpicture}[clean]
			\node[node args=2] (times) {\clean{*}};
			\node[below=of times.arg1.south] (4) {\clean{4}};
			\node[node args=1,right=of times.arg2.east] (fac) {\clean{fac}};
			\node[below=of fac.arg1.south] (3) {\clean{3}};
			\draw (times.arg1) -- (4.north);
			\draw (times.arg2) -- (fac.west);
			\draw (fac.arg1) -- (3.north);
		\end{tikzpicture}
		\caption{Applying \clean{-}}
	\end{subfigure}
	\caption{Rewriting a Clean node\label{fig:intro:rewriting}}
\end{figure*}

\subsubsection{Interworking}
\label{sec:intro:interworking}
The ARM and Thumb instruction sets are designed to \emph{interwork}:
	different parts of a program can be assembled for different instruction sets
	and it is possible to switch instruction set when the program counter is written to~\parencite[A4.1]{armv7ar}.

The Thumb-2 code generator proposed in this thesis does not produce ARM code,
	though the existence of the interworking facility has effects on the techniques that can be used in it.
This will be covered in \cref{sec:code-addresses}.

\subsection{Clean}
\label{sec:intro:clean}
Clean is \enquote{a general purpose, state-of-the-art, pure and lazy functional programming language designed for making real-world applications}~\parencite{clean}.
It evolved from LEAN, an intermediate language based on graph rewriting~\parencite{lean}.
A Clean program consists of a list of rewrite rules, for example:

\begin{minted}{clean}
	fac 0 = 1
	fac n = n * fac (n - 1)
	Start = fac 4
\end{minted}

Executing a Clean program means rewriting the \clean{Start} rule until no rewriting is possible any more.
The first rewriting steps of the above program are shown in \cref{fig:intro:rewriting}.

\subsubsection{The ABC-machine}
\label{sec:intro:clean:abc}
A Clean program is compiled to ABC-code, a platform-independent language for the ABC-machine, an abstract graph rewriting machine~\parencite{execspecs}.
A code generator is used to generate machine code equivalent to the ABC-code for concrete machines.
This, again, is done in several steps, so that only a relatively small part of the compilation process is target-dependent.

The ABC-machine is \enquote{an imperative abstract machine architecture for graph rewriting}~\parencite{execspecs}.
It consists of three stacks: for arguments (A), basic values (B) and control information like return addresses (C).
It also has a program store containing the graph rewriting rules that make up the program,
	a graph store that contains the graph to be rewritten,
	and several other elements that we need not mention here.

We do not discuss the internals of the ABC-machine in much depth here ---
	they have been described in \cite{execspecs} and \cite{m680x0}.
However, to get a better idea of the working of the ABC-machine,
	we consider the ABC-code for the factorial example above briefly.

The two rules for \clean{fac 0} and \clean{fac n} are translated to the following ABC-code%
	\footnote{The actual code generated by the compiler is slightly larger, since it uses some unnecessary instructions.
		They are removed automatically by the code generator and are not considered here for brevity.}:

\begin{minted}[tabsize=4]{abc}
	sfac.1
		eqI_b 0 0       | Is argument on B-stack 0?
		jmp_true case.1 | Jump to first alternative
		jmp case.2      | Jump to second alternative
	case.1              | First alternative (fac 0)
		pop_b 1         | Pop argument from B-stack
		pushI 1         | Push result to B-stack
		rtn             | Return
	case.2              | Second alternative (fac 1)
		pushI 1         | Push 1 to B-stack
		push_b 1        | Copy argument to B-stack
		subI            | Subtract on B-stack
		jsr sfac.1      | Call factorial
		mulI            | Multiply two stack elements
		rtn             | Return
\end{minted}

The code under \abc{sfac.1} is for the pattern match on the first argument.
In \abc{case.1}, we handle the case that it is zero.
We only need to remove it from the stack and push the return value 1.
In \abc{case.2}, we copy the argument, decrement it and recursively call \abc{sfac.1}.
After the \abc{jsr} instruction, we will have two elements on the stack:
	the result of the recursive call and the original argument.
We can the multiply them and return.

This function uses the B-stack, for basic values, for its arguments and return values, because it uses only integers.
More complex types would be passed on the A-stack.

The \clean{Start} rule compiles to%
	\footnote{Again, the code has been shorted insignificantly for brevity.}:

\begin{minted}[tabsize=4]{abc}
	__fac_Start
		build _ 0 nStart.2 | Build the Start node
		jmp _driver        | Jump to the RTS
	nStart.2               | The Start node entry
		jsr eaStart.2      | Jump to actual code
		fillI_b 0 0        | Copy result to A-stack
		pop_b 1            | Pop B-stack
		rtn                | Return
	eaStart.2              | Actual code
		pushI 4            | Push 4 to B-stack
		jmp sfac.1         | Call factorial
\end{minted}

When the program starts and \abc{__fac_Start} is executed, a node for \clean{Start} is created in the graph store.
We then jump to the run-time system (discussed in \cref{sec:intro:clean:rts}), where the \abc{_driver} subroutine will make sure that this node is reduced and printed.
In \abc{nStart.2}, we call \abc{eaStart.2}.
This subroutine will execute \clean{fac 4} and return the result on the B-stack.
Since \abc{nStart.2} is expected to return its result on the A-stack, we need to copy it and clean up the B-stack.
In \abc{eaStart.2}, we only need to set the argument for the factorial, 4, on the stack, and we jump to \abc{sfac.1} to let it do the rest.

\subsubsection{The code generator}
\label{sec:intro:clean:cg}
The code generator translates the abstract ABC-code into concrete machine code.
While the ABC-machine can be implemented in a straightforward manner using macro expansion,
	Clean's code generator does more than that:
	it introduces several optimisations, some of which are target-dependent.

The ARM code for the factorial example is as follows%
	\footnote{Some irrelevant peculiarities have been removed for brevity and clarity.}:

\begin{minted}{ual}
	sfac_P1:
	  cmp  r4,#0        @ Is argument 0?
	  bne  case_P2
	case_P1:
	  mov  r4,#1        @ Return 1
	  ldr  pc,[sp],#4
	case_P2:
	  str  r4,[sp,#-4]! @ Copy argument
	  add  r4,r4,#-1    @ Decrement argument
	  str  pc,[sp,#-4]!
	  bl   sfac_P1      @ Call factorial
	  ldr  r12,[sp],#4  @ Multiply with own argument
	  mul  r4,r12,r4
	  ldr  pc,[sp],#4
\end{minted}

We see that the argument is passed in \ual{r4}, and that register is also used for the result.
In ARM, \ual{r4} through \ual{r0} are used for the top of the B-stack.
The machine stack pointer \ual{sp} is used as the B-stack pointer.
For intermediate results, \ual{r12} is used as a scratch register.
When a reduction has finished, control is passed back to the callee by loading the PC from the stack (\ual{ldr pc,[sp],#4}).

The \clean{Start} rule translates roughly to:

\begin{minted}{ual}
	____fac__Start:
	  ldr  r12,=nStart_P2 @ Store Start node on the heap
	  str  r12,[r10]
	  mov  r6,r10         @ Save node pointer on A-stack
	  add  r10,r10,#12    @ Reserve heap space
	  b    __driver       @ Jump to RTS
	nStart_P2:
	  str  r6,[r9],#4     @ Save node entry
	  str  pc,[sp,#-4]!   @ Save return address
	  bl   eaStart_P2     @ Jump to actual code
	  ldr  r6,[r9,#-4]!   @ Restore node entry
	  ldr  r12,=INT+2     @ Make an Int node on A-stack
	  str  r12,[r6]
	  str  r4,[r6,#4]     @ Give that node value r4
	  ldr  pc,[sp],#4     @ Return to callee
	eaStart_P2:
	  mov  r4,#4          @ Push 4 to B-stack
	sfac_P1               @ A jump to sfac_P1 is not
	  @ ...               @ needed as it is right below
\end{minted}

For the A-stack, \ual{r6} through \ual{r8} and \ual{r11} are used, and \ual{r9} is the A-stack pointer.
Nodes (the graph store of the ABC-machine) are stored on a heap which uses \ual{r10} as a pointer.
To store the \clean{Start} node on the heap, we need to write its address and then increase \ual{r10} to reserve the space.
The \ual{__driver} subroutine will reduce the top of the A-stack, so saving the pointer to the \clean{Start} node in \ual{r6}, the top of the A-stack, makes sure that \ual{__driver} jumps to \ual{nStart_P2}.
There, we do some administration and jump to \ual{eaStart_P2}.
In that routine, the literal \ual{4} is pushed to the B-stack and we \enquote{jump} to \ual{sfac_P1} ---
	the code generator optimises this by placing \ual{sfac_P1} right below and removing the branch instruction.
When \ual{sfac_P1} returns, we continue in \ual{nStart_P2}.
There we need to copy the result from the B-stack to the A-stack.
Since the B-stack is untyped (the compiler makes sure that it is safe), while the A-stack is typed,
	we need to create a node on the heap with the \ual{INT+2} entry address.

Note that ABC instructions like \abc{pushI 4} (push 4 to the B-stack) are translated to ARM code like \ual{mov r4,#4}:
	the content of the B-stack is not moved down, as one might expect.
This is possible because the code generator knows that the B-stack will be empty at this point.
Had it not been empty, but had one element, for example, a \ual{mov r3,r4} instruction would have preceded to move it down.
The actual ABC-code contains annotations that tell the code generator how many elements the stacks hold, so that these optimisations can be made.

\subsubsection{The run-time system}
\label{sec:intro:clean:rts}
After compilation, a Clean program is linked together with the Clean run-time system.
The RTS ensures that that the \clean{Start} node is reduced and printed and takes care of garbage collecting.
This system is written in Clean, C and assembly, so making Clean available on Thumb-2 inherently means adapting the platform-dependent parts of the RTS as well.

\subsection{A Thumb backend for Clean}
\label{sec:intro:clean-thumb}
In this thesis, we propose a Thumb backend for Clean.
The ARM code generator and RTS were taken as a starting point.
Thanks to the Unified Assembler Language, only little time was needed to make these systems assemble for the Thumb instruction set.

The proposed backend does not use ARM and Thumb interworking.
The reason for this is threefold.
First, there are several processors, like the ARMv7-M series~\parencite[A4.1]{armv7m}, that do support Thumb instructions but cannot run ARM code.
By only using Thumb, we target the widest possible range of devices.
Second, we doubt that using interworking can be done efficiently.
In the run-time system, only minimal time overhead is introduced by using Thumb instructions.
For generated code it would be complicated to detect if the ARM or Thumb instruction set would give better results,
	and this would give significantly better results only in specific cases.
Third, the problem discussed in \cref{sec:code-addresses} could be solved efficiently only without interworking.
Using interworking would introduce overhead at every branch instruction,
	since the solution to this problem would have to be adapted.

\subsection{Terminology}
\label{sec:intro:terminology}
In this thesis, I will usually write \enquote{Thumb} where the Thumb-2 extension is meant.
Only when the distinction with pre-ARMv6T2 Thumb is important will I distinguish between (early) Thumb and Thumb-2.
As for \enquote{ARM}, it should be clear from the context whether the architecture or the instruction set is meant.

For brevity, a number of common abbreviations is used.
An overview can be found in \cref{sec:abbreviations}.

\subsection{Organisation}
\label{sec:intro:organisation}
In much of the rest of this thesis we discuss differences between ARM and Thumb,
	their influences on code generation,
	and the way they were dealt with in the Thumb backend for Clean proposed in this thesis.
In \cref{sec:storing-pc}, we consider an issue arising from halfword-aligned instructions and the way a read of PC is interpreted under Thumb.
\Cref{sec:code-addresses} discusses problems related to the fact that Thumb instruction addresses use bit 1 and should have bit 0 set for interworking,
	while under ARM a branch will automatically clear both these bits.
There was a minor problem with negative offsets to the \ual{ldr} instruction,
	that cannot be as large in Thumb as they can in ARM.
\Cref{sec:load-offsets} deals with this.

We benchmark the Thumb backend for Clean and discuss the results in \cref{sec:results}.

\end{multicols}