thesis/storing-pc.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226

\section{Storing the program counter}
\label{sec:storing-pc}

\begin{multicols}{2}

\subsection{Introduction}
\label{sec:storing-pc:intro}
Storing the program counter on the stack is something done commonly in many languages during a function call.
Usually, an offset to the actual program counter at the moment of the store instruction is stored,
	to allow for a branch after that instruction, and have the callee return to the address after that branch.
The ARM architecture accommodates for this: an ARM instruction that reads the program counter, actually reads
	\enquote{the address of the current instruction plus 8}~\parencite[A2.3]{armv7ar}.
The following ARM assembly example illustrates this (\texttt{armstartup.s:540-2},~\cite{armrts}):

\begin{minted}{ual}
	str     pc,[sp,#-4]!  @ 0x20
	bl      init_clean    @ 0x24
	tst     r4,r4         @ 0x28
\end{minted}

Dummy addresses have been indicated in comments.
When execution arrives at \ual{0x20}, the program counter is set to \ual{0x20}.
Per the above documentation, \ual{str} stores \ual{0x20+8} on the stack
	(to be precise: to the address indicated by the stack pointer with pre-indexed offset \ual{-4}).
The processor branches to \ual{init_clean},
	which loads the value from the stack back into the program counter to return to the caller.
The program counter is then \ual{0x28}.
For the next instruction cycle, the \ual{tst} command is executed.

There are two reasons why the above cannot be used in Thumb-2 code.
First, \ual{pc} is not allowed as the first operand of a \ual{str} instruction.

The second problem we meet is that the instruction to store the program counter may be halfword-aligned rather than word-aligned.
We saw above that a read of the program counter in ARM mode reads as PC+8.
In Thumb mode this is more complicated.
In this case, we 
	\enquote{[r]ead the word-aligned PC value, that is,
		the address of the current instruction + 4,
		with bits [1:0] forced to zero}~\parencite[A5.1.2]{armv7m}.
This means that when the \ual{add} instruction above is at \ual{0x22},
	we will still store \ual{0x2d} on the stack,
	since the word-aligned program counter is \ual{0x20} as before.
However, in this case \ual{bl} is located at \ual{0x2a}, and since this is a 32 bits instruction we point to the middle of that instruction.

\subsection{Usage in Clean}
\label{sec:storing-pc:clean}
Storing the PC is required for jumping to subroutines.
We see a clear example of this when a Clean function definition uses pattern matching.
Consider the following example:

\begin{minted}{clean}
	isEmpty :: [a] -> Bool
	isEmpty [] = True
	isEmpty _  = False
\end{minted}

The generated ABC-code looks as below%
	\footnote{The current Clean compiler misses the \abc{jsr_eval 0} line:
			the strictness analyser recognises that \clean{isEmpty} is strict in its first argument,
			so evaluating the argument to head normal form is not needed any more.
		In cases where the strictness analyser cannot derive strictness,
			code similar to this example is generated.
		The code given here can be reproduced with \bash{clm -nsa}.}.
Since \clean{isEmpty} pattern matches on its first argument, it needs to be evaluated to head normal form.
This is done with \abc{jsr_eval 0}.
Only after that can we check if its constructor is \abc{_Nil} (the empty list), which is done with \abc{eq_desc _Nil 0 0}.

\begin{minted}[tabsize=4]{abc}
	sisEmpty.1
		jsr_eval 0        | Evaluate argument
		eq_desc _Nil 0 0  | If it equals []
		jmp_true case.1   | .. jump to case.1
		jmp case.2        | [else] to case.2
	case.1
		pop_a 1           | Pop argument
		pushB TRUE        | Return True
		rtn
	case.2
		pop_a 1           | Pop argument
		pushB FALSE       | Return False
		rtn
\end{minted}

Evaluating the argument is done by jumping to the subroutine indicated by the node entry of that argument.
Hence, we store the PC, jump to that address, and continue with \abc{eq_desc _Nil 0 0} when the node has been evaluated to head normal form.
The ARM code for the pattern match is:

\begin{minted}{ual}
	sisEmpty_P1:
	  ldr  r12,[r6]      @ Load node entry point
	  tst  r12,#2        @ If in HNF already
	  bne  e_0           @ Skip evaluation
	  str  pc,[sp,#-4]!  @ Store PC
	  blx  r12           @ Evaluate argument
	e_0:
	  ldr  r12,[r6]      @ If it does not equal []
	  ldr  r14,=__Nil+2
	  cmp  r12,r14
	  bne  case_P2
\end{minted}

We can see here
	that evaluating a node requires a \abc{jsr_eval} ABC instruction, and
	that \abc{jsr} ABC instructions require storing the PC in ARM assembly.

Of course, evaluating nodes is something that happens throughout the source code and has to be done all the time during the execution of a Clean program.
We therefore need a fast, small Thumb alternative for the ARM code.

\subsection{Solution}
\label{sec:storing-pc:solution}
To solve the first problem, we need to first move \ual{pc} to an auxiliary register, and then push that on the stack.
We then get, for the example from \texttt{armstartup.s:540-2}:

\begin{minted}{ual}
	add     lr,pc,#9      @ 0x20
	str     lr,[sp,#-4]!  @ 0x24
	bl      init_clean    @ 0x28
	tst     r4,r4         @ 0x2c
\end{minted}

We store the value \ual{0x2d}.
This address points to the \ual{tst} instruction as before, with the LSB set to 1 to indicate Thumb mode.

The offset, 9, is calculated as the number of bytes to the instruction after the branch plus one for Thumb mode.
For \ual{b} and \ual{bl} instructions, this means an offset of 9, since these instructions are 32-bit.
The \ual{bx} and \ual{blx} instructions are 16-bit, and require an offset of 7.

The second problem is that
	when the \ual{add} instruction is located at the start of a halfword (e.g. \ual{0x22}),
	the value that is read for PC will still be \ual{0x20}, as it is word-aligned~\parencite[A5.1.2]{armv7m}
When generating object code, we need to keep track of the current alignment and add either 7, 9 or 11 to the read program counter,
	depending on both the alignment and the size of the branch instruction.
When generating assembly code, we are less concerned with efficiency.
In this code generator, we simply force-align the \ual{add} instruction (by adding an \ual{.align} directive, which will insert a \ual{nop} if necessary).

\subsection{Implementation details}
\label{sec:storing-pc:implementation}
A jump always jumps to either a label (with \ual{b} or \ual{bl}) or a register (with \ual{bx} or \ual{blx}).
The latter occurs for example in the case of a \abc{jsr_eval} ABC instruction.
This instruction is used to evaluate a node on the A-stack.
First, the node entry address has to be fetched; then we jump to that address.

In the first case, we need one scratch register to store the PC temporarily.
In the second case, we need two scratch registers: also one for the address we are jumping to.
For the ARM instruction set we needed zero and one scratch register(s), respectively.
The ARM backend uses two scratch registers, S0 and S1 (for more details, see \cref{sec:reg-alloc:clean}).
The latter is used only when two scratch registers are needed, so S0 is used in this case.

The Thumb-2 code generator uses S0 to store the PC temporarily in the first case,
	and S1 in the second case (where S0 is still used for the address we are jumping to).
This makes sure that S0 is used as much as possible.
The slightly easier implementation would use S1 in both cases.
However, in Thumb-2 it is convenient to have great variation in register usage:
	this allows for a massive code size optimisation (see \cref{sec:reg-alloc}).
For this reason it is better to use S0 whenever possible.

\subsection{Comparison}
\label{sec:storing-pc:comparison}
Assuming the worst case, that all instructions in the jump block are wide, we need four more bytes in Thumb than in ARM.
As a benchmark, the Clean compiler has 41,006 jumps of this kind in 1,253,978 instructions, or approximately 3.27\%.
The four extra bytes in Thumb mean a size increase of $41006\cdot4\approx160$KiB on the 5.3MiB file, an increase of 3.00\%.

As for the time complexity: every jump requires an extra instruction cycle.
In particular, every reduction needs an extra cycle.
It is hard to tell what effect this has on Clean programs in general,
	and it may well be very dependent on the kind of program.
A general comparison of running time under ARM and Thumb is made in \cref{sec:results:time}.

% pi@rasppi:~/clean/exe$ objdump -d cocl | grep -E '^\s+[0123456789abcdef]{5,8}:\s+[0123456789abcdef]{8}\s+push\s+\{pc\}' | wc -l
% 41006
% pi@rasppi:~/clean/exe$ objdump -d cocl | grep -E '^\s+[0123456789abcdef]{5,8}:\s+[0123456789abcdef]{8}' | wc -l
% 1253978

\subsection{Other solutions}
\label{sec:storing-pc:other-solutions}
Another solution than the one we have considered here makes use of the link register.
Some branch instructions, like \ual{bl}, store the address of the next instruction in the link register.
We could therefore imagine a setup where the callee gets the return address from that register rather than from the stack.
This is the approach taken by GCC.
The code of a typical C subroutine starts with \ual{push {...,lr}} and ends with \ual{pop {...,pc}}.

When generating code for a functional language, it is not straightforward to do this, due to tail recursion.
An example of this can be found in the following basic function:

\begin{minted}{clean}
	length :: !Int ![a] -> Int
	length n []     = n
	length n [_:xs] = length (n+1) xs
\end{minted}

The current Thumb backend generates the following code:

\begin{minted}{ual}
slength_P2:
	ldr   r0,[r6]         @ Load top of A-stack (the list)
	ldr   r14,=__Nil+2    @ Check if it is Nil
	cmp   r0,r14
	bne   case_P2
case_P1:                    @ If the list is Nil
	ldr   pc,[sp],#4      @ Simply return
	.ltorg
case_P2:                    @ If the list is not Nil
	ldr   r6,[r6,#8]      @ Load the Cons part of the list
	ldr   r0,[r6]         @ Check if it has been evaluated already
	tst   r0,#2
	bne   e_0
	str   r4,[sp,#-4]!    @ If the Cons part has not been evaluated
	.align
	add.w r14,pc,#7       @ Evaluate the cons part
	str.w r14,[sp,#-4]!
	blx   r0
	ldr   r4,[sp],#4
e_0:                        @ If / after the Cons part has been evaluated
	add   r4,r4,#1        @ Increase counter
	b     slength_P2      @ Tail recursively jump back to start of function
	.ltorg
\end{minted}

In this setup, it is not easily possible to let the callee store the return address,
	since we enter the function as many times as there are elements in the list,
	while we only return from it once, in \texttt{case\_P1}.
A more straightforward solution to have the caller responsible for storing the return address,
	which is why this approach is taken in Clean's ARM code generator~\parencite{armcg}
	and why we continue along these lines for the Thumb backend.

\end{multicols}