1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
|
\section{Storing the program counter}
\label{sec:storing-pc}
\begin{multicols}{2}
\subsection{Introduction}
\label{sec:storing-pc:intro}
Storing the program counter on the stack is something done commonly in many languages during a function call.
Usually, an offset to the actual program counter at the moment of the store instruction is stored,
to allow for a branch after that instruction, and have the callee return to the address after that branch.
The ARM architecture accommodates for this: an ARM instruction that reads the program counter, actually reads
\enquote{the address of the current instruction plus 8}~\parencite[A2.3]{armv7ar}.
The following ARM assembly example illustrates this (\texttt{armstartup.s:540-2},~\cite{armrts}):
\begin{minted}{ual}
str pc,[sp,#-4]! @ 0x20
bl init_clean @ 0x24
tst r4,r4 @ 0x28
\end{minted}
Dummy addresses have been indicated in comments.
When execution arrives at \ual{0x20}, the program counter is set to \ual{0x20}.
Per the above documentation, \ual{str} stores \ual{0x20+8} on the stack
(to be precise: to the address indicated by the stack pointer with pre-indexed offset \ual{-4}).
The processor branches to \ual{init_clean},
which loads the value from the stack back into the program counter to return to the caller.
The program counter is then \ual{0x28}.
For the next instruction cycle, the \ual{tst} command is executed.
There are two reasons why the above cannot be used in Thumb-2 code.
First, \ual{pc} is not allowed as the first operand of a \ual{str} instruction.
The second problem we meet is that the instruction to store the program counter may be halfword-aligned rather than word-aligned.
We saw above that a read of the program counter in ARM mode reads as PC+8.
In Thumb mode this is more complicated.
In this case, we
\enquote{[r]ead the word-aligned PC value, that is,
the address of the current instruction + 4,
with bits [1:0] forced to zero}~\parencite[A5.1.2]{armv7m}.
This means that when the \ual{add} instruction above is at \ual{0x22},
we will still store \ual{0x2d} on the stack,
since the word-aligned program counter is \ual{0x20} as before.
However, in this case \ual{bl} is located at \ual{0x2a}, and since this is a 32 bits instruction we point to the middle of that instruction.
\subsection{Usage in Clean}
\label{sec:storing-pc:clean}
Storing the PC is required for jumping to subroutines.
We see a clear example of this when a Clean function definition uses pattern matching.
Consider the following example:
\begin{minted}{clean}
isEmpty :: [a] -> Bool
isEmpty [] = True
isEmpty _ = False
\end{minted}
The generated ABC-code looks as below%
\footnote{The current Clean compiler misses the \abc{jsr_eval 0} line:
the strictness analyser recognises that \clean{isEmpty} is strict in its first argument,
so evaluating the argument to head normal form is not needed any more.
In cases where the strictness analyser cannot derive strictness,
code similar to this example is generated.
The code given here can be reproduced with \bash{clm -nsa}.}.
Since \clean{isEmpty} pattern matches on its first argument, it needs to be evaluated to head normal form.
This is done with \abc{jsr_eval 0}.
Only after that can we check if its constructor is \abc{_Nil} (the empty list), which is done with \abc{eq_desc _Nil 0 0}.
\begin{minted}[tabsize=4]{abc}
sisEmpty.1
jsr_eval 0 | Evaluate argument
eq_desc _Nil 0 0 | If it equals []
jmp_true case.1 | .. jump to case.1
jmp case.2 | [else] to case.2
case.1
pop_a 1 | Pop argument
pushB TRUE | Return True
rtn
case.2
pop_a 1 | Pop argument
pushB FALSE | Return False
rtn
\end{minted}
Evaluating the argument is done by jumping to the subroutine indicated by the node entry of that argument.
Hence, we store the PC, jump to that address, and continue with \abc{eq_desc _Nil 0 0} when the node has been evaluated to head normal form.
The ARM code for the pattern match is:
\begin{minted}{ual}
sisEmpty_P1:
ldr r12,[r6] @ Load node entry point
tst r12,#2 @ If in HNF already
bne e_0 @ Skip evaluation
str pc,[sp,#-4]! @ Store PC
blx r12 @ Evaluate argument
e_0:
ldr r12,[r6] @ If it does not equal []
ldr r14,=__Nil+2
cmp r12,r14
bne case_P2
\end{minted}
We can see here
that evaluating a node requires a \abc{jsr_eval} ABC instruction, and
that \abc{jsr} ABC instructions require storing the PC in ARM assembly.
Of course, evaluating nodes is something that happens throughout the source code and has to be done all the time during the execution of a Clean program.
We therefore need a fast, small Thumb alternative for the ARM code.
\subsection{Solution}
\label{sec:storing-pc:solution}
To solve the first problem, we need to first move \ual{pc} to an auxiliary register, and then push that on the stack.
We then get, for the example from \texttt{armstartup.s:540-2}:
\begin{minted}{ual}
add lr,pc,#9 @ 0x20
str lr,[sp,#-4]! @ 0x24
bl init_clean @ 0x28
tst r4,r4 @ 0x2c
\end{minted}
We store the value \ual{0x2d}.
This address points to the \ual{tst} instruction as before, with the LSB set to 1 to indicate Thumb mode.
The offset, 9, is calculated as the number of bytes to the instruction after the branch plus one for Thumb mode.
For \ual{b} and \ual{bl} instructions, this means an offset of 9, since these instructions are 32-bit.
The \ual{bx} and \ual{blx} instructions are 16-bit, and require an offset of 7.
The second problem is that
when the \ual{add} instruction is located at the start of a halfword (e.g. \ual{0x22}),
the value that is read for PC will still be \ual{0x20}, as it is word-aligned~\parencite[A5.1.2]{armv7m}
When generating object code, we need to keep track of the current alignment and add either 7, 9 or 11 to the read program counter,
depending on both the alignment and the size of the branch instruction.
When generating assembly code, we are less concerned with efficiency.
In this code generator, we simply force-align the \ual{add} instruction (by adding an \ual{.align} directive, which will insert a \ual{nop} if necessary).
\subsection{Implementation details}
\label{sec:storing-pc:implementation}
A jump always jumps to either a label (with \ual{b} or \ual{bl}) or a register (with \ual{bx} or \ual{blx}).
The latter occurs for example in the case of a \abc{jsr_eval} ABC instruction.
This instruction is used to evaluate a node on the A-stack.
First, the node entry address has to be fetched; then we jump to that address.
In the first case, we need one scratch register to store the PC temporarily.
In the second case, we need two scratch registers: also one for the address we are jumping to.
For the ARM instruction set we needed zero and one scratch register(s), respectively.
The ARM backend uses two scratch registers, S0 and S1 (for more details, see \cref{sec:reg-alloc:clean}).
The latter is used only when two scratch registers are needed, so S0 is used in this case.
The Thumb-2 code generator uses S0 to store the PC temporarily in the first case,
and S1 in the second case (where S0 is still used for the address we are jumping to).
This makes sure that S0 is used as much as possible.
The slightly easier implementation would use S1 in both cases.
However, in Thumb-2 it is convenient to have great variation in register usage:
this allows for a massive code size optimisation (see \cref{sec:reg-alloc}).
For this reason it is better to use S0 whenever possible.
\subsection{Comparison}
\label{sec:storing-pc:comparison}
Assuming the worst case, that all instructions in the jump block are wide, we need four more bytes in Thumb than in ARM.
As a benchmark, the Clean compiler has 41,006 jumps of this kind in 1,253,978 instructions, or approximately 3.27\%.
The four extra bytes in Thumb mean a size increase of $41006\cdot4\approx160$KiB on the 5.3MiB file, an increase of 3.00\%.
As for the time complexity: every jump requires an extra instruction cycle.
In particular, every reduction needs an extra cycle.
It is hard to tell what effect this has on Clean programs in general,
and it may well be very dependent on the kind of program.
A general comparison of running time under ARM and Thumb is made in \cref{sec:results:time}.
% pi@rasppi:~/clean/exe$ objdump -d cocl | grep -E '^\s+[0123456789abcdef]{5,8}:\s+[0123456789abcdef]{8}\s+push\s+\{pc\}' | wc -l
% 41006
% pi@rasppi:~/clean/exe$ objdump -d cocl | grep -E '^\s+[0123456789abcdef]{5,8}:\s+[0123456789abcdef]{8}' | wc -l
% 1253978
\subsection{Other solutions}
\label{sec:storing-pc:other-solutions}
Another solution than the one we have considered here makes use of the link register.
Some branch instructions, like \ual{bl}, store the address of the next instruction in the link register.
We could therefore imagine a setup where the callee gets the return address from that register rather than from the stack.
This is the approach taken by GCC.
The code of a typical C subroutine starts with \ual{push {...,lr}} and ends with \ual{pop {...,pc}}.
When generating code for a functional language, it is not straightforward to do this, due to tail recursion.
An example of this can be found in the following basic function:
\begin{minted}{clean}
length :: !Int ![a] -> Int
length n [] = n
length n [_:xs] = length (n+1) xs
\end{minted}
The current Thumb backend generates the following code:
\begin{minted}{ual}
slength_P2:
ldr r0,[r6] @ Load top of A-stack (the list)
ldr r14,=__Nil+2 @ Check if it is Nil
cmp r0,r14
bne case_P2
case_P1: @ If the list is Nil
ldr pc,[sp],#4 @ Simply return
.ltorg
case_P2: @ If the list is not Nil
ldr r6,[r6,#8] @ Load the Cons part of the list
ldr r0,[r6] @ Check if it has been evaluated already
tst r0,#2
bne e_0
str r4,[sp,#-4]! @ If the Cons part has not been evaluated
.align
add.w r14,pc,#7 @ Evaluate the cons part
str.w r14,[sp,#-4]!
blx r0
ldr r4,[sp],#4
e_0: @ If / after the Cons part has been evaluated
add r4,r4,#1 @ Increase counter
b slength_P2 @ Tail recursively jump back to start of function
.ltorg
\end{minted}
In this setup, it is not easily possible to let the callee store the return address,
since we enter the function as many times as there are elements in the list,
while we only return from it once, in \texttt{case\_P1}.
A more straightforward solution to have the caller responsible for storing the return address,
which is why this approach is taken in Clean's ARM code generator~\parencite{armcg}
and why we continue along these lines for the Thumb backend.
\end{multicols}
|