summaryrefslogtreecommitdiff
path: root/thesis/storing-pc.tex
blob: 81a688b80be6c398c47e6040d07488961a2ad296 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
\section{Storing the program counter}
\label{sec:storing-pc}

\begin{multicols}{2}

\subsection{Introduction}
\label{sec:storing-pc:intro}
Storing the program counter on the stack is something done commonly in many languages during a function call.
Usually, an offset to the actual program counter at the moment of the store instruction is stored,
	to allow for a branch after that instruction, and have the callee return to the address after that branch.
The ARM architecture accommodates for this: an ARM instruction that reads the program counter, actually reads
	\enquote{the address of the current instruction plus 8}~\parencite[A2.3]{armv7ar}.
The following ARM assembly example illustrates this (\texttt{armstartup.s:540-2},~\cite{armrts}):

\begin{minted}{ual}
	str     pc,[sp,#-4]!  @ 0x20
	bl      init_clean    @ 0x24
	tst     r4,r4         @ 0x28
\end{minted}

Dummy addresses have been indicated in comments.
When execution arrives at \ual{0x20}, the program counter is set to \ual{0x20}.
Per the above documentation, \ual{str} stores \ual{0x20+8} on the stack
	(to be precise: to the address indicated by the stack pointer with pre-indexed offset \ual{-4}).
The processor branches to \ual{init_clean},
	which loads the value from the stack back into the program counter to return to the caller.
The program counter is then \ual{0x28}.
For the next instruction cycle, the \ual{tst} command is executed.

There are two reasons why the above cannot be used in Thumb-2 code.
First, \ual{pc} is not allowed as the first operand of a \ual{str} instruction.

The second problem we meet is that the instruction to store the program counter may be halfword-aligned rather than word-aligned.
We saw above that a read of the program counter in ARM mode reads as PC+8.
In Thumb mode this is more complicated.
In this case, we 
	\enquote{[r]ead the word-aligned PC value, that is,
		the address of the current instruction + 4,
		with bits [1:0] forced to zero}~\parencite[A5.1.2]{armv7m}.
This means that when the \ual{add} instruction above is at \ual{0x22},
	we will still store \ual{0x2d} on the stack,
	since the word-aligned program counter is \ual{0x20} as before.
However, in this case \ual{bl} is located at \ual{0x2a}, and since this is a 32 bits instruction we point to the middle of that instruction.

\subsection{Usage in Clean}
\label{sec:storing-pc:clean}
Storing the PC is required for jumping to subroutines.
We see a clear example of this when a Clean function definition uses pattern matching.
Consider the following example:

\begin{minted}{clean}
	isEmpty :: [a] -> Bool
	isEmpty [] = True
	isEmpty _  = False
\end{minted}

The generated ABC-code looks as below%
	\footnote{The current Clean compiler misses the \abc{jsr_eval 0} line:
			the strictness analyser recognises that \clean{isEmpty} is strict in its first argument,
			so evaluating the argument to head normal form is not needed any more.
		In cases where the strictness analyser cannot derive strictness,
			code similar to this example is generated.
		The code given here can be reproduced with \bash{clm -nsa}.}.
Since \clean{isEmpty} pattern matches on its first argument, it needs to be evaluated to head normal form.
This is done with \abc{jsr_eval 0}.
Only after that can we check if its constructor is \abc{_Nil} (the empty list), which is done with \abc{eq_desc _Nil 0 0}.

\begin{minted}[tabsize=4]{abc}
	sisEmpty.1
		jsr_eval 0        | Evaluate argument
		eq_desc _Nil 0 0  | If it equals []
		jmp_true case.1   | .. jump to case.1
		jmp case.2        | [else] to case.2
	case.1
		pop_a 1           | Pop argument
		pushB TRUE        | Return True
		rtn
	case.2
		pop_a 1           | Pop argument
		pushB FALSE       | Return False
		rtn
\end{minted}

Evaluating the argument is done by jumping to the subroutine indicated by the node entry of that argument.
Hence, we store the PC, jump to that address, and continue with \abc{eq_desc _Nil 0 0} when the node has been evaluated to head normal form.
The ARM code for the pattern match is:

\begin{minted}{ual}
	sisEmpty_P1:
	  ldr  r12,[r6]      @ Load node entry point
	  tst  r12,#2        @ If in HNF already
	  bne  e_0           @ Skip evaluation
	  str  pc,[sp,#-4]!  @ Store PC
	  blx  r12           @ Evaluate argument
	e_0:
	  ldr  r12,[r6]      @ If it does not equal []
	  ldr  r14,=__Nil+2
	  cmp  r12,r14
	  bne  case_P2
\end{minted}

We can see here
	that evaluating a node requires a \abc{jsr_eval} ABC instruction, and
	that \abc{jsr} ABC instructions require storing the PC in ARM assembly.

Of course, evaluating nodes is something that happens throughout the source code and has to be done all the time during the execution of a Clean program.
We therefore need a fast, small Thumb alternative for the ARM code.

\subsection{Solution}
\label{sec:storing-pc:solution}
To solve the first problem, we need to first move \ual{pc} to an auxiliary register, and then push that on the stack.
We then get, for the example from \texttt{armstartup.s:540-2}:

\begin{minted}{ual}
	add     lr,pc,#9      @ 0x20
	str     lr,[sp,#-4]!  @ 0x24
	bl      init_clean    @ 0x28
	tst     r4,r4         @ 0x2c
\end{minted}

We store the value \ual{0x2d}.
This address points to the \ual{tst} instruction as before, with the LSB set to 1 to indicate Thumb mode.

In assembly code, we can solve this by adding labels for the addresses we want to store on the stack.
In object code, we need to keep track of the current alignment and add either 9 or 11 to the read program counter.

The offset, 9, is calculated as the number of bytes to the instruction after the branch plus one for Thumb mode.
For \ual{b} and \ual{bl} instructions, this means an offset of 9, since these instructions are 32-bit.
The \ual{bx} and \ual{blx} instructions are 16-bit, and require an offset of 7.

\subsection{Comparison}
\label{sec:storing-pc:comparison}
Assuming the worst case, that all instructions in the jump block are wide, we need four more bytes in Thumb than in ARM.
As a benchmark, the Clean compiler has 41,006 jumps of this kind in 1,253,978 instructions, a rough 3.27\%.
The four extra bytes in Thumb mean a size increase of $41006\cdot4\approx160$KiB on the 5.3MiB file, an increase of 3.00\%.

As for the time complexity: every jump requires an extra instruction cycle.
In particular, every reduction needs an extra cycle.
It is hard to tell what effect this has on Clean programs in general,
	and it may well be very dependent on the kind of program.
A general comparison of running time under ARM and Thumb is made in \cref{sec:results:time}.

% pi@rasppi:~/clean/exe$ objdump -d cocl | grep -E '^\s+[0123456789abcdef]{5,8}:\s+[0123456789abcdef]{8}\s+push\s+\{pc\}' | wc -l
% 41006
% pi@rasppi:~/clean/exe$ objdump -d cocl | grep -E '^\s+[0123456789abcdef]{5,8}:\s+[0123456789abcdef]{8}' | wc -l
% 1253978

\subsection{Other solutions}
\label{sec:storing-pc:other-solutions}
Another solution than the one we present makes use of the link register.
Some branch instructions, like \ual{bl}, store the address of the next instruction in the link register.
We could therefore imagine a setup where the callee gets the return address from that register rather than from the stack.
This is the approach taken by GCC.
The code of a typical C subroutine starts with \ual{push {...,lr}} and ends with \ual{pop {...,pc}}.

When generating code for a functional language, it is not straightforward to do this, due to tail recursion.
It is an easier solution to have the caller responsible for storing the return address,
	which is why this approach is taken in Clean's ARM code generator~\parencite{armcg}
	and why we continue along these lines for the Thumb backend.

\end{multicols}