\section{Storing the program counter}
\label{sec:storing-pc}

\begin{multicols}{2}

\subsection{Introduction}
\label{sec:storing-pc:intro}
Storing the program counter on the stack is something done commonly in many languages during a function call.
Usually, an offset to the actual program counter at the moment of the store instruction is stored,
	to allow for a branch after that instruction, and have the callee return to the address after that branch.
The ARM architecture accommodates for this: an ARM instruction that reads the program counter, actually reads
	\enquote{the address of the current instruction plus 8}~\parencite[A2.3]{armv7ar}.
The following ARM assembly example illustrates this (\texttt{armstartup.s:540-2},~\cite{armrts}):

\begin{minted}{ual}
	str     pc,[sp,#-4]!  @ 0x20
	bl      init_clean    @ 0x24
	tst     r4,r4         @ 0x28
\end{minted}

Dummy addresses have been indicated in comments.
When execution arrives at \ual{0x20}, the program counter is set to \ual{0x20}.
Per the above documentation, \ual{str} stores \ual{0x20+8} on the stack
	(to be precise: to the address indicated by the stack pointer with pre-indexed offset \ual{-4}).
The processor branches to \ual{init_clean},
	which loads the value from the stack back into the program counter to return to the caller.
The program counter is then \ual{0x28}.
For the next instruction cycle, the \ual{tst} command is executed.

There are two reasons why the above cannot be used in Thumb-2 code.
First, \ual{pc} is not allowed as the first operand of a \ual{str} instruction.

The second problem we meet is that the instruction to store the program counter may be halfword-aligned rather than word-aligned.
We saw above that a read of the program counter in ARM mode reads as PC+8.
In Thumb mode this is more complicated.
In this case, we 
	\enquote{[r]ead the word-aligned PC value, that is,
		the address of the current instruction + 4,
		with bits [1:0] forced to zero}~\parencite[A5.1.2]{armv7m}.
This means that when the \ual{add} instruction above is at \ual{0x22},
	we will still store \ual{0x2d} on the stack,
	since the word-aligned program counter is \ual{0x20} as before.
However, in this case \ual{bl} is located at \ual{0x2a}, and since this is a 32 bits instruction we point to the middle of that instruction.

\subsection{Usage in Clean}
\label{sec:storing-pc:clean}
Storing the PC is required for jumping to subroutines.
We see a clear example of this when a Clean function definition uses pattern matching.
Consider the following example:

\begin{minted}{clean}
	isEmpty :: [a] -> Bool
	isEmpty [] = True
	isEmpty _  = False
\end{minted}

The generated ABC-code looks as below%
	\footnote{The current Clean compiler misses the \abc{jsr_eval 0} line:
			the strictness analyser recognises that \clean{isEmpty} is strict in its first argument,
			so evaluating the argument to head normal form is not needed any more.
		In cases where the strictness analyser cannot derive strictness,
			code similar to this example is generated.
		The code given here can be reproduced with \bash{clm -nsa}.}.
Since \clean{isEmpty} pattern matches on its first argument, it needs to be evaluated to head normal form.
This is done with \abc{jsr_eval 0}.
Only after that can we check if its constructor is \abc{_Nil} (the empty list), which is done with \abc{eq_desc _Nil 0 0}.

\begin{minted}[tabsize=4]{text}
	sisEmpty.1
		jsr_eval 0        || Evaluate argument
		eq_desc _Nil 0 0  || If it equals []
		jmp_true case.1   || .. jump to case.1
		jmp case.2        || [else] to case.2
	case.1
		pop_a 1           || Pop argument
		pushB TRUE        || Return True
		rtn
	case.2
		pop_a 1           || Pop argument
		pushB FALSE       || Return False
		rtn
\end{minted}

Evaluating the argument is done by jumping to the subroutine indicated by the node entry of that argument.
Hence, we store the PC, jump to that address, and continue with \abc{eq_desc _Nil 0 0} when the node has been evaluated to head normal form.
The ARM code for pattern matching is:

\begin{minted}{ual}
	sisEmpty_P1:
	  ldr  r12,[r6]      @ Load node entry point
	  tst  r12,#2        @ If in HNF already
	  bne  e_0           @ Skip evaluation
	  str  pc,[sp,#-4]!  @ Store PC
	  blx  r12           @ Evaluate argument
	e_0:
	  ldr  r12,[r6]      @ If it does not equal []
	  ldr  r14,=__Nil+2
	  cmp  r12,r14
	  bne  case_P2
\end{minted}

We can see here
	(a) that evaluating a node requires a \abc{jsr_eval} ABC instruction, and
	(b) that \abc{jsr} ABC instructions require storing the PC in ARM assembly.

Of course, evaluating nodes is something that happens throughout the source code and has to be done all the time during the execution of a Clean program.
We therefore need a fast, small Thumb alternative for the ARM code.

\subsection{Solution}
\label{sec:storing-pc:solution}
To solve the first problem, we need to first move \ual{pc} to an auxiliary register, and then push that on the stack.
We then get, for the example from \texttt{armstartup.s:540-2}:

\begin{minted}{ual}
	add     lr,pc,#9      @ 0x20
	str     lr,[sp,#-4]!  @ 0x24
	bl      init_clean    @ 0x28
	tst     r4,r4         @ 0x2c
\end{minted}

We store the value \ual{0x2d}.
This address points to the \ual{tst} instruction as before, with the LSB set to 1 to indicate Thumb mode.

In assembly code, we can solve this by adding labels for the addresses we want to store on the stack.
In object code, we need to keep track of the current alignment and add either 9 or 11 to the read program counter.

The offset, 9, is calculated as the number of bytes to the instruction after the branch plus one for Thumb mode.
For \ual{b} and \ual{bl} instructions, this means an offset of 9, since these instructions are 32-bit.
The \ual{bx} and \ual{blx} instructions are 16-bit, and require an offset of 7.

\subsection{Other solutions}
\label{sec:storing-pc:other-solutions}
Another solution than the one we present makes use of the link register.
Some branch instructions, like \ual{bl}, store the address of the next instruction in the link register.
We could therefore imagine a setup where the callee gets the return address from that register rather than from the stack.
This is the approach taken by GCC.
The code of a typical C subroutine starts with \ual{push {...,lr}} and ends with \ual{pop {...,pc}}.

When generating code for a functional language, it is not straightforward to do this, due to tail recursion.
It is an easier solution to have the caller responsible for storing the return address,
	which is why this approach is taken in Clean's ARM code generator~\parencite{armcg}
	and why we continue along these lines for the Thumb backend.

\subsection{Comparison}
To be done. %TODO

For every occurrence: +1 word code size;
for every function call: +1 instruction.

\end{multicols}