\section{Reduced offset space for memory load instructions}
\label{sec:load-offsets}

\begin{multicols}{2}

\subsection{Introduction}
\label{sec:load-offsets:intro}
The \ual{LDR} (immediate) instruction loads a word from memory into a register.
The address is specified by a base register and an offset
	(which can possibly be added to the base register before or after the load).
On ARM, this offset is encoded in twelve bits, and one bit is used for its sign.
That allows for any offset between $-4095$ and $4095$ bytes.
For details, see \cite[A8.8.63]{armv7ar}.

The Thumb instruction set defines four different variants of the immediate \ual{LDR} instruction~\parencite[6.7.42]{armv7m}:

\begin{itemize}[itemsep=0pt]
	\item
		16-bits, any base register, offsets between $0$ and $124$; $0$ modulo $4$.
	\item
		16-bits, SP as base, offsets between $0$ and $1020$; $0$ modulo $4$.
	\item
		32-bits, any base register, offsets between $0$ and $4095$.
	\item
		32-bits, any base register, offsets between ${-255}$ and $255$.
\end{itemize}

Only the last variant can add the offset to the base register, either before or after the load instruction.

While in some cases a narrow \ual{LDR} instructions can be used in Thumb,
	some uses in ARM cannot be used directly in Thumb:
	the offset can now only be between $-255$ and $4095$ bytes.

\subsection{Usage in Clean}
\label{sec:load-offsets:clean}
For most Clean programs, the generated code will not attempt to load memory with large negative offsets.
However, there are some modules with exceptionally large right hand sides, for which large negative offsets are generated.
The largest in the Clean compiler is $-600$, in the \texttt{frontend/predef} module.

It is worth noting that the ARM code generator attempts to use as few clock cycles as possible to modify the A-stack.
Instead of updating the A-stack pointer every time a load is done,
	the code generator keeps track of the difference between the stored A-stack pointer and the actual top of the stack.
So, when three arguments are popped off the A-stack, we do not generate this code%
	\footnote{This is a hypothetical example, so we do not show ABC-code here.}:

\begin{minted}{ual}
	ldr   r0,[r9,#4]!
	ldr   r1,[r9,#4]!
	ldr   r2,[r9,#4]!
\end{minted}

Instead, the code is optimised to:

\begin{minted}{ual}
	ldr   r0,[r9,#4]
	ldr   r1,[r9,#8]
	ldr   r2,[r9,#12]!
\end{minted}

Instead of updating \ual{r9}, the A-stack pointer, after every pop,
	the code generator modifies the offset and updates \ual{r9} only after the last load.

\subsection{Solution}
\label{sec:load-offsets:solution}
We extend the optimisation scheme described in the previous paragraph to ensure no large negative offsets are generated.
When generating code for a basic block, we now compute the minimum offset to the A-stack.
If it is lower than $-255$, a \ual{sub} instruction is added to modify the A-stack pointer before the load is executed.

In this case, we optimise the offset for code size.
Let $O$ be the set of offsets used in the basic block and $o_{\mathit{min}}\in O$ the lowest offset in that set.
If we call the offset used in the \ual{sub} instruction $o_{\mathit{start}}$, this expression gives the total size in bytes of all load instructions:
$$h\left(o_{\mathit{start}}\right) = \sum_{o\in O} s(o - o_{\mathit{start}}),$$
where
$$s(x) = \begin{cases}
	16 & \text{if $4\mid x$ and $0 \le x \le 124$;}\\
	32 & \text{otherwise.}
\end{cases}$$
We therefore want to calculate
$$\min_{o_{\mathit{min}}-3 \le o_{\mathit{start}} \le o_{\mathit{min}} + 255} h\left(o_{\mathit{start}}\right).$$

There is no reason to go lower than $o_{\mathit{min}}-3$.
In this case, all load offsets are positive, and decreasing $o_{\mathit{start}}$ further will make less instruction fall into the first 16-bit variant listed above.
We cannot go higher than $o_{\mathit{min}}+255$, because we would then need a load offset lower than $-255$.

The \ual{sub} instruction is inserted right before the first \ual{ldr} instruction with a too large negative offset.
This way, as many instructions as possible before the \ual{sub} instruction fall into the 16-bit variant.

For example, if we have to load data with the offsets $0,4,8,12,-512,16,20,24,28$, the following code is generated:

\begin{minted}{ual}
	ldr   ..,[r9]
	ldr   ..,[r9,#4]
	ldr   ..,[r9,#8]
	ldr   ..,[r9,#12]
	sub   r9,r9,#-512
	ldr   ..,[r9]
	ldr   ..,[r9,#528]
	ldr   ..,[r9,#532]
	ldr   ..,[r9,#536]
	ldr   ..,[r9,#540]
\end{minted}

The first four loads are 16-bit, while putting the \ual{sub} instruction at the start of the block would make them 32-bit.

\subsection{Comparison}
\label{sec:load-offsets:comparison}
We have assumed that the largest difference between A-stack offsets used in one basic block is at most $4095$.
If it would be more, we would have to modify the A-stack pointer not only to avoid large negative offsets,
	but also to avoid large positive offsets.
We have not found any module that uses large positive offsets, so the assumption is reasonable for the moment.

This assumption allows us to add only one \ual{sub} instruction per basic block.
With this, we lose one clock cycle.
In the worst case, all \ual{ldr} instructions are 32-bit.
In that case we also lose 4 bytes of code size.
However, in practice, at least some of the \ual{ldr} instructions will be 16-bit, and we will in fact save some space.

Negative offsets occur so infrequently that they do not influence the overall results significantly.

\subsection{Other solutions}
\label{sec:load-offsets:other-solutions}
Note that the solution proposed here makes a rigorous choice to minimise clock cycles.
In the above example, we could optimise for code size by restoring \ual{r9} when the offset is not needed any more:

\begin{minted}{ual}
	ldr   ..,[r9]
	ldr   ..,[r9,#4]
	ldr   ..,[r9,#8]
	ldr   ..,[r9,#12]
	sub   r9,r9,#-512
	ldr   ..,[r9]
	add   r9,r9,#-512
	ldr   ..,[r9,#16]
	ldr   ..,[r9,#20]
	ldr   ..,[r9,#24]
	ldr   ..,[r9,#28]
\end{minted}

Here, all \ual{ldr} instructions are 16-bit, at the cost of one extra \ual{add} instruction.
We choose to optimise for speed, because it is easier to implement and this situation is so rare that a different scheme would not give significantly smaller code.

\end{multicols}