Finish load offsets

author: Camil Staps 2016-12-05 12:24:45 +0100
committer: Camil Staps 2016-12-05 12:24:45 +0100
commit: 462b94ff7c0e2288f2dcbb81dbedaef0f7c17899 (patch)
tree: de89f91fb29e762a4061d5257fb8df4d58dfae4f
parent: load-offsets log (diff)
2 files changed, 110 insertions, 7 deletions
diff --git a/thesis/load-offsets.tex b/thesis/load-offsets.tex
index bfd09ef..eea6efb 100644
--- a/thesis/load-offsets.tex
+++ b/thesis/load-offsets.tex
@@ -16,15 +16,17 @@ The Thumb instruction set defines four different variants of the immediate \ual{
 
 \begin{itemize}[itemsep=0pt]
 	\item
-		16-bits, any base register, a 5-bit positive offset.
+		16-bits, any base register, offsets between $0$ and $124$; $0$ modulo $4$.
 	\item
-		16-bits, SP as base, a 8-bit positive offset.
+		16-bits, SP as base, offsets between $0$ and $1020$; $0$ modulo $4$.
 	\item
-		32-bits, any base register, a 12-bit positive offset.
+		32-bits, any base register, offsets between $0$ and $4095$.
 	\item
-		32-bits, any base register, a 8-bit negative offset.
+		32-bits, any base register, offsets between $-255$ and $255$.
 \end{itemize}
 
+Only the last variant can add the offset to the base register, either before or after the load instruction.
+
 While in some cases a narrow \ual{LDR} instructions can be used in Thumb,
 	some uses in ARM cannot be used directly in Thumb:
 	the offset can now only be between $-255$ and $4095$ bytes.
@@ -32,9 +34,109 @@ While in some cases a narrow \ual{LDR} instructions can be used in Thumb,
 \subsection{Usage in Clean}
 \label{sec:load-offsets:clean}
 For most Clean programs, the generated code will not attempt to load memory with large negative offsets.
-However, it can occur when a function has to look far down the A-stack.
-In practice, this happens when a function has a high arity or many complex expressions in its right hand side.
+However, there are some modules with exceptionally large right hand sides, for which large negative offsets are generated.
+The largest in the Clean compiler is $-600$, in the \texttt{frontend/predef} module.
+
+It is worth noting that the ARM code generator attempts to use as few clock cycles as possible to modify the A-stack.
+Instead of updating the A-stack pointer every time a load is done,
+	the code generator keeps track of the difference between the stored A-stack pointer and the actual top of the stack.
+So, when three arguments are popped off the A-stack, we do not generate this code%
+	\footnote{This is a hypothetical example, so we do not show ABC-code here.}:
+
+\begin{minted}{ual}
+	ldr   r0,[r9,#4]!
+	ldr   r1,[r9,#4]!
+	ldr   r2,[r9,#4]!
+\end{minted}
+
+Instead, the code is optimised to:
+
+\begin{minted}{ual}
+	ldr   r0,[r9,#4]
+	ldr   r1,[r9,#8]
+	ldr   r2,[r9,#12]!
+\end{minted}
+
+Instead of updating \ual{r9}, the A-stack pointer, after every pop,
+	the code generator modifies the offset and updates \ual{r9} only after the last load.
+
+\subsection{Solution}
+\label{sec:load-offsets:solution}
+We extend the optimisation scheme described in the previous paragraph to ensure no large negative offsets are generated.
+When generating code for a basic block, we now compute the minimum offset to the A-stack.
+If it is lower than $-255$, a \ual{sub} instruction is added to modify the A-stack pointer before the load is executed.
+
+In this case, we optimise the offset for code size.
+Let $O$ be the set of offsets used in the basic block and $o_{\mathit{min}}\in O$ the lowest offset in that set.
+If we call the offset used in the \ual{sub} instruction $o_{\mathit{start}}$, this expression gives the total size in bytes of all load instructions:
+$$h\left(o_{\mathit{start}}\right) = \sum_{o\in O} s(o - o_{\mathit{start}}),$$
+where
+$$s(x) = \begin{cases}
+	16 & \text{if $4\mid x$ and $0 \le x \le 124$;}\\
+	32 & \text{otherwise.}
+\end{cases}$$
+We therefore want to calculate
+$$\min_{o_{\mathit{min}}-3 \le o_{\mathit{start}} \le o_{\mathit{min}} + 255} h\left(o_{\mathit{start}}\right).$$
+
+There is no reason to go lower than $o_{\mathit{min}}-3$.
+In this case, all load offsets are positive, and decreasing $o_{\mathit{start}}$ further will make less instruction fall into the first 16-bit variant listed above.
+We cannot go higher than $o_{\mathit{min}}+255$, because we would then need a load offset lower than $-255$.
+
+The \ual{sub} instruction is inserted right before the first \ual{ldr} instruction with a too large negative offset.
+This way, as many instructions as possible before the \ual{sub} instruction fall into the 16-bit variant.
+
+For example, if we have to load data with the offsets $0,4,8,12,-512,16,20,24,28$, the following code is generated:
+
+\begin{minted}{ual}
+	ldr   ..,[r9]
+	ldr   ..,[r9,#4]
+	ldr   ..,[r9,#8]
+	ldr   ..,[r9,#12]
+	sub   r9,r9,#-512
+	ldr   ..,[r9]
+	ldr   ..,[r9,#528]
+	ldr   ..,[r9,#532]
+	ldr   ..,[r9,#536]
+	ldr   ..,[r9,#540]
+\end{minted}
+
+The first four loads are 16-bit, while putting the \ual{sub} instruction at the start of the block would make them 32-bit.
+
+\subsection{Comparison}
+\label{sec:load-offsets:comparison}
+We have assumed that the largest difference between A-stack offsets used in one basic block is at most $4095$.
+If it would be more, we would have to modify the A-stack pointer not only to avoid large negative offsets,
+	but also to avoid large positive offsets.
+We have not found any module that uses large positive offsets, so the assumption is reasonable for the moment.
+
+This assumption allows us to add only one \ual{sub} instruction per basic block.
+With this, we lose one clock cycle.
+In the worst case, all \ual{ldr} instructions are 32-bit.
+In that case we also lose 4 bytes of code size.
+However, in practice, at least some of the \ual{ldr} instructions will be 16-bit, and we will in fact save some space.
+
+Negative offsets occur so infrequently that they do not influence the overall results significantly.
+
+\subsection{Other solutions}
+\label{sec:load-offsets:other-solutions}
+Note that the solution proposed here makes a rigorous choice to minimise clock cycles.
+In the above example, we could optimise for code size by restoring \ual{r9} when the offset is not needed any more:
+
+\begin{minted}{ual}
+	ldr   ..,[r9]
+	ldr   ..,[r9,#4]
+	ldr   ..,[r9,#8]
+	ldr   ..,[r9,#12]
+	sub   r9,r9,#-512
+	ldr   ..,[r9]
+	add   r9,r9,#-512
+	ldr   ..,[r9,#16]
+	ldr   ..,[r9,#20]
+	ldr   ..,[r9,#24]
+	ldr   ..,[r9,#28]
+\end{minted}
 
-\todo{rest}
+Here, all \ual{ldr} instructions are 16-bit, at the cost of one extra \ual{add} instruction.
+We choose to optimise for speed, because it is easier to implement and this situation is so rare that a different scheme would not give significantly smaller code.
 
 \end{multicols}
diff --git a/thesis/preamble.tex b/thesis/preamble.tex
index 342cf02..7650d20 100644
--- a/thesis/preamble.tex
+++ b/thesis/preamble.tex
@@ -17,6 +17,7 @@
 \usepackage[british]{babel}
 \usepackage[babel=true]{csquotes}
 \usepackage{enumitem}
+\usepackage{amsmath}
 
 \usepackage[usenames,dvipsnames,svgnames,table]{xcolor}
 \definecolor{linkcolor}{rgb}{0,0.65,0}
author	Camil Staps	2016-12-05 12:24:45 +0100
committer	Camil Staps	2016-12-05 12:24:45 +0100
commit	462b94ff7c0e2288f2dcbb81dbedaef0f7c17899 (patch)
tree	de89f91fb29e762a4061d5257fb8df4d58dfae4f
parent	load-offsets log (diff)