diff options
author | Camil Staps | 2016-12-05 12:24:45 +0100 |
---|---|---|
committer | Camil Staps | 2016-12-05 12:24:45 +0100 |
commit | 462b94ff7c0e2288f2dcbb81dbedaef0f7c17899 (patch) | |
tree | de89f91fb29e762a4061d5257fb8df4d58dfae4f | |
parent | load-offsets log (diff) |
Finish load offsets
-rw-r--r-- | thesis/load-offsets.tex | 116 | ||||
-rw-r--r-- | thesis/preamble.tex | 1 |
2 files changed, 110 insertions, 7 deletions
diff --git a/thesis/load-offsets.tex b/thesis/load-offsets.tex index bfd09ef..eea6efb 100644 --- a/thesis/load-offsets.tex +++ b/thesis/load-offsets.tex @@ -16,15 +16,17 @@ The Thumb instruction set defines four different variants of the immediate \ual{ \begin{itemize}[itemsep=0pt] \item - 16-bits, any base register, a 5-bit positive offset. + 16-bits, any base register, offsets between $0$ and $124$; $0$ modulo $4$. \item - 16-bits, SP as base, a 8-bit positive offset. + 16-bits, SP as base, offsets between $0$ and $1020$; $0$ modulo $4$. \item - 32-bits, any base register, a 12-bit positive offset. + 32-bits, any base register, offsets between $0$ and $4095$. \item - 32-bits, any base register, a 8-bit negative offset. + 32-bits, any base register, offsets between $-255$ and $255$. \end{itemize} +Only the last variant can add the offset to the base register, either before or after the load instruction. + While in some cases a narrow \ual{LDR} instructions can be used in Thumb, some uses in ARM cannot be used directly in Thumb: the offset can now only be between $-255$ and $4095$ bytes. @@ -32,9 +34,109 @@ While in some cases a narrow \ual{LDR} instructions can be used in Thumb, \subsection{Usage in Clean} \label{sec:load-offsets:clean} For most Clean programs, the generated code will not attempt to load memory with large negative offsets. -However, it can occur when a function has to look far down the A-stack. -In practice, this happens when a function has a high arity or many complex expressions in its right hand side. +However, there are some modules with exceptionally large right hand sides, for which large negative offsets are generated. +The largest in the Clean compiler is $-600$, in the \texttt{frontend/predef} module. + +It is worth noting that the ARM code generator attempts to use as few clock cycles as possible to modify the A-stack. +Instead of updating the A-stack pointer every time a load is done, + the code generator keeps track of the difference between the stored A-stack pointer and the actual top of the stack. +So, when three arguments are popped off the A-stack, we do not generate this code% + \footnote{This is a hypothetical example, so we do not show ABC-code here.}: + +\begin{minted}{ual} + ldr r0,[r9,#4]! + ldr r1,[r9,#4]! + ldr r2,[r9,#4]! +\end{minted} + +Instead, the code is optimised to: + +\begin{minted}{ual} + ldr r0,[r9,#4] + ldr r1,[r9,#8] + ldr r2,[r9,#12]! +\end{minted} + +Instead of updating \ual{r9}, the A-stack pointer, after every pop, + the code generator modifies the offset and updates \ual{r9} only after the last load. + +\subsection{Solution} +\label{sec:load-offsets:solution} +We extend the optimisation scheme described in the previous paragraph to ensure no large negative offsets are generated. +When generating code for a basic block, we now compute the minimum offset to the A-stack. +If it is lower than $-255$, a \ual{sub} instruction is added to modify the A-stack pointer before the load is executed. + +In this case, we optimise the offset for code size. +Let $O$ be the set of offsets used in the basic block and $o_{\mathit{min}}\in O$ the lowest offset in that set. +If we call the offset used in the \ual{sub} instruction $o_{\mathit{start}}$, this expression gives the total size in bytes of all load instructions: +$$h\left(o_{\mathit{start}}\right) = \sum_{o\in O} s(o - o_{\mathit{start}}),$$ +where +$$s(x) = \begin{cases} + 16 & \text{if $4\mid x$ and $0 \le x \le 124$;}\\ + 32 & \text{otherwise.} +\end{cases}$$ +We therefore want to calculate +$$\min_{o_{\mathit{min}}-3 \le o_{\mathit{start}} \le o_{\mathit{min}} + 255} h\left(o_{\mathit{start}}\right).$$ + +There is no reason to go lower than $o_{\mathit{min}}-3$. +In this case, all load offsets are positive, and decreasing $o_{\mathit{start}}$ further will make less instruction fall into the first 16-bit variant listed above. +We cannot go higher than $o_{\mathit{min}}+255$, because we would then need a load offset lower than $-255$. + +The \ual{sub} instruction is inserted right before the first \ual{ldr} instruction with a too large negative offset. +This way, as many instructions as possible before the \ual{sub} instruction fall into the 16-bit variant. + +For example, if we have to load data with the offsets $0,4,8,12,-512,16,20,24,28$, the following code is generated: + +\begin{minted}{ual} + ldr ..,[r9] + ldr ..,[r9,#4] + ldr ..,[r9,#8] + ldr ..,[r9,#12] + sub r9,r9,#-512 + ldr ..,[r9] + ldr ..,[r9,#528] + ldr ..,[r9,#532] + ldr ..,[r9,#536] + ldr ..,[r9,#540] +\end{minted} + +The first four loads are 16-bit, while putting the \ual{sub} instruction at the start of the block would make them 32-bit. + +\subsection{Comparison} +\label{sec:load-offsets:comparison} +We have assumed that the largest difference between A-stack offsets used in one basic block is at most $4095$. +If it would be more, we would have to modify the A-stack pointer not only to avoid large negative offsets, + but also to avoid large positive offsets. +We have not found any module that uses large positive offsets, so the assumption is reasonable for the moment. + +This assumption allows us to add only one \ual{sub} instruction per basic block. +With this, we lose one clock cycle. +In the worst case, all \ual{ldr} instructions are 32-bit. +In that case we also lose 4 bytes of code size. +However, in practice, at least some of the \ual{ldr} instructions will be 16-bit, and we will in fact save some space. + +Negative offsets occur so infrequently that they do not influence the overall results significantly. + +\subsection{Other solutions} +\label{sec:load-offsets:other-solutions} +Note that the solution proposed here makes a rigorous choice to minimise clock cycles. +In the above example, we could optimise for code size by restoring \ual{r9} when the offset is not needed any more: + +\begin{minted}{ual} + ldr ..,[r9] + ldr ..,[r9,#4] + ldr ..,[r9,#8] + ldr ..,[r9,#12] + sub r9,r9,#-512 + ldr ..,[r9] + add r9,r9,#-512 + ldr ..,[r9,#16] + ldr ..,[r9,#20] + ldr ..,[r9,#24] + ldr ..,[r9,#28] +\end{minted} -\todo{rest} +Here, all \ual{ldr} instructions are 16-bit, at the cost of one extra \ual{add} instruction. +We choose to optimise for speed, because it is easier to implement and this situation is so rare that a different scheme would not give significantly smaller code. \end{multicols} diff --git a/thesis/preamble.tex b/thesis/preamble.tex index 342cf02..7650d20 100644 --- a/thesis/preamble.tex +++ b/thesis/preamble.tex @@ -17,6 +17,7 @@ \usepackage[british]{babel} \usepackage[babel=true]{csquotes} \usepackage{enumitem} +\usepackage{amsmath} \usepackage[usenames,dvipsnames,svgnames,table]{xcolor} \definecolor{linkcolor}{rgb}{0,0.65,0} |