summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCamil Staps2017-01-02 21:52:39 +0100
committerCamil Staps2017-01-02 21:52:39 +0100
commit5b2c98a29aac6d1e45b3f54b51e30c5acf695976 (patch)
tree7c074635e4b1fefecf43005010444dcb5d4d5ca0
parentExpand on note about tail recursion (diff)
Commentaar Mart
-rw-r--r--thesis/code-addresses.tex5
-rw-r--r--thesis/intro.tex11
-rw-r--r--thesis/load-offsets.tex2
-rw-r--r--thesis/preamble.tex14
-rw-r--r--thesis/reg-alloc.tex109
-rw-r--r--thesis/storing-pc.tex6
6 files changed, 82 insertions, 65 deletions
diff --git a/thesis/code-addresses.tex b/thesis/code-addresses.tex
index 4dc7338..29d6257 100644
--- a/thesis/code-addresses.tex
+++ b/thesis/code-addresses.tex
@@ -20,7 +20,8 @@ When the LSB is set, it switches to Thumb mode~\parencite[A2.3.2]{armv7ar}.
The fact that in ARM mode the two lowest bits of a value written to the PC are cleared automatically has been exploited in the ARM code generator:
these two bits are used to store information.
Bit 1 is used to mark a node as having reached head normal form%
- \footnote{A node is in \emph{head} or \emph{root normal form} when it cannot be rewritten itself (though its children may).}.
+ \footnote{A node is in \emph{head} or \emph{root normal form} when it cannot be rewritten itself (though its children may).}
+ (HNF).
Bit 0 is used in Clean's default garbage collector, the copying collector, to record whether a node has been copied to the other semispace already:
when cleared (the normal state of an address as stored on the heap), it has not been copied yet.
@@ -49,7 +50,7 @@ Flipping the meaning of the LSB in the garbage collector amounts to swapping \ua
By word-aligning all node entry addresses we lose one alignment byte per node entry address on average
(assuming that half of the node entry points are word-aligned already).
This increases code size slightly, but since many instructions that were 32-bit in ARM are now 16-bit, the overall code size is still smaller.
-Aligning node entries has no effect on the program's efficiency, since the \ual{nop} instruction that is inserted above the entry is never executed.
+Aligning node entries has no effect on the program's efficiency since the \ual{nop} instruction that is inserted above the entry is never executed.
\subsection{Other solutions}
\label{sec:code-addresses:other-solutions}
diff --git a/thesis/intro.tex b/thesis/intro.tex
index ccffb4c..924973c 100644
--- a/thesis/intro.tex
+++ b/thesis/intro.tex
@@ -5,7 +5,10 @@
\subsection{ARM, Thumb and Thumb-2}
\label{sec:intro:arm-thumb-thumb2}
-ARM is a RISC architecture with several enhancements to a basic RISC architecture allowing ARM processors to
+ARM is a RISC architecture%
+ \footnote{RISC stands for \emph{Reduced Instruction Set Computing}.
+ RISC architectures provide relatively few, basic instructions that can be executed fast to improve performance compared to large, slow architectures.}
+ with several enhancements to a basic RISC architecture allowing ARM processors to
\enquote{achieve a good balance of high performance, small program size, low power consumption, and small silicon area}~\parencite[A1.1]{armv7ar}.
Several instruction sets were designed for the ARM architecture.
@@ -20,7 +23,7 @@ In this thesis, we will usually use \enquote{Thumb} where the Thumb-2 extension
Only when the distinction with pre-ARMv6T2 Thumb is important will we distinguish between (early) Thumb and Thumb-2.
As for \enquote{ARM}, it should be clear from the context whether the architecture or the instruction set is meant.
-Using the Unified Assembler Language (UAL), one can write assembly code for both the ARM and the Thumb instruction set and fix the target ISA only at assemble-time.
+Using the Unified Assembler Language (UAL), one can write assembly code for both the ARM and the Thumb instruction set and fix the target ISA only at assemble-time~\parencite[A4.2]{armv7m}.
The main differences between ARM and Thumb-2 are the following:
@@ -232,7 +235,7 @@ We see that the argument is passed in \ual{r4}, and that register is also used f
In ARM, \ual{r4} through \ual{r0} are used for the top of the B-stack.
The machine stack pointer \ual{sp} is used as the B-stack pointer.
For intermediate results, \ual{r12} is used as a scratch register.
-When a reduction has finished, control is passed back to the callee by loading the PC from the stack (\ual{ldr pc,[sp],#4}).
+When a reduction has finished, control is passed back to the callee by loading the program counter from the stack (\ual{ldr pc,[sp],#4}).
The \clean{Start} rule translates roughly to:
@@ -278,7 +281,7 @@ The actual ABC-code contains annotations that tell the code generator how many e
\subsubsection{The run-time system}
\label{sec:intro:clean:rts}
-After compilation, a Clean program is linked together with the Clean run-time system.
+After compilation, a Clean program is linked together with the Clean run-time system (RTS).
The RTS ensures that that the \clean{Start} node is reduced and printed and takes care of garbage collecting.
This system is written in Clean, C and assembly, so making Clean available on Thumb-2 inherently means adapting the platform-dependent parts of the RTS as well.
diff --git a/thesis/load-offsets.tex b/thesis/load-offsets.tex
index eea6efb..2ef25d0 100644
--- a/thesis/load-offsets.tex
+++ b/thesis/load-offsets.tex
@@ -22,7 +22,7 @@ The Thumb instruction set defines four different variants of the immediate \ual{
\item
32-bits, any base register, offsets between $0$ and $4095$.
\item
- 32-bits, any base register, offsets between $-255$ and $255$.
+ 32-bits, any base register, offsets between ${-255}$ and $255$.
\end{itemize}
Only the last variant can add the offset to the base register, either before or after the load instruction.
diff --git a/thesis/preamble.tex b/thesis/preamble.tex
index 7e25555..5eeeb74 100644
--- a/thesis/preamble.tex
+++ b/thesis/preamble.tex
@@ -2,9 +2,10 @@
\newif\ifdraft
\draftfalse
-
\newif\ifcolumns
\columnsfalse
+\newif\ifcolours
+\coloursfalse
\ifdraft
\usepackage[inner=2cm,outer=2cm,top=2cm,bottom=25mm]{geometry}
@@ -34,12 +35,16 @@
\definecolor{linkcolor}{rgb}{0,0.65,0}
\definecolor{citecolor}{rgb}{0.65,0,0}
\definecolor{urlcolor}{rgb}{0,0,0.65}
-\usepackage[colorlinks=true,linkcolor=linkcolor,urlcolor=urlcolor,citecolor=citecolor]{hyperref}
+\ifcolours
+ \usepackage[colorlinks=true,linkcolor=linkcolor,urlcolor=urlcolor,citecolor=citecolor]{hyperref}
+\else
+ \usepackage[hidelinks=true]{hyperref}
+\fi
\usepackage{cleveref}
\crefname{figure}{figure}{figures}
-\usepackage[style=authoryear,backend=biber]{biblatex}
+\usepackage[style=ieee,backend=biber]{biblatex}
% See http://tex.stackexchange.com/a/40750/23992
\DeclareRobustCommand{\VAN}[1]{Van #1}
\bibliography{thesis}
@@ -60,6 +65,7 @@
style=lovelace,
bgcolor=mintedbg
}
+\ifcolours\else\setminted{style=bw}\fi
\BeforeBeginEnvironment{minted}{\noindent\begin{minipage}{\linewidth}}
\AfterEndEnvironment{minted}{\end{minipage}}
\setmintedinline{fontsize=\footnotesize,bgcolor={}}
@@ -76,7 +82,7 @@
\let\oldtexttt\texttt
\def\texttt#1{\oldtexttt{\footnotesize #1}}
\let\oldurl\url
-\renewcommand{\url}[1]{{\small\oldurl{#1}}}
+\renewcommand{\url}[1]{{\footnotesize\oldurl{#1}}}
\newcommand*{\blankpage}{%
\vspace*{\fill}
diff --git a/thesis/reg-alloc.tex b/thesis/reg-alloc.tex
index beb83c5..e90b36e 100644
--- a/thesis/reg-alloc.tex
+++ b/thesis/reg-alloc.tex
@@ -5,6 +5,7 @@
\small
\begin{subfigure}{.5\linewidth}
\begin{tikzpicture}
+ \ifcolours\else\selectcolormodel{gray}\fi
\begin{axis}
[ xbar
, xmin=0
@@ -41,6 +42,7 @@
\end{subfigure}%
\begin{subfigure}{.5\linewidth}
\begin{tikzpicture}
+ \ifcolours\else\selectcolormodel{gray}\fi
\begin{axis}
[ xbar
, xmin=0
@@ -153,7 +155,7 @@ In Clean programs, the ARM registers are used as follows~\parencite[\texttt{arms
\begin{table*}[b!]
\centering
- \newcommand{\highlight}[1]{\textcolor{red}{\textbf{#1}}}
+ \newcommand{\highlight}[1]{\textbf{\ifcolours\textcolor{red}{#1}\else\underline{#1}\fi}}
\begin{subfigure}{.5\linewidth}
\centering
\begin{tabular}{l >{\tt\footnotesize}c >{\tt\footnotesize}c}
@@ -200,60 +202,9 @@ For small programs it may not be negligible,
The counting method used here is rather simplistic:
basing a register allocation on these counts alone means assuming that every instruction has a 16-bit variant, which is not the case.
-A more accurate method would only count those instructions where using a high or low register actually makes a difference.
+A more accurate method would be to only count those instructions where using a high or low register actually makes a difference.
This is much more complicated, and for a rough estimate the simplistic method used here will already allow us to shrink down the code size.
-\begin{figure*}[b]
- \small
- \centering
- \begin{tikzpicture}
- \begin{axis}
- [ anchor=south east
- , height=.25\linewidth
- , width=.45\linewidth
- , xbar
- , xmin=0, xmax=6000000
- , xlabel={Bytes}
- , ytick=\empty
- , reverse stacked plots
- , ylabel={Clean compiler}
- , ymin=-1, ymax=1
- , axis lines*=left
- , nodes near coords, nodes near coords align=west
- , reverse legend
- , every axis plot/.append style={point meta=explicit symbolic}
- , legend style={at={($(1,-0.5)+(1.5em,0pt)$)},anchor=north,legend columns=-1}
- ]
- \addplot coordinates { (3785144,0) [80.7\%] };
- \addplot coordinates { (3827868,0) [81.6\%] };
- \addplot coordinates { (4385964,0) [93.5\%] };
- \addplot coordinates { (4692628,0) [100\%] };
- \legend{Thumb object-code (approximation), Thumb optimised, Thumb, ARM}
- \end{axis}
- \begin{axis}
- [ anchor=south west
- , height=.25\linewidth
- , width=.45\linewidth
- , xshift=3em
- , xbar
- , xmin=0, xmax=80000
- , xlabel={Bytes}
- , ytick=\empty
- , ymin=-1, ymax=1
- , ylabel={\texttt{frontend/scanner.icl}}
- , axis lines*=left
- , every axis plot/.append style={point meta=explicit symbolic}
- , nodes near coords, nodes near coords align=west
- ]
- \addplot coordinates { (57296,0) [82.0\%] };
- \addplot coordinates { (57864,0) [82.8\%] };
- \addplot coordinates { (64588,0) [92.4\%] };
- \addplot coordinates { (69896,0) [100\%] };
- \end{axis}
- \end{tikzpicture}
- \caption{Code size for different backends.\label{fig:reg-alloc:results}}
-\end{figure*}
-
\subsection{Optimisation}
\label{sec:reg-alloc:optimisation}
The register allocation used in the ARM backend is inefficient for Thumb-2 on several points (\cref{tab:reg-alloc:arm-and-thumb}).
@@ -329,6 +280,58 @@ To measure the code size improvement introduced by this optimisation,
With the new Thumb backend, with the optimised register allocation and leaving out any \ual{.align} directives to approximate object-code level efficiency.
\end{itemize}
+\begin{figure*}[b]
+ \small
+ \centering
+ \begin{tikzpicture}
+ \ifcolours\else\selectcolormodel{gray}\fi
+ \begin{axis}
+ [ anchor=south east
+ , height=.25\linewidth
+ , width=.45\linewidth
+ , xbar
+ , xmin=0, xmax=6000000
+ , xlabel={Bytes}
+ , ytick=\empty
+ , reverse stacked plots
+ , ylabel={Clean compiler}
+ , ymin=-1, ymax=1
+ , axis lines*=left
+ , nodes near coords, nodes near coords align=west
+ , reverse legend
+ , every axis plot/.append style={point meta=explicit symbolic}
+ , legend style={at={($(1,-0.5)+(1.5em,0pt)$)},anchor=north,legend columns=-1}
+ ]
+ \addplot coordinates { (3785144,0) [80.7\%] };
+ \addplot coordinates { (3827868,0) [81.6\%] };
+ \addplot coordinates { (4385964,0) [93.5\%] };
+ \addplot coordinates { (4692628,0) [100\%] };
+ \legend{Thumb object-code (approximation), Thumb optimised, Thumb, ARM}
+ \end{axis}
+ \begin{axis}
+ [ anchor=south west
+ , height=.25\linewidth
+ , width=.45\linewidth
+ , xshift=3em
+ , xbar
+ , xmin=0, xmax=80000
+ , xlabel={Bytes}
+ , ytick=\empty
+ , ymin=-1, ymax=1
+ , ylabel={\texttt{frontend/scanner.icl}}
+ , axis lines*=left
+ , every axis plot/.append style={point meta=explicit symbolic}
+ , nodes near coords, nodes near coords align=west
+ ]
+ \addplot coordinates { (57296,0) [82.0\%] };
+ \addplot coordinates { (57864,0) [82.8\%] };
+ \addplot coordinates { (64588,0) [92.4\%] };
+ \addplot coordinates { (69896,0) [100\%] };
+ \end{axis}
+ \end{tikzpicture}
+ \caption{Code size for different backends.\label{fig:reg-alloc:results}}
+\end{figure*}
+
The object code generator has not been finished yet at the moment these measurements were done,
so we can only estimate its efficiency.
A modification described in \cref{sec:storing-pc:solution} required us to add \ual{.align} to jump instructions, adding a \ual{nop} instruction in approximately half the instances.
diff --git a/thesis/storing-pc.tex b/thesis/storing-pc.tex
index d3fc474..dda32ab 100644
--- a/thesis/storing-pc.tex
+++ b/thesis/storing-pc.tex
@@ -108,6 +108,10 @@ We therefore need a fast, small Thumb alternative for the ARM code.
\subsection{Solution}
\label{sec:storing-pc:solution}
+Recall from \cref{sec:storing-pc:intro} that we meet two issues when generating Thumb code.
+First, that \ual{pc} cannot be the first operand of a \ual{str} instruction;
+ second, that the value read for the program counter is word-aligned while the read instruction may be halfword-aligned.
+
To solve the first problem, we need to first move \ual{pc} to an auxiliary register, and then push that on the stack.
We then get, for the example from \texttt{armstartup.s:540-2}:
@@ -160,7 +164,7 @@ Assuming the worst case, that all instructions in the jump block are wide, we ne
As a benchmark, the Clean compiler has 41,006 jumps of this kind in 1,253,978 instructions, or approximately 3.27\%.
The four extra bytes in Thumb mean a size increase of $41006\cdot4\approx160$KiB on the 5.3MiB file, an increase of 3.00\%.
-As for the time complexity: every jump requires an extra instruction cycle.
+As for the time complexity: every subroutine call requires an extra instruction cycle.
In particular, every reduction needs an extra cycle.
It is hard to tell what effect this has on Clean programs in general,
and it may well be very dependent on the kind of program.