Discussion

author: Camil Staps 2017-01-19 17:05:08 +0100
committer: Camil Staps 2017-01-19 17:05:08 +0100
commit: 89d745b862c06ba3e73e3bfab6ba608c4583bab9 (patch)
tree: 68e8f75dfc9de589ecd516a5be6004c76778aec9
parent: Presentation (diff)
1 files changed, 80 insertions, 44 deletions
diff --git a/thesis/discussion.tex b/thesis/discussion.tex
index ae72de3..d26af5b 100644
--- a/thesis/discussion.tex
+++ b/thesis/discussion.tex
@@ -3,6 +3,15 @@
 
 \begin{multicols}{2}
 
+In this section, we discuss three optimisation vectors that are not yet considered.
+First, we look at subroutine calls, which are expensive in Thumb, compared to ARM.
+In \cref{sec:discussion:branches}, we consider branch optimisation,
+	which concerns moving code blocks around to make as many branch instructions narrow by minimising their offsets.
+The last optimisation vector is about instructions that exist in Thumb but not in ARM.
+These are discussed in \cref{sec:discussion:thumb-only}.
+
+We also look at the results from \cref{sec:results} in more detail, in \cref{sec:discussion:matching}.
+
 \subsection{Optimising subroutine calls}
 \label{sec:discussion:subroutines}
 In \cref{sec:storing-pc}, we saw that subroutine calls are slower and larger in Thumb than in ARM,
@@ -50,6 +59,13 @@ As for running time, we win one instruction cycle per subroutine call.
 The \ual{add} instruction is removed, but the \ual{str} instruction is still executed
 	(whether it is placed in the caller's or the callee's block does not matter).
 
+In the current code generator, branches are often removed by putting the callee direct behind the caller.
+Adding a tail recursive entry point before the callee complicates this:
+	we would need to jump behind that entry point
+	or place the entry point somewhere else
+		and jump from the tail recursive entry point to the start of the function.
+It should be tested if any of these options is an improvement over the current slow subroutine calls.
+
 To implement tail recursive entry points,
 	one would need to be able to recognise tail recursion such that the tail recursive entry point is used only for these subroutine calls,
 	but not for the `plain' recursive subroutine calls.
@@ -60,13 +76,76 @@ A second possibility would be to have a \emph{mixed calling convention}.
 Currently, in every subroutine call, the caller is responsible for storing the return address on the stack.
 In a mixed calling convention,
 	the callee would be responsible for this,
-	except for functions that exploit tail recursion.
+	except for functions that exploit tail recursion or that are not always called but also jumped to (as described above).
 Recognising tail recursion can be done on a relatively high level of abstraction,
 	so annotations can be added to the intermediate ABC-code to indicate what calling convention should be used for a certain function.
 
 An advantage of this approach might be that there are less cases to consider.
 The obvious disadvantage is that it makes the calling convention overly complex.
 
+\subsection{Branch optimisation}
+\label{sec:discussion:branches}
+A second optimisation vector that is yet to be considered is branch optimisation.
+The Thumb instruction set has four variants of the simple branch instruction \ual{b}~\parencite[A6.7.12]{armv7m}:
+
+\begin{itemize}[itemsep=0pt]
+	\item
+		16-bits, conditional, with an offset between $-256$ and $254$.
+	\item
+		16-bits, not conditional, with an offset between $-2,048$ and $2,046$.
+	\item
+		32-bits, conditional, with an offset between $-1,048,576$ and $1,048,574$.
+	\item
+		32-bits, not conditional, with an offset between $-16,777,216$ and $16,777,214$.
+\end{itemize}
+
+By reorganising the code, the code generator could make as many branch instructions as possible 16-bit.
+
+This optimisation is restricted to the \ual{b} instruction:
+	the branch with link (\ual{bl}) is always 32-bit;
+	the branch and exchange (\ual{bx}) and branch with link and exchange (\ual{blx}) instructions always 16-bit.
+Since the \ual{b} instruction is used relatively little by the code generator,
+	we cannot hope for much improvement.
+The Clean compiler, with a code size of 3,827,868 bytes (Thumb optimised, \cref{fig:reg-alloc:results}),
+	counts 56,009 simple branches, of which 13752 ($25\%$) are already narrow.
+In the best case we can make all the wide branches narrow
+	and win $(56,009 - 13,752)\cdot 2 = 84,514$ bytes, that is $2.2\%$.
+The \ual{bl}, \ual{blx} and \ual{bx} instructions are used $42,436$, $23,070$ and $374$ times in the Clean compiler, respectively.
+
+% $ objdump -d a.out | grep -w 'b\(en\|eq\|cs\|hs\|cc\|lo\|mi\|pl\|vs\|vc\|hi\|ls\|ge\|lt\|gt\|le\|al\)\?\.w' | wc -l
+% 42257
+% $ objdump -d a.out | grep -w 'b\(en\|eq\|cs\|hs\|cc\|lo\|mi\|pl\|vs\|vc\|hi\|ls\|ge\|lt\|gt\|le\|al\)\?\.n' | wc -l
+% 13752
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bl | wc -l
+% 42436
+% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF blx | wc -l
+% 23070
+% $ objdump -d a.out | grep -w 'bx\(en\|eq\|cs\|hs\|cc\|lo\|mi\|pl\|vs\|vc\|hi\|ls\|ge\|lt\|gt\|le\|al\)\?' | wc -l
+% 374
+
+\subsection{Thumb-only instructions}
+\label{sec:discussion:thumb-only}
+The Thumb instruction set introduces some instructions that are not available on ARM.
+These include:
+
+\begin{itemize}
+	\item \ual{cbz} and \ual{cbnz},
+		\enquote{Compare and Branch on (Non-)Zero} can conditionally branch forward an even number of bytes between 0 and 126 when a register is (un)equal to zero~\parencite[A6.7.21]{armv7m}.
+		This is an improvement over combining a \ual{tst} and a \ual{beq} or \ual{bne} instruction.
+	\item \ual{tbb} and \ual{tbh},
+		\enquote{Table Branch Byte} and \enquote{Table Branch Halfword} can branch forward using a table of single byte or halfword offsets,
+			when given a pointer to and an index in the table~\parencite[A6.7.139]{armv7m}.
+\end{itemize}
+
+Where they are applicable, using these functions will be more efficient than the equivalent ARM code, which is being used now.
+There do not seem to be direct use cases for \ual{tbb} and \ual{tbh}, but \ual{cbz} and \ual{cbnz} can improve performance.
+It has not been checked yet how much can be won.
+
+The \ual{it} instruction allows for conditional execution in the Thumb instruction set.
+It does not exist in the ARM instruction set, which has a four-bit conditional field in every instruction~(see \cref{sec:intro:conditionals}).
+As such, the \ual{it} instruction is new in Thumb.
+However, it does not introduce a performance optimisation, as there is no slower alternative to \ual{it} blocks.
+
 \begin{figure*}[b!]
 	\centering
 	\begin{tikzpicture}
@@ -103,46 +182,6 @@ The obvious disadvantage is that it makes the calling convention overly complex.
 		\label{fig:results:scatter}}
 \end{figure*}
 
-\subsection{Branch optimisation}
-\label{sec:discussion:branches}
-A second optimisation vector that is yet to be considered is branch optimisation.
-The Thumb instruction set has four variants of the simple branch instruction \ual{b}~\parencite[A6.7.12]{armv7m}:
-
-\begin{itemize}[itemsep=0pt]
-	\item
-		16-bits, conditional, with an offset between $-256$ and $254$.
-	\item
-		16-bits, not conditional, with an offset between $-2,048$ and $2,046$.
-	\item
-		32-bits, conditional, with an offset between $-1,048,576$ and $1,048,574$.
-	\item
-		32-bits, not conditional, with an offset between $-16,777,216$ and $16,777,214$.
-\end{itemize}
-
-By reorganising the code, the code generator could make as many branch instructions as possible 16-bit.
-
-This optimisation is restricted to the \ual{b} instruction:
-	the branch with link (\ual{bl}) is always 32-bit;
-	the branch and exchange (\ual{bx}) and branch with link and exchange (\ual{blx}) instructions always 16-bit.
-Since the \ual{b} instruction is used relatively little by the code generator,
-	we cannot hope for much improvement.
-The Clean compiler, with a code size of 3,827,868 bytes (Thumb optimised, \cref{fig:reg-alloc:results}),
-	counts only 29,982 simple branches, of which 6,979 ($23\%$) are already narrow.
-In the best case we can make all the wide branches narrow
-	and win $(29,982 - 6,979)\cdot 2 = 46,004$ bytes, only $1.2\%$.
-The \ual{bl}, \ual{blx} and \ual{bx} instructions are used $42,436$, $23,070$ and $306$ times in the Clean compiler, respectively.
-
-% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.w | wc -l
-% 23003
-% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.n | wc -l
-% 6979
-% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bl | wc -l
-% 42436
-% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF blx | wc -l
-% 23070
-% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bx | wc -l
-% 306
-
 \subsection{Matching programs and instruction sets}
 \label{sec:discussion:matching}
 If we compare the code size decrease with the running time increase,
@@ -163,7 +202,4 @@ The outliers in \cref{fig:results:scatter} are
 This is explained by the relatively high number of jumps:
 	in Thumb we had to add several instructions, introducing extra code size and running time (see \cref{sec:storing-pc}).
 
-\bigskip
-\todo{more}
-
 \end{multicols}
author	Camil Staps	2017-01-19 17:05:08 +0100
committer	Camil Staps	2017-01-19 17:05:08 +0100
commit	89d745b862c06ba3e73e3bfab6ba608c4583bab9 (patch)
tree	68e8f75dfc9de589ecd516a5be6004c76778aec9
parent	Presentation (diff)