thesis/discussion.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

\section{Discussion}
\label{sec:discussion}

\begin{multicols}{2}

\subsection{Optimising subroutine calls}
\label{sec:discussion:subroutines}
In \cref{sec:storing-pc}, we saw that subroutine calls are slower and larger in Thumb than in ARM,
	because an extra \ual{add} instruction has to be inserted.
We discussed that a more efficient solution would exploit the link register:
	upon a \ual{bl} and \ual{blx} instruction, the link register is loaded with the address of the instruction after the branch.
If the callee would be responsible for saving this address, rather than the caller,
	we could delay storing the return address and exploit these instructions to eliminate the \ual{add} instruction.
However, we also mentioned that tail recursion makes this difficult (see \cref{sec:storing-pc:other-solutions}).

\subsubsection{Tail recursive entry points}
To improve on this, two things can be done.
First, we could add a \emph{tail recursive entry point} to function blocks.
The code for the \clean{length} function from \cref{sec:storing-pc:other-solutions} could then look like this:

\begin{minted}{ual}
slength_P2:
	str   lr,[sp,#-4]!    @ Store the return address
slength_P2_tr:              @ The tail recursive entry point
	ldr   r0,[r6]         @ Load top of A-stack (the list)

	@ ... etc

e_0:                        @ If / after the Cons part has been evaluated
	add   r4,r4,#1        @ Increase counter
	b     slength_P2_tr   @ Tail recursively jump back to start of function
	.ltorg
\end{minted}

This way, the code to store the return address is only executed once.
The return address is saved on the stack, as before.
To call the subroutine, the following block can be used:

\begin{minted}{ual}
	bl    slength_P2
\end{minted}

This is an improvement of between two and six bytes per subroutine call:
	in any case, we win two bytes since the \ual{add} instruction is eliminated.
In the minimal case that a function is called only once, we save only these two bytes.
If the function is called often, many \ual{str} instructions can be removed at the cost of one extra \ual{str} instruction in the function itself,
	and the space saved per subroutine call asymptotically approaches six bytes.

As for running time, we win one instruction cycle per subroutine call.
The \ual{add} instruction is removed, but the \ual{str} instruction is still executed
	(whether it is placed in the caller's or the callee's block does not matter).

To implement tail recursive entry points,
	one would need to be able to recognise tail recursion such that the tail recursive entry point is used only for these subroutine calls,
	but not for the `plain' recursive subroutine calls.
Also, more research is needed to see if a tail recursive entry point is sufficient in all cases.

\subsubsection{Mixed calling convention}
A second possibility would be to have a \emph{mixed calling convention}.
Currently, in every subroutine call, the caller is responsible for storing the return address on the stack.
In a mixed calling convention,
	the callee would be responsible for this,
	except for functions that exploit tail recursion.
Recognising tail recursion can be done on a relatively high level of abstraction,
	so annotations can be added to the intermediate ABC-code to indicate what calling convention should be used for a certain function.

An advantage of this approach might be that there are less cases to consider.
The obvious disadvantage is that it makes the calling convention overly complex.

\begin{figure*}[b!]
	\centering
	\begin{tikzpicture}
		\begin{axis}
			[ width=.5\textwidth
			, xlabel={Size decrease (\%)}
			, ylabel={Running time increase (\%)}
			, mark options={scale=2}
			, scatter/classes={%
				a={mark=x,draw=\ifcolours red\else gray\fi},
				b={mark=+,draw=black}}
			]
			\addplot[scatter,only marks,scatter src=explicit symbolic]
				table[meta=label] {
					x     y    label
					12.8  11.9 a
					18.6  3.6  b
					9.8   5.1  a
					18.9  5.9  b
					18.9  3.7  b
					20.1  5.4  a
					22.9  0.9  a
					0.0   2.6  a
					20.0  5.2  b
					22.0  0.0  b
					12.1  8.0  a
				};
		\end{axis}
	\end{tikzpicture}
	\caption{
		Comparison of code size decrease and running time increase.
		{\ifcolours Red\else Grey\fi} crosses indicate programs that are smaller than 1000b in ARM;
			black pluses correspond to larger programs.
		\label{fig:results:scatter}}
\end{figure*}

\subsection{Branch optimisation}
\label{sec:discussion:branches}
A second optimisation vector that is yet to be considered is branch optimisation.
The Thumb instruction set has four variants of the simple branch instruction \ual{b}~\parencite[A6.7.12]{armv7m}:

\begin{itemize}[itemsep=0pt]
	\item
		16-bits, conditional, with an offset between $-256$ and $254$.
	\item
		16-bits, not conditional, with an offset between $-2,048$ and $2,046$.
	\item
		32-bits, conditional, with an offset between $-1,048,576$ and $1,048,574$.
	\item
		32-bits, not conditional, with an offset between $-16,777,216$ and $16,777,214$.
\end{itemize}

By reorganising the code, the code generator could make as many branch instructions as possible 16-bit.

This optimisation is restricted to the \ual{b} instruction:
	the branch with link (\ual{bl}) is always 32-bit;
	the branch and exchange (\ual{bx}) and branch with link and exchange (\ual{blx}) instructions always 16-bit.
Since the \ual{b} instruction is used relatively little by the code generator,
	we cannot hope for much improvement.
The Clean compiler, with a code size of 3,827,868 bytes (Thumb optimised, \cref{fig:reg-alloc:results}),
	counts only 29,982 simple branches, of which 6,979 ($23\%$) are already narrow.
In the best case we can make all the wide branches narrow
	and win $(29,982 - 6,979)\cdot 2 = 46,004$ bytes, only $1.2\%$.
The \ual{bl}, \ual{blx} and \ual{bx} instructions are used $42,436$, $23,070$ and $306$ times in the Clean compiler, respectively.

% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.w | wc -l
% 23003
% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF b.n | wc -l
% 6979
% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bl | wc -l
% 42436
% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF blx | wc -l
% 23070
% pi@rasppi:~/clean/src/compiler$ objdump -d a.out | grep -wF bx | wc -l
% 306

\subsection{Matching programs and instruction sets}
\label{sec:discussion:matching}
If we compare the code size decrease with the running time increase,
	we get the plot in \cref{fig:results:scatter}.
It seems that code size decrease correlates negatively with increase in running time,
	suggesting that Thumb is more suitable for some programs than others.
However, we do not have enough data to see if this is significant.
More research is needed to see if there actually is such a relationship,
	and if so, what programs are more suitable for Thumb.

The outliers in \cref{fig:results:scatter} are
	RFib $(0.0,2.6)$,
	Ack $(12.8,11.9)$,
	Fib $(9.8,5.1)$ and
	Tak $(12.1,8.0)$,
	all small programs that rely heavily on recursion
	(for a description of the programs, see \cref{sec:results:testsuite} above).
This is explained by the relatively high number of jumps:
	in Thumb we had to add several instructions, introducing extra code size and running time (see \cref{sec:storing-pc}).

\bigskip
\todo{more}

\end{multicols}