*This is part 2 in a series on running the [Clean][] compiler in
[WebAssembly][], with the proof of concept in the [Clean Sandbox][]. In this
part I discuss the compilation pipeline and the program used to rebuild
generated files. See the [introduction][] for a high-level overview.*

[[toc]]

# The compilation pipeline

Basically, this is what we do when we compile a Clean program:

1. For each Clean file, generate intermediate ABC code.
2. For each ABC file, generate machine code.
3. Link the machine code together with the system linker into an executable.

When we compile for interpretation, this looks a little different:

1. For each Clean file, generate intermediate ABC code.
2. For each ABC file, generate ABC bytecode.

   Normal ABC code is human-readable. ABC bytecode is easier to parse.
   Furthermore, the ABC bytecode defines some custom instructions to speed up
   interpretation, like frequent instruction combinations (e.g. a `pop`
   followed by a `return`).
3. Link the ABC bytecode together with the ABC linker.
4. Prelink the ABC bytecode for use in the WebAssembly interpreter.

   Prelinking removes symbol names and relocations, assuming the code segment
   starts at address 0. This is possible because the WebAssembly memory always
   starts at address 0. This way, the WebAssembly interpreter does not need to
   deal with relocations.

For step 1, we use the Clean compiler, written in Clean and C. The other tools
are maintained in the [abc-interpreter][] repository, and are written in C. We
compile the compiler frontend to prelinked bytecode, and the C tools to
WebAssembly using emscripten.

# The file system

Because the Clean compilation pipeline expects to read and write files (rather
than `stdin` and `stdout`), we need to set up a file system. Luckily,
[Emscripten][] already includes one with the
[`FS` library](https://emscripten.org/docs/api_reference/Filesystem-API.html).
I use the `FS` library for file access, and the `IDBFS` backend, which is based
on IndexedDB, for persistent storage.

Unfortunately, it is [currently not possible to let the different C tools (code
generator, linker, etc.) share the same file
system](https://groups.google.com/g/emscripten-discuss/c/_K61fo-9oKY/m/4m6LYrcPo5sJ).
To circumvent this problem, I link the file system of each tool to the same
IndexedDB. When switching from one tool to another (e.g., going from the code
generator to the linker), I sync the file systems with the IndexedDB. This
incurs some overhead, but makes sure that the contents are the same everywhere.

There is a `make`-like tool which checks which files need to be rebuilt based
on timestamps. This tool is written in Clean, compiled to prelinked bytecode,
and runs in the WebAssembly interpreter.

# Building the C tools

Each C tool is built to its own JavaScript and WebAssembly source using
[Emscripten][]. This is similar to building these tools to a native executable,
except that we use `emcc` instead of `gcc` and add some other options. For
example, the bytecode generator is normally built like this:

```make
CLIBS:=-lm

BCGEN:=bcgen

SRC_BCGEN:=abc_instructions.c bcgen.c bcgen_instructions.c bcgen_instruction_table.c bytecode.c parse_abc.c util.c
DEP_BCGEN:=$(subst .c,.h,$(SRC_BCGEN)) settings.h

$(BCGEN): $(SRC_BCGEN) $(DEP_BCGEN)
	$(CC) $(CFLAGS) $(SRC_BCGEN) $(CLIBS) -DBCGEN -o $@
```

With Emscripten we use:

```make
EMCC:=emcc

JS_BCGEN:=bcgen.js

$(JS_BCGEN): $(SRC_BCGEN)
	$(EMCC) -O2 -DPOSIX -DBCGEN\
		-s ENVIRONMENT=web\
		-s INVOKE_RUN=0\
		-s MODULARIZE -s EXPORT_NAME=bcgen\
		-lidbfs.js -s FORCE_FILESYSTEM=1 -s EXPORTED_RUNTIME_METHODS=[FS,IDBFS]\
		$^ -o $@
```

This compiles the C sources to WebAssembly and generates a file `bcgen.js` that
acts as a wrapper around it. Loading `bcgen.js` will then define the function
`bcgen` which returns a promise resolving in an Emscripten module with the
property `_main`, corresponding to the C function `main`.

This sounds complicated, but we can instantiate the Emscripten modules like
this:

```html
<script type="text/javascript" src="js/bcgen.js"></script>
<script type="text/javascript" src="js/bclink.js"></script>
<script type="text/javascript" src="js/bcprelink.js"></script>
```

```js
var c_bcgen, c_bclink, c_bcprelink;
Promise.all([
  bcgen().then(instance => c_bcgen = instance),
  bclink().then(instance => c_bclink = instance),
  bcprelink().then(instance => c_bcprelink = instance),
])
```

# Building the compiler

The Clean frontend of the compiler is relatively easy to build. We take the
normal project file (Clean's `Makefile`) and use `cpm` (Clean's `make`) to
build the prelinked bytecode.

The C backend is a little trickier. We use the same `make` recipe as above.
However, the compiler expects some global functions to be defined. These
functions are normally provided by the native Clean run-time system or a
supporting C library, but are not present in the interpreter. We need to
define:

- `set_return_code` which takes an `int` and sets the process exit code.
- `ArgEnvCopyCStringToCleanStringC` which takes a null-terminated C string and
  builds a Clean string from it (which is not null-terminated but includes the
  number of bytes).
- `ArgEnvGetCommandLineArgumentC` which takes an `int` and returns a pointer to
  a C string with the *n*th command-line argument.
- `ArgEnvGetCommandLineCountC` which returns the number of command-line
  arguments (`argc`).

To implement these functions, we make use of custom properties on the
Emscripten module: `clean_argc`, `clean_argv`, and `clean_return_code`. The
JavaScript looks like this:

```js
var c_compiler;
compiler().then(instance => {
	c_compiler = instance;

	c_compiler._ArgEnvGetCommandLineCountC = () => c_compiler.clean_argc;

	c_compiler._ArgEnvGetCommandLineArgumentC = (i, sizep, sp) => {
		const arg = c_compiler.HEAP32[c_compiler.clean_argv/4+i];
		var p = arg;
		for (; c_compiler.HEAPU8[p] != 0; p++);
		c_compiler.HEAP32[sizep/4] = p-arg;
		c_compiler.HEAP32[sp/4] = arg;
	};

	c_compiler._ArgEnvCopyCStringToCleanStringC = (cs, cleans) => {
		const size = c_compiler.HEAP32[cleans/4];
		cleans += 4;
		for (var i = 0; i < size; i++)
			c_compiler.HEAPU8[cleans++] = c_compiler.HEAPU8[cs++];
	};
	c_compiler._ArgEnvCopyCStringToCleanStringC.sync_strings_back_to_clean = true;

	c_compiler._set_return_code = (c) => {c_compiler.clean_return_code = c;};
});
```

# Preparing the file system

With all the tools set up, we need to create the main directories in the file
system. We do this by adding them in the file system of the `c_compiler`, and
then sync the other file systems.

```js
c_compiler.FS.mkdir('/clean');
c_compiler.FS.mount(c_compiler.IDBFS, {}, '/clean');
c_compiler.FS.mkdir('/clean/lib');
c_compiler.FS.mkdir('/clean/src');

const mount = backend => new Promise(resolve => {
	backend.FS.mkdir('/clean');
	backend.FS.mount(backend.IDBFS, {}, '/clean');
	backend.FS.syncfs(true, err => {
		if (err)
			throw err;
		backend.FS.chdir('/clean/src');
		resolve();
	});
});

return new Promise(resolve => {
	c_compiler.FS.syncfs(false, () => {
		Promise.all([
			mount(c_bcgen),
			mount(c_bclink),
			mount(c_bcprelink),
		]).then(resolve);
	});
});
```

In the real sandbox, we also add some standard libraries to the file system on
start-up.

# The `make`-like tool

I had to build a simple `make`-like tool to call the compiler and regenerate
files when needed. This tool is a separate Clean program, which is compiled to
prelinked bytecode and interpreted in WebAssembly.

The following snippet sets everything up: it  sets the global variable
`sandbox.compile` to a Clean function that makes the sandbox project. In the
[last part][integration] I show how everything is put together, here I want to
focus on the implementation.

`wrapInitFunction` makes this a Clean program that can run in the browser. The
`me` parameter can be ignored here. The `compile` function calls `make`, which
contains the actual logic. `make` is a higher-order function which takes
functions for compilation, code generation, etc., as parameters. These are here
provided as `call_main c_bcgen` etc.

```clean
Start = wrapInitFunction start

sandbox :== jsGlobal "sandbox"
c_compiler :== jsGlobal "c_compiler"
clean_compiler :== jsGlobal "clean_compiler"
c_bcgen :== jsGlobal "c_bcgen"
c_bclink :== jsGlobal "c_bclink"
c_bcprelink :== jsGlobal "c_bcprelink"

start :: !JSVal !*JSWorld -> *JSWorld
start me w
	# (cb,w) = jsWrapFun (compile me) me w
	# w = (sandbox .# "compile" .= cb) w
	= w
where
	compile me {[0]=paths,[1]=main,[2]=callback} w
		# (paths,w) = jsValToList` paths (fromJS "") w
		= make
			/* .. */
			do_compile
			(call_main c_bcgen) (call_main c_bclink) (call_main c_bcprelink)
			/* .. */
			paths (fromJS "" main)
			w
```

The C tools we can call relatively easily, by copying `argv` into the memory of
the process, setting `argc`, and calling `main`. Afterwards we need to free the
memory:

```clean
	call_main backend args w
		# (argc,args,argv,w) = copy_argv args backend w
		# (res,w) = (backend .# "_main" .$? (argc, argv)) (-1, w)
		= (res == 0, free_args_and_argv args argv backend w)
```

Compilation is a little trickier, because it is a Clean program. For this the
`clean_compiler` object, which is a WebAssembly interpreter, has a `.compile()`
method. This method makes sure that no global state is kept between runs (by
clearing CAFs), and that the memory does not get corrupted if compilation
fails:

```js
clean_compiler.compile = function () {
	clean_compiler.clear_cafs();
	c_compiler.clean_stdio_open = false;

	var res = 0;
	try {
		/* Interpret from the main entry point */
		clean_compiler.interpreter.instance.exports.set_pc(clean_compiler.start);
		clean_compiler.interpreter.instance.exports.interpret();
		res = c_compiler.clean_return_code;
	} catch (e) {
		console.warn(e);
		res = e;
		/* Memory may be badly corrupted, create a new instance to be sure */
		create_clean_compiler();
	}

	/* Clear output buffers */
	if (sandbox.stdout_buffer.length > 0)
		sandbox.stdout('\n');
	if (sandbox.stderr_buffer.length > 0)
		sandbox.stderr('\n');

	return res;
};
```

The Clean wrapper also checks the exit code of the process, and catches any
errors:

```clean
	do_compile args w
		# (_,args,argv,w) = copy_argv args c_compiler w
		# (res_or_err,w) = (clean_compiler .# "compile" .$ ()) w
		# w = free_args_and_argv args argv c_compiler w
		# res = jsValToInt res_or_err
		| isJust res
			= (fromJust res==0, w)
			# (err,w) = (res_or_err .# "toString" .$? ()) ("Compiler crashed.", w)
			= (False, feedback False err w)
```

The make tool itself is now relatively simple. It keeps a queue of modules to
be compiled. It takes a module from the queue, compiles it, and checks the
generated code for any dependencies. These dependencies are added to the queue,
if they were not compiled yet. When the queue is emptied, the same is done for
bytecode generation. Finally, the generated code is linked into a single
bytecode file and prelinked bytecode file.

*That's all about the pipeline. You can continue reading about [how it's all
put together][integration].*

[abc-interpreter]: https://gitlab.com/clean-and-itasks/abc-interpreter/
[Clean]: http://clean.cs.ru.nl/
[Clean Sandbox]: https://camilstaps.gitlab.io/clean-sandbox/
[Emscripten]: https://emscripten.org/
[WebAssembly]: https://webassembly.org/

[introduction]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-1-introduction.html
[pipeline]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.html
[integration]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-3-putting-it-all-together.html