*This is part 2 in a series on running the [Clean][] compiler in [WebAssembly][], with the proof of concept in the [Clean Sandbox][]. In this part I discuss the compilation pipeline and the program used to rebuild generated files. See the [introduction][] for a high-level overview.* [[toc]] # The compilation pipeline Basically, this is what we do when we compile a Clean program: 1. For each Clean file, generate intermediate ABC code. 2. For each ABC file, generate machine code. 3. Link the machine code together with the system linker into an executable. When we compile for interpretation, this looks a little different: 1. For each Clean file, generate intermediate ABC code. 2. For each ABC file, generate ABC bytecode. Normal ABC code is human-readable. ABC bytecode is easier to parse. Furthermore, the ABC bytecode defines some custom instructions to speed up interpretation, like frequent instruction combinations (e.g. a `pop` followed by a `return`). 3. Link the ABC bytecode together with the ABC linker. 4. Prelink the ABC bytecode for use in the WebAssembly interpreter. Prelinking removes symbol names and relocations, assuming the code segment starts at address 0. This is possible because the WebAssembly memory always starts at address 0. This way, the WebAssembly interpreter does not need to deal with relocations. For step 1, we use the Clean compiler, written in Clean and C. The other tools are maintained in the [abc-interpreter][] repository, and are written in C. We compile the compiler frontend to prelinked bytecode, and the C tools to WebAssembly using emscripten. # The file system Because the Clean compilation pipeline expects to read and write files (rather than `stdin` and `stdout`), we need to set up a file system. Luckily, [Emscripten][] already includes one with the [`FS` library](https://emscripten.org/docs/api_reference/Filesystem-API.html). I use the `FS` library for file access, and the `IDBFS` backend, which is based on IndexedDB, for persistent storage. Unfortunately, it is [currently not possible to let the different C tools (code generator, linker, etc.) share the same file system](https://groups.google.com/g/emscripten-discuss/c/_K61fo-9oKY/m/4m6LYrcPo5sJ). To circumvent this problem, I link the file system of each tool to the same IndexedDB. When switching from one tool to another (e.g., going from the code generator to the linker), I sync the file systems with the IndexedDB. This incurs some overhead, but makes sure that the contents are the same everywhere. There is a `make`-like tool which checks which files need to be rebuilt based on timestamps. This tool is written in Clean, compiled to prelinked bytecode, and runs in the WebAssembly interpreter. # Building the C tools Each C tool is built to its own JavaScript and WebAssembly source using [Emscripten][]. This is similar to building these tools to a native executable, except that we use `emcc` instead of `gcc` and add some other options. For example, the bytecode generator is normally built like this: ```make CLIBS:=-lm BCGEN:=bcgen SRC_BCGEN:=abc_instructions.c bcgen.c bcgen_instructions.c bcgen_instruction_table.c bytecode.c parse_abc.c util.c DEP_BCGEN:=$(subst .c,.h,$(SRC_BCGEN)) settings.h $(BCGEN): $(SRC_BCGEN) $(DEP_BCGEN) $(CC) $(CFLAGS) $(SRC_BCGEN) $(CLIBS) -DBCGEN -o $@ ``` With Emscripten we use: ```make EMCC:=emcc JS_BCGEN:=bcgen.js $(JS_BCGEN): $(SRC_BCGEN) $(EMCC) -O2 -DPOSIX -DBCGEN\ -s ENVIRONMENT=web\ -s INVOKE_RUN=0\ -s MODULARIZE -s EXPORT_NAME=bcgen\ -lidbfs.js -s FORCE_FILESYSTEM=1 -s EXPORTED_RUNTIME_METHODS=[FS,IDBFS]\ $^ -o $@ ``` This compiles the C sources to WebAssembly and generates a file `bcgen.js` that acts as a wrapper around it. Loading `bcgen.js` will then define the function `bcgen` which returns a promise resolving in an Emscripten module with the property `_main`, corresponding to the C function `main`. This sounds complicated, but we can instantiate the Emscripten modules like this: ```html <script type="text/javascript" src="js/bcgen.js"></script> <script type="text/javascript" src="js/bclink.js"></script> <script type="text/javascript" src="js/bcprelink.js"></script> ``` ```js var c_bcgen, c_bclink, c_bcprelink; Promise.all([ bcgen().then(instance => c_bcgen = instance), bclink().then(instance => c_bclink = instance), bcprelink().then(instance => c_bcprelink = instance), ]) ``` # Building the compiler The Clean frontend of the compiler is relatively easy to build. We take the normal project file (Clean's `Makefile`) and use `cpm` (Clean's `make`) to build the prelinked bytecode. The C backend is a little trickier. We use the same `make` recipe as above. However, the compiler expects some global functions to be defined. These functions are normally provided by the native Clean run-time system or a supporting C library, but are not present in the interpreter. We need to define: - `set_return_code` which takes an `int` and sets the process exit code. - `ArgEnvCopyCStringToCleanStringC` which takes a null-terminated C string and builds a Clean string from it (which is not null-terminated but includes the number of bytes). - `ArgEnvGetCommandLineArgumentC` which takes an `int` and returns a pointer to a C string with the *n*th command-line argument. - `ArgEnvGetCommandLineCountC` which returns the number of command-line arguments (`argc`). To implement these functions, we make use of custom properties on the Emscripten module: `clean_argc`, `clean_argv`, and `clean_return_code`. The JavaScript looks like this: ```js var c_compiler; compiler().then(instance => { c_compiler = instance; c_compiler._ArgEnvGetCommandLineCountC = () => c_compiler.clean_argc; c_compiler._ArgEnvGetCommandLineArgumentC = (i, sizep, sp) => { const arg = c_compiler.HEAP32[c_compiler.clean_argv/4+i]; var p = arg; for (; c_compiler.HEAPU8[p] != 0; p++); c_compiler.HEAP32[sizep/4] = p-arg; c_compiler.HEAP32[sp/4] = arg; }; c_compiler._ArgEnvCopyCStringToCleanStringC = (cs, cleans) => { const size = c_compiler.HEAP32[cleans/4]; cleans += 4; for (var i = 0; i < size; i++) c_compiler.HEAPU8[cleans++] = c_compiler.HEAPU8[cs++]; }; c_compiler._ArgEnvCopyCStringToCleanStringC.sync_strings_back_to_clean = true; c_compiler._set_return_code = (c) => {c_compiler.clean_return_code = c;}; }); ``` # Preparing the file system With all the tools set up, we need to create the main directories in the file system. We do this by adding them in the file system of the `c_compiler`, and then sync the other file systems. ```js c_compiler.FS.mkdir('/clean'); c_compiler.FS.mount(c_compiler.IDBFS, {}, '/clean'); c_compiler.FS.mkdir('/clean/lib'); c_compiler.FS.mkdir('/clean/src'); const mount = backend => new Promise(resolve => { backend.FS.mkdir('/clean'); backend.FS.mount(backend.IDBFS, {}, '/clean'); backend.FS.syncfs(true, err => { if (err) throw err; backend.FS.chdir('/clean/src'); resolve(); }); }); return new Promise(resolve => { c_compiler.FS.syncfs(false, () => { Promise.all([ mount(c_bcgen), mount(c_bclink), mount(c_bcprelink), ]).then(resolve); }); }); ``` In the real sandbox, we also add some standard libraries to the file system on start-up. # The `make`-like tool I had to build a simple `make`-like tool to call the compiler and regenerate files when needed. This tool is a separate Clean program, which is compiled to prelinked bytecode and interpreted in WebAssembly. The following snippet sets everything up: it sets the global variable `sandbox.compile` to a Clean function that makes the sandbox project. In the [last part][integration] I show how everything is put together, here I want to focus on the implementation. `wrapInitFunction` makes this a Clean program that can run in the browser. The `me` parameter can be ignored here. The `compile` function calls `make`, which contains the actual logic. `make` is a higher-order function which takes functions for compilation, code generation, etc., as parameters. These are here provided as `call_main c_bcgen` etc. ```clean Start = wrapInitFunction start sandbox :== jsGlobal "sandbox" c_compiler :== jsGlobal "c_compiler" clean_compiler :== jsGlobal "clean_compiler" c_bcgen :== jsGlobal "c_bcgen" c_bclink :== jsGlobal "c_bclink" c_bcprelink :== jsGlobal "c_bcprelink" start :: !JSVal !*JSWorld -> *JSWorld start me w # (cb,w) = jsWrapFun (compile me) me w # w = (sandbox .# "compile" .= cb) w = w where compile me {[0]=paths,[1]=main,[2]=callback} w # (paths,w) = jsValToList` paths (fromJS "") w = make /* .. */ do_compile (call_main c_bcgen) (call_main c_bclink) (call_main c_bcprelink) /* .. */ paths (fromJS "" main) w ``` The C tools we can call relatively easily, by copying `argv` into the memory of the process, setting `argc`, and calling `main`. Afterwards we need to free the memory: ```clean call_main backend args w # (argc,args,argv,w) = copy_argv args backend w # (res,w) = (backend .# "_main" .$? (argc, argv)) (-1, w) = (res == 0, free_args_and_argv args argv backend w) ``` Compilation is a little trickier, because it is a Clean program. For this the `clean_compiler` object, which is a WebAssembly interpreter, has a `.compile()` method. This method makes sure that no global state is kept between runs (by clearing CAFs), and that the memory does not get corrupted if compilation fails: ```js clean_compiler.compile = function () { clean_compiler.clear_cafs(); c_compiler.clean_stdio_open = false; var res = 0; try { /* Interpret from the main entry point */ clean_compiler.interpreter.instance.exports.set_pc(clean_compiler.start); clean_compiler.interpreter.instance.exports.interpret(); res = c_compiler.clean_return_code; } catch (e) { console.warn(e); res = e; /* Memory may be badly corrupted, create a new instance to be sure */ create_clean_compiler(); } /* Clear output buffers */ if (sandbox.stdout_buffer.length > 0) sandbox.stdout('\n'); if (sandbox.stderr_buffer.length > 0) sandbox.stderr('\n'); return res; }; ``` The Clean wrapper also checks the exit code of the process, and catches any errors: ```clean do_compile args w # (_,args,argv,w) = copy_argv args c_compiler w # (res_or_err,w) = (clean_compiler .# "compile" .$ ()) w # w = free_args_and_argv args argv c_compiler w # res = jsValToInt res_or_err | isJust res = (fromJust res==0, w) # (err,w) = (res_or_err .# "toString" .$? ()) ("Compiler crashed.", w) = (False, feedback False err w) ``` The make tool itself is now relatively simple. It keeps a queue of modules to be compiled. It takes a module from the queue, compiles it, and checks the generated code for any dependencies. These dependencies are added to the queue, if they were not compiled yet. When the queue is emptied, the same is done for bytecode generation. Finally, the generated code is linked into a single bytecode file and prelinked bytecode file. *That's all about the pipeline. You can continue reading about [how it's all put together][integration].* [abc-interpreter]: https://gitlab.com/clean-and-itasks/abc-interpreter/ [Clean]: http://clean.cs.ru.nl/ [Clean Sandbox]: https://camilstaps.gitlab.io/clean-sandbox/ [Emscripten]: https://emscripten.org/ [WebAssembly]: https://webassembly.org/ [introduction]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-1-introduction.html [pipeline]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.html [integration]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-3-putting-it-all-together.html