summaryrefslogtreecommitdiffhomepage
path: root/resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipelin...
diff options
context:
space:
mode:
authorCamil Staps2021-06-23 19:20:29 +0200
committerCamil Staps2021-06-23 19:20:29 +0200
commitbbc049b5c7749b8f9c4e1ade9cf1da4f91f37c0d (patch)
tree9fe20b0549b51a7165b847ac1a6a5f56a128ff92 /resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.md
parentAdd author on articles (diff)
Add article series on the clean sandbox
Diffstat (limited to 'resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.md')
-rw-r--r--resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.md328
1 files changed, 328 insertions, 0 deletions
diff --git a/resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.md b/resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.md
new file mode 100644
index 0000000..b766560
--- /dev/null
+++ b/resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.md
@@ -0,0 +1,328 @@
+*This is part 2 in a series on running the [Clean][] compiler in
+[WebAssembly][], with the proof of concept in the [Clean Sandbox][]. In this
+part I discuss the compilation pipeline and the program used to rebuild
+generated files. See the [introduction][] for a high-level overview.*
+
+[[toc]]
+
+# The compilation pipeline
+
+Basically, this is what we do when we compile a Clean program:
+
+1. For each Clean file, generate intermediate ABC code.
+2. For each ABC file, generate machine code.
+3. Link the machine code together with the system linker into an executable.
+
+When we compile for interpretation, this looks a little different:
+
+1. For each Clean file, generate intermediate ABC code.
+2. For each ABC file, generate ABC bytecode.
+
+ Normal ABC code is human-readable. ABC bytecode is easier to parse.
+ Furthermore, the ABC bytecode defines some custom instructions to speed up
+ interpretation, like frequent instruction combinations (e.g. a `pop`
+ followed by a `return`).
+3. Link the ABC bytecode together with the ABC linker.
+4. Prelink the ABC bytecode for use in the WebAssembly interpreter.
+
+ Prelinking removes symbol names and relocations, assuming the code segment
+ starts at address 0. This is possible because the WebAssembly memory always
+ starts at address 0. This way, the WebAssembly interpreter does not need to
+ deal with relocations.
+
+For step 1, we use the Clean compiler, written in Clean and C. The other tools
+are maintained in the [abc-interpreter][] repository, and are written in C. We
+compile the compiler frontend to prelinked bytecode, and the C tools to
+WebAssembly using emscripten.
+
+# The file system
+
+Because the Clean compilation pipeline expects to read and write files (rather
+than `stdin` and `stdout`), we need to set up a file system. Luckily,
+[Emscripten][] already includes one with the
+[`FS` library](https://emscripten.org/docs/api_reference/Filesystem-API.html).
+I use the `FS` library for file access, and the `IDBFS` backend, which is based
+on IndexedDB, for persistent storage.
+
+Unfortunately, it is [currently not possible to let the different C tools (code
+generator, linker, etc.) share the same file
+system](https://groups.google.com/g/emscripten-discuss/c/_K61fo-9oKY/m/4m6LYrcPo5sJ).
+To circumvent this problem, I link the file system of each tool to the same
+IndexedDB. When switching from one tool to another (e.g., going from the code
+generator to the linker), I sync the file systems with the IndexedDB. This
+incurs some overhead, but makes sure that the contents are the same everywhere.
+
+There is a `make`-like tool which checks which files need to be rebuilt based
+on timestamps. This tool is written in Clean, compiled to prelinked bytecode,
+and runs in the WebAssembly interpreter.
+
+# Building the C tools
+
+Each C tool is built to its own JavaScript and WebAssembly source using
+[Emscripten][]. This is similar to building these tools to a native executable,
+except that we use `emcc` instead of `gcc` and add some other options. For
+example, the bytecode generator is normally built like this:
+
+```make
+CLIBS:=-lm
+
+BCGEN:=bcgen
+
+SRC_BCGEN:=abc_instructions.c bcgen.c bcgen_instructions.c bcgen_instruction_table.c bytecode.c parse_abc.c util.c
+DEP_BCGEN:=$(subst .c,.h,$(SRC_BCGEN)) settings.h
+
+$(BCGEN): $(SRC_BCGEN) $(DEP_BCGEN)
+ $(CC) $(CFLAGS) $(SRC_BCGEN) $(CLIBS) -DBCGEN -o $@
+```
+
+With Emscripten we use:
+
+```make
+EMCC:=emcc
+
+JS_BCGEN:=bcgen.js
+
+$(JS_BCGEN): $(SRC_BCGEN)
+ $(EMCC) -O2 -DPOSIX -DBCGEN\
+ -s ENVIRONMENT=web\
+ -s INVOKE_RUN=0\
+ -s MODULARIZE -s EXPORT_NAME=bcgen\
+ -lidbfs.js -s FORCE_FILESYSTEM=1 -s EXPORTED_RUNTIME_METHODS=[FS,IDBFS]\
+ $^ -o $@
+```
+
+This compiles the C sources to WebAssembly and generates a file `bcgen.js` that
+acts as a wrapper around it. Loading `bcgen.js` will then define the function
+`bcgen` which returns a promise resolving in an Emscripten module with the
+property `_main`, corresponding to the C function `main`.
+
+This sounds complicated, but we can instantiate the Emscripten modules like
+this:
+
+```html
+<script type="text/javascript" src="js/bcgen.js"></script>
+<script type="text/javascript" src="js/bclink.js"></script>
+<script type="text/javascript" src="js/bcprelink.js"></script>
+```
+
+```js
+var c_bcgen, c_bclink, c_bcprelink;
+Promise.all([
+ bcgen().then(instance => c_bcgen = instance),
+ bclink().then(instance => c_bclink = instance),
+ bcprelink().then(instance => c_bcprelink = instance),
+])
+```
+
+# Building the compiler
+
+The Clean frontend of the compiler is relatively easy to build. We take the
+normal project file (Clean's `Makefile`) and use `cpm` (Clean's `make`) to
+build the prelinked bytecode.
+
+The C backend is a little trickier. We use the same `make` recipe as above.
+However, the compiler expects some global functions to be defined. These
+functions are normally provided by the native Clean run-time system or a
+supporting C library, but are not present in the interpreter. We need to
+define:
+
+- `set_return_code` which takes an `int` and sets the process exit code.
+- `ArgEnvCopyCStringToCleanStringC` which takes a null-terminated C string and
+ builds a Clean string from it (which is not null-terminated but includes the
+ number of bytes).
+- `ArgEnvGetCommandLineArgumentC` which takes an `int` and returns a pointer to
+ a C string with the *n*th command-line argument.
+- `ArgEnvGetCommandLineCountC` which returns the number of command-line
+ arguments (`argc`).
+
+To implement these functions, we make use of custom properties on the
+Emscripten module: `clean_argc`, `clean_argv`, and `clean_return_code`. The
+JavaScript looks like this:
+
+```js
+var c_compiler;
+compiler().then(instance => {
+ c_compiler = instance;
+
+ c_compiler._ArgEnvGetCommandLineCountC = () => c_compiler.clean_argc;
+
+ c_compiler._ArgEnvGetCommandLineArgumentC = (i, sizep, sp) => {
+ const arg = c_compiler.HEAP32[c_compiler.clean_argv/4+i];
+ var p = arg;
+ for (; c_compiler.HEAPU8[p] != 0; p++);
+ c_compiler.HEAP32[sizep/4] = p-arg;
+ c_compiler.HEAP32[sp/4] = arg;
+ };
+
+ c_compiler._ArgEnvCopyCStringToCleanStringC = (cs, cleans) => {
+ const size = c_compiler.HEAP32[cleans/4];
+ cleans += 4;
+ for (var i = 0; i < size; i++)
+ c_compiler.HEAPU8[cleans++] = c_compiler.HEAPU8[cs++];
+ };
+ c_compiler._ArgEnvCopyCStringToCleanStringC.sync_strings_back_to_clean = true;
+
+ c_compiler._set_return_code = (c) => {c_compiler.clean_return_code = c;};
+});
+```
+
+# Preparing the file system
+
+With all the tools set up, we need to create the main directories in the file
+system. We do this by adding them in the file system of the `c_compiler`, and
+then sync the other file systems.
+
+```js
+c_compiler.FS.mkdir('/clean');
+c_compiler.FS.mount(c_compiler.IDBFS, {}, '/clean');
+c_compiler.FS.mkdir('/clean/lib');
+c_compiler.FS.mkdir('/clean/src');
+
+const mount = backend => new Promise(resolve => {
+ backend.FS.mkdir('/clean');
+ backend.FS.mount(backend.IDBFS, {}, '/clean');
+ backend.FS.syncfs(true, err => {
+ if (err)
+ throw err;
+ backend.FS.chdir('/clean/src');
+ resolve();
+ });
+});
+
+return new Promise(resolve => {
+ c_compiler.FS.syncfs(false, () => {
+ Promise.all([
+ mount(c_bcgen),
+ mount(c_bclink),
+ mount(c_bcprelink),
+ ]).then(resolve);
+ });
+});
+```
+
+In the real sandbox, we also add some standard libraries to the file system on
+start-up.
+
+# The `make`-like tool
+
+I had to build a simple `make`-like tool to call the compiler and regenerate
+files when needed. This tool is a separate Clean program, which is compiled to
+prelinked bytecode and interpreted in WebAssembly.
+
+The following snippet sets everything up: it sets the global variable
+`sandbox.compile` to a Clean function that makes the sandbox project. In the
+[last part][integration] I show how everything is put together, here I want to
+focus on the implementation.
+
+`wrapInitFunction` makes this a Clean program that can run in the browser. The
+`me` parameter can be ignored here. The `compile` function calls `make`, which
+contains the actual logic. `make` is a higher-order function which takes
+functions for compilation, code generation, etc., as parameters. These are here
+provided as `call_main c_bcgen` etc.
+
+```clean
+Start = wrapInitFunction start
+
+sandbox :== jsGlobal "sandbox"
+c_compiler :== jsGlobal "c_compiler"
+clean_compiler :== jsGlobal "clean_compiler"
+c_bcgen :== jsGlobal "c_bcgen"
+c_bclink :== jsGlobal "c_bclink"
+c_bcprelink :== jsGlobal "c_bcprelink"
+
+start :: !JSVal !*JSWorld -> *JSWorld
+start me w
+ # (cb,w) = jsWrapFun (compile me) me w
+ # w = (sandbox .# "compile" .= cb) w
+ = w
+where
+ compile me {[0]=paths,[1]=main,[2]=callback} w
+ # (paths,w) = jsValToList` paths (fromJS "") w
+ = make
+ /* .. */
+ do_compile
+ (call_main c_bcgen) (call_main c_bclink) (call_main c_bcprelink)
+ /* .. */
+ paths (fromJS "" main)
+ w
+```
+
+The C tools we can call relatively easily, by copying `argv` into the memory of
+the process, setting `argc`, and calling `main`. Afterwards we need to free the
+memory:
+
+```clean
+ call_main backend args w
+ # (argc,args,argv,w) = copy_argv args backend w
+ # (res,w) = (backend .# "_main" .$? (argc, argv)) (-1, w)
+ = (res == 0, free_args_and_argv args argv backend w)
+```
+
+Compilation is a little trickier, because it is a Clean program. For this the
+`clean_compiler` object, which is a WebAssembly interpreter, has a `.compile()`
+method. This method makes sure that no global state is kept between runs (by
+clearing CAFs), and that the memory does not get corrupted if compilation
+fails:
+
+```js
+clean_compiler.compile = function () {
+ clean_compiler.clear_cafs();
+ c_compiler.clean_stdio_open = false;
+
+ var res = 0;
+ try {
+ /* Interpret from the main entry point */
+ clean_compiler.interpreter.instance.exports.set_pc(clean_compiler.start);
+ clean_compiler.interpreter.instance.exports.interpret();
+ res = c_compiler.clean_return_code;
+ } catch (e) {
+ console.warn(e);
+ res = e;
+ /* Memory may be badly corrupted, create a new instance to be sure */
+ create_clean_compiler();
+ }
+
+ /* Clear output buffers */
+ if (sandbox.stdout_buffer.length > 0)
+ sandbox.stdout('\n');
+ if (sandbox.stderr_buffer.length > 0)
+ sandbox.stderr('\n');
+
+ return res;
+};
+```
+
+The Clean wrapper also checks the exit code of the process, and catches any
+errors:
+
+```clean
+ do_compile args w
+ # (_,args,argv,w) = copy_argv args c_compiler w
+ # (res_or_err,w) = (clean_compiler .# "compile" .$ ()) w
+ # w = free_args_and_argv args argv c_compiler w
+ # res = jsValToInt res_or_err
+ | isJust res
+ = (fromJust res==0, w)
+ # (err,w) = (res_or_err .# "toString" .$? ()) ("Compiler crashed.", w)
+ = (False, feedback False err w)
+```
+
+The make tool itself is now relatively simple. It keeps a queue of modules to
+be compiled. It takes a module from the queue, compiles it, and checks the
+generated code for any dependencies. These dependencies are added to the queue,
+if they were not compiled yet. When the queue is emptied, the same is done for
+bytecode generation. Finally, the generated code is linked into a single
+bytecode file and prelinked bytecode file.
+
+*That's all about the pipeline. You can continue reading about [how it's all
+put together][integration].*
+
+[abc-interpreter]: https://gitlab.com/clean-and-itasks/abc-interpreter/
+[Clean]: http://clean.cs.ru.nl/
+[Clean Sandbox]: https://camilstaps.gitlab.io/clean-sandbox/
+[Emscripten]: https://emscripten.org/
+[WebAssembly]: https://webassembly.org/
+
+[introduction]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-1-introduction.html
+[pipeline]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.html
+[integration]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-3-putting-it-all-together.html