From bbc049b5c7749b8f9c4e1ade9cf1da4f91f37c0d Mon Sep 17 00:00:00 2001 From: Camil Staps Date: Wed, 23 Jun 2021 19:20:29 +0200 Subject: Add article series on the clean sandbox --- ...browser-with-webassembly-part-2-the-pipeline.md | 328 +++++++++++++++++++++ 1 file changed, 328 insertions(+) create mode 100644 resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.md (limited to 'resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.md') diff --git a/resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.md b/resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.md new file mode 100644 index 0000000..b766560 --- /dev/null +++ b/resources/md/2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.md @@ -0,0 +1,328 @@ +*This is part 2 in a series on running the [Clean][] compiler in +[WebAssembly][], with the proof of concept in the [Clean Sandbox][]. In this +part I discuss the compilation pipeline and the program used to rebuild +generated files. See the [introduction][] for a high-level overview.* + +[[toc]] + +# The compilation pipeline + +Basically, this is what we do when we compile a Clean program: + +1. For each Clean file, generate intermediate ABC code. +2. For each ABC file, generate machine code. +3. Link the machine code together with the system linker into an executable. + +When we compile for interpretation, this looks a little different: + +1. For each Clean file, generate intermediate ABC code. +2. For each ABC file, generate ABC bytecode. + + Normal ABC code is human-readable. ABC bytecode is easier to parse. + Furthermore, the ABC bytecode defines some custom instructions to speed up + interpretation, like frequent instruction combinations (e.g. a `pop` + followed by a `return`). +3. Link the ABC bytecode together with the ABC linker. +4. Prelink the ABC bytecode for use in the WebAssembly interpreter. + + Prelinking removes symbol names and relocations, assuming the code segment + starts at address 0. This is possible because the WebAssembly memory always + starts at address 0. This way, the WebAssembly interpreter does not need to + deal with relocations. + +For step 1, we use the Clean compiler, written in Clean and C. The other tools +are maintained in the [abc-interpreter][] repository, and are written in C. We +compile the compiler frontend to prelinked bytecode, and the C tools to +WebAssembly using emscripten. + +# The file system + +Because the Clean compilation pipeline expects to read and write files (rather +than `stdin` and `stdout`), we need to set up a file system. Luckily, +[Emscripten][] already includes one with the +[`FS` library](https://emscripten.org/docs/api_reference/Filesystem-API.html). +I use the `FS` library for file access, and the `IDBFS` backend, which is based +on IndexedDB, for persistent storage. + +Unfortunately, it is [currently not possible to let the different C tools (code +generator, linker, etc.) share the same file +system](https://groups.google.com/g/emscripten-discuss/c/_K61fo-9oKY/m/4m6LYrcPo5sJ). +To circumvent this problem, I link the file system of each tool to the same +IndexedDB. When switching from one tool to another (e.g., going from the code +generator to the linker), I sync the file systems with the IndexedDB. This +incurs some overhead, but makes sure that the contents are the same everywhere. + +There is a `make`-like tool which checks which files need to be rebuilt based +on timestamps. This tool is written in Clean, compiled to prelinked bytecode, +and runs in the WebAssembly interpreter. + +# Building the C tools + +Each C tool is built to its own JavaScript and WebAssembly source using +[Emscripten][]. This is similar to building these tools to a native executable, +except that we use `emcc` instead of `gcc` and add some other options. For +example, the bytecode generator is normally built like this: + +```make +CLIBS:=-lm + +BCGEN:=bcgen + +SRC_BCGEN:=abc_instructions.c bcgen.c bcgen_instructions.c bcgen_instruction_table.c bytecode.c parse_abc.c util.c +DEP_BCGEN:=$(subst .c,.h,$(SRC_BCGEN)) settings.h + +$(BCGEN): $(SRC_BCGEN) $(DEP_BCGEN) + $(CC) $(CFLAGS) $(SRC_BCGEN) $(CLIBS) -DBCGEN -o $@ +``` + +With Emscripten we use: + +```make +EMCC:=emcc + +JS_BCGEN:=bcgen.js + +$(JS_BCGEN): $(SRC_BCGEN) + $(EMCC) -O2 -DPOSIX -DBCGEN\ + -s ENVIRONMENT=web\ + -s INVOKE_RUN=0\ + -s MODULARIZE -s EXPORT_NAME=bcgen\ + -lidbfs.js -s FORCE_FILESYSTEM=1 -s EXPORTED_RUNTIME_METHODS=[FS,IDBFS]\ + $^ -o $@ +``` + +This compiles the C sources to WebAssembly and generates a file `bcgen.js` that +acts as a wrapper around it. Loading `bcgen.js` will then define the function +`bcgen` which returns a promise resolving in an Emscripten module with the +property `_main`, corresponding to the C function `main`. + +This sounds complicated, but we can instantiate the Emscripten modules like +this: + +```html + + + +``` + +```js +var c_bcgen, c_bclink, c_bcprelink; +Promise.all([ + bcgen().then(instance => c_bcgen = instance), + bclink().then(instance => c_bclink = instance), + bcprelink().then(instance => c_bcprelink = instance), +]) +``` + +# Building the compiler + +The Clean frontend of the compiler is relatively easy to build. We take the +normal project file (Clean's `Makefile`) and use `cpm` (Clean's `make`) to +build the prelinked bytecode. + +The C backend is a little trickier. We use the same `make` recipe as above. +However, the compiler expects some global functions to be defined. These +functions are normally provided by the native Clean run-time system or a +supporting C library, but are not present in the interpreter. We need to +define: + +- `set_return_code` which takes an `int` and sets the process exit code. +- `ArgEnvCopyCStringToCleanStringC` which takes a null-terminated C string and + builds a Clean string from it (which is not null-terminated but includes the + number of bytes). +- `ArgEnvGetCommandLineArgumentC` which takes an `int` and returns a pointer to + a C string with the *n*th command-line argument. +- `ArgEnvGetCommandLineCountC` which returns the number of command-line + arguments (`argc`). + +To implement these functions, we make use of custom properties on the +Emscripten module: `clean_argc`, `clean_argv`, and `clean_return_code`. The +JavaScript looks like this: + +```js +var c_compiler; +compiler().then(instance => { + c_compiler = instance; + + c_compiler._ArgEnvGetCommandLineCountC = () => c_compiler.clean_argc; + + c_compiler._ArgEnvGetCommandLineArgumentC = (i, sizep, sp) => { + const arg = c_compiler.HEAP32[c_compiler.clean_argv/4+i]; + var p = arg; + for (; c_compiler.HEAPU8[p] != 0; p++); + c_compiler.HEAP32[sizep/4] = p-arg; + c_compiler.HEAP32[sp/4] = arg; + }; + + c_compiler._ArgEnvCopyCStringToCleanStringC = (cs, cleans) => { + const size = c_compiler.HEAP32[cleans/4]; + cleans += 4; + for (var i = 0; i < size; i++) + c_compiler.HEAPU8[cleans++] = c_compiler.HEAPU8[cs++]; + }; + c_compiler._ArgEnvCopyCStringToCleanStringC.sync_strings_back_to_clean = true; + + c_compiler._set_return_code = (c) => {c_compiler.clean_return_code = c;}; +}); +``` + +# Preparing the file system + +With all the tools set up, we need to create the main directories in the file +system. We do this by adding them in the file system of the `c_compiler`, and +then sync the other file systems. + +```js +c_compiler.FS.mkdir('/clean'); +c_compiler.FS.mount(c_compiler.IDBFS, {}, '/clean'); +c_compiler.FS.mkdir('/clean/lib'); +c_compiler.FS.mkdir('/clean/src'); + +const mount = backend => new Promise(resolve => { + backend.FS.mkdir('/clean'); + backend.FS.mount(backend.IDBFS, {}, '/clean'); + backend.FS.syncfs(true, err => { + if (err) + throw err; + backend.FS.chdir('/clean/src'); + resolve(); + }); +}); + +return new Promise(resolve => { + c_compiler.FS.syncfs(false, () => { + Promise.all([ + mount(c_bcgen), + mount(c_bclink), + mount(c_bcprelink), + ]).then(resolve); + }); +}); +``` + +In the real sandbox, we also add some standard libraries to the file system on +start-up. + +# The `make`-like tool + +I had to build a simple `make`-like tool to call the compiler and regenerate +files when needed. This tool is a separate Clean program, which is compiled to +prelinked bytecode and interpreted in WebAssembly. + +The following snippet sets everything up: it sets the global variable +`sandbox.compile` to a Clean function that makes the sandbox project. In the +[last part][integration] I show how everything is put together, here I want to +focus on the implementation. + +`wrapInitFunction` makes this a Clean program that can run in the browser. The +`me` parameter can be ignored here. The `compile` function calls `make`, which +contains the actual logic. `make` is a higher-order function which takes +functions for compilation, code generation, etc., as parameters. These are here +provided as `call_main c_bcgen` etc. + +```clean +Start = wrapInitFunction start + +sandbox :== jsGlobal "sandbox" +c_compiler :== jsGlobal "c_compiler" +clean_compiler :== jsGlobal "clean_compiler" +c_bcgen :== jsGlobal "c_bcgen" +c_bclink :== jsGlobal "c_bclink" +c_bcprelink :== jsGlobal "c_bcprelink" + +start :: !JSVal !*JSWorld -> *JSWorld +start me w + # (cb,w) = jsWrapFun (compile me) me w + # w = (sandbox .# "compile" .= cb) w + = w +where + compile me {[0]=paths,[1]=main,[2]=callback} w + # (paths,w) = jsValToList` paths (fromJS "") w + = make + /* .. */ + do_compile + (call_main c_bcgen) (call_main c_bclink) (call_main c_bcprelink) + /* .. */ + paths (fromJS "" main) + w +``` + +The C tools we can call relatively easily, by copying `argv` into the memory of +the process, setting `argc`, and calling `main`. Afterwards we need to free the +memory: + +```clean + call_main backend args w + # (argc,args,argv,w) = copy_argv args backend w + # (res,w) = (backend .# "_main" .$? (argc, argv)) (-1, w) + = (res == 0, free_args_and_argv args argv backend w) +``` + +Compilation is a little trickier, because it is a Clean program. For this the +`clean_compiler` object, which is a WebAssembly interpreter, has a `.compile()` +method. This method makes sure that no global state is kept between runs (by +clearing CAFs), and that the memory does not get corrupted if compilation +fails: + +```js +clean_compiler.compile = function () { + clean_compiler.clear_cafs(); + c_compiler.clean_stdio_open = false; + + var res = 0; + try { + /* Interpret from the main entry point */ + clean_compiler.interpreter.instance.exports.set_pc(clean_compiler.start); + clean_compiler.interpreter.instance.exports.interpret(); + res = c_compiler.clean_return_code; + } catch (e) { + console.warn(e); + res = e; + /* Memory may be badly corrupted, create a new instance to be sure */ + create_clean_compiler(); + } + + /* Clear output buffers */ + if (sandbox.stdout_buffer.length > 0) + sandbox.stdout('\n'); + if (sandbox.stderr_buffer.length > 0) + sandbox.stderr('\n'); + + return res; +}; +``` + +The Clean wrapper also checks the exit code of the process, and catches any +errors: + +```clean + do_compile args w + # (_,args,argv,w) = copy_argv args c_compiler w + # (res_or_err,w) = (clean_compiler .# "compile" .$ ()) w + # w = free_args_and_argv args argv c_compiler w + # res = jsValToInt res_or_err + | isJust res + = (fromJust res==0, w) + # (err,w) = (res_or_err .# "toString" .$? ()) ("Compiler crashed.", w) + = (False, feedback False err w) +``` + +The make tool itself is now relatively simple. It keeps a queue of modules to +be compiled. It takes a module from the queue, compiles it, and checks the +generated code for any dependencies. These dependencies are added to the queue, +if they were not compiled yet. When the queue is emptied, the same is done for +bytecode generation. Finally, the generated code is linked into a single +bytecode file and prelinked bytecode file. + +*That's all about the pipeline. You can continue reading about [how it's all +put together][integration].* + +[abc-interpreter]: https://gitlab.com/clean-and-itasks/abc-interpreter/ +[Clean]: http://clean.cs.ru.nl/ +[Clean Sandbox]: https://camilstaps.gitlab.io/clean-sandbox/ +[Emscripten]: https://emscripten.org/ +[WebAssembly]: https://webassembly.org/ + +[introduction]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-1-introduction.html +[pipeline]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-2-the-pipeline.html +[integration]: 2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-3-putting-it-all-together.html -- cgit v1.2.3