Compiling a database to WebAssembly: what broke
TL;DR
SlothDB is a C++20 analytical database. Over a week I ported it to WebAssembly via Emscripten. Final build: 1.3 MB wasm + 97 KB JS, boots in under a second, runs a 1,000-row GROUP BY in under 5 ms. The port itself was straightforward. The four things that ate real time were threads that compile-time-succeed but runtime-fail, exceptions that have to be enabled twice, HTTP being a complete fiction in the browser, and a CodeMirror cache-miss that cost two hours to diagnose. Full working playground at slothdb.org/playground.
Why bother?
SlothDB's pitch is SQL where your files live — you point at a CSV or Parquet file on disk and query it. That works fine from the CLI and from Python, but the 2 minutes between "read a Hacker News post" and "install a CLI to try it" is where most projects lose readers. A URL that runs the same engine in the browser, against a file you drag-drop in, closes that loop.
There is a second reason. Compiling an analytical database to WebAssembly is a reasonable litmus test for the quality of its portability layer. If the code is riddled with #ifdef _WIN32, threads spawned via platform APIs, and syscalls baked into hot paths, porting to WASM will surface all of it in one go.
Setup (the easy part)
Installing emsdk and pointing CMake at it is a five-minute job:
git clone --depth 1 https://github.com/emscripten-core/emsdk.git
./emsdk/emsdk.bat install latest
./emsdk/emsdk.bat activate latest
source emsdk/emsdk_env.sh
emcmake cmake -S . -B build-wasm \
-DCMAKE_BUILD_TYPE=Release \
-DSLOTHDB_BUILD_TESTS=OFF \
-DSLOTHDB_BUILD_SHELL=OFF
emmake cmake --build build-wasm --target slothdb_wasm
The slothdb_wasm target is a new CMake executable that links slothdb_lib with a thin embind wrapper exposing three functions: openDatabase(), runQuery(sql), and version(). The wrapper returns JSON strings so the JS side never has to touch C++ types.
The first emmake invocation linked cleanly on the very first try. I was briefly optimistic.
Problem 1: threads that compile but don't exist
SlothDB uses std::thread in four places: the thread pool, parallel CSV parsing, parallel JSON parsing, and the Parquet row-group decoder. Without -pthread, Emscripten still compiles <thread> — std::thread::hardware_concurrency() returns a fixed value, std::thread's constructor exists. At runtime, instantiating one aborts the WASM instance.
Two ways to fix this, both ugly:
- Link with
-pthread. This enables real multi-threading via Web Workers +SharedArrayBuffer. The price is that every page serving the wasm must sendCross-Origin-Opener-Policy: same-originandCross-Origin-Embedder-Policy: require-corp. Simple static hosts (Python'shttp.server, the default GitHub Pages config) don't send these, and the wasm refuses to instantiate. Local development becomes a chore. - Compile without
-pthreadand guard everystd::threaduse. No COOP/COEP requirement, but every hot-path parallel site needs an#ifdef __EMSCRIPTEN__serial fallback.
I ran into a stroke of luck here: most of SlothDB's threading paths already had an if (nt <= 1) serial-fallback, written defensively for single-core machines. Forcing hardware_concurrency() to return 1 under Emscripten dodged most spawn sites for free. The parallel Parquet consumer-mode pool didn't have a fallback and needed three lines added. Total threading-specific code: about 20 lines of preprocessor guards.
The cost: browser-side queries are single-threaded. On 1,000 rows this is invisible. On a million rows you'd notice. For a playground it's the correct call.
Problem 2: exceptions that compile but don't catch
SlothDB's C API wraps every internal call in try { ... } catch (const std::exception &e) { ... } and returns an error code. Straightforward. On the first wasm build, a query like SELECT * FROM not_a_table — which should return a clean "Table not found" string — instead threw:
CppException { excPtr: 5311112 }
Aborted(CppException)
Emscripten ships with C++ exception catching disabled by default as a code-size optimization. The binary can still throw. It just can't catch. The exception walks past every try block and aborts the program.
The fix requires committing twice — once at compile, once at link:
target_compile_options(slothdb_lib PRIVATE "-fexceptions")
target_link_options(slothdb_lib INTERFACE "-fexceptions")
target_compile_options(slothdb_wasm PRIVATE "-fexceptions")
target_link_options(slothdb_wasm PRIVATE "-fexceptions")
Miss either, and you get the same abort. There's no warning; you just get a runtime crash on the first thrown exception. I lost an hour before checking the Emscripten docs on this.
-fexceptions, the final link silently wires in the "abort on throw" runtime. Add -fexceptions globally or not at all.
Problem 3: HTTP doesn't exist
SlothDB added SELECT * FROM 'https://…' support in v0.1.4. The implementation uses WinHTTP on Windows and raw POSIX sockets on Linux/macOS. Both are meaningless in a browser: <windows.h> doesn't exist under Emscripten, and <sys/socket.h> compiles but the sockets can only speak to a websockify-style proxy, not arbitrary HTTPS endpoints.
The fix was a single preprocessor guard in src/storage/http_client.cpp:
#ifdef _MSC_VER
// Windows: WinHTTP.
#include <winhttp.h>
#elif defined(__EMSCRIPTEN__)
// WASM: HTTP not supported. Files come from MEMFS.
#else
// POSIX: raw sockets.
#include <sys/socket.h>
#endif
…plus stubbing HTTPClient::Get() to return a friendly error. The playground handles file input through the browser's <input type="file"> and writes the bytes directly into Emscripten's MEMFS at /data/<name>. The rest of the database code reads via fopen / fread, which MEMFS emulates natively. Zero changes to the 12 storage readers.
Problem 4: two copies of @codemirror/state
The playground's SQL editor uses CodeMirror 6. CodeMirror's architecture is explicitly plugin-based — every extension is a tiny npm package that shares identity checks on EditorState and EditorView classes.
The obvious wiring is six ESM imports from esm.sh:
import { EditorView, basicSetup } from 'https://esm.sh/codemirror@6.0.1';
import { keymap } from 'https://esm.sh/@codemirror/view@6.32.0';
import { EditorState } from 'https://esm.sh/@codemirror/state@6.4.1';
import { sql } from 'https://esm.sh/@codemirror/lang-sql@6.8.0';
// …
This failed with:
Error: Unrecognized extension value in extension set ([object Object]).
This sometimes happens because multiple instances of @codemirror/state
are loaded, breaking instanceof checks.
esm.sh was serving two different copies of @codemirror/state — one pulled in by codemirror@6.0.1's transitive dependency graph, another pinned directly via my import. Each version hashes into a different URL path. Modules load independently. EditorState from copy A is not EditorState from copy B, and CodeMirror's instanceof guards reject extensions that don't share its singleton.
I tried esm.sh's ?deps=@codemirror/state@6.5.2 URL parameter, bumped versions, matched semvers. Each attempt got closer but never fully deduped. Two hours in, I gave up on the CDN and bundled locally with esbuild:
mkdir /tmp/cm-bundle && cd /tmp/cm-bundle
cat > entry.js <<'EOF'
export { EditorView, basicSetup } from 'codemirror';
export { keymap } from '@codemirror/view';
export { EditorState } from '@codemirror/state';
export { sql } from '@codemirror/lang-sql';
export { oneDark } from '@codemirror/theme-one-dark';
EOF
npm install codemirror@6.0.1 @codemirror/view@6.41.1 \
@codemirror/state@6.6.0 @codemirror/lang-sql@6.8.0 \
@codemirror/theme-one-dark@6.1.2 esbuild
npx esbuild entry.js --bundle --format=esm --minify \
--outfile=docs/playground/vendor/cm.js
Result: one 429 KB file, one @codemirror/state, zero identity checks failing. Checked into git. The playground imports ./vendor/cm.js and nothing else.
CDN ESM is fine until it isn't. When it fails, it fails at runtime with cryptic errors. Bundling the exact dependency graph you tested is five minutes of work and pays itself back the first time you ship.
The 90 KB cache-bust
After the first deploy I pushed a bug fix, redeployed, and the old bug was still there. Hard-refresh: fixed. Close the tab, reopen: broken again. The browser was serving slothdb.wasm from disk cache.
Standard cache-bust is a query string: slothdb.wasm?v=20260421-2. The catch is that Emscripten's generated JS computes the wasm URL itself, using new URL("slothdb.wasm", import.meta.url). Appending ?v=... to the JS import's URL flows into import.meta.url but not into the wasm path — the wasm stays cached.
The fix is the locateFile hook that Emscripten gives you precisely for this:
const mod = await createSlothDB({
locateFile: (path) => path + '?v=' + BUILD_VERSION,
});
Bump BUILD_VERSION on every push, and all five assets — HTML, CSS, two JS files, and the wasm — refetch. One line.
What I did not have to do
A partial list of things that worked out of the box — which is the more interesting half of the story:
- File I/O. The
FileHandleabstraction uses purefopen/fread/fseeko. MEMFS emulates all of them. Every one of SlothDB's seven readers — CSV, Parquet, JSON, Avro, Arrow IPC, SQLite, Excel — works unmodified. - mmap. Parquet and JSON readers use
mmapfor large files. Emscripten implementsmmapas a memory-copy backed byread. Slower than native, but transparent to the reader code. - Parser, binder, planner, executor. No platform-specific code. Zero changes.
- The C API. The stable C ABI is what embind bound to; the JS surface is 30 lines.
Final numbers
| Artifact | Size | Notes |
|---|---|---|
slothdb.wasm | 1.3 MB | -O3, -fexceptions, single-threaded |
slothdb.js (loader) | 97 KB | Emscripten glue + embind |
vendor/cm.js | 429 KB | CodeMirror 6 bundled |
| Demo CSV | 25 KB | 1,000 rows, 5 columns |
| Demo Parquet | 10 KB | Same data, SNAPPY |
| Cold boot | ~750 ms | Over local 100 Mbps |
| 1,000-row GROUP BY | ~3 ms | After warm-up |
The playground is at slothdb.org/playground. The source is on GitHub. Every fix described in this post lives in the main branch in a single commit each, linked from the commit log under Playground: prefixes.
The overall lesson
Compiling to WebAssembly isn't the hard part. The hard part is everything that silently assumes threads, exceptions, a network stack, and a filesystem — the browser reverses the defaults on all four. If your code is organized around platform-agnostic abstractions (stdio for file I/O, a C API boundary, preprocessor-guarded syscall usage), the port is a day. If it isn't, no amount of -sALLOW_MEMORY_GROWTH will save you; the browser will keep telling you, one cryptic abort at a time, exactly which assumption you made.
Try it: Open the playground → GitHub
SlothDB is an MIT-licensed embedded SQL database built by a solo maintainer. It reads Parquet, CSV, JSON, Avro, Arrow, SQLite, and Excel directly from SQL — 1.1×–8.6× faster than DuckDB on every benchmarked format. If this post helped you, the maintainer would appreciate a star.