TPC-DS in the browser - all 99 queries on the real Spark engine

This demo runs all 99 official TPC-DS queries against a genuine TPC-DS dataset, using the real spark-rust Spark SQL engine compiled to WebAssembly -- parse, plan, optimize, execute -- entirely inside the browser tab. No server, no backend, no data leaving the sandbox.

It's the multi-table, multi-query sibling of ../browser (which is a single-query scratchpad). Where that demo proves "Spark dialect in a tab", this one proves the engine is complete enough to run an industry-standard analytical benchmark end to end.

+---------------------------- browser tab -----------------------------+
|  app.js                                  spark-wasm (WebAssembly)    |
|  +--------------------+  registerIpc()   +------------------------+  |
|  | 99-query gallery   | ---------------> | SparkEngine            |  |
|  | SQL viewer         |                  |  * 24 TPC-DS MemTables |  |
|  | result table       |  runJson(sql)    |  * 391 Spark UDFs      |  |
|  | "Run all 99 >"     | <--------------- |  * DataFusion executor |  |
|  +--------------------+   JSON rows      +------------------------+  |
|         ^ fetch + gunzip (DecompressionStream)                       |
|   data/*.arrow.gz  (24 tables, 3.6 MB gzipped Arrow IPC)             |
+----------------------------------------------------------------------+
        the network is touched twice, both from THIS origin:
        once for the engine .wasm, once for the gzipped fixtures.

Quick start

bash build.sh          # compile wasm + generate/validate/stage fixtures (~10 min cold)
bash serve.sh          # serve on http://localhost:8000/  (HOST=0.0.0.0 by default)
# open the URL, click "Run all 99 >"

Headless end-to-end check (no browser needed -- drives the same wasm):

node smoke.mjs         # representative subset, asserts row-count parity
node smoke.mjs --all   # all 99 queries vs the native reference

How the data problem is solved

TPC-DS is a big-schema benchmark: 24 tables, 7 of them facts. The naive "generate at a tiny scale factor and ship it" approach fails because three dimension tables are fixed-size at every scale:

tablerows at any SFafter FK-pruning (sf=0.01)
customer_demographics1,920,8005,379
time_dim86,4004,429
date_dim73,0492,779

The generator (gen-tpcds-wasm-fixtures) dsdgen's at sf=0.01, then FK-prunes each fixed dimension down to only the rows reachable from the (tiny) generated fact tables. A dimension row that no fact references can never affect any join result, so dropping it is loss-free. That collapses the working set from hundreds of MB to 21.7 MB of typed Arrow IPC, which gzips to 3.6 MB.

Data ships as Arrow IPC (typed), not JSON, so DATE/DECIMAL columns survive -- many TPC-DS queries filter on d_date BETWEEN cast('...' AS date), which JSON's stringly-typed coercion would break. The browser inflates the .arrow.gz files with the native DecompressionStream API (zero JS deps) and hands the raw IPC bytes to the engine.

Why some queries return 0 rows

At sf=0.01 the data is correct but sparse -- 27 of the 99 queries have predicates selective enough that the tiny sample yields an empty result. That's the dataset being small, not the engine being wrong: the UI shows the native reference row count next to each result, and "Run all 99" reports 72 with rows / 27 empty / 0 errored. (sf=0.1 reduces empties to 18 but balloons to 191 MB raw -- too heavy for a tab, so sf=0.01 is the shipped default.)

Architecture: the persistent SparkEngine

The engine exposes a SparkEngine class (see crates/spark-wasm/src/wasm.rs) that registers tables once and runs many queries against the live context -- the right primitive for a 99-query workload. The one-shot run_sql_json_multi re-registers every table on every call, which would be pathological here.

const engine = new SparkEngine();
for (const t of TABLES) engine.registerIpc(t, await fetchInflate(`./data/${t}.arrow.gz`));
const rows = JSON.parse(engine.runJson(sql));   // run any of the 99

Two real wasm engine bugs this demo surfaced

End-to-end browser testing caught two execution bugs that the native test harness (which has a full tokio runtime) masked:

  1. Parallel-plan panic. DataFusion parallelizes across target_partitions (>= CPU count) by default, spawning work onto a tokio::task::JoinSet. wasm32 is single-threaded with no such runtime, so any repartitioning plan trapped. Fixed by forcing target_partitions = 1 in build_context -- also the correct model for single-threaded wasm (mirrors DuckDB-WASM).
  2. Missing reactor. Even single-partition, window / grouping-sets / multi-join plans still spawn onto a JoinSet, which needs a runtime context. The old block_on used futures::executor (no context) -> 10 of the 99 queries trapped with unreachable. Fixed by driving execution on a tokio current-thread runtime (the rt feature is wasm-clean -- no mio/socket reactor), which installs the context spawned tasks need.

After both fixes: 99/99 queries execute, 0 errored, exact row-count parity with the native reference across the whole suite.

Files

filerole
index.htmlUI: query gallery, run-all progress, SQL viewer, result table
app.jsloads engine, inflates + registers fixtures, runs queries
build.shSTAGE=wasm compiles the engine; STAGE=data generates fixtures
serve.shstatic HTTP server (python3)
smoke.mjsheadless Node E2E -- the wasm path, asserted against the native ref
pkg/wasm-bindgen output (gitignored -- build.sh regenerates)
data/staged *.arrow.gz + queries.json + manifest.json (gitignored)

The fixtures and query catalog are produced by two native bins in crates/spark-benchmarks:

Privacy

The engine binary and the fixtures are fetched once, from this origin. Every query runs inside the wasm sandbox; results are computed in linear memory and rendered locally. Your queries and the data never leave the tab -- the same capability-based guarantee described in ~/dev-macrohard/jaaiworks/WASM-DUCKDB.md.

← back to the demo