TPC-DS in the browser - all 99 queries on the real Spark engine
This demo runs all 99 official TPC-DS queries against a genuine TPC-DS dataset, using the real spark-rust Spark SQL engine compiled to WebAssembly -- parse, plan, optimize, execute -- entirely inside the browser tab. No server, no backend, no data leaving the sandbox.
It's the multi-table, multi-query sibling of ../browser (which is a single-query scratchpad). Where that demo proves "Spark dialect in a tab", this one proves the engine is complete enough to run an industry-standard analytical benchmark end to end.
+---------------------------- browser tab -----------------------------+
| app.js spark-wasm (WebAssembly) |
| +--------------------+ registerIpc() +------------------------+ |
| | 99-query gallery | ---------------> | SparkEngine | |
| | SQL viewer | | * 24 TPC-DS MemTables | |
| | result table | runJson(sql) | * 391 Spark UDFs | |
| | "Run all 99 >" | <--------------- | * DataFusion executor | |
| +--------------------+ JSON rows +------------------------+ |
| ^ fetch + gunzip (DecompressionStream) |
| data/*.arrow.gz (24 tables, 3.6 MB gzipped Arrow IPC) |
+----------------------------------------------------------------------+
the network is touched twice, both from THIS origin:
once for the engine .wasm, once for the gzipped fixtures.
Quick start
bash build.sh # compile wasm + generate/validate/stage fixtures (~10 min cold)
bash serve.sh # serve on http://localhost:8000/ (HOST=0.0.0.0 by default)
# open the URL, click "Run all 99 >"
Headless end-to-end check (no browser needed -- drives the same wasm):
node smoke.mjs # representative subset, asserts row-count parity
node smoke.mjs --all # all 99 queries vs the native reference
How the data problem is solved
TPC-DS is a big-schema benchmark: 24 tables, 7 of them facts. The naive "generate at a tiny scale factor and ship it" approach fails because three dimension tables are fixed-size at every scale:
| table | rows at any SF | after FK-pruning (sf=0.01) |
|---|---|---|
customer_demographics | 1,920,800 | 5,379 |
time_dim | 86,400 | 4,429 |
date_dim | 73,049 | 2,779 |
The generator (gen-tpcds-wasm-fixtures) dsdgen's at sf=0.01, then FK-prunes each fixed dimension down to only the rows reachable from the (tiny) generated fact tables. A dimension row that no fact references can never affect any join result, so dropping it is loss-free. That collapses the working set from hundreds of MB to 21.7 MB of typed Arrow IPC, which gzips to 3.6 MB.
Data ships as Arrow IPC (typed), not JSON, so DATE/DECIMAL columns survive -- many TPC-DS queries filter on d_date BETWEEN cast('...' AS date), which JSON's stringly-typed coercion would break. The browser inflates the .arrow.gz files with the native DecompressionStream API (zero JS deps) and hands the raw IPC bytes to the engine.
Why some queries return 0 rows
At sf=0.01 the data is correct but sparse -- 27 of the 99 queries have predicates selective enough that the tiny sample yields an empty result. That's the dataset being small, not the engine being wrong: the UI shows the native reference row count next to each result, and "Run all 99" reports 72 with rows / 27 empty / 0 errored. (sf=0.1 reduces empties to 18 but balloons to 191 MB raw -- too heavy for a tab, so sf=0.01 is the shipped default.)
Architecture: the persistent SparkEngine
The engine exposes a SparkEngine class (see crates/spark-wasm/src/wasm.rs) that registers tables once and runs many queries against the live context -- the right primitive for a 99-query workload. The one-shot run_sql_json_multi re-registers every table on every call, which would be pathological here.
const engine = new SparkEngine();
for (const t of TABLES) engine.registerIpc(t, await fetchInflate(`./data/${t}.arrow.gz`));
const rows = JSON.parse(engine.runJson(sql)); // run any of the 99
Two real wasm engine bugs this demo surfaced
End-to-end browser testing caught two execution bugs that the native test harness (which has a full tokio runtime) masked:
- Parallel-plan panic. DataFusion parallelizes across
target_partitions(>= CPU count) by default, spawning work onto atokio::task::JoinSet. wasm32 is single-threaded with no such runtime, so any repartitioning plan trapped. Fixed by forcingtarget_partitions = 1inbuild_context-- also the correct model for single-threaded wasm (mirrors DuckDB-WASM). - Missing reactor. Even single-partition, window / grouping-sets / multi-join plans still
spawnonto aJoinSet, which needs a runtime context. The oldblock_onusedfutures::executor(no context) -> 10 of the 99 queries trapped withunreachable. Fixed by driving execution on a tokio current-thread runtime (thertfeature is wasm-clean -- no mio/socket reactor), which installs the context spawned tasks need.
After both fixes: 99/99 queries execute, 0 errored, exact row-count parity with the native reference across the whole suite.
Files
| file | role |
|---|---|
index.html | UI: query gallery, run-all progress, SQL viewer, result table |
app.js | loads engine, inflates + registers fixtures, runs queries |
build.sh | STAGE=wasm compiles the engine; STAGE=data generates fixtures |
serve.sh | static HTTP server (python3) |
smoke.mjs | headless Node E2E -- the wasm path, asserted against the native ref |
pkg/ | wasm-bindgen output (gitignored -- build.sh regenerates) |
data/ | staged *.arrow.gz + queries.json + manifest.json (gitignored) |
The fixtures and query catalog are produced by two native bins in crates/spark-benchmarks:
gen-tpcds-wasm-fixtures-- dsdgen + FK-prune + export parquet/IPC + manifest.validate-tpcds-wasm-- load the IPC fixtures into the realSparkEngine, run all 99 queries natively (the exact browser code path), and emitqueries.json(SQL + expected rows + table/feature tags) plus a pass-rate report. This is the source of truth the browser's "matches reference" check compares against.
Privacy
The engine binary and the fixtures are fetched once, from this origin. Every query runs inside the wasm sandbox; results are computed in linear memory and rendered locally. Your queries and the data never leave the tab -- the same capability-based guarantee described in ~/dev-macrohard/jaaiworks/WASM-DUCKDB.md.