TPC-DS — a guided tour · spark-rust-wasm

TPC-DS is the web-scale big-data decision-support benchmark: a 24-table retail data warehouse modeling a business that sells through three channels — physical stores, a catalog, and the web — stressed by 99 queries spanning reporting, ad-hoc analytics, OLAP roll-ups, and data mining. Where TPC-H models a traditional relational warehouse, TPC-DS was built for the modern big-data era: a far richer schema, deeply skewed data, and queries designed to push distributed query engines. It runs from sf=1 (≈1 GB) to sf=100000 (100 TB) and is the de-facto yardstick for Spark, BigQuery, Snowflake, and friends.

This demo runs all 99 queries against a genuine (browser-sized) TPC-DS dataset using the real spark-rust Spark SQL engine compiled to WebAssembly — entirely in your browser tab. This page explains the shape of the schema and what the queries are doing.

The schema: a multi-star, three-channel warehouse

TPC-DS is a galaxy (multi-star) schema. Each of the three sales channels has a matched pair of fact tables — a sales fact and a returns fact — plus inventory for the warehouse. All seven facts share a common pool of conformed dimensions (date_dim, item, customer, …). The diagram below collapses the shared dimensions into one cluster so the channel structure stays legible:

sales facts (the rows you aggregate) returns facts inventory shared dimensions

Schema · three sales channels over shared dimensions

flowchart TB
    subgraph STORE["🏬 Store channel"]
      direction TB
      SS["store_sales
fact"]
      SR["store_returns
fact"]
    end
    subgraph CATALOG["📖 Catalog channel"]
      direction TB
      CS["catalog_sales
fact"]
      CR["catalog_returns
fact"]
    end
    subgraph WEB["🌐 Web channel"]
      direction TB
      WS["web_sales
fact"]
      WR["web_returns
fact"]
    end
    INV["📦 inventory
item × warehouse × date"]
    DIMS["⭐ shared dimensions
date_dim · item · customer · customer_address
customer_demographics · household_demographics
promotion · store · catalog_page · web_site
warehouse · time_dim · … (17 total)"]

    SS --> DIMS
    SR --> DIMS
    CS --> DIMS
    CR --> DIMS
    WS --> DIMS
    WR --> DIMS
    INV --> DIMS
    SR -.->|references| SS
    CR -.->|references| CS
    WR -.->|references| WS

    classDef sales   fill:#e8590c,stroke:#ff8a4a,color:#fff,font-weight:600;
    classDef returns fill:#5a2a10,stroke:#8a4a22,color:#ffd9c2;
    classDef dim     fill:#222b38,stroke:#3a4658,color:#e6edf3;
    classDef inv     fill:#9a4a18,stroke:#c46a2a,color:#fff;
    class SS,CS,WS sales;
    class SR,CR,WR returns;
    class DIMS dim;
    class INV inv;

Each fact carries foreign keys (*_sk "surrogate keys") into the shared dimensions — e.g. store_sales.ss_item_sk → item.i_item_sk, ss_sold_date_sk → date_dim.d_date_sk. A returns fact references the same dimensions as its matching sales fact, which is why so many queries pair store_sales with store_returns on the same item/customer.

The 24 tables

7 fact tables

store_sales / store_returns
catalog_sales / catalog_returns
web_sales / web_returns
inventory (item × warehouse × date)

17 dimension tables

core: date_dim, time_dim, item, customer
who: customer_address, customer_demographics, household_demographics, income_band
where: store, call_center, catalog_page, web_site, web_page, warehouse
misc: promotion, reason, ship_mode

The "fixed-size mega-dimension" gotcha. Three dimensions are the same size at every scale factor: customer_demographics (1.92M rows), time_dim (86,400 — one per second of the day), and date_dim (~73K days). At small scale the demo FK-prunes these down to only the rows the facts actually reference, which is how a 99-query warehouse fits in a browser tab. See the README for the full data-sizing story.

The 99 queries, by class

The TPC-DS spec sorts its 99 queries into four business classes. First, two views of the whole suite — the SQL features it leans on, and which tables it hits most (a query can carry several tags / touch many tables):

SQL features across the 99 queries

subquery70

UNION20

window fn15

ROLLUP10

ranking8

INTERSECT3

GROUPING SETS2

CASE2

EXCEPT1

Most-referenced tables (× of 99 queries)

date_dim89

store_sales67

item53

catalog_sales39

store38

web_sales36

customer29

customer_address29

date_dim in 89 / 99 queries is the schema in a nutshell — almost everything is a time-sliced star-join. With that context, the four official classes and the SQL muscle each exercises:

Reporting

Well-known, scheduled business questions answered against a single channel — "what were store sales by category last quarter?". Straight star-joins (fact → a handful of dimensions) with GROUP BY + filters. The bread-and-butter shape, and the bulk of the store_sales-heavy queries above.

Ad-hoc / iterative

Exploratory questions a human asks interactively, often comparing channels or slicing many ways. These drive the 20 UNION queries (stack store + catalog + web results) and the 3 INTERSECT / 1 EXCEPT set-operations that find customers active in one channel but not another.

OLAP roll-ups & ranking

Multi-dimensional cube/subtotal analytics. The 15 window-function queries (running totals, LAG/LEAD, moving averages), the 8 ranking queries (RANK/top-N within a group), the 10 ROLLUP subtotal queries, and the 2 GROUPING SETS — the heaviest plans in the suite.

Data mining

The optimizer killers: deeply nested and correlated subqueries (tagged on 70 / 99 queries) and multi-fact self-joins — e.g. find customers whose store, catalog, and web spend all grew year-over-year. These need a customer's entire cross-channel history in one place, which is exactly what the demo's data generator preserves.

Why some queries return zero rows. TPC-DS mimics web-scale data — wide, sparse, and deeply skewed — so many queries are deliberately selective: a single manufacturer ID, or a customer who must be active across all three channels in two years. At a small scale factor those land on no rows, which is the correct result (the official dsdgen tiny data and Trino's tpcds.tiny catalog return the same empties). The gallery marks each query green (rows) or amber (empty) so you can tell them apart. Denser tiers recover more; the Generated tier runs 72 / 99 non-empty with zero errors.

Try it

Head back to the demo, pick a dataset tier (the zero-download Generated tier is the default), filter the gallery (try window, rollup, or a table name), and click a query — or Run all 99 to watch the whole suite execute in the wasm sandbox, with nothing leaving your tab.