TPC-DS is the web-scale big-data decision-support
benchmark: a 24-table retail data warehouse modeling a business that sells through
three channels — physical stores, a catalog, and the web — stressed by
99 queries spanning reporting, ad-hoc analytics, OLAP roll-ups, and data mining.
Where TPC-H models a traditional relational warehouse, TPC-DS was built
for the modern big-data era: a far richer schema, deeply skewed data, and queries
designed to push distributed query engines. It runs from sf=1
(≈1 GB) to sf=100000 (100 TB) and is the de-facto yardstick
for Spark, BigQuery, Snowflake, and friends.
This demo runs all 99 queries against a genuine (browser-sized) TPC-DS dataset using the real spark-rust Spark SQL engine compiled to WebAssembly — entirely in your browser tab. This page explains the shape of the schema and what the queries are doing.
The schema: a multi-star, three-channel warehouse
TPC-DS is a galaxy (multi-star) schema. Each of the three sales
channels has a matched pair of fact tables — a sales fact and a
returns fact — plus inventory for the warehouse. All seven
facts share a common pool of conformed dimensions
(date_dim, item, customer, …). The diagram
below collapses the shared dimensions into one cluster so the channel structure
stays legible:
Schema · three sales channels over shared dimensions
flowchart TB
subgraph STORE["🏬 Store channel"]
direction TB
SS["store_sales
fact"]
SR["store_returns
fact"]
end
subgraph CATALOG["📖 Catalog channel"]
direction TB
CS["catalog_sales
fact"]
CR["catalog_returns
fact"]
end
subgraph WEB["🌐 Web channel"]
direction TB
WS["web_sales
fact"]
WR["web_returns
fact"]
end
INV["📦 inventory
item × warehouse × date"]
DIMS["⭐ shared dimensions
date_dim · item · customer · customer_address
customer_demographics · household_demographics
promotion · store · catalog_page · web_site
warehouse · time_dim · … (17 total)"]
SS --> DIMS
SR --> DIMS
CS --> DIMS
CR --> DIMS
WS --> DIMS
WR --> DIMS
INV --> DIMS
SR -.->|references| SS
CR -.->|references| CS
WR -.->|references| WS
classDef sales fill:#e8590c,stroke:#ff8a4a,color:#fff,font-weight:600;
classDef returns fill:#5a2a10,stroke:#8a4a22,color:#ffd9c2;
classDef dim fill:#222b38,stroke:#3a4658,color:#e6edf3;
classDef inv fill:#9a4a18,stroke:#c46a2a,color:#fff;
class SS,CS,WS sales;
class SR,CR,WR returns;
class DIMS dim;
class INV inv;
Each fact carries foreign keys (*_sk "surrogate keys") into the
shared dimensions — e.g. store_sales.ss_item_sk → item.i_item_sk,
ss_sold_date_sk → date_dim.d_date_sk. A returns fact references the
same dimensions as its matching sales fact, which is why so many queries pair
store_sales with store_returns on the same item/customer.
The 24 tables
7 fact tables
store_sales/store_returnscatalog_sales/catalog_returnsweb_sales/web_returnsinventory(item × warehouse × date)
17 dimension tables
- core:
date_dim,time_dim,item,customer - who:
customer_address,customer_demographics,household_demographics,income_band - where:
store,call_center,catalog_page,web_site,web_page,warehouse - misc:
promotion,reason,ship_mode
The "fixed-size mega-dimension" gotcha. Three dimensions are
the same size at every scale factor:
customer_demographics (1.92M rows), time_dim (86,400 —
one per second of the day), and date_dim (~73K days). At small scale
the demo FK-prunes these down to only the rows the facts actually reference,
which is how a 99-query warehouse fits in a browser tab. See the
README for the full data-sizing story.
The 99 queries, by class
The TPC-DS spec sorts its 99 queries into four business classes. First, two views of the whole suite — the SQL features it leans on, and which tables it hits most (a query can carry several tags / touch many tables):
SQL features across the 99 queries
Most-referenced tables (× of 99 queries)
date_dim in 89 / 99 queries is the schema in a
nutshell — almost everything is a time-sliced star-join. With that context, the
four official classes and the SQL muscle each exercises:
Reporting
Well-known, scheduled business questions answered against a single channel —
"what were store sales by category last quarter?". Straight star-joins
(fact → a handful of dimensions) with GROUP BY + filters. The
bread-and-butter shape, and the bulk of the store_sales-heavy
queries above.
Ad-hoc / iterative
Exploratory questions a human asks interactively, often comparing channels or
slicing many ways. These drive the 20 UNION queries
(stack store + catalog + web results) and the 3 INTERSECT /
1 EXCEPT set-operations that find customers active in
one channel but not another.
OLAP roll-ups & ranking
Multi-dimensional cube/subtotal analytics. The 15 window-function
queries (running totals, LAG/LEAD, moving averages), the
8 ranking queries (RANK/top-N within a group), the
10 ROLLUP subtotal queries, and the
2 GROUPING SETS — the heaviest plans in the suite.
Data mining
The optimizer killers: deeply nested and correlated subqueries (tagged on 70 / 99 queries) and multi-fact self-joins — e.g. find customers whose store, catalog, and web spend all grew year-over-year. These need a customer's entire cross-channel history in one place, which is exactly what the demo's data generator preserves.
Why some queries return zero rows. TPC-DS mimics web-scale data
— wide, sparse, and deeply skewed — so many queries are deliberately
selective: a single manufacturer ID, or a customer who must be active across
all three channels in two years. At a small scale factor those land on no rows,
which is the correct result (the official dsdgen tiny data
and Trino's tpcds.tiny catalog return the same empties). The gallery
marks each query green (rows) or amber (empty) so you can tell them apart. Denser
tiers recover more; the Generated tier runs 72 / 99 non-empty
with zero errors.
Try it
Head back to the demo, pick a dataset tier (the
zero-download Generated tier is the default), filter the gallery
(try window, rollup, or a table name), and click a query
— or Run all 99 to watch the whole suite execute in the wasm sandbox,
with nothing leaving your tab.