Docs/Data Reference/Methodology

Methodology

How data is matched, merged, and reconciled across sources.

How data flows from raw federal sources to the platform — the matching logic, the enrichment process, the derived metrics, and the quality controls that ensure what you see is accurate.

Pipeline architecture

The pipeline is a one-directional flow through four stages:

Raw sources (EIA, GEM, USGS, FERC, ...)

  ↓ download & extract

Parquet files (columnar, typed, local)

  ↓ build dimensions

JSON export (one file per plant + index)

  ↓ push to Supabase

PostgreSQL (Supabase) → Vercel CDN → your browser

Each stage is independent and reproducible. The pipeline can be rebuilt from scratch in minutes. All intermediate files are Parquet format — columnar, typed, and efficient. The database is never edited directly; it is always a reflection of the pipeline output.

Dimension builds

Data is assembled in dimensions, each with its own build step. Every dimension has a schema definition, a build script, and optionally a query interface. Dimensions are independent — a failure in one does not block others from updating.

Base (asset_master)

EIA-860M + EIA-860 Annual → every plant and generator in the US inventory. This is the spine — all other dimensions reference it.

Ownership

Base + GEM trackers + Wikidata → parent company identification, entity chains, ownership agreement flags.

Boundary

Base + OpenStreetMap → plant boundary polygons (GeoJSON) and geographic centroids for map display.

Engineering

Base + USPVDB + USWTDB → solar panel specs, wind turbine specs, battery storage details.

Generation

EIA Generation API → monthly MWh time series, capacity factors, annual totals, and efficiency rankings.

Financial

FERC Form 1 + LBNL Solar + FERC EQR → capital costs, operating costs, installed cost, PPA prices, energy contracts.

Context

Wikipedia + news articles → plant summaries, external links, recent news coverage.

Matching across sources

The central challenge in multi-source assembly is matching records that refer to the same entity but use different identifiers. Here is how we handle each source join:

EIA ↔ GEM (ownership)

GEM trackers include EIA Plant IDs for US assets. The join is on plant_id. When a GEM record matches an EIA plant, the pipeline compares owner names between sources. If they disagree (common with SPV structures), the owner_agreement flag is set to false and both names are preserved.

EIA ↔ USPVDB / USWTDB (engineering)

Both USGS databases include EIA Plant IDs and Generator IDs. The join is on the composite key (plant_id, generator_id). Generators without a match have null values for all specification columns — the UI distinguishes between "no data available" and "value is genuinely zero."

EIA ↔ Wikidata (identifiers)

Wikidata entries for US power plants include the EIA Plant ID as a property (P4157). The pipeline runs SPARQL queries against Wikidata to fetch Q-IDs for all plants that have entries. The Q-ID then links to Wikipedia articles, OpenStreetMap entries, and other knowledge bases.

Plant ↔ pricing nodes (grid)

Pricing nodes are matched to plants through a multi-step crosswalk that combines EIA plant coordinates, OpenStreetMap substation data (voltage, operator), and curated ISO node databases. The crosswalk produces a list of nearby nodes sorted by distance for each plant in an ISO territory.

Derived metrics

Capacity factor

Capacity factor = actual generation / theoretical maximum generation. Computed from EIA monthly MWh data and nameplate capacity. Annual capacity factor uses the most recent 12 months of data. A plant generating at 100% of nameplate capacity for all hours in a year would have a CF of 1.0. Typical values: solar ~0.20–0.28, wind ~0.25–0.45, nuclear ~0.90–0.93, natural gas ~0.10–0.55.

Rankings

State and national rankings are computed at export time. Each plant is ranked by nameplate capacity within its primary fuel type — separately for state and national scope. Rankings are recomputed on every export to reflect the latest data, not stored as static values.

Quality controls

Every dimension has a schema definition (Python dataclass) that validates column names, types, and expected value ranges. The test suite (94 tests) covers schema validation, build logic, and data integrity checks. Tests run after every rebuild and before any push to production.

The pipeline is fully deterministic — the same source inputs always produce the same output. There is no manual editing of production data. If something looks wrong, the fix goes into the pipeline code, not into the database.

Incremental updates

The export and push stages are incremental by default. The export script computes a SHA-256 hash of each plant's JSON and skips writing files that haven't changed. The push script compares against Supabase and only upserts changed rows. A typical monthly refresh touches a fraction of the 15,053 plants — making the process fast and minimizing write load on the database.