Spatial Equity Index Calculation for Public Health Infrastructure

This guide is part of the Healthcare Access & Network Analysis Automation section and covers how to turn harmonized population, facility, and network layers into a validated, reproducible Spatial Equity Index (SEI) that quantifies disparities in healthcare access across census-defined geographies. The SEI is the metric that converts catchment geometry and facility capacity into a single accessibility score per population unit; in operational public health it becomes compliance-critical infrastructure the moment it informs resource allocation, shortage-area designation, or interagency funding decisions.

At a glance, the calculation proceeds through five deterministic stages:

Concept & Epidemiological Alignment

A Spatial Equity Index measures realized accessibility: for each population unit it answers “how much clinically-weighted supply is reachable, discounted by the travel cost of reaching it, and contested by every other resident competing for the same providers.” It is distinct from raw provider-to-population ratios (which ignore where people and clinics actually are) and from simple nearest-facility distance (which ignores capacity and competition). The dominant family of estimators is the Two-Step Floating Catchment Area (2SFCA) and its Enhanced variant (E2SFCA), which add distance-decay weighting inside the catchment so a clinic at the edge of a 30-minute band counts for less than one across the street.

For a population unit $i$ , the E2SFCA index $A_i$ is:

A_i = \sum_{j \in \{d_{ij} \le d_0\}} R_j\, W(d_{ij}), \qquad R_j = \frac{S_j}{\displaystyle\sum_{k \in \{d_{kj} \le d_0\}} P_k\, W(d_{kj})}

where $S_j$ is clinically-weighted supply at facility $j$ , $P_k$ is the population (demand) of unit $k$ , $d_{ij}$ is the network travel cost, $d_0$ is the catchment threshold, and $W(\cdot)$ is the distance-decay kernel. Step one builds each facility’s supply-to-demand ratio $R_j$ ; step two sums those ratios back onto residents.

Use the SEI when you must defend who gets a new clinic, mobile unit, or grant on accessibility grounds. Prefer a simpler metric when the question is purely descriptive. The estimator’s validity rests on assumptions that frequently break in surveillance data: catchments are mobility-realistic (not Euclidean buffers); demand is correctly de-duplicated across overlapping catchments; supply is measured in clinically meaningful units rather than raw bed counts; and the decay kernel reflects observed care-seeking behavior for the service type in question. Each assumption below maps to a validation gate later in the pipeline.

Method-Selection Table

Scenario	Recommended estimator	Decay kernel	Why
Urban primary care, dense providers	E2SFCA, multi-band	Gaussian or stepwise	Competition and edge-of-catchment discounting matter most where catchments overlap heavily
Rural / frontier, sparse providers	E2SFCA, wide single band	Exponential, gentle slope	Few facilities; avoid over-penalizing the only reachable clinic
Emergency / time-critical service	2SFCA on isochrone bands	Stepwise (cliff at threshold)	Care is binary inside/outside a response-time guarantee
Specialty referral (low volume)	E2SFCA with capacity weighting	Power-law	Long travel is normal; weight by true service capacity, not headcount
Screening / descriptive report only	Provider-to-population ratio	none	When no allocation decision rides on it, the simpler metric is defensible and cheaper

Spatial Data Prerequisites

Reliable indexing begins with harmonized inputs. Population denominators, facility locations, and network topology must share a single projected CRS optimized for distance preservation within the study area — enforce canonical Coordinate Reference Systems for Public Health before any distance computation. For U.S. deployments, an EPSG:269xx UTM zone or the appropriate State Plane zone is mandatory; multi-state or national analyses require an equal-area projection (e.g. EPSG:5070 CONUS Albers) to control areal distortion in demand weighting.

Minimum requirements before the pipeline runs:

Geometry types: population units as polygons (census blocks/block groups) reduced to representative points; facilities as points snapped to routable network nodes.
CRS: one documented projected CRS for all layers; no lat/long arithmetic anywhere downstream.
Topology: valid, non-self-intersecting geometries; floating-point coordinate drift snapped to sub-meter tolerance.
Minimum sample size: at least one facility reachable within $d_0$ for the bulk of population units; flag and report units with zero reachable supply rather than imputing.
Covariates: facility capacity in clinically meaningful units (FTEs, weighted beds), not raw counts; population de-duplicated against overlapping administrative boundaries.

Use geopandas and shapely to enforce valid geometries, remove self-intersections, and snap drift before any spatial operation. When fusing multi-agency datasets, implement a deterministic join key strategy on standardized identifiers (FIPS, NPI, facility-registry IDs) and log every unmatched record to an audit table — silent data loss here corrupts downstream scoring and breaks reproducibility.

# geopandas==0.14.4, shapely==2.0.4, pyproj==3.6.1
import geopandas as gpd
from shapely.validation import make_valid

def harmonize_and_validate(gdf: gpd.GeoDataFrame, target_epsg: int) -> gpd.GeoDataFrame:
    """Deterministic geometry cleaning + reprojection before any distance op."""
    gdf = gdf.copy()
    gdf["geometry"] = gdf.geometry.apply(make_valid)          # repair invalid rings
    gdf = gdf[~gdf.geometry.is_empty & gdf.geometry.notna()]  # drop null/empty
    gdf = gdf.to_crs(epsg=target_epsg)                        # project FIRST
    # snap sub-meter drift AFTER projection so tolerance is in metres
    gdf["geometry"] = gdf.geometry.simplify(0.001, preserve_topology=True)
    return gdf.sort_index()                                   # stable order = reproducible

Production Implementation: Network-Constrained Catchments

Catchments must reflect real-world mobility, not Euclidean buffers. Build them from Drive-Time Isochrone Generation so the travel surface accounts for road hierarchy, speed limits, turn restrictions, and one-way constraints. In production, never generate isochrones on-the-fly per run: precompute a facility-to-population travel-time matrix (or cache pandana/osmnx graph traversals) and consume it as a fixed input so the SEI is reproducible bit-for-bit.

Parameterize catchment thresholds via configuration files (YAML/TOML), never hard-coded constants — every parameter must be justifiable in a compliance review:

# config/catchment.yml
catchment:
  travel_time_bands: [5, 10, 15, 20, 30]   # minutes; nested E2SFCA zones
  mode: "drive"
  impedance_multiplier: 1.0
  max_edges: 50000
decay:
  kernel: "gaussian"        # gaussian | exponential | stepwise | power
  bandwidth_minutes: 15.0
  d0_minutes: 30.0

Load and schema-validate the configuration at pipeline init, then hash it (see Compliance & Audit Controls) so the exact parameter set is bound to every output artifact.

Production Implementation: Vectorized E2SFCA Scoring

The computational backbone is the distance-decay weighted supply-to-demand ratio. For step-by-step derivation of the two-step mechanics, see Calculating the Two-Step Floating Catchment Area Index. A production implementation avoids iterrows()/apply(), which collapse past ~100k census blocks; instead use a scipy.spatial KD-tree for neighborhood queries and vectorized NumPy for decay weighting. The implementation below is deterministic, logs structured audit events, and reads $d_0$ from the validated config.

# numpy==1.26.4, scipy==1.13.1, geopandas==0.14.4
import logging
import numpy as np
from scipy.spatial import cKDTree

log = logging.getLogger("sei.e2sfca")

def gaussian_decay(d, bandwidth):
    """Gaussian distance-decay; d and bandwidth in the same units (metres)."""
    return np.exp(-(d ** 2) / (2.0 * bandwidth ** 2))

def compute_e2sfca(pop_gdf, fac_gdf, *, max_dist, bandwidth, decay_fn=gaussian_decay):
    """
    Enhanced Two-Step Floating Catchment Area (E2SFCA).
      pop_gdf: GeoDataFrame, point geometry + 'population' (demand)
      fac_gdf: GeoDataFrame, point geometry + 'capacity'   (clinically-weighted supply)
      max_dist: catchment radius d0 in CRS units (metres for a projected CRS)
    Returns SEI per population unit, in the SAME row order as pop_gdf.
    """
    # stable ordering => reproducible output regardless of input shuffling
    pop_gdf = pop_gdf.sort_index()
    fac_gdf = fac_gdf.sort_index()

    pop_xy = np.column_stack([pop_gdf.geometry.x, pop_gdf.geometry.y])
    fac_xy = np.column_stack([fac_gdf.geometry.x, fac_gdf.geometry.y])
    pop = pop_gdf["population"].to_numpy(float)
    cap = fac_gdf["capacity"].to_numpy(float)

    pop_tree = cKDTree(pop_xy)
    fac_tree = cKDTree(fac_xy)

    # ---- Step 1: facility supply-to-demand ratio R_j ----
    ratios = np.zeros(len(fac_gdf))
    empty_catchments = 0
    for j, fxy in enumerate(fac_xy):
        idx = pop_tree.query_ball_point(fxy, r=max_dist)
        if not idx:
            empty_catchments += 1
            continue
        d = np.linalg.norm(pop_xy[idx] - fxy, axis=1)
        weighted_demand = float(np.sum(pop[idx] * decay_fn(d, bandwidth)))
        ratios[j] = cap[j] / weighted_demand if weighted_demand > 0 else 0.0

    # ---- Step 2: accumulate ratios back onto population units ----
    sei = np.zeros(len(pop_gdf))
    no_access = 0
    for i, pxy in enumerate(pop_xy):
        idx = fac_tree.query_ball_point(pxy, r=max_dist)
        if not idx:
            no_access += 1
            continue
        d = np.linalg.norm(fac_xy[idx] - pxy, axis=1)
        sei[i] = float(np.sum(ratios[idx] * decay_fn(d, bandwidth)))

    log.info(
        "e2sfca complete: facilities=%d empty_catchments=%d "
        "pop_units=%d zero_access=%d",
        len(fac_gdf), empty_catchments, len(pop_gdf), no_access,
    )
    return sei

Capacity normalization must account for service type, staffing ratios, and operational hours. Reference Facility Capacity Allocation Models for the weighting matrices that convert raw bed counts or provider FTEs into clinically meaningful supply units before they enter cap.

Parameter Selection & Tuning

Three parameters dominate the SEI and must each be justified in writing:

Catchment threshold $d_0$ . Set from service type and jurisdictional standards (e.g. a 30-minute emergency guarantee, a 15-minute primary-care target), not convenience. Use nested bands so the decay is piecewise rather than a single cliff.
Decay kernel and bandwidth. Gaussian is the default for primary care; exponential suits sparse rural networks; a stepwise kernel encodes a hard response-time guarantee. The bandwidth controls how fast access falls off — set it to the distance at which care-seeking behavior is observed to drop, not an arbitrary half-of- $d_0$ .
Significance / instability thresholds. Because the index has no analytic standard error, defensibility comes from sensitivity analysis rather than a p-value. Run the pipeline across a grid of plausible kernels and bandwidths and compute the coefficient of variation per geography. Flag any unit whose SEI varies more than 15% across the grid as parameter-sensitive, and surface that flag in the output so downstream allocation does not treat an unstable score as firm.

# itertools for the parameter grid; numpy for CV
import itertools
import numpy as np

def sensitivity_grid(pop_gdf, fac_gdf, *, d0_set, bandwidth_set, kernels):
    """Return per-unit coefficient of variation across the parameter grid."""
    runs = []
    for d0, bw, kfn in itertools.product(d0_set, bandwidth_set, kernels):
        runs.append(compute_e2sfca(pop_gdf, fac_gdf, max_dist=d0,
                                    bandwidth=bw, decay_fn=kfn))
    stack = np.vstack(runs)                       # (n_runs, n_units)
    mean = stack.mean(axis=0)
    cv = np.divide(stack.std(axis=0), mean, out=np.zeros_like(mean), where=mean > 0)
    return cv                                     # flag cv > 0.15

Validate spatial outputs against independent benchmarks — CDC Social Vulnerability Index strata or HRSA shortage-area designations — and check for clustering artifacts with global Moran’s I via Global & Local Moran’s I Implementation; strong residual autocorrelation at jurisdictional edges usually signals an edge effect rather than a real disparity.

Edge Cases & Failure Modes

Population units with zero reachable supply. Report them as a distinct category (access desert), never as SEI = 0 mixed in with genuinely low scores — the two mean very different things to an allocator. The implementation above counts these via no_access.
Island / disconnected geographies. Units on a network component with no facility return empty KD-tree queries. Detect them at the topology stage, not by inspecting NaNs later.
Zero-inflation in supply. Facilities with capacity = 0 (closed, suspended) must be dropped before step one, or they silently dilute every ratio they touch.
Transboundary CRS drift. Multi-state runs that mix UTM zones distort distances near zone seams; force a single equal-area CRS and re-validate after the reproject.
Memory at $N > 50\text{k}$ . query_ball_point per point is fine to a few hundred thousand units; beyond that, switch to a sparse cKDTree.sparse_distance_matrix or tile the study area with a halo buffer so cross-tile catchments are not truncated.
Edge-of-study-area inflation. Residents near the boundary appear to have artificially few competitors because demand outside the frame is invisible. Apply buffer padding (include facilities and population one $d_0$ beyond the reporting boundary) and clip results back to the reporting area only at output.

Compliance & Audit Controls

A SEI that drives funding must be reconstructable on demand. Enforce deterministic execution by sorting every layer on a stable identifier before computation (done in compute_e2sfca). Bind the exact configuration to each artifact with a SHA-256 hash, and serialize outputs with embedded lineage metadata.

# pyproj==3.6.1 for CRS authority string; hashlib stdlib
import hashlib, json, datetime as dt
import geopandas as gpd

def write_sei_output(pop_gdf, sei, cv, config, path):
    pop_gdf = pop_gdf.copy()
    pop_gdf["sei"] = sei
    pop_gdf["sei_cv"] = cv
    pop_gdf["sei_unstable"] = cv > 0.15            # parameter-sensitivity flag

    cfg_hash = hashlib.sha256(
        json.dumps(config, sort_keys=True).encode()
    ).hexdigest()

    # ISO 19115-style lineage embedded in the GeoParquet schema metadata
    pop_gdf.to_parquet(path, index=True, schema_metadata={
        "title": "Spatial Equity Index",
        "crs_authority": pop_gdf.crs.to_authority()[1],
        "config_sha256": cfg_hash,
        "pipeline_version": "sei-2.1.0",
        "generated_utc": dt.datetime.now(dt.timezone.utc).isoformat(),
        "lineage": "E2SFCA over network-constrained catchments; "
                   "inputs harmonized to single projected CRS",
    })
    return cfg_hash

Output schema (stable column names): geoid (stable ID), population, sei, sei_cv, sei_unstable, plus the embedded config_sha256 and crs_authority. Wrap network-dependent steps (OSM graph download, external router calls) in retry-with-backoff, pin all dependencies, and run inside a versioned container so a re-run two years later reproduces the same scores.

Production Implementation Checklist

Frequently Asked Questions

When should I use E2SFCA instead of a simple provider-to-population ratio? Whenever the result drives an allocation decision. Ratios ignore where people and clinics sit and assume residents never cross an administrative line for care. E2SFCA models real travel cost, capacity, and competition, so it can defend why one tract scores worse than its neighbor.

Why must everything run in a projected CRS? Geographic coordinates (EPSG:4326) measure angles, not metres, so catchment radii and decay weights computed on lat/long are distorted and that error compounds through both steps. Enforce one projected — or equal-area for large extents — CRS before any distance computation.

How do I pick the distance-decay bandwidth? Set it from observed care-seeking behavior for the service type, not a default fraction of $d_0$ . Then run the sensitivity grid: if scores swing more than ~15% across plausible bandwidths, the result is parameter-sensitive and should be flagged rather than reported as firm.

What breaks first at scale? Per-point query_ball_point over hundreds of thousands of census blocks becomes the bottleneck. Switch to cKDTree.sparse_distance_matrix or tile the study area with a one- $d_0$ halo so cross-tile catchments are not truncated, then reassemble.

Healthcare Access & Network Analysis Automation — the parent guide tying catchments, capacity, and equity scoring together.
Calculating the Two-Step Floating Catchment Area Index — the step-by-step 2SFCA derivation this page operationalizes.
Drive-Time Isochrone Generation — produces the network-constrained catchments the SEI consumes.
Facility Capacity Allocation Models — converts raw bed/FTE counts into the clinically-weighted supply the index requires.
Batch Routing Error Handling — timeout, retry, and partial-result handling when building the travel-time matrix at scale.