Getis-Ord Gi* Hotspot Detection
Getis-Ord Gi* is a local spatial statistic engineered to identify statistically significant spatial clusters of high (hot) or low (cold) values at the feature level. Unlike global autocorrelation metrics that collapse regional patterns into a single coefficient, Gi* computes a localized z-score and p-value for every spatial unit, enabling public health agencies to allocate intervention resources with precinct-level precision. Within the broader Disease Clustering & Spatial Statistical Modeling framework, Gi* is deployed when case-weighted incidence must be mapped against a defined spatial neighborhood, particularly during active surveillance of vector-borne outbreaks, opioid overdose surges, or environmental exposure events. Production deployment requires strict adherence to coordinate reference system (CRS) alignment, deterministic spatial weighting, and multiple-testing correction to ensure audit-ready outputs.
Compliance-Driven Preprocessing & CRS Alignment
Raw patient coordinates must never enter the analytical environment. HIPAA Safe Harbor and GDPR spatial de-identification thresholds mandate aggregation to stable areal units such as census tracts, block groups, or zip code tabulation areas (ZCTAs). All geometries must be projected to a distance-preserving coordinate system (e.g., EPSG:32618 for UTM Zone 18N or an equivalent state plane projection) to guarantee that Euclidean or network-based distance calculations remain metrically accurate. Store the preprocessed dataset as a versioned GeoDataFrame with explicit metadata tracking aggregation methodology, temporal window, and CRS transformation logs. This establishes a reproducible baseline for downstream statistical validation and regulatory audits.
Spatial Weights Matrix Configuration
The Gi* statistic is highly sensitive to the spatial weights matrix (W). Row-standardization is mandatory to ensure that the sum of weights for each feature equals 1, stabilizing variance across irregularly shaped polygons. For areal data, contiguity-based weights (Queen or Rook) are standard, but distance-based or K-nearest neighbor (KNN) matrices are preferred when administrative boundaries are fragmented or when analyzing mobile populations. Isolated geometries (spatial islands) must be explicitly handled; libpysal provides remap_ids and w_transform utilities to prevent NaN propagation during matrix inversion. Refer to the official PySAL ESDA documentation for implementation details on weight construction and transformation.
Core Computation & Multiple Testing Correction
The Gi* formula evaluates the ratio of the weighted sum of values in a local neighborhood to the global sum of values, standardized by expected mean and variance under spatial randomness:
Where is the spatial weight, is the attribute value, is the sample mean, and is the sample standard deviation. The resulting z-score indicates cluster directionality (positive = hot, negative = cold). Because local statistics evaluate hundreds or thousands of hypotheses simultaneously, uncorrected p-values yield unacceptable false positive rates. The Benjamini-Hochberg False Discovery Rate (FDR) procedure is the production standard for spatial epidemiology, balancing statistical power with type I error control.
The implementation that follows realizes this five-stage pipeline:
flowchart LR A["Aggregate to areal units & project CRS"] --> B["Build spatial weights W (row-standardized)"] B --> C["Compute Gi* z-score per feature"] C --> D["FDR (Benjamini-Hochberg) correction"] D --> E["Classify hotspots / coldspots"]
Production Python Pipeline
The following implementation enforces row-standardization, handles isolated geometries, applies FDR correction, and outputs a validated GeoDataFrame ready for GIS rendering or API consumption.
import geopandas as gpd
import numpy as np
import libpysal
from libpysal.weights import KNN, Queen
from esda.getisord import G_Local
from statsmodels.stats.multitest import fdrcorrection
def compute_getis_ord_gi_star(
gdf: gpd.GeoDataFrame,
incidence_col: str,
k_neighbors: int = 4,
fdr_alpha: float = 0.05,
permutations: int = 999,
weight_type: str = "knn"
) -> gpd.GeoDataFrame:
"""
Production-ready Getis-Ord Gi* hotspot detection pipeline.
Assumes gdf is projected to a metric CRS and incidence_col contains
aggregated case counts or standardized rates.
"""
# 1. Spatial validation: drop null geometries and verify CRS
gdf = gdf.dropna(subset=["geometry"]).copy()
if gdf.crs.is_geographic:
raise ValueError("Geographic CRS detected. Project to a metric CRS (e.g., UTM/State Plane) before computation.")
# 2. Construct spatial weights matrix
if weight_type.lower() == "knn":
w = KNN.from_dataframe(gdf, k=k_neighbors)
elif weight_type.lower() == "queen":
w = Queen.from_dataframe(gdf)
else:
raise ValueError("Unsupported weight_type. Use 'knn' or 'queen'.")
w.transform = "r" # Row-standardize
w.full() # Ensure full matrix representation for esda compatibility
# 3. Compute Gi* statistic (star=True includes the focal feature in its own neighborhood)
y = gdf[incidence_col].values.astype(float)
gi_star = G_Local(y, w, star=True, permutations=permutations)
# 4. Extract results and apply FDR correction
z_scores = gi_star.Zs
raw_p_values = gi_star.p_sim
# Handle potential NaNs from permutations or isolated features
valid_mask = ~np.isnan(z_scores)
corrected_p = np.full_like(raw_p_values, np.nan)
if valid_mask.any():
_, corrected_p[valid_mask] = fdrcorrection(raw_p_values[valid_mask], alpha=fdr_alpha)
# 5. Assemble output GeoDataFrame
out_gdf = gdf.copy()
out_gdf["gi_z_score"] = z_scores
out_gdf["gi_p_raw"] = raw_p_values
out_gdf["gi_p_fdr"] = corrected_p
out_gdf["gi_significant"] = (corrected_p < fdr_alpha) & valid_mask
out_gdf["gi_cluster_type"] = np.select(
[
(out_gdf["gi_significant"]) & (out_gdf["gi_z_score"] > 0),
(out_gdf["gi_significant"]) & (out_gdf["gi_z_score"] < 0)
],
["hotspot", "coldspot"],
default="not_significant"
)
return out_gdf
Parameter Tuning & Audit Validation
Bandwidth selection and neighborhood definition directly influence Gi* sensitivity. Overly dense neighborhoods dilute localized intensity, while sparse neighborhoods amplify noise and edge effects. For continuous surveillance, implement adaptive bandwidths that scale with population density rather than fixed Euclidean radii. Detailed methodology for radius calibration and kernel decay functions is documented in Optimizing Bandwidth for Getis-Ord Gi* Heatmaps.
Statistical validation requires sensitivity analysis across multiple k values or contiguity thresholds. Log all weight matrix parameters, permutation seeds, and FDR alpha levels in a metadata registry. When integrating with point-level surveillance, cross-validate Gi* outputs against K-Function & Point Pattern Analysis to distinguish true spatial clustering from underlying population heterogeneity. For temporal surveillance, apply rolling-window aggregation and enforce strict version control on input shapefiles to maintain chain-of-custody compliance. All outputs should be exported with explicit projection metadata and accompanied by a reproducible execution script to satisfy federal audit requirements and inter-agency data sharing protocols.