Getis-Ord Gi* Hotspot Detection

This guide is part of Disease Clustering & Spatial Statistical Modeling, and covers the production deployment of the Getis-Ord Gi* local statistic for feature-level hotspot and coldspot detection in public health surveillance. Gi* computes a localized z-score and p-value for every spatial unit, letting an agency rank census tracts or block groups by the statistical intensity of high-incidence clustering rather than collapsing a region into a single global coefficient. Use it during active surveillance of vector-borne outbreaks, opioid overdose surges, or environmental exposure events, where intervention budgets must be steered to specific, defensible geographies.

Concept & Epidemiological Alignment

Gi* measures the degree to which a feature and its defined neighborhood deviate from a spatially random distribution of the analysis variable. The statistic sums the analysis values inside the neighborhood of feature i (including i itself when computed in star form) and compares that local sum to what would be expected if all values were randomly redistributed across the study area. A large positive z-score marks a feature embedded in a high-value neighborhood — a hotspot; a large negative z-score marks a low-value neighborhood — a coldspot.

Three assumptions must hold before Gi* is epidemiologically valid:

The analysis variable is a rate or count comparable across units. Raw case counts conflate incidence with population. Aggregate to an age-standardized or population-normalized rate, or model counts against an expected baseline, so that a hotspot reflects elevated risk rather than elevated denominator.
Spatial support is consistent. Mixing tracts, ZCTAs, and point geocodes in one run produces a neighborhood structure that is not interpretable. Resolve to a single areal support before building weights.
Underlying population is roughly homogeneous within neighborhoods, or its heterogeneity is captured by the weights. Where population density varies sharply, fixed-distance neighborhoods over- or under-bound the at-risk population; an adaptive neighborhood is required (covered under Optimizing Bandwidth for Getis-Ord Gi* Heatmaps).

Gi* answers where clustering concentrates. It does not, on its own, distinguish a focal point source from a diffuse gradient, and it does not test for clusters of arbitrary geometry across a time window. Choose the method that matches the surveillance question.

Method Selection

Surveillance question	Preferred method	Why not Gi*
Which areal units are significant high/low intensity clusters?	Getis-Ord Gi*	—
Are units high-but-surrounded-by-low (spatial outliers)?	Global & Local Moran’s I Implementation	Gi* cannot flag HL/LH outliers; it only signs hot vs cold.
Is there clustering at unknown distance bands in raw point geocodes?	K-Function & Point Pattern Analysis	Gi* needs areal aggregation and a fixed neighborhood.
Where and when is the most likely cluster, scanning many windows?	Spatial Scan Statistics Configuration	Gi* tests fixed neighborhoods, not variable-radius scanning windows.

A common production pattern runs a global Moran’s I test first to confirm that any clustering exists, then drills into Gi* only when global autocorrelation is significant — this avoids interpreting local noise as signal.

The Gi* z-score for each feature places it on a signed significance scale: high values surrounded by high neighbors push the statistic above the upper critical threshold (a hotspot), low values surrounded by low neighbors push it below the lower threshold (a coldspot), and features near zero remain statistically indistinguishable from spatial randomness.

Spatial Data Prerequisites

Gi* operates on a single layer of areal polygons (or polygon centroids) carrying one numeric analysis column. Before computation, satisfy every prerequisite below — each maps to a validation gate enforced in the pipeline that follows.

Geometry type: clean, valid polygons (or their centroids). Run gdf.is_valid.all() and repair with make_valid(); drop or flag null geometries explicitly rather than silently.
Projection: a distance-preserving projected CRS, never a geographic one. Project to UTM (e.g., EPSG:32618 for Zone 18N) or the appropriate State Plane zone so that contiguity and distance neighborhoods are metrically accurate. Enforce canonical projections per Coordinate Reference Systems for Public Health before any neighborhood is built.
Privacy de-identification: raw patient coordinates must never enter the analytical environment. Aggregate to stable areal units — census tracts, block groups, or ZCTAs — to satisfy HIPAA Safe Harbor and the disclosure thresholds documented in Compliance Mapping Frameworks.
Analysis variable: a population-normalized rate or a count paired with an expected baseline, with no nulls in the analysis column. Impute or exclude missing values with a logged, domain-justified rule.
Minimum sample size: permutation inference is unstable below roughly 30 features; below that, neighborhoods overlap so heavily that z-scores lose discriminating power. Report the feature count in execution metadata.
Stable identifier: a unique, sortable feature ID (FIPS, GEOID) used to order the frame deterministically before computation, so repeated runs produce byte-identical output.

The implementation that follows realizes this five-stage pipeline:

The Gi* Statistic

The Gi* formula evaluates the ratio of the weighted sum of values in a local neighborhood to the global sum of values, standardized by the expected mean and variance under spatial randomness:

G_i^* = \frac{\sum_{j=1}^{n} w_{ij} x_j - \bar{X} \sum_{j=1}^{n} w_{ij}}{S \sqrt{\frac{n \sum_{j=1}^{n} w_{ij}^2 - (\sum_{j=1}^{n} w_{ij})^2}{n-1}}}

Where $w_{ij}$ is the spatial weight between features i and j, $x_j$ is the attribute value at j, $\bar{X}$ is the sample mean, and $S$ is the sample standard deviation. The resulting z-score indicates cluster directionality (positive = hot, negative = cold). Because local statistics evaluate hundreds or thousands of hypotheses simultaneously, uncorrected p-values yield unacceptable false-positive rates. The Benjamini-Hochberg False Discovery Rate (FDR) procedure is the production standard for spatial epidemiology, balancing statistical power against type I error control.

Production Implementation

The following implementation enforces row-standardization, handles isolated geometries, applies FDR correction, and outputs a validated GeoDataFrame ready for GIS rendering or API consumption. Pinned versions: geopandas==0.14.*, libpysal==4.11.*, esda==2.5.*, statsmodels==0.14.*, numpy==1.26.*.

import geopandas as gpd
import numpy as np
import libpysal
from libpysal.weights import KNN, Queen
from esda.getisord import G_Local
from statsmodels.stats.multitest import fdrcorrection

def compute_getis_ord_gi_star(
    gdf: gpd.GeoDataFrame,
    incidence_col: str,
    id_col: str,
    k_neighbors: int = 4,
    fdr_alpha: float = 0.05,
    permutations: int = 999,
    weight_type: str = "knn",
    seed: int = 42
) -> gpd.GeoDataFrame:
    """
    Production-ready Getis-Ord Gi* hotspot detection pipeline.
    Assumes gdf is projected to a metric CRS and incidence_col contains
    aggregated case counts or standardized rates.
    """
    # 0. Deterministic ordering: sort by a stable ID so reruns are byte-identical
    gdf = gdf.sort_values(id_col).reset_index(drop=True)
    np.random.seed(seed)

    # 1. Spatial validation: drop null geometries and verify CRS
    gdf = gdf.dropna(subset=["geometry"]).copy()
    if gdf.crs is None or gdf.crs.is_geographic:
        raise ValueError("Geographic/undefined CRS. Project to a metric CRS (UTM/State Plane) before computation.")
    if gdf[incidence_col].isna().any():
        raise ValueError(f"Null values in '{incidence_col}'. Impute or exclude with a logged rule first.")

    # 2. Construct spatial weights matrix
    if weight_type.lower() == "knn":
        w = KNN.from_dataframe(gdf, k=k_neighbors)
    elif weight_type.lower() == "queen":
        w = Queen.from_dataframe(gdf)
        if w.islands:
            # Fall back to KNN when Queen weights produce isolated polygons
            w = KNN.from_dataframe(gdf, k=k_neighbors)
    else:
        raise ValueError("Unsupported weight_type. Use 'knn' or 'queen'.")

    w.transform = "r"  # Row-standardize

    # 3. Compute Gi* statistic (star=True includes the focal feature in its own neighborhood)
    y = gdf[incidence_col].values.astype(float)
    gi_star = G_Local(y, w, star=True, permutations=permutations, seed=seed)

    # 4. Extract results and apply FDR correction
    z_scores = gi_star.Zs
    raw_p_values = gi_star.p_sim

    # Handle potential NaNs from permutations or isolated features
    valid_mask = ~np.isnan(z_scores)
    corrected_p = np.full_like(raw_p_values, np.nan)
    if valid_mask.any():
        _, corrected_p[valid_mask] = fdrcorrection(raw_p_values[valid_mask], alpha=fdr_alpha)

    # 5. Assemble output GeoDataFrame
    out_gdf = gdf.copy()
    out_gdf["gi_z_score"] = z_scores
    out_gdf["gi_p_raw"] = raw_p_values
    out_gdf["gi_p_fdr"] = corrected_p
    out_gdf["gi_significant"] = (corrected_p < fdr_alpha) & valid_mask
    out_gdf["gi_cluster_type"] = np.select(
        [
            (out_gdf["gi_significant"]) & (out_gdf["gi_z_score"] > 0),
            (out_gdf["gi_significant"]) & (out_gdf["gi_z_score"] < 0)
        ],
        ["hotspot", "coldspot"],
        default="not_significant"
    )

    return out_gdf

Parameter Selection & Tuning

Four parameters drive Gi* output; each must be chosen against the surveillance context and recorded for reproducibility.

Weights type. Contiguity-based weights (Queen or Rook) are standard for complete areal coverage, where adjacency encodes a meaningful “neighbor” relationship. K-nearest-neighbor or distance-band weights are preferred when administrative boundaries are fragmented, when polygons vary wildly in size, or when modeling mobile populations whose exposure is not bounded by tract lines.
Row-standardization. Setting w.transform = "r" so each feature’s weights sum to 1 is mandatory; it stabilizes variance across irregularly shaped polygons and prevents large-perimeter units from dominating the statistic.
Neighborhood size (k or bandwidth). Overly dense neighborhoods dilute localized intensity into the regional mean; sparse neighborhoods amplify noise and edge effects. For continuous surveillance, use adaptive bandwidths that scale with population density rather than a fixed Euclidean radius — radius calibration and kernel decay functions are detailed in Optimizing Bandwidth for Getis-Ord Gi* Heatmaps.
Significance threshold and FDR strategy. Apply Benjamini-Hochberg FDR at a documented alpha (commonly 0.05). Run a sensitivity analysis across several k values or contiguity thresholds and confirm that the set of significant hotspots is stable; clusters that appear only at a single parameter setting are candidates for review, not action.

Edge Cases & Failure Modes

Island polygons. Contiguity weights produce features with zero neighbors (w.islands), which yield NaN z-scores that propagate through downstream joins. Detect islands via w.islands and fall back to KNN, or attach a domain-justified nearest-neighbor injection — never silently drop them.
Zero-inflation. Sparse case data with many zero-count units flattens the variance term and can manufacture spurious coldspots. Use a population-offset rate or smooth small-area estimates before running Gi*, and flag units whose denominator falls below a reliability threshold.
Transboundary CRS drift. Multi-jurisdiction studies that span UTM zones or mix State Plane zones introduce distance error at the seams. Reproject every input to one documented CRS and assert a single gdf.crs before building weights; see Coordinate Reference Systems for Public Health.
Memory constraints for N > 50k. Dense permutation inference scales poorly past tens of thousands of features. Use sparse weight representations, reduce permutations only after confirming p-value stability, and consider tiling the study area into overlapping panes with a buffer to preserve edge neighborhoods.
Edge effects. Features on the study boundary have truncated neighborhoods and biased statistics. Buffer the analysis extent or flag boundary units in execution metadata rather than reporting them as confirmed hotspots.

Compliance & Audit Controls

Federal audit and inter-agency data-sharing protocols require that any published hotspot be reproducible from logged inputs.

Deterministic execution. Sort by a stable feature ID before computation and pin the permutation seed, so every rerun reproduces the same z-scores, p-values, and classifications.
Configuration logging. Record weights type, k/bandwidth, fdr_alpha, permutation count, seed, CRS EPSG code, temporal window, and aggregation method in a metadata registry. Hash the input geometry file and the configuration block with SHA-256 and attach both hashes to every output.
Output schema. Emit explicit, documented columns — gi_z_score, gi_p_raw, gi_p_fdr, gi_significant, gi_cluster_type — alongside the stable ID, and write ISO 19115 lineage metadata (projection, source, processing steps) with the file. Prefer serialization formats covered in Spatial Data Types & Formats so provenance survives interagency handoff.
Validation cross-check. When the surveillance program also holds point-level data, cross-validate Gi* areal hotspots against K-Function & Point Pattern Analysis to confirm that detected clustering is not an artifact of population heterogeneity.

Production Readiness Checklist

All inputs reprojected to one documented metric CRS; crs.is_geographic asserted false
Patient-level coordinates aggregated to areal units that clear disclosure thresholds
Analysis column is a population-normalized rate or count with a baseline; no nulls
Frame sorted by a stable ID and permutation seed pinned for byte-identical reruns
Weights row-standardized (w.transform = "r"); islands detected and handled, not dropped
FDR (Benjamini-Hochberg) applied at a documented alpha; sensitivity sweep over k logged
Weights type, k/bandwidth, alpha, seed, CRS, and input hash recorded in the metadata registry
Output written with explicit z/p/classification columns and ISO 19115 lineage metadata

Disease Clustering & Spatial Statistical Modeling — parent overview of the end-to-end surveillance pipeline.
Global & Local Moran’s I Implementation — autocorrelation testing and spatial-outlier detection that complements Gi*.
K-Function & Point Pattern Analysis — second-order clustering for raw point geocodes.
Spatial Scan Statistics Configuration — variable-radius space-time cluster scanning.
Optimizing Bandwidth for Getis-Ord Gi* Heatmaps — adaptive neighborhood calibration for this method.