Optimizing Bandwidth for Getis-Ord Gi* Heatmaps

Bandwidth calibration dictates the spatial weights matrix in Getis-Ord Gi* analysis, directly controlling neighbor inclusion, local statistic sensitivity, and the spatial scale of inference. In public health surveillance, arbitrary distance thresholds inflate Type I/II error rates, obscure micro-transmission clusters, and produce intervention maps that lack epidemiological defensibility. Production systems must replace heuristic distance guessing with automated, metric-driven optimization anchored to pathogen dispersal radii, care-access catchments, or vector mobility constraints. Proper calibration within a broader Disease Clustering & Spatial Statistical Modeling architecture ensures that local hotspot outputs remain statistically rigorous and operationally actionable.

Operational Constraints & Epidemiological Alignment

The Getis-Ord Gi* statistic partitions local weighted sums against global expectations, but the weights matrix itself is bandwidth-dependent. Fixed-distance thresholds perform poorly in jurisdictions with heterogeneous population density: urban cores become over-connected, while rural surveillance grids generate isolated islands or artificially sparse neighborhoods. Conversely, k-nearest-neighbor (k-NN) approaches preserve neighbor count but distort spatial continuity, violating the assumption that proximity drives transmission risk.

For automated Getis-Ord Gi* Hotspot Detection, bandwidth must be optimized against a stability metric rather than visual preference. Production pipelines typically sweep a metric distance range, compute Gi* at each step, apply Benjamini-Hochberg False Discovery Rate (FDR) correction, and select the threshold that maximizes statistically significant hotspot yield while minimizing spatial fragmentation. This approach aligns statistical output with operational intervention scales and prevents over-smoothing of localized outbreaks.

Production-Ready Optimization Pipeline

The following pipeline automates bandwidth selection using libpysal, esda, and statsmodels. It enforces projected CRS alignment, validates spatial weights topology, applies FDR stabilization, and returns an audit-ready configuration log.

import geopandas as gpd
import libpysal
import numpy as np
import pandas as pd
from esda import G_Local
from statsmodels.stats.multitest import multipletests
import logging
import warnings
from datetime import datetime

warnings.filterwarnings("ignore")
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
    handlers=[logging.StreamHandler()]
)

def optimize_gistar_bandwidth(
    gdf: gpd.GeoDataFrame,
    variable: str,
    bw_range: np.ndarray = np.arange(1000, 15000, 1000),
    fdr_alpha: float = 0.05,
    min_neighbors: int = 3,
    row_standardize: bool = True
) -> dict:
    """
    Automated bandwidth optimization for Getis-Ord Gi* using FDR-stabilized 
    hotspot yield and spatial weights validation.
    """
    # 1. CRS Enforcement
    if gdf.crs is None or not gdf.crs.is_projected:
        target_crs = gdf.estimate_utm_crs()
        logging.info(f"Non-projected CRS detected. Auto-projecting to {target_crs}")
        gdf = gdf.to_crs(target_crs)
    else:
        target_crs = gdf.crs

    # 2. Data Preparation
    y = gdf[variable].values.astype(float)
    if np.any(np.isnan(y)) or np.any(np.isinf(y)):
        raise ValueError("Input variable contains NaN/Inf. Impute or filter before execution.")

    results_log = []
    best_bw = None
    best_score = -np.inf
    best_weights = None
    best_result = None

    # 3. Bandwidth Sweep
    for bw in bw_range:
        try:
            # Distance-based weights with topology validation
            w = libpysal.weights.DistanceBand.from_dataframe(
                gdf, threshold=bw, binary=False, silence_warnings=True
            )
            
            # Handle row-standardization and island detection
            if row_standardize:
                w.transform = 'R'
            
            if w.n != len(gdf):
                logging.warning(f"BW {bw}m: Topology mismatch. Skipping.")
                continue
                
            # Enforce minimum neighbor threshold (w.cardinalities is a dict keyed by feature id)
            neighbor_counts = np.array(list(w.cardinalities.values()))
            if np.min(neighbor_counts) < min_neighbors:
                logging.info(f"BW {bw}m: Below min_neighbors threshold. Skipping.")
                continue

            # Gi* computation (weights already row-standardized above; star=True for Gi*)
            gi = G_Local(y, w, star=True)
            
            # FDR correction
            _, pvals_fdr, _, _ = multipletests(gi.p_sim, alpha=fdr_alpha, method='fdr_bh')
            
            # Scoring: Maximize significant hotspots while penalizing fragmentation
            # Fragmentation proxy: count of isolated significant polygons (1-neighbor significant clusters)
            sig_mask = pvals_fdr < fdr_alpha
            hotspot_count = np.sum(sig_mask)
            
            # Stability score: balance yield vs over-segmentation
            score = hotspot_count - (0.1 * np.sum(neighbor_counts[sig_mask] <= min_neighbors))
            
            results_log.append({
                "bandwidth_m": bw,
                "hotspot_count": int(hotspot_count),
                "fdr_alpha": fdr_alpha,
                "score": float(score),
                "min_neighbors": int(np.min(neighbor_counts)),
                "timestamp": datetime.utcnow().isoformat()
            })
            
            if score > best_score:
                best_score = score
                best_bw = bw
                best_weights = w
                best_result = {
                    "z_scores": gi.z_sim,
                    "p_values": gi.p_sim,
                    "p_fdr": pvals_fdr,
                    "hotspot_flag": sig_mask
                }
                
        except Exception as e:
            logging.error(f"BW {bw}m failed: {e}")
            continue

    if best_bw is None:
        raise RuntimeError("No valid bandwidth found in specified range. Adjust bw_range or min_neighbors.")

    logging.info(f"Optimal bandwidth: {best_bw}m (Score: {best_score:.2f})")
    
    return {
        "optimal_bandwidth_m": best_bw,
        "weights_matrix": best_weights,
        "results": best_result,
        "optimization_log": pd.DataFrame(results_log),
        "crs_used": target_crs.to_epsg(),
        "fdr_alpha": fdr_alpha
    }

Edge-Case Handling & Spatial Validation

Production deployments encounter spatial anomalies that break naive implementations. The pipeline above incorporates explicit guards, but additional validation steps are required for high-stakes public health analytics:

  1. Island & Boundary Correction: Distance thresholds near jurisdictional boundaries create artificial edge effects. Apply a spatial buffer equal to the maximum tested bandwidth before weight generation, then clip results post-calculation. Alternatively, use libpysal.weights.Kernel with adaptive bandwidths for continuous density surfaces.
  2. Zero-Inflation & Rate Data: Raw case counts violate Gi* assumptions when denominators vary. Transform inputs to age-adjusted rates, SMRs, or apply a small constant (+1e-6) to prevent division-by-zero during row-standardization. Validate variance homogeneity using Levene’s test before execution.
  3. Memory Constraints: For N > 50,000, dense weight matrices exhaust RAM. Switch to libpysal.weights.KNN with a fixed k derived from the optimal fixed distance, or utilize sparse matrix representations via scipy.sparse. The ESDA documentation details sparse weight handling for large-scale surveillance grids.
  4. CRS Misalignment Drift: Automated UTM estimation (estimate_utm_crs()) can fail for transboundary datasets spanning multiple zones. Enforce explicit EPSG codes in configuration files and validate coordinate precision (gdf.geometry.x.round(2)) to prevent floating-point topology breaks.

Audit & Compliance Enforcement

Regulatory and interagency data sharing requires reproducible, version-pinned spatial workflows. Implement the following compliance controls:

  • Deterministic Execution: Sort the input GeoDataFrame by a stable identifier (gdf = gdf.sort_values("case_id").reset_index(drop=True)) before weight generation. Spatial algorithms are order-sensitive; inconsistent sorting produces non-reproducible p-values.
  • Configuration Logging: Persist all parameters, CRS EPSG codes, FDR method, and library versions to a JSON audit trail. Include SHA-256 hashes of input shapefiles to guarantee data lineage.
  • Output Standardization: Export results as a validated GeoDataFrame with explicit columns: gi_z, gi_p_raw, gi_p_fdr, hotspot_flag, bandwidth_m. Attach a metadata dictionary compliant with ISO 19115 spatial data standards.
  • Validation Gates: Before deployment, run a topology check (assert w.n == len(gdf)) and verify that no row-standardized weights sum to zero. Log warnings if >5% of features fall below min_neighbors, as this indicates scale mismatch requiring epidemiological review.

Automated bandwidth optimization transforms Getis-Ord Gi* from a heuristic visualization tool into a defensible surveillance instrument. By enforcing metric CRS alignment, validating spatial weights topology, and stabilizing outputs via FDR correction, public health teams can deploy hotspot detection pipelines that withstand statistical scrutiny and operational scaling.