Production-Grade Local Moran’s I Pipeline for Infectious Disease Surveillance

Operational deployment of Local Indicators of Spatial Association (LISA) in public health surveillance requires deterministic execution, explicit statistical controls, and strict data governance. Unlike exploratory spatial analysis, outbreak response pipelines must handle fragmented administrative boundaries, enforce metric coordinate systems, and apply multiple testing corrections to prevent false-positive cluster declarations. This architecture implements a validated, audit-ready Local Moran’s I workflow tailored for infectious disease incidence mapping, with explicit gates for geometry validation, spatial weight construction, and regulatory compliance.

1. Data Ingestion, Geometry Validation, and Compliance Enforcement

Surveillance feeds frequently contain topological defects, unprojected coordinates, or protected health information (PHI) that violate HIPAA/GDPR requirements before spatial statistics execute. The ingestion layer must enforce CRS alignment, repair invalid geometries, and strip direct identifiers deterministically.

import geopandas as gpd
import pandas as pd
import numpy as np
import logging
from shapely.validation import make_valid

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")

def ingest_and_validate_spatial_data(
    gdf: gpd.GeoDataFrame,
    target_epsg: int = 32617,
    rate_col: str = "incidence_rate"
) -> gpd.GeoDataFrame:
    """Enforce CRS alignment, repair geometries, and strip PHI per compliance policy."""
    # Geometry validation & repair
    invalid_mask = ~gdf.geometry.is_valid
    if invalid_mask.any():
        gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].apply(make_valid)
        logging.warning(f"Repaired {invalid_mask.sum()} invalid geometries.")
        
    # CRS enforcement for metric distance/contiguity
    if gdf.crs is None or gdf.crs.to_epsg() == 4326:
        gdf = gdf.to_crs(epsg=target_epsg)
        logging.info(f"Projected to EPSG:{target_epsg} for metric spatial operations.")
    elif gdf.crs.to_epsg() != target_epsg:
        gdf = gdf.to_crs(epsg=target_epsg)
        
    # PHI/PII compliance gate
    sensitive_patterns = ["ssn", "name", "phone", "email", "patient_id", "mrn"]
    sensitive_cols = [c for c in gdf.columns if any(p in c.lower() for p in sensitive_patterns)]
    if sensitive_cols:
        gdf.drop(columns=sensitive_cols, inplace=True)
        logging.info("Stripped PII/PHI columns per compliance policy.")
        
    # Zero-variance rate check
    if gdf[rate_col].std() == 0:
        raise ValueError("Incidence rate column has zero variance. Local Moran's I cannot compute.")
        
    return gdf

2. Spatial Weights Matrix Construction with Island Resolution

The spatial weights matrix (W) dictates neighborhood topology and directly controls Local Moran’s I sensitivity. Contiguity-based weights (Queen/Rook) routinely fail in rural jurisdictions, coastal tracts, or fragmented health districts where administrative boundaries produce isolated polygons (“islands”). Distance-based or k-nearest neighbor weights mitigate boundary artifacts but require explicit row-standardization to ensure statistical comparability across heterogeneous population densities.

import libpysal
from libpysal.weights import Queen, KNN, WSP2W

def construct_spatial_weights(gdf: gpd.GeoDataFrame, k: int = 4) -> libpysal.weights.W:
    """Build row-standardized spatial weights with automatic island fallback."""
    try:
        # Primary: Queen contiguity
        W = Queen.from_dataframe(gdf, use_index=True)
        if W.islands:
            logging.warning(f"Queen weights contain {len(W.islands)} islands. Switching to KNN fallback.")
            raise ValueError("Islands detected")
    except Exception:
        # Fallback: KNN with explicit row-standardization
        W = KNN.from_dataframe(gdf, k=k, use_index=True)
        
    # Row-standardization is mandatory for LISA comparability
    W.transform = "r"
    logging.info(f"Spatial weights constructed: {W.n} observations, transform='{W.transform}'.")
    return W

3. Local Moran’s I Computation and Multiple Testing Correction

Local Moran’s I evaluates spatial autocorrelation at the observation level, but uncorrected p-values across hundreds of tracts guarantee false positives. Production pipelines must implement permutation-based inference and apply False Discovery Rate (FDR) control. The esda library provides optimized LISA computation, while statsmodels handles rigorous multiple testing correction.

import esda
from statsmodels.stats.multitest import multipletests
from typing import Tuple

def compute_local_moran_with_fdr(
    gdf: gpd.GeoDataFrame,
    W: libpysal.weights.W,
    rate_col: str = "incidence_rate",
    permutations: int = 999,
    alpha: float = 0.05
) -> Tuple[gpd.GeoDataFrame, dict]:
    """Execute Local Moran's I with permutation inference and FDR correction."""
    y = gdf[rate_col].values.astype(float)
    
    # LISA computation
    lisa = esda.Moran_Local(y, W, permutations=permutations)
    
    # FDR correction for spatial multiple testing
    _, pvals_corrected, _, _ = multipletests(lisa.p_sim, alpha=alpha, method="fdr_bh")
    
    # Attach results to GeoDataFrame
    gdf["lisa_i"] = lisa.Is
    gdf["p_value"] = lisa.p_sim
    gdf["p_fdr_corrected"] = pvals_corrected
    gdf["significant_fdr"] = pvals_corrected < alpha
    
    logging.info(f"LISA computed. Significant clusters (FDR<0.05): {gdf['significant_fdr'].sum()}")

    # Moran_Local exposes per-feature results (Is, p_sim); the single global I/p
    # come from the global Moran statistic, computed separately.
    global_mi = esda.Moran(y, W, permutations=permutations)
    return gdf, {"global_moran": global_mi.I, "global_p": global_mi.p_sim}

4. Cluster Classification, Validation Gates, and Audit Export

Raw LISA statistics require deterministic classification into High-High (HH), Low-Low (LL), High-Low (HL), and Low-High (LH) quadrants. Public health dashboards must filter out statistically insignificant results and export only validated, de-identified outputs for interagency sharing.

def classify_clusters_and_export(
    gdf: gpd.GeoDataFrame,
    rate_col: str = "incidence_rate",
    output_path: str = "lisa_outbreak_clusters.parquet"
) -> gpd.GeoDataFrame:
    """Classify LISA quadrants, apply validation gates, and export audit-ready output."""
    mean_rate = gdf[rate_col].mean()
    
    # Quadrant classification
    conditions = [
        (gdf[rate_col] > mean_rate) & (gdf["lisa_i"] > 0),
        (gdf[rate_col] < mean_rate) & (gdf["lisa_i"] > 0),
        (gdf[rate_col] > mean_rate) & (gdf["lisa_i"] < 0),
        (gdf[rate_col] < mean_rate) & (gdf["lisa_i"] < 0)
    ]
    choices = ["HH", "LL", "HL", "LH"]
    gdf["cluster_type"] = np.select(conditions, choices, default="NS")
    
    # Validation gate: retain only FDR-significant clusters
    gdf["is_significant_cluster"] = gdf["significant_fdr"] & (gdf["cluster_type"] != "NS")
    
    # Export with schema enforcement
    export_cols = ["cluster_type", "lisa_i", "p_fdr_corrected", "is_significant_cluster", "geometry"]
    gdf[export_cols].to_parquet(output_path, index=False)
    logging.info(f"Audit-ready LISA results exported to {output_path}")
    
    return gdf[gdf["is_significant_cluster"]]

Integration Notes for Agency Pipelines

This workflow isolates spatial validation, weight construction, and statistical correction into discrete, testable modules. When integrating into broader Disease Clustering & Spatial Statistical Modeling frameworks, ensure that incidence rates are age-standardized or population-adjusted prior to LISA execution to prevent ecological fallacy. For continuous surveillance deployments, wrap these functions in a scheduled orchestration layer (e.g., Airflow or Prefect) with explicit dependency checks on upstream case reporting feeds. The deterministic structure aligns directly with established Global & Local Moran’s I Implementation standards, enabling reproducible outbreak detection across jurisdictional boundaries while maintaining strict compliance with health data governance requirements.