Global & Local Moran’s I Implementation

This guide is part of Disease Clustering & Spatial Statistical Modeling, and covers the production deployment of Global and Local Moran’s I for spatial autocorrelation testing and spatial-outlier detection across areal public health surveillance data. Global Moran’s I returns a single coefficient summarizing whether high-incidence and low-incidence units are spatially clustered region-wide; Local Moran’s I (LISA) decomposes that signal to the feature level, labelling each unit as a hotspot, a coldspot, or a spatial outlier. Reach for this pair when an agency needs to confirm that clustering exists at all before steering response budgets, and to distinguish genuine high-amid-high transmission from isolated units that diverge from their neighbors.

Concept & Epidemiological Alignment

Moran’s I quantifies the correlation between a unit’s analysis value and the spatially weighted mean of its neighbors’ values. A positive global coefficient means like values sit near like values (clustering of high or low incidence); a coefficient near the expected value indicates a spatially random surface; a negative coefficient indicates a checkerboard pattern where high units are systematically bordered by low ones. The local form, Local Indicators of Spatial Association (LISA), runs the same logic per feature and additionally separates the kind of association: agreement with neighbors (High-High hotspots, Low-Low coldspots) versus disagreement (High-Low and Low-High spatial outliers).

Use the global test as a gate. Compute it first; only when it is significant does feature-level LISA mapping carry interpretive weight, because mapping local clusters on a globally random surface manufactures false positives. Where the surveillance question is purely where is intensity concentrated — hot versus cold, with no interest in outliers — Getis-Ord Gi* Hotspot Detection is the more direct local statistic; Moran’s I earns its place specifically when High-Low and Low-High spatial outliers are operationally meaningful (for example, a single elevated tract embedded in a quiet region that warrants a targeted investigation).

Three assumptions must hold before Moran’s I is epidemiologically valid in surveillance data:

The analysis variable is a population-normalized rate or a count with an explicit expected baseline. Raw case counts encode the population denominator, so what looks like clustered high counts can simply be densely populated census tracts grouped together. Aggregate to an age-standardized or expected-adjusted rate before computing weights, or the autocorrelation you detect is population structure, not disease structure.
A single, consistent areal support. Mixing census tracts, ZIP Code Tabulation Areas, and point geocodes in one run yields a neighborhood graph that cannot be interpreted. Resolve to one areal geometry before building the spatial weights matrix.
Spatial stationarity is plausible, or its violation is acknowledged. The global statistic assumes one autocorrelation regime across the whole region; strong density gradients or administrative discontinuities break that assumption. When global and local results conflict, treat it as evidence of non-stationarity rather than overriding the local signal.

Method Selection

Surveillance question	Preferred method	Why not Moran’s I
Does any spatial clustering exist region-wide?	Global Moran’s I	—
Which areal units are high/low-but-surrounded-by-the-opposite (spatial outliers)?	Local Moran’s I (LISA)	—
Which units are simply high-intensity vs low-intensity hotspots?	Getis-Ord Gi* Hotspot Detection	LISA reports an outlier sign Gi* lacks, but Gi* gives cleaner hot/cold intensity.
Is there clustering at unknown distance bands in raw point geocodes?	K-Function & Point Pattern Analysis	Moran’s I needs areal aggregation and a neighborhood graph.
Where and when is the single most likely cluster across many windows?	Spatial Scan Statistics Configuration	Moran’s I tests a fixed neighborhood, not variable-radius space-time windows.

A robust workflow runs the global test first, then drills into local clusters only when global autocorrelation is significant:

The Moran’s I Statistic

Global Moran’s I is the cross-product of mean-centered values weighted by the spatial weights matrix, normalized by the variance of the variable and the total of the weights:

I = \frac{n}{\sum_{i=1}^{n}\sum_{j=1}^{n} w_{ij}} \cdot \frac{\sum_{i=1}^{n}\sum_{j=1}^{n} w_{ij}\,(x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}

Where $n$ is the number of areal units, $w_{ij}$ is the spatial weight between features i and j, $x_i$ is the analysis value, and $\bar{x}$ is the sample mean. Under the null of spatial randomness the expected value is $E[I] = -1/(n-1)$ , which approaches zero only for large $n$ — a detail that matters when reporting the observed statistic against its expectation rather than against a naive zero.

The local decomposition assigns each unit its own indicator, and the global statistic is proportional to the mean of the local ones:

I_i = \frac{(x_i - \bar{x})}{m_2} \sum_{j=1}^{n} w_{ij}\,(x_j - \bar{x}), \qquad m_2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}

The sign of $I_i$ together with the unit’s own value sets the LISA class: a high unit with a positive $I_i$ is a High-High hotspot, a low unit with a positive $I_i$ is a Low-Low coldspot, and a negative $I_i$ marks a High-Low or Low-High spatial outlier. Because each unit is one of thousands of simultaneous hypotheses, uncorrected pseudo p-values inflate the false-positive rate; the Benjamini-Hochberg False Discovery Rate (FDR) procedure is the production standard for controlling that inflation while preserving power.

Spatial Data Prerequisites

Geometry type: clean, non-overlapping polygons at one areal support (tracts, block groups, or ZCTAs). Validate topology before building weights — a single self-intersecting polygon corrupts the contiguity graph.
CRS projection: reproject to an equal-area system appropriate to the jurisdiction (Albers Equal Area Conic, a national grid, or the correct UTM zone) so adjacency and distance reflect true spatial exposure. Enforce a canonical projection per the Coordinate Reference Systems for Public Health standard; assert crs.is_geographic is false before any neighbor computation.
Minimum sample size: permutation inference is unstable below roughly 30 units, and the asymptotic expectation $-1/(n-1)$ only becomes negligible at larger $n$ . Report $n$ alongside the statistic.
Covariate requirements: the analysis column must be a population-normalized rate or an expected-adjusted count with no nulls; impute or flag missing units explicitly rather than letting them silently break the weights graph.
Boundary handling: features intersecting the study edge must be buffered with a spatial-join mask or flagged in execution metadata. Silently clipping or dropping boundary polygons introduces spatial bias and breaks reproducibility for agency reporting.

Enforce topology validation before anything else:

# geopandas 0.14.x, pyproj 3.6.x
import geopandas as gpd

gdf = gdf.to_crs("EPSG:5070")  # jurisdiction-specific equal-area CRS (e.g. US Albers)
assert not gdf.crs.is_geographic, "Project to an equal-area CRS before neighbor computation"
if not gdf.is_valid.all():
    gdf["geometry"] = gdf.geometry.make_valid()  # repair self-intersections, do not drop

Production Implementation

The pipeline builds a row-standardized weights matrix, runs the global gate, and only computes LISA when the global test is significant. Isolated polygons (zero neighbors) are logged and flagged rather than dropped, the permutation seed is pinned for byte-identical reruns, and every parameter is captured for the audit trail.

The weight matrix defines the neighborhood structure and therefore dictates the autocorrelation result. For administrative units, Queen or Rook contiguity is standard; for irregular surveillance grids, distance-based KNN. For vector-borne or airborne pathogens, derive any distance threshold from documented transmission radii rather than an arbitrary default.

# libpysal 4.9.x, esda 2.5.x, numpy 1.26.x
import libpysal
import numpy as np
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

def build_weights(gdf, id_col="geoid"):
    """Row-standardized Queen contiguity; isolates flagged, not dropped."""
    gdf = gdf.sort_values(id_col).reset_index(drop=True)  # deterministic ordering
    w = libpysal.weights.Queen.from_dataframe(gdf, ids=gdf[id_col].tolist())
    w.transform = "r"  # row-standardize: each row sums to 1
    if w.islands:
        logging.warning("Isolated polygons (zero neighbors): %s — flagged for audit", w.islands)
    return gdf, w

from esda.moran import Moran, Moran_Local

def compute_global_moran(gdf, variable, w, seed=42, permutations=9999):
    """Global gate. Permutation inference avoids the normality assumption."""
    np.random.seed(seed)
    y = gdf[variable].to_numpy()
    mi = Moran(y, w, permutations=permutations, two_tailed=True)
    return {
        "I_obs": mi.I,
        "I_exp": mi.EI,            # -1 / (n - 1)
        "z_sim": mi.z_sim,
        "p_sim": mi.p_sim,         # permutation-derived pseudo p-value
        "permutations": permutations,
        "is_significant": mi.p_sim < 0.05,
    }

def compute_local_moran(gdf, variable, w, seed=42, permutations=9999, fdr_alpha=0.05):
    """LISA with Benjamini-Hochberg FDR. Run only when the global gate is significant."""
    from statsmodels.stats.multitest import multipletests
    np.random.seed(seed)
    y = gdf[variable].to_numpy()
    lisa = Moran_Local(y, w, permutations=permutations, seed=seed)
    reject, p_fdr, _, _ = multipletests(lisa.p_sim, alpha=fdr_alpha, method="fdr_bh")
    labels = {1: "HH", 2: "LH", 3: "LL", 4: "HL"}  # esda quadrant codes
    out = gdf.copy()
    out["lisa_I"] = lisa.Is
    out["lisa_p_raw"] = lisa.p_sim
    out["lisa_p_fdr"] = p_fdr
    out["lisa_significant"] = reject
    out["lisa_class"] = [labels[q] if r else "ns" for q, r in zip(lisa.q, reject)]
    return out

def run_pipeline(gdf, variable, id_col="geoid", seed=42):
    gdf, w = build_weights(gdf, id_col=id_col)
    g = compute_global_moran(gdf, variable, w, seed=seed)
    logging.info("Global Moran's I=%.4f p=%.4f (n=%d)", g["I_obs"], g["p_sim"], w.n)
    if not g["is_significant"]:
        logging.info("Global autocorrelation not significant — halting before LISA")
        return g, None
    local = compute_local_moran(gdf, variable, w, seed=seed)
    return g, local

If the global statistic is non-significant after correction, halt downstream local analysis to prevent false-positive cluster mapping. Validate the global output against historical surveillance baselines to confirm the expected spatial structure before any resource allocation.

When the global gate passes, each unit is placed on the Moran scatterplot by its standardized value (z) against the spatially lagged mean of its neighbors (Wz). The quadrant it lands in — read against the global means — determines its LISA class: agreement with neighbors yields hotspots or coldspots, disagreement yields spatial outliers.

Pathogen-specific LISA workflows — including integration with temporal incidence curves to separate sustained transmission from transient noise — are detailed in Calculating Local Moran’s I for Infectious Disease Outbreaks.

Parameter Selection & Tuning

Weights type. Queen contiguity captures shared edges and corners and is the safe default for tract-level surveillance; Rook (edges only) suits gridded supports where corner-touching is not a meaningful adjacency. Switch to KNN or distance-band weights only when contiguity produces many islands (coastal or fragmented geographies) or when a pathogen’s transmission radius defines the relevant neighborhood.
Row-standardization. Always apply w.transform = "r". It normalizes influence across units with different neighbor counts so that a tract with twelve neighbors does not dominate one with three.
Permutations. Use permutations=9999. The minimum achievable pseudo p-value is 1 / (permutations + 1), so 999 permutations floors at 0.001 — too coarse for FDR ranking across thousands of units. More permutations reduce Monte Carlo variance at linear cost.
Significance threshold and FDR. Set a documented fdr_alpha (typically 0.05) and apply Benjamini-Hochberg across all local p-values. Report both the raw and FDR-adjusted p-values; never map units on raw pseudo p-values alone.
Seed. Pin the permutation seed so z-scores, p-values, and classifications are byte-identical across reruns and compute environments.

Edge Cases & Failure Modes

Island polygons (zero neighbors). Contiguity weights leave coastal or detached units with no neighbors; their LISA statistic is undefined. Log the offending IDs and apply a domain-justified fallback — nearest-neighbor injection or an explicit exclusion flag — rather than dropping them silently, which biases the surface.
Zero-inflation and rare counts. Sparse rare-disease counts violate the variance assumptions behind the statistic, producing unstable z-scores. Aggregate to a coarser support, lengthen the temporal window, or switch to an expected-adjusted rate before computing weights.
Transboundary CRS drift. Multi-jurisdictional studies that concatenate layers in different projections corrupt adjacency. Reproject every input to one documented equal-area CRS and assert agreement before the union.
Memory for N > 50k. Dense weight representations scale quadratically. Use the sparse representation (w.sparse) and cache validated matrices; for real-time surveillance, recompute weights only for modified or newly reported polygons rather than rebuilding the full graph.
Global/local disagreement. A non-significant global result with apparently strong local clusters usually signals non-stationarity or an artifact of population heterogeneity. Cross-validate against K-Function & Point Pattern Analysis at the point level and against Getis-Ord Gi* Hotspot Detection under a different weighting scheme; if LISA and Gi* diverge beyond roughly 15% spatial mismatch, trigger a manual review and log weight-matrix sensitivity diagnostics.

Compliance & Audit Controls

Government and inter-agency deployments require that any published cluster be reproducible from logged inputs. Aggregate or hash all patient-level identifiers to administrative units that clear disclosure thresholds before any spatial computation.

Deterministic execution. Sort by a stable feature ID before building weights and pin the permutation seed, so every rerun reproduces identical I values, p-values, and classifications.
Configuration logging. Record weights type, transform, permutation count, seed, fdr_alpha, CRS EPSG code, temporal window, and aggregation method in a metadata registry. Hash the input geometry file and the configuration block with SHA-256 and attach both hashes to every output.
Output schema. Emit explicit, documented columns — lisa_I, lisa_p_raw, lisa_p_fdr, lisa_significant, lisa_class — alongside the stable ID, and write ISO 19115 lineage metadata (projection, source, processing steps) with the file. Prefer serialization formats covered in Spatial Data Types & Formats so provenance survives interagency handoff.
Environment pinning. Deploy via containers with pinned geopandas, libpysal, esda, and GDAL versions to prevent silent statistical drift across dependency updates.

Production Readiness Checklist

All inputs reprojected to one documented equal-area CRS; crs.is_geographic asserted false
Patient-level coordinates aggregated to areal units that clear disclosure thresholds
Analysis column is a population-normalized rate or expected-adjusted count with no nulls
Frame sorted by a stable ID and permutation seed pinned for byte-identical reruns
Weights row-standardized (w.transform = "r"); islands detected and flagged, not dropped
Global Moran’s I run first as a gate; LISA computed only when it is significant
permutations=9999; FDR (Benjamini-Hochberg) applied at a documented alpha
Weights type, permutations, alpha, seed, CRS, and input hash recorded in the metadata registry
Output written with explicit I/p/classification columns and ISO 19115 lineage metadata

Disease Clustering & Spatial Statistical Modeling — parent overview of the end-to-end surveillance pipeline.
Getis-Ord Gi* Hotspot Detection — intensity-focused local statistic to cross-validate LISA hotspots.
K-Function & Point Pattern Analysis — second-order clustering for raw point geocodes before areal aggregation.
Spatial Scan Statistics Configuration — variable-radius space-time cluster scanning.
Calculating Local Moran’s I for Infectious Disease Outbreaks — pathogen-specific LISA workflow with temporal integration.