What does the K-function tell me that Moran's I does not?

The K-function answers at what distance scale events cluster, computed directly on precise event coordinates. Moran's I and Getis-Ord Gi* operate on counts aggregated to polygons and answer whether neighbouring administrative units are similar. Use the K-function for point events and questions of scale; use Moran's I or Gi* when data arrive pre-aggregated.

Why do I get clustering everywhere even with no real disease signal?

The homogeneous Poisson null assumes constant intensity, but disease cases track population density. When the at-risk population is non-stationary, the homogeneous K-function reports the population structure as clustering. Switch to the inhomogeneous K-function with an intensity surface estimated from a population or at-risk denominator.

How many Monte Carlo simulations do I need?

The smallest attainable pointwise p-value is one over the number of simulations plus one. Use 999 simulations for routine screening and 9999 when a distance band sits close to the envelope and you need a defensible p below 0.001. Always pin the random seed so the envelope is reproducible.

Why is edge correction mandatory?

Near the study boundary, some true neighbours fall outside the window and go uncounted, which systematically deflates K(d) and produces false-negative clustering signals. Translation or isotropic correction reweights boundary pairs to compensate; the chosen method must be recorded because results are not comparable across methods.

K-Function & Point Pattern Analysis: Production-Ready Implementation for Public Health Surveillance

This guide is part of Disease Clustering & Spatial Statistical Modeling, and covers how to turn geocoded event locations into scale-specific, significance-tested clustering signals using Ripley’s K-function and its companion second-order statistics. The operational purpose is narrow and high-stakes: decide at what distance disease cases, vector habitats, or exposure coordinates depart from complete spatial randomness, so that intervention radii and surveillance windows are set from evidence rather than convention.

Concept & Epidemiological Alignment

Ripley’s K-function, $K(d)$ , is a second-order summary of a point process: it measures the expected number of additional events within distance $d$ of a typical event, normalized by the overall intensity $\lambda$ (events per unit area). Where first-order statistics describe how many events occur, $K(d)$ describes how events relate to one another across a continuum of scales. This is what distinguishes point pattern analysis from lattice methods such as Global & Local Moran’s I Implementation or Getis-Ord Gi* Hotspot Detection: those operate on counts aggregated to administrative polygons and inherit the modifiable areal unit problem, whereas $K(d)$ operates directly on event coordinates and preserves the resolution at which transmission actually happens.

The estimator and its variance-stabilizing transform are:

\hat{K}(d) = \frac{|A|}{n(n-1)} \sum_{i \neq j} w_{ij}\, \mathbf{1}\!\left(d_{ij} \le d\right), \qquad \hat{L}(d) = \sqrt{\hat{K}(d)/\pi} - d

where $|A|$ is the study window area, $n$ the event count, $d_{ij}$ the inter-event distance, $\mathbf{1}(\cdot)$ the indicator function, and $w_{ij}$ an edge-correction weight. Besag’s $L(d)$ centers the homogeneous-Poisson null expectation at zero, so $L(d) > 0$ reads as clustering and $L(d) < 0$ as inhibition (regularity) at scale $d$ .

When to use it — and when not to. Reach for the K-function when you have precise event points and a genuine spatial question of scale: Is there clustering, and over what distance band? Prefer Moran’s I or Gi* when your data arrive pre-aggregated to tracts or counties. Prefer Spatial Scan Statistics Configuration when where and when must be answered jointly and you need a single most-likely cluster with a location and a window. The K-function answers “at what scale,” not “exactly where” — it is a global, scale-resolved diagnostic that should precede localized cluster mapping.

Assumptions that must hold in surveillance data. The homogeneous-Poisson null assumes constant intensity across the window. Public health point patterns almost never satisfy this: cases track population density. Treating population-driven heterogeneity as disease clustering is the single most common analytical failure. Where intensity is non-stationary, use the inhomogeneous K-function $K_{\text{inhom}}(d)$ with an intensity surface estimated from a population or at-risk denominator, so that the null becomes “clustered no more than the underlying population is.”

Method Selection

Situation	Recommended estimator	Why
Precise event points, roughly uniform at-risk population	Homogeneous $K(d)$ / $L(d)$	Null of complete spatial randomness is defensible
Event points, population density varies across window	Inhomogeneous $K_{\text{inhom}}(d)$	Controls for first-order intensity; isolates true clustering
Need clustering vs. regularity at a glance	Besag $L(d)$	Linearizes null at zero, stabilizes variance across scales
Pre-aggregated counts per polygon	Moran’s I / Getis-Ord Gi*	K-function needs point geometry, not areal units
Where and when, single most-likely cluster	Spatial scan statistic	K-function gives scale, not location or onset
Pairwise scale signature without cumulative blur	Pair correlation $g(d)$	Non-cumulative; separates adjacent scales the cumulative $K$ merges

Spatial Data Prerequisites

Point pattern analysis is unforgiving about geometry and projection. Validate every prerequisite before computing a single distance.

Geometry type: a single-part Point layer. Split any mixed polygon/point inputs; resolve multipoints to their constituents.
CRS: a planar, metric projection — UTM for localized investigations, Albers Equal Area or Lambert Conformal Conic for multi-jurisdiction extents. Distances on unprojected WGS84 degrees are geometrically meaningless and distort with latitude. Choose the projection per Coordinate Reference Systems for Public Health and record the authority code.
Study window: the true sampling frame — a vector-control operational boundary, a healthcare catchment, an environmental monitoring extent — not an arbitrary administrative polygon. A misaligned window biases every neighbor count near the boundary.
Topology: drop exact duplicate coordinates, snap GPS drift, and assert all points fall inside the window polygon.
Minimum sample size: roughly $n \ge 30$ events for stable envelopes; below that, Monte Carlo bounds are too wide to be informative.
De-identification: where HIPAA or GDPR restricts residential or facility coordinates, apply geomasking, hexagonal aggregation, or coordinate rounding before analysis, consistent with Precision Standards in Epi-Mapping, and log the disclosure threshold used.

Core Algorithm Implementation & Edge Correction

The empirical computation flows from projection through edge-corrected estimation into the significance step:

For epidemiological applications, distance increments must be calibrated to pathogen or vector ecology rather than chosen for round numbers: 500 m to 5 km steps typically resolve mosquito flight ranges, localized human mobility clusters, and environmental gradient effects. Edge correction is non-negotiable. Near the study boundary, some true neighbors fall outside the window and go uncounted; without correction this systematically deflates $K(d)$ and produces false-negative clustering signals. Translation correction reweights each pair by the proportion of valid translations keeping both points inside the window; isotropic (Ripley) correction reweights by the arc fraction of the neighbor circle inside the window. The reference implementation below uses libpysal and pointpats, with parameters externalized so the same code is portable across jurisdictions.

# K-function surveillance pipeline
# Pinned: geopandas==0.14.4, libpysal==4.12.1, pointpats==2.5.0,
#         pyproj==3.6.1, shapely==2.0.4, numpy==1.26.4
import hashlib
import json
import logging
from datetime import datetime, timezone

import geopandas as gpd
import numpy as np
from pointpats import PointPattern, Genv  # G/F/K estimators + envelopes
from pointpats.distance_statistics import k_test
from shapely.geometry import Point

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)
log = logging.getLogger("kfunction")


def run_k_function(
    events_path: str,
    window_path: str,
    target_crs: str,        # e.g. "EPSG:32616" (UTM 16N) — metric, equal-distance
    d_min: float = 500.0,   # metres; calibrate to transmission/vector ecology
    d_max: float = 5000.0,
    d_step: float = 500.0,
    n_sims: int = 999,      # CSR realizations; raise to 9999 for tighter p
    seed: int = 20260625,   # pin the RNG so significance is reproducible
):
    cfg = dict(
        target_crs=target_crs, d_min=d_min, d_max=d_max,
        d_step=d_step, n_sims=n_sims, seed=seed,
    )
    cfg_hash = hashlib.sha256(
        json.dumps(cfg, sort_keys=True).encode()
    ).hexdigest()
    log.info("config sha256=%s", cfg_hash)

    # --- Load, project, validate -------------------------------------------
    events = gpd.read_file(events_path).to_crs(target_crs)
    window = gpd.read_file(window_path).to_crs(target_crs)

    events = events[events.geometry.type == "Point"].copy()
    # Deterministic order so any downstream tie-breaking is byte-reproducible
    events = events.sort_values("event_id").reset_index(drop=True)
    # Drop exact duplicate coordinates (duplicate trap deployments, re-reports)
    events = events.drop_duplicates(subset=events.geometry.apply(
        lambda p: (round(p.x, 3), round(p.y, 3))
    ).rename("xy_key"))

    win_poly = window.union_all()  # GeoSeries.unary_union is deprecated
    inside = events[events.within(win_poly)]
    dropped = len(events) - len(inside)
    if dropped:
        log.warning("%d events fell outside the study window", dropped)
    n = len(inside)
    if n < 30:
        raise ValueError(f"n={n} too small for stable CSR envelopes (need >=30)")

    # --- Build the point pattern over the explicit window ------------------
    coords = np.column_stack([inside.geometry.x, inside.geometry.y])
    pp = PointPattern(coords)
    log.info("n=%d  intensity=%.6f pts/m^2  window_area=%.1f m^2",
             n, pp.lambda_window, win_poly.area)

    # --- K-function + Monte Carlo CSR envelope (translation edge corr.) ----
    support = np.arange(d_min, d_max + d_step, d_step)
    rng = np.random.default_rng(seed)
    k_result = k_test(
        coords,
        support=support,
        keep_simulations=True,
        n_simulations=n_sims,
        hull="bounding_box",   # or pass the actual window for tighter bounds
        edge_correction="ripley",
        seed=rng,
    )

    # Besag L(d) = sqrt(K/pi) - d; envelope from the simulated K surface
    l_obs = np.sqrt(k_result.statistic / np.pi) - support
    sims_l = np.sqrt(k_result.simulations / np.pi) - support
    lo = np.percentile(sims_l, 2.5, axis=0)
    hi = np.percentile(sims_l, 97.5, axis=0)

    flags = []
    for d, lo_d, lobs_d, hi_d in zip(support, lo, l_obs, hi):
        if lobs_d > hi_d:
            verdict = "clustering"
        elif lobs_d < lo_d:
            verdict = "inhibition"
        else:
            verdict = "random"
        flags.append({"d_m": float(d), "L_obs": float(lobs_d),
                      "env_lo": float(lo_d), "env_hi": float(hi_d),
                      "verdict": verdict})

    # Distance of maximum positive deviation == candidate intervention radius
    dev = l_obs - hi
    intervention_radius = float(support[int(np.argmax(dev))]) if dev.max() > 0 else None

    out = {
        "config_sha256": cfg_hash,
        "crs": target_crs,
        "n_events": int(n),
        "events_dropped_outside_window": int(dropped),
        "intervention_radius_m": intervention_radius,
        "executed_utc": datetime.now(timezone.utc).isoformat(),
        "bands": flags,
    }
    log.info("intervention_radius_m=%s", intervention_radius)
    return out

Detailed workflow patterns for arbovirus surveillance and vector habitat mapping — including geomasking-aware ingestion and seeded envelope generation — are documented in Implementing Ripley’s K-Function in Python for Vector-Borne Diseases. Core spatial weight and point pattern routines should follow libpysal documentation and Geopandas documentation for CRS handling and topology preservation.

Statistical Envelopes & Hypothesis Testing

Significance comes from Monte Carlo simulation under complete spatial randomness. Generating 999–9999 randomized realizations within the validated window yields per-distance envelopes; an observed $L(d)$ above the upper envelope is statistically significant clustering at that scale, below it is regularity. The interpretation hinges on where the observed $L(d)$ curve sits relative to the envelope at each scale:

The distance of maximum positive deviation identifies a candidate intervention radius for resource deployment. Two disciplines keep this honest. First, pre-specify the evaluation range from transmission biology, vector dispersal literature, or operational constraints — post-hoc distance selection inflates the false-positive rate. Second, because you test many distances at once, use simultaneous inference: a maximum-absolute-deviation (global) envelope controls the family-wise error rate across all bands, rather than a pointwise envelope that is valid at only one distance. When the K-function feeds a Spatial Scan Statistics Configuration downstream, the scale of maximum deviation is the natural upper bound for the scan’s maximum window radius, which prevents overfitting to stochastic noise.

Parameter Selection & Tuning

Distance support (d_min, d_max, d_step): anchor the upper bound to no more than half the shortest window dimension — beyond that, edge correction degrades. Set the step to the finest operationally meaningful scale (e.g. 250–500 m for arbovirus vectors).
Edge correction method: translation is the robust default for irregular windows; isotropic (Ripley) is preferable for convex windows and is more accurate near corners. Record which one was used — results are not comparable across methods.
Number of simulations: the smallest attainable pointwise p-value is $1/(n_{\text{sims}}+1)$ . Use 999 for screening, 9999 when a band sits near the envelope and you need a defensible p < 0.001.
Homogeneous vs. inhomogeneous null: if a Poisson dispersion test or a kernel intensity surface shows the at-risk population is non-stationary, switch to $K_{\text{inhom}}(d)$ with a denominator-derived intensity. This is the parameter that most often flips a “cluster” verdict.
Multiplicity control across scales: prefer a global (rank-based) envelope test reporting a single p-value over reading significance off each band independently — the latter is the cumulative analogue of an uncorrected α cutoff.

Edge Cases & Failure Modes

Population heterogeneity masquerading as clustering. Cases cluster because people cluster. Validate stationarity first; if it fails, the homogeneous envelope will declare clustering everywhere. Switch to $K_{\text{inhom}}$ or restrict the window to a homogeneous sub-region.
Disconnected or island windows. A multipart window (e.g. an archipelago catchment) breaks naive bounding-box edge correction. Pass the actual window polygon to the estimator and correct against true boundaries, or analyze each component separately when inter-component distances are operationally irrelevant.
Sparse data ( $n < 30$ ). Envelopes widen until nothing is significant. Pool surveillance periods, or fall back to a first-order intensity description rather than forcing a second-order test.
Duplicate and snapped coordinates. Stacked points (re-reports at one address, repeated trap visits) inflate short-distance $K(d)$ and manufacture clustering. De-duplicate before estimation; if duplicates are genuine multiplicity, weight rather than drop.
Transboundary CRS drift. Mixing UTM zones or unprojected feeds across a multi-state extent silently corrupts distances. Reproject everything to one equal-area CRS and assert the authority code at ingestion.
Memory for large $N$ . The pairwise distance step is $O(n^2)$ . Above ~50,000 points, use chunked or KD-tree neighbor enumeration and avoid materializing the full distance matrix; for ~100k+ points, sparse banded computation keeps memory bounded.
Reporting lag. Delayed case confirmation thins recent periods and biases the estimate. Apply temporal thinning or time-weighted K-functions to mitigate staggered laboratory turnaround.

Compliance & Audit Controls

Every run must be reproducible by a reviewer who has only the inputs and the recorded configuration. Sort events by a stable event_id so any tie-breaking is byte-reproducible; pin the RNG seed so the CSR envelope is identical on re-run; and attest the configuration with a SHA-256 hash (as in the code above). Serialize outputs — $K/L$ values, envelope bounds, per-band verdicts, and the candidate intervention radius — alongside the CRS authority code, the edge-correction method, point counts before and after de-duplication, the disclosure threshold applied during geomasking, an execution timestamp, and pinned library versions. Where coordinates are de-identified, the masking parameters belong in the run record per Compliance Mapping Frameworks, so the privacy transformation is itself auditable. Emit the metadata block as GeoParquet or JSON with ISO 19115 lineage fields so the result survives interagency exchange and regulatory review.

Deploying K-function analysis in public health operations demands rigorous spatial preprocessing, epidemiologically grounded parameterization, and audit-ready pipelines. Combined with complementary spatial statistics and automated validation, it provides a scalable, defensible foundation for outbreak detection, vector control optimization, and environmental risk assessment.

Global & Local Moran’s I Implementation — autocorrelation diagnostics on aggregated administrative units.
Getis-Ord Gi* Hotspot Detection — signed z-score hotspot surfaces for intervention zoning.
Spatial Scan Statistics Configuration — likelihood-ratio space-time outbreak detection that consumes the K-function scale.
Implementing Ripley’s K-Function in Python for Vector-Borne Diseases — geomasking-aware ingestion and seeded envelopes for arbovirus surveillance.
Coordinate Reference Systems for Public Health — projection selection the entire distance computation depends on.