Why must the points be projected before computing Ripley's K?

Ripley's K is a distance-based statistic. In a geographic CRS such as WGS84, coordinates are in degrees, and Euclidean distance between degree coordinates is not a true metric because the longitude unit shrinks with latitude. Projecting to a planar, metre-based CRS (UTM or Albers Equal Area) is required so that inter-event distances are correct and the K estimator is meaningful.

Why does edge correction matter for vector-borne disease surveillance?

Events near the boundary of a county or watershed window have neighbors that fall outside the study area and are never counted, which systematically under-estimates K, increasingly so at larger distances. The translation edge-correction weight compensates for this, preventing biased clustering signals at exactly the scales used to set intervention radii.

When should I use the inhomogeneous K-function instead of the homogeneous one?

Use the inhomogeneous K-function when event intensity varies across the window — for example when cases track population density or vector habitat. The homogeneous K assumes constant intensity and will report population structure as disease clustering. The inhomogeneous version weights pairs by an estimated intensity surface so the null becomes 'clustered no more than the underlying population.'

What must be logged for a Ripley's K result to be auditable?

Record the source and target EPSG codes and confirmed CRS unit, the jitter sigma and random seed of any coordinate mask, the edge-correction method, the distance sweep bounds and step, the filtered point count, and the envelope's simulation count, alpha, seed, and resulting p-value vector. Together these reproduce both the de-identification and every significance claim.

Implementing Ripley’s K-Function in Python for Vector-Borne Diseases

This guide solves one narrow problem: computing a defensible, edge-corrected, significance-tested Ripley’s K-function on vector-borne disease (VBD) point data in Python, where the events are mosquito trap catches, geocoded human cases, or larval breeding sites and the answer feeds intervention-radius decisions. It is part of the K-Function & Point Pattern Analysis method family within the broader Disease Clustering & Spatial Statistical Modeling pipeline.

Problem Context & Constraints

The default way most analysts reach for Ripley’s K — feed raw coordinates to a one-line library call, plot the curve, declare “clustering” — fails on VBD surveillance data for four specific, recurring reasons. The estimator computed below is the homogeneous K-function and its variance-stabilizing Besag transform:

\hat{K}(d) = \frac{|A|}{n(n-1)} \sum_{i \neq j} w_{ij}\, \mathbf{1}\!\left(d_{ij} \le d\right), \qquad \hat{L}(d) = \sqrt{\hat{K}(d)/\pi} - d

where $|A|$ is the study-window area, $n$ the event count, $d_{ij}$ the inter-event distance, $\mathbf{1}(\cdot)$ the indicator, and $w_{ij}$ an edge-correction weight. Each term in that formula maps to a way the naive pipeline breaks:

Unprojected coordinates make $d_{ij}$ meaningless. Trap and case geocodes arrive in WGS84 degrees. Euclidean distance on degrees mixes a longitude unit that shrinks with latitude against a fixed-length latitude unit, so $d_{ij}$ is not a distance at all. The whole curve is computed in the wrong metric. The fix is to project to a planar, metre-based CRS first, following Coordinate Reference Systems for Public Health.

Missing the $w_{ij}$ edge weight biases the curve downward. Events near the boundary of a county or watershed window have neighbors that fall outside it and are never counted. Without an edge correction, $\hat{K}(d)$ is systematically under-estimated, increasingly so at large $d$ — exactly the scales where intervention radii are decided.

A homogeneous null treats population structure as disease clustering. The $|A|/n(n-1)$ normalization assumes constant intensity across the window. VBD events track human density and vector habitat; a homogeneous K will report “significant clustering” that is nothing more than where people and mosquitoes live. This is the single most common false-positive in applied point pattern work.

Raw coordinates can be a disclosure. Case geocodes are protected health information. Running K on un-masked residence points, and worse, shipping the points into a plotted figure, is a re-identification risk that no amount of statistical rigor excuses.

This guide produces a pipeline that closes all four gaps deterministically and writes an audit record at each stage.

Prerequisites

Pin these versions; the pdist/squareform API and numpy random generator semantics used below are stable across them, and a mismatched CRS unit is the most common cause of a silently wrong curve.

Library versions (record in requirements.txt): geopandas==0.14.4, shapely==2.0.4, pyproj==3.6.1, scipy==1.13.1, numpy==1.26.4, pandas==2.2.2. For production-tested point pattern statistics and built-in envelope routines, pointpats==2.4.0 is the reference implementation; the hand-rolled version here exists so every term is auditable.
CRS state: the input must already be in a projected, metre-based CRS (UTM for a single investigation, Albers Equal Area for a multi-county extent) before any distance is computed. The guard in step 1 hard-fails on degree or survey-foot units rather than producing a plausible-looking wrong answer.
Input data: a GeoDataFrame of Point geometries representing true event locations — trap catches with at least one positive specimen, confirmed case residences, or surveyed breeding sites — plus a single-polygon GeoDataFrame describing the exact sampling frame. Zero-catch trap centroids and administrative placeholder points must be filtered out upstream; they violate the point-process assumption. Patient coordinates must be de-identified before they reach this stage.

Step-by-Step Solution

1. Validate, project, and privacy-mask the points

Geographic coordinates introduce distance distortion and second-order statistics require a planar metre CRS. This gate also enforces a minimum disclosure count and applies deterministic Gaussian jitter when the points are patient residences.

# geopandas==0.14.4, shapely==2.0.4, pyproj==3.6.1, numpy==1.26.4
import geopandas as gpd
import numpy as np
import logging
from pyproj import CRS
from shapely.geometry import Point
from typing import Tuple

logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s | %(levelname)s | %(message)s")

def validate_and_project_points(
    raw_gdf: gpd.GeoDataFrame,
    target_epsg: int = 26918,   # UTM 18N — replace with the zone for your AOI
    min_points: int = 30,
    jitter_sigma: float = 0.0,  # metres; > 0 only for patient residences
    seed: int = 42,
) -> gpd.GeoDataFrame:
    """Validate, project to a metre CRS, and optionally jitter VBD points."""
    if raw_gdf.geometry.isnull().any():
        raise ValueError("Null geometries detected. Drop or impute before analysis.")
    if not raw_gdf.crs:
        raise RuntimeError("Input lacks a CRS. Assign one explicitly before projection.")

    projected = raw_gdf.to_crs(CRS.from_epsg(target_epsg))

    # K-functions fail silently in degrees or survey feet — hard-fail on non-metre units
    unit = projected.crs.axis_info[0].unit_name
    if unit not in ("metre", "meter"):
        raise ValueError(f"Projected CRS unit is '{unit}'; a metre-based CRS is required.")

    # Disclosure control: refuse to analyse sub-threshold jurisdictions
    if len(projected) < min_points:
        raise PermissionError(
            f"Point count ({len(projected)}) below the {min_points}-event disclosure floor."
        )

    if jitter_sigma > 0:
        rng = np.random.default_rng(seed)  # seeded -> reproducible mask
        logging.info("Applying Gaussian jitter (sigma=%.1f m) for disclosure control.",
                     jitter_sigma)
        projected = projected.copy()
        offsets = rng.normal(0, jitter_sigma, size=(len(projected), 2))
        projected["geometry"] = [
            Point(g.x + dx, g.y + dy)
            for g, (dx, dy) in zip(projected.geometry, offsets)
        ]

    logging.info("Projected %d points to EPSG:%d (unit=%s, jitter_sigma=%.1f).",
                 len(projected), target_epsg, unit, jitter_sigma)
    return projected

The jitter is driven by a seeded numpy generator so the mask is reproducible across agency runs and recorded in the log line. Coordinate masking belongs to the same de-identification discipline as the upstream ingestion layer; trap and environmental coordinates typically skip it, but every transform is logged regardless.

2. Compute edge-corrected K over a distance sweep

Translation edge correction supplies the $w_{ij}$ weight without per-event polygon intersection during the inner loop, which keeps the distance sweep tractable on surveillance-scale point counts.

# scipy==1.13.1, numpy==1.26.4
from scipy.spatial.distance import pdist, squareform

def compute_ripleys_k(
    points: gpd.GeoDataFrame,
    study_area: gpd.GeoDataFrame,
    distances: np.ndarray,
    correction: str = "translation",
) -> Tuple[np.ndarray, np.ndarray]:
    """Homogeneous Ripley's K with translation edge correction. Returns (d, K)."""
    n = len(points)
    area = float(study_area.geometry.area.iloc[0])
    coords = np.column_stack([points.geometry.x.values, points.geometry.y.values])

    dist_matrix = squareform(pdist(coords))          # n x n metres
    k_values = np.zeros_like(distances, dtype=float)

    for i, d in enumerate(distances):
        # ordered pairs within d, excluding the n self-distances on the diagonal
        pair_count = (np.sum(dist_matrix <= d) - n) / 2.0

        # translation weight: approximate fraction of each ring inside the window.
        # For fragmented or coastal frames, swap in exact buffer-intersection area.
        edge_weight = (area - 2.0 * d * np.sqrt(area)) / area
        edge_weight = np.clip(edge_weight, 0.01, 1.0)  # guard against near-zero

        k_values[i] = (area * pair_count) / (n * (n - 1) * edge_weight)

    return distances, k_values

For highly fragmented jurisdictions or coastal surveillance zones, replace the closed-form weight with the exact study_area.intersection(Point(x, y).buffer(d)).area per event. Before the sweep, confirm the window polygon strictly contains every point — boundary leakage biases $\hat{K}$ at exactly the large distances that drive intervention-radius decisions.

3. Build the CSR envelope and derive significance

Raw $\hat{K}(d)$ has no statistical context. Simulation under complete spatial randomness (CSR) yields a confidence envelope and per-distance p-values; a fixed seed makes the result reproducible.

# numpy==1.26.4, geopandas==0.14.4
def generate_monte_carlo_envelope(
    points: gpd.GeoDataFrame,
    study_area: gpd.GeoDataFrame,
    distances: np.ndarray,
    n_simulations: int = 999,
    alpha: float = 0.05,
    seed: int = 42,
) -> dict:
    """CSR envelope + per-distance p-values for the K-function."""
    rng = np.random.default_rng(seed)
    n = len(points)
    sim_k = np.zeros((n_simulations, len(distances)))

    minx, miny, maxx, maxy = study_area.total_bounds
    study_geom = study_area.geometry.iloc[0]

    completed, attempts, max_attempts = 0, 0, n_simulations * 20
    while completed < n_simulations and attempts < max_attempts:
        attempts += 1
        # rejection-sample n points uniformly inside the (possibly irregular) window
        cand = gpd.GeoDataFrame(
            geometry=gpd.points_from_xy(
                rng.uniform(minx, maxx, n * 2), rng.uniform(miny, maxy, n * 2)
            ),
            crs=points.crs,
        )
        inside = cand[cand.geometry.within(study_geom)]
        if len(inside) < n:
            continue
        _, sim_k[completed] = compute_ripleys_k(inside.head(n), study_area, distances)
        completed += 1

    if completed < n_simulations:
        raise RuntimeError(
            f"Only {completed}/{n_simulations} simulations completed — check window geometry."
        )

    lower = np.percentile(sim_k, (alpha / 2) * 100, axis=0)
    upper = np.percentile(sim_k, (1 - alpha / 2) * 100, axis=0)

    observed_d, observed_k = compute_ripleys_k(points, study_area, distances)
    # one-sided clustering p-value per distance band
    p_values = (np.sum(sim_k >= observed_k, axis=0) + 1) / (n_simulations + 1)

    logging.info("Envelope built: %d sims, seed=%d, alpha=%.3f.",
                 n_simulations, seed, alpha)
    return {
        "distances": observed_d, "observed_k": observed_k,
        "lower_bound": lower, "upper_bound": upper,
        "p_values": p_values, "alpha": alpha, "seed": seed,
        "n_simulations": n_simulations,
    }

Note the $(r+1)/(m+1)$ p-value form rather than the raw proportion: counting the observed pattern among the simulations prevents an impossible p-value of exactly zero and is the convention for Monte Carlo tests. A statistically significant band — where observed_k exceeds upper_bound — is the evidence that justifies a larvicide radius or targeted outreach at that scale, rather than a convention-chosen buffer.

Validation & Edge Cases

Three failure modes account for nearly every wrong VBD K-curve in production. Each has a diagnostic signature in the logs.

Unit mismatch silently rescales the whole curve. If the CRS is in survey feet or degrees, $\hat{K}(d)$ comes out scaled by orders of magnitude and every band trips the envelope — a wall of false clustering. The step-1 guard catches it before computation:

ValueError: Projected CRS unit is 'US survey foot'; a metre-based CRS is required.

Non-stationary intensity manufactures clustering. When events track population rather than transmission, the homogeneous curve sits far above the CSR envelope at every distance. The tell is a $\hat{L}(d)$ that rises monotonically and never returns toward zero. The fix is the inhomogeneous K-function $K_{\text{inhom}}(d)$ , which weights each pair by an estimated intensity surface (a kernel density of the at-risk population) so the null becomes “clustered no more than the population is.” Before trusting any positive result, confirm the homogeneous-intensity assumption holds — the same alignment check the parent K-Function & Point Pattern Analysis guide describes.

Sparse data and zero-inflation break the point process. Mosquito trap networks return many zero-catch nights; including those centroids injects regularly spaced phantom events that suppress $\hat{K}$ at short distances. Filter to positive-catch points before analysis and watch the envelope-completion log line — repeated “underfilled simulation” skips mean the window is too sparse or fragmented for the requested n:

RuntimeError: Only 612/999 simulations completed — check window geometry.

For more than roughly 50,000 events the full $n\times n$ squareform matrix exhausts memory; switch to a chunked or KD-tree neighbor query, or delegate to pointpats, which streams pairwise counts.

Compliance Notes

Spatial statistics that drive resource allocation must be reproducible and traceable. Log and persist, per run:

the source and target EPSG codes and the confirmed CRS unit, so a reviewer can re-project identically;
the jitter_sigma and seed used for any coordinate mask, the parameters that make a de-identification step reproducible and defensible;
the edge-correction method, the distance sweep bounds and step, and the input point count after filtering;
the envelope n_simulations, alpha, and seed, plus the resulting p-value vector — the full provenance of every significance claim.

The minimum disclosure floor (min_points) and the masking parameters tie directly to the de-identification rules in Building a HIPAA-Compliant Spatial Metadata Schema; record them in the same metadata block that travels with the output.

K-Function & Point Pattern Analysis — the parent method family, including the inhomogeneous $K_{\text{inhom}}(d)$ this page defers to for non-stationary intensity.
Coordinate Reference Systems for Public Health — projection selection the entire distance computation depends on.
Building a HIPAA-Compliant Spatial Metadata Schema — where the masking and provenance fields above are recorded.