How do I choose the maximum spatial cluster size?

The maximum window is conventionally capped at 50 percent of the population at risk. For rare diseases or fine-grained units, lower it to 10 to 25 percent to avoid over-smoothing. Setting it too wide lets the maximizing window absorb background and dilute the signal; setting it too tight truncates a genuine cluster below detection.

Which likelihood model should I use?

Use the Discrete Poisson model for case counts against a population denominator, the Bernoulli model for case-control point data, and the space-time permutation model when only case data are available and an expected count must be derived internally. Matching the model to the data structure is what makes the expected count correct.

Why must the Monte Carlo seed be fixed?

Significance is calibrated by reshuffling case labels under the null, so the reported p-value depends on the random sequence. Fixing the RandomSeed makes the simulated envelope and the p-value byte-reproducible on re-run, which is required for regulatory audit. The smallest attainable p-value is one over the number of replications plus one.

Spatial Scan Statistics Configuration: Production-Grade Implementation for Public Health Surveillance

This guide is part of Disease Clustering & Spatial Statistical Modeling, and covers how to configure, run, and audit the spatial and space-time scan statistic (the Kulldorff scan, implemented in SaTScan) so a surveillance team can answer one question with a defensible answer: where — and when — is the single most likely disease cluster, given a null of spatial randomness? The operational purpose is to convert raw case, control, and population-at-risk tables into a reproducible likelihood-ratio screen whose every threshold, seed, and parameter file is version-locked and audit-ready.

Concept & Epidemiological Alignment

The scan statistic moves a variable-radius window across the study region, and for each candidate window tests whether the risk inside the window differs from the risk outside. Unlike fixed-neighborhood local statistics such as Getis-Ord Gi* Hotspot Detection, which evaluate each areal unit against a global mean, the scan dynamically resizes circular (or elliptical) windows to maximize a likelihood ratio and then reports the single window — the most likely cluster — with the strongest evidence of elevated risk. Because that maximization is itself a multiple-comparisons problem, significance is calibrated by Monte Carlo simulation rather than read off a closed-form distribution.

For a Poisson model, each candidate window $Z$ is scored by the log likelihood ratio against a null of constant risk:

\text{LLR}(Z) = c_Z \ln\!\left(\frac{c_Z}{e_Z}\right) + (C - c_Z)\ln\!\left(\frac{C - c_Z}{C - e_Z}\right)\;\;\text{when}\;\; \frac{c_Z}{e_Z} > \frac{C - c_Z}{C - e_Z}

where $c_Z$ is the observed case count inside the window, $e_Z$ the count expected under the null given the population at risk, and $C$ the total cases. The window maximizing $\text{LLR}(Z)$ is the candidate cluster; its p-value is the rank of its statistic among the maxima from many random reshufflings of case labels.

Three assumptions must hold before a scan result is epidemiologically defensible:

The baseline encodes risk, not just counts. $e_Z$ must come from a population-at-risk layer (or a covariate-adjusted expectation), or the scan reports population density as “clustering.” Where age or sex strongly modify risk, supply an indirectly standardized expected count rather than crude population.
Spatial support is discrete and topologically sound. The scan needs case, control, and population tabulated against the same location IDs and coordinates. Mixing tract centroids with raw geocodes corrupts both the window geometry and the denominator.
The window shape matches the plausible cluster geometry. Circular windows recover compact clusters; elongated transmission corridors (a river, a road) need elliptical windows or a flexibly-shaped scan, or they are truncated below detection.

Method Selection

Surveillance question	Preferred method	Why not the scan statistic
Where and when is the single most likely cluster, scanning many windows?	Spatial / space-time scan statistic	—
Which fixed areal units are significant high/low intensity?	Getis-Ord Gi* Hotspot Detection	The scan reports clusters, not a per-unit z-score surface.
Does any global clustering exist before localizing?	Global & Local Moran’s I Implementation	The scan always returns a most-likely window even when global signal is weak.
At what distance scale do raw point events cluster?	K-Function & Point Pattern Analysis	The scan fixes a maximum window size; it does not profile clustering across a continuum of scales.

A common production pattern confirms global autocorrelation with Moran’s I first, then runs the scan only when global signal is present — this avoids over-interpreting the most-likely window when the region is, in fact, spatially random. For retrospective historical reviews with a temporal baseline, the configuration specializes in Configuring SaTScan for Retrospective Cluster Detection.

The scan operates by centering candidate circular (or elliptical) windows on every location and expanding each radius up to the configured MaxSpatialSize ceiling. For each window the engine computes a likelihood ratio comparing observed-versus-expected counts inside the window against the remainder; the window maximizing this ratio becomes the most likely cluster. Monte Carlo replicates then reshuffle case labels under the null to calibrate a p-value. The MaxSpatialSize ceiling is the single most consequential parameter: set it too wide and the maximizing window engulfs unrelated background, diluting the signal; set it too tight and a genuine cluster is truncated below detection.

Spatial Data Prerequisites

Configuration failures frequently originate upstream from topological inconsistencies or misaligned population denominators, so the prerequisites are non-negotiable before a single window is scored:

Geometry type. The scan consumes location points (centroids or geocodes) plus three aligned tables — cases (.cas), population (.pop), and coordinates (.geo) — all keyed on the same stable location ID. Polygon inputs must be reduced to a representative point with a documented rule (population-weighted centroid preferred over geometric centroid).
CRS projection. Project case, control, and population-at-risk coordinates to a single area-preserving or distance-preserving CRS (for example a UTM zone, EPSG:326xx) before centroid extraction, following the canonical Coordinate Reference Systems for Public Health selection rules. Cartesian coordinates feed SaTScan’s CoordinatesType=0; latitude/longitude feed CoordinatesType=1 and trigger great-circle distances.
Topology checks. Reject empty or invalid geometries, assert one coordinate per location ID, and cross-validate population extents against official census boundaries so the denominator covers exactly the case support — no leakage, no gaps.
Minimum sample size. With very few cases the maximizing window is unstable and Monte Carlo cannot resolve small p-values ( $p_{\min} = 1/(n_{\text{sims}}+1)$ ). Pool reporting periods or coarsen the support rather than scanning a near-empty surface.
Covariate / baseline requirements. Supply an expected-count or population denominator that reflects risk. Where age or sex modify incidence, pass an indirectly standardized expected count so a hotspot reflects elevated risk rather than an elevated denominator.

When processing protected health information, geomasking, coordinate jittering, or k-anonymity aggregation must precede statistical ingestion, and the disclosure parameters must be recorded with the run per the site’s Compliance Mapping Frameworks.

Production Implementation

The following pipeline enforces CRS alignment, deterministic seed fixation, configuration hashing, and parameter-file generation. SaTScan reads every input path and setting from a single parameter file (.prm); the executable is invoked with that file path as its sole positional argument. Geometry validation and tabulation use the project’s standard Python GIS stack.

# geopandas==0.14.4  pyproj==3.6.1  shapely==2.0.4  pandas==2.2.2  (Python 3.11)
# SaTScan >= 10.1 CLI on PATH (https://www.satscan.org)
import hashlib
import json
import logging
import subprocess
from pathlib import Path

import geopandas as gpd
import pandas as pd

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(asctime)s %(message)s")
log = logging.getLogger("scan")

TARGET_CRS = "EPSG:32618"  # UTM 18N — distance-preserving for the study extent


def validate_inputs(cases_gdf: gpd.GeoDataFrame, pop_gdf: gpd.GeoDataFrame,
                    crs: str = TARGET_CRS) -> tuple[gpd.GeoDataFrame, gpd.GeoDataFrame]:
    """Enforce CRS alignment and topological soundness before tabulation."""
    out = []
    for gdf, name in ((cases_gdf, "cases"), (pop_gdf, "pop")):
        if gdf.crs is None:
            raise ValueError(f"{name} layer has no CRS; refusing to assume one.")
        if str(gdf.crs) != crs:
            log.info("Reprojecting %s from %s to %s", name, gdf.crs, crs)
            gdf = gdf.to_crs(crs)
        if gdf.geometry.is_empty.any() or (~gdf.geometry.is_valid).any():
            raise ValueError(f"Empty or invalid geometries detected in {name} layer.")
        # one coordinate per stable location id (deterministic ordering)
        gdf = gdf.sort_values("id").reset_index(drop=True)
        out.append(gdf)
    return tuple(out)


def config_digest(config: dict) -> str:
    """SHA-256 of the canonicalized configuration for the audit trail."""
    payload = json.dumps(config, sort_keys=True, separators=(",", ":")).encode()
    return hashlib.sha256(payload).hexdigest()


def generate_satscan_prm(config: dict, data_dir: Path, output_dir: Path) -> Path:
    """
    Write a SaTScan-compatible .prm parameter file. SaTScan reads all settings
    from this file; data file paths are declared within it.
    Full parameter reference: https://www.satscan.org/techdoc.html
    """
    prm_file = output_dir / "scan_config.prm"
    cases_file = data_dir / "cases.cas"
    pop_file = data_dir / "pop.pop"
    coords_file = data_dir / "coords.geo"
    results_file = output_dir / "results"

    prm_content = f"""
[Input]
CaseFile={cases_file}
PopulationFile={pop_file}
CoordinatesFile={coords_file}
CoordinatesType=1

[Analysis]
AnalysisType=1
ModelType={config['model_type']}
ScanAreas=1

[Output]
ResultsFile={results_file}

[Spatial]
MaxSpatialSizeInPopulationAtRisk={config['max_spatial_size']}

[Temporal]
MaxTemporalSizeInterpretation=0
MaxTemporalSize={config.get('max_temporal_size', 0)}

[Inference]
MonteCarloReps={config['permutations']}
RandomSeed={config['seed']}
""".strip()

    prm_file.write_text(prm_content)
    return prm_file


def run_scan(config: dict, data_dir: Path, output_dir: Path) -> str:
    """Execute the SaTScan CLI and capture exit code, with the config hash logged."""
    output_dir.mkdir(parents=True, exist_ok=True)
    digest = config_digest(config)
    log.info("Run config SHA-256: %s", digest)
    (output_dir / "config.audit.json").write_text(
        json.dumps({"config": config, "config_sha256": digest}, indent=2)
    )

    prm_file = generate_satscan_prm(config, data_dir, output_dir)
    cmd = ["satscan", str(prm_file)]  # SaTScan CLI: satscan <parameter_file>
    log.info("Executing: %s", " ".join(cmd))
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode != 0:
        log.error("SaTScan failed: %s", result.stderr)
        raise RuntimeError("Statistical engine execution failed.")
    log.info("Scan completed. Parsing outputs (.col / .txt) ...")
    return result.stdout


if __name__ == "__main__":
    CONFIG = {
        "model_type": 0,          # 0 = Discrete Poisson; 1 = Bernoulli; 2 = space-time permutation
        "max_spatial_size": 0.25,  # fraction of population at risk (cap the maximizing window)
        "max_temporal_size": 14,   # days (0 = purely spatial scan)
        "permutations": 9999,      # p_min = 1/(reps+1); 999 to screen, 9999 for p<0.001
        "seed": 42,                # fixed seed = reproducible Monte Carlo envelope
        "crs": TARGET_CRS,
    }

    cases = gpd.read_file("data/cases.geojson")
    pop = gpd.read_file("data/pop_at_risk.geojson")
    cases, pop = validate_inputs(cases, pop, CONFIG["crs"])

    # Export to SaTScan-compatible tabular formats (space-delimited, no header)
    cases[["id", "case_count", "date"]].to_csv(
        "data/cases.cas", index=False, sep=" ", header=False
    )                                   # .cas: <location_id> <cases> <date>
    pop[["id", "population", "year"]].to_csv(
        "data/pop.pop", index=False, sep=" ", header=False
    )                                   # .pop: <location_id> <population> <year>
    cases[["id", "lat", "lon"]].drop_duplicates("id").to_csv(
        "data/coords.geo", index=False, sep=" ", header=False
    )                                   # .geo: <location_id> <latitude> <longitude>

    run_scan(CONFIG, Path("data"), Path("output"))

Parameter Selection & Tuning

Four interdependent controls govern statistical power, computational cost, and defensibility. Tune them deliberately and log every value.

Maximum spatial cluster size (MaxSpatialSize). Conventionally capped at 50% of the population at risk to keep clusters epidemiologically plausible and the search tractable. For rare-disease surveillance or fine-grained administrative units, lower it to 10–25% to prevent over-smoothing and preserve localized signal — this is the parameter that most often flips a verdict.
Likelihood model (ModelType). Match the data structure: Discrete Poisson for counts against a population denominator, Bernoulli for case-control point data, space-time permutation when only cases (no denominator) are available and an expected count is derived internally from the case margins. The wrong model silently mis-specifies $e_Z$ .
Temporal window. For a space-time scan, the temporal cap (commonly ≤ 50% of the study period, or a fixed operational horizon such as 14 days for prospective surveillance) bounds how long any detected window may persist. Setting it to zero collapses to a purely spatial scan.
Monte Carlo replications and significance threshold. Use 999 to screen and 9999 when a candidate sits near the threshold and a defensible $p < 0.001$ is required; the smallest attainable p-value is $1/(n_{\text{sims}}+1)$ . Always fix the random seed so the simulated envelope — and therefore the reported p-value — is byte-reproducible.
Multiplicity beyond the primary cluster. The most-likely cluster’s p-value is already corrected for the scan’s multiple-window search. Secondary clusters reported by SaTScan carry their own Gumbel-approximated or Monte Carlo p-values; treat them as a ranked candidate list and apply an explicit false-discovery-rate cut if many secondary windows are acted on operationally, rather than reading each at an uncorrected α.

Edge Cases & Failure Modes

Population heterogeneity masquerading as clustering. A Poisson scan with a flat or stale denominator reports dense population as elevated risk. Refresh the population layer and standardize the expected count before trusting any window.
Island and disconnected support. Locations with no neighbors within the maximum window, or multipart catchments (archipelagos, exclaves), distort circular windows. Use a non-Euclidean neighbor file or analyze components separately when inter-component distance is operationally meaningless.
Zero-inflation and sparse counts. Many zero-count locations widen the simulated maxima until nothing is significant. Pool periods, coarsen support, or move to the space-time permutation model, which tolerates sparse case-only data better than a denominator-hungry Poisson run.
Transboundary CRS drift. Mixing UTM zones or feeding unprojected coordinates into a Cartesian scan silently corrupts distances and window radii. Reproject everything to one CRS and assert the authority code at ingestion.
Memory and runtime for large $N$ . Window enumeration over many locations with high replication is costly; above ~50k locations, restrict MaxSpatialSize, reduce candidate centroids by aggregating to a coarser support, or partition the extent — and budget runtime linearly in MonteCarloReps.
Reporting lag in prospective runs. Delayed case confirmation thins recent intervals and biases the temporal window. Apply a guard interval or time-adjusted baseline so clustering is not declared on an artifact of staggered laboratory turnaround.

Compliance & Audit Controls

Every run must be reproducible by a reviewer who holds only the inputs and the recorded configuration. Sort locations by a stable id so any tie-breaking is byte-reproducible, pin the RandomSeed so the Monte Carlo envelope is identical on re-run, and attest the configuration with a SHA-256 hash (as config_digest does above). Version-control the generated .prm file alongside the code. Serialize outputs — the most-likely and secondary cluster geometries, observed/expected counts, log-likelihood ratios, relative risks, p-values, and the temporal window — into GeoJSON or GeoParquet with the CRS authority code, the model type, MaxSpatialSize, replication count, seed, the geomasking disclosure threshold applied at ingestion, an execution timestamp, and pinned library and SaTScan versions. Emit the metadata block with ISO 19115 lineage fields so the result survives interagency exchange and regulatory review. Real-time deployments add incremental window updates and lag mitigation, but the core configuration stays anchored to these reproducible baselines.

Deploying scan statistics in public health operations demands rigorous spatial preprocessing, model-matched parameterization, and audit-ready pipelines. Combined with complementary spatial statistics and automated validation, the configured scan provides a scalable, defensible foundation for prospective outbreak detection and retrospective cluster investigation.

Disease Clustering & Spatial Statistical Modeling — the parent overview tying these methods into one surveillance pipeline.
Getis-Ord Gi* Hotspot Detection — fixed-neighborhood z-score surfaces for intervention zoning.
Global & Local Moran’s I Implementation — global autocorrelation screen to run before localizing.
K-Function & Point Pattern Analysis — scale-resolved clustering on raw event points.
Configuring SaTScan for Retrospective Cluster Detection — historical-baseline parameterization of this same engine.
Coordinate Reference Systems for Public Health — projection selection the window geometry depends on.