Configuring SaTScan for Retrospective Cluster Detection

This guide solves one narrow operational problem: generating a deterministic, audit-defensible SaTScan .prm parameter file for a retrospective space-time scan over historical surveillance data, where a hand-edited parameter file silently drifts out of sync with the input schema. It is part of Spatial Scan Statistics Configuration, within the broader Disease Clustering & Spatial Statistical Modeling section.

Problem Context & Constraints

The naive approach — open the SaTScan GUI, click through the parameter tabs once, export a .prm, and reuse that file for every retrospective run — fails the moment the analysis is expected to be reproducible. A retrospective scan answers “did a statistically significant excess occur somewhere in this fixed historical window?”, so it is overwhelmingly used for after-action review, accreditation evidence, and litigation-adjacent reporting. All three contexts demand that the exact configuration be reconstructable from logs months later, which a GUI-clicked file cannot guarantee.

Three specific failure modes break the GUI-once workflow:

Schema drift. The .cas, .pop, and .geo files are regenerated each reporting cycle by an upstream extract. If a column order changes or the population file is dropped for a space-time permutation run, the GUI happily executes against stale assumptions and emits a likelihood with no error. The likelihood ratio test statistic SaTScan maximizes,
$\Lambda = \max_{Z}\left(\frac{c}{\mu}\right)^{c}\left(\frac{C-c}{C-\mu}\right)^{C-c} \mathbf{1}\!\left(\frac{c}{\mu} > \frac{C-c}{C-\mu}\right),$
where (c) is observed cases inside candidate window (Z), (\mu) the expected count under the null, and (C) the total — is only meaningful when (\mu) is derived from the correct baseline. A Poisson model run without a matching population file computes (\mu) from a degenerate denominator and reports a confident, wrong cluster.
Non-deterministic p-values. Without a fixed Monte Carlo random seed, two runs over identical data return different p-values near the significance boundary. For retrospective evidence that has to be defended, this is disqualifying.
Coordinate reference system mismatch. Cases supplied in projected metres while the .geo file is geographic (or vice versa) produce silently distorted window radii. Enforce a canonical Coordinate Reference System for Public Health before the coordinate file is ever written.

The fix is to stop treating the .prm as a one-time artifact and instead generate it programmatically from validated inputs, with the likelihood model, the window ceilings, and the random seed all asserted in code.

Prerequisites

# Python 3.11+
# pandas==2.2.2
# SaTScan 10.2 CLI on PATH (the Linux/macOS binary is lowercase 'satscan')
#
# Input state required before this step runs:
#   - Cases aggregated to a stable location_id (census tract / hex cell), NOT raw addresses.
#   - Coordinates already reprojected to EPSG:4326 (geographic) for SaTScan's native
#     spherical distance engine; CoordinatesType=1 selects lat/long.
#   - Day-level date resolution (no exact timestamps) to satisfy re-identification controls.

SaTScan expects fixed-format, whitespace-delimited text files in a defined column order:

File	Extension	Columns (in order)
Case	`.cas`	`location_id cases date`
Population	`.pop`	`location_id population year`
Coordinates	`.geo`	`location_id latitude longitude`

The single most consequential prerequisite is matching the likelihood ModelType to the files you actually have. The matrix below maps each model to its required inputs — a Poisson run without a population file, or a Bernoulli run with one, is the most common silent misconfiguration.

Retrospective space-time analysis uses AnalysisType=3; a purely spatial retrospective scan uses AnalysisType=1. Monte Carlo replications (MonteCarloReps) of 999 or 9999 give publication-grade p-value resolution, and MaxTemporalSize should reflect the disease’s incubation period and reporting lag rather than a round default. Version-specific parameter names are listed in the official SaTScan User Guide.

Step-by-Step Solution

The pipeline below validates the input schemas, asserts that the chosen ModelType has the files it needs, writes a fully specified .prm with a fixed random seed, hashes every input for the audit trail, then executes the engine. The alignment gate is the critical step: it raises before any engine call, so a model-to-file mismatch can never reach SaTScan. SaTScan’s CLI accepts exactly one argument: the path to the parameter file.

# pandas==2.2.2 ; SaTScan 10.2 CLI on PATH
import hashlib
import subprocess
import pandas as pd
from pathlib import Path

# Which input files each likelihood model legitimately requires.
MODEL_REQUIRES_POP = {1: True, 2: False, 3: False}  # 1=Poisson 2=Bernoulli 3=ST-permutation


def generate_satscan_prm(case_path: str, coords_path: str, output_dir: str,
                         pop_path: str | None = None, model_type: int = 1,
                         num_sims: int = 999, random_seed: int = 42,
                         max_spatial_pct: float = 0.5,
                         max_temporal_days: int = 30) -> str:
    """Generate an audit-ready retrospective space-time .prm for SaTScan.

    Raises before any engine call if the ModelType / input-file pairing is
    inconsistent (the most common silent retrospective misconfiguration).
    """
    out = Path(output_dir)
    out.mkdir(parents=True, exist_ok=True)

    # --- 1. Validate case schema (column order is positional in SaTScan) ---
    case_df = pd.read_csv(case_path, sep=r"\s+", header=None,
                          names=["location_id", "cases", "date"])
    if case_df["cases"].lt(0).any():
        raise ValueError("Case file contains negative counts.")

    # --- 2. Assert ModelType <-> population-file alignment ---
    pop_supplied = pop_path is not None
    pop_required = MODEL_REQUIRES_POP[model_type]
    if pop_required and not pop_supplied:
        raise ValueError(f"ModelType={model_type} requires a population file; none supplied.")
    if not pop_required and pop_supplied:
        raise ValueError(f"ModelType={model_type} must NOT receive a population file.")

    pop_line = f"PopulationFile={pop_path}\n" if pop_supplied else ""
    results_stem = str(out / "sat_results")

    # --- 3. Write a fully specified, deterministic .prm ---
    prm_content = f"""[Input]
CaseFile={case_path}
{pop_line}CoordinatesFile={coords_path}
CoordinatesType=1

[Analysis]
AnalysisType=3
ModelType={model_type}
ScanAreas=1

[Output]
ResultsFile={results_stem}

[Spatial]
MaxSpatialSizeInPopulationAtRisk={max_spatial_pct}

[Temporal]
MaxTemporalSizeInterpretation=0
MaxTemporalSize={max_temporal_days}

[Inference]
MonteCarloReps={num_sims}
RandomSeed={random_seed}
"""
    prm_path = out / "retrospective_scan.prm"
    prm_path.write_text(prm_content.strip() + "\n")

    # --- 4. Hash every input + the .prm itself for the chain-of-custody log ---
    inputs = [Path(case_path), Path(coords_path), prm_path]
    if pop_supplied:
        inputs.insert(1, Path(pop_path))
    for f in inputs:
        digest = hashlib.sha256(f.read_bytes()).hexdigest()
        print(f"[AUDIT] {f.name}: sha256={digest}")

    return str(prm_path)


def run_sat_scan(prm_path: str, executable: str = "satscan") -> None:
    """Execute SaTScan with subprocess isolation and explicit error trapping."""
    result = subprocess.run([executable, prm_path],
                            capture_output=True, text=True, check=False)
    if result.returncode != 0:
        raise RuntimeError(f"SaTScan failed (exit {result.returncode}):\n{result.stderr}")
    print("[EXECUTION] SaTScan completed; outputs written to the configured directory.")

For robust process management in production, consult the official Python subprocess documentation.

Validation & Edge Cases

Three failure modes account for nearly every broken retrospective run. Each has a cheap diagnostic.

1. Orphaned cases — a location_id in the .cas file has no row in the .geo file. SaTScan drops the case silently and under-reports cluster size. Flag it before execution with a set difference:

case_ids = set(pd.read_csv(case_path, sep=r"\s+", header=None,
                           names=["location_id", "cases", "date"])["location_id"])
geo_ids = set(pd.read_csv(coords_path, sep=r"\s+", header=None,
                          names=["location_id", "lat", "lon"])["location_id"])
orphans = case_ids - geo_ids
assert not orphans, f"{len(orphans)} cases have no coordinate: {sorted(orphans)[:5]}"

2. Zero-population zones (maritime, restricted, or unpopulated tracts) cause a division by zero in the expected-count term (\mu) for Poisson models. Mask them or assign a documented minimum denominator of 1.0. The symptom in the SaTScan log is a refusal to run with a message resembling:

Error: The population file contains a location with a population of zero for all dates.

3. No significant cluster returned. This is frequently a configuration artifact, not a true null — an over-tight MaxTemporalSize or MaxSpatialSizeInPopulationAtRisk truncated the maximizing window below detection. Use progressive, logged window relaxation rather than silent re-tuning:

for spatial_pct in (0.25, 0.30, 0.35, 0.40, 0.50):
    print(f"[RETRY] MaxSpatialSizeInPopulationAtRisk={spatial_pct}")
    prm = generate_satscan_prm(case_path, coords_path, out_dir,
                               pop_path=pop_path, max_spatial_pct=spatial_pct)
    run_sat_scan(prm)
    # break on first run that returns a cluster at the configured alpha

Logging every relaxation step is what separates a defensible sensitivity analysis from p-hacking: the audit trail shows the ceiling was widened deliberately, in fixed increments, not cherry-picked.

Compliance Notes

For a retrospective result to survive later scrutiny, the following must be persisted alongside the scan output, never just inferred:

The exact .prm file, archived next to the SHA-256 hashes of the .cas, .pop, and .geo inputs it ran against.
The RandomSeed value (here 42) — the p-values are only reproducible with the seed recorded.
The MonteCarloReps count, so reported p-value resolution (e.g. the 1/1000 floor at 999 replications) is verifiable.
The SaTScan binary version string, since likelihood and parameter semantics shift across releases.
Every MaxSpatialSize / MaxTemporalSize relaxation step from the retry loop, with timestamps.

Because the inputs were aggregated to a stable location_id and truncated to day-level dates upstream, the scan engine never touches identifiable attributes; record that de-identification provenance in the same manifest. For the schema that captures this lineage, see Building a HIPAA-Compliant Spatial Metadata Schema.

Frequently Asked Questions

Should a retrospective scan use AnalysisType=3 or AnalysisType=1?

Use AnalysisType=3 (retrospective space-time) when your case file carries dates and you want to detect clustering localized in both space and time. Use AnalysisType=1 (purely spatial retrospective) when you only care about where the excess occurred over the whole window and dates are absent or irrelevant.

Why does my Poisson run report a hot spot but the p-value is exactly 1?

Almost always a baseline problem: the population file is missing, mismatched on location_id, or has a zero denominator in the candidate area, so the expected count (\mu) is degenerate. The alignment assertion in generate_satscan_prm catches the missing-file case; the orphan and zero-population checks catch the rest.

Can I reuse one .prm across reporting cycles?

Only if you regenerate it from code each cycle so the input paths, hashes, and ModelType are re-validated against that cycle’s extract. Reusing a hand-edited file is exactly the schema-drift trap this guide exists to prevent.

Spatial Scan Statistics Configuration — the parent guide on scan-window, baseline, and significance configuration.
Getis-Ord Gi* Hotspot Detection — an aggregated-polygon alternative when you need a per-unit hotspot surface rather than a single most-likely window.
Calculating Local Moran’s I for Infectious Disease Outbreaks — local autocorrelation as a complementary screen before committing to a scan.