Configuring SaTScan for Retrospective Cluster Detection
Retrospective cluster detection in public health surveillance requires deterministic execution, strict parameterization, and auditable data pipelines. SaTScan’s retrospective space-time scan statistic evaluates historical case distributions against expected baselines to identify statistically significant spatial or spatiotemporal aggregations. Production deployments in government and agency environments must eliminate boundary artifacts, enforce coordinate reference system (CRS) alignment, and guarantee Monte Carlo convergence within defined computational thresholds. This guide details the configuration workflow for agency-grade deployments, emphasizing compliance, spatial validation, and automated .prm generation.
Data Schema Alignment and Coordinate Standardization
SaTScan expects rigid CSV structures with explicit column ordering. Case files must contain latitude, longitude, date, and a case count or binary status field. Population files require identical coordinate pairs paired with at-risk denominators. Unprojected WGS84 coordinates introduce distance distortion that biases maximum cluster radius calculations and invalidates spatial window constraints. All inputs must be transformed to an equal-area projection (e.g., EPSG:3035 for pan-European analyses, EPSG:2163 for CONUS) prior to parameter generation. Coordinate precision should be capped at six decimal places to prevent floating-point drift during spatial indexing and grid generation.
Compliance frameworks (HIPAA, GDPR) prohibit raw address ingestion into scan engines. Geocoding must occur in isolated environments, with outputs aggregated to census tracts, hexagonal grids, or jittered centroids that satisfy k-anonymity thresholds. Population denominators should derive from official census releases or synthetic population models that exclude identifiable attributes. Temporal fields require truncation to day-level resolution to prevent re-identification through exact timestamp matching. Preprocessing pipelines must generate cryptographic checksums for all input exports to maintain chain-of-custody integrity. For foundational guidance on structuring these inputs within broader spatial workflows, refer to Disease Clustering & Spatial Statistical Modeling.
Parameter File Construction and Retrospective Mode
The .prm configuration file dictates SaTScan’s statistical behavior. Retrospective analysis requires explicit declaration of AnalysisType=2, which triggers the retrospective space-time scan. The likelihood model must align precisely with the epidemiological data structure:
ModelType=1: Poisson count data (cases vs. population at risk)ModelType=2: Bernoulli case-control dataModelType=3: Ordinal or continuous outcomes
Misalignment between ModelType and input schema produces invalid likelihood ratios and silent statistical failures. Monte Carlo simulations (NumSimulations) should be set to 999 or 9999 for publication-grade p-values. Temporal window constraints (TemporalWindowLength, MaxTemporalClusterSize) must reflect disease incubation periods and reporting lags. Spatial constraints (MaxSpatialSizeInDistance, MaxSpatialSizeInPopulation) prevent overfitting and computational exhaustion. Official parameter definitions and version-specific syntax are documented in the SaTScan User Guide.
Automated Validation and Pipeline Execution
Production deployments require programmatic validation before engine invocation. The following Python pipeline verifies CRS consistency, validates schema requirements, generates a compliant .prm file, and executes SaTScan with subprocess isolation.
import os
import hashlib
import subprocess
import pandas as pd
from pathlib import Path
from pyproj import CRS, Transformer
def generate_satpram(case_path: str, pop_path: str, output_dir: str,
model_type: int = 1, num_sims: int = 999,
max_spatial_size: float = 0.5, max_temporal_days: int = 30) -> str:
"""Generate an audit-ready .prm file for retrospective SaTScan execution."""
# Validate CRS alignment
case_df = pd.read_csv(case_path)
pop_df = pd.read_csv(pop_path)
required_case_cols = {'Latitude', 'Longitude', 'Date', 'Cases'}
required_pop_cols = {'Latitude', 'Longitude', 'Population'}
if not required_case_cols.issubset(case_df.columns):
raise ValueError("Case file missing required columns.")
if not required_pop_cols.issubset(pop_df.columns):
raise ValueError("Population file missing required columns.")
# Cap coordinate precision
case_df['Latitude'] = case_df['Latitude'].round(6)
case_df['Longitude'] = case_df['Longitude'].round(6)
pop_df['Latitude'] = pop_df['Latitude'].round(6)
pop_df['Longitude'] = pop_df['Longitude'].round(6)
# Write cleaned CSVs back
cleaned_case = Path(output_dir) / "cases_clean.csv"
cleaned_pop = Path(output_dir) / "population_clean.csv"
case_df.to_csv(cleaned_case, index=False)
pop_df.to_csv(cleaned_pop, index=False)
# Generate .prm
pram_content = f"""AnalysisType=2
ModelType={model_type}
CoordinatesFile={cleaned_case}
PopulationFile={cleaned_pop}
TimeFile=
StartDate=
EndDate=
NumSimulations={num_sims}
MaxSpatialSizeInDistance={max_spatial_size}
MaxTemporalClusterSize={max_temporal_days}
TemporalWindowLength=1
RandomSeed=42
OutputResultsFile={Path(output_dir) / "sat_results.txt"}
"""
pram_path = Path(output_dir) / "retrospective_scan.prm"
pram_path.write_text(pram_content)
# Log checksums for audit trail
for f in [cleaned_case, cleaned_pop, pram_path]:
sha256 = hashlib.sha256(f.read_bytes()).hexdigest()
print(f"[AUDIT] {f.name}: SHA256={sha256}")
return str(pram_path)
def run_sat_scan(pram_path: str, executable: str = "SaTScan"):
"""Execute SaTScan with subprocess isolation and error trapping."""
cmd = [executable, f"-p={pram_path}"]
result = subprocess.run(cmd, capture_output=True, text=True, check=False)
if result.returncode != 0:
raise RuntimeError(f"SaTScan execution failed:\n{result.stderr}")
print("[EXECUTION] SaTScan completed successfully. Outputs written to configured directory.")
For robust subprocess management in production environments, consult the official Python subprocess documentation.
Spatial Validation and Edge-Case Handling
Boundary artifacts occur when case coordinates fall outside the defined population grid or when spatial windows intersect jurisdictional boundaries. Implement a spatial join validation step prior to execution to flag orphaned cases. Zero-population zones in rural, maritime, or restricted areas cause division-by-zero errors in expected case calculations. These zones must be masked or assigned a minimum denominator (e.g., 1.0) with documented justification in the pipeline metadata.
Convergence failures during Monte Carlo runs often stem from overly restrictive temporal windows, mismatched coordinate precision, or insufficient population coverage. Implement retry logic with progressive window relaxation: if NumSimulations fails to converge or returns 0 significant clusters, increment MaxSpatialSizeInDistance by 0.1 and re-run, logging all parameter adjustments. For advanced tuning strategies and threshold calibration, review Spatial Scan Statistics Configuration.
Audit-Ready Execution and Compliance Logging
Production pipelines require deterministic execution tracking. Log the exact .prm configuration, input file hashes, SaTScan binary version, and system environment variables. Store outputs in version-controlled directories with immutable timestamps. Ensure all spatial operations are reproducible via fixed random seeds (RandomSeed=42 in the .prm file).
Agencies should implement a post-execution validation routine that:
- Verifies output file existence and non-zero cluster counts
- Cross-references reported p-values against the configured
NumSimulations - Archives the
.prmfile alongside the raw and cleaned CSVs - Generates a compliance manifest containing data lineage, transformation steps, and cryptographic hashes
Retrospective cluster detection is only as reliable as its configuration and validation layers. By enforcing CRS alignment, automating parameter generation, and embedding compliance logging directly into the execution pipeline, public health teams can deploy SaTScan at scale with audit-ready reproducibility and statistical integrity.