Should epidemiology pipelines still use Shapefiles?

Treat Shapefile strictly as a transient exchange format, never as storage of record. Its 2 GB size cap, 10-character field-name truncation, mandatory multi-file distribution, and inconsistent CP1252/ISO-8859-1 encoding cause silent attribute loss. Migrate Shapefile inputs into a topology-preserving columnar format such as GeoParquet at ingestion, normalizing encoding to UTF-8 explicitly.

When should I store case data as GeoParquet versus GeoJSON?

Use GeoParquet for high-volume surveillance feeds and canonical analytical storage: it preserves types strictly, supports predicate pushdown and columnar compression, and avoids JSON serialization overhead. Reserve GeoJSON for web handoff to dashboards, where RFC 7946 compliance and tooling ubiquity outweigh its memory and parsing cost for large polygon sets.

Which resampling method should I use when aligning exposure rasters?

Use bilinear or cubic resampling for continuous fields such as PM2.5 or temperature, where interpolated values are meaningful. Use nearest-neighbor for any categorical raster such as land use or soil class, because interpolating class codes invents fractional categories that corrupt every downstream zonal statistic.

Why does mixing geometry types in one layer break the pipeline?

Vectorized spatial joins and area calculations assume a homogeneous geometry type. A layer that interleaves Polygon and MultiPolygon, or LineString and Point, passes a naive read but produces incorrect areas or failed joins. The ingestion gate rejects on more than one geometry type and requires normalization with a logged record of what was changed.

Spatial Data Types & Formats in Production Epidemiology Pipelines

This guide is part of Spatial Epidemiology Fundamentals & Data Standards, and it covers how to choose, validate, and convert the spatial data types and file formats that flow through a public health surveillance pipeline so that every downstream join, buffer, and statistic operates on geometrically and structurally sound inputs. Format selection is not a storage detail: vector topology, raster cell alignment, and tabular schema enforcement each set hard preconditions for spatial joins, exposure overlays, and cluster detection, and a single malformed input that slips past ingestion will surface only after maps have already been published.

Concept & Epidemiological Alignment

A spatial data type describes the geometric model a record carries — a point, a line, a polygon, or a continuous grid cell — while a format is the on-disk container that serializes those geometries together with their attributes and coordinate reference metadata. The two are coupled: a format either preserves topology, type fidelity, and a declared CRS, or it silently degrades them. In surveillance work the distinction is operational, because the validity of an incidence rate, an exposure overlay, or a hotspot z-score depends on the geometry surviving ingestion without distortion.

Surveillance pipelines routinely combine four input classes. Case line lists arrive as tabular coordinate pairs and become geocoded points (patient residences, testing sites, vector-trap locations). Administrative boundaries arrive as polygons (census tracts, health service areas, reporting jurisdictions). Environmental exposure fields arrive as rasters (interpolated PM2.5, land surface temperature, land cover). Network and corridor data arrive as lines (mobility flows, watershed boundaries, transmission corridors). Each class has a different failure surface: points fail on undeclared CRS and over-precision; polygons fail on invalid topology; rasters fail on extent and resolution mismatch; lines fail on connectivity and direction.

The choice of when to use each format is governed by the downstream operation, not by habit. Use a topology-preserving vector format whenever a record must participate in a spatial join, a buffer, or a Global & Local Moran’s I Implementation where the spatial weights matrix is built from polygon adjacency. Use an aligned raster whenever the analysis samples a continuous field at case locations or computes population-weighted exposure. Use a columnar tabular format (GeoParquet, Feather) for high-volume daily feeds where I/O latency, type preservation, and predicate pushdown matter more than human readability. Reserve text interchange formats (GeoJSON) for web handoff, where RFC 7946 compliance and tooling ubiquity outweigh serialization overhead.

Three assumptions must hold for format handling to remain analytically defensible. First, every layer must declare an authoritative CRS before any metric operation — an unlabeled shapefile or a loosely typed GeoJSON is a defect, not a default, and is the province of Coordinate Reference Systems for Public Health. Second, geometry type must be homogeneous within a layer, because mixed point/polygon collections break vectorized spatial joins and area calculations. Third, coordinate precision must be capped at the input’s true positional accuracy, a constraint enforced under Precision Standards in Epi-Mapping, so that a columnar store does not preserve eight decimal places of false certainty.

Format Selection Matrix

The table below maps each input class to its recommended on-disk format and the validation gate that must pass before the data reaches an analytical routine.

Input class	Geometry	Recommended format	Primary failure mode	Required gate
Case line list	Point	GeoParquet (zstd)	Undeclared CRS, over-precision	Schema + CRS + precision cap
Administrative boundaries	Polygon	GeoParquet / GeoPackage	Invalid topology, self-intersection	`is_valid` + single geom type
Exposure surface	Raster grid	Cloud-Optimized GeoTIFF	Extent / resolution mismatch	Grid alignment to template
Network / corridor	Line	GeoPackage	Disconnected segments, direction loss	Connectivity check
Web interchange	Any	GeoJSON (RFC 7946)	Serialization bloat, encoding drift	RFC 7946 lint
Legacy exchange	Vector	(migrate off Shapefile)	2 GB cap, field truncation, CP1252	Encoding + attribute audit

The decision rule is to ingest into a strict, topology-preserving columnar format at the analytical boundary and to treat Shapefile and raw CSV strictly as transient inputs that are migrated, never as the storage of record.

Spatial Data Prerequisites

Before a format pipeline runs, four prerequisites must be satisfied for every layer. Geometry type must be homogeneous: a layer is either all points, all lines, or all single-part polygons; mixed collections are normalized or rejected. CRS must be declared and resolvable to a recognized EPSG code; geographic EPSG:4326 is acceptable for storage and interchange, but distance and area work must be reprojected to a jurisdiction-appropriate projected CRS (a UTM zone, State Plane, or an Albers Equal Area Conic for multi-state extents). Topology must be valid — no self-intersections, unclosed rings, or duplicate vertices — and rasters must share a common grid origin, cell size, and extent with the template they are sampled against. Minimum metadata must be present: a stable record identifier for deterministic ordering, an onset_date or equivalent temporal field for space-time work, and per-record positional accuracy where the source provides it.

For raster inputs specifically, alignment is a hard prerequisite rather than a convenience. Overlaying case points on an exposure surface requires the raster and any comparison rasters to share resolution and extent; bilinear or cubic resampling is appropriate for continuous variables (PM2.5, temperature), while nearest-neighbor preserves categorical boundaries (land use, soil class). Resampling a categorical raster with bilinear interpolation invents fractional class codes and corrupts every subsequent zonal statistic.

Production Implementation

The implementation below validates and normalizes a heterogeneous batch of inputs in a fixed, auditable order: it repairs vector topology, enforces a single geometry type, aligns rasters to a template grid, and ingests tabular line lists into a schema-validated columnar store. Compliance annotations mark each step that must be logged for regulatory defensibility.

# Pinned stack: geopandas==0.14.4, shapely==2.0.4, rasterio==1.3.10,
#               pandas==2.2.2, pyarrow==16.1.0, pyproj==3.6.1
import logging
import geopandas as gpd
import pandas as pd
import numpy as np
import rasterio
from rasterio.warp import reproject, Resampling
from shapely.validation import make_valid

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger("format_gate")

REQUIRED_FIELDS = {"case_id", "latitude", "longitude", "onset_date"}
COORD_PRECISION = 5          # ~1.1 m at the equator — cap at facility-level analysis
ANALYSIS_CRS = "EPSG:32616"  # UTM 16N — jurisdiction projected CRS for distance/area work


def validate_vector_gdf(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Repair invalid geometries and enforce a single geometry type."""
    if gdf.crs is None:
        raise ValueError("Layer has no declared CRS — reject at ingestion gate.")
    invalid = ~gdf.geometry.is_valid
    if invalid.any():
        log.warning("Repairing %d invalid geometries", int(invalid.sum()))  # AUDIT: geometry repair
        gdf = gdf.copy()
        gdf.loc[invalid, "geometry"] = gdf.loc[invalid, "geometry"].apply(make_valid)
    if gdf.geom_type.nunique() > 1:
        raise ValueError(f"Mixed geometry types {set(gdf.geom_type)} — normalize before ingestion.")
    return gdf


def align_raster_to_template(src_path: str, template_path: str, out_path: str,
                             resampling: Resampling = Resampling.bilinear) -> None:
    """Align a raster to a template grid; use nearest-neighbor for categorical data."""
    with rasterio.open(template_path) as tmpl:
        dst_transform, dst_crs, dst_shape = tmpl.transform, tmpl.crs, tmpl.shape
        dst_meta = tmpl.meta.copy()
    with rasterio.open(src_path) as src:
        dst = np.empty(dst_shape, dtype=src.meta["dtype"])
        reproject(
            source=rasterio.band(src, 1), destination=dst,
            src_transform=src.transform, src_crs=src.crs,
            dst_transform=dst_transform, dst_crs=dst_crs,
            resampling=resampling,
        )
    dst_meta.update(driver="GTiff", height=dst_shape[0], width=dst_shape[1],
                    transform=dst_transform, crs=dst_crs)
    with rasterio.open(out_path, "w", **dst_meta) as out:
        out.write(dst, 1)
    log.info("Aligned %s to template grid %dx%d", src_path, dst_shape[1], dst_shape[0])  # AUDIT


def ingest_line_list(csv_path: str, parquet_path: str) -> gpd.GeoDataFrame:
    """Validate a tabular case line list and persist as schema-checked GeoParquet."""
    df = pd.read_csv(csv_path, dtype={"case_id": str, "latitude": float, "longitude": float})
    missing = REQUIRED_FIELDS - set(df.columns)
    if missing:
        raise ValueError(f"Missing required fields: {missing}")
    # Cap precision so the columnar store does not preserve false certainty (AUDIT: precision policy)
    df["latitude"] = df["latitude"].round(COORD_PRECISION)
    df["longitude"] = df["longitude"].round(COORD_PRECISION)
    df = df.sort_values("case_id").reset_index(drop=True)  # deterministic, byte-reproducible order
    gdf = gpd.GeoDataFrame(
        df, geometry=gpd.points_from_xy(df.longitude, df.latitude), crs="EPSG:4326"
    )
    gdf = validate_vector_gdf(gdf).to_crs(ANALYSIS_CRS)  # reproject at the analytical boundary
    gdf.to_parquet(parquet_path, compression="zstd")
    log.info("Ingested %d cases to %s", len(gdf), parquet_path)
    return gdf

The same validate_vector_gdf gate is reusable for administrative-boundary polygons, and the columnar output is the canonical input for boundary-level statistics such as Getis-Ord Gi* Hotspot Detection. Detailed migration of legacy ESRI inputs into this pipeline, including encoding normalization and attribute-name de-truncation, is documented in Converting Shapefiles to GeoJSON for Epi Pipelines.

Parameter Selection & Tuning

The format pipeline exposes a small set of parameters whose defaults must be chosen deliberately and recorded. Resampling method is the most consequential raster choice: Resampling.bilinear or Resampling.cubic for continuous exposure surfaces, Resampling.nearest for any categorical raster — there is no safe default that covers both. Coordinate precision (COORD_PRECISION) should be set from the strictest applicable disclosure rule, not from the geocoder output: 5 decimals (~1.1 m) only for facility-level analysis under a data use agreement, 4 (~11 m) for tract-level controlled access, 3 (~110 m) for ZIP or county release. Columnar compression defaults to zstd for a balance of ratio and decode speed; snappy is faster to write for short-lived intermediates. Target CRS (ANALYSIS_CRS) must be the projected zone that minimizes distortion over the study extent — a single UTM zone is reliable only within its ~6° longitude band, so transboundary studies require an equal-area projection sized to the region.

For Parquet specifically, set the row-group size to match the typical query window (a daily reporting cycle), enable dictionary encoding for low-cardinality categorical attributes, and write the partition key on the temporal field so that predicate pushdown can skip irrelevant date ranges during surveillance refreshes.

Edge Cases & Failure Modes

Mixed geometry collections. A boundary file that interleaves Polygon and MultiPolygon, or a network file that mixes LineString and Point waypoints, will pass a naive read but break vectorized joins. The gate rejects on geom_type.nunique() > 1; normalize by exploding multiparts or filtering to the dominant type with an audit log of what was dropped.

Island polygons and disconnected segments. Administrative units with no shared boundary (offshore territories, exclaves) produce empty adjacency rows that propagate as NaN through spatial-weights construction. Detect them before building weights and decide explicitly between k-nearest-neighbor weights and dropping the isolate with a logged justification.

Raster extent and resolution drift. Two exposure rasters from different vendors may share a CRS yet differ in cell origin by a fraction of a pixel; sampled together they introduce a systematic offset in population-weighted incidence. Always align both to a single template before sampling, and assert that transform and shape match post-alignment.

Encoding corruption from legacy inputs. Shapefiles frequently ship as CP1252 or ISO-8859-1; reading them as UTF-8 mojibakes accented place names and silently changes join keys. Detect the source encoding explicitly and normalize to UTF-8 at conversion, never assume it.

Memory constraints for N > 50k. Constructing a full GeoJSON FeatureCollection in memory for hundreds of thousands of polygons fragments the heap. For large migrations, stream with GDAL’s ogr2ogr or gdal.VectorTranslate, or write GeoParquet row-group by row-group rather than materializing a single object.

Compliance & Audit Controls

Format handling is also a disclosure-control boundary. Before individual-level spatial data reaches an analytical model, the pipeline must apply HIPAA Safe Harbor and GDPR generalization: cap coordinate precision at ingestion, suppress raster or polygon outputs where case counts fall below the statistical-disclosure-control limit (typically n < 5), and geomask point residences by displacement or centroid-snap where required. The regulatory crosswalk that maps each output format and precision band to its statutory obligation is maintained in Compliance Mapping Frameworks.

Every conversion must be reproducible and reviewable. Persist original inputs in an immutable, read-only staging layer; write all repaired and reprojected geometries to derived layers with explicit lineage. Order records by a stable case_id and use fixed resampling and rounding modes so two runs produce byte-identical output. Emit a per-cycle manifest — input format, source CRS, resampling method, precision cap, geometry-repair count, and a SHA-256 hash of the configuration — and attach ISO 19115 lineage fields (process step, source positional accuracy, transformation parameters) so the format decisions survive interagency handoff. Authoritative format references are maintained by the Open Geospatial Consortium (OGC) and the GDAL/OGR driver specifications.

Production Format-Handling Checklist

Coordinate Reference Systems for Public Health — enforcing the CRS gate every format must clear before metric operations.
Precision Standards in Epi-Mapping — capping retained coordinate precision at true positional accuracy.
Compliance Mapping Frameworks — encoding HIPAA/GDPR and disclosure constraints as deterministic output gates.
Converting Shapefiles to GeoJSON for Epi Pipelines — the legacy-format migration workflow in full.
Disease Clustering & Spatial Statistical Modeling — the downstream methods that consume these validated formats.