Converting Shapefiles to GeoJSON for Epi Pipelines

Legacy ESRI Shapefiles persist in public health data exchanges due to entrenched agency workflows, but modern syndromic surveillance dashboards, real-time case tracking feeds, and interoperable web mapping stacks require strict GeoJSON compliance. A naive export pipeline risks silent topology corruption, coordinate drift, and inadvertent exposure of protected health information (PHI). Production-grade conversion for epidemiological analytics demands deterministic geometry repair, explicit coordinate reference system (CRS) alignment, and audit-ready attribute sanitization before ingestion into spatial join or hotspot detection routines.

The conversion pipeline applies each safeguard in a fixed, auditable order:

flowchart LR
  A["Read legacy shapefile (.shp / .prj)"] --> B["Verify datum & align CRS to EPSG:4326"]
  B --> C["Validate & repair geometry (make_valid)"]
  C --> D["Truncate coordinates to 6 decimals"]
  D --> E["Sanitize attributes (drop PHI, snake_case)"]
  E --> F["Serialize compliant GeoJSON"]

CRS Alignment & Datum Verification

The GeoJSON specification (RFC 7946) mandates WGS 84 (EPSG:4326) with strict longitude-latitude ordering. Public health shapefiles frequently arrive in projected systems (UTM zones, State Plane, or custom health district grids). Blind transformation without datum verification introduces boundary drift that compromises spatial joins with census tracts, healthcare facility networks, or environmental exposure layers.

A robust pipeline must inspect the source .prj metadata, apply rigorous datum transformation using pyproj, and validate coordinate bounds before serialization. When working across multi-jurisdictional datasets, understanding how Spatial Epidemiology Fundamentals & Data Standards governs coordinate handling prevents silent misalignment during aggregation, rate calculation, or cluster detection.

Geometry Validation & Topology Repair

Legacy shapefiles routinely contain invalid geometries: self-intersecting polygons, unclosed rings, incorrect winding orders, or collapsed features. These defects cause serialization failures, silent topology breaks in downstream mapping engines, or erroneous spatial weights matrices. The conversion process must enforce shapely validation routines, applying make_valid() and filtering null geometries before serialization. Multi-part geometries should be explicitly handled to prevent unexpected feature duplication during GeoJSON flattening.

Precision Control & De-identification

Raw shapefile coordinates often carry 10–15 decimal places of floating-point noise, which inflates GeoJSON payload sizes, degrades rendering performance, and can inadvertently encode exact residential locations. For epidemiological mapping, truncating to six decimal places (~0.11 meters at the equator) satisfies analytical precision while supporting HIPAA Safe Harbor and GDPR de-identification thresholds. Precision control must occur after CRS transformation and geometry validation to prevent artificial topology errors or coordinate snapping artifacts.

Audit-Ready Attribute Sanitization

Public health datasets frequently contain PHI, sensitive demographic fields, or non-standardized column names. The conversion process must explicitly drop or hash these columns, standardize field names to snake_case, and log all transformations. This creates an immutable audit trail required for compliance reviews and data governance frameworks. Null handling must also be deterministic: either drop records with missing spatial attributes or explicitly flag them for downstream imputation workflows.

Production Pipeline Implementation

The following Python implementation enforces CRS alignment, precision limits, geometry repair, and attribute sanitization. It is designed for headless execution, CI/CD integration, and compliance logging.

import logging
import json
import re
from typing import Optional, List
import geopandas as gpd
import numpy as np
from shapely.geometry import mapping
from shapely.validation import make_valid
from shapely.ops import transform

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s"
)

def _round_coords(geom, precision: int = 6):
    """Recursively round coordinates in a Shapely geometry."""
    def _round_xy(x, y, z=None):
        rounded = (round(x, precision), round(y, precision))
        return rounded if z is None else (*rounded, round(z, precision))
    return transform(_round_xy, geom)

def _sanitize_columns(gdf: gpd.GeoDataFrame, pii_columns: List[str]) -> gpd.GeoDataFrame:
    """Drop PHI/PII columns and standardize field names to snake_case."""
    drop_cols = [c for c in pii_columns if c in gdf.columns]
    if drop_cols:
        logging.info(f"Dropping PHI/PII columns: {drop_cols}")
        gdf = gdf.drop(columns=drop_cols)
    
    # Standardize to snake_case
    gdf.columns = [re.sub(r'[^a-zA-Z0-9_]+', '_', c.lower()) for c in gdf.columns]
    return gdf

def convert_shapefile_to_geojson(
    shp_path: str,
    out_path: str,
    target_crs: str = "EPSG:4326",
    precision: int = 6,
    pii_columns: Optional[List[str]] = None,
    drop_null_geoms: bool = True
) -> dict:
    """
    Production-ready Shapefile to GeoJSON converter for epi pipelines.
    Enforces CRS validation, topology repair, precision limits, and PHI sanitization.
    """
    if pii_columns is None:
        pii_columns = []

    logging.info(f"Loading shapefile: {shp_path}")
    gdf = gpd.read_file(shp_path)

    # 1. CRS Validation & Transformation
    if gdf.crs is None:
        raise ValueError("Source shapefile lacks CRS definition. Assign manually before conversion.")
    
    logging.info(f"Source CRS: {gdf.crs.to_string()} | Target CRS: {target_crs}")
    if str(gdf.crs) != target_crs:
        gdf = gdf.to_crs(target_crs)
        logging.info("CRS transformation applied.")
    
    # Validate WGS84 bounds
    if gdf.total_bounds[0] < -180 or gdf.total_bounds[2] > 180:
        logging.warning("Coordinates exceed valid longitude range. Verify CRS transformation.")

    # 2. Geometry Validation & Repair
    invalid_mask = ~gdf.geometry.is_valid
    if invalid_mask.any():
        logging.warning(f"Repairing {invalid_mask.sum()} invalid geometries.")
        gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].apply(make_valid)
    
    if drop_null_geoms:
        null_mask = gdf.geometry.is_empty | gdf.geometry.isna()
        if null_mask.any():
            logging.info(f"Dropping {null_mask.sum()} null/empty geometries.")
            gdf = gdf[~null_mask].copy()

    # 3. Precision Control (Post-transform, Post-repair)
    logging.info(f"Truncating coordinates to {precision} decimal places.")
    gdf["geometry"] = gdf.geometry.apply(lambda geom: _round_coords(geom, precision))

    # 4. Attribute Sanitization
    gdf = _sanitize_columns(gdf, pii_columns)
    
    # 5. Serialization
    logging.info(f"Serializing to GeoJSON: {out_path}")
    geojson_dict = json.loads(gdf.to_json())
    
    # Strip unnecessary GeoJSON members for epi dashboard performance
    if "bbox" in geojson_dict:
        del geojson_dict["bbox"]
        
    with open(out_path, "w", encoding="utf-8") as f:
        json.dump(geojson_dict, f, separators=(",", ":"), ensure_ascii=False)
        
    logging.info(f"Conversion complete. Features exported: {len(geojson_dict['features'])}")
    return geojson_dict

Edge Cases & Pre-Flight Validation

Before deploying this pipeline into automated epi workflows, validate against common edge cases:

  • Multi-Jurisdictional Overlaps: Shapefiles containing overlapping administrative boundaries may produce duplicate features after make_valid(). Apply gdf.explode() and gdf.dissolve() if aggregation is required.
  • Z/M Coordinates: Legacy survey data often includes elevation or measurement dimensions. GeoJSON RFC 7946 supports optional z/m arrays, but most web mapping stacks ignore them. Strip dimensions using gdf.geometry.apply(lambda g: g if g.has_z else g) or explicitly flatten to 2D.
  • Date/Time Fields: Shapefiles store dates as strings or integers. Convert to ISO 8601 format before serialization to ensure consistent temporal joins in syndromic surveillance feeds.
  • Character Encoding: Force UTF-8 encoding during gpd.read_file() to prevent mojibake in non-English jurisdictional names or clinical descriptors.

Integrating this conversion routine into broader Spatial Data Types & Formats workflows ensures deterministic, compliant, and performant spatial data exchange across public health infrastructure.