Handling API Timeouts in Batch OSM Routing

In spatial epidemiology and public health infrastructure planning, calculating drive-time accessibility across large patient cohorts requires deterministic, fault-tolerant network analysis. When scaling origin-destination (OD) queries to tens of thousands of facility-patient pairs, API timeouts become a critical engineering constraint. Unhandled timeouts introduce spatial sampling bias, compromise healthcare access equity metrics, and violate data integrity requirements for regulatory reporting. Production-grade pipelines must implement stateful retry architectures, topology-aware payload partitioning, and audit-compliant error tracking. This operational rigor aligns directly with established practices in Healthcare Access & Network Analysis Automation where statistical validity depends on complete spatial coverage and reproducible query execution.

Root Cause Analysis in Routing Engines

OSM-derived routing engines (e.g., OSRM, Valhalla) execute graph traversals that scale non-linearly with coordinate density, edge complexity, and turn restrictions. Timeouts typically manifest from three vectors: server-side query saturation, client-side serialization overhead, and transient network degradation. In public health workflows, unvalidated coordinate arrays or mismatched coordinate reference systems (CRS) force engines to perform redundant spatial joins or projection transformations, inflating latency. Additionally, batch matrices exceeding engine-specific limits trigger silent queueing or hard 504 Gateway Timeout responses. Understanding these failure modes is essential for designing resilient Batch Routing & Error Handling architectures that preserve spatial accuracy under sustained computational load.

Retry Architecture & Compliance Logging

Transient failures require stateful retry mechanisms rather than naive polling loops. Production systems should implement exponential backoff with randomized jitter to prevent thundering-herd effects on shared routing infrastructure. The retry strategy must strictly differentiate between recoverable HTTP status codes (429, 502, 504) and terminal failures (400, 404, invalid geometries). A circuit breaker pattern halts requests when consecutive failures exceed a defined threshold, preventing cascading pipeline degradation.

Each attempt must log a deterministic payload hash, UTC timestamp, and spatial bounding box. Raw patient identifiers must never be persisted in retry logs, ensuring alignment with HIPAA minimum necessary standards and GDPR data minimization principles. The Python logging module should be configured with structured JSON formatters to enable downstream audit parsing and compliance verification.

A single OD chunk flows through the retry layer, which classifies failures and re-issues recoverable requests with backoff before returning a validated matrix:

sequenceDiagram
  participant P as Pipeline
  participant R as Retry layer
  participant E as Routing engine
  P->>R: submit OD chunk
  R->>E: Table API request (attempt 1)
  E-->>R: 504 Gateway Timeout
  Note over R: exponential backoff + jitter
  R->>E: Table API request (attempt 2)
  E-->>R: 200 OK, durations matrix
  R-->>P: validated travel-time matrix

Payload Optimization & CRS Alignment

Batch routing efficiency depends on strategic request partitioning. Monolithic coordinate matrices should be replaced with topology-aware chunks based on spatial proximity and network boundaries. Pre-processing must enforce consistent coordinate precision (typically six decimal places for ~0.11m resolution at the equator) and validate geometries against the routing engine’s expected CRS (usually EPSG:4326). Spatial indexing via geopandas enables efficient chunk generation that minimizes cross-boundary route fragmentation. Implementing these optimizations reduces payload serialization overhead and keeps individual requests within engine timeout thresholds, as documented in the OSRM Table API specifications.

Production-Ready Python Implementation

The following pipeline demonstrates deterministic retry logic, spatial validation, and audit-compliant logging using tenacity for backoff management and requests for HTTP execution.

import hashlib
import json
import logging
import time
import geopandas as gpd
import requests
from shapely.geometry import Point
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type, retry_if_result
from requests.exceptions import HTTPError, Timeout, ConnectionError

# Configure audit-ready structured logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s | %(levelname)s | %(message)s',
    handlers=[logging.FileHandler("routing_audit.log")]
)
logger = logging.getLogger("osm_batch_router")

def generate_payload_hash(coords):
    """Deterministic hash for audit trails without storing raw coordinates."""
    return hashlib.sha256(json.dumps(coords, sort_keys=True).encode()).hexdigest()[:12]

def validate_response(response):
    """Return True if response indicates recoverable failure for retry."""
    return response.status_code in {429, 502, 504}

@retry(
    retry=(retry_if_exception_type((Timeout, ConnectionError)) | retry_if_result(validate_response)),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    stop=stop_after_attempt(5),
    reraise=True
)
def fetch_route_batch(engine_url, coords_chunk, timeout=15):
    payload = {"coordinates": coords_chunk, "sources": "all", "destinations": "all"}
    payload_hash = generate_payload_hash(coords_chunk)
    
    try:
        resp = requests.post(engine_url, json=payload, timeout=timeout)
        if resp.status_code == 200:
            logger.info(f"SUCCESS | hash={payload_hash} | duration={resp.elapsed.total_seconds():.2f}s")
            return resp.json()
        elif validate_response(resp):
            logger.warning(f"RECOVERABLE | hash={payload_hash} | status={resp.status_code}")
            return resp  # tenacity will trigger retry
        else:
            logger.error(f"TERMINAL | hash={payload_hash} | status={resp.status_code} | body={resp.text[:100]}")
            resp.raise_for_status()
    except (Timeout, ConnectionError) as e:
        logger.warning(f"NETWORK_ERROR | hash={payload_hash} | detail={str(e)}")
        raise

def chunk_od_pairs(gdf_origins, gdf_destinations, chunk_size=25):
    """Topology-aware chunking based on spatial proximity."""
    gdf_origins = gdf_origins.to_crs("EPSG:4326")
    gdf_destinations = gdf_destinations.to_crs("EPSG:4326")
    
    # Round point coordinates to 6 decimals (~0.1m) for deterministic, privacy-safe payloads
    gdf_origins.geometry = gdf_origins.geometry.apply(lambda g: Point(round(g.x, 6), round(g.y, 6)))
    gdf_destinations.geometry = gdf_destinations.geometry.apply(lambda g: Point(round(g.x, 6), round(g.y, 6)))
    
    coords = [[(o.x, o.y), (d.x, d.y)] for o in gdf_origins.itertuples() for d in gdf_destinations.itertuples()]
    return [coords[i:i + chunk_size] for i in range(0, len(coords), chunk_size)]

# Execution wrapper
def run_batch_routing(engine_url, origins_path, destinations_path):
    origins = gpd.read_file(origins_path)
    destinations = gpd.read_file(destinations_path)
    chunks = chunk_od_pairs(origins, destinations)
    
    results = []
    for i, chunk in enumerate(chunks):
        try:
            results.append(fetch_route_batch(engine_url, chunk))
        except Exception as e:
            logger.critical(f"CHUNK_{i}_FAILED | detail={str(e)}")
            # Implement fallback: mark chunk for manual review or secondary engine
    return results

Spatial Validation & Audit Readiness

Post-processing must verify spatial completeness before calculating accessibility indices. Missing routes should be explicitly flagged rather than imputed to prevent bias in spatial equity metrics. Implement validation checks that verify returned travel times fall within physiologically plausible ranges (e.g., >0 and <24 hours) and cross-reference route geometries against known facility catchments.

Audit trails must support full reproducibility for public health reporting and peer review. Store chunk hashes, retry counts, and final status codes in a version-controlled metadata table. This pattern ensures compliance with federal spatial data standards while maintaining the statistical integrity required for epidemiological modeling and resource allocation decisions.