Skip to content

Data pipeline

← Back to index

The Python pipeline that computes ~170 KPIs per H3 cell across Geneva canton. Runs locally, outputs parquet + KV JSON batches.

Entry point

cd code
conda activate hood-analyzer
python pipeline.py

Runtime: ~1.5 minutes without transit (r5py), ~39 minutes with transit (GTFS routing via r5py needs Java 21+).

Output: output/geneva_kpis_by_h3.parquet (17,097 rows × ~170 cols) and output/kv_export/kv_batch_{001,002}.json (KV upload batches).

Conda environment

conda env create -f environment.yml    # if it exists
# or
conda create -n hood-analyzer python=3.11
conda activate hood-analyzer
pip install pandas geopandas h3 pandana osmnx rasterio duckdb pyproj
pip install r5py  # optional, for transit

Pipeline steps (code/pipeline.py)

# Step 1: Load canton boundary + generate H3 grid (res 10)
canton_gdf = load_canton_boundary()
h3grid_gdf = load_h3_grid_with_geometry()  # ~17,097 cells

# Step 2: Load all data sources (cached in data/cache/)
tpg_stops_gdf = load_tpg_stops()
schools = load_schools()
reg_gdf = load_business_registry()
municipalities_gdf = load_municipalities()
_, gdf_nodes, gdf_edges = load_osm_network(canton_gdf)  # walk
parks_gdf = load_osm_parks(canton_gdf)
# ... plus playgrounds, post offices, construction, waste, intl schools

# Step 3: Build pandana network + H3 ↔ OSM node mappings
net = build_pandana_network(gdf_nodes, gdf_edges)
h3_to_osmid, osmid_to_h3_dict, _ = build_h3_osm_mappings(h3grid_gdf, gdf_nodes)

# Step 4: Compute KPIs (one function per domain)
tpg_kpis         = compute_tpg_kpis(...)             # public transport
school_kpis      = compute_school_kpis(...)          # primary/cycle/college
vibrancy_kpis    = compute_vibrancy_kpis(...)        # economic/cultural/health
errand_kpis      = compute_errand_kpis(...)          # supermarket/bakery/pharmacy
green_kpis       = compute_green_space_kpis(...)     # parks/playgrounds
daycare_kpis     = compute_daycare_kpis(...)         # crèches
intl_school_kpis = compute_international_school_kpis(...)
commute_kpis     = compute_commute_kpis(...)         # Euclidean to key locations
nuisance_kpis    = compute_nuisance_kpis(...)        # bars/nightclubs/construction

# Multimodal travel times (walk/bike/car) — sequential network loading
walk_kpis = compute_multimodal_times(net, ..., mode='walk')
# Free walk network, load bike, compute, free, load drive, compute, free
bike_kpis = compute_multimodal_times(bike_net, ..., mode='bike')
car_kpis  = compute_multimodal_times(drive_net, ..., mode='car')

# Raster-based KPIs
noise_kpis       = load_noise_data(h3grid_gdf)
air_quality_kpis = load_air_quality_data(h3grid_gdf)

# Transit (optional — requires r5py + Java 21)
transit_kpis = compute_transit_travel_times(h3grid_gdf)

# Step 5: Merge all KPI dataframes on h3_index
kpi_df = h3grid_gdf[['h3_index', 'lat', 'lng']]
for df in [tpg_kpis, school_kpis, ...]:
    kpi_df = pd.merge(kpi_df, df, on='h3_index', how='left')

# Step 6: Compute composite scores (see scoring.md)
kpi_df = compute_all_scores(kpi_df)

# Step 7: Generate insights (fun facts, headlines)
kpi_df = add_insights_to_df(kpi_df)

# Step 8: Save parquet
kpi_df.to_parquet('output/geneva_kpis_by_h3.parquet')

# Step 9: Export KV JSON batches
export_to_kv_json(kpi_df)

Data sources

All sources cached locally in data/cache/ so re-runs don't re-download.

SITG (Geneva open geo portal — ge.ch/sitg)

Clear commercial license ("Accès libre" with attribution).

Source File Used for
Canton boundary CAD_LIMITE_CANTON H3 grid extent
Lake GEO_LAC H3 cell filtering (exclude water)
Municipalities CAD_COMMUNE Commune assignment, tax rates
TPG stops TPG_ARRETS Nearest transit stops, line names
Primary schools DIP_ECOLES_PRIMAIRE School access KPIs
Cycle schools DIP_CYCLES_ORIENTATION Middle school (ages 12-15)
College schools DIP_COLLEGES High school (ages 15-19)
Business registry REG_ENTREPRISE_ETABLISSEMENT Vibrancy, errands, nuisances, daycares
Tree canopy SIPV_ICA_MNC_2019 Future: green space metrics

OpenStreetMap (via osmnx)

ODbL license — clear commercial OK with attribution.

  • Walk network: ~70K nodes, used for most KPIs (pandana shortest paths)
  • Bike network: ~37K nodes, used for bike travel times
  • Drive network: ~10K nodes, used for car travel times (with per-edge travel time weights — see "Car calibration" below)
  • Parks: leisure=park polygons
  • Playgrounds: leisure=playground points
  • Post offices: amenity=post_office points
  • Construction sites: landuse=construction
  • Waste disposal: amenity=waste_disposal + recycling points

OFEV/BAFU (Swiss federal noise maps)

Clear commercial license.

  • StrassenLaerm_Tag_LV95.tif — road noise daytime
  • StrassenLaerm_Nacht_LV95.tif — road noise nighttime
  • Bahnlaerm_Tag_LV95.tif — train noise daytime
  • Bahnlaerm_Nacht_LV95.tif — train noise nighttime

Sampled per H3 centroid via rasterio.

Air quality (SPAIR)

  • NO2 2023, PM10 2020, PM2.5 2020 — .tif rasters
  • License: Restricted. Currently shown in the free teaser only (not behind paywall) until commercial permission is secured.

GTFS transit (opentransportdata.swiss)

  • Used by r5py for multimodal routing (walk + public transport)
  • Target: Tuesday 8:30 AM departure to key destinations (Cornavin, airport, CERN, UN)
  • Optional — pipeline gracefully degrades to Euclidean distances if r5py is unavailable

OCSTAT (Geneva cantonal statistics)

  • T 05.05.1.4.03 — Apartment resale prices 2024, by commune or group
  • Hardcoded in code/config.py::COMMUNE_PRICE_M2 (small dict, low update frequency)

Cantonal tax data

  • Centimes additionnels per commune, 2024 data
  • Hardcoded in code/config.py::COMMUNE_TAX_RATES

International schools (curated list)

  • 8 schools hardcoded in code/data/international_schools.json
  • Small curated list — SITG doesn't distinguish international schools, and OSM tagging is inconsistent
  • See scoring.md for why this matters for the expat/UN segment

Caching

All data sources are cached in data/cache/ after first fetch: - SITG GDB → *.gpkg (GeoPackage) - OSM networks → *.gpkg with versioned cache tags (osm_drive_nodes_polybuf1km_v2.gpkg) - SITG API caches → hashed JSON in code/cache/

Cache keys are versioned: bump the OSM_CACHE_TAG constant in data_sources.py when you change download parameters (buffer distance, network type, etc.) to force a re-download.

Car travel time calibration

The drive network KPIs (car_*_min fields) use a per-edge travel time model (not a flat speed). This was a significant iteration — early versions using flat speeds gave wildly optimistic estimates (e.g., Plainpalais → Cornavin in 4 minutes).

What's in the model

  1. Per-edge speed from OSM maxspeed tag where available (68% coverage), falling back to highway-type defaults:
Highway Speed (km/h)
motorway 80
trunk 55
primary 40
secondary 35
tertiary 25
residential 25
living_street 15
service 10
  1. Intersection penalty per edge (seconds):
Highway Penalty
motorway 0
trunk 3
primary 15
secondary 15
tertiary 8
residential 6
living_street 5
  1. Hypercenter congestion multiplier (1.5×) applied to minor roads (residential/living_street/unclassified) inside a bounding box roughly covering Plainpalais–Cornavin–Eaux-Vives–Vieille Ville.

Why: OSM shortest paths through the hyper-center are much shorter than real driving routes (one-way detours, pedestrian zones, bus lanes, tram crossings). Pandana treats all edges as undirected — it ignores one-way restrictions — so the multiplier compensates. Applied only to minor roads so arterials (used by transit routes) aren't double-penalized.

Known limitation: autoroute gaps

The A1 autoroute between Chambésy and Coppet has no accessible on-ramps in the OSM drive network (gap between lat 46.2572 and 46.3370). This means routes from Versoix / Bellevue are forced onto local roads and come out slightly overestimated (26–27 min to Cornavin vs Google's 15–25 min).

This is a data issue, not a model issue. Possible fixes (not done): - Manually add the missing on-ramps to the cache - Switch to OSRM/Valhalla for proper directed routing

Calibration results (vs Google Maps)

Route Model Google
Plainpalais → Cornavin 8 min 8-14
Meyrin → Cornavin 18 min 12-20
Carouge → Airport 14 min 15-25
Cologny → Cornavin 15 min 8-15
Chêne-Bourg → Cornavin 18 min 10-18
Versoix → Cornavin 25 min 15-25 (autoroute gap)
Eaux-Vives → CERN 26 min 15-25 (autoroute gap)

Implementation: code/kpis.py::build_car_travel_time_edges + build_car_pandana_network.

Outputs

Parquet (output/geneva_kpis_by_h3.parquet)

  • ~170 columns × 17,097 rows
  • Snake_case field names matching the frontend CellData type
  • Full source of truth — can be re-exported without rerunning the pipeline

KV JSON batches (output/kv_export/kv_batch_*.json)

  • Two batches of 10K / 7097 entries (Cloudflare KV bulk put limit)
  • Each entry: {"key": "<h3_index>", "value": "<json string>"}
  • H3 index is the key but not stored inside the value (anti-scraping)
  • TEASER_FIELDS subset is returned on GET /api/teaser; full dataset on GET /api/report (auth required)

See deployment.md for the wrangler kv bulk put runbook.

Common pipeline gotchas

OOM during network build

Symptom: pipeline dies with exit code 137 around "Building pandana network". Cause: walk + bike + drive networks loaded simultaneously. Fix: already handled — pipeline.py loads one network at a time, computes, then del + gc.collect() before loading the next.

r5py ImportError

Symptom: "Skipping transit computation (ImportError)" during run. Cause: Java 21+ not installed or r5py not in the conda env. Fix: install Java 21 or accept that transit_* columns will be loaded from the previous parquet (if present) or defaulted.

Cache miss / stale cache

Symptom: fields look wrong after changing a loader. Fix: either delete the relevant file in data/cache/ or bump the cache tag constant in data_sources.py.

Celigny exclave

The canton has a small disconnected exclave (Céligny) to the north. This causes surprises when using canton bounding boxes for OSM downloads (the bbox includes a lot of Vaud). All OSM downloads use the actual canton polygon (+ a small buffer) to avoid this.


Next: Scoring