Data pipeline¶

← Back to index

The Python pipeline that computes ~170 KPIs per H3 cell across Geneva canton. Runs locally, outputs parquet + KV JSON batches.

Entry point¶

cd code
conda activate hood-analyzer
python pipeline.py

Runtime: ~1.5 minutes without transit (r5py), ~39 minutes with transit (GTFS routing via r5py needs Java 21+).

Output: output/geneva_kpis_by_h3.parquet (17,097 rows × ~170 cols) and output/kv_export/kv_batch_{001,002}.json (KV upload batches).

Conda environment¶

conda env create -f environment.yml    # if it exists
# or
conda create -n hood-analyzer python=3.11
conda activate hood-analyzer
pip install pandas geopandas h3 pandana osmnx rasterio duckdb pyproj
pip install r5py  # optional, for transit

Pipeline steps (`code/pipeline.py`)¶

# Step 1: Load canton boundary + generate H3 grid (res 10)
canton_gdf = load_canton_boundary()
h3grid_gdf = load_h3_grid_with_geometry()  # ~17,097 cells

# Step 2: Load all data sources (cached in data/cache/)
tpg_stops_gdf = load_tpg_stops()
schools = load_schools()
reg_gdf = load_business_registry()
municipalities_gdf = load_municipalities()
_, gdf_nodes, gdf_edges = load_osm_network(canton_gdf)  # walk
parks_gdf = load_osm_parks(canton_gdf)
# ... plus playgrounds, post offices, construction, waste, intl schools

# Step 3: Build pandana network + H3 ↔ OSM node mappings
net = build_pandana_network(gdf_nodes, gdf_edges)
h3_to_osmid, osmid_to_h3_dict, _ = build_h3_osm_mappings(h3grid_gdf, gdf_nodes)

# Step 4: Compute KPIs (one function per domain)
tpg_kpis         = compute_tpg_kpis(...)             # public transport
school_kpis      = compute_school_kpis(...)          # primary/cycle/college
vibrancy_kpis    = compute_vibrancy_kpis(...)        # economic/cultural/health
errand_kpis      = compute_errand_kpis(...)          # supermarket/bakery/pharmacy
green_kpis       = compute_green_space_kpis(...)     # parks/playgrounds
daycare_kpis     = compute_daycare_kpis(...)         # crèches
intl_school_kpis = compute_international_school_kpis(...)
commute_kpis     = compute_commute_kpis(...)         # Euclidean to key locations
nuisance_kpis    = compute_nuisance_kpis(...)        # bars/nightclubs/construction

# Multimodal travel times (walk/bike/car) — sequential network loading
walk_kpis = compute_multimodal_times(net, ..., mode='walk')
# Free walk network, load bike, compute, free, load drive, compute, free
bike_kpis = compute_multimodal_times(bike_net, ..., mode='bike')
car_kpis  = compute_multimodal_times(drive_net, ..., mode='car')

# Raster-based KPIs
noise_kpis       = load_noise_data(h3grid_gdf)
air_quality_kpis = load_air_quality_data(h3grid_gdf)

# Transit (optional — requires r5py + Java 21)
transit_kpis = compute_transit_travel_times(h3grid_gdf)

# Step 5: Merge all KPI dataframes on h3_index
kpi_df = h3grid_gdf[['h3_index', 'lat', 'lng']]
for df in [tpg_kpis, school_kpis, ...]:
    kpi_df = pd.merge(kpi_df, df, on='h3_index', how='left')

# Step 6: Compute composite scores (see scoring.md)
kpi_df = compute_all_scores(kpi_df)

# Step 7: Generate insights (fun facts, headlines)
kpi_df = add_insights_to_df(kpi_df)

# Step 8: Save parquet
kpi_df.to_parquet('output/geneva_kpis_by_h3.parquet')

# Step 9: Export KV JSON batches
export_to_kv_json(kpi_df)

Data sources¶

All sources cached locally in data/cache/ so re-runs don't re-download.

SITG (Geneva open geo portal — `ge.ch/sitg`)¶

Clear commercial license ("Accès libre" with attribution).

Source	File	Used for
Canton boundary	`CAD_LIMITE_CANTON`	H3 grid extent
Lake	`GEO_LAC`	H3 cell filtering (exclude water)
Municipalities	`CAD_COMMUNE`	Commune assignment, tax rates
TPG stops	`TPG_ARRETS`	Nearest transit stops, line names
Primary schools	`DIP_ECOLES_PRIMAIRE`	School access KPIs
Cycle schools	`DIP_CYCLES_ORIENTATION`	Middle school (ages 12-15)
College schools	`DIP_COLLEGES`	High school (ages 15-19)
Business registry	`REG_ENTREPRISE_ETABLISSEMENT`	Vibrancy, errands, nuisances, daycares
Tree canopy	`SIPV_ICA_MNC_2019`	Future: green space metrics

OpenStreetMap (via osmnx)¶

ODbL license — clear commercial OK with attribution.

Walk network: ~70K nodes, used for most KPIs (pandana shortest paths)
Bike network: ~37K nodes, used for bike travel times
Drive network: ~10K nodes, used for car travel times (with per-edge travel time weights — see "Car calibration" below)
Parks: leisure=park polygons
Playgrounds: leisure=playground points
Post offices: amenity=post_office points
Construction sites: landuse=construction
Waste disposal: amenity=waste_disposal + recycling points

OFEV/BAFU (Swiss federal noise maps)¶

Clear commercial license.

StrassenLaerm_Tag_LV95.tif — road noise daytime
StrassenLaerm_Nacht_LV95.tif — road noise nighttime
Bahnlaerm_Tag_LV95.tif — train noise daytime
Bahnlaerm_Nacht_LV95.tif — train noise nighttime

Sampled per H3 centroid via rasterio.

Air quality (SPAIR)¶

NO2 2023, PM10 2020, PM2.5 2020 — .tif rasters
License: Restricted. Currently shown in the free teaser only (not behind paywall) until commercial permission is secured.

GTFS transit (`opentransportdata.swiss`)¶

Used by r5py for multimodal routing (walk + public transport)
Target: Tuesday 8:30 AM departure to key destinations (Cornavin, airport, CERN, UN)
Optional — pipeline gracefully degrades to Euclidean distances if r5py is unavailable

OCSTAT (Geneva cantonal statistics)¶

T 05.05.1.4.03 — Apartment resale prices 2024, by commune or group
Hardcoded in code/config.py::COMMUNE_PRICE_M2 (small dict, low update frequency)

Cantonal tax data¶

Centimes additionnels per commune, 2024 data
Hardcoded in code/config.py::COMMUNE_TAX_RATES

International schools (curated list)¶

8 schools hardcoded in code/data/international_schools.json
Small curated list — SITG doesn't distinguish international schools, and OSM tagging is inconsistent
See scoring.md for why this matters for the expat/UN segment

Caching¶

All data sources are cached in data/cache/ after first fetch: - SITG GDB → *.gpkg (GeoPackage) - OSM networks → *.gpkg with versioned cache tags (osm_drive_nodes_polybuf1km_v2.gpkg) - SITG API caches → hashed JSON in code/cache/

Cache keys are versioned: bump the OSM_CACHE_TAG constant in data_sources.py when you change download parameters (buffer distance, network type, etc.) to force a re-download.

Car travel time calibration¶

The drive network KPIs (car_*_min fields) use a per-edge travel time model (not a flat speed). This was a significant iteration — early versions using flat speeds gave wildly optimistic estimates (e.g., Plainpalais → Cornavin in 4 minutes).

What's in the model¶

Per-edge speed from OSM maxspeed tag where available (68% coverage), falling back to highway-type defaults:

Highway	Speed (km/h)
motorway	80
trunk	55
primary	40
secondary	35
tertiary	25
residential	25
living_street	15
service	10

Intersection penalty per edge (seconds):

Highway	Penalty
motorway	0
trunk	3
primary	15
secondary	15
tertiary	8
residential	6
living_street	5

Hypercenter congestion multiplier (1.5×) applied to minor roads (residential/living_street/unclassified) inside a bounding box roughly covering Plainpalais–Cornavin–Eaux-Vives–Vieille Ville.

Why: OSM shortest paths through the hyper-center are much shorter than real driving routes (one-way detours, pedestrian zones, bus lanes, tram crossings). Pandana treats all edges as undirected — it ignores one-way restrictions — so the multiplier compensates. Applied only to minor roads so arterials (used by transit routes) aren't double-penalized.

Known limitation: autoroute gaps¶

The A1 autoroute between Chambésy and Coppet has no accessible on-ramps in the OSM drive network (gap between lat 46.2572 and 46.3370). This means routes from Versoix / Bellevue are forced onto local roads and come out slightly overestimated (26–27 min to Cornavin vs Google's 15–25 min).

This is a data issue, not a model issue. Possible fixes (not done): - Manually add the missing on-ramps to the cache - Switch to OSRM/Valhalla for proper directed routing

Calibration results (vs Google Maps)¶

Route	Model	Google
Plainpalais → Cornavin	8 min	8-14
Meyrin → Cornavin	18 min	12-20
Carouge → Airport	14 min	15-25
Cologny → Cornavin	15 min	8-15
Chêne-Bourg → Cornavin	18 min	10-18
Versoix → Cornavin	25 min	15-25 (autoroute gap)
Eaux-Vives → CERN	26 min	15-25 (autoroute gap)

Implementation: code/kpis.py::build_car_travel_time_edges + build_car_pandana_network.

Outputs¶

Parquet (`output/geneva_kpis_by_h3.parquet`)¶

~170 columns × 17,097 rows
Snake_case field names matching the frontend CellData type
Full source of truth — can be re-exported without rerunning the pipeline

KV JSON batches (`output/kv_export/kv_batch_*.json`)¶

Two batches of 10K / 7097 entries (Cloudflare KV bulk put limit)
Each entry: {"key": "<h3_index>", "value": "<json string>"}
H3 index is the key but not stored inside the value (anti-scraping)
TEASER_FIELDS subset is returned on GET /api/teaser; full dataset on GET /api/report (auth required)

See deployment.md for the wrangler kv bulk put runbook.

Common pipeline gotchas¶

OOM during network build¶

Symptom: pipeline dies with exit code 137 around "Building pandana network". Cause: walk + bike + drive networks loaded simultaneously. Fix: already handled — pipeline.py loads one network at a time, computes, then del + gc.collect() before loading the next.

r5py ImportError¶

Symptom: "Skipping transit computation (ImportError)" during run. Cause: Java 21+ not installed or r5py not in the conda env. Fix: install Java 21 or accept that transit_* columns will be loaded from the previous parquet (if present) or defaulted.

Cache miss / stale cache¶

Symptom: fields look wrong after changing a loader. Fix: either delete the relevant file in data/cache/ or bump the cache tag constant in data_sources.py.

Celigny exclave¶

The canton has a small disconnected exclave (Céligny) to the north. This causes surprises when using canton bounding boxes for OSM downloads (the bbox includes a lot of Vaud). All OSM downloads use the actual canton polygon (+ a small buffer) to avoid this.

Next: Scoring