Data pipeline¶
The Python pipeline that computes ~170 KPIs per H3 cell across Geneva canton. Runs locally, outputs parquet + KV JSON batches.
Entry point¶
Runtime: ~1.5 minutes without transit (r5py), ~39 minutes with transit (GTFS routing via r5py needs Java 21+).
Output: output/geneva_kpis_by_h3.parquet (17,097 rows × ~170 cols)
and output/kv_export/kv_batch_{001,002}.json (KV upload batches).
Conda environment¶
conda env create -f environment.yml # if it exists
# or
conda create -n hood-analyzer python=3.11
conda activate hood-analyzer
pip install pandas geopandas h3 pandana osmnx rasterio duckdb pyproj
pip install r5py # optional, for transit
Pipeline steps (code/pipeline.py)¶
# Step 1: Load canton boundary + generate H3 grid (res 10)
canton_gdf = load_canton_boundary()
h3grid_gdf = load_h3_grid_with_geometry() # ~17,097 cells
# Step 2: Load all data sources (cached in data/cache/)
tpg_stops_gdf = load_tpg_stops()
schools = load_schools()
reg_gdf = load_business_registry()
municipalities_gdf = load_municipalities()
_, gdf_nodes, gdf_edges = load_osm_network(canton_gdf) # walk
parks_gdf = load_osm_parks(canton_gdf)
# ... plus playgrounds, post offices, construction, waste, intl schools
# Step 3: Build pandana network + H3 ↔ OSM node mappings
net = build_pandana_network(gdf_nodes, gdf_edges)
h3_to_osmid, osmid_to_h3_dict, _ = build_h3_osm_mappings(h3grid_gdf, gdf_nodes)
# Step 4: Compute KPIs (one function per domain)
tpg_kpis = compute_tpg_kpis(...) # public transport
school_kpis = compute_school_kpis(...) # primary/cycle/college
vibrancy_kpis = compute_vibrancy_kpis(...) # economic/cultural/health
errand_kpis = compute_errand_kpis(...) # supermarket/bakery/pharmacy
green_kpis = compute_green_space_kpis(...) # parks/playgrounds
daycare_kpis = compute_daycare_kpis(...) # crèches
intl_school_kpis = compute_international_school_kpis(...)
commute_kpis = compute_commute_kpis(...) # Euclidean to key locations
nuisance_kpis = compute_nuisance_kpis(...) # bars/nightclubs/construction
# Multimodal travel times (walk/bike/car) — sequential network loading
walk_kpis = compute_multimodal_times(net, ..., mode='walk')
# Free walk network, load bike, compute, free, load drive, compute, free
bike_kpis = compute_multimodal_times(bike_net, ..., mode='bike')
car_kpis = compute_multimodal_times(drive_net, ..., mode='car')
# Raster-based KPIs
noise_kpis = load_noise_data(h3grid_gdf)
air_quality_kpis = load_air_quality_data(h3grid_gdf)
# Transit (optional — requires r5py + Java 21)
transit_kpis = compute_transit_travel_times(h3grid_gdf)
# Step 5: Merge all KPI dataframes on h3_index
kpi_df = h3grid_gdf[['h3_index', 'lat', 'lng']]
for df in [tpg_kpis, school_kpis, ...]:
kpi_df = pd.merge(kpi_df, df, on='h3_index', how='left')
# Step 6: Compute composite scores (see scoring.md)
kpi_df = compute_all_scores(kpi_df)
# Step 7: Generate insights (fun facts, headlines)
kpi_df = add_insights_to_df(kpi_df)
# Step 8: Save parquet
kpi_df.to_parquet('output/geneva_kpis_by_h3.parquet')
# Step 9: Export KV JSON batches
export_to_kv_json(kpi_df)
Data sources¶
All sources cached locally in data/cache/ so re-runs don't re-download.
SITG (Geneva open geo portal — ge.ch/sitg)¶
Clear commercial license ("Accès libre" with attribution).
| Source | File | Used for |
|---|---|---|
| Canton boundary | CAD_LIMITE_CANTON |
H3 grid extent |
| Lake | GEO_LAC |
H3 cell filtering (exclude water) |
| Municipalities | CAD_COMMUNE |
Commune assignment, tax rates |
| TPG stops | TPG_ARRETS |
Nearest transit stops, line names |
| Primary schools | DIP_ECOLES_PRIMAIRE |
School access KPIs |
| Cycle schools | DIP_CYCLES_ORIENTATION |
Middle school (ages 12-15) |
| College schools | DIP_COLLEGES |
High school (ages 15-19) |
| Business registry | REG_ENTREPRISE_ETABLISSEMENT |
Vibrancy, errands, nuisances, daycares |
| Tree canopy | SIPV_ICA_MNC_2019 |
Future: green space metrics |
OpenStreetMap (via osmnx)¶
ODbL license — clear commercial OK with attribution.
- Walk network: ~70K nodes, used for most KPIs (pandana shortest paths)
- Bike network: ~37K nodes, used for bike travel times
- Drive network: ~10K nodes, used for car travel times (with per-edge travel time weights — see "Car calibration" below)
- Parks:
leisure=parkpolygons - Playgrounds:
leisure=playgroundpoints - Post offices:
amenity=post_officepoints - Construction sites:
landuse=construction - Waste disposal:
amenity=waste_disposal+recyclingpoints
OFEV/BAFU (Swiss federal noise maps)¶
Clear commercial license.
StrassenLaerm_Tag_LV95.tif— road noise daytimeStrassenLaerm_Nacht_LV95.tif— road noise nighttimeBahnlaerm_Tag_LV95.tif— train noise daytimeBahnlaerm_Nacht_LV95.tif— train noise nighttime
Sampled per H3 centroid via rasterio.
Air quality (SPAIR)¶
- NO2 2023, PM10 2020, PM2.5 2020 —
.tifrasters - License: Restricted. Currently shown in the free teaser only (not behind paywall) until commercial permission is secured.
GTFS transit (opentransportdata.swiss)¶
- Used by
r5pyfor multimodal routing (walk + public transport) - Target: Tuesday 8:30 AM departure to key destinations (Cornavin, airport, CERN, UN)
- Optional — pipeline gracefully degrades to Euclidean distances if r5py is unavailable
OCSTAT (Geneva cantonal statistics)¶
- T 05.05.1.4.03 — Apartment resale prices 2024, by commune or group
- Hardcoded in
code/config.py::COMMUNE_PRICE_M2(small dict, low update frequency)
Cantonal tax data¶
- Centimes additionnels per commune, 2024 data
- Hardcoded in
code/config.py::COMMUNE_TAX_RATES
International schools (curated list)¶
- 8 schools hardcoded in
code/data/international_schools.json - Small curated list — SITG doesn't distinguish international schools, and OSM tagging is inconsistent
- See scoring.md for why this matters for the expat/UN segment
Caching¶
All data sources are cached in data/cache/ after first fetch:
- SITG GDB → *.gpkg (GeoPackage)
- OSM networks → *.gpkg with versioned cache tags
(osm_drive_nodes_polybuf1km_v2.gpkg)
- SITG API caches → hashed JSON in code/cache/
Cache keys are versioned: bump the OSM_CACHE_TAG constant in
data_sources.py when you change download parameters (buffer distance,
network type, etc.) to force a re-download.
Car travel time calibration¶
The drive network KPIs (car_*_min fields) use a per-edge travel time
model (not a flat speed). This was a significant iteration — early versions
using flat speeds gave wildly optimistic estimates (e.g., Plainpalais →
Cornavin in 4 minutes).
What's in the model¶
- Per-edge speed from OSM
maxspeedtag where available (68% coverage), falling back to highway-type defaults:
| Highway | Speed (km/h) |
|---|---|
| motorway | 80 |
| trunk | 55 |
| primary | 40 |
| secondary | 35 |
| tertiary | 25 |
| residential | 25 |
| living_street | 15 |
| service | 10 |
- Intersection penalty per edge (seconds):
| Highway | Penalty |
|---|---|
| motorway | 0 |
| trunk | 3 |
| primary | 15 |
| secondary | 15 |
| tertiary | 8 |
| residential | 6 |
| living_street | 5 |
- Hypercenter congestion multiplier (1.5×) applied to minor roads (residential/living_street/unclassified) inside a bounding box roughly covering Plainpalais–Cornavin–Eaux-Vives–Vieille Ville.
Why: OSM shortest paths through the hyper-center are much shorter than real driving routes (one-way detours, pedestrian zones, bus lanes, tram crossings). Pandana treats all edges as undirected — it ignores one-way restrictions — so the multiplier compensates. Applied only to minor roads so arterials (used by transit routes) aren't double-penalized.
Known limitation: autoroute gaps¶
The A1 autoroute between Chambésy and Coppet has no accessible on-ramps in the OSM drive network (gap between lat 46.2572 and 46.3370). This means routes from Versoix / Bellevue are forced onto local roads and come out slightly overestimated (26–27 min to Cornavin vs Google's 15–25 min).
This is a data issue, not a model issue. Possible fixes (not done): - Manually add the missing on-ramps to the cache - Switch to OSRM/Valhalla for proper directed routing
Calibration results (vs Google Maps)¶
| Route | Model | |
|---|---|---|
| Plainpalais → Cornavin | 8 min | 8-14 |
| Meyrin → Cornavin | 18 min | 12-20 |
| Carouge → Airport | 14 min | 15-25 |
| Cologny → Cornavin | 15 min | 8-15 |
| Chêne-Bourg → Cornavin | 18 min | 10-18 |
| Versoix → Cornavin | 25 min | 15-25 (autoroute gap) |
| Eaux-Vives → CERN | 26 min | 15-25 (autoroute gap) |
Implementation: code/kpis.py::build_car_travel_time_edges + build_car_pandana_network.
Outputs¶
Parquet (output/geneva_kpis_by_h3.parquet)¶
- ~170 columns × 17,097 rows
- Snake_case field names matching the frontend
CellDatatype - Full source of truth — can be re-exported without rerunning the pipeline
KV JSON batches (output/kv_export/kv_batch_*.json)¶
- Two batches of 10K / 7097 entries (Cloudflare KV bulk put limit)
- Each entry:
{"key": "<h3_index>", "value": "<json string>"} - H3 index is the key but not stored inside the value (anti-scraping)
TEASER_FIELDSsubset is returned onGET /api/teaser; full dataset onGET /api/report(auth required)
See deployment.md for the wrangler kv bulk put runbook.
Common pipeline gotchas¶
OOM during network build¶
Symptom: pipeline dies with exit code 137 around "Building pandana network".
Cause: walk + bike + drive networks loaded simultaneously.
Fix: already handled — pipeline.py loads one network at a time, computes,
then del + gc.collect() before loading the next.
r5py ImportError¶
Symptom: "Skipping transit computation (ImportError)" during run.
Cause: Java 21+ not installed or r5py not in the conda env.
Fix: install Java 21 or accept that transit_* columns will be loaded
from the previous parquet (if present) or defaulted.
Cache miss / stale cache¶
Symptom: fields look wrong after changing a loader.
Fix: either delete the relevant file in data/cache/ or bump the cache
tag constant in data_sources.py.
Celigny exclave¶
The canton has a small disconnected exclave (Céligny) to the north. This causes surprises when using canton bounding boxes for OSM downloads (the bbox includes a lot of Vaud). All OSM downloads use the actual canton polygon (+ a small buffer) to avoid this.
Next: Scoring