Simple Machine Learning model using Overture Maps data¶
In this notebook, we aim to predict the distance to the nearest bicycle-sharing station in three Spanish cities: Madrid, Seville, and Valencia.
Our goal is to develop a model capable of accurately predicting this distance based on geospatial features, which can assist in urban planning and enhance the accessibility of bicycle-sharing services.
The task involves several key steps:
Loading bicycle-sharing stations: We begin by loading geospatial data for bicycle-sharing stations from OpenStreetMap using
OsmOnlineLoader
for the specified cities. We also visualise the locations of these stations.Regionalization using H3 hexagons: The cities are regionalized using H3 hexagons with
H3Regionalizer
, enabling us to divide them into smaller, manageable regions. Next, we buffer the generated H3 regions and compute the distance from each hexagon to the nearest bicycle-sharing station.Feature extraction and embedding: We load additional geospatial features from Overture Maps using
OvertureMapsLoader
and join them with the H3 regions (withIntersectionJoiner
). We then generate embeddings for these regions usingContextualCountEmbedder
, which will be used as input features for our machine learning model.Training and evaluating the model: We prepare the data for training and validation, standardize the features, and train an XGBoost regression model to predict the distance to the nearest bicycle-sharing station. Data from Madrid is used for training, while data from Seville is used for validation.
Prediction and Visualization: Finally, the trained model predicts the distance to the nearest bicycle-sharing station for all three cities. The predicted distances and prediction errors are visualised on maps to evaluate the model's performance.
Prerequisites
- 12 GB of RAM
-
Installed libraries:
srai[osm,overturemaps,plotting]
,contextily
,seaborn
,scikit-learn
,xgboost
# Uncomment the line below to install the required packages
# (e.g. when running in a new environment or Google Colab)
# %pip install srai[osm,overturemaps,plotting] contextily seaborn scikit-learn xgboost
Import required libraries¶
import contextily as cx
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd
import pyarrow as pa
import seaborn as sns
import xgboost as xgb
from h3 import int_to_str, str_to_int
from h3ronpy import grid_disk_aggregate_k
from matplotlib.colors import LinearSegmentedColormap, ListedColormap
from sklearn.preprocessing import StandardScaler
from srai.embedders import ContextualCountEmbedder
from srai.h3 import h3_to_shapely_geometry
from srai.joiners import IntersectionJoiner
from srai.loaders import OSMOnlineLoader, OvertureMapsLoader
from srai.neighbourhoods import H3Neighbourhood
from srai.regionalizers import H3Regionalizer, geocode_to_region_gdf
SEED = 71
Define regions of interest¶
Geocode three Spanish cities using the geocode_to_region_gdf
function.
train_city = "Madrid"
validation_city = "Seville"
test_city = "Valencia"
cities_names = [train_city, validation_city, test_city]
regions = geocode_to_region_gdf(cities_names)
regions.index = cities_names
regions
geometry | |
---|---|
Madrid | POLYGON ((-3.88895 40.57085, -3.88838 40.56962... |
Seville | MULTIPOLYGON (((-6.03292 37.3021, -6.03289 37.... |
Valencia | MULTIPOLYGON (((-0.43255 39.56118, -0.43251 39... |
Load bicycle-sharing stations data¶
Load locations of the bicycle-sharing station from OpenStreetMap using OsmOnlineLoader
for the defined regions.
We will use the {"amenity": "bicycle_rental"}
filter to get only locations where you can rental a bicycle (see the amenity=bicycle_rental
definition on the OSM wiki).
bicycle_stations = OSMOnlineLoader().load(area=regions, tags={"amenity": "bicycle_rental"})
bicycle_stations
geometry | amenity | |
---|---|---|
feature_id | ||
node/167352492 | POINT (-3.69084 40.42324) | bicycle_rental |
node/203671810 | POINT (-0.37996 39.4732) | bicycle_rental |
node/308961227 | POINT (-5.99848 37.39784) | bicycle_rental |
node/308965974 | POINT (-5.99662 37.39754) | bicycle_rental |
node/308973765 | POINT (-5.99385 37.3969) | bicycle_rental |
... | ... | ... |
way/1260641080 | POLYGON ((-3.71176 40.34171, -3.71179 40.34172... | bicycle_rental |
way/1264576450 | POLYGON ((-3.68527 40.48484, -3.68523 40.48483... | bicycle_rental |
way/1308461422 | POLYGON ((-3.76907 40.38341, -3.76906 40.38338... | bicycle_rental |
way/1308571697 | POLYGON ((-3.78342 40.38359, -3.78341 40.38361... | bicycle_rental |
way/1309902673 | POLYGON ((-3.69516 40.46537, -3.69515 40.46536... | bicycle_rental |
1047 rows × 2 columns
Let's join the locations of the bicycle-sharing stations with defined cities geometries using IntersectionJoiner
to group them per city.
bicycle_stations_in_city = IntersectionJoiner().transform(
regions, bicycle_stations, return_geom=True
)
bicycle_stations_per_city = {}
for city_name in cities_names:
bicycle_stations_per_city[city_name] = bicycle_stations_in_city.loc[city_name]
Let's visualise the training data on the map.
bicycle_stations_per_city[train_city].explore(tiles="CartoDB Positron")
Regionalize points data into H3 and create regions dataset¶
Now we will transform the stations locations into the H3 cells using the H3Regionalizer
.
After that, we will buffer those cells with grid disks together with aggregating the minimal distance to the station for each buffered H3 cell. We will use the grid_disk_aggregate_k
function from the h3ronpy
library.
Since h3ronpy
operates on the Arrow tables and using integers as H3 indexes, we have to transform them from str
to int
and back again.
We have to define 3 variables:
H3_RESOLUTION
: What will be the H3 cell resolution?H3_PREDICTION_RANGE
: What is the maximum distance we aim to predict?H3_NEIGHBOURS
: How many neighbours do we want to have in the embedding context?
We will then buffer the generated cells up to the H3_PREDICTION_RANGE
+ H3_NEIGHBOURS
distance.
H3_RESOLUTION = 11
H3_PREDICTION_RANGE = 10
H3_NEIGHBOURS = 5
def buffer_h3_cells_with_aggregation(h3_regions):
"""Expand H3 regions and calculate minimal distance to origin cells."""
return (
pa.table(
grid_disk_aggregate_k(
h3_regions.index.map(str_to_int),
H3_NEIGHBOURS + H3_PREDICTION_RANGE,
"min",
)
)
.to_pandas()
.rename(columns={"k": "distance_to_station", "cell": "region_id"})
)
h3_regionalizer = H3Regionalizer(resolution=H3_RESOLUTION)
h3_regions_gdfs = []
for city_name, bicycle_stations_data in bicycle_stations_per_city.items():
city_h3_regions = h3_regionalizer.transform(bicycle_stations_data)
expanded_city_h3_regions = buffer_h3_cells_with_aggregation(city_h3_regions)
expanded_city_h3_regions["region_id"] = expanded_city_h3_regions["region_id"].map(int_to_str)
expanded_city_h3_regions = expanded_city_h3_regions.set_index("region_id")
expanded_city_h3_regions["city"] = city_name
expanded_city_h3_regions = gpd.GeoDataFrame(
expanded_city_h3_regions,
geometry=h3_to_shapely_geometry(expanded_city_h3_regions.index),
crs=4326,
)
h3_regions_gdfs.append(expanded_city_h3_regions)
h3_regions = gpd.pd.concat(h3_regions_gdfs)
h3_regions
distance_to_station | city | geometry | |
---|---|---|---|
region_id | |||
8b390cb1daa4fff | 6 | Madrid | POLYGON ((-3.70789 40.46501, -3.70815 40.46484... |
8b390ca35086fff | 3 | Madrid | POLYGON ((-3.68528 40.40041, -3.68553 40.40024... |
8b390ca048e1fff | 11 | Madrid | POLYGON ((-3.63797 40.397, -3.63823 40.39683, ... |
8b390ca05870fff | 1 | Madrid | POLYGON ((-3.61024 40.40308, -3.61049 40.40291... |
8b390ca33d0afff | 4 | Madrid | POLYGON ((-3.68561 40.36786, -3.68586 40.36768... |
... | ... | ... | ... |
8b39540ad811fff | 3 | Valencia | POLYGON ((-0.37828 39.48014, -0.37852 39.47997... |
8b395401ad5dfff | 4 | Valencia | POLYGON ((-0.35547 39.48016, -0.35572 39.47999... |
8b39540130e6fff | 2 | Valencia | POLYGON ((-0.39022 39.48913, -0.39047 39.48897... |
8b39540f4794fff | 4 | Valencia | POLYGON ((-0.33164 39.44966, -0.33189 39.44949... |
8b39540ac6c4fff | 4 | Valencia | POLYGON ((-0.39553 39.45943, -0.39578 39.45926... |
129372 rows × 3 columns
Now, we'll visualise the generated H3 regions.
We will colour the cells based on the distance in the range defined in H3_PREDICTION_RANGE
and we will colour the additional cells buffering the prediction range with additional neighbours (defined in the H3_NEIGHBOURS
variable) in grey.
temps_cmap = LinearSegmentedColormap.from_list(
name="Temps",
colors=[
"#009392FF",
"#39B185FF",
"#9CCB86FF",
"#E9E29CFF",
"#EEB479FF",
"#E88471FF",
"#CF597EFF",
],
)
ax = h3_regions[
(h3_regions.city == train_city) & (h3_regions.distance_to_station <= H3_PREDICTION_RANGE)
].plot(
column="distance_to_station",
figsize=(20, 20),
cmap=temps_cmap,
alpha=0.6,
legend=True,
legend_kwds={
"shrink": 0.3,
"location": "bottom",
"label": "Distance to station",
"pad": -0.05,
},
)
h3_regions[
(h3_regions.city == train_city) & (h3_regions.distance_to_station > H3_PREDICTION_RANGE)
].plot(ax=ax, color="gray", alpha=0.3)
bicycle_stations_per_city[train_city].representative_point().plot(
ax=ax, color="black", markersize=1
)
cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
ax.set_axis_off()
ax.set_title(f"Distance to the nearest bike station in {train_city}", fontsize=20)
plt.show()
Prepare features for the model¶
Now we will download the geospatial data for all of the H3 cells and generate embeddings for the model.
Download Overture Maps data¶
Now we will use OvertureMapsLoader
to download the data for selected list of datasets defined by a theme/type pair. For each pair we will define a hierarchy depth used to aggregate features into columns.
Data is downloaded using the OvertureMaestro
library.
Switch to features based on OpenStreetMap
If you want to use OpenStreetMap data instead you can use the OSMPbfLoader
with GEOFABRIK_LAYERS
filter.
from srai.loaders.osm_loaders.filters import GEOFABRIK_LAYERS
from srai.loaders.osm_loaders import OSMPbfLoader
features = OSMPbfLoader().load(area=h3_regions, tags=GEOFABRIK_LAYERS)
OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES = {
("base", "infrastructure"): 1,
("base", "land"): 1,
("base", "land_use"): 1,
("base", "water"): 1,
("transportation", "segment"): 2,
("buildings", "building"): 2,
("places", "place"): 1,
}
features = OvertureMapsLoader(
release="2024-12-18.0",
theme_type_pairs=list(OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES.keys()),
hierarchy_depth=list(OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES.values()),
include_all_possible_columns=False,
).load(area=h3_regions)
features
Finished operation in 0:00:31
geometry | base|infrastructure|aerialway | base|infrastructure|airport | base|infrastructure|barrier | base|infrastructure|bridge | base|infrastructure|communication | base|infrastructure|manhole | base|infrastructure|pedestrian | base|infrastructure|pier | base|infrastructure|power | ... | transportation|segment|road|residential | transportation|segment|road|secondary | transportation|segment|road|service | transportation|segment|road|steps | transportation|segment|road|tertiary | transportation|segment|road|track | transportation|segment|road|trunk | transportation|segment|road|unclassified | transportation|segment|road|unknown | transportation|segment|water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
feature_id | |||||||||||||||||||||
08b391c635384fff0001bc111d7893b4 | LINESTRING (-5.98693 37.30427, -5.98701 37.305... | False | False | False | False | False | False | False | False | True | ... | False | False | False | False | False | False | False | False | False | False |
08b391c6267abfff0001be3dd76fb36c | LINESTRING (-5.98923 37.33665, -5.98893 37.336... | False | False | False | False | False | False | False | False | True | ... | False | False | False | False | False | False | False | False | False | False |
08b391c62645bfff0001bcac55a89d0f | LINESTRING (-5.97701 37.31921, -5.9776 37.3200... | False | False | False | False | False | False | False | False | True | ... | False | False | False | False | False | False | False | False | False | False |
08b391c62612efff0001b2fbb0e44b48 | LINESTRING (-5.98384 37.33817, -5.98341 37.3389) | False | False | False | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b391c62612efff0001b01d83b6c48c | POLYGON ((-5.98397 37.3382, -5.98393 37.33819,... | False | False | False | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08f390cb1d4c089103b364d9e921bbcd | POINT (-3.71512 40.45121) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f390cb1d4f058103ba08825e385da5 | POINT (-3.71733 40.45039) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f390cb1d4f34d4035d698b413ac96f | POINT (-3.71777 40.45061) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f390cb1d4c431603c90d877a230e5d | POINT (-3.71642 40.45073) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f390cb1d4e018c0301aa4d527a2d3e | POINT (-3.71668 40.45153) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
639464 rows × 182 columns
joint = IntersectionJoiner().transform(regions=h3_regions, features=features)
joint
region_id | feature_id |
---|---|
8b391c622db1fff | 08b391c635384fff0001bc111d7893b4 |
8b391c622da2fff | 08b391c635384fff0001bc111d7893b4 |
8b391c622da3fff | 08b391c635384fff0001bc111d7893b4 |
8b391c622ce2fff | 08b391c6267abfff0001be3dd76fb36c |
8b391c622ce3fff | 08b391c6267abfff0001be3dd76fb36c |
... | ... |
8b390cb1d4e8fff | 08f390cb1d4c089103b364d9e921bbcd |
8b390cb1d4f3fff | 08f390cb1d4f058103ba08825e385da5 |
8b390cb1d4f0fff | 08f390cb1d4f34d4035d698b413ac96f |
8b390cb1d4c4fff | 08f390cb1d4c431603c90d877a230e5d |
8b390cb1d4e0fff | 08f390cb1d4e018c0301aa4d527a2d3e |
1639419 rows × 0 columns
Create the embedding¶
The simplest possible embedding in geospatial context is the number of features per category in a given region. You can generate it using CountEmbedder
and it is the equivalent of CountVectorizer
from the scikit-learn
library.
Here we will use a slightly more advanced method, taking into account close proximity in geographical space, so that each region also has information about what is in its neighbourhood. We call it Context
and we will use the ContextualCountEmbedder
to get this kind of embedding.
This method will retrieve information about neighbours at a distance less than or equal to the variable H3_NEIGHBOURS
and add up the number of objects in each ring. Then those aggregated rings could be added as additional columns (concatenate_vectors
=True
) or added to existing columns using diminishing weight based on the distance from the central region (concatenate_vectors
=False
, default).
We also have to pass a Neighbourhood
object, that will be used to calculate adjacency between regions. Because we are working with H3 cells, we can pass H3Neighbourhood
object to utilize fast internal H3
methods used to calculate neighbourhood of an H3 cell, instead of relying on slower geometry-based operations.
Since the data from the OvertureMapsLoader
is returned in the form of bool
columns (instead of default str
columns with None
values), the have to change default value of the count_subcategories
from True
to False
. All of the subcategories are already present in the final form of features GeoDataFrame
.
Using OpenStreetMap features
If you are using the OpenStreetMap data, then change the count_subcategories
parameter to True
.
Also, remember to remove bicycle sharing stations from the dataset (amenity=bicycle_sharing
), since we don't want it to be present in the final dataset.
Overture Maps doesn't have any bicycle sharing stations data in the 2024-12-18.0
release, so we don't need to do this when using Overture Maps data.
embeddings = embeddings.drop(
columns=[
c for c in embeddings.columns
if "bicycle_rental" in c
]
)
embeddings = ContextualCountEmbedder(
neighbourhood=H3Neighbourhood(),
neighbourhood_distance=H3_NEIGHBOURS,
concatenate_vectors=False,
count_subcategories=False,
).transform(regions_gdf=h3_regions, features_gdf=features, joint_gdf=joint)
embeddings
base|infrastructure|aerialway | base|infrastructure|airport | base|infrastructure|barrier | base|infrastructure|bridge | base|infrastructure|communication | base|infrastructure|manhole | base|infrastructure|pedestrian | base|infrastructure|pier | base|infrastructure|power | base|infrastructure|tower | ... | transportation|segment|road|residential | transportation|segment|road|secondary | transportation|segment|road|service | transportation|segment|road|steps | transportation|segment|road|tertiary | transportation|segment|road|track | transportation|segment|road|trunk | transportation|segment|road|unclassified | transportation|segment|road|unknown | transportation|segment|water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
region_id | |||||||||||||||||||||
8b390cb1daa4fff | 0.0 | 0.0 | 0.415370 | 0.078472 | 0.0 | 0.0 | 1.239074 | 0.0 | 0.000000 | 0.000000 | ... | 0.258981 | 2.231435 | 0.007037 | 0.021574 | 0.217593 | 0.014537 | 0.0 | 0.000000 | 0.0 | 0.0 |
8b390ca35086fff | 0.0 | 0.0 | 2.453009 | 0.012917 | 0.0 | 0.0 | 1.262083 | 0.0 | 0.003472 | 0.000000 | ... | 1.395185 | 2.564769 | 1.086250 | 0.000000 | 1.238056 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 |
8b390ca048e1fff | 0.0 | 0.0 | 0.101157 | 0.021528 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.060880 | 0.000000 | 3.411296 | 0.044259 | 0.100185 | 0.000000 | 0.0 | 0.175417 | 0.0 | 0.0 |
8b390ca05870fff | 0.0 | 0.0 | 0.007731 | 0.000000 | 0.0 | 0.0 | 0.000926 | 0.0 | 0.000000 | 0.000000 | ... | 0.961065 | 0.043009 | 0.168704 | 0.006852 | 0.068148 | 0.001667 | 0.0 | 0.000000 | 0.0 | 0.0 |
8b390ca33d0afff | 0.0 | 0.0 | 7.095694 | 0.006806 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 1.396296 | 0.000000 | 3.454907 | 1.074213 | 0.000000 | 0.091759 | 0.0 | 0.000000 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8b39540ad811fff | 0.0 | 0.0 | 5.049537 | 0.005556 | 0.0 | 0.0 | 0.634074 | 0.0 | 0.000000 | 0.001852 | ... | 0.025324 | 0.000000 | 0.008843 | 0.043519 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 |
8b395401ad5dfff | 0.0 | 0.0 | 1.186620 | 0.000000 | 0.0 | 0.0 | 0.042685 | 0.0 | 0.000000 | 0.001667 | ... | 0.930833 | 0.000000 | 2.133750 | 0.001667 | 0.016389 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 |
8b39540130e6fff | 0.0 | 0.0 | 0.299630 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.218333 | 1.375000 | 0.005000 | 0.000000 | 0.164398 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 |
8b39540f4794fff | 0.0 | 0.0 | 1.807454 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 1.603611 | 0.000000 | 0.004444 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 |
8b39540ac6c4fff | 0.0 | 0.0 | 0.056389 | 0.000000 | 0.0 | 0.0 | 0.000926 | 0.0 | 0.000000 | 0.000000 | ... | 4.555185 | 0.000000 | 0.011204 | 0.000000 | 0.076204 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 |
129372 rows × 181 columns
Training the model¶
After preparing the features for all three cities at once, it is now time to split them, prepare them for training and train the XGBoost
model.
Define the target and cities for training¶
We will now define the target column that will be used for the model to predict and select train and validation cities.
TARGET = "distance_to_station"
Let's join the cities datasets with generated embeddings.
train_data = h3_regions[h3_regions["city"] == train_city].merge(
embeddings, left_index=True, right_index=True
)
validation_data = h3_regions[h3_regions["city"] == validation_city].merge(
embeddings, left_index=True, right_index=True
)
test_data = h3_regions[h3_regions["city"] == test_city].merge(
embeddings, left_index=True, right_index=True
)
train_data.shape, validation_data.shape, test_data.shape
((77114, 184), (26754, 184), (25504, 184))
train_data.head()
distance_to_station | city | geometry | base|infrastructure|aerialway | base|infrastructure|airport | base|infrastructure|barrier | base|infrastructure|bridge | base|infrastructure|communication | base|infrastructure|manhole | base|infrastructure|pedestrian | ... | transportation|segment|road|residential | transportation|segment|road|secondary | transportation|segment|road|service | transportation|segment|road|steps | transportation|segment|road|tertiary | transportation|segment|road|track | transportation|segment|road|trunk | transportation|segment|road|unclassified | transportation|segment|road|unknown | transportation|segment|water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
region_id | |||||||||||||||||||||
8b390cb1daa4fff | 6 | Madrid | POLYGON ((-3.70789 40.46501, -3.70815 40.46484... | 0.0 | 0.0 | 0.415370 | 0.078472 | 0.0 | 0.0 | 1.239074 | ... | 0.258981 | 2.231435 | 0.007037 | 0.021574 | 0.217593 | 0.014537 | 0.0 | 0.000000 | 0.0 | 0.0 |
8b390ca35086fff | 3 | Madrid | POLYGON ((-3.68528 40.40041, -3.68553 40.40024... | 0.0 | 0.0 | 2.453009 | 0.012917 | 0.0 | 0.0 | 1.262083 | ... | 1.395185 | 2.564769 | 1.086250 | 0.000000 | 1.238056 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 |
8b390ca048e1fff | 11 | Madrid | POLYGON ((-3.63797 40.397, -3.63823 40.39683, ... | 0.0 | 0.0 | 0.101157 | 0.021528 | 0.0 | 0.0 | 0.000000 | ... | 0.060880 | 0.000000 | 3.411296 | 0.044259 | 0.100185 | 0.000000 | 0.0 | 0.175417 | 0.0 | 0.0 |
8b390ca05870fff | 1 | Madrid | POLYGON ((-3.61024 40.40308, -3.61049 40.40291... | 0.0 | 0.0 | 0.007731 | 0.000000 | 0.0 | 0.0 | 0.000926 | ... | 0.961065 | 0.043009 | 0.168704 | 0.006852 | 0.068148 | 0.001667 | 0.0 | 0.000000 | 0.0 | 0.0 |
8b390ca33d0afff | 4 | Madrid | POLYGON ((-3.68561 40.36786, -3.68586 40.36768... | 0.0 | 0.0 | 7.095694 | 0.006806 | 0.0 | 0.0 | 0.000000 | ... | 1.396296 | 0.000000 | 3.454907 | 1.074213 | 0.000000 | 0.091759 | 0.0 | 0.000000 | 0.0 | 0.0 |
5 rows × 184 columns
Scale the features¶
Each city might have a different density of features and we don't want to be sensitive to those changes.
Let's consider a scenario where in a city A, there is a lot of benches, and a city B where there isn't a lot of benches. Now, there would naturally be a notable difference in the number of benches we can expect in a given region in a city A vs city B. Scaling the features for each of those cities will give us a relative information about how many benches there are in a given region in regards to a particular city. And that's why we want to scale the features, to keep different cities in a similar domain, so that the transfer of knowledge between them is easier for the model.
Here we will use a StandardScaler
from the scikit-learn
library.
x_train = StandardScaler().fit_transform(train_data[embeddings.columns])
y_train = train_data[TARGET]
x_validation = StandardScaler().fit_transform(validation_data[embeddings.columns])
y_validation = validation_data[TARGET]
x_test = StandardScaler().fit_transform(test_data[embeddings.columns])
y_test = test_data[TARGET]
x_train.shape, y_train.shape, x_validation.shape, y_validation.shape, x_test.shape, y_test.shape
((77114, 181), (77114,), (26754, 181), (26754,), (25504, 181), (25504,))
Train the model¶
It is finally time to train the model.
We will use the XGBoost
library as a popular choice in the data science domain.
We will create two datasets: train and validation and we will set some parameters for the booster training.
Our loss function will be Root Mean Squared Error (RMSE) and we will use early stopping to avoid overfitting to a single city.
mask = y_train <= H3_PREDICTION_RANGE
dtrain = xgb.DMatrix(
data=x_train[mask], label=y_train[mask], feature_names=list(embeddings.columns)
)
mask = y_validation <= H3_PREDICTION_RANGE
dval = xgb.DMatrix(
data=x_validation[mask], label=y_validation[mask], feature_names=list(embeddings.columns)
)
# Set parameters for XGBoost
params = {
"objective": "reg:squarederror",
"eta": 0.01,
"max_depth": 8,
"subsample": 0.8,
"colsample_bytree": 0.8,
"seed": SEED,
}
# Train the model
bst = xgb.train(
params,
dtrain,
num_boost_round=2000,
verbose_eval=50,
early_stopping_rounds=100,
evals=[(dtrain, "train"), (dval, "valid")],
)
[0] train-rmse:2.58204 valid-rmse:2.69827
[50] train-rmse:2.19382 valid-rmse:2.47383
[100] train-rmse:1.95987 valid-rmse:2.37758
[150] train-rmse:1.80367 valid-rmse:2.33290
[200] train-rmse:1.69121 valid-rmse:2.31894
[250] train-rmse:1.60988 valid-rmse:2.30175
[300] train-rmse:1.54435 valid-rmse:2.29999
[350] train-rmse:1.49464 valid-rmse:2.29924
[400] train-rmse:1.44988 valid-rmse:2.29890
[423] train-rmse:1.43092 valid-rmse:2.30444
bst.best_iteration, bst.best_score
(323, 2.2939042486453736)
Evaluate the results¶
Now we will evaluate the model and visualise the results on a map.
Predict the distance for all cities¶
First we need to predict the distance for all three cities to have a set to compare to.
We will calculate the prediction error using predicted value and expected distance.
concatenated_regions = pd.concat([train_data, validation_data, test_data])[
[TARGET, "city", "geometry"]
]
concatenated_regions["predicted_distance_to_station"] = (
np.concatenate(
[
bst.predict(xgb.DMatrix(x_train, feature_names=list(embeddings.columns))),
bst.predict(xgb.DMatrix(x_validation, feature_names=list(embeddings.columns))),
bst.predict(xgb.DMatrix(x_test, feature_names=list(embeddings.columns))),
]
)
.round(2)
.clip(min=0)
)
concatenated_regions = concatenated_regions[concatenated_regions[TARGET] <= H3_PREDICTION_RANGE]
concatenated_regions
distance_to_station | city | geometry | predicted_distance_to_station | |
---|---|---|---|---|
region_id | ||||
8b390cb1daa4fff | 6 | Madrid | POLYGON ((-3.70789 40.46501, -3.70815 40.46484... | 5.48 |
8b390ca35086fff | 3 | Madrid | POLYGON ((-3.68528 40.40041, -3.68553 40.40024... | 3.66 |
8b390ca05870fff | 1 | Madrid | POLYGON ((-3.61024 40.40308, -3.61049 40.40291... | 5.24 |
8b390ca33d0afff | 4 | Madrid | POLYGON ((-3.68561 40.36786, -3.68586 40.36768... | 5.26 |
8b390ca35325fff | 3 | Madrid | POLYGON ((-3.67583 40.40631, -3.67608 40.40614... | 3.95 |
... | ... | ... | ... | ... |
8b39540ad811fff | 3 | Valencia | POLYGON ((-0.37828 39.48014, -0.37852 39.47997... | 4.42 |
8b395401ad5dfff | 4 | Valencia | POLYGON ((-0.35547 39.48016, -0.35572 39.47999... | 4.36 |
8b39540130e6fff | 2 | Valencia | POLYGON ((-0.39022 39.48913, -0.39047 39.48897... | 5.08 |
8b39540f4794fff | 4 | Valencia | POLYGON ((-0.33164 39.44966, -0.33189 39.44949... | 6.41 |
8b39540ac6c4fff | 4 | Valencia | POLYGON ((-0.39553 39.45943, -0.39578 39.45926... | 5.14 |
102217 rows × 4 columns
Visualise the distribution of predictions¶
First, we will see how the model performed in terms of distance prediction by comparing the predicted values with the expected values.
pal12_cmap = ListedColormap(
name="pal12",
colors=[
"#DE1A1AFF",
"#E1581AFF",
"#E37F1BFF",
"#E2A11BFF",
"#DFC11BFF",
"#D8E01BFF",
"#CEFF1AFF",
"#B2D736FF",
"#97B043FF",
"#7B8948FF",
"#5F654AFF",
"#43444AFF",
"#202547FF",
][: (H3_PREDICTION_RANGE + 1)],
)
_, axs = plt.subplots(2, 3, figsize=(12, 8), sharey=True, sharex=True, dpi=600)
axs[0, 0].set_ylabel("Predicted distance to station")
axs[1, 0].set_ylabel("Predicted distance to station")
for idx, city_name in enumerate(cities_names):
expected_values = concatenated_regions[concatenated_regions["city"] == city_name][TARGET]
mask = expected_values <= H3_PREDICTION_RANGE
expected_values = expected_values[mask]
predicted_values = concatenated_regions[concatenated_regions["city"] == city_name][
"predicted_distance_to_station"
][mask]
sns.regplot(
x=expected_values,
y=predicted_values,
ax=axs[0, idx],
scatter=True,
order=2,
scatter_kws=dict(
alpha=0.02,
color=[pal12_cmap.colors[_y] for _y in expected_values],
),
line_kws=dict(
color="black",
),
x_jitter=0.1,
)
sns.violinplot(
x=expected_values,
y=predicted_values,
ax=axs[1, idx],
fill=True,
palette=pal12_cmap.colors,
hue=expected_values,
legend=False,
)
title = city_name
if city_name == train_city:
title += " (train)"
elif city_name == validation_city:
title += " (validation)"
axs[0, idx].set_title(title)
axs[0, idx].set_xlabel(None)
axs[1, idx].set_xlabel("Distance to station")
axs[0, 0].set_ylim(bottom=0)
axs[1, 0].set_ylim(bottom=0)
plt.tight_layout()
As you can see, it's far from perfect, but we can see that the overall trend from the lowest to the highest distance is preserved for both validation city (Seville) and the city that didn't take part in the modelling (Valencia).
Although the scale isn't preserved and errors are significant, we can still use those predictions to display some hotspots on a map.
Visualise the predictions on a map¶
Here comes the most interesting part - visualising the predictions on a map.
We will change the scale for each of the cities to mach the scale of predictions, so that we can clearly see the difference between lowest and highest predictions.
Existing bicycle stations will be plotted as black dots.
for city_name in cities_names:
city_data = concatenated_regions[concatenated_regions.city == city_name]
ax = city_data.plot(
column="predicted_distance_to_station",
figsize=(20, 20),
cmap=temps_cmap,
alpha=0.8,
legend=True,
legend_kwds={
"shrink": 0.3,
"location": "bottom",
"label": "Predicted distance to station",
"pad": -0.05,
},
vmin=max(0, city_data["predicted_distance_to_station"].min()),
vmax=city_data["predicted_distance_to_station"].max(),
)
bicycle_stations_per_city[city_name].representative_point().plot(
ax=ax, color="black", markersize=3, alpha=0.4
)
cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
ax.set_axis_off()
title = f"Predicted distance to the nearest bike station in {city_name}"
if city_name == train_city:
title += " (train)"
elif city_name == validation_city:
title += " (validation)"
ax.set_title(title, fontsize=20)
plt.show()
Visualise the predictions error on a map¶
And the last part, the prediction error preview.
As in other areas of data science, we can make a boring error graph on the X / Y axis or as a histogram. But what would the field of geospatial data be if we didn't also use maps for this.
Since we already know that ranges in the predictions per city aren't matched with the original domain, we will rescale the predictions to be in the original domain and recalculate the prediction error. That way we will be able to eliminate the base error resulting from the difference between scales and focus on the spatial part of the error.
We will also normalize the prediction error, to the 0-1 scale so that the biggest negative error will correlate with the value of 0, biggest positive error to the value of 1, and value of 0.5 will represent no error at all.
That way we will have an easy way to utilize divergent colour scale without rescaling it.
Additionally, we will use opacity to hide regions where the error was insignificant.
tangerine_blues_cmap = LinearSegmentedColormap.from_list(
name="TangerineBlues",
colors=[
"#552000FF",
"#8A4D00FF",
"#C17D17FF",
"#F8B150FF",
"#F5F5F5FF",
"#93C6E1FF",
"#5F93ACFF",
"#2E627AFF",
"#00344AFF",
],
)
for city_name in cities_names:
city_data = concatenated_regions[concatenated_regions.city == city_name].copy()
# NewValue = (((OldValue - OldMin) * (NewMax - NewMin)) / (OldMax - OldMin)) + NewMin
city_data["scaled_predicted_distance_to_station"] = (
(
(
city_data["predicted_distance_to_station"]
- city_data["predicted_distance_to_station"].min()
)
* (city_data["distance_to_station"].max() - city_data["distance_to_station"].min())
)
/ (
city_data["predicted_distance_to_station"].max()
- city_data["predicted_distance_to_station"].min()
)
) + city_data["distance_to_station"].min()
city_data["scaled_prediction_error"] = (
city_data[TARGET].clip(upper=H3_PREDICTION_RANGE)
- city_data["scaled_predicted_distance_to_station"]
)
city_data["normalized_scaled_prediction_error"] = (
city_data["scaled_prediction_error"].apply(
lambda x, city_data=city_data: (
-x / city_data["scaled_prediction_error"].min()
if x < 0
else x / city_data["scaled_prediction_error"].max()
)
)
+ 1
) / 2
city_data["normalized_prediction_error_alpha"] = (
city_data["normalized_scaled_prediction_error"] - 0.5
).abs() * 2
ax = city_data.plot(
column="normalized_scaled_prediction_error",
figsize=(20, 20),
cmap=tangerine_blues_cmap,
alpha=city_data["normalized_prediction_error_alpha"],
legend=True,
legend_kwds={
"shrink": 0.3,
"location": "bottom",
"label": "Distance prediction error (scaled)",
"pad": -0.05,
"ticks": [0, 0.5, 1],
"format": mticker.FixedFormatter(
[
round(city_data["scaled_prediction_error"].min(), 2),
"0",
round(city_data["scaled_prediction_error"].max(), 2),
]
),
},
)
bicycle_stations_per_city[city_name].representative_point().plot(
ax=ax, color="black", markersize=3, alpha=0.4
)
cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
ax.set_axis_off()
title = f"Distance prediction error (scaled) in {city_name}"
if city_name == train_city:
title += " (train)"
elif city_name == validation_city:
title += " (validation)"
ax.set_title(title, fontsize=20)
plt.show()
Below is an example of the scaling we have done in a previous step.
The difference between predicted distances and predicted distances scaled to the original domain.
sns.displot(
data=pd.melt(
city_data,
value_vars=[
"predicted_distance_to_station",
"scaled_predicted_distance_to_station",
],
),
x="value",
hue="variable",
kde=True,
height=10,
)
<seaborn.axisgrid.FacetGrid at 0x7fa4cb93c580>
See feature importance¶
We can visualise feature importance from the XGBoost
Booster object and display the gain per feature. This info can tell us, or the planners, what urban features are important for the model.
This is a basic form of explainability and it doesn't tell us if the feature has negative or positive impact on the result. For that we would have to use for example the SHAP library.
_, ax = plt.subplots(1, 1, figsize=(10, 10))
feature_importance_df = pd.DataFrame(
bst.get_score(importance_type="gain").items(), columns=["Feature", "Gain"]
)
ax = sns.barplot(y="Feature", x="Gain", data=feature_importance_df.nlargest(20, columns="Gain"))
ax.set_title("Feature importance")
plt.show()
Summary¶
In summary, looking at the prediction maps, and the errors, there is a general trend where the predictions fairly represent the true layout of the cycle stations.
There is a strong predominance of city centres, where these predictions are closer to reality, and the model tends to predict low values there. As you move away from the centre, you can see increasing deviations, with noticeable clusters where the model completely failed to predict the existence of a cycle station.
This may be due to a number of factors:
- differences in planning of the cycling network between towns and cities;
- differences in urban layout of towns and cities;
- lack of features or examples representing some types of urban tissue;
- model unable to catch all the complexities of the data set;
- too simple method of embedding geographic features into the numerical vector.
Even looking at the example of Madrid, which has been reproduced quite well by the model, you can see examples where the model thinks a station should not be there, or has found empty places where it would propose to put a station, so even without knowledge transfer between cities, such a model could support local planners.
I hope that the methods presented here have helped you understand the basics of the field of geospatial machine learning and presented the tangible benefits of using such a model in real-world scenarios.