Simple Machine Learning model using Overture Maps data¶

In this notebook, we aim to predict the distance to the nearest bicycle-sharing station in three Spanish cities: Madrid, Seville, and Valencia.

Our goal is to develop a model capable of accurately predicting this distance based on geospatial features, which can assist in urban planning and enhance the accessibility of bicycle-sharing services.

The task involves several key steps:

Loading bicycle-sharing stations: We begin by loading geospatial data for bicycle-sharing stations from OpenStreetMap using OsmOnlineLoader for the specified cities. We also visualise the locations of these stations.
Regionalization using H3 hexagons: The cities are regionalized using H3 hexagons with H3Regionalizer, enabling us to divide them into smaller, manageable regions. Next, we buffer the generated H3 regions and compute the distance from each hexagon to the nearest bicycle-sharing station.
Feature extraction and embedding: We load additional geospatial features from Overture Maps using OvertureMapsLoader and join them with the H3 regions (with IntersectionJoiner). We then generate embeddings for these regions using ContextualCountEmbedder, which will be used as input features for our machine learning model.
Training and evaluating the model: We prepare the data for training and validation, standardize the features, and train an XGBoost regression model to predict the distance to the nearest bicycle-sharing station. Data from Madrid is used for training, while data from Seville is used for validation.
Prediction and Visualization: Finally, the trained model predicts the distance to the nearest bicycle-sharing station for all three cities. The predicted distances and prediction errors are visualised on maps to evaluate the model's performance.

Prerequisites

12 GB of RAM
Installed libraries: srai[osm,overturemaps,plotting], contextily, seaborn, scikit-learn, xgboost

In [1]:

Copied!

# Uncomment the line below to install the required packages
# (e.g. when running in a new environment or Google Colab)

# %pip install srai[osm,overturemaps,plotting] contextily seaborn scikit-learn xgboost
# Uncomment the line below to install the required packages
# (e.g. when running in a new environment or Google Colab)

# %pip install srai[osm,overturemaps,plotting] contextily seaborn scikit-learn xgboost

Import required libraries¶

In [2]:

Copied!





import contextily as cx
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd
import pyarrow as pa
import seaborn as sns
import xgboost as xgb
from h3 import int_to_str, str_to_int
from h3ronpy import grid_disk_aggregate_k
from matplotlib.colors import LinearSegmentedColormap, ListedColormap
from sklearn.preprocessing import StandardScaler

from srai.embedders import ContextualCountEmbedder
from srai.h3 import h3_to_shapely_geometry
from srai.joiners import IntersectionJoiner
from srai.loaders import OSMOnlineLoader, OvertureMapsLoader
from srai.neighbourhoods import H3Neighbourhood
from srai.regionalizers import H3Regionalizer, geocode_to_region_gdf
import contextily as cx
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd
import pyarrow as pa
import seaborn as sns
import xgboost as xgb
from h3 import int_to_str, str_to_int
from h3ronpy import grid_disk_aggregate_k
from matplotlib.colors import LinearSegmentedColormap, ListedColormap
from sklearn.preprocessing import StandardScaler

from srai.embedders import ContextualCountEmbedder
from srai.h3 import h3_to_shapely_geometry
from srai.joiners import IntersectionJoiner
from srai.loaders import OSMOnlineLoader, OvertureMapsLoader
from srai.neighbourhoods import H3Neighbourhood
from srai.regionalizers import H3Regionalizer, geocode_to_region_gdf

In [3]:

Copied!

SEED = 71
SEED = 71

Define regions of interest¶

Geocode three Spanish cities using the geocode_to_region_gdf function.

In [4]:

Copied!





train_city = "Madrid"
validation_city = "Seville"
test_city = "Valencia"

cities_names = [train_city, validation_city, test_city]

regions = geocode_to_region_gdf(cities_names)
regions.index = cities_names
regions
train_city = "Madrid"
validation_city = "Seville"
test_city = "Valencia"

cities_names = [train_city, validation_city, test_city]

regions = geocode_to_region_gdf(cities_names)
regions.index = cities_names
regions

Out[4]:

	geometry
Madrid	POLYGON ((-3.88895 40.57085, -3.88838 40.56962...
Seville	MULTIPOLYGON (((-6.03292 37.3021, -6.03289 37....
Valencia	MULTIPOLYGON (((-0.43255 39.56118, -0.43251 39...

Load locations of the bicycle-sharing station from OpenStreetMap using OsmOnlineLoader for the defined regions.

We will use the {"amenity": "bicycle_rental"} filter to get only locations where you can rental a bicycle (see the amenity=bicycle_rental definition on the OSM wiki).

In [5]:

Copied!

bicycle_stations = OSMOnlineLoader().load(area=regions, tags={"amenity": "bicycle_rental"})
bicycle_stations
bicycle_stations = OSMOnlineLoader().load(area=regions, tags={"amenity": "bicycle_rental"})
bicycle_stations

Out[5]:

	geometry	amenity
feature_id
node/167352492	POINT (-3.69084 40.42324)	bicycle_rental
node/203671810	POINT (-0.37996 39.4732)	bicycle_rental
node/308961227	POINT (-5.99848 37.39784)	bicycle_rental
node/308965974	POINT (-5.99662 37.39754)	bicycle_rental
node/308973765	POINT (-5.99385 37.3969)	bicycle_rental
...	...	...
way/1260641080	POLYGON ((-3.71176 40.34171, -3.71179 40.34172...	bicycle_rental
way/1264576450	POLYGON ((-3.68527 40.48484, -3.68523 40.48483...	bicycle_rental
way/1308461422	POLYGON ((-3.76907 40.38341, -3.76906 40.38338...	bicycle_rental
way/1308571697	POLYGON ((-3.78342 40.38359, -3.78341 40.38361...	bicycle_rental
way/1309902673	POLYGON ((-3.69516 40.46537, -3.69515 40.46536...	bicycle_rental

1050 rows × 2 columns

Let's join the locations of the bicycle-sharing stations with defined cities geometries using IntersectionJoiner to group them per city.

In [6]:

Copied!





bicycle_stations_in_city = IntersectionJoiner().transform(
    regions, bicycle_stations, return_geom=True
)

bicycle_stations_per_city = {}
for city_name in cities_names:
    bicycle_stations_per_city[city_name] = bicycle_stations_in_city.loc[city_name]
bicycle_stations_in_city = IntersectionJoiner().transform(
    regions, bicycle_stations, return_geom=True
)

bicycle_stations_per_city = {}
for city_name in cities_names:
    bicycle_stations_per_city[city_name] = bicycle_stations_in_city.loc[city_name]

Let's visualise the training data on the map.

In [7]:

Copied!

bicycle_stations_per_city[train_city].explore(tiles="CartoDB Positron")
bicycle_stations_per_city[train_city].explore(tiles="CartoDB Positron")

Out[7]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Regionalize points data into H3 and create regions dataset¶

Now we will transform the stations locations into the H3 cells using the H3Regionalizer.

After that, we will buffer those cells with grid disks together with aggregating the minimal distance to the station for each buffered H3 cell. We will use the grid_disk_aggregate_k function from the h3ronpy library.

Since h3ronpy operates on the Arrow tables and using integers as H3 indexes, we have to transform them from str to int and back again.

We have to define 3 variables:

H3_RESOLUTION: What will be the H3 cell resolution?
H3_PREDICTION_RANGE: What is the maximum distance we aim to predict?
H3_NEIGHBOURS: How many neighbours do we want to have in the embedding context?

We will then buffer the generated cells up to the H3_PREDICTION_RANGE + H3_NEIGHBOURS distance.

In [8]:

Copied!





H3_RESOLUTION = 11
H3_PREDICTION_RANGE = 10
H3_NEIGHBOURS = 5


def buffer_h3_cells_with_aggregation(h3_regions):
    """Expand H3 regions and calculate minimal distance to origin cells."""
    return (
        pa.table(
            grid_disk_aggregate_k(
                h3_regions.index.map(str_to_int),
                H3_NEIGHBOURS + H3_PREDICTION_RANGE,
                "min",
            )
        )
        .to_pandas()
        .rename(columns={"k": "distance_to_station", "cell": "region_id"})
    )


h3_regionalizer = H3Regionalizer(resolution=H3_RESOLUTION)
h3_regions_gdfs = []
for city_name, bicycle_stations_data in bicycle_stations_per_city.items():
    city_h3_regions = h3_regionalizer.transform(bicycle_stations_data)

    expanded_city_h3_regions = buffer_h3_cells_with_aggregation(city_h3_regions)
    expanded_city_h3_regions["region_id"] = expanded_city_h3_regions["region_id"].map(int_to_str)
    expanded_city_h3_regions = expanded_city_h3_regions.set_index("region_id")
    expanded_city_h3_regions["city"] = city_name
    expanded_city_h3_regions = gpd.GeoDataFrame(
        expanded_city_h3_regions,
        geometry=h3_to_shapely_geometry(expanded_city_h3_regions.index),
        crs=4326,
    )
    h3_regions_gdfs.append(expanded_city_h3_regions)

h3_regions = gpd.pd.concat(h3_regions_gdfs)

h3_regions
H3_RESOLUTION = 11
H3_PREDICTION_RANGE = 10
H3_NEIGHBOURS = 5


def buffer_h3_cells_with_aggregation(h3_regions):
    """Expand H3 regions and calculate minimal distance to origin cells."""
    return (
        pa.table(
            grid_disk_aggregate_k(
                h3_regions.index.map(str_to_int),
                H3_NEIGHBOURS + H3_PREDICTION_RANGE,
                "min",
            )
        )
        .to_pandas()
        .rename(columns={"k": "distance_to_station", "cell": "region_id"})
    )


h3_regionalizer = H3Regionalizer(resolution=H3_RESOLUTION)
h3_regions_gdfs = []
for city_name, bicycle_stations_data in bicycle_stations_per_city.items():
    city_h3_regions = h3_regionalizer.transform(bicycle_stations_data)

    expanded_city_h3_regions = buffer_h3_cells_with_aggregation(city_h3_regions)
    expanded_city_h3_regions["region_id"] = expanded_city_h3_regions["region_id"].map(int_to_str)
    expanded_city_h3_regions = expanded_city_h3_regions.set_index("region_id")
    expanded_city_h3_regions["city"] = city_name
    expanded_city_h3_regions = gpd.GeoDataFrame(
        expanded_city_h3_regions,
        geometry=h3_to_shapely_geometry(expanded_city_h3_regions.index),
        crs=4326,
    )
    h3_regions_gdfs.append(expanded_city_h3_regions)

h3_regions = gpd.pd.concat(h3_regions_gdfs)

h3_regions

Out[8]:

	distance_to_station	city	geometry
region_id
8b390ca0512cfff	11	Madrid	POLYGON ((-3.61075 40.39808, -3.611 40.3979, -...
8b390ca229acfff	7	Madrid	POLYGON ((-3.663 40.41641, -3.66325 40.41623, ...
8b390ca34211fff	3	Madrid	POLYGON ((-3.70106 40.39729, -3.70131 40.39712...
8b390cb0c2e1fff	5	Madrid	POLYGON ((-3.6807 40.50022, -3.68095 40.50005,...
8b390ca06b6afff	5	Madrid	POLYGON ((-3.63698 40.37905, -3.63724 40.37887...
...	...	...	...
8b395401a585fff	2	Valencia	POLYGON ((-0.35762 39.46981, -0.35787 39.46964...
8b39540a8264fff	4	Valencia	POLYGON ((-0.37076 39.45244, -0.37101 39.45228...
8b39540ad519fff	4	Valencia	POLYGON ((-0.38231 39.46794, -0.38256 39.46777...
8b39540ab815fff	9	Valencia	POLYGON ((-0.37091 39.44194, -0.37116 39.44177...
8b39540ac9a2fff	6	Valencia	POLYGON ((-0.40937 39.47787, -0.40962 39.4777,...

130814 rows × 3 columns

Now, we'll visualise the generated H3 regions.

We will colour the cells based on the distance in the range defined in H3_PREDICTION_RANGE and we will colour the additional cells buffering the prediction range with additional neighbours (defined in the H3_NEIGHBOURS variable) in grey.

In [9]:

Copied!





temps_cmap = LinearSegmentedColormap.from_list(
    name="Temps",
    colors=[
        "#009392FF",
        "#39B185FF",
        "#9CCB86FF",
        "#E9E29CFF",
        "#EEB479FF",
        "#E88471FF",
        "#CF597EFF",
    ],
)

ax = h3_regions[
    (h3_regions.city == train_city) & (h3_regions.distance_to_station <= H3_PREDICTION_RANGE)
].plot(
    column="distance_to_station",
    figsize=(20, 20),
    cmap=temps_cmap,
    alpha=0.6,
    legend=True,
    legend_kwds={
        "shrink": 0.3,
        "location": "bottom",
        "label": "Distance to station",
        "pad": -0.05,
    },
)
h3_regions[
    (h3_regions.city == train_city) & (h3_regions.distance_to_station > H3_PREDICTION_RANGE)
].plot(ax=ax, color="gray", alpha=0.3)
bicycle_stations_per_city[train_city].representative_point().plot(
    ax=ax, color="black", markersize=1
)

cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
ax.set_axis_off()
ax.set_title(f"Distance to the nearest bike station in {train_city}", fontsize=20)
plt.show()
temps_cmap = LinearSegmentedColormap.from_list(
    name="Temps",
    colors=[
        "#009392FF",
        "#39B185FF",
        "#9CCB86FF",
        "#E9E29CFF",
        "#EEB479FF",
        "#E88471FF",
        "#CF597EFF",
    ],
)

ax = h3_regions[
    (h3_regions.city == train_city) & (h3_regions.distance_to_station <= H3_PREDICTION_RANGE)
].plot(
    column="distance_to_station",
    figsize=(20, 20),
    cmap=temps_cmap,
    alpha=0.6,
    legend=True,
    legend_kwds={
        "shrink": 0.3,
        "location": "bottom",
        "label": "Distance to station",
        "pad": -0.05,
    },
)
h3_regions[
    (h3_regions.city == train_city) & (h3_regions.distance_to_station > H3_PREDICTION_RANGE)
].plot(ax=ax, color="gray", alpha=0.3)
bicycle_stations_per_city[train_city].representative_point().plot(
    ax=ax, color="black", markersize=1
)

cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
ax.set_axis_off()
ax.set_title(f"Distance to the nearest bike station in {train_city}", fontsize=20)
plt.show()

$No description has been provided for this image$

Prepare features for the model¶

Now we will download the geospatial data for all of the H3 cells and generate embeddings for the model.

Download Overture Maps data¶

Now we will use OvertureMapsLoader to download the data for selected list of datasets defined by a theme/type pair. For each pair we will define a hierarchy depth used to aggregate features into columns.

Data is downloaded using the OvertureMaestro library.

Switch to features based on OpenStreetMap

If you want to use OpenStreetMap data instead you can use the OSMPbfLoader with GEOFABRIK_LAYERS filter.

from srai.loaders.osm_loaders.filters import GEOFABRIK_LAYERS
from srai.loaders.osm_loaders import OSMPbfLoader

features = OSMPbfLoader().load(area=h3_regions, tags=GEOFABRIK_LAYERS)

In [10]:

Copied!





OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES = {
    ("base", "infrastructure"): 1,
    ("base", "land"): 1,
    ("base", "land_use"): 1,
    ("base", "water"): 1,
    ("transportation", "segment"): 2,
    ("buildings", "building"): 2,
    ("places", "place"): 1,
}

features = OvertureMapsLoader(
    release="2024-12-18.0",
    theme_type_pairs=list(OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES.keys()),
    hierarchy_depth=list(OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES.values()),
    include_all_possible_columns=False,
).load(area=h3_regions)

features
OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES = {
    ("base", "infrastructure"): 1,
    ("base", "land"): 1,
    ("base", "land_use"): 1,
    ("base", "water"): 1,
    ("transportation", "segment"): 2,
    ("buildings", "building"): 2,
    ("places", "place"): 1,
}

features = OvertureMapsLoader(
    release="2024-12-18.0",
    theme_type_pairs=list(OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES.keys()),
    hierarchy_depth=list(OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES.values()),
    include_all_possible_columns=False,
).load(area=h3_regions)

features

Finished operation in 0:00:30

Out[10]:

	geometry	base\|infrastructure\|aerialway	base\|infrastructure\|airport	base\|infrastructure\|barrier	base\|infrastructure\|bridge	base\|infrastructure\|communication	base\|infrastructure\|manhole	base\|infrastructure\|pedestrian	base\|infrastructure\|pier	base\|infrastructure\|power	...	transportation\|segment\|road\|residential	transportation\|segment\|road\|secondary	transportation\|segment\|road\|service	transportation\|segment\|road\|steps	transportation\|segment\|road\|tertiary	transportation\|segment\|road\|track	transportation\|segment\|road\|trunk	transportation\|segment\|road\|unclassified	transportation\|segment\|road\|unknown	transportation\|segment\|water
feature_id
08b391c635384fff0001bc111d7893b4	LINESTRING (-5.98693 37.30427, -5.98701 37.305...	False	False	False	False	False	False	False	False	True	...	False	False	False	False	False	False	False	False	False	False
08b391c6267abfff0001be3dd76fb36c	LINESTRING (-5.98923 37.33665, -5.98893 37.336...	False	False	False	False	False	False	False	False	True	...	False	False	False	False	False	False	False	False	False	False
08b391c62645bfff0001bcac55a89d0f	LINESTRING (-5.97701 37.31921, -5.9776 37.3200...	False	False	False	False	False	False	False	False	True	...	False	False	False	False	False	False	False	False	False	False
08b391c62612efff0001b2fbb0e44b48	LINESTRING (-5.98384 37.33817, -5.98341 37.3389)	False	False	False	True	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
08b391c62612efff0001b01d83b6c48c	POLYGON ((-5.98397 37.3382, -5.98393 37.33819,...	False	False	False	True	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
08f390cb1d4c089103b364d9e921bbcd	POINT (-3.71512 40.45121)	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
08f390cb1d4f058103ba08825e385da5	POINT (-3.71733 40.45039)	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
08f390cb1d4f34d4035d698b413ac96f	POINT (-3.71777 40.45061)	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
08f390cb1d4c431603c90d877a230e5d	POINT (-3.71642 40.45073)	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
08f390cb1d4e018c0301aa4d527a2d3e	POINT (-3.71668 40.45153)	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

642011 rows × 182 columns

Intersect downloaded features with H3 regions¶

Now we will join geometries of each region and feature using STRTree spatial index from the GeoPandas library.

As a result, we will get a multiindex with region_id and feature_id columns.

In [11]:

Copied!

joint = IntersectionJoiner().transform(regions=h3_regions, features=features)
joint
joint = IntersectionJoiner().transform(regions=h3_regions, features=features)
joint

Out[11]:


region_id	feature_id
8b391c622db1fff	08b391c635384fff0001bc111d7893b4
8b391c622da2fff	08b391c635384fff0001bc111d7893b4
8b391c622da3fff	08b391c635384fff0001bc111d7893b4
8b391c622c12fff	08b391c6267abfff0001be3dd76fb36c
8b391c622c13fff	08b391c6267abfff0001be3dd76fb36c
...	...
8b390cb1d4e8fff	08f390cb1d4c089103b364d9e921bbcd
8b390cb1d4f3fff	08f390cb1d4f058103ba08825e385da5
8b390cb1d4f0fff	08f390cb1d4f34d4035d698b413ac96f
8b390cb1d4c4fff	08f390cb1d4c431603c90d877a230e5d
8b390cb1d4e0fff	08f390cb1d4e018c0301aa4d527a2d3e

1650666 rows × 0 columns

Create the embedding¶

The simplest possible embedding in geospatial context is the number of features per category in a given region. You can generate it using CountEmbedder and it is the equivalent of CountVectorizer from the scikit-learn library.

Here we will use a slightly more advanced method, taking into account close proximity in geographical space, so that each region also has information about what is in its neighbourhood. We call it Context and we will use the ContextualCountEmbedder to get this kind of embedding.

This method will retrieve information about neighbours at a distance less than or equal to the variable H3_NEIGHBOURS and add up the number of objects in each ring. Then those aggregated rings could be added as additional columns (concatenate_vectors=True) or added to existing columns using diminishing weight based on the distance from the central region (concatenate_vectors=False, default).

We also have to pass a Neighbourhood object, that will be used to calculate adjacency between regions. Because we are working with H3 cells, we can pass H3Neighbourhood object to utilize fast internal H3 methods used to calculate neighbourhood of an H3 cell, instead of relying on slower geometry-based operations.

Since the data from the OvertureMapsLoader is returned in the form of bool columns (instead of default str columns with None values), the have to change default value of the count_subcategories from True to False. All of the subcategories are already present in the final form of features GeoDataFrame.

Using OpenStreetMap features

If you are using the OpenStreetMap data, then change the count_subcategories parameter to True.

Also, remember to remove bicycle sharing stations from the dataset (amenity=bicycle_sharing), since we don't want it to be present in the final dataset.

Overture Maps doesn't have any bicycle sharing stations data in the 2024-12-18.0 release, so we don't need to do this when using Overture Maps data.

embeddings = embeddings.drop(
    columns=[
        c for c in embeddings.columns
        if "bicycle_rental" in c
    ]
)

In [12]:

Copied!





embeddings = ContextualCountEmbedder(
    neighbourhood=H3Neighbourhood(),
    neighbourhood_distance=H3_NEIGHBOURS,
    concatenate_vectors=False,
    count_subcategories=False,
).transform(regions_gdf=h3_regions, features_gdf=features, joint_gdf=joint)

embeddings
embeddings = ContextualCountEmbedder(
    neighbourhood=H3Neighbourhood(),
    neighbourhood_distance=H3_NEIGHBOURS,
    concatenate_vectors=False,
    count_subcategories=False,
).transform(regions_gdf=h3_regions, features_gdf=features, joint_gdf=joint)

embeddings

Out[12]:

	base\|infrastructure\|aerialway	base\|infrastructure\|airport	base\|infrastructure\|barrier	base\|infrastructure\|bridge	base\|infrastructure\|communication	base\|infrastructure\|manhole	base\|infrastructure\|pedestrian	base\|infrastructure\|pier	base\|infrastructure\|power	base\|infrastructure\|tower	...	transportation\|segment\|road\|residential	transportation\|segment\|road\|secondary	transportation\|segment\|road\|service	transportation\|segment\|road\|steps	transportation\|segment\|road\|tertiary	transportation\|segment\|road\|track	transportation\|segment\|road\|trunk	transportation\|segment\|road\|unclassified	transportation\|segment\|road\|unknown	transportation\|segment\|water
region_id
8b390ca0512cfff	0.000000	0.0	0.159074	0.010139	0.0	0.0	0.003333	0.00000	0.000000	0.0	...	0.041898	0.000000	0.000000	0.014120	0.000000	0.000000	0.000000	0.000000	0.000000	0.0
8b390ca229acfff	0.000000	0.0	2.452593	0.000000	0.0	0.0	0.000000	0.00000	0.032778	0.0	...	1.400278	0.000000	1.050741	0.018241	0.000000	0.000000	0.000000	0.000000	0.000000	0.0
8b390ca34211fff	0.000926	0.0	0.196111	0.035556	0.0	0.0	0.014028	0.00000	0.004630	0.0	...	3.826852	0.000000	0.077454	0.125972	0.074815	0.000000	0.000000	0.000000	0.000000	0.0
8b390cb0c2e1fff	0.000000	0.0	0.015880	0.000000	0.0	0.0	0.000000	0.00000	0.000000	0.0	...	0.006296	0.000000	1.455093	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0
8b390ca06b6afff	0.000000	0.0	0.187130	0.000000	0.0	0.0	0.115231	0.00000	0.000000	0.0	...	1.325463	0.000000	0.039537	0.000926	0.005556	0.000000	0.000000	0.001852	0.000000	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8b395401a585fff	0.000000	0.0	0.825556	0.000000	0.0	0.0	0.013056	0.00000	0.000000	0.0	...	0.822870	0.000000	0.152546	0.000000	0.003704	0.000000	0.000000	0.000000	0.000000	0.0
8b39540a8264fff	0.000000	0.0	1.407546	0.015417	0.0	0.0	0.000000	0.00000	0.000000	0.0	...	1.883241	0.072685	0.059954	0.005139	0.000000	0.000000	0.000000	0.000000	0.000000	0.0
8b39540ad519fff	0.000000	0.0	0.033611	0.000000	0.0	0.0	0.060000	0.00000	0.000000	0.0	...	0.151528	0.114907	0.028657	0.005185	1.167083	0.000000	0.000000	0.000000	0.000000	0.0
8b39540ab815fff	0.000000	0.0	0.044259	0.060278	0.0	0.0	0.000000	0.00000	0.012361	0.0	...	1.186991	0.000000	0.066991	0.000000	0.000000	0.024583	0.060278	0.019167	0.000926	0.0
8b39540ac9a2fff	0.000000	0.0	2.945231	2.579907	0.0	0.0	0.000000	0.02713	0.000000	0.0	...	0.000000	0.000000	0.007407	0.012778	0.004630	0.000000	0.000000	0.006991	0.000000	0.0

130814 rows × 181 columns

Training the model¶

After preparing the features for all three cities at once, it is now time to split them, prepare them for training and train the XGBoost model.

Define the target and cities for training¶

We will now define the target column that will be used for the model to predict and select train and validation cities.

In [13]:

Copied!

TARGET = "distance_to_station"
TARGET = "distance_to_station"

Let's join the cities datasets with generated embeddings.

In [14]:

Copied!





train_data = h3_regions[h3_regions["city"] == train_city].merge(
    embeddings, left_index=True, right_index=True
)

validation_data = h3_regions[h3_regions["city"] == validation_city].merge(
    embeddings, left_index=True, right_index=True
)

test_data = h3_regions[h3_regions["city"] == test_city].merge(
    embeddings, left_index=True, right_index=True
)


train_data.shape, validation_data.shape, test_data.shape
train_data = h3_regions[h3_regions["city"] == train_city].merge(
    embeddings, left_index=True, right_index=True
)

validation_data = h3_regions[h3_regions["city"] == validation_city].merge(
    embeddings, left_index=True, right_index=True
)

test_data = h3_regions[h3_regions["city"] == test_city].merge(
    embeddings, left_index=True, right_index=True
)


train_data.shape, validation_data.shape, test_data.shape

Out[14]:

((78556, 184), (26754, 184), (25504, 184))

In [15]:

Copied!

train_data.head()
train_data.head()

Out[15]:

	distance_to_station	city	geometry	base\|infrastructure\|aerialway	base\|infrastructure\|airport	base\|infrastructure\|barrier	base\|infrastructure\|bridge	base\|infrastructure\|communication	base\|infrastructure\|manhole	base\|infrastructure\|pedestrian	...	transportation\|segment\|road\|residential	transportation\|segment\|road\|secondary	transportation\|segment\|road\|service	transportation\|segment\|road\|steps	transportation\|segment\|road\|tertiary	transportation\|segment\|road\|track	transportation\|segment\|road\|trunk	transportation\|segment\|road\|unclassified	transportation\|segment\|road\|unknown	transportation\|segment\|water
region_id
8b390ca0512cfff	11	Madrid	POLYGON ((-3.61075 40.39808, -3.611 40.3979, -...	0.000000	0.0	0.159074	0.010139	0.0	0.0	0.003333	...	0.041898	0.0	0.000000	0.014120	0.000000	0.0	0.0	0.000000	0.0	0.0
8b390ca229acfff	7	Madrid	POLYGON ((-3.663 40.41641, -3.66325 40.41623, ...	0.000000	0.0	2.452593	0.000000	0.0	0.0	0.000000	...	1.400278	0.0	1.050741	0.018241	0.000000	0.0	0.0	0.000000	0.0	0.0
8b390ca34211fff	3	Madrid	POLYGON ((-3.70106 40.39729, -3.70131 40.39712...	0.000926	0.0	0.196111	0.035556	0.0	0.0	0.014028	...	3.826852	0.0	0.077454	0.125972	0.074815	0.0	0.0	0.000000	0.0	0.0
8b390cb0c2e1fff	5	Madrid	POLYGON ((-3.6807 40.50022, -3.68095 40.50005,...	0.000000	0.0	0.015880	0.000000	0.0	0.0	0.000000	...	0.006296	0.0	1.455093	0.000000	0.000000	0.0	0.0	0.000000	0.0	0.0
8b390ca06b6afff	5	Madrid	POLYGON ((-3.63698 40.37905, -3.63724 40.37887...	0.000000	0.0	0.187130	0.000000	0.0	0.0	0.115231	...	1.325463	0.0	0.039537	0.000926	0.005556	0.0	0.0	0.001852	0.0	0.0

5 rows × 184 columns

Scale the features¶

Each city might have a different density of features and we don't want to be sensitive to those changes.

Let's consider a scenario where in a city A, there is a lot of benches, and a city B where there isn't a lot of benches. Now, there would naturally be a notable difference in the number of benches we can expect in a given region in a city A vs city B. Scaling the features for each of those cities will give us a relative information about how many benches there are in a given region in regards to a particular city. And that's why we want to scale the features, to keep different cities in a similar domain, so that the transfer of knowledge between them is easier for the model.

Here we will use a StandardScaler from the scikit-learn library.

In [16]:

Copied!

x_train = StandardScaler().fit_transform(train_data[embeddings.columns])
y_train = train_data[TARGET]

x_validation = StandardScaler().fit_transform(validation_data[embeddings.columns])
y_validation = validation_data[TARGET]

x_test = StandardScaler().fit_transform(test_data[embeddings.columns])
y_test = test_data[TARGET]

x_train.shape, y_train.shape, x_validation.shape, y_validation.shape, x_test.shape, y_test.shape
x_train = StandardScaler().fit_transform(train_data[embeddings.columns])
y_train = train_data[TARGET]

x_validation = StandardScaler().fit_transform(validation_data[embeddings.columns])
y_validation = validation_data[TARGET]

x_test = StandardScaler().fit_transform(test_data[embeddings.columns])
y_test = test_data[TARGET]

x_train.shape, y_train.shape, x_validation.shape, y_validation.shape, x_test.shape, y_test.shape

Out[16]:

((78556, 181), (78556,), (26754, 181), (26754,), (25504, 181), (25504,))

Train the model¶

It is finally time to train the model.

We will use the XGBoost library as a popular choice in the data science domain.

We will create two datasets: train and validation and we will set some parameters for the booster training.

Our loss function will be Root Mean Squared Error (RMSE) and we will use early stopping to avoid overfitting to a single city.

In [17]:

Copied!





mask = y_train <= H3_PREDICTION_RANGE
dtrain = xgb.DMatrix(
    data=x_train[mask], label=y_train[mask], feature_names=list(embeddings.columns)
)
mask = y_validation <= H3_PREDICTION_RANGE
dval = xgb.DMatrix(
    data=x_validation[mask], label=y_validation[mask], feature_names=list(embeddings.columns)
)

# Set parameters for XGBoost
params = {
    "objective": "reg:squarederror",
    "eta": 0.01,
    "max_depth": 8,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "seed": SEED,
}

# Train the model
bst = xgb.train(
    params,
    dtrain,
    num_boost_round=2000,
    verbose_eval=50,
    early_stopping_rounds=100,
    evals=[(dtrain, "train"), (dval, "valid")],
)
mask = y_train <= H3_PREDICTION_RANGE
dtrain = xgb.DMatrix(
    data=x_train[mask], label=y_train[mask], feature_names=list(embeddings.columns)
)
mask = y_validation <= H3_PREDICTION_RANGE
dval = xgb.DMatrix(
    data=x_validation[mask], label=y_validation[mask], feature_names=list(embeddings.columns)
)

# Set parameters for XGBoost
params = {
    "objective": "reg:squarederror",
    "eta": 0.01,
    "max_depth": 8,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "seed": SEED,
}

# Train the model
bst = xgb.train(
    params,
    dtrain,
    num_boost_round=2000,
    verbose_eval=50,
    early_stopping_rounds=100,
    evals=[(dtrain, "train"), (dval, "valid")],
)

[0]	train-rmse:2.58550	valid-rmse:2.70354

[50]	train-rmse:2.19814	valid-rmse:2.48845

[100]	train-rmse:1.96141	valid-rmse:2.38952

[150]	train-rmse:1.80447	valid-rmse:2.33464

[200]	train-rmse:1.69388	valid-rmse:2.32054

[250]	train-rmse:1.61023	valid-rmse:2.30993

[300]	train-rmse:1.54427	valid-rmse:2.29924

[350]	train-rmse:1.49378	valid-rmse:2.29520

[400]	train-rmse:1.44908	valid-rmse:2.29387

[450]	train-rmse:1.41615	valid-rmse:2.28767

[500]	train-rmse:1.38770	valid-rmse:2.28327

[550]	train-rmse:1.36240	valid-rmse:2.27590

[600]	train-rmse:1.33810	valid-rmse:2.26803

[650]	train-rmse:1.31213	valid-rmse:2.26595

[700]	train-rmse:1.28838	valid-rmse:2.26332

[750]	train-rmse:1.26610	valid-rmse:2.26089

[800]	train-rmse:1.24285	valid-rmse:2.25650

[850]	train-rmse:1.22424	valid-rmse:2.25984

[893]	train-rmse:1.20564	valid-rmse:2.25696

In [18]:

Copied!

bst.best_iteration, bst.best_score
bst.best_iteration, bst.best_score

Out[18]:

(794, 2.255931538117537)

Evaluate the results¶

Now we will evaluate the model and visualise the results on a map.

Predict the distance for all cities¶

First we need to predict the distance for all three cities to have a set to compare to.

We will calculate the prediction error using predicted value and expected distance.

In [19]:

Copied!





concatenated_regions = pd.concat([train_data, validation_data, test_data])[
    [TARGET, "city", "geometry"]
]

concatenated_regions["predicted_distance_to_station"] = (
    np.concatenate(
        [
            bst.predict(xgb.DMatrix(x_train, feature_names=list(embeddings.columns))),
            bst.predict(xgb.DMatrix(x_validation, feature_names=list(embeddings.columns))),
            bst.predict(xgb.DMatrix(x_test, feature_names=list(embeddings.columns))),
        ]
    )
    .round(2)
    .clip(min=0)
)

concatenated_regions = concatenated_regions[concatenated_regions[TARGET] <= H3_PREDICTION_RANGE]

concatenated_regions
concatenated_regions = pd.concat([train_data, validation_data, test_data])[
    [TARGET, "city", "geometry"]
]

concatenated_regions["predicted_distance_to_station"] = (
    np.concatenate(
        [
            bst.predict(xgb.DMatrix(x_train, feature_names=list(embeddings.columns))),
            bst.predict(xgb.DMatrix(x_validation, feature_names=list(embeddings.columns))),
            bst.predict(xgb.DMatrix(x_test, feature_names=list(embeddings.columns))),
        ]
    )
    .round(2)
    .clip(min=0)
)

concatenated_regions = concatenated_regions[concatenated_regions[TARGET] <= H3_PREDICTION_RANGE]

concatenated_regions

Out[19]:

	distance_to_station	city	geometry	predicted_distance_to_station
region_id
8b390ca229acfff	7	Madrid	POLYGON ((-3.663 40.41641, -3.66325 40.41623, ...	6.71
8b390ca34211fff	3	Madrid	POLYGON ((-3.70106 40.39729, -3.70131 40.39712...	3.86
8b390cb0c2e1fff	5	Madrid	POLYGON ((-3.6807 40.50022, -3.68095 40.50005,...	5.00
8b390ca06b6afff	5	Madrid	POLYGON ((-3.63698 40.37905, -3.63724 40.37887...	5.84
8b390cb0ea74fff	8	Madrid	POLYGON ((-3.69448 40.48917, -3.69473 40.489, ...	7.29
...	...	...	...	...
8b395401a585fff	2	Valencia	POLYGON ((-0.35762 39.46981, -0.35787 39.46964...	5.06
8b39540a8264fff	4	Valencia	POLYGON ((-0.37076 39.45244, -0.37101 39.45228...	5.00
8b39540ad519fff	4	Valencia	POLYGON ((-0.38231 39.46794, -0.38256 39.46777...	4.10
8b39540ab815fff	9	Valencia	POLYGON ((-0.37091 39.44194, -0.37116 39.44177...	7.26
8b39540ac9a2fff	6	Valencia	POLYGON ((-0.40937 39.47787, -0.40962 39.4777,...	7.20

102879 rows × 4 columns

Visualise the distribution of predictions¶

First, we will see how the model performed in terms of distance prediction by comparing the predicted values with the expected values.

In [20]:

Copied!





pal12_cmap = ListedColormap(
    name="pal12",
    colors=[
        "#DE1A1AFF",
        "#E1581AFF",
        "#E37F1BFF",
        "#E2A11BFF",
        "#DFC11BFF",
        "#D8E01BFF",
        "#CEFF1AFF",
        "#B2D736FF",
        "#97B043FF",
        "#7B8948FF",
        "#5F654AFF",
        "#43444AFF",
        "#202547FF",
    ][: (H3_PREDICTION_RANGE + 1)],
)

_, axs = plt.subplots(2, 3, figsize=(12, 8), sharey=True, sharex=True, dpi=600)

axs[0, 0].set_ylabel("Predicted distance to station")
axs[1, 0].set_ylabel("Predicted distance to station")

for idx, city_name in enumerate(cities_names):
    expected_values = concatenated_regions[concatenated_regions["city"] == city_name][TARGET]
    mask = expected_values <= H3_PREDICTION_RANGE
    expected_values = expected_values[mask]
    predicted_values = concatenated_regions[concatenated_regions["city"] == city_name][
        "predicted_distance_to_station"
    ][mask]

    sns.regplot(
        x=expected_values,
        y=predicted_values,
        ax=axs[0, idx],
        scatter=True,
        order=2,
        scatter_kws=dict(
            alpha=0.02,
            color=[pal12_cmap.colors[_y] for _y in expected_values],
        ),
        line_kws=dict(
            color="black",
        ),
        x_jitter=0.1,
    )
    sns.violinplot(
        x=expected_values,
        y=predicted_values,
        ax=axs[1, idx],
        fill=True,
        palette=pal12_cmap.colors,
        hue=expected_values,
        legend=False,
    )
    title = city_name
    if city_name == train_city:
        title += " (train)"
    elif city_name == validation_city:
        title += " (validation)"

    axs[0, idx].set_title(title)
    axs[0, idx].set_xlabel(None)
    axs[1, idx].set_xlabel("Distance to station")

axs[0, 0].set_ylim(bottom=0)
axs[1, 0].set_ylim(bottom=0)

plt.tight_layout()
pal12_cmap = ListedColormap(
    name="pal12",
    colors=[
        "#DE1A1AFF",
        "#E1581AFF",
        "#E37F1BFF",
        "#E2A11BFF",
        "#DFC11BFF",
        "#D8E01BFF",
        "#CEFF1AFF",
        "#B2D736FF",
        "#97B043FF",
        "#7B8948FF",
        "#5F654AFF",
        "#43444AFF",
        "#202547FF",
    ][: (H3_PREDICTION_RANGE + 1)],
)

_, axs = plt.subplots(2, 3, figsize=(12, 8), sharey=True, sharex=True, dpi=600)

axs[0, 0].set_ylabel("Predicted distance to station")
axs[1, 0].set_ylabel("Predicted distance to station")

for idx, city_name in enumerate(cities_names):
    expected_values = concatenated_regions[concatenated_regions["city"] == city_name][TARGET]
    mask = expected_values <= H3_PREDICTION_RANGE
    expected_values = expected_values[mask]
    predicted_values = concatenated_regions[concatenated_regions["city"] == city_name][
        "predicted_distance_to_station"
    ][mask]

    sns.regplot(
        x=expected_values,
        y=predicted_values,
        ax=axs[0, idx],
        scatter=True,
        order=2,
        scatter_kws=dict(
            alpha=0.02,
            color=[pal12_cmap.colors[_y] for _y in expected_values],
        ),
        line_kws=dict(
            color="black",
        ),
        x_jitter=0.1,
    )
    sns.violinplot(
        x=expected_values,
        y=predicted_values,
        ax=axs[1, idx],
        fill=True,
        palette=pal12_cmap.colors,
        hue=expected_values,
        legend=False,
    )
    title = city_name
    if city_name == train_city:
        title += " (train)"
    elif city_name == validation_city:
        title += " (validation)"

    axs[0, idx].set_title(title)
    axs[0, idx].set_xlabel(None)
    axs[1, idx].set_xlabel("Distance to station")

axs[0, 0].set_ylim(bottom=0)
axs[1, 0].set_ylim(bottom=0)

plt.tight_layout()

No description has been provided for this image

As you can see, it's far from perfect, but we can see that the overall trend from the lowest to the highest distance is preserved for both validation city (Seville) and the city that didn't take part in the modelling (Valencia).

Although the scale isn't preserved and errors are significant, we can still use those predictions to display some hotspots on a map.

Visualise the predictions on a map¶

Here comes the most interesting part - visualising the predictions on a map.

We will change the scale for each of the cities to mach the scale of predictions, so that we can clearly see the difference between lowest and highest predictions.

Existing bicycle stations will be plotted as black dots.

In [21]:

Copied!





for city_name in cities_names:
    city_data = concatenated_regions[concatenated_regions.city == city_name]
    ax = city_data.plot(
        column="predicted_distance_to_station",
        figsize=(20, 20),
        cmap=temps_cmap,
        alpha=0.8,
        legend=True,
        legend_kwds={
            "shrink": 0.3,
            "location": "bottom",
            "label": "Predicted distance to station",
            "pad": -0.05,
        },
        vmin=max(0, city_data["predicted_distance_to_station"].min()),
        vmax=city_data["predicted_distance_to_station"].max(),
    )
    bicycle_stations_per_city[city_name].representative_point().plot(
        ax=ax, color="black", markersize=3, alpha=0.4
    )

    cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
    ax.set_axis_off()

    title = f"Predicted distance to the nearest bike station in {city_name}"
    if city_name == train_city:
        title += " (train)"
    elif city_name == validation_city:
        title += " (validation)"

    ax.set_title(title, fontsize=20)

    plt.show()
for city_name in cities_names:
    city_data = concatenated_regions[concatenated_regions.city == city_name]
    ax = city_data.plot(
        column="predicted_distance_to_station",
        figsize=(20, 20),
        cmap=temps_cmap,
        alpha=0.8,
        legend=True,
        legend_kwds={
            "shrink": 0.3,
            "location": "bottom",
            "label": "Predicted distance to station",
            "pad": -0.05,
        },
        vmin=max(0, city_data["predicted_distance_to_station"].min()),
        vmax=city_data["predicted_distance_to_station"].max(),
    )
    bicycle_stations_per_city[city_name].representative_point().plot(
        ax=ax, color="black", markersize=3, alpha=0.4
    )

    cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
    ax.set_axis_off()

    title = f"Predicted distance to the nearest bike station in {city_name}"
    if city_name == train_city:
        title += " (train)"
    elif city_name == validation_city:
        title += " (validation)"

    ax.set_title(title, fontsize=20)

    plt.show()

Visualise the predictions error on a map¶

And the last part, the prediction error preview.

As in other areas of data science, we can make a boring error graph on the X / Y axis or as a histogram. But what would the field of geospatial data be if we didn't also use maps for this.

Since we already know that ranges in the predictions per city aren't matched with the original domain, we will rescale the predictions to be in the original domain and recalculate the prediction error. That way we will be able to eliminate the base error resulting from the difference between scales and focus on the spatial part of the error.

We will also normalize the prediction error, to the 0-1 scale so that the biggest negative error will correlate with the value of 0, biggest positive error to the value of 1, and value of 0.5 will represent no error at all.

That way we will have an easy way to utilize divergent colour scale without rescaling it.

Additionally, we will use opacity to hide regions where the error was insignificant.

In [22]:

Copied!





tangerine_blues_cmap = LinearSegmentedColormap.from_list(
    name="TangerineBlues",
    colors=[
        "#552000FF",
        "#8A4D00FF",
        "#C17D17FF",
        "#F8B150FF",
        "#F5F5F5FF",
        "#93C6E1FF",
        "#5F93ACFF",
        "#2E627AFF",
        "#00344AFF",
    ],
)

for city_name in cities_names:
    city_data = concatenated_regions[concatenated_regions.city == city_name].copy()

    # NewValue = (((OldValue - OldMin) * (NewMax - NewMin)) / (OldMax - OldMin)) + NewMin
    city_data["scaled_predicted_distance_to_station"] = (
        (
            (
                city_data["predicted_distance_to_station"]
                - city_data["predicted_distance_to_station"].min()
            )
            * (city_data["distance_to_station"].max() - city_data["distance_to_station"].min())
        )
        / (
            city_data["predicted_distance_to_station"].max()
            - city_data["predicted_distance_to_station"].min()
        )
    ) + city_data["distance_to_station"].min()

    city_data["scaled_prediction_error"] = (
        city_data[TARGET].clip(upper=H3_PREDICTION_RANGE)
        - city_data["scaled_predicted_distance_to_station"]
    )

    city_data["normalized_scaled_prediction_error"] = (
        city_data["scaled_prediction_error"].apply(
            lambda x, city_data=city_data: (
                -x / city_data["scaled_prediction_error"].min()
                if x < 0
                else x / city_data["scaled_prediction_error"].max()
            )
        )
        + 1
    ) / 2

    city_data["normalized_prediction_error_alpha"] = (
        city_data["normalized_scaled_prediction_error"] - 0.5
    ).abs() * 2

    ax = city_data.plot(
        column="normalized_scaled_prediction_error",
        figsize=(20, 20),
        cmap=tangerine_blues_cmap,
        alpha=city_data["normalized_prediction_error_alpha"],
        legend=True,
        legend_kwds={
            "shrink": 0.3,
            "location": "bottom",
            "label": "Distance prediction error (scaled)",
            "pad": -0.05,
            "ticks": [0, 0.5, 1],
            "format": mticker.FixedFormatter(
                [
                    round(city_data["scaled_prediction_error"].min(), 2),
                    "0",
                    round(city_data["scaled_prediction_error"].max(), 2),
                ]
            ),
        },
    )
    bicycle_stations_per_city[city_name].representative_point().plot(
        ax=ax, color="black", markersize=3, alpha=0.4
    )

    cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
    ax.set_axis_off()

    title = f"Distance prediction error (scaled) in {city_name}"
    if city_name == train_city:
        title += " (train)"
    elif city_name == validation_city:
        title += " (validation)"

    ax.set_title(title, fontsize=20)

    plt.show()
tangerine_blues_cmap = LinearSegmentedColormap.from_list(
    name="TangerineBlues",
    colors=[
        "#552000FF",
        "#8A4D00FF",
        "#C17D17FF",
        "#F8B150FF",
        "#F5F5F5FF",
        "#93C6E1FF",
        "#5F93ACFF",
        "#2E627AFF",
        "#00344AFF",
    ],
)

for city_name in cities_names:
    city_data = concatenated_regions[concatenated_regions.city == city_name].copy()

    # NewValue = (((OldValue - OldMin) * (NewMax - NewMin)) / (OldMax - OldMin)) + NewMin
    city_data["scaled_predicted_distance_to_station"] = (
        (
            (
                city_data["predicted_distance_to_station"]
                - city_data["predicted_distance_to_station"].min()
            )
            * (city_data["distance_to_station"].max() - city_data["distance_to_station"].min())
        )
        / (
            city_data["predicted_distance_to_station"].max()
            - city_data["predicted_distance_to_station"].min()
        )
    ) + city_data["distance_to_station"].min()

    city_data["scaled_prediction_error"] = (
        city_data[TARGET].clip(upper=H3_PREDICTION_RANGE)
        - city_data["scaled_predicted_distance_to_station"]
    )

    city_data["normalized_scaled_prediction_error"] = (
        city_data["scaled_prediction_error"].apply(
            lambda x, city_data=city_data: (
                -x / city_data["scaled_prediction_error"].min()
                if x < 0
                else x / city_data["scaled_prediction_error"].max()
            )
        )
        + 1
    ) / 2

    city_data["normalized_prediction_error_alpha"] = (
        city_data["normalized_scaled_prediction_error"] - 0.5
    ).abs() * 2

    ax = city_data.plot(
        column="normalized_scaled_prediction_error",
        figsize=(20, 20),
        cmap=tangerine_blues_cmap,
        alpha=city_data["normalized_prediction_error_alpha"],
        legend=True,
        legend_kwds={
            "shrink": 0.3,
            "location": "bottom",
            "label": "Distance prediction error (scaled)",
            "pad": -0.05,
            "ticks": [0, 0.5, 1],
            "format": mticker.FixedFormatter(
                [
                    round(city_data["scaled_prediction_error"].min(), 2),
                    "0",
                    round(city_data["scaled_prediction_error"].max(), 2),
                ]
            ),
        },
    )
    bicycle_stations_per_city[city_name].representative_point().plot(
        ax=ax, color="black", markersize=3, alpha=0.4
    )

    cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
    ax.set_axis_off()

    title = f"Distance prediction error (scaled) in {city_name}"
    if city_name == train_city:
        title += " (train)"
    elif city_name == validation_city:
        title += " (validation)"

    ax.set_title(title, fontsize=20)

    plt.show()

Below is an example of the scaling we have done in a previous step.

The difference between predicted distances and predicted distances scaled to the original domain.

In [23]:

Copied!





sns.displot(
    data=pd.melt(
        city_data,
        value_vars=[
            "predicted_distance_to_station",
            "scaled_predicted_distance_to_station",
        ],
    ),
    x="value",
    hue="variable",
    kde=True,
    height=10,
)
sns.displot(
    data=pd.melt(
        city_data,
        value_vars=[
            "predicted_distance_to_station",
            "scaled_predicted_distance_to_station",
        ],
    ),
    x="value",
    hue="variable",
    kde=True,
    height=10,
)

Out[23]:

<seaborn.axisgrid.FacetGrid at 0x7fb64561cc40>

See feature importance¶

We can visualise feature importance from the XGBoost Booster object and display the gain per feature. This info can tell us, or the planners, what urban features are important for the model.

This is a basic form of explainability and it doesn't tell us if the feature has negative or positive impact on the result. For that we would have to use for example the SHAP library.

In [24]:

Copied!





_, ax = plt.subplots(1, 1, figsize=(10, 10))

feature_importance_df = pd.DataFrame(
    bst.get_score(importance_type="gain").items(), columns=["Feature", "Gain"]
)
ax = sns.barplot(y="Feature", x="Gain", data=feature_importance_df.nlargest(20, columns="Gain"))
ax.set_title("Feature importance")
plt.show()
_, ax = plt.subplots(1, 1, figsize=(10, 10))

feature_importance_df = pd.DataFrame(
    bst.get_score(importance_type="gain").items(), columns=["Feature", "Gain"]
)
ax = sns.barplot(y="Feature", x="Gain", data=feature_importance_df.nlargest(20, columns="Gain"))
ax.set_title("Feature importance")
plt.show()

Summary¶

In summary, looking at the prediction maps, and the errors, there is a general trend where the predictions fairly represent the true layout of the cycle stations.

There is a strong predominance of city centres, where these predictions are closer to reality, and the model tends to predict low values there. As you move away from the centre, you can see increasing deviations, with noticeable clusters where the model completely failed to predict the existence of a cycle station.

This may be due to a number of factors:

differences in planning of the cycling network between towns and cities;
differences in urban layout of towns and cities;
lack of features or examples representing some types of urban tissue;
model unable to catch all the complexities of the data set;
too simple method of embedding geographic features into the numerical vector.

Even looking at the example of Madrid, which has been reproduced quite well by the model, you can see examples where the model thinks a station should not be there, or has found empty places where it would propose to put a station, so even without knowledge transfer between cities, such a model could support local planners.

I hope that the methods presented here have helped you understand the basics of the field of geospatial machine learning and presented the tangible benefits of using such a model in real-world scenarios.