Airbnb multicity
Airbnb Dataset¶
This dataset consists of approximately 3.1 million Airbnb listings collected between June 2022 and May 2023 across 80 cities worldwide. It includes geographic location, property characteristics, host activity, and review metrics. For the benchmark, a cleaned subset from six cities—Paris, Rome, London, Amsterdam, Melbourne, and New York City—was selected.
In [1]:
Copied!
# plotting imports
import contextily as cx
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.patches import Patch
# dataset import
from srai.datasets import AirbnbMulticityDataset
# plotting imports
import contextily as cx
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.patches import Patch
# dataset import
from srai.datasets import AirbnbMulticityDataset
In [2]:
Copied!
airbnb_multicity = AirbnbMulticityDataset()
airbnb_multicity = AirbnbMulticityDataset()
In [3]:
Copied!
type(airbnb_multicity.train_gdf), type(airbnb_multicity.test_gdf)
type(airbnb_multicity.train_gdf), type(airbnb_multicity.test_gdf)
Out[3]:
(NoneType, NoneType)
Loading default version
In [4]:
Copied!
ds = airbnb_multicity.load()
ds.keys()
ds = airbnb_multicity.load()
ds.keys()
Out[4]:
dict_keys(['train', 'test'])
In [5]:
Copied!
type(airbnb_multicity.train_gdf), type(airbnb_multicity.test_gdf)
type(airbnb_multicity.train_gdf), type(airbnb_multicity.test_gdf)
Out[5]:
(geopandas.geodataframe.GeoDataFrame, geopandas.geodataframe.GeoDataFrame)
In [6]:
Copied!
print("Aggregation H3 resolution:", airbnb_multicity.resolution)
print("Aggregation H3 resolution:", airbnb_multicity.resolution)
Aggregation H3 resolution: 8
In [7]:
Copied!
print("Prediction target:", airbnb_multicity.target)
print("Prediction target:", airbnb_multicity.target)
Prediction target: price
In [8]:
Copied!
gdf_train, gdf_test = ds["train"], ds["test"]
gdf_train, gdf_test = ds["train"], ds["test"]
In [9]:
Copied!
print("Available cities:", sorted(gdf_train["city"].unique()))
print("Available cities:", sorted(gdf_train["city"].unique()))
Available cities: ['amsterdam', 'london', 'melbourne', 'new-york-city', 'paris', 'rome']
In [10]:
Copied!
gdf_train.head()
gdf_train.head()
Out[10]:
id | name | host_id | host_name | neighbourhood | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | city | date | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9396150 | Cosy studio sur cour fleurie. Métro Saint-Man... | 5783938 | Alexandra | Ménilmontant | Entire home/apt | 53.0 | 7 | 7 | 2022-06-02 | 0.34 | 1 | 308 | 5 | paris | 2022-06-06 | POINT (2.42054 48.84782) |
1 | 638764995602497687 | JOLI T2 LUMINEUX de 34 m² à 7 min de PARIS | 460853253 | Sawsen | Buttes-Montmartre | Entire home/apt | 59.0 | 1 | 0 | None | NaN | 1 | 13 | 0 | paris | 2022-06-06 | POINT (2.3591 48.90649) |
2 | 23135649 | Charming studio, Pont de Neuilly - Paris | 171938056 | Julien | Passy | Private room | 65.0 | 3 | 31 | 2022-05-31 | 0.65 | 1 | 4 | 6 | paris | 2022-06-06 | POINT (2.25285 48.88173) |
3 | 54220288 | Superbe appartement moderne - Les Docks Saint-... | 141681595 | Théo | Batignolles-Monceau | Entire home/apt | 103.0 | 3 | 4 | 2022-05-22 | 1.08 | 1 | 214 | 4 | paris | 2022-06-06 | POINT (2.32833 48.90859) |
4 | 885179 | Parisian luxury apartment ... | 4176034 | Bernard | Buttes-Montmartre | Entire home/apt | 65.0 | 3 | 84 | 2022-03-19 | 0.99 | 1 | 245 | 21 | paris | 2022-06-06 | POINT (2.33163 48.90531) |
In [11]:
Copied!
fig, axes = plt.subplots(
2, 2, sharex=False, sharey=False, figsize=(12, 15), width_ratios=[3, 1]
)
cities = [("Amsterdam", 0.05), ("London", 0.01)]
for row_idx, (city_name, marker_size) in enumerate(cities):
city_train = gdf_train[gdf_train["city"] == city_name.lower()]
city_test = gdf_test[gdf_test["city"] == city_name.lower()]
train_points = len(city_train)
test_points = len(city_test)
train_pct = 100 * train_points / (train_points + test_points)
test_pct = 100 * test_points / (train_points + test_points)
ax_map = axes[row_idx][0]
city_train.plot(color="orange", markersize=marker_size, ax=ax_map, label="train")
city_test.plot(color="royalblue", markersize=marker_size, ax=ax_map, label="test")
ax_map.set_title(
f"{city_name} data - points on a map"
f" (Train: {train_points} ({train_pct:.2f}%),"
f" Test: {test_points} ({test_pct:.2f}%))"
)
ax_map.legend(
handles=[
Line2D([], [], marker="o", color="orange", linestyle="None"),
Line2D([], [], marker="o", color="royalblue", linestyle="None"),
],
labels=["Train", "Test"],
)
cx.add_basemap(
ax_map, source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=12
)
ax_map.set_axis_off()
ax_dist = axes[row_idx][1]
sns.kdeplot(
x=city_train[airbnb_multicity.target],
label="train",
color="orange",
ax=ax_dist,
fill=False,
cut=0,
)
sns.kdeplot(
x=city_test[airbnb_multicity.target],
label="test",
color="royalblue",
ax=ax_dist,
fill=False,
cut=0,
)
ax_dist.set_title(f"{city_name} data - target distribution")
ax_dist.legend()
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(
2, 2, sharex=False, sharey=False, figsize=(12, 15), width_ratios=[3, 1]
)
cities = [("Amsterdam", 0.05), ("London", 0.01)]
for row_idx, (city_name, marker_size) in enumerate(cities):
city_train = gdf_train[gdf_train["city"] == city_name.lower()]
city_test = gdf_test[gdf_test["city"] == city_name.lower()]
train_points = len(city_train)
test_points = len(city_test)
train_pct = 100 * train_points / (train_points + test_points)
test_pct = 100 * test_points / (train_points + test_points)
ax_map = axes[row_idx][0]
city_train.plot(color="orange", markersize=marker_size, ax=ax_map, label="train")
city_test.plot(color="royalblue", markersize=marker_size, ax=ax_map, label="test")
ax_map.set_title(
f"{city_name} data - points on a map"
f" (Train: {train_points} ({train_pct:.2f}%),"
f" Test: {test_points} ({test_pct:.2f}%))"
)
ax_map.legend(
handles=[
Line2D([], [], marker="o", color="orange", linestyle="None"),
Line2D([], [], marker="o", color="royalblue", linestyle="None"),
],
labels=["Train", "Test"],
)
cx.add_basemap(
ax_map, source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=12
)
ax_map.set_axis_off()
ax_dist = axes[row_idx][1]
sns.kdeplot(
x=city_train[airbnb_multicity.target],
label="train",
color="orange",
ax=ax_dist,
fill=False,
cut=0,
)
sns.kdeplot(
x=city_test[airbnb_multicity.target],
label="test",
color="royalblue",
ax=ax_dist,
fill=False,
cut=0,
)
ax_dist.set_title(f"{city_name} data - target distribution")
ax_dist.legend()
plt.tight_layout()
plt.show()
Getting aggregated hexagon values
In [12]:
Copied!
train_h3, _, test_h3 = airbnb_multicity.get_h3_with_labels()
train_h3, _, test_h3 = airbnb_multicity.get_h3_with_labels()
In [13]:
Copied!
train_h3.head()
train_h3.head()
Out[13]:
geometry | price | |
---|---|---|
region_id | ||
882a100dd9fffff | POLYGON ((-73.90034 40.69846, -73.90671 40.697... | 127.746667 |
88195dace5fffff | POLYGON ((-0.42167 51.54068, -0.42839 51.53947... | 79.500000 |
881fb474a1fffff | POLYGON ((2.33465 48.91954, 2.32833 48.91839, ... | 112.678571 |
882a100a8bfffff | POLYGON ((-73.90344 40.83853, -73.90983 40.837... | 106.071429 |
88be6236d1fffff | POLYGON ((145.44505 -37.93223, 145.44622 -37.9... | 225.230769 |
In [14]:
Copied!
test_h3.head()
test_h3.head()
Out[14]:
geometry | price | |
---|---|---|
region_id | ||
88195da681fffff | POLYGON ((-0.10646 51.57144, -0.11317 51.57025... | 121.156250 |
881969504dfffff | POLYGON ((4.87764 52.33471, 4.87606 52.33041, ... | 183.818182 |
881e80cd93fffff | POLYGON ((12.60114 41.96974, 12.59926 41.96496... | 200.000000 |
88be622e41fffff | POLYGON ((145.49699 -37.65429, 145.49816 -37.6... | 199.700000 |
88be63c9c7fffff | POLYGON ((145.28791 -37.83716, 145.28909 -37.8... | 35.000000 |
In [15]:
Copied!
aggregated_train_data = train_h3.cx[-1.04:0.65, 51.09:51.84]
aggregated_test_data = test_h3.cx[-1.04:0.65, 51.09:51.84]
with plt.rc_context({"hatch.linewidth": 0.4}):
ax = aggregated_train_data.plot(
airbnb_multicity.target,
cmap="spring_r",
legend=True,
legend_kwds=dict(
location="right", shrink=0.9, pad=0.02, label=airbnb_multicity.target
),
figsize=(15, 9),
alpha=0.5,
)
ax.set_axis_off()
aggregated_test_data.plot(
airbnb_multicity.target, cmap="spring_r", alpha=0.5, ax=ax
)
aggregated_test_data.plot(
ax=ax, linewidth=0.4, color=(0, 0, 0, 0), edgecolor=(0, 0, 0, 0.4), hatch="+++"
)
ax.set_title("London data aggregated to H3 cells")
ax.legend(
handles=[
Patch(edgecolor=(0, 0, 0, 0.8), linewidth=0.1, facecolor=(0, 0, 0, 0)),
Patch(
edgecolor=(0, 0, 0, 0.8),
linewidth=0.1,
facecolor=(0, 0, 0, 0),
hatch="+++",
),
],
labels=["Train", "Test"],
loc=2,
)
cx.add_basemap(ax, source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=11)
ax.set_axis_off()
plt.show()
aggregated_train_data = train_h3.cx[-1.04:0.65, 51.09:51.84]
aggregated_test_data = test_h3.cx[-1.04:0.65, 51.09:51.84]
with plt.rc_context({"hatch.linewidth": 0.4}):
ax = aggregated_train_data.plot(
airbnb_multicity.target,
cmap="spring_r",
legend=True,
legend_kwds=dict(
location="right", shrink=0.9, pad=0.02, label=airbnb_multicity.target
),
figsize=(15, 9),
alpha=0.5,
)
ax.set_axis_off()
aggregated_test_data.plot(
airbnb_multicity.target, cmap="spring_r", alpha=0.5, ax=ax
)
aggregated_test_data.plot(
ax=ax, linewidth=0.4, color=(0, 0, 0, 0), edgecolor=(0, 0, 0, 0.4), hatch="+++"
)
ax.set_title("London data aggregated to H3 cells")
ax.legend(
handles=[
Patch(edgecolor=(0, 0, 0, 0.8), linewidth=0.1, facecolor=(0, 0, 0, 0)),
Patch(
edgecolor=(0, 0, 0, 0.8),
linewidth=0.1,
facecolor=(0, 0, 0, 0),
hatch="+++",
),
],
labels=["Train", "Test"],
loc=2,
)
cx.add_basemap(ax, source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=11)
ax.set_axis_off()
plt.show()
Loading raw, full data
In [16]:
Copied!
ds = airbnb_multicity.load(version="all")
ds.keys()
ds = airbnb_multicity.load(version="all")
ds.keys()
Out[16]:
dict_keys(['train'])
In [17]:
Copied!
type(airbnb_multicity.train_gdf), type(airbnb_multicity.test_gdf)
type(airbnb_multicity.train_gdf), type(airbnb_multicity.test_gdf)
Out[17]:
(geopandas.geodataframe.GeoDataFrame, NoneType)
In [18]:
Copied!
ds["train"].head()
ds["train"].head()
Out[18]:
id | name | host_id | host_name | neighbourhood | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | city | date | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 23726706 | Private room 20 minutes from Amsterdam + Break... | 122619127 | Patricia | IJburg - Zeeburgereiland | Private room | 88.0 | 2 | 78 | 2022-05-29 | 1.53 | 1 | 66 | 11 | amsterdam | 2022-06-05 | POINT (4.97879 52.34916) |
1 | 35815036 | Vrijstaand vakantiehuis, privé tuin aan het water | 269425139 | Lydia | Noord-Oost | Entire home/apt | 105.0 | 3 | 95 | 2022-06-02 | 2.65 | 1 | 243 | 36 | amsterdam | 2022-06-05 | POINT (4.95689 52.42419) |
2 | 31553121 | Quiet Guesthouse near Amsterdam | 76806621 | Ralf | Noord-West | Entire home/apt | 152.0 | 2 | 82 | 2022-05-29 | 2.02 | 1 | 3 | 26 | amsterdam | 2022-06-05 | POINT (4.91821 52.43237) |
3 | 34745823 | Apartment ' Landzicht', nearby Amsterdam | 238083700 | Daisy | Gaasperdam - Driemond | Entire home/apt | 87.0 | 2 | 39 | 2022-04-17 | 1.08 | 3 | 290 | 4 | amsterdam | 2022-06-05 | POINT (5.01231 52.2962) |
4 | 44586947 | Weesp, 2 kamers vlakbij Amsterdam | 360838688 | Aranka | Gaasperdam - Driemond | Private room | 160.0 | 2 | 15 | 2022-05-29 | 0.68 | 1 | 152 | 12 | amsterdam | 2022-06-05 | POINT (5.0303 52.31475) |
Create your own train-test split -> Spatial splitting with bucket stratification
In [19]:
Copied!
train, test = airbnb_multicity.train_test_split(
target_column="price", test_size=0.2, resolution=8, n_bins=10, random_state=42
)
train, test = airbnb_multicity.train_test_split(
target_column="price", test_size=0.2, resolution=8, n_bins=10, random_state=42
)
Summary of the split: Train: 33190 H3 cells (2243473 points) Test: 26285 H3 cells (560865 points) Expected ratios: {'train': 0.8, 'validation': 0, 'test': 0.2} Actual ratios: {'train': 0.8, 'test': 0.2} Actual ratios difference: {'train': 0.0, 'test': 0.0} bucket train_ratio test_ratio train_ratio_difference \ 0 0 0.80000 0.20000 0.00000 1 1 0.80000 0.20000 0.00000 2 2 0.80000 0.20000 0.00000 3 3 0.80000 0.20000 0.00000 4 4 0.80000 0.20000 0.00000 5 5 0.80000 0.20000 0.00000 6 6 0.80000 0.20000 0.00000 7 7 0.80000 0.20000 0.00000 8 8 0.80000 0.20000 0.00000 9 9 0.80001 0.19999 -0.00001 test_ratio_difference train_points test_points 0 0.00000 226300 56575 1 0.00000 231889 57972 2 0.00000 218161 54541 3 0.00000 243071 60766 4 0.00000 206847 51712 5 0.00000 222598 55650 6 0.00000 235277 58819 7 0.00000 212393 53099 8 0.00000 222589 55646 9 0.00001 224348 56085
Created new train_gdf and test_gdf. Train len: 2243473,test len: 560865
In [20]:
Copied!
type(airbnb_multicity.train_gdf), type(airbnb_multicity.test_gdf)
type(airbnb_multicity.train_gdf), type(airbnb_multicity.test_gdf)
Out[20]:
(geopandas.geodataframe.GeoDataFrame, geopandas.geodataframe.GeoDataFrame)
In [21]:
Copied!
airbnb_multicity.resolution
airbnb_multicity.resolution
Out[21]:
8
In [22]:
Copied!
train.head()
train.head()
Out[22]:
id | name | host_id | host_name | neighbourhood | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | city | date | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 23726706 | Private room 20 minutes from Amsterdam + Break... | 122619127 | Patricia | IJburg - Zeeburgereiland | Private room | 88.0 | 2 | 78 | 2022-05-29 | 1.53 | 1 | 66 | 11 | amsterdam | 2022-06-05 | POINT (4.97879 52.34916) |
5 | 15801253 | Studio with own bathroom & kitchen at East A'dam | 21813940 | Nan | Watergraafsmeer | Private room | 90.0 | 2 | 46 | 2022-05-29 | 0.89 | 1 | 164 | 7 | amsterdam | 2022-06-05 | POINT (4.96413 52.34507) |
11 | 44174770 | quiet Apartment at a Lake | 114132046 | Sven | Watergraafsmeer | Private room | 59.0 | 30 | 15 | 2022-03-18 | 0.66 | 3 | 245 | 10 | amsterdam | 2022-06-05 | POINT (4.95199 52.33715) |
12 | 50888392 | Gezellig appartement aan de rand van centrum w... | 119648218 | Margareth | Gaasperdam - Driemond | Entire home/apt | 150.0 | 2 | 18 | 2022-04-28 | 1.73 | 1 | 153 | 18 | amsterdam | 2022-06-05 | POINT (5.03833 52.30917) |
14 | 48864190 | Amsterdam_Farmyard | 85831122 | Linda & Arjan | IJburg - Zeeburgereiland | Entire home/apt | 350.0 | 1 | 11 | 2022-04-04 | 0.83 | 3 | 300 | 9 | amsterdam | 2022-06-05 | POINT (4.99961 52.33851) |
In [23]:
Copied!
test.head()
test.head()
Out[23]:
id | name | host_id | host_name | neighbourhood | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | city | date | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 35815036 | Vrijstaand vakantiehuis, privé tuin aan het water | 269425139 | Lydia | Noord-Oost | Entire home/apt | 105.0 | 3 | 95 | 2022-06-02 | 2.65 | 1 | 243 | 36 | amsterdam | 2022-06-05 | POINT (4.95689 52.42419) |
2 | 31553121 | Quiet Guesthouse near Amsterdam | 76806621 | Ralf | Noord-West | Entire home/apt | 152.0 | 2 | 82 | 2022-05-29 | 2.02 | 1 | 3 | 26 | amsterdam | 2022-06-05 | POINT (4.91821 52.43237) |
3 | 34745823 | Apartment ' Landzicht', nearby Amsterdam | 238083700 | Daisy | Gaasperdam - Driemond | Entire home/apt | 87.0 | 2 | 39 | 2022-04-17 | 1.08 | 3 | 290 | 4 | amsterdam | 2022-06-05 | POINT (5.01231 52.2962) |
4 | 44586947 | Weesp, 2 kamers vlakbij Amsterdam | 360838688 | Aranka | Gaasperdam - Driemond | Private room | 160.0 | 2 | 15 | 2022-05-29 | 0.68 | 1 | 152 | 12 | amsterdam | 2022-06-05 | POINT (5.0303 52.31475) |
6 | 19572024 | Coachhouse, in nature only 5 km from Amsterdam | 81955946 | Amber | Watergraafsmeer | Entire home/apt | 279.0 | 3 | 126 | 2022-05-29 | 2.13 | 2 | 298 | 23 | amsterdam | 2022-06-05 | POINT (4.90833 52.30739) |