Philadelphia crime
Philadelphia Crime Dataset¶
Crime incidents from the Philadelphia Police Department. Part I crimes include violent offenses such as aggravated assault, rape, arson, among others. Part II crimes include simple assault, prostitution, gambling, fraud, and other non-violent offenses. Each record provides the date, time, and type of crime, allowing for both spatial and temporal analysis. For the benchmark, we rely on a subset of crime reports from 2023.
In [1]:
Copied!
# plotting imports
import contextily as cx
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.patches import Patch
# dataset import
from srai.datasets import PhiladelphiaCrimeDataset
# plotting imports
import contextily as cx
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.patches import Patch
# dataset import
from srai.datasets import PhiladelphiaCrimeDataset
In [2]:
Copied!
philadelphia_crime = PhiladelphiaCrimeDataset()
philadelphia_crime = PhiladelphiaCrimeDataset()
In [3]:
Copied!
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
Out[3]:
(NoneType, NoneType)
Get data using .load() method -> a default version 'res_8'
In [4]:
Copied!
ds = philadelphia_crime.load()
ds.keys()
ds = philadelphia_crime.load()
ds.keys()
Downloading crime data for 2023...
Loading cached Parquet file for 2023... Splitting into train-test subsets ...
Loading cached Parquet file for 2023... Splitting into train-test subsets ...
Out[4]:
dict_keys(['train', 'test'])
In [5]:
Copied!
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
Out[5]:
(geopandas.geodataframe.GeoDataFrame, geopandas.geodataframe.GeoDataFrame)
In [6]:
Copied!
resolution = philadelphia_crime.resolution
resolution
resolution = philadelphia_crime.resolution
resolution
Out[6]:
8
In [7]:
Copied!
gdf_train, gdf_test = ds["train"], ds["test"]
gdf_train, gdf_test = ds["train"], ds["test"]
In [8]:
Copied!
gdf_train.head()
gdf_train.head()
Out[8]:
| objectid | dc_dist | psa | dispatch_date_time | dispatch_date | dispatch_time | hour | dc_key | location_block | ucr_general | text_general_code | point_x | point_y | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 411 | 116 | 01 | 1 | 2023-03-11 18:31:00 | 2023-03-11 | 13:31:00 | 13.0 | 2.023010e+11 | 2400 BLOCK S 28TH ST | 600 | Theft from Vehicle | -75.193618 | 39.922350 | POINT (-75.19362 39.92235) |
| 412 | 119 | 08 | 2 | 2023-03-11 22:13:00 | 2023-03-11 | 17:13:00 | 17.0 | 2.023080e+11 | 9800 BLOCK Roosevelt Blvd | 600 | Thefts | -75.015070 | 40.094525 | POINT (-75.01507 40.09452) |
| 413 | 96 | 15 | 1 | 2023-03-11 12:42:00 | 2023-03-11 | 07:42:00 | 7.0 | 2.023150e+11 | 4700 BLOCK GRISCOM ST | 600 | Thefts | -75.083953 | 40.017896 | POINT (-75.08395 40.0179) |
| 414 | 99 | 14 | 1 | 2023-03-12 00:54:00 | 2023-03-11 | 19:54:00 | 19.0 | 2.023140e+11 | 5500 BLOCK BLOYD ST | 300 | Robbery No Firearm | -75.161898 | 40.044952 | POINT (-75.1619 40.04495) |
| 416 | 102 | 25 | 3 | 2023-03-11 07:03:00 | 2023-03-11 | 02:03:00 | 2.0 | 2.023250e+11 | 200 BLOCK W ONTARIO ST | 400 | Aggravated Assault Firearm | -75.133172 | 40.002221 | POINT (-75.13317 40.00222) |
Getting the h3 with target values
In [9]:
Copied!
train_h3, _, test_h3 = philadelphia_crime.get_h3_with_labels()
train_h3, _, test_h3 = philadelphia_crime.get_h3_with_labels()
In [10]:
Copied!
fig, axes = plt.subplots(
2, 1, sharex=False, sharey=False, figsize=(12, 16), height_ratios=[4, 1]
)
train_h3.plot(
color="orange",
markersize=0.1,
ax=axes[0],
label="train",
alpha=np.minimum(np.power(train_h3[philadelphia_crime.target] + 0.4, 2), 1),
)
test_h3.plot(
color="royalblue",
markersize=0.1,
ax=axes[0],
label="test",
alpha=np.minimum(np.power(test_h3[philadelphia_crime.target] + 0.4, 2), 1),
)
cx.add_basemap(axes[0], source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=12)
axes[0].set_title("Philadelphia crime data aggregated to H3 cells")
axes[0].legend(
handles=[Patch(facecolor="orange"), Patch(facecolor="royalblue")],
labels=["Train", "Test"],
)
axes[0].set_axis_off()
sns.kdeplot(
x=train_h3[philadelphia_crime.target],
label="train",
color="orange",
ax=axes[1],
fill=False,
cut=0,
)
sns.kdeplot(
x=test_h3[philadelphia_crime.target],
label="test",
color="royalblue",
ax=axes[1],
fill=False,
cut=0,
)
axes[1].set_title("Philadelphia crime data - target distribution")
axes[1].legend()
fig.tight_layout()
plt.show()
fig, axes = plt.subplots(
2, 1, sharex=False, sharey=False, figsize=(12, 16), height_ratios=[4, 1]
)
train_h3.plot(
color="orange",
markersize=0.1,
ax=axes[0],
label="train",
alpha=np.minimum(np.power(train_h3[philadelphia_crime.target] + 0.4, 2), 1),
)
test_h3.plot(
color="royalblue",
markersize=0.1,
ax=axes[0],
label="test",
alpha=np.minimum(np.power(test_h3[philadelphia_crime.target] + 0.4, 2), 1),
)
cx.add_basemap(axes[0], source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=12)
axes[0].set_title("Philadelphia crime data aggregated to H3 cells")
axes[0].legend(
handles=[Patch(facecolor="orange"), Patch(facecolor="royalblue")],
labels=["Train", "Test"],
)
axes[0].set_axis_off()
sns.kdeplot(
x=train_h3[philadelphia_crime.target],
label="train",
color="orange",
ax=axes[1],
fill=False,
cut=0,
)
sns.kdeplot(
x=test_h3[philadelphia_crime.target],
label="test",
color="royalblue",
ax=axes[1],
fill=False,
cut=0,
)
axes[1].set_title("Philadelphia crime data - target distribution")
axes[1].legend()
fig.tight_layout()
plt.show()
Get data from 2013 year.
In [11]:
Copied!
ds = philadelphia_crime.load(version="2013")
ds.keys()
ds = philadelphia_crime.load(version="2013")
ds.keys()
Downloading crime data for 2013...
Loading cached Parquet file for 2013...
Out[11]:
dict_keys(['train'])
In [12]:
Copied!
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
Out[12]:
(geopandas.geodataframe.GeoDataFrame, NoneType)
In [13]:
Copied!
ds["train"].head()
ds["train"].head()
Out[13]:
| objectid | dc_dist | psa | dispatch_date_time | dispatch_date | dispatch_time | hour | dc_key | location_block | ucr_general | text_general_code | point_x | point_y | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32951063 | 12 | 2 | 2013-04-24 04:00:00 | 2013-04-24 | 00:00:00 | 8 | 2.013120e+11 | 5900 BLOCK SPRINGFIELD AVE | 100 | Homicide - Criminal | -75.231532 | 39.935160 | POINT (-75.23153 39.93516) |
| 2 | 32951104 | 25 | 1 | 2013-04-06 04:00:00 | 2013-04-06 | 00:00:00 | 23 | 2.013250e+11 | 700 BLOCK W VENANGO STREET | 100 | Homicide - Criminal | -75.140959 | 40.006263 | POINT (-75.14096 40.00626) |
| 12 | 32951276 | 24 | 1 | 2013-12-01 05:00:00 | 2013-12-01 | 00:00:00 | 22 | 2.013241e+11 | 900 BLOCK E RUSSELL ST | 100 | Homicide - Criminal | -75.112436 | 40.000069 | POINT (-75.11244 40.00007) |
| 13 | 32951308 | 19 | 1 | 2013-01-03 05:00:00 | 2013-01-03 | 00:00:00 | 23 | 2.013190e+11 | 6000 BLOCK HADDINGTON LANE | 100 | Homicide - Criminal | -75.240554 | 39.977365 | POINT (-75.24055 39.97736) |
| 14 | 32951338 | 24 | 2 | 2013-11-17 05:00:00 | 2013-11-17 | 00:00:00 | 17 | 2.013241e+11 | 2900 BLOCK N KIP ST | 100 | Homicide - Criminal | -75.127046 | 39.993583 | POINT (-75.12705 39.99358) |
Creating your own train - test split -> Bucket regression (works similarly for spatial regression)
In [14]:
Copied!
philadelphia_crime.target
philadelphia_crime.target
Out[14]:
'count'
In [15]:
Copied!
train, test = philadelphia_crime.train_test_split(
test_size=0.2, resolution=8, n_bins=10, random_state=42
)
train, test = philadelphia_crime.train_test_split(
test_size=0.2, resolution=8, n_bins=10, random_state=42
)
Summary of the split:
Train: 403 H3 cells (141029 points)
Test: 117 H3 cells (35685 points)
Expected ratios: {'train': 0.8, 'validation': 0, 'test': 0.2}
Actual ratios: {'train': 0.798, 'test': 0.202}
Actual ratios difference: {'train': 0.002, 'test': -0.002}
bucket train_ratio test_ratio train_ratio_difference \
0 0 0.81421 0.18579 -0.01421
1 1 0.80993 0.19007 -0.00993
2 2 0.80693 0.19307 -0.00693
3 3 0.79468 0.20532 0.00532
4 4 0.79568 0.20432 0.00432
5 5 0.79242 0.20758 0.00758
6 6 0.79190 0.20810 0.00810
7 7 0.80483 0.19517 -0.00483
8 8 0.79277 0.20723 0.00723
9 9 0.80070 0.19930 -0.00070
test_ratio_difference train_points test_points
0 0.01421 149 34
1 0.00993 669 157
2 0.00693 1793 429
3 -0.00532 3259 842
4 -0.00432 5156 1324
5 -0.00758 8532 2235
6 -0.00810 13380 3516
7 0.00483 21876 5305
8 -0.00723 30715 8029
9 0.00070 55500 13814
Created new train_gdf and test_gdf. Train len: 141029,test len: 35685
In [16]:
Copied!
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
Out[16]:
(geopandas.geodataframe.GeoDataFrame, geopandas.geodataframe.GeoDataFrame)
In [17]:
Copied!
philadelphia_crime.resolution
philadelphia_crime.resolution
Out[17]:
8
In [18]:
Copied!
train.head()
train.head()
Out[18]:
| objectid | dc_dist | psa | dispatch_date_time | dispatch_date | dispatch_time | hour | dc_key | location_block | ucr_general | text_general_code | point_x | point_y | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 32951104 | 25 | 1 | 2013-04-06 04:00:00 | 2013-04-06 | 00:00:00 | 23 | 2.013250e+11 | 700 BLOCK W VENANGO STREET | 100 | Homicide - Criminal | -75.140959 | 40.006263 | POINT (-75.14096 40.00626) |
| 12 | 32951276 | 24 | 1 | 2013-12-01 05:00:00 | 2013-12-01 | 00:00:00 | 22 | 2.013241e+11 | 900 BLOCK E RUSSELL ST | 100 | Homicide - Criminal | -75.112436 | 40.000069 | POINT (-75.11244 40.00007) |
| 13 | 32951308 | 19 | 1 | 2013-01-03 05:00:00 | 2013-01-03 | 00:00:00 | 23 | 2.013190e+11 | 6000 BLOCK HADDINGTON LANE | 100 | Homicide - Criminal | -75.240554 | 39.977365 | POINT (-75.24055 39.97736) |
| 14 | 32951338 | 24 | 2 | 2013-11-17 05:00:00 | 2013-11-17 | 00:00:00 | 17 | 2.013241e+11 | 2900 BLOCK N KIP ST | 100 | Homicide - Criminal | -75.127046 | 39.993583 | POINT (-75.12705 39.99358) |
| 15 | 32951344 | 39 | 3 | 2013-07-14 04:00:00 | 2013-07-14 | 00:00:00 | 1 | 2.013390e+11 | 2700 BLOCK N 27TH ST | 100 | Homicide - Criminal | -75.175501 | 39.996899 | POINT (-75.1755 39.9969) |
In [19]:
Copied!
test.head()
test.head()
Out[19]:
| objectid | dc_dist | psa | dispatch_date_time | dispatch_date | dispatch_time | hour | dc_key | location_block | ucr_general | text_general_code | point_x | point_y | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32951063 | 12 | 2 | 2013-04-24 04:00:00 | 2013-04-24 | 00:00:00 | 8 | 2.013120e+11 | 5900 BLOCK SPRINGFIELD AVE | 100 | Homicide - Criminal | -75.231532 | 39.935160 | POINT (-75.23153 39.93516) |
| 18 | 32951355 | 17 | 2 | 2013-07-14 04:00:00 | 2013-07-14 | 00:00:00 | 21 | 2.013170e+11 | 2700 BLOCK REED ST | 100 | Homicide - Criminal | -75.189423 | 39.935696 | POINT (-75.18942 39.9357) |
| 28 | 32951596 | 24 | 1 | 2013-07-26 04:00:00 | 2013-07-26 | 00:00:00 | 1 | 2.013241e+11 | 2000 BLOCK PICKWICK ST | 100 | Homicide - Criminal | -75.099995 | 39.998954 | POINT (-75.09999 39.99895) |
| 36 | 32951761 | 16 | 1 | 2013-12-22 05:00:00 | 2013-12-22 | 00:00:00 | 21 | 2.013160e+11 | 3800 BLOCK RENO ST | 100 | Homicide - Criminal | -75.199156 | 39.968398 | POINT (-75.19916 39.9684) |
| 38 | 32951708 | 12 | 3 | 2013-10-16 04:00:00 | 2013-10-16 | 00:00:00 | 11 | 2.013121e+11 | 2400 BLOCK S 63RD ST | 100 | Homicide - Criminal | -75.228979 | 39.925023 | POINT (-75.22898 39.92502) |