Philadelphia crime
Philadelphia Crime Dataset¶
Crime incidents from the Philadelphia Police Department. Part I crimes include violent offenses such as aggravated assault, rape, arson, among others. Part II crimes include simple assault, prostitution, gambling, fraud, and other non-violent offenses. Each record provides the date, time, and type of crime, allowing for both spatial and temporal analysis. For the benchmark, we rely on a subset of crime reports from 2023.
In [1]:
Copied!
# plotting imports
import contextily as cx
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.patches import Patch
# dataset import
from srai.datasets import PhiladelphiaCrimeDataset
# plotting imports
import contextily as cx
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.patches import Patch
# dataset import
from srai.datasets import PhiladelphiaCrimeDataset
In [2]:
Copied!
philadelphia_crime = PhiladelphiaCrimeDataset()
philadelphia_crime = PhiladelphiaCrimeDataset()
In [3]:
Copied!
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
Out[3]:
(NoneType, NoneType)
Get data using .load() method -> a default version 'res_8'
In [4]:
Copied!
ds = philadelphia_crime.load()
ds.keys()
ds = philadelphia_crime.load()
ds.keys()
Downloading crime data for 2023...
Loading cached Parquet file for 2023... Splitting into train-test subsets ...
Loading cached Parquet file for 2023... Splitting into train-test subsets ...
Out[4]:
dict_keys(['train', 'test'])
In [5]:
Copied!
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
Out[5]:
(geopandas.geodataframe.GeoDataFrame, geopandas.geodataframe.GeoDataFrame)
In [6]:
Copied!
resolution = philadelphia_crime.resolution
resolution
resolution = philadelphia_crime.resolution
resolution
Out[6]:
8
In [7]:
Copied!
gdf_train, gdf_test = ds["train"], ds["test"]
gdf_train, gdf_test = ds["train"], ds["test"]
In [8]:
Copied!
gdf_train.head()
gdf_train.head()
Out[8]:
objectid | dc_dist | psa | dispatch_date_time | dispatch_date | dispatch_time | hour | dc_key | location_block | ucr_general | text_general_code | point_x | point_y | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
411 | 116 | 01 | 1 | 2023-03-11 18:31:00 | 2023-03-11 | 13:31:00 | 13.0 | 2.023010e+11 | 2400 BLOCK S 28TH ST | 600 | Theft from Vehicle | -75.193618 | 39.922350 | POINT (-75.19362 39.92235) |
412 | 119 | 08 | 2 | 2023-03-11 22:13:00 | 2023-03-11 | 17:13:00 | 17.0 | 2.023080e+11 | 9800 BLOCK Roosevelt Blvd | 600 | Thefts | -75.015070 | 40.094525 | POINT (-75.01507 40.09452) |
413 | 96 | 15 | 1 | 2023-03-11 12:42:00 | 2023-03-11 | 07:42:00 | 7.0 | 2.023150e+11 | 4700 BLOCK GRISCOM ST | 600 | Thefts | -75.083953 | 40.017896 | POINT (-75.08395 40.0179) |
414 | 99 | 14 | 1 | 2023-03-12 00:54:00 | 2023-03-11 | 19:54:00 | 19.0 | 2.023140e+11 | 5500 BLOCK BLOYD ST | 300 | Robbery No Firearm | -75.161898 | 40.044952 | POINT (-75.1619 40.04495) |
416 | 102 | 25 | 3 | 2023-03-11 07:03:00 | 2023-03-11 | 02:03:00 | 2.0 | 2.023250e+11 | 200 BLOCK W ONTARIO ST | 400 | Aggravated Assault Firearm | -75.133172 | 40.002221 | POINT (-75.13317 40.00222) |
Getting the h3 with target values
In [9]:
Copied!
train_h3, _, test_h3 = philadelphia_crime.get_h3_with_labels()
train_h3, _, test_h3 = philadelphia_crime.get_h3_with_labels()
In [10]:
Copied!
fig, axes = plt.subplots(
2, 1, sharex=False, sharey=False, figsize=(12, 16), height_ratios=[4, 1]
)
train_h3.plot(
color="orange",
markersize=0.1,
ax=axes[0],
label="train",
alpha=np.minimum(np.power(train_h3[philadelphia_crime.target] + 0.4, 2), 1),
)
test_h3.plot(
color="royalblue",
markersize=0.1,
ax=axes[0],
label="test",
alpha=np.minimum(np.power(test_h3[philadelphia_crime.target] + 0.4, 2), 1),
)
cx.add_basemap(axes[0], source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=12)
axes[0].set_title("Philadelphia crime data aggregated to H3 cells")
axes[0].legend(
handles=[Patch(facecolor="orange"), Patch(facecolor="royalblue")],
labels=["Train", "Test"],
)
axes[0].set_axis_off()
sns.kdeplot(
x=train_h3[philadelphia_crime.target],
label="train",
color="orange",
ax=axes[1],
fill=False,
cut=0,
)
sns.kdeplot(
x=test_h3[philadelphia_crime.target],
label="test",
color="royalblue",
ax=axes[1],
fill=False,
cut=0,
)
axes[1].set_title("Philadelphia crime data - target distribution")
axes[1].legend()
fig.tight_layout()
plt.show()
fig, axes = plt.subplots(
2, 1, sharex=False, sharey=False, figsize=(12, 16), height_ratios=[4, 1]
)
train_h3.plot(
color="orange",
markersize=0.1,
ax=axes[0],
label="train",
alpha=np.minimum(np.power(train_h3[philadelphia_crime.target] + 0.4, 2), 1),
)
test_h3.plot(
color="royalblue",
markersize=0.1,
ax=axes[0],
label="test",
alpha=np.minimum(np.power(test_h3[philadelphia_crime.target] + 0.4, 2), 1),
)
cx.add_basemap(axes[0], source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=12)
axes[0].set_title("Philadelphia crime data aggregated to H3 cells")
axes[0].legend(
handles=[Patch(facecolor="orange"), Patch(facecolor="royalblue")],
labels=["Train", "Test"],
)
axes[0].set_axis_off()
sns.kdeplot(
x=train_h3[philadelphia_crime.target],
label="train",
color="orange",
ax=axes[1],
fill=False,
cut=0,
)
sns.kdeplot(
x=test_h3[philadelphia_crime.target],
label="test",
color="royalblue",
ax=axes[1],
fill=False,
cut=0,
)
axes[1].set_title("Philadelphia crime data - target distribution")
axes[1].legend()
fig.tight_layout()
plt.show()
Get data from 2013 year.
In [11]:
Copied!
ds = philadelphia_crime.load(version="2013")
ds.keys()
ds = philadelphia_crime.load(version="2013")
ds.keys()
Downloading crime data for 2013...
Loading cached Parquet file for 2013...
Out[11]:
dict_keys(['train'])
In [12]:
Copied!
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
Out[12]:
(geopandas.geodataframe.GeoDataFrame, NoneType)
In [13]:
Copied!
ds["train"].head()
ds["train"].head()
Out[13]:
objectid | dc_dist | psa | dispatch_date_time | dispatch_date | dispatch_time | hour | dc_key | location_block | ucr_general | text_general_code | point_x | point_y | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 32497922 | 12 | 2 | 2013-04-24 04:00:00 | 2013-04-24 | 00:00:00 | 8 | 2.013120e+11 | 5900 BLOCK SPRINGFIELD AVE | 100 | Homicide - Criminal | -75.231532 | 39.935160 | POINT (-75.23153 39.93516) |
1 | 32497962 | 25 | 1 | 2013-04-06 04:00:00 | 2013-04-06 | 00:00:00 | 23 | 2.013250e+11 | 700 BLOCK W VENANGO STREET | 100 | Homicide - Criminal | -75.140959 | 40.006263 | POINT (-75.14096 40.00626) |
2 | 32498006 | 35 | 3 | 2013-04-10 04:00:00 | 2013-04-10 | 00:00:00 | 10 | 2.013350e+11 | 1600 BLOCK CHELTEN AVENUE | 100 | Homicide - Criminal | -75.146544 | 40.051341 | POINT (-75.14654 40.05134) |
3 | 32498022 | 24 | 2 | 2013-11-18 05:00:00 | 2013-11-18 | 00:00:00 | 13 | 2.013241e+11 | 1800 BLOCK E SOMERSET ST | 100 | Homicide - Criminal | -75.122542 | 39.991391 | POINT (-75.12254 39.99139) |
4 | 32498035 | 02 | 1 | 2013-05-06 04:00:00 | 2013-05-06 | 00:00:00 | 2 | 2.013020e+11 | 6700 BLOCK CASTOR STREET | 100 | Homicide - Criminal | -75.073213 | 40.044279 | POINT (-75.07321 40.04428) |
Creating your own train - test split -> Bucket regression (works similarly for spatial regression)
In [14]:
Copied!
philadelphia_crime.target
philadelphia_crime.target
Out[14]:
'count'
In [15]:
Copied!
train, test = philadelphia_crime.train_test_split(
test_size=0.2, resolution=8, n_bins=10, random_state=42
)
train, test = philadelphia_crime.train_test_split(
test_size=0.2, resolution=8, n_bins=10, random_state=42
)
Summary of the split: Train: 403 H3 cells (141029 points) Test: 117 H3 cells (35685 points) Expected ratios: {'train': 0.8, 'validation': 0, 'test': 0.2} Actual ratios: {'train': 0.798, 'test': 0.202} Actual ratios difference: {'train': 0.002, 'test': -0.002} bucket train_ratio test_ratio train_ratio_difference \ 0 0 0.81421 0.18579 -0.01421 1 1 0.80993 0.19007 -0.00993 2 2 0.80693 0.19307 -0.00693 3 3 0.79468 0.20532 0.00532 4 4 0.79568 0.20432 0.00432 5 5 0.79242 0.20758 0.00758 6 6 0.79190 0.20810 0.00810 7 7 0.80483 0.19517 -0.00483 8 8 0.79277 0.20723 0.00723 9 9 0.80070 0.19930 -0.00070 test_ratio_difference train_points test_points 0 0.01421 149 34 1 0.00993 669 157 2 0.00693 1793 429 3 -0.00532 3259 842 4 -0.00432 5156 1324 5 -0.00758 8532 2235 6 -0.00810 13380 3516 7 0.00483 21876 5305 8 -0.00723 30715 8029 9 0.00070 55500 13814 Created new train_gdf and test_gdf. Train len: 141029,test len: 35685
In [16]:
Copied!
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
type(philadelphia_crime.train_gdf), type(philadelphia_crime.test_gdf)
Out[16]:
(geopandas.geodataframe.GeoDataFrame, geopandas.geodataframe.GeoDataFrame)
In [17]:
Copied!
philadelphia_crime.resolution
philadelphia_crime.resolution
Out[17]:
8
In [18]:
Copied!
train.head()
train.head()
Out[18]:
objectid | dc_dist | psa | dispatch_date_time | dispatch_date | dispatch_time | hour | dc_key | location_block | ucr_general | text_general_code | point_x | point_y | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 32497962 | 25 | 1 | 2013-04-06 04:00:00 | 2013-04-06 | 00:00:00 | 23 | 2.013250e+11 | 700 BLOCK W VENANGO STREET | 100 | Homicide - Criminal | -75.140959 | 40.006263 | POINT (-75.14096 40.00626) |
2 | 32498006 | 35 | 3 | 2013-04-10 04:00:00 | 2013-04-10 | 00:00:00 | 10 | 2.013350e+11 | 1600 BLOCK CHELTEN AVENUE | 100 | Homicide - Criminal | -75.146544 | 40.051341 | POINT (-75.14654 40.05134) |
3 | 32498022 | 24 | 2 | 2013-11-18 05:00:00 | 2013-11-18 | 00:00:00 | 13 | 2.013241e+11 | 1800 BLOCK E SOMERSET ST | 100 | Homicide - Criminal | -75.122542 | 39.991391 | POINT (-75.12254 39.99139) |
4 | 32498035 | 02 | 1 | 2013-05-06 04:00:00 | 2013-05-06 | 00:00:00 | 2 | 2.013020e+11 | 6700 BLOCK CASTOR STREET | 100 | Homicide - Criminal | -75.073213 | 40.044279 | POINT (-75.07321 40.04428) |
5 | 32498059 | 22 | 1 | 2013-08-25 04:00:00 | 2013-08-25 | 00:00:00 | 3 | 2.013221e+11 | 2200 BLOCK W HUNTINGDON ST | 100 | Homicide - Criminal | -75.167954 | 39.994024 | POINT (-75.16795 39.99402) |
In [19]:
Copied!
test.head()
test.head()
Out[19]:
objectid | dc_dist | psa | dispatch_date_time | dispatch_date | dispatch_time | hour | dc_key | location_block | ucr_general | text_general_code | point_x | point_y | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 32497922 | 12 | 2 | 2013-04-24 04:00:00 | 2013-04-24 | 00:00:00 | 8 | 2.013120e+11 | 5900 BLOCK SPRINGFIELD AVE | 100 | Homicide - Criminal | -75.231532 | 39.935160 | POINT (-75.23153 39.93516) |
23 | 32498332 | 35 | 2 | 2013-12-08 05:00:00 | 2013-12-08 | 00:00:00 | 3 | 2.013351e+11 | 4700 BLOCK N MARSHALL ST | 100 | Homicide - Criminal | -75.136443 | 40.023347 | POINT (-75.13644 40.02335) |
29 | 32498403 | 35 | 2 | 2013-11-11 05:00:00 | 2013-11-11 | 00:00:00 | 2 | 2.013351e+11 | 5200 BLOCK N 5TH ST | 100 | Homicide - Criminal | -75.131439 | 40.031762 | POINT (-75.13144 40.03176) |
34 | 32498469 | 16 | 1 | 2013-11-30 05:00:00 | 2013-11-30 | 00:00:00 | 2 | 2.013160e+11 | 4000 BLOCK W GIRARD AV | 100 | Homicide - Criminal | -75.204550 | 39.974008 | POINT (-75.20455 39.97401) |
42 | 32498693 | 25 | 3 | 2013-03-27 04:00:00 | 2013-03-27 | 00:00:00 | 22 | 2.013250e+11 | 3400 BLOCK WATER STREET | 100 | Homicide - Criminal | -75.127109 | 40.001577 | POINT (-75.12711 40.00158) |