Police department incidents
Police Department Incidents Dataset¶
The San Francisco Police Department’s (SFPD) Incident Report Dataset is one of the most frequently used datasets on DataSF. It encompasses over 600,000 reported crime incidents filed between January 2018 and March 2024, either by officers or self-reported by the public through SFPD’s online reporting system. It provides detailed temporal and categorical information on crimes, making it comparable to the Chicago and Philadelphia datasets.
In [1]:
Copied!
# plotting imports
import contextily as cx
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.patches import Patch
# dataset import
from srai.datasets import PoliceDepartmentIncidentsDataset
# plotting imports
import contextily as cx
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.patches import Patch
# dataset import
from srai.datasets import PoliceDepartmentIncidentsDataset
In [2]:
Copied!
police_department_incidents = PoliceDepartmentIncidentsDataset()
police_department_incidents = PoliceDepartmentIncidentsDataset()
In [3]:
Copied!
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
Out[3]:
(NoneType, NoneType)
Default config
In [4]:
Copied!
ds = police_department_incidents.load(version=8)
ds.keys()
ds = police_department_incidents.load(version=8)
ds.keys()
Out[4]:
dict_keys(['train', 'test'])
In [5]:
Copied!
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
Out[5]:
(geopandas.geodataframe.GeoDataFrame, geopandas.geodataframe.GeoDataFrame)
In [6]:
Copied!
print("Aggregation H3 resolution:", police_department_incidents.resolution)
print("Aggregation H3 resolution:", police_department_incidents.resolution)
Aggregation H3 resolution: 8
In [7]:
Copied!
print("Prediction target:", police_department_incidents.target)
print("Prediction target:", police_department_incidents.target)
Prediction target: count
In [8]:
Copied!
gdf_train, gdf_test = ds["train"], ds["test"]
gdf_train, gdf_test = ds["train"], ds["test"]
In [9]:
Copied!
gdf_train.head()
gdf_train.head()
Out[9]:
| Incident Datetime | Incident Date | Incident Time | Incident Year | Incident Day of Week | Incident Category | Incident Subcategory | Incident Description | Incident Code | Report Datetime | ... | Intersection | CNN | Police District | Analysis Neighborhood | Supervisor District | Supervisor District 2012 | Neighborhoods | Current Supervisor Districts | Current Police Districts | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023/03/11 02:00:00 PM | 2023/03/11 | 14:00 | 2023 | Saturday | Assault | Simple Assault | Battery | 4134 | 2023/03/15 11:21:00 AM | ... | STANYAN ST \ HAYES ST | 26446000.0 | Park | Golden Gate Park | 1.0 | 1.0 | NaN | 4.0 | 7.0 | POINT (-122.45429 37.7729) |
| 1 | 2022/06/27 12:00:00 PM | 2022/06/27 | 12:00 | 2022 | Monday | Lost Property | Lost Property | Lost Property | 71000 | 2023/03/15 05:20:00 PM | ... | GEARY ST \ POWELL ST | 24903000.0 | Central | Financial District/South Beach | 3.0 | 3.0 | 19.0 | 3.0 | 6.0 | POINT (-122.40823 37.78736) |
| 2 | 2023/03/16 05:30:00 PM | 2023/03/16 | 17:30 | 2023 | Thursday | Assault | Simple Assault | Battery | 4134 | 2023/03/16 06:02:00 PM | ... | 18TH ST \ DE HARO ST | 23743000.0 | Bayview | Potrero Hill | 10.0 | 10.0 | 54.0 | 9.0 | 2.0 | POINT (-122.40132 37.76229) |
| 3 | 2023/03/21 03:50:00 PM | 2023/03/21 | 15:50 | 2023 | Tuesday | Non-Criminal | Non-Criminal | Aided Case | 51040 | 2023/03/21 04:01:00 PM | ... | POST ST \ LARKIN ST | 25167000.0 | Northern | Tenderloin | 3.0 | 6.0 | 50.0 | 10.0 | 6.0 | POINT (-122.41827 37.78704) |
| 4 | 2021/08/22 09:40:00 AM | 2021/08/22 | 09:40 | 2021 | Sunday | Warrant | Other | Probation Search | 62071 | 2021/08/22 09:40:00 AM | ... | LAGUNA ST \ PACIFIC AVE | 26569000.0 | Northern | Pacific Heights | 2.0 | 2.0 | 102.0 | 6.0 | 4.0 | POINT (-122.4298 37.79398) |
5 rows × 27 columns
Getting the h3 with target values
In [10]:
Copied!
train_h3, _, test_h3 = police_department_incidents.get_h3_with_labels()
train_h3, _, test_h3 = police_department_incidents.get_h3_with_labels()
In [11]:
Copied!
train_h3.head()
train_h3.head()
Out[11]:
| geometry | count | |
|---|---|---|
| region_id | ||
| 8828308259fffff | POLYGON ((-122.39209 37.70665, -122.38762 37.7... | 0.012247 |
| 88283095c7fffff | POLYGON ((-122.48521 37.73797, -122.48074 37.7... | 0.028330 |
| 8828309421fffff | POLYGON ((-122.48689 37.70741, -122.48243 37.7... | 0.003166 |
| 8828309519fffff | POLYGON ((-122.46613 37.71221, -122.46166 37.7... | 0.032615 |
| 8828309565fffff | POLYGON ((-122.40248 37.70426, -122.39801 37.7... | 0.034438 |
In [12]:
Copied!
test_h3.head()
test_h3.head()
Out[12]:
| geometry | count | |
|---|---|---|
| region_id | ||
| 8828308257fffff | POLYGON ((-122.38604 37.72313, -122.38157 37.7... | 0.103568 |
| 8828309533fffff | POLYGON ((-122.44668 37.73932, -122.44221 37.7... | 0.022926 |
| 8828308211fffff | POLYGON ((-122.39038 37.73721, -122.38591 37.7... | 0.049082 |
| 8828309563fffff | POLYGON ((-122.42024 37.70771, -122.41577 37.7... | 0.052536 |
| 8828308253fffff | POLYGON ((-122.39643 37.72074, -122.39196 37.7... | 0.064399 |
In [13]:
Copied!
fig, axes = plt.subplots(
2, 1, sharex=False, sharey=False, figsize=(12, 16), height_ratios=[5, 1]
)
train_h3.plot(
color="orange",
markersize=0.1,
ax=axes[0],
label="train",
alpha=np.minimum(
np.power(train_h3[police_department_incidents.target] + 0.4, 2), 1
),
)
test_h3.plot(
color="royalblue",
markersize=0.1,
ax=axes[0],
label="test",
alpha=np.minimum(np.power(test_h3[police_department_incidents.target] + 0.4, 2), 1),
)
cx.add_basemap(axes[0], source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=13)
axes[0].set_title("SFPD incidents aggregated to H3 cells")
axes[0].legend(
handles=[Patch(facecolor="orange"), Patch(facecolor="royalblue")],
labels=["Train", "Test"],
)
axes[0].set_axis_off()
sns.kdeplot(
x=train_h3[police_department_incidents.target],
label="train",
color="orange",
ax=axes[1],
fill=False,
cut=0,
)
sns.kdeplot(
x=test_h3[police_department_incidents.target],
label="test",
color="royalblue",
ax=axes[1],
fill=False,
cut=0,
)
axes[1].set_title("SFPD incidents - target distribution")
axes[1].legend()
fig.tight_layout()
plt.show()
fig, axes = plt.subplots(
2, 1, sharex=False, sharey=False, figsize=(12, 16), height_ratios=[5, 1]
)
train_h3.plot(
color="orange",
markersize=0.1,
ax=axes[0],
label="train",
alpha=np.minimum(
np.power(train_h3[police_department_incidents.target] + 0.4, 2), 1
),
)
test_h3.plot(
color="royalblue",
markersize=0.1,
ax=axes[0],
label="test",
alpha=np.minimum(np.power(test_h3[police_department_incidents.target] + 0.4, 2), 1),
)
cx.add_basemap(axes[0], source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=13)
axes[0].set_title("SFPD incidents aggregated to H3 cells")
axes[0].legend(
handles=[Patch(facecolor="orange"), Patch(facecolor="royalblue")],
labels=["Train", "Test"],
)
axes[0].set_axis_off()
sns.kdeplot(
x=train_h3[police_department_incidents.target],
label="train",
color="orange",
ax=axes[1],
fill=False,
cut=0,
)
sns.kdeplot(
x=test_h3[police_department_incidents.target],
label="test",
color="royalblue",
ax=axes[1],
fill=False,
cut=0,
)
axes[1].set_title("SFPD incidents - target distribution")
axes[1].legend()
fig.tight_layout()
plt.show()
loading raw, full data
In [14]:
Copied!
ds = police_department_incidents.load(version="all")
ds.keys()
ds = police_department_incidents.load(version="all")
ds.keys()
Out[14]:
dict_keys(['train'])
In [15]:
Copied!
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
Out[15]:
(geopandas.geodataframe.GeoDataFrame, NoneType)
In [16]:
Copied!
ds["train"].head()
ds["train"].head()
Out[16]:
| Incident Datetime | Incident Date | Incident Time | Incident Year | Incident Day of Week | Incident Category | Incident Subcategory | Incident Description | Incident Code | Report Datetime | ... | Intersection | CNN | Police District | Analysis Neighborhood | Supervisor District | Supervisor District 2012 | Neighborhoods | Current Supervisor Districts | Current Police Districts | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023/03/11 02:00:00 PM | 2023/03/11 | 14:00 | 2023 | Saturday | Assault | Simple Assault | Battery | 4134 | 2023/03/15 11:21:00 AM | ... | STANYAN ST \ HAYES ST | 26446000.0 | Park | Golden Gate Park | 1.0 | 1.0 | NaN | 4.0 | 7.0 | POINT (-122.45429 37.7729) |
| 1 | 2022/06/27 12:00:00 PM | 2022/06/27 | 12:00 | 2022 | Monday | Lost Property | Lost Property | Lost Property | 71000 | 2023/03/15 05:20:00 PM | ... | GEARY ST \ POWELL ST | 24903000.0 | Central | Financial District/South Beach | 3.0 | 3.0 | 19.0 | 3.0 | 6.0 | POINT (-122.40823 37.78736) |
| 2 | 2023/03/16 05:30:00 PM | 2023/03/16 | 17:30 | 2023 | Thursday | Assault | Simple Assault | Battery | 4134 | 2023/03/16 06:02:00 PM | ... | 18TH ST \ DE HARO ST | 23743000.0 | Bayview | Potrero Hill | 10.0 | 10.0 | 54.0 | 9.0 | 2.0 | POINT (-122.40132 37.76229) |
| 3 | 2023/03/21 03:50:00 PM | 2023/03/21 | 15:50 | 2023 | Tuesday | Non-Criminal | Non-Criminal | Aided Case | 51040 | 2023/03/21 04:01:00 PM | ... | POST ST \ LARKIN ST | 25167000.0 | Northern | Tenderloin | 3.0 | 6.0 | 50.0 | 10.0 | 6.0 | POINT (-122.41827 37.78704) |
| 4 | 2021/08/22 09:40:00 AM | 2021/08/22 | 09:40 | 2021 | Sunday | Warrant | Other | Probation Search | 62071 | 2021/08/22 09:40:00 AM | ... | LAGUNA ST \ PACIFIC AVE | 26569000.0 | Northern | Pacific Heights | 2.0 | 2.0 | 102.0 | 6.0 | 4.0 | POINT (-122.4298 37.79398) |
5 rows × 27 columns
Create your own train-test split -> Spatial splitting with bucket stratification
In [17]:
Copied!
train, test = police_department_incidents.train_test_split(
test_size=0.2, resolution=8, n_bins=10, random_state=42
)
train, test = police_department_incidents.train_test_split(
test_size=0.2, resolution=8, n_bins=10, random_state=42
)
Summary of the split:
Train: 135 H3 cells (638803 points)
Test: 36 H3 cells (147271 points)
Expected ratios: {'train': 0.8, 'validation': 0, 'test': 0.2}
Actual ratios: {'train': 0.813, 'test': 0.187}
Actual ratios difference: {'train': -0.013, 'test': 0.013}
bucket train_ratio test_ratio train_ratio_difference \
0 0 0.80563 0.19437 -0.00563
1 1 0.81835 0.18165 -0.01835
2 2 0.77517 0.22483 0.02483
3 3 0.77571 0.22429 0.02429
4 4 0.82568 0.17432 -0.02568
5 5 0.81983 0.18017 -0.01983
6 6 0.82327 0.17673 -0.02327
7 7 0.82568 0.17432 -0.02568
8 8 0.78099 0.21901 0.01901
9 9 0.82152 0.17848 -0.02152
test_ratio_difference train_points test_points
0 0.00563 2031 490
1 0.01835 8794 1952
2 -0.02483 13560 3933
3 -0.02429 19347 5594
4 0.02568 30740 6490
5 0.01983 38487 8458
6 0.02327 49267 10576
7 0.02568 67635 14279
8 -0.01901 105366 29547
9 0.02152 303576 65952
Created new train_gdf and test_gdf. Train len: 638803,test len: 147271
In [18]:
Copied!
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
Out[18]:
(geopandas.geodataframe.GeoDataFrame, geopandas.geodataframe.GeoDataFrame)
In [19]:
Copied!
police_department_incidents.resolution
police_department_incidents.resolution
Out[19]:
8
In [20]:
Copied!
train.head()
train.head()
Out[20]:
| Incident Datetime | Incident Date | Incident Time | Incident Year | Incident Day of Week | Incident Category | Incident Subcategory | Incident Description | Incident Code | Report Datetime | ... | Intersection | CNN | Police District | Analysis Neighborhood | Supervisor District | Supervisor District 2012 | Neighborhoods | Current Supervisor Districts | Current Police Districts | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023/03/11 02:00:00 PM | 2023/03/11 | 14:00 | 2023 | Saturday | Assault | Simple Assault | Battery | 4134 | 2023/03/15 11:21:00 AM | ... | STANYAN ST \ HAYES ST | 26446000.0 | Park | Golden Gate Park | 1.0 | 1.0 | NaN | 4.0 | 7.0 | POINT (-122.45429 37.7729) |
| 1 | 2022/06/27 12:00:00 PM | 2022/06/27 | 12:00 | 2022 | Monday | Lost Property | Lost Property | Lost Property | 71000 | 2023/03/15 05:20:00 PM | ... | GEARY ST \ POWELL ST | 24903000.0 | Central | Financial District/South Beach | 3.0 | 3.0 | 19.0 | 3.0 | 6.0 | POINT (-122.40823 37.78736) |
| 2 | 2023/03/16 05:30:00 PM | 2023/03/16 | 17:30 | 2023 | Thursday | Assault | Simple Assault | Battery | 4134 | 2023/03/16 06:02:00 PM | ... | 18TH ST \ DE HARO ST | 23743000.0 | Bayview | Potrero Hill | 10.0 | 10.0 | 54.0 | 9.0 | 2.0 | POINT (-122.40132 37.76229) |
| 3 | 2023/03/21 03:50:00 PM | 2023/03/21 | 15:50 | 2023 | Tuesday | Non-Criminal | Non-Criminal | Aided Case | 51040 | 2023/03/21 04:01:00 PM | ... | POST ST \ LARKIN ST | 25167000.0 | Northern | Tenderloin | 3.0 | 6.0 | 50.0 | 10.0 | 6.0 | POINT (-122.41827 37.78704) |
| 4 | 2021/08/22 09:40:00 AM | 2021/08/22 | 09:40 | 2021 | Sunday | Warrant | Other | Probation Search | 62071 | 2021/08/22 09:40:00 AM | ... | LAGUNA ST \ PACIFIC AVE | 26569000.0 | Northern | Pacific Heights | 2.0 | 2.0 | 102.0 | 6.0 | 4.0 | POINT (-122.4298 37.79398) |
5 rows × 27 columns
In [21]:
Copied!
test.head()
test.head()
Out[21]:
| Incident Datetime | Incident Date | Incident Time | Incident Year | Incident Day of Week | Incident Category | Incident Subcategory | Incident Description | Incident Code | Report Datetime | ... | Intersection | CNN | Police District | Analysis Neighborhood | Supervisor District | Supervisor District 2012 | Neighborhoods | Current Supervisor Districts | Current Police Districts | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 2022/07/02 10:53:00 PM | 2022/07/02 | 22:53 | 2022 | Saturday | Assault | Simple Assault | Battery | 4134 | 2022/07/02 11:00:00 PM | ... | GILMAN AVE \ HAWES ST | 20438000.0 | Bayview | Bayview Hunters Point | 10.0 | 10.0 | 88.0 | 9.0 | 2.0 | POINT (-122.39002 37.7193) |
| 17 | 2021/10/14 11:00:00 AM | 2021/10/14 | 11:00 | 2021 | Thursday | Suspicious Occ | Suspicious Occ | Suspicious Occurrence | 64070 | 2021/10/14 11:12:00 AM | ... | CLAY ST \ MONTGOMERY ST | 24756000.0 | Central | Chinatown | 3.0 | 3.0 | 104.0 | 3.0 | 6.0 | POINT (-122.40314 37.79467) |
| 18 | 2021/08/29 03:32:00 AM | 2021/08/29 | 03:32 | 2021 | Sunday | Non-Criminal | Other | Mental Health Detention | 64020 | 2021/08/29 03:32:00 AM | ... | 47TH AVE \ SANTIAGO ST | 23508000.0 | Taraval | Sunset/Parkside | 4.0 | 4.0 | 39.0 | 7.0 | 10.0 | POINT (-122.50582 37.7436) |
| 19 | 2021/07/01 09:45:00 AM | 2021/07/01 | 09:45 | 2021 | Thursday | Non-Criminal | Non-Criminal | Found Property | 72000 | 2021/07/01 09:45:00 AM | ... | PAGE ST \ LYON ST | 26325000.0 | Park | Haight Ashbury | 5.0 | 5.0 | 112.0 | 11.0 | 7.0 | POINT (-122.44219 37.77157) |
| 27 | 2021/06/08 08:50:00 PM | 2021/06/08 | 20:50 | 2021 | Tuesday | Assault | Simple Assault | Battery | 4134 | 2021/06/11 09:02:00 AM | ... | 03RD AVE \ GEARY BLVD | 27245000.0 | Richmond | Inner Richmond | 1.0 | 1.0 | 5.0 | 4.0 | 8.0 | POINT (-122.46106 37.78116) |
5 rows × 27 columns
In [ ]:
Copied!