Police department incidents
Police Department Incidents Dataset¶
The San Francisco Police Department’s (SFPD) Incident Report Dataset is one of the most frequently used datasets on DataSF. It encompasses over 600,000 reported crime incidents filed between January 2018 and March 2024, either by officers or self-reported by the public through SFPD’s online reporting system. It provides detailed temporal and categorical information on crimes, making it comparable to the Chicago and Philadelphia datasets.
In [1]:
Copied!
# plotting imports
import contextily as cx
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.patches import Patch
# dataset import
from srai.datasets import PoliceDepartmentIncidentsDataset
# plotting imports
import contextily as cx
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.patches import Patch
# dataset import
from srai.datasets import PoliceDepartmentIncidentsDataset
In [2]:
Copied!
police_department_incidents = PoliceDepartmentIncidentsDataset()
police_department_incidents = PoliceDepartmentIncidentsDataset()
In [3]:
Copied!
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
Out[3]:
(NoneType, NoneType)
Default config
In [4]:
Copied!
ds = police_department_incidents.load(version=8)
ds.keys()
ds = police_department_incidents.load(version=8)
ds.keys()
Out[4]:
dict_keys(['train', 'test'])
In [5]:
Copied!
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
Out[5]:
(geopandas.geodataframe.GeoDataFrame, geopandas.geodataframe.GeoDataFrame)
In [6]:
Copied!
print("Aggregation H3 resolution:", police_department_incidents.resolution)
print("Aggregation H3 resolution:", police_department_incidents.resolution)
Aggregation H3 resolution: 8
In [7]:
Copied!
print("Prediction target:", police_department_incidents.target)
print("Prediction target:", police_department_incidents.target)
Prediction target: count
In [8]:
Copied!
gdf_train, gdf_test = ds["train"], ds["test"]
gdf_train, gdf_test = ds["train"], ds["test"]
In [9]:
Copied!
gdf_train.head()
gdf_train.head()
Out[9]:
Incident Datetime | Incident Date | Incident Time | Incident Year | Incident Day of Week | Incident Category | Incident Subcategory | Incident Description | Incident Code | Report Datetime | ... | Intersection | CNN | Police District | Analysis Neighborhood | Supervisor District | Supervisor District 2012 | Neighborhoods | Current Supervisor Districts | Current Police Districts | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023/03/11 02:00:00 PM | 2023/03/11 | 14:00 | 2023 | Saturday | Assault | Simple Assault | Battery | 4134 | 2023/03/15 11:21:00 AM | ... | STANYAN ST \ HAYES ST | 26446000.0 | Park | Golden Gate Park | 1.0 | 1.0 | NaN | 4.0 | 7.0 | POINT (-122.45429 37.7729) |
1 | 2022/06/27 12:00:00 PM | 2022/06/27 | 12:00 | 2022 | Monday | Lost Property | Lost Property | Lost Property | 71000 | 2023/03/15 05:20:00 PM | ... | GEARY ST \ POWELL ST | 24903000.0 | Central | Financial District/South Beach | 3.0 | 3.0 | 19.0 | 3.0 | 6.0 | POINT (-122.40823 37.78736) |
2 | 2023/03/16 05:30:00 PM | 2023/03/16 | 17:30 | 2023 | Thursday | Assault | Simple Assault | Battery | 4134 | 2023/03/16 06:02:00 PM | ... | 18TH ST \ DE HARO ST | 23743000.0 | Bayview | Potrero Hill | 10.0 | 10.0 | 54.0 | 9.0 | 2.0 | POINT (-122.40132 37.76229) |
3 | 2023/03/21 03:50:00 PM | 2023/03/21 | 15:50 | 2023 | Tuesday | Non-Criminal | Non-Criminal | Aided Case | 51040 | 2023/03/21 04:01:00 PM | ... | POST ST \ LARKIN ST | 25167000.0 | Northern | Tenderloin | 3.0 | 6.0 | 50.0 | 10.0 | 6.0 | POINT (-122.41827 37.78704) |
4 | 2021/08/22 09:40:00 AM | 2021/08/22 | 09:40 | 2021 | Sunday | Warrant | Other | Probation Search | 62071 | 2021/08/22 09:40:00 AM | ... | LAGUNA ST \ PACIFIC AVE | 26569000.0 | Northern | Pacific Heights | 2.0 | 2.0 | 102.0 | 6.0 | 4.0 | POINT (-122.4298 37.79398) |
5 rows × 27 columns
Getting the h3 with target values
In [10]:
Copied!
train_h3, _, test_h3 = police_department_incidents.get_h3_with_labels()
train_h3, _, test_h3 = police_department_incidents.get_h3_with_labels()
In [11]:
Copied!
train_h3.head()
train_h3.head()
Out[11]:
geometry | count | |
---|---|---|
region_id | ||
88283095a3fffff | POLYGON ((-122.47313 37.77091, -122.46866 37.7... | 0.036564 |
882830952bfffff | POLYGON ((-122.42458 37.7218, -122.42011 37.72... | 0.038051 |
88283095a7fffff | POLYGON ((-122.46274 37.77331, -122.45827 37.7... | 0.079683 |
8828309505fffff | POLYGON ((-122.4276 37.71356, -122.42313 37.71... | 0.035589 |
88283082d3fffff | POLYGON ((-122.448 37.76163, -122.44353 37.765... | 0.117206 |
In [12]:
Copied!
test_h3.head()
test_h3.head()
Out[12]:
geometry | count | |
---|---|---|
region_id | ||
8828308289fffff | POLYGON ((-122.41985 37.76059, -122.41538 37.7... | 0.393570 |
88283082c5fffff | POLYGON ((-122.40511 37.7489, -122.40064 37.75... | 0.115863 |
8828309533fffff | POLYGON ((-122.44668 37.73932, -122.44221 37.7... | 0.022926 |
88283082e3fffff | POLYGON ((-122.39906 37.76538, -122.39458 37.7... | 0.064063 |
88283082abfffff | POLYGON ((-122.4034 37.77945, -122.39893 37.78... | 0.626191 |
In [13]:
Copied!
fig, axes = plt.subplots(
2, 1, sharex=False, sharey=False, figsize=(12, 16), height_ratios=[5, 1]
)
train_h3.plot(
color="orange",
markersize=0.1,
ax=axes[0],
label="train",
alpha=np.minimum(
np.power(train_h3[police_department_incidents.target] + 0.4, 2), 1
),
)
test_h3.plot(
color="royalblue",
markersize=0.1,
ax=axes[0],
label="test",
alpha=np.minimum(np.power(test_h3[police_department_incidents.target] + 0.4, 2), 1),
)
cx.add_basemap(axes[0], source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=13)
axes[0].set_title("SFPD incidents aggregated to H3 cells")
axes[0].legend(
handles=[Patch(facecolor="orange"), Patch(facecolor="royalblue")],
labels=["Train", "Test"],
)
axes[0].set_axis_off()
sns.kdeplot(
x=train_h3[police_department_incidents.target],
label="train",
color="orange",
ax=axes[1],
fill=False,
cut=0,
)
sns.kdeplot(
x=test_h3[police_department_incidents.target],
label="test",
color="royalblue",
ax=axes[1],
fill=False,
cut=0,
)
axes[1].set_title("SFPD incidents - target distribution")
axes[1].legend()
fig.tight_layout()
plt.show()
fig, axes = plt.subplots(
2, 1, sharex=False, sharey=False, figsize=(12, 16), height_ratios=[5, 1]
)
train_h3.plot(
color="orange",
markersize=0.1,
ax=axes[0],
label="train",
alpha=np.minimum(
np.power(train_h3[police_department_incidents.target] + 0.4, 2), 1
),
)
test_h3.plot(
color="royalblue",
markersize=0.1,
ax=axes[0],
label="test",
alpha=np.minimum(np.power(test_h3[police_department_incidents.target] + 0.4, 2), 1),
)
cx.add_basemap(axes[0], source=cx.providers.CartoDB.PositronNoLabels, crs=4326, zoom=13)
axes[0].set_title("SFPD incidents aggregated to H3 cells")
axes[0].legend(
handles=[Patch(facecolor="orange"), Patch(facecolor="royalblue")],
labels=["Train", "Test"],
)
axes[0].set_axis_off()
sns.kdeplot(
x=train_h3[police_department_incidents.target],
label="train",
color="orange",
ax=axes[1],
fill=False,
cut=0,
)
sns.kdeplot(
x=test_h3[police_department_incidents.target],
label="test",
color="royalblue",
ax=axes[1],
fill=False,
cut=0,
)
axes[1].set_title("SFPD incidents - target distribution")
axes[1].legend()
fig.tight_layout()
plt.show()
loading raw, full data
In [14]:
Copied!
ds = police_department_incidents.load(version="all")
ds.keys()
ds = police_department_incidents.load(version="all")
ds.keys()
Out[14]:
dict_keys(['train'])
In [15]:
Copied!
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
Out[15]:
(geopandas.geodataframe.GeoDataFrame, NoneType)
In [16]:
Copied!
ds["train"].head()
ds["train"].head()
Out[16]:
Incident Datetime | Incident Date | Incident Time | Incident Year | Incident Day of Week | Incident Category | Incident Subcategory | Incident Description | Incident Code | Report Datetime | ... | Intersection | CNN | Police District | Analysis Neighborhood | Supervisor District | Supervisor District 2012 | Neighborhoods | Current Supervisor Districts | Current Police Districts | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023/03/11 02:00:00 PM | 2023/03/11 | 14:00 | 2023 | Saturday | Assault | Simple Assault | Battery | 4134 | 2023/03/15 11:21:00 AM | ... | STANYAN ST \ HAYES ST | 26446000.0 | Park | Golden Gate Park | 1.0 | 1.0 | NaN | 4.0 | 7.0 | POINT (-122.45429 37.7729) |
1 | 2022/06/27 12:00:00 PM | 2022/06/27 | 12:00 | 2022 | Monday | Lost Property | Lost Property | Lost Property | 71000 | 2023/03/15 05:20:00 PM | ... | GEARY ST \ POWELL ST | 24903000.0 | Central | Financial District/South Beach | 3.0 | 3.0 | 19.0 | 3.0 | 6.0 | POINT (-122.40823 37.78736) |
2 | 2023/03/16 05:30:00 PM | 2023/03/16 | 17:30 | 2023 | Thursday | Assault | Simple Assault | Battery | 4134 | 2023/03/16 06:02:00 PM | ... | 18TH ST \ DE HARO ST | 23743000.0 | Bayview | Potrero Hill | 10.0 | 10.0 | 54.0 | 9.0 | 2.0 | POINT (-122.40132 37.76229) |
3 | 2023/03/21 03:50:00 PM | 2023/03/21 | 15:50 | 2023 | Tuesday | Non-Criminal | Non-Criminal | Aided Case | 51040 | 2023/03/21 04:01:00 PM | ... | POST ST \ LARKIN ST | 25167000.0 | Northern | Tenderloin | 3.0 | 6.0 | 50.0 | 10.0 | 6.0 | POINT (-122.41827 37.78704) |
4 | 2021/08/22 09:40:00 AM | 2021/08/22 | 09:40 | 2021 | Sunday | Warrant | Other | Probation Search | 62071 | 2021/08/22 09:40:00 AM | ... | LAGUNA ST \ PACIFIC AVE | 26569000.0 | Northern | Pacific Heights | 2.0 | 2.0 | 102.0 | 6.0 | 4.0 | POINT (-122.4298 37.79398) |
5 rows × 27 columns
Create your own train-test split -> Spatial splitting with bucket stratification
In [17]:
Copied!
train, test = police_department_incidents.train_test_split(
test_size=0.2, resolution=8, n_bins=10, random_state=42
)
train, test = police_department_incidents.train_test_split(
test_size=0.2, resolution=8, n_bins=10, random_state=42
)
Summary of the split: Train: 135 H3 cells (638803 points) Test: 36 H3 cells (147271 points) Expected ratios: {'train': 0.8, 'validation': 0, 'test': 0.2} Actual ratios: {'train': 0.813, 'test': 0.187} Actual ratios difference: {'train': -0.013, 'test': 0.013} bucket train_ratio test_ratio train_ratio_difference \ 0 0 0.80563 0.19437 -0.00563 1 1 0.81835 0.18165 -0.01835 2 2 0.77517 0.22483 0.02483 3 3 0.77571 0.22429 0.02429 4 4 0.82568 0.17432 -0.02568 5 5 0.81983 0.18017 -0.01983 6 6 0.82327 0.17673 -0.02327 7 7 0.82568 0.17432 -0.02568 8 8 0.78099 0.21901 0.01901 9 9 0.82152 0.17848 -0.02152 test_ratio_difference train_points test_points 0 0.00563 2031 490 1 0.01835 8794 1952 2 -0.02483 13560 3933 3 -0.02429 19347 5594 4 0.02568 30740 6490 5 0.01983 38487 8458 6 0.02327 49267 10576 7 0.02568 67635 14279 8 -0.01901 105366 29547 9 0.02152 303576 65952
Created new train_gdf and test_gdf. Train len: 638803,test len: 147271
In [18]:
Copied!
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
type(police_department_incidents.train_gdf), type(police_department_incidents.test_gdf)
Out[18]:
(geopandas.geodataframe.GeoDataFrame, geopandas.geodataframe.GeoDataFrame)
In [19]:
Copied!
police_department_incidents.resolution
police_department_incidents.resolution
Out[19]:
8
In [20]:
Copied!
train.head()
train.head()
Out[20]:
Incident Datetime | Incident Date | Incident Time | Incident Year | Incident Day of Week | Incident Category | Incident Subcategory | Incident Description | Incident Code | Report Datetime | ... | Intersection | CNN | Police District | Analysis Neighborhood | Supervisor District | Supervisor District 2012 | Neighborhoods | Current Supervisor Districts | Current Police Districts | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023/03/11 02:00:00 PM | 2023/03/11 | 14:00 | 2023 | Saturday | Assault | Simple Assault | Battery | 4134 | 2023/03/15 11:21:00 AM | ... | STANYAN ST \ HAYES ST | 26446000.0 | Park | Golden Gate Park | 1.0 | 1.0 | NaN | 4.0 | 7.0 | POINT (-122.45429 37.7729) |
1 | 2022/06/27 12:00:00 PM | 2022/06/27 | 12:00 | 2022 | Monday | Lost Property | Lost Property | Lost Property | 71000 | 2023/03/15 05:20:00 PM | ... | GEARY ST \ POWELL ST | 24903000.0 | Central | Financial District/South Beach | 3.0 | 3.0 | 19.0 | 3.0 | 6.0 | POINT (-122.40823 37.78736) |
2 | 2023/03/16 05:30:00 PM | 2023/03/16 | 17:30 | 2023 | Thursday | Assault | Simple Assault | Battery | 4134 | 2023/03/16 06:02:00 PM | ... | 18TH ST \ DE HARO ST | 23743000.0 | Bayview | Potrero Hill | 10.0 | 10.0 | 54.0 | 9.0 | 2.0 | POINT (-122.40132 37.76229) |
3 | 2023/03/21 03:50:00 PM | 2023/03/21 | 15:50 | 2023 | Tuesday | Non-Criminal | Non-Criminal | Aided Case | 51040 | 2023/03/21 04:01:00 PM | ... | POST ST \ LARKIN ST | 25167000.0 | Northern | Tenderloin | 3.0 | 6.0 | 50.0 | 10.0 | 6.0 | POINT (-122.41827 37.78704) |
4 | 2021/08/22 09:40:00 AM | 2021/08/22 | 09:40 | 2021 | Sunday | Warrant | Other | Probation Search | 62071 | 2021/08/22 09:40:00 AM | ... | LAGUNA ST \ PACIFIC AVE | 26569000.0 | Northern | Pacific Heights | 2.0 | 2.0 | 102.0 | 6.0 | 4.0 | POINT (-122.4298 37.79398) |
5 rows × 27 columns
In [21]:
Copied!
test.head()
test.head()
Out[21]:
Incident Datetime | Incident Date | Incident Time | Incident Year | Incident Day of Week | Incident Category | Incident Subcategory | Incident Description | Incident Code | Report Datetime | ... | Intersection | CNN | Police District | Analysis Neighborhood | Supervisor District | Supervisor District 2012 | Neighborhoods | Current Supervisor Districts | Current Police Districts | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 2022/07/02 10:53:00 PM | 2022/07/02 | 22:53 | 2022 | Saturday | Assault | Simple Assault | Battery | 4134 | 2022/07/02 11:00:00 PM | ... | GILMAN AVE \ HAWES ST | 20438000.0 | Bayview | Bayview Hunters Point | 10.0 | 10.0 | 88.0 | 9.0 | 2.0 | POINT (-122.39002 37.7193) |
17 | 2021/10/14 11:00:00 AM | 2021/10/14 | 11:00 | 2021 | Thursday | Suspicious Occ | Suspicious Occ | Suspicious Occurrence | 64070 | 2021/10/14 11:12:00 AM | ... | CLAY ST \ MONTGOMERY ST | 24756000.0 | Central | Chinatown | 3.0 | 3.0 | 104.0 | 3.0 | 6.0 | POINT (-122.40314 37.79467) |
18 | 2021/08/29 03:32:00 AM | 2021/08/29 | 03:32 | 2021 | Sunday | Non-Criminal | Other | Mental Health Detention | 64020 | 2021/08/29 03:32:00 AM | ... | 47TH AVE \ SANTIAGO ST | 23508000.0 | Taraval | Sunset/Parkside | 4.0 | 4.0 | 39.0 | 7.0 | 10.0 | POINT (-122.50582 37.7436) |
19 | 2021/07/01 09:45:00 AM | 2021/07/01 | 09:45 | 2021 | Thursday | Non-Criminal | Non-Criminal | Found Property | 72000 | 2021/07/01 09:45:00 AM | ... | PAGE ST \ LYON ST | 26325000.0 | Park | Haight Ashbury | 5.0 | 5.0 | 112.0 | 11.0 | 7.0 | POINT (-122.44219 37.77157) |
27 | 2021/06/08 08:50:00 PM | 2021/06/08 | 20:50 | 2021 | Tuesday | Assault | Simple Assault | Battery | 4134 | 2021/06/11 09:02:00 AM | ... | 03RD AVE \ GEARY BLVD | 27245000.0 | Richmond | Inner Richmond | 1.0 | 1.0 | 5.0 | 4.0 | 8.0 | POINT (-122.46106 37.78116) |
5 rows × 27 columns
In [ ]:
Copied!