OBSR Datasets¶

Examples showcasing the available datasets from the Open Benchmark for Spatial Representations (OBSR).

Available prediction tasks¶

The benchmark tasks are split into two main categories: point-based 📍 and trajectory-based 🛣️.

Below is the list of example tasks ⤵️

Point-based tasks¶

Price prediction¶

Short term rental price

Predict the average short-term rental price for a given hexagonal region (at a specified H3 resolution). The task uses region embeddings generated by the evaluated model, optionally combined with additional features from the provided dataset. The benchmark is based on Airbnb data, enabling evaluation of how well spatial representations capture economic patterns in urban areas.

House sale price

Datasets: House Sales in King County

Predict the average house sale price for a given hexagonal region (at a specified H3 resolution). The task uses region embeddings generated by the evaluated model, optionally combined with property-related features from the provided dataset. The benchmark is based on the House Sales in King County dataset, enabling assessment of how well spatial representations capture real estate market patterns.

Activity prediction¶

Crime activity prediction

Datasets: Chicago Crime, Philadelphia Crime, SFPD Incident Report Datatset

Estimate the crime intensity level within a given geographic region. Each sample represents a specific area with associated attributes. The target variable is the crime intensity, represented as a continuous value in the range (0, 1), indicating relative risk or danger level. This is formulated as a regression problem. The task uses region embeddings generated by the evaluated model.

Trajectory-based tasks¶

Human Mobility Prediction (HMC)¶

Auto-regressive classification of user trajectories

Datasets: Geolife, Porto Taxi

Predict the next location visited based on their recent movement trajectory represented as a sequence of spatial regions. Each input sample consists of an ordered sequence of hexagon IDs capturing a user’s past trajectory, and the target is the next hexagon in the sequence. This is formulated as an autoregressive sequence classification problem, where the model learns spatial-temporal mobility patterns to forecast the most probable next location among neighboring hexagons. The task uses region embeddings generated by the evaluated model.

Travel Time Estimation (TTE)¶

Predict estimated travel time

Datasets: Geolife, Porto Taxi

Predict the total travel time for a trip based on a sequence of spatial regions representing the trajectory. Each sample consists of an ordered sequence of hexagon IDs corresponding to GPS positions along a route. The target variable is the actual travel time. This is formulated as a sequence-to-value regression problem, where the model learns temporal and spatial dependencies within the route to estimate travel duration. The task uses region embeddings generated by the evaluated model.

	Price prediction	Crime activity prediction	Human Mobility Prediction (HMC)	Travel Time Estimation (TTE)
Airbnb Multicity	✅
Chicago Crime		✅
Geolife			✅	✅
House Sales in King County	✅
Philadelphia Crime		✅
Porto Taxi			✅	✅
SFPD Incident Report Datatset		✅

Benchmark and datasets paper¶

DOI: TBA

ArXiv pre-print: https://arxiv.org/abs/2510.05879

Contributing new datasets¶

Datasets codebase in SRAI is based mainly on HuggingFace backend. The library offers preprocessed datasets with predefined train and test splits, as well as "raw" version for users to split manually.

Although most of the datasets are based on data from Kraina organisation, we are not limited to it. Each dataset class can point to any public dataset on HuggingFace.

We want to collect and distribute vector based datasets for evaluating geospatial embeddings in a variety of tasks. We are using H3 grid system to generate data at different resolutions for the same dataset.

Requirements for a new spatial dataset:

Publicly available.
Has a spatial component - points with a target, collection of points that will be agrgegated, trajectories, regions for classification, etc...
Available train and test splits should be split spatially (there is a train_test_spatial_split function available in the srai.spatial_split module) - we want to ensure that both splits don't cover the same regions. If the dataset covers multiple cities (like Airbnb), consider splitting the data for each city / region individually and combine the results a the end. Can be omitted in trajectory based tasks - hard to achieve proper separation.
Should have a default H3 aggregation resolution with an optional list of available other resolutions to test different levels.
Doesn't violate any licenses (can be Non-Commercial) - for example, GeoLife and Philadelphia Crime datasets are downloaded from their original sources, and only splits with IDs are saved on HuggingFace, because their licenses don't allow for redistribution. Required processing of raw data is automatically done on the user side with attached code.