Wide format¶
OvertureMaestro implements a logic for transforming downloaded data into a wide format. This format is dedicated for geospatial machine learning usage, where selected datasets are pivoted based on their categories to a columnar format.
This notebook will explore what is this format and how to work with it.
New functions¶
New module contains the same set of functions as the basic api, just with the wide_form part inside:
convert_geometry_to_parquet→convert_geometry_to_wide_form_parquetconvert_geometry_to_geodataframe→convert_geometry_to_wide_form_geodataframe- other functions ...
Additionally, special functions for downloading all available datasets are available:
convert_geometry_to_wide_form_parquet_for_all_typesconvert_geometry_to_wide_form_geodataframe_for_all_typesconvert_bounding_box_to_wide_form_parquet_for_all_typesconvert_bounding_box_to_wide_form_geodataframe_for_all_types
You can import them from the overturemaestro.advanced_functions module.
from overturemaestro import convert_geometry_to_geodataframe, geocode_to_geometry
from overturemaestro.advanced_functions import convert_geometry_to_wide_form_geodataframe
What is the wide format?¶
In this section we will compare how the original data format differs from the wide format based on water data.
Let's start by looking at the official Overture Maps schema for the base water data type:
import requests
import yaml
response = requests.get(
"https://raw.githubusercontent.com/OvertureMaps/schema/refs/tags/v1.4.0/schema/base/water.yaml",
allow_redirects=True,
)
water_schema = yaml.safe_load(response.content.decode("utf-8"))
water_schema
{'$schema': 'https://json-schema.org/draft/2020-12/schema',
'title': 'water',
'description': 'Physical representations of inland and ocean marine surfaces. Translates `natural` and `waterway` tags from OpenStreetMap.',
'type': 'object',
'properties': {'id': {'$ref': '../defs.yaml#/$defs/propertyDefinitions/id'},
'geometry': {'unevaluatedProperties': False,
'oneOf': [{'$ref': 'https://geojson.org/schema/Point.json'},
{'$ref': 'https://geojson.org/schema/LineString.json'},
{'$ref': 'https://geojson.org/schema/Polygon.json'},
{'$ref': 'https://geojson.org/schema/MultiPolygon.json'}]},
'properties': {'unevaluatedProperties': False,
'allOf': [{'$ref': '../defs.yaml#/$defs/propertyContainers/overtureFeaturePropertiesContainer'},
{'$ref': '../defs.yaml#/$defs/propertyContainers/levelContainer'},
{'$ref': '../defs.yaml#/$defs/propertyContainers/namesContainer'},
{'$ref': './defs.yaml#/$defs/propertyContainers/osmPropertiesContainer'}],
'required': ['subtype', 'class'],
'properties': {'subtype': {'description': 'The type of water body such as an river, ocean or lake.',
'default': ['water'],
'type': 'string',
'enum': ['canal',
'human_made',
'lake',
'ocean',
'physical',
'pond',
'reservoir',
'river',
'spring',
'stream',
'wastewater',
'water']},
'class': {'description': 'Further description of the type of water',
'default': ['water'],
'enum': ['basin',
'bay',
'blowhole',
'canal',
'cape',
'ditch',
'dock',
'drain',
'fairway',
'fish_pass',
'fishpond',
'geyser',
'hot_spring',
'lagoon',
'lake',
'moat',
'ocean',
'oxbow',
'pond',
'reflecting_pool',
'reservoir',
'river',
'salt_pond',
'sea',
'sewage',
'shoal',
'spring',
'strait',
'stream',
'swimming_pool',
'tidal_channel',
'wastewater',
'water',
'water_storage',
'waterfall']},
'is_salt': {'description': 'Is it salt water or not', 'type': 'boolean'},
'is_intermittent': {'description': 'Is it intermittent water or not',
'type': 'boolean'}}}}}
Two required fields are defined in the specification: subtype and class. There are even lists of possible values defined.
Both of these values detail the meaning of the feature. Together, everything maps to the path:
theme (base) → type (water) → subtype (eg. reservoir) → class (eg. basin).
Based on this hierarchy, all available values can be determined and mapped to columns.
In this way, you will obtain data in a wide format, where each feature defines what it is with boolean flags.
amsterdam = geocode_to_geometry("Amsterdam")
original_data = convert_geometry_to_geodataframe("base", "water", amsterdam)
wide_data = convert_geometry_to_wide_form_geodataframe("base", "water", amsterdam)
Finished operation in 0:00:07
Finished operation in 0:00:09
original_data
| names | subtype | class | sources | source_tags | level | wikidata | is_intermittent | is_salt | geometry | version | bbox | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | ||||||||||||
| f3ff6909-6205-369d-9da8-f9aac25e43db | None | ocean | ocean | [{'property': '', 'dataset': 'OpenStreetMap', ... | None | NaN | None | None | True | POLYGON ((-72.99934 40.68825, -72.99934 40.747... | 1 | {'xmin': -74.00066375732422, 'xmax': -72.99933... |
| 7198db90-6b71-3792-8947-b581d6c5b5c4 | None | human_made | swimming_pool | [{'property': '', 'dataset': 'OpenStreetMap', ... | [(access, private), (leisure, swimming_pool)] | NaN | None | None | None | POLYGON ((-73.72785 40.66063, -73.72784 40.660... | 0 | {'xmin': -73.7278823852539, 'xmax': -73.727821... |
| d3a3c77d-8078-3d98-b7b7-323323bacf44 | {'primary': 'Hook Creek', 'common': None, 'rul... | canal | drain | [{'property': '', 'dataset': 'OpenStreetMap', ... | [(tunnel, culvert), (waterway, drain)] | -1.0 | None | None | None | LINESTRING (-73.72846 40.66632, -73.72851 40.6... | 0 | {'xmin': -73.728515625, 'xmax': -73.7254257202... |
| 722d3468-4f50-37a7-b809-1f0c1e7814ab | None | human_made | swimming_pool | [{'property': '', 'dataset': 'OpenStreetMap', ... | [(access, private), (leisure, swimming_pool)] | NaN | None | None | None | POLYGON ((-73.72678 40.65868, -73.72682 40.658... | 0 | {'xmin': -73.7268295288086, 'xmax': -73.726760... |
| 61f8df08-22b7-309b-a39d-1e27ffdea4c1 | None | human_made | swimming_pool | [{'property': '', 'dataset': 'OpenStreetMap', ... | [(access, private), (leisure, swimming_pool)] | NaN | None | None | None | POLYGON ((-73.72903 40.66007, -73.72902 40.660... | 0 | {'xmin': -73.72909545898438, 'xmax': -73.72901... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1b55cc6a-4875-35bb-973c-7f78107e9349 | None | human_made | swimming_pool | [{'property': '', 'dataset': 'OpenStreetMap', ... | [(access, private), (leisure, swimming_pool), ... | NaN | None | None | None | POLYGON ((-73.81194 40.88809, -73.81196 40.888... | 0 | {'xmin': -73.81196594238281, 'xmax': -73.81179... |
| f6710717-836e-3537-8dc4-b1d301691427 | None | water | water | [{'property': '', 'dataset': 'OpenStreetMap', ... | [(area, yes), (intermittent, no), (natural, wa... | NaN | None | False | None | POLYGON ((-73.91054 40.91505, -73.91045 40.915... | 1 | {'xmin': -73.94308471679688, 'xmax': -73.88537... |
| 79dced11-730d-34b5-bb9c-803ddfeb40f0 | {'primary': 'Hudson River', 'common': [('es', ... | river | river | [{'property': '', 'dataset': 'OpenStreetMap', ... | [(boat, yes), (canoe, yes), (canoe:description... | NaN | None | False | None | LINESTRING (-73.88916 41.04345, -73.89008 41.0... | 0 | {'xmin': -73.93392944335938, 'xmax': -73.88914... |
| 04b2eb53-8a9d-3c89-8c42-2afe32559c7d | None | ocean | ocean | [{'property': '', 'dataset': 'OpenStreetMap', ... | None | NaN | None | None | True | POLYGON ((-73.64509 41.0005, -73.64611 41.0005... | 1 | {'xmin': -74.00066375732422, 'xmax': -72.99933... |
| 6b67cfae-059c-387b-9391-14e8d4946a7b | {'primary': 'Long Island Sound', 'common': [('... | physical | bay | [{'property': '', 'dataset': 'OpenStreetMap', ... | [(natural, bay), (gnis:feature_id, 977427), (w... | NaN | Q867460 | None | None | POLYGON ((-72.23354 41.16044, -72.23316 41.160... | 2 | {'xmin': -73.80953979492188, 'xmax': -71.85736... |
49260 rows × 12 columns
wide_data
| geometry | base|water|canal|canal | base|water|canal|ditch | base|water|canal|drain | base|water|canal|moat | base|water|human_made|fish_pass | base|water|human_made|reflecting_pool | base|water|human_made|salt_pond | base|water|human_made|swimming_pool | base|water|lake|lagoon | ... | base|water|spring|geyser | base|water|spring|hot_spring | base|water|spring|spring | base|water|stream|stream | base|water|wastewater|sewage | base|water|water|dock | base|water|water|fairway | base|water|water|tidal_channel | base|water|water|wastewater | base|water|water|water | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| f3ff6909-6205-369d-9da8-f9aac25e43db | POLYGON ((-72.99934 40.68825, -72.99934 40.747... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 7198db90-6b71-3792-8947-b581d6c5b5c4 | POLYGON ((-73.72785 40.66063, -73.72784 40.660... | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
| d3a3c77d-8078-3d98-b7b7-323323bacf44 | LINESTRING (-73.72846 40.66632, -73.72851 40.6... | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 722d3468-4f50-37a7-b809-1f0c1e7814ab | POLYGON ((-73.72678 40.65868, -73.72682 40.658... | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
| 61f8df08-22b7-309b-a39d-1e27ffdea4c1 | POLYGON ((-73.72903 40.66007, -73.72902 40.660... | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1b55cc6a-4875-35bb-973c-7f78107e9349 | POLYGON ((-73.81194 40.88809, -73.81196 40.888... | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
| f6710717-836e-3537-8dc4-b1d301691427 | POLYGON ((-73.91054 40.91505, -73.91045 40.915... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| 79dced11-730d-34b5-bb9c-803ddfeb40f0 | LINESTRING (-73.88916 41.04345, -73.89008 41.0... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 04b2eb53-8a9d-3c89-8c42-2afe32559c7d | POLYGON ((-73.64509 41.0005, -73.64611 41.0005... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 6b67cfae-059c-387b-9391-14e8d4946a7b | POLYGON ((-72.23354 41.16044, -72.23316 41.160... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
49260 rows × 37 columns
Using this format, we can quickly filter out data or calculate number of features per category.
wide_data.drop(columns="geometry").sum().sort_values(ascending=False)
base|water|human_made|swimming_pool 46846 base|water|stream|stream 677 base|water|water|water 500 base|water|pond|pond 353 base|water|river|river 210 base|water|water|wastewater 209 base|water|canal|ditch 109 base|water|reservoir|basin 74 base|water|physical|bay 54 base|water|water|tidal_channel 53 base|water|canal|drain 39 base|water|water|fairway 33 base|water|physical|cape 30 base|water|canal|canal 17 base|water|physical|waterfall 13 base|water|ocean|ocean 10 base|water|physical|shoal 9 base|water|lake|lake 6 base|water|reservoir|reservoir 6 base|water|physical|strait 5 base|water|human_made|reflecting_pool 4 base|water|spring|spring 2 base|water|human_made|fish_pass 1 base|water|canal|moat 0 base|water|water|dock 0 base|water|wastewater|sewage 0 base|water|human_made|salt_pond 0 base|water|physical|ocean 0 base|water|spring|hot_spring 0 base|water|spring|geyser 0 base|water|physical|sea 0 base|water|lake|lagoon 0 base|water|reservoir|water_storage 0 base|water|lake|oxbow 0 base|water|pond|fishpond 0 base|water|spring|blowhole 0 dtype: int64
Each theme type has defined list of columns used for generating final list of columns.
Most of the datasets have two columns (subtype and class) with three exceptions:
base|land_cover→subtypeonlytransportation|segment→subtype,classandsubclassplaces|place→1,2,3, ... (this one is described in detail below)
from overturemaestro.advanced_functions.wide_form import THEME_TYPE_CLASSIFICATION
for (theme_value, type_value), definition in sorted(THEME_TYPE_CLASSIFICATION.items()):
print(theme_value, type_value, definition.hierachy_columns)
base infrastructure ['subtype', 'class'] base land ['subtype', 'class'] base land_cover ['subtype'] base land_use ['subtype', 'class'] base water ['subtype', 'class'] buildings building ['subtype', 'class'] places place ['1', '2', '3', '4', '5', '6'] transportation segment ['subtype', 'class', 'subclass']
Multiple data types¶
You can also download data for multiple data theme/types at once, or even download all at once.
If some datasets have been downloaded during previous executions, then only missing data is downloaded.
Here we will look at the top 10 most common features for both examples.
from overturemaestro.advanced_functions import (
convert_geometry_to_wide_form_geodataframe_for_all_types,
convert_geometry_to_wide_form_geodataframe_for_multiple_types,
)
two_datasets_gdf = convert_geometry_to_wide_form_geodataframe_for_multiple_types(
[("base", "water"), ("base", "land_cover")], amsterdam
)
two_datasets_gdf.drop(columns="geometry").sum().sort_values(ascending=False).head(10)
Finished operation in 0:00:07
base|water|human_made|swimming_pool 46846 base|water|stream|stream 677 base|land_cover|shrub 505 base|water|water|water 500 base|land_cover|barren 475 base|land_cover|forest 442 base|water|pond|pond 353 base|water|river|river 210 base|water|water|wastewater 209 base|land_cover|wetland 129 dtype: int64
len(two_datasets_gdf.columns)
47
all_datasets_gdf = convert_geometry_to_wide_form_geodataframe_for_all_types(
amsterdam, sort_result=False # we skip sorting the result here for faster execution
)
all_datasets_gdf.drop(columns="geometry").sum().sort_values(ascending=False).head(10)
Finished operation in 0:01:47
buildings|building 802916 base|infrastructure|barrier|kerb 128766 base|infrastructure|transportation|crossing 119176 buildings|building|residential|garage 104633 base|infrastructure|transit|parking_space 91669 buildings|building|residential|detached 76228 transportation|segment|road|footway 75404 transportation|segment|road|residential 59594 base|land|tree|tree 56081 transportation|segment|road|footway|sidewalk 53024 dtype: int64
len(all_datasets_gdf.columns)
2645
Limiting hierarchy depth¶
If for some reason you want to only have higher level aggregation of the data, you can limit the hierarchy depth of the data.
By default full hierarchy is used to generate the columns.
Note
If you pass too high value, it will be automatically capped to the highest possible for a given theme/type pair.
limited_depth_water_gdf = convert_geometry_to_wide_form_geodataframe(
"base", "water", amsterdam, hierarchy_depth=1
)
limited_depth_water_gdf.drop(columns="geometry").sum()
Finished operation in 0:00:11
base|water|canal 165 base|water|human_made 46851 base|water|lake 6 base|water|ocean 10 base|water|physical 111 base|water|pond 353 base|water|reservoir 80 base|water|river 210 base|water|spring 2 base|water|stream 677 base|water|wastewater 0 base|water|water 795 dtype: int64
Using value of 0 will result in just a list of theme/type pairs.
limited_depth_all_gdf = convert_geometry_to_wide_form_geodataframe_for_all_types(
amsterdam, hierarchy_depth=0
)
limited_depth_all_gdf.drop(columns="geometry").sum()
Finished operation in 0:00:43
base|infrastructure 529817 base|land 62758 base|land_cover 1714 base|land_use 44652 base|water 49260 buildings|building 1091617 places|place 205915 transportation|segment 309737 dtype: int64
You can also pass a list if you are downloading data for multiple datasets at once. The list of values must be the same length as a list of theme_type_pairs.
limited_depth_multiple_gdf = convert_geometry_to_wide_form_geodataframe_for_multiple_types(
[("places", "place"), ("base", "land_cover"), ("base", "water")],
amsterdam,
hierarchy_depth=[1, None, 0],
)
limited_depth_multiple_gdf.drop(columns="geometry").sum()
Finished operation in 0:00:08
base|land_cover|barren 475 base|land_cover|crop 58 base|land_cover|forest 442 base|land_cover|grass 0 base|land_cover|mangrove 0 base|land_cover|moss 0 base|land_cover|shrub 505 base|land_cover|snow 0 base|land_cover|urban 105 base|land_cover|wetland 129 base|water 49260 places|place|accommodation 3316 places|place|active_life 8100 places|place|arts_and_entertainment 6226 places|place|attractions_and_activities 9351 places|place|automotive 5561 places|place|beauty_and_spa 13633 places|place|business_to_business 8536 places|place|eat_and_drink 42480 places|place|education 10132 places|place|financial_service 9706 places|place|health_and_medical 25909 places|place|home_service 7772 places|place|mass_media 1930 places|place|pets 1166 places|place|private_establishments_and_corporates 3076 places|place|professional_services 24072 places|place|public_service_and_government 9907 places|place|real_estate 5458 places|place|religious_organization 6778 places|place|retail 43700 places|place|structure_and_geography 776 places|place|travel 6119 dtype: int64
Places¶
Places data have different schema than other datasets and it's the only one with possible multiple categories at once: primary and optional multiple alternative.
This structure is preserved in the wide format and it's the only dataset where a single feature can have multiple True values in the columns.
OvertureMaestro utilizes the categories column with primary and alternate sub-fields to get feature categorization. The hierarchy depth of 6 is based on official taxonomy of the possible categories.
There are two pyarrow filters applied automatically when downloading the data for the wide format: confidence value >= 0.75 and categories cannot be empty.
import pyarrow.compute as pc
category_not_null_filter = pc.invert(pc.field("categories").is_null())
minimal_confidence_filter = pc.field("confidence") >= pc.scalar(0.75)
combined_filter = category_not_null_filter & minimal_confidence_filter
original_places_data = convert_geometry_to_geodataframe(
"places",
"place",
amsterdam,
pyarrow_filter=combined_filter,
columns_to_download=["id", "geometry", "categories", "confidence"],
)
original_places_data
Finished operation in 0:00:10
| geometry | categories | confidence | |
|---|---|---|---|
| id | |||
| d1d9bdd1-c030-40b1-9ae9-84935d0fd1fb | POINT (-74.25304 40.48667) | {'primary': 'lighthouse', 'alternate': ['landm... | 0.770604 |
| 3a528566-cbfb-410f-ac23-07cf529fdf43 | POINT (-74.23893 40.49926) | {'primary': 'seafood_restaurant', 'alternate':... | 0.905618 |
| 20f52a3a-c772-4e00-aa88-b59738b596e4 | POINT (-74.24071 40.49981) | {'primary': 'department_store', 'alternate': N... | 0.790186 |
| 25526275-c5a8-46ba-bc83-f071b8551d19 | POINT (-74.24232 40.49959) | {'primary': 'garbage_collection_service', 'alt... | 0.770000 |
| 7290c346-53f6-494c-bd47-536260cd0356 | POINT (-74.24486 40.49938) | {'primary': 'playground', 'alternate': ['park']} | 0.946945 |
| ... | ... | ... | ... |
| 9c6ffc77-830a-484b-9779-1a7351e16f71 | POINT (-73.75866 40.59265) | {'primary': 'beach', 'alternate': ['park', 'sp... | 0.805213 |
| e0b67409-7469-477d-9485-caae3b4b7a55 | POINT (-73.75863 40.59271) | {'primary': 'landmark_and_historical_building'... | 0.962183 |
| c1fe438b-20d5-45a6-bbaa-7ff00f9e75eb | POINT (-73.75774 40.59271) | {'primary': 'garbage_collection_service', 'alt... | 0.770000 |
| e995402e-fdc5-4f53-820c-ef1b70b22453 | POINT (-73.75387 40.59336) | {'primary': 'hospital', 'alternate': ['health_... | 0.976375 |
| a45bac2c-e6cb-4f4f-bfce-aff65ad931d1 | POINT (-73.75333 40.59327) | {'primary': 'senior_citizen_services', 'altern... | 0.951227 |
205915 rows × 3 columns
first_index = (
# Find first object with at least one alternate category
original_places_data[original_places_data.categories.str.get("alternate").str.len() > 1]
.iloc[0]
.name
)
first_index, original_places_data.loc[first_index].categories
('d1d9bdd1-c030-40b1-9ae9-84935d0fd1fb',
{'primary': 'lighthouse',
'alternate': array(['landmark_and_historical_building', 'monument'], dtype=object)})
wide_form_places_data = convert_geometry_to_wide_form_geodataframe("places", "place", amsterdam)
wide_form_places_data
Finished operation in 0:01:02
| geometry | places|place|accommodation | places|place|accommodation|bed_and_breakfast | places|place|accommodation|cabin | places|place|accommodation|campground | places|place|accommodation|cottage | places|place|accommodation|guest_house | places|place|accommodation|health_retreats | places|place|accommodation|holiday_rental_home | places|place|accommodation|hostel | ... | places|place|travel|transportation|transport_interchange | places|place|travel|transportation|water_taxi | places|place|travel|travel_services | places|place|travel|travel_services|luggage_storage | places|place|travel|travel_services|passport_and_visa_services | places|place|travel|travel_services|passport_and_visa_services|visa_agent | places|place|travel|travel_services|travel_agents | places|place|travel|travel_services|travel_agents|sightseeing_tour_agency | places|place|travel|travel_services|visitor_center | places|place|travel|vacation_rental_agents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| d1d9bdd1-c030-40b1-9ae9-84935d0fd1fb | POINT (-74.25304 40.48667) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 3a528566-cbfb-410f-ac23-07cf529fdf43 | POINT (-74.23893 40.49926) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 20f52a3a-c772-4e00-aa88-b59738b596e4 | POINT (-74.24071 40.49981) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 25526275-c5a8-46ba-bc83-f071b8551d19 | POINT (-74.24232 40.49959) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 7290c346-53f6-494c-bd47-536260cd0356 | POINT (-74.24486 40.49938) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9c6ffc77-830a-484b-9779-1a7351e16f71 | POINT (-73.75866 40.59265) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| e0b67409-7469-477d-9485-caae3b4b7a55 | POINT (-73.75863 40.59271) | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| c1fe438b-20d5-45a6-bbaa-7ff00f9e75eb | POINT (-73.75774 40.59271) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| e995402e-fdc5-4f53-820c-ef1b70b22453 | POINT (-73.75387 40.59336) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| a45bac2c-e6cb-4f4f-bfce-aff65ad931d1 | POINT (-73.75333 40.59327) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
205915 rows × 2117 columns
As you can see, only those features existing in the categories column are True and the rest is False.
wide_form_places_data.loc[first_index].drop("geometry").sort_values(ascending=False)
places|place|attractions_and_activities|landmark_and_historical_building True
places|place|attractions_and_activities|monument True
places|place|attractions_and_activities|lighthouse True
places|place|pets|pet_services|emergency_pet_hospital False
places|place|professional_services|emergency_service False
...
places|place|eat_and_drink|bar|milkshake_bar False
places|place|eat_and_drink|bar|milk_bar False
places|place|eat_and_drink|bar|lounge False
places|place|eat_and_drink|bar|kombucha False
places|place|travel|vacation_rental_agents False
Name: d1d9bdd1-c030-40b1-9ae9-84935d0fd1fb, Length: 2116, dtype: object
You can use places_use_primary_category_only to use only single category per feature without altenatives.
primary_only_wide_form_places_data = convert_geometry_to_wide_form_geodataframe(
"places",
"place",
amsterdam,
places_use_primary_category_only=True,
)
primary_only_wide_form_places_data.loc[first_index].drop("geometry").sort_values(ascending=False)
Finished operation in 0:00:12
places|place|attractions_and_activities|lighthouse True
places|place|professional_services|construction_services|stone_and_masonry|masonry_contractors False
places|place|professional_services|electrical_consultant False
places|place|professional_services|elder_care_planning False
places|place|professional_services|editorial_services False
...
places|place|eat_and_drink|bar|lounge False
places|place|eat_and_drink|bar|kombucha False
places|place|eat_and_drink|bar|irish_pub False
places|place|eat_and_drink|bar|hotel_bar False
places|place|travel|vacation_rental_agents False
Name: d1d9bdd1-c030-40b1-9ae9-84935d0fd1fb, Length: 2116, dtype: object
Below you can see the difference in the counts of True values across all columns.
wide_form_places_data.drop(columns="geometry").sum().sort_values(ascending=False)
places|place|eat_and_drink|restaurant 16863
places|place|health_and_medical 12485
places|place|health_and_medical|doctor 10826
places|place|retail 7904
places|place|beauty_and_spa|beauty_salon 7079
...
places|place|retail|food|bakery|flatbread 0
places|place|active_life|sports_and_recreation_venue|diving_center 0
places|place|active_life|sports_and_recreation_venue|diving_center|free_diving_center 0
places|place|retail|food|back_shop 0
places|place|attractions_and_activities|stargazing_area 0
Length: 2116, dtype: int64
primary_only_wide_form_places_data.drop(columns="geometry").sum().sort_values(ascending=False)
places|place|religious_organization|church_cathedral 3679
places|place|health_and_medical|doctor 3616
places|place|eat_and_drink|restaurant|pizza_restaurant 2895
places|place|beauty_and_spa|hair_salon 2780
places|place|private_establishments_and_corporates|corporate_office 2578
...
places|place|health_and_medical|traditional_chinese_medicine 0
places|place|business_to_business|business_manufacturing_and_supply|b2b_rubber_and_plastics|plastic_company 0
places|place|eat_and_drink|restaurant|meatball_restaurant 0
places|place|public_service_and_government|railway_service 0
places|place|professional_services|metal_detector_services 0
Length: 2116, dtype: int64
You can also change the minimal confidence value with places_minimal_confidence parameter.
convert_geometry_to_wide_form_geodataframe(
"places", "place", amsterdam, places_minimal_confidence=0.95
)
Finished operation in 0:01:04
| geometry | places|place|accommodation | places|place|accommodation|bed_and_breakfast | places|place|accommodation|cabin | places|place|accommodation|campground | places|place|accommodation|cottage | places|place|accommodation|guest_house | places|place|accommodation|health_retreats | places|place|accommodation|holiday_rental_home | places|place|accommodation|hostel | ... | places|place|travel|transportation|transport_interchange | places|place|travel|transportation|water_taxi | places|place|travel|travel_services | places|place|travel|travel_services|luggage_storage | places|place|travel|travel_services|passport_and_visa_services | places|place|travel|travel_services|passport_and_visa_services|visa_agent | places|place|travel|travel_services|travel_agents | places|place|travel|travel_services|travel_agents|sightseeing_tour_agency | places|place|travel|travel_services|visitor_center | places|place|travel|vacation_rental_agents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 8f12e63e-1c93-4c60-a7bc-555d801104c3 | POINT (-74.25125 40.50058) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| dd5c25f1-e8d5-4379-a0d2-3eb116ac82e4 | POINT (-74.25322 40.5032) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 799b7a13-60e6-48eb-9397-e770ddfe23af | POINT (-74.25421 40.50553) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 2bab82d8-c3f4-45e3-8c39-98b07c064d4e | POINT (-74.24495 40.50832) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 547e2bed-376a-438c-a34f-9440ba4de7a7 | POINT (-74.24448 40.50991) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 18f9d264-a2bb-4145-b4ec-4a770d35e978 | POINT (-73.74513 40.59795) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| a76f8690-d0ec-481d-a688-070ea8606c66 | POINT (-73.74031 40.59847) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 7f421a62-b7d1-467d-813f-1c6bc9b56d28 | POINT (-73.73994 40.59916) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| a14f3cd0-961c-4c19-b4dd-ca802ebae349 | POINT (-73.74217 40.59955) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| f44ebbb1-bff7-41a3-b5b1-06a67944602a | POINT (-73.74159 40.59542) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
66505 rows × 2117 columns
Full hierarchy of the places dataset is derived from the official taxonomy available here.
You can limit it to get less columns, with grouped categories.
convert_geometry_to_wide_form_geodataframe("places", "place", amsterdam, hierarchy_depth=1)
Finished operation in 0:00:07
| geometry | places|place|accommodation | places|place|active_life | places|place|arts_and_entertainment | places|place|attractions_and_activities | places|place|automotive | places|place|beauty_and_spa | places|place|business_to_business | places|place|eat_and_drink | places|place|education | ... | places|place|mass_media | places|place|pets | places|place|private_establishments_and_corporates | places|place|professional_services | places|place|public_service_and_government | places|place|real_estate | places|place|religious_organization | places|place|retail | places|place|structure_and_geography | places|place|travel | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| d1d9bdd1-c030-40b1-9ae9-84935d0fd1fb | POINT (-74.25304 40.48667) | False | False | False | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 3a528566-cbfb-410f-ac23-07cf529fdf43 | POINT (-74.23893 40.49926) | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
| 20f52a3a-c772-4e00-aa88-b59738b596e4 | POINT (-74.24071 40.49981) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
| 25526275-c5a8-46ba-bc83-f071b8551d19 | POINT (-74.24232 40.49959) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
| 7290c346-53f6-494c-bd47-536260cd0356 | POINT (-74.24486 40.49938) | False | True | False | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9c6ffc77-830a-484b-9779-1a7351e16f71 | POINT (-73.75866 40.59265) | False | True | False | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| e0b67409-7469-477d-9485-caae3b4b7a55 | POINT (-73.75863 40.59271) | True | False | False | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| c1fe438b-20d5-45a6-bbaa-7ff00f9e75eb | POINT (-73.75774 40.59271) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
| e995402e-fdc5-4f53-820c-ef1b70b22453 | POINT (-73.75387 40.59336) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | True | False | False | False |
| a45bac2c-e6cb-4f4f-bfce-aff65ad931d1 | POINT (-73.75333 40.59327) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | True | False | False | False | False | False |
205915 rows × 23 columns
Pruning final list of columns¶
By default, OvertureMaestro includes all possible columns regardless of whether any features of a given category exist.
This is done to keep the overall schema consistent for different geographical regions and simplifying the feature engineering process.
However, there is a dedicated parameter include_all_possible_columns that can be set to False to keep only columns based on actually existing features.
convert_geometry_to_wide_form_geodataframe(
"base", "infrastructure", amsterdam, include_all_possible_columns=True # default value
)
Finished operation in 0:00:04
| geometry | base|infrastructure|aerialway|aerialway_station | base|infrastructure|aerialway|cable_car | base|infrastructure|aerialway|chair_lift | base|infrastructure|aerialway|drag_lift | base|infrastructure|aerialway|gondola | base|infrastructure|aerialway|goods | base|infrastructure|aerialway|j-bar | base|infrastructure|aerialway|magic_carpet | base|infrastructure|aerialway|mixed_lift | ... | base|infrastructure|utility|utility_pole | base|infrastructure|utility|water_tower | base|infrastructure|waste_management|recycling | base|infrastructure|waste_management|waste_basket | base|infrastructure|waste_management|waste_disposal | base|infrastructure|water|breakwater | base|infrastructure|water|dam | base|infrastructure|water|drinking_water | base|infrastructure|water|fountain | base|infrastructure|water|weir | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 25627f2a-664d-35ca-a7c5-90c5fea051e7 | POINT (-74.25192 40.50054) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
| 957f3e16-47bd-3a44-87cb-fd90531b2e2a | LINESTRING (-74.25409 40.5025, -74.25407 40.50... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 8274f299-be66-3bf2-aea0-ad4069582a12 | LINESTRING (-74.25379 40.50246, -74.25378 40.5... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| aa1a50f1-2a37-34f0-bc72-62b32cf116f5 | LINESTRING (-74.25388 40.50239, -74.25386 40.5... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 517cf5aa-6097-3195-9a0c-5bdaa3fafc5d | LINESTRING (-74.2538 40.50239, -74.25382 40.50... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| c2f6482b-2976-3841-98bc-30953ba98092 | POLYGON ((-73.93049 40.55266, -73.93047 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 3bb1a4c5-8e49-357b-9831-42694aecc364 | POLYGON ((-73.93052 40.55265, -73.9305 40.5526... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| a3794d61-4b03-353e-89a6-e84ca36de853 | POLYGON ((-73.93043 40.55267, -73.93041 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 7d7fc443-8f5f-3efc-9d96-f35607c87c78 | POLYGON ((-73.9304 40.55268, -73.93038 40.5526... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 089a3ed4-6829-3838-9b94-18583e45542a | POLYGON ((-73.93037 40.55269, -73.93035 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
529817 rows × 164 columns
convert_geometry_to_wide_form_geodataframe(
"base", "infrastructure", amsterdam, include_all_possible_columns=False
)
Finished operation in 0:00:04
| geometry | base|infrastructure|aerialway|aerialway_station | base|infrastructure|aerialway|cable_car | base|infrastructure|aerialway|pylon | base|infrastructure|airport|airport_gate | base|infrastructure|airport|apron | base|infrastructure|airport|helipad | base|infrastructure|airport|heliport | base|infrastructure|airport|international_airport | base|infrastructure|airport|runway | ... | base|infrastructure|utility|utility_pole | base|infrastructure|utility|water_tower | base|infrastructure|waste_management|recycling | base|infrastructure|waste_management|waste_basket | base|infrastructure|waste_management|waste_disposal | base|infrastructure|water|breakwater | base|infrastructure|water|dam | base|infrastructure|water|drinking_water | base|infrastructure|water|fountain | base|infrastructure|water|weir | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 25627f2a-664d-35ca-a7c5-90c5fea051e7 | POINT (-74.25192 40.50054) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
| 957f3e16-47bd-3a44-87cb-fd90531b2e2a | LINESTRING (-74.25409 40.5025, -74.25407 40.50... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 8274f299-be66-3bf2-aea0-ad4069582a12 | LINESTRING (-74.25379 40.50246, -74.25378 40.5... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| aa1a50f1-2a37-34f0-bc72-62b32cf116f5 | LINESTRING (-74.25388 40.50239, -74.25386 40.5... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 517cf5aa-6097-3195-9a0c-5bdaa3fafc5d | LINESTRING (-74.2538 40.50239, -74.25382 40.50... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| c2f6482b-2976-3841-98bc-30953ba98092 | POLYGON ((-73.93049 40.55266, -73.93047 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 3bb1a4c5-8e49-357b-9831-42694aecc364 | POLYGON ((-73.93052 40.55265, -73.9305 40.5526... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| a3794d61-4b03-353e-89a6-e84ca36de853 | POLYGON ((-73.93043 40.55267, -73.93041 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 7d7fc443-8f5f-3efc-9d96-f35607c87c78 | POLYGON ((-73.9304 40.55268, -73.93038 40.5526... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 089a3ed4-6829-3838-9b94-18583e45542a | POLYGON ((-73.93037 40.55269, -73.93035 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
529817 rows × 121 columns
Getting a full list of possible column names¶
You can also preview the final list of columns before downloading the data using get_all_possible_column_names function.
You can specify the release, theme and type, as well as hierarchy_depth.
from overturemaestro.advanced_functions.wide_form import get_all_possible_column_names
get_all_possible_column_names(theme="base", type="water")
['base|water|canal|canal', 'base|water|canal|ditch', 'base|water|canal|drain', 'base|water|canal|moat', 'base|water|human_made|fish_pass', 'base|water|human_made|reflecting_pool', 'base|water|human_made|salt_pond', 'base|water|human_made|swimming_pool', 'base|water|lake|lagoon', 'base|water|lake|lake', 'base|water|lake|oxbow', 'base|water|ocean|ocean', 'base|water|physical|bay', 'base|water|physical|cape', 'base|water|physical|ocean', 'base|water|physical|sea', 'base|water|physical|shoal', 'base|water|physical|strait', 'base|water|physical|waterfall', 'base|water|pond|fishpond', 'base|water|pond|pond', 'base|water|reservoir|basin', 'base|water|reservoir|reservoir', 'base|water|reservoir|water_storage', 'base|water|river|river', 'base|water|spring|blowhole', 'base|water|spring|geyser', 'base|water|spring|hot_spring', 'base|water|spring|spring', 'base|water|stream|stream', 'base|water|wastewater|sewage', 'base|water|water|dock', 'base|water|water|fairway', 'base|water|water|tidal_channel', 'base|water|water|wastewater', 'base|water|water|water']
With all parameters empty, function will return a full list of all possible columns with maximal depth.
columns = get_all_possible_column_names()
len(columns)
2644
columns[:10]
['base|infrastructure|aerialway|aerialway_station', 'base|infrastructure|aerialway|cable_car', 'base|infrastructure|aerialway|chair_lift', 'base|infrastructure|aerialway|drag_lift', 'base|infrastructure|aerialway|gondola', 'base|infrastructure|aerialway|goods', 'base|infrastructure|aerialway|j-bar', 'base|infrastructure|aerialway|magic_carpet', 'base|infrastructure|aerialway|mixed_lift', 'base|infrastructure|aerialway|platter']
You can also specify different hierarchy_depth values.
get_all_possible_column_names(theme="buildings", type="building", hierarchy_depth=1)
['buildings|building', 'buildings|building|agricultural', 'buildings|building|civic', 'buildings|building|commercial', 'buildings|building|education', 'buildings|building|entertainment', 'buildings|building|industrial', 'buildings|building|medical', 'buildings|building|military', 'buildings|building|outbuilding', 'buildings|building|religious', 'buildings|building|residential', 'buildings|building|service', 'buildings|building|transportation']