Wide format¶
OvertureMaestro implements a logic for transforming downloaded data into a wide
format. This format is dedicated for geospatial machine learning usage, where selected datasets are pivoted based on their categories to a columnar format.
This notebook will explore what is this format and how to work with it.
New functions¶
New module contains the same set of functions as the basic api, just with the wide_form
part inside:
convert_geometry_to_parquet
→convert_geometry_to_wide_form_parquet
convert_geometry_to_geodataframe
→convert_geometry_to_wide_form_geodataframe
- other functions ...
Additionally, special functions for downloading all available datasets are available:
convert_geometry_to_wide_form_parquet_for_all_types
convert_geometry_to_wide_form_geodataframe_for_all_types
convert_bounding_box_to_wide_form_parquet_for_all_types
convert_bounding_box_to_wide_form_geodataframe_for_all_types
You can import them from the overturemaestro.advanced_functions
module.
from overturemaestro import convert_geometry_to_geodataframe, geocode_to_geometry
from overturemaestro.advanced_functions import convert_geometry_to_wide_form_geodataframe
What is the wide format?¶
In this section we will compare how the original data format differs from the wide format based on water data.
Let's start by looking at the official Overture Maps schema for the base water data type:
import requests
import yaml
response = requests.get(
"https://raw.githubusercontent.com/OvertureMaps/schema/refs/tags/v1.4.0/schema/base/water.yaml",
allow_redirects=True,
)
water_schema = yaml.safe_load(response.content.decode("utf-8"))
water_schema
{'$schema': 'https://json-schema.org/draft/2020-12/schema', 'title': 'water', 'description': 'Physical representations of inland and ocean marine surfaces. Translates `natural` and `waterway` tags from OpenStreetMap.', 'type': 'object', 'properties': {'id': {'$ref': '../defs.yaml#/$defs/propertyDefinitions/id'}, 'geometry': {'unevaluatedProperties': False, 'oneOf': [{'$ref': 'https://geojson.org/schema/Point.json'}, {'$ref': 'https://geojson.org/schema/LineString.json'}, {'$ref': 'https://geojson.org/schema/Polygon.json'}, {'$ref': 'https://geojson.org/schema/MultiPolygon.json'}]}, 'properties': {'unevaluatedProperties': False, 'allOf': [{'$ref': '../defs.yaml#/$defs/propertyContainers/overtureFeaturePropertiesContainer'}, {'$ref': '../defs.yaml#/$defs/propertyContainers/levelContainer'}, {'$ref': '../defs.yaml#/$defs/propertyContainers/namesContainer'}, {'$ref': './defs.yaml#/$defs/propertyContainers/osmPropertiesContainer'}], 'required': ['subtype', 'class'], 'properties': {'subtype': {'description': 'The type of water body such as an river, ocean or lake.', 'default': ['water'], 'type': 'string', 'enum': ['canal', 'human_made', 'lake', 'ocean', 'physical', 'pond', 'reservoir', 'river', 'spring', 'stream', 'wastewater', 'water']}, 'class': {'description': 'Further description of the type of water', 'default': ['water'], 'enum': ['basin', 'bay', 'blowhole', 'canal', 'cape', 'ditch', 'dock', 'drain', 'fairway', 'fish_pass', 'fishpond', 'geyser', 'hot_spring', 'lagoon', 'lake', 'moat', 'ocean', 'oxbow', 'pond', 'reflecting_pool', 'reservoir', 'river', 'salt_pond', 'sea', 'sewage', 'shoal', 'spring', 'strait', 'stream', 'swimming_pool', 'tidal_channel', 'wastewater', 'water', 'water_storage', 'waterfall']}, 'is_salt': {'description': 'Is it salt water or not', 'type': 'boolean'}, 'is_intermittent': {'description': 'Is it intermittent water or not', 'type': 'boolean'}}}}}
Two required fields are defined in the specification: subtype
and class
. There are even lists of possible values defined.
Both of these values detail the meaning of the feature. Together, everything maps to the path:
theme
(base) → type
(water) → subtype
(eg. reservoir) → class
(eg. basin).
Based on this hierarchy, all available values can be determined and mapped to columns.
In this way, you will obtain data in a wide format, where each feature defines what it is with boolean flags.
amsterdam = geocode_to_geometry("Amsterdam")
original_data = convert_geometry_to_geodataframe("base", "water", amsterdam)
wide_data = convert_geometry_to_wide_form_geodataframe("base", "water", amsterdam)
Finished operation in 0:00:15
Finished operation in 0:00:06
original_data
geometry | bbox | version | sources | level | subtype | class | names | source_tags | wikidata | is_salt | is_intermittent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
08b196952bad6fff0004b3d6ac192a67 | LINESTRING (4.95544 52.27891, 4.95537 52.27889) | {'xmin': 4.955374240875244, 'xmax': 4.95543622... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | canal | drain | None | [(waterway, drain)] | None | None | None |
08b196952ba89fff0004bfa69663858f | LINESTRING (4.95582 52.27935, 4.95583 52.27921) | {'xmin': 4.955817699432373, 'xmax': 4.95583009... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | canal | drain | None | [(waterway, drain)] | None | None | None |
08b196952ba81fff0004b897f2a10807 | POLYGON ((4.95466 52.28024, 4.95465 52.28023, ... | {'xmin': 4.954648494720459, 'xmax': 4.95503425... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | water | water | None | [(natural, water)] | None | None | None |
08b196952bab3fff0004baa256fea3ee | POLYGON ((4.95451 52.28086, 4.95447 52.28091, ... | {'xmin': 4.953765392303467, 'xmax': 4.95451402... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | water | water | None | [(natural, water)] | None | None | None |
08b196952b0c3fff0004ba4f87feef2a | LINESTRING (4.96358 52.27727, 4.96162 52.27799... | {'xmin': 4.92629337310791, 'xmax': 4.963577270... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | river | river | {'primary': 'Holendrecht', 'common': None, 'ru... | [(waterway, river)] | None | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b1969cda4c6fff0004b4ac01121b5f | POLYGON ((4.99712 52.28935, 4.99711 52.28933, ... | {'xmin': 4.997106552124023, 'xmax': 5.00078153... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | water | water | None | [(natural, water)] | None | None | None |
08b1969cda4dcfff0004bd544fd4ac37 | LINESTRING (4.98152 52.2741, 4.98177 52.2745, ... | {'xmin': 4.9815239906311035, 'xmax': 5.0154547... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | river | river | {'primary': 'Gein', 'common': None, 'rules': N... | [(waterway, river)] | None | None | None |
08b1969cda4d0fff0004b7c6329a4e5a | POLYGON ((4.9933 52.28946, 4.99328 52.28941, 4... | {'xmin': 4.993284702301025, 'xmax': 5.00190782... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | water | water | None | [(natural, water)] | None | None | None |
08b1969cda4dcfff0004b91ade6aebc1 | POLYGON ((4.98114 52.2738, 4.98118 52.27382, 4... | {'xmin': 4.981075286865234, 'xmax': 5.01563167... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | river | river | None | [(natural, water), (water, river)] | None | None | None |
08b196954115cfff0004c5e0f5272b7a | POLYGON ((5.07713 52.10181, 5.07713 52.10183, ... | {'xmin': 4.993065357208252, 'xmax': 5.08212852... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | canal | canal | None | [(natural, water), (water, canal)] | None | None | None |
8803 rows × 12 columns
wide_data
geometry | base|water|canal|canal | base|water|canal|ditch | base|water|canal|drain | base|water|canal|moat | base|water|human_made|fish_pass | base|water|human_made|reflecting_pool | base|water|human_made|salt_pond | base|water|human_made|swimming_pool | base|water|lake|lagoon | ... | base|water|spring|geyser | base|water|spring|hot_spring | base|water|spring|spring | base|water|stream|stream | base|water|wastewater|sewage | base|water|water|dock | base|water|water|fairway | base|water|water|tidal_channel | base|water|water|wastewater | base|water|water|water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08b196952bad6fff0004b3d6ac192a67 | LINESTRING (4.95544 52.27891, 4.95537 52.27889) | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b196952ba89fff0004bfa69663858f | LINESTRING (4.95582 52.27935, 4.95583 52.27921) | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b196952ba81fff0004b897f2a10807 | POLYGON ((4.95466 52.28024, 4.95465 52.28023, ... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
08b196952bab3fff0004baa256fea3ee | POLYGON ((4.95451 52.28086, 4.95447 52.28091, ... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
08b196952b0c3fff0004ba4f87feef2a | LINESTRING (4.96358 52.27727, 4.96162 52.27799... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b1969cda4c6fff0004b4ac01121b5f | POLYGON ((4.99712 52.28935, 4.99711 52.28933, ... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
08b1969cda4dcfff0004bd544fd4ac37 | LINESTRING (4.98152 52.2741, 4.98177 52.2745, ... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969cda4d0fff0004b7c6329a4e5a | POLYGON ((4.9933 52.28946, 4.99328 52.28941, 4... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
08b1969cda4dcfff0004b91ade6aebc1 | POLYGON ((4.98114 52.2738, 4.98118 52.27382, 4... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b196954115cfff0004c5e0f5272b7a | POLYGON ((5.07713 52.10181, 5.07713 52.10183, ... | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
8803 rows × 37 columns
Using this format, we can quickly filter out data or calculate number of features per category.
wide_data.drop(columns="geometry").sum().sort_values(ascending=False)
base|water|water|water 5150 base|water|canal|canal 1560 base|water|canal|ditch 1029 base|water|canal|drain 736 base|water|pond|pond 82 base|water|water|fairway 64 base|water|human_made|swimming_pool 47 base|water|river|river 45 base|water|stream|stream 25 base|water|reservoir|basin 24 base|water|water|wastewater 23 base|water|lake|lake 6 base|water|canal|moat 5 base|water|reservoir|reservoir 3 base|water|physical|bay 2 base|water|water|dock 1 base|water|human_made|reflecting_pool 1 base|water|spring|spring 0 base|water|spring|hot_spring 0 base|water|spring|geyser 0 base|water|spring|blowhole 0 base|water|wastewater|sewage 0 base|water|water|tidal_channel 0 base|water|pond|fishpond 0 base|water|reservoir|water_storage 0 base|water|physical|strait 0 base|water|physical|shoal 0 base|water|physical|sea 0 base|water|physical|ocean 0 base|water|physical|cape 0 base|water|ocean|ocean 0 base|water|lake|oxbow 0 base|water|lake|lagoon 0 base|water|human_made|salt_pond 0 base|water|human_made|fish_pass 0 base|water|physical|waterfall 0 dtype: int64
Each theme type has defined list of columns used for generating final list of columns.
Most of the datasets have two columns (subtype
and class
) with three exceptions:
base|land_cover
→subtype
onlytransportation|segment
→subtype
,class
andsubclass
places|place
→1
,2
,3
, ... (this one is described in detail below)
from overturemaestro.advanced_functions.wide_form import THEME_TYPE_CLASSIFICATION
for (theme_value, type_value), definition in sorted(THEME_TYPE_CLASSIFICATION.items()):
print(theme_value, type_value, definition.hierachy_columns)
base infrastructure ['subtype', 'class'] base land ['subtype', 'class'] base land_cover ['subtype'] base land_use ['subtype', 'class'] base water ['subtype', 'class'] buildings building ['subtype', 'class'] places place ['1', '2', '3', '4', '5', '6'] transportation segment ['subtype', 'class', 'subclass']
Multiple data types¶
You can also download data for multiple data theme/types at once, or even download all at once.
If some datasets have been downloaded during previous executions, then only missing data is downloaded.
Here we will look at the top 10 most common features for both examples.
from overturemaestro.advanced_functions import (
convert_geometry_to_wide_form_geodataframe_for_all_types,
convert_geometry_to_wide_form_geodataframe_for_multiple_types,
)
two_datasets_gdf = convert_geometry_to_wide_form_geodataframe_for_multiple_types(
[("base", "water"), ("base", "land_cover")], amsterdam
)
two_datasets_gdf.drop(columns="geometry").sum().sort_values(ascending=False).head(10)
Finished operation in 0:00:04
base|water|water|water 5150 base|water|canal|canal 1560 base|water|canal|ditch 1029 base|water|canal|drain 736 base|land_cover|shrub 308 base|land_cover|forest 169 base|land_cover|urban 89 base|water|pond|pond 82 base|land_cover|barren 82 base|water|water|fairway 64 dtype: int64
len(two_datasets_gdf.columns)
47
all_datasets_gdf = convert_geometry_to_wide_form_geodataframe_for_all_types(amsterdam)
all_datasets_gdf.drop(columns="geometry").sum().sort_values(ascending=False).head(10)
Finished operation in 0:01:05
buildings|building 72607 buildings|building|residential|house 54853 buildings|building|residential|apartments 53995 base|land|tree|tree 37170 base|infrastructure|transit|parking_space 21925 transportation|segment|road|footway 16707 base|land_use|managed|grass 13810 transportation|segment|road|residential 10853 base|infrastructure|barrier|fence 7619 base|infrastructure|transportation|crossing 6710 dtype: int64
len(all_datasets_gdf.columns)
2633
Limiting hierarchy depth¶
If for some reason you want to only have higher level aggregation of the data, you can limit the hierarchy depth of the data.
By default full hierarchy is used to generate the columns.
Note
If you pass too high value, it will be automatically capped to the highest possible for a given theme/type pair.
limited_depth_water_gdf = convert_geometry_to_wide_form_geodataframe(
"base", "water", amsterdam, hierarchy_depth=1
)
limited_depth_water_gdf.drop(columns="geometry").sum()
Finished operation in 0:00:04
base|water|canal 3330 base|water|human_made 48 base|water|lake 6 base|water|ocean 0 base|water|physical 2 base|water|pond 82 base|water|reservoir 27 base|water|river 45 base|water|spring 0 base|water|stream 25 base|water|wastewater 0 base|water|water 5238 dtype: int64
Using value of 0 will result in just a list of theme
/type
pairs.
limited_depth_all_gdf = convert_geometry_to_wide_form_geodataframe_for_all_types(
amsterdam, hierarchy_depth=0
)
limited_depth_all_gdf.drop(columns="geometry").sum()
Finished operation in 0:00:14
base|infrastructure 92990 base|land 49510 base|land_cover 702 base|land_use 24816 base|water 8803 buildings|building 199958 places|place 28196 transportation|segment 61855 dtype: int64
You can also pass a list if you are downloading data for multiple datasets at once. The list of values must be the same length as a list of theme_type_pairs
.
limited_depth_multiple_gdf = convert_geometry_to_wide_form_geodataframe_for_multiple_types(
[("places", "place"), ("base", "land_cover"), ("base", "water")],
amsterdam,
hierarchy_depth=[1, None, 0],
)
limited_depth_multiple_gdf.drop(columns="geometry").sum()
Finished operation in 0:00:02
base|land_cover|barren 82 base|land_cover|crop 24 base|land_cover|forest 169 base|land_cover|grass 1 base|land_cover|mangrove 0 base|land_cover|moss 0 base|land_cover|shrub 308 base|land_cover|snow 0 base|land_cover|urban 89 base|land_cover|wetland 29 base|water 8803 places|place|accommodation 1090 places|place|active_life 2042 places|place|arts_and_entertainment 2003 places|place|attractions_and_activities 1671 places|place|automotive 560 places|place|beauty_and_spa 2039 places|place|business_to_business 2362 places|place|eat_and_drink 5538 places|place|education 1633 places|place|financial_service 831 places|place|health_and_medical 2361 places|place|home_service 1081 places|place|mass_media 811 places|place|pets 150 places|place|private_establishments_and_corporates 117 places|place|professional_services 7308 places|place|public_service_and_government 1982 places|place|real_estate 821 places|place|religious_organization 342 places|place|retail 7192 places|place|structure_and_geography 371 places|place|travel 1353 dtype: int64
Places¶
Places data have different schema than other datasets and it's the only one with possible multiple categories at once: primary
and optional multiple alternative
.
This structure is preserved in the wide
format and it's the only dataset where a single feature can have multiple True
values in the columns.
OvertureMaestro utilizes the categories
column with primary
and alternate
sub-fields to get feature categorization. The hierarchy depth of 6
is based on official taxonomy of the possible categories.
There are two pyarrow filters applied automatically when downloading the data for the wide
format: confidence
value >= 0.75 and categories
cannot be empty.
import pyarrow.compute as pc
category_not_null_filter = pc.invert(pc.field("categories").is_null())
minimal_confidence_filter = pc.field("confidence") >= pc.scalar(0.75)
combined_filter = category_not_null_filter & minimal_confidence_filter
original_places_data = convert_geometry_to_geodataframe(
"places",
"place",
amsterdam,
pyarrow_filter=combined_filter,
columns_to_download=["id", "geometry", "categories", "confidence"],
)
original_places_data
Finished operation in 0:00:03
geometry | categories | confidence | |
---|---|---|---|
id | |||
08f19695336d340003f6fdf3a790a782 | POINT (4.82069 52.32627) | {'primary': 'diner', 'alternate': ['restaurant... | 0.979399 |
08f1969514a14c28034b4eefc83a719b | POINT (4.81088 52.33052) | {'primary': 'restaurant', 'alternate': ['frenc... | 0.979399 |
08f1969514a0636b03b4e801caa4077c | POINT (4.81242 52.33008) | {'primary': 'fashion', 'alternate': ['engineer... | 0.877056 |
08f1969514a14a24031d524c97baacd2 | POINT (4.81126 52.33039) | {'primary': 'arts_and_entertainment', 'alterna... | 0.757967 |
08f1969514869a08031e47174edb17be | POINT (4.81395 52.33421) | {'primary': 'park', 'alternate': ['lake', 'act... | 0.757967 |
... | ... | ... | ... |
08f196952d76b67303ce11872e91b54b | POINT (4.99107 52.2995) | {'primary': 'restaurant', 'alternate': ['veget... | 0.979399 |
08f196952d2d3413037e883b570aacc4 | POINT (4.9934 52.29578) | {'primary': 'hotel', 'alternate': ['bed_and_br... | 0.935964 |
08f196952d2f2323037dd4ead1c9f7f9 | POINT (4.9946 52.29682) | {'primary': 'graphic_designer', 'alternate': [... | 0.804933 |
08f196952d2e225a03caf05ea4067ee7 | POINT (4.99605 52.29664) | {'primary': 'psychotherapist', 'alternate': ['... | 0.864553 |
08f196952995b47403eb6a97bdf35d2a | POINT (4.99254 52.29225) | {'primary': 'painting', 'alternate': ['contrac... | 0.864553 |
28196 rows × 3 columns
first_index = (
# Find first object with at least one alternate category
original_places_data[original_places_data.categories.str.get("alternate").str.len() > 1]
.iloc[0]
.name
)
first_index, original_places_data.loc[first_index].categories
('08f19695336d340003f6fdf3a790a782', {'primary': 'diner', 'alternate': array(['restaurant', 'urban_farm'], dtype=object)})
wide_form_places_data = convert_geometry_to_wide_form_geodataframe("places", "place", amsterdam)
wide_form_places_data
Finished operation in 0:00:18
geometry | places|place|accommodation | places|place|accommodation|bed_and_breakfast | places|place|accommodation|cabin | places|place|accommodation|campground | places|place|accommodation|cottage | places|place|accommodation|guest_house | places|place|accommodation|health_retreats | places|place|accommodation|holiday_rental_home | places|place|accommodation|hostel | ... | places|place|travel|transportation|transport_interchange | places|place|travel|transportation|water_taxi | places|place|travel|travel_services | places|place|travel|travel_services|luggage_storage | places|place|travel|travel_services|passport_and_visa_services | places|place|travel|travel_services|passport_and_visa_services|visa_agent | places|place|travel|travel_services|travel_agents | places|place|travel|travel_services|travel_agents|sightseeing_tour_agency | places|place|travel|travel_services|visitor_center | places|place|travel|vacation_rental_agents | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08f19695336d340003f6fdf3a790a782 | POINT (4.82069 52.32627) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969514a14c28034b4eefc83a719b | POINT (4.81088 52.33052) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969514a0636b03b4e801caa4077c | POINT (4.81242 52.33008) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969514a14a24031d524c97baacd2 | POINT (4.81126 52.33039) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969514869a08031e47174edb17be | POINT (4.81395 52.33421) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08f196952d76b67303ce11872e91b54b | POINT (4.99107 52.2995) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196952d2d3413037e883b570aacc4 | POINT (4.9934 52.29578) | True | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196952d2f2323037dd4ead1c9f7f9 | POINT (4.9946 52.29682) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196952d2e225a03caf05ea4067ee7 | POINT (4.99605 52.29664) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196952995b47403eb6a97bdf35d2a | POINT (4.99254 52.29225) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
28196 rows × 2117 columns
As you can see, only those features existing in the categories
column are True
and the rest is False
.
wide_form_places_data.loc[first_index].drop("geometry").sort_values(ascending=False)
places|place|eat_and_drink|restaurant|diner True places|place|eat_and_drink|restaurant True places|place|business_to_business|b2b_agriculture_and_food|b2b_farming|b2b_farms|urban_farm True places|place|professional_services|courier_and_delivery_services False places|place|professional_services|crane_services False ... places|place|eat_and_drink|bar|lounge False places|place|eat_and_drink|bar|kombucha False places|place|eat_and_drink|bar|irish_pub False places|place|eat_and_drink|bar|hotel_bar False places|place|travel|vacation_rental_agents False Name: 08f19695336d340003f6fdf3a790a782, Length: 2116, dtype: object
You can use places_use_primary_category_only
to use only single category per feature without altenatives.
primary_only_wide_form_places_data = convert_geometry_to_wide_form_geodataframe(
"places",
"place",
amsterdam,
places_use_primary_category_only=True,
)
primary_only_wide_form_places_data.loc[first_index].drop("geometry").sort_values(ascending=False)
Finished operation in 0:00:02
places|place|eat_and_drink|restaurant|diner True places|place|professional_services|digitizing_services False places|place|professional_services|copywriting_service False places|place|professional_services|courier_and_delivery_services False places|place|professional_services|crane_services False ... places|place|eat_and_drink|bar|kombucha False places|place|eat_and_drink|bar|irish_pub False places|place|eat_and_drink|bar|hotel_bar False places|place|eat_and_drink|bar|hookah_bar False places|place|travel|vacation_rental_agents False Name: 08f19695336d340003f6fdf3a790a782, Length: 2116, dtype: object
Below you can see the difference in the counts of True
values across all columns.
wide_form_places_data.drop(columns="geometry").sum().sort_values(ascending=False)
places|place|professional_services 3740 places|place|eat_and_drink|restaurant 3137 places|place|retail|shopping 2240 places|place|eat_and_drink|cafe 1553 places|place|beauty_and_spa 1380 ... places|place|health_and_medical|psychomotor_therapist 0 places|place|health_and_medical|placenta_encapsulation_service 0 places|place|attractions_and_activities|paddleboard_rental 0 places|place|attractions_and_activities|parasailing_ride_service 0 places|place|travel|vacation_rental_agents 0 Length: 2116, dtype: int64
primary_only_wide_form_places_data.drop(columns="geometry").sum().sort_values(ascending=False)
places|place|professional_services 1187 places|place|eat_and_drink|restaurant 959 places|place|accommodation|hotel 672 places|place|public_service_and_government|community_services 534 places|place|beauty_and_spa|beauty_salon 534 ... places|place|health_and_medical|doctor|pediatrician|pediatric_orthopedic_surgery 0 places|place|health_and_medical|doctor|pediatrician|pediatric_oncology 0 places|place|health_and_medical|doctor|pediatrician|pediatric_neurology 0 places|place|health_and_medical|doctor|pediatrician|pediatric_nephrology 0 places|place|travel|vacation_rental_agents 0 Length: 2116, dtype: int64
You can also change the minimal confidence value with places_minimal_confidence
parameter.
convert_geometry_to_wide_form_geodataframe(
"places", "place", amsterdam, places_minimal_confidence=0.95
)
Finished operation in 0:00:20
geometry | places|place|accommodation | places|place|accommodation|bed_and_breakfast | places|place|accommodation|cabin | places|place|accommodation|campground | places|place|accommodation|cottage | places|place|accommodation|guest_house | places|place|accommodation|health_retreats | places|place|accommodation|holiday_rental_home | places|place|accommodation|hostel | ... | places|place|travel|transportation|transport_interchange | places|place|travel|transportation|water_taxi | places|place|travel|travel_services | places|place|travel|travel_services|luggage_storage | places|place|travel|travel_services|passport_and_visa_services | places|place|travel|travel_services|passport_and_visa_services|visa_agent | places|place|travel|travel_services|travel_agents | places|place|travel|travel_services|travel_agents|sightseeing_tour_agency | places|place|travel|travel_services|visitor_center | places|place|travel|vacation_rental_agents | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08f19695336d340003f6fdf3a790a782 | POINT (4.82069 52.32627) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196953349e0220332fbab7c0624cc | POINT (4.81461 52.3348) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196953349e63403aee60021ce4eeb | POINT (4.81443 52.33493) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196953359c25b03e57a7b8aeb4046 | POINT (4.81732 52.33739) | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196953358ec19034e9ee0feebed60 | POINT (4.81793 52.33744) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08f1969cde71b75103d79e5ba0dbeaad | POINT (5.01877 52.30536) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969cde7a8859030034fd5260c13f | POINT (5.01768 52.30598) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969cde71408a0309167ec87891cf | POINT (5.01872 52.3071) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196952d0488b00335a268db7ecf1e | POINT (4.99733 52.30271) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196952d76b67303ce11872e91b54b | POINT (4.99107 52.2995) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
11142 rows × 2117 columns
Full hierarchy of the places dataset is derived from the official taxonomy available here.
You can limit it to get less columns, with grouped categories.
convert_geometry_to_wide_form_geodataframe("places", "place", amsterdam, hierarchy_depth=1)
Finished operation in 0:00:01
geometry | places|place|accommodation | places|place|active_life | places|place|arts_and_entertainment | places|place|attractions_and_activities | places|place|automotive | places|place|beauty_and_spa | places|place|business_to_business | places|place|eat_and_drink | places|place|education | ... | places|place|mass_media | places|place|pets | places|place|private_establishments_and_corporates | places|place|professional_services | places|place|public_service_and_government | places|place|real_estate | places|place|religious_organization | places|place|retail | places|place|structure_and_geography | places|place|travel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08f19695336d340003f6fdf3a790a782 | POINT (4.82069 52.32627) | False | False | False | False | False | False | True | True | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969514a14c28034b4eefc83a719b | POINT (4.81088 52.33052) | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969514a0636b03b4e801caa4077c | POINT (4.81242 52.33008) | False | False | False | False | False | False | True | False | False | ... | False | False | False | True | False | False | False | True | False | False |
08f1969514a14a24031d524c97baacd2 | POINT (4.81126 52.33039) | False | False | True | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969514869a08031e47174edb17be | POINT (4.81395 52.33421) | False | True | False | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08f196952d76b67303ce11872e91b54b | POINT (4.99107 52.2995) | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196952d2d3413037e883b570aacc4 | POINT (4.9934 52.29578) | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196952d2f2323037dd4ead1c9f7f9 | POINT (4.9946 52.29682) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
08f196952d2e225a03caf05ea4067ee7 | POINT (4.99605 52.29664) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
08f196952995b47403eb6a97bdf35d2a | POINT (4.99254 52.29225) | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
28196 rows × 23 columns
Pruning final list of columns¶
By default, OvertureMaestro
includes all possible columns regardless of whether any features of a given category exist.
This is done to keep the overall schema consistent for different geographical regions and simplifying the feature engineering process.
However, there is a dedicated parameter include_all_possible_columns
that can be set to False
to keep only columns based on actually existing features.
convert_geometry_to_wide_form_geodataframe(
"base", "infrastructure", amsterdam, include_all_possible_columns=True # default value
)
Finished operation in 0:00:01
geometry | base|infrastructure|aerialway|aerialway_station | base|infrastructure|aerialway|cable_car | base|infrastructure|aerialway|chair_lift | base|infrastructure|aerialway|drag_lift | base|infrastructure|aerialway|gondola | base|infrastructure|aerialway|goods | base|infrastructure|aerialway|j-bar | base|infrastructure|aerialway|magic_carpet | base|infrastructure|aerialway|mixed_lift | ... | base|infrastructure|utility|utility_pole | base|infrastructure|utility|water_tower | base|infrastructure|waste_management|recycling | base|infrastructure|waste_management|waste_basket | base|infrastructure|waste_management|waste_disposal | base|infrastructure|water|breakwater | base|infrastructure|water|dam | base|infrastructure|water|drinking_water | base|infrastructure|water|fountain | base|infrastructure|water|weir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08b1969514b5afff0001be5e2ab17485 | LINESTRING (4.81495 52.32946, 4.81513 52.32939) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969514a75fff0001bd00984951d2 | LINESTRING (4.81556 52.32902, 4.81553 52.32892) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969514a75fff0001b429e9e28dcc | LINESTRING (4.81564 52.32901, 4.81561 52.32891) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969514a60fff0001bb84af296d79 | LINESTRING (4.81716 52.32861, 4.81745 52.32878... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969514a66fff0001b2355851584d | LINESTRING (4.81625 52.32894, 4.81623 52.32886... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b196953320dfff0001a24cde081e55 | POINT (4.83817 52.32758) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b19695008e8fff0001b27dc45f09a6 | LINESTRING (4.88267 52.29164, 4.88272 52.29173... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b196952a78bfff0001ba6f0f6414bf | LINESTRING (4.88311 52.29154, 4.88395 52.29371... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b196952bd2afff0001be80f27d7aea | LINESTRING (4.9468 52.29099, 4.9466 52.29092, ... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969570acbfff0001b71ff706ca6a | LINESTRING (5.01114 52.33393, 5.00887 52.33055... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
92990 rows × 161 columns
convert_geometry_to_wide_form_geodataframe(
"base", "infrastructure", amsterdam, include_all_possible_columns=False
)
Finished operation in 0:00:01
geometry | base|infrastructure|airport|apron | base|infrastructure|airport|helipad | base|infrastructure|airport|heliport | base|infrastructure|barrier|barrier | base|infrastructure|barrier|block | base|infrastructure|barrier|bollard | base|infrastructure|barrier|border_control | base|infrastructure|barrier|bump_gate | base|infrastructure|barrier|bus_trap | ... | base|infrastructure|utility|utility_pole | base|infrastructure|utility|water_tower | base|infrastructure|waste_management|recycling | base|infrastructure|waste_management|waste_basket | base|infrastructure|waste_management|waste_disposal | base|infrastructure|water|breakwater | base|infrastructure|water|dam | base|infrastructure|water|drinking_water | base|infrastructure|water|fountain | base|infrastructure|water|weir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08b1969514b5afff0001be5e2ab17485 | LINESTRING (4.81495 52.32946, 4.81513 52.32939) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969514a75fff0001bd00984951d2 | LINESTRING (4.81556 52.32902, 4.81553 52.32892) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969514a75fff0001b429e9e28dcc | LINESTRING (4.81564 52.32901, 4.81561 52.32891) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969514a60fff0001bb84af296d79 | LINESTRING (4.81716 52.32861, 4.81745 52.32878... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969514a66fff0001b2355851584d | LINESTRING (4.81625 52.32894, 4.81623 52.32886... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b196953320dfff0001a24cde081e55 | POINT (4.83817 52.32758) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b19695008e8fff0001b27dc45f09a6 | LINESTRING (4.88267 52.29164, 4.88272 52.29173... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b196952a78bfff0001ba6f0f6414bf | LINESTRING (4.88311 52.29154, 4.88395 52.29371... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b196952bd2afff0001be80f27d7aea | LINESTRING (4.9468 52.29099, 4.9466 52.29092, ... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969570acbfff0001b71ff706ca6a | LINESTRING (5.01114 52.33393, 5.00887 52.33055... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
92990 rows × 105 columns
Getting a full list of possible column names¶
You can also preview the final list of columns before downloading the data using get_all_possible_column_names
function.
You can specify the release
, theme
and type
, as well as hierarchy_depth
.
from overturemaestro.advanced_functions.wide_form import get_all_possible_column_names
get_all_possible_column_names(theme="base", type="water")
['base|water|canal|canal', 'base|water|canal|ditch', 'base|water|canal|drain', 'base|water|canal|moat', 'base|water|human_made|fish_pass', 'base|water|human_made|reflecting_pool', 'base|water|human_made|salt_pond', 'base|water|human_made|swimming_pool', 'base|water|lake|lagoon', 'base|water|lake|lake', 'base|water|lake|oxbow', 'base|water|ocean|ocean', 'base|water|physical|bay', 'base|water|physical|cape', 'base|water|physical|ocean', 'base|water|physical|sea', 'base|water|physical|shoal', 'base|water|physical|strait', 'base|water|physical|waterfall', 'base|water|pond|fishpond', 'base|water|pond|pond', 'base|water|reservoir|basin', 'base|water|reservoir|reservoir', 'base|water|reservoir|water_storage', 'base|water|river|river', 'base|water|spring|blowhole', 'base|water|spring|geyser', 'base|water|spring|hot_spring', 'base|water|spring|spring', 'base|water|stream|stream', 'base|water|wastewater|sewage', 'base|water|water|dock', 'base|water|water|fairway', 'base|water|water|tidal_channel', 'base|water|water|wastewater', 'base|water|water|water']
With all parameters empty, function will return a full list of all possible columns with maximal depth.
columns = get_all_possible_column_names()
len(columns)
2632
columns[:10]
['base|infrastructure|aerialway|aerialway_station', 'base|infrastructure|aerialway|cable_car', 'base|infrastructure|aerialway|chair_lift', 'base|infrastructure|aerialway|drag_lift', 'base|infrastructure|aerialway|gondola', 'base|infrastructure|aerialway|goods', 'base|infrastructure|aerialway|j-bar', 'base|infrastructure|aerialway|magic_carpet', 'base|infrastructure|aerialway|mixed_lift', 'base|infrastructure|aerialway|platter']
You can also specify different hierarchy_depth
values.