Wide format¶
OvertureMaestro implements a logic for transforming downloaded data into a wide
format. This format is dedicated for geospatial machine learning usage, where selected datasets are pivoted based on their categories to a columnar format.
This notebook will explore what is this format and how to work with it.
New functions¶
New module contains the same set of functions as the basic api, just with the wide_form
part inside:
convert_geometry_to_parquet
→convert_geometry_to_wide_form_parquet
convert_geometry_to_geodataframe
→convert_geometry_to_wide_form_geodataframe
- other functions ...
Additionally, special functions for downloading all available datasets are available:
convert_geometry_to_wide_form_parquet_for_all_types
convert_geometry_to_wide_form_geodataframe_for_all_types
convert_bounding_box_to_wide_form_parquet_for_all_types
convert_bounding_box_to_wide_form_geodataframe_for_all_types
You can import them from the overturemaestro.advanced_functions
module.
from overturemaestro import convert_geometry_to_geodataframe, geocode_to_geometry
from overturemaestro.advanced_functions import convert_geometry_to_wide_form_geodataframe
What is the wide format?¶
In this section we will compare how the original data format differs from the wide format based on water data.
Let's start by looking at the official Overture Maps schema for the base water data type:
import requests
import yaml
response = requests.get(
"https://raw.githubusercontent.com/OvertureMaps/schema/refs/tags/v1.4.0/schema/base/water.yaml",
allow_redirects=True,
)
water_schema = yaml.safe_load(response.content.decode("utf-8"))
water_schema
{'$schema': 'https://json-schema.org/draft/2020-12/schema', 'title': 'water', 'description': 'Physical representations of inland and ocean marine surfaces. Translates `natural` and `waterway` tags from OpenStreetMap.', 'type': 'object', 'properties': {'id': {'$ref': '../defs.yaml#/$defs/propertyDefinitions/id'}, 'geometry': {'unevaluatedProperties': False, 'oneOf': [{'$ref': 'https://geojson.org/schema/Point.json'}, {'$ref': 'https://geojson.org/schema/LineString.json'}, {'$ref': 'https://geojson.org/schema/Polygon.json'}, {'$ref': 'https://geojson.org/schema/MultiPolygon.json'}]}, 'properties': {'unevaluatedProperties': False, 'allOf': [{'$ref': '../defs.yaml#/$defs/propertyContainers/overtureFeaturePropertiesContainer'}, {'$ref': '../defs.yaml#/$defs/propertyContainers/levelContainer'}, {'$ref': '../defs.yaml#/$defs/propertyContainers/namesContainer'}, {'$ref': './defs.yaml#/$defs/propertyContainers/osmPropertiesContainer'}], 'required': ['subtype', 'class'], 'properties': {'subtype': {'description': 'The type of water body such as an river, ocean or lake.', 'default': ['water'], 'type': 'string', 'enum': ['canal', 'human_made', 'lake', 'ocean', 'physical', 'pond', 'reservoir', 'river', 'spring', 'stream', 'wastewater', 'water']}, 'class': {'description': 'Further description of the type of water', 'default': ['water'], 'enum': ['basin', 'bay', 'blowhole', 'canal', 'cape', 'ditch', 'dock', 'drain', 'fairway', 'fish_pass', 'fishpond', 'geyser', 'hot_spring', 'lagoon', 'lake', 'moat', 'ocean', 'oxbow', 'pond', 'reflecting_pool', 'reservoir', 'river', 'salt_pond', 'sea', 'sewage', 'shoal', 'spring', 'strait', 'stream', 'swimming_pool', 'tidal_channel', 'wastewater', 'water', 'water_storage', 'waterfall']}, 'is_salt': {'description': 'Is it salt water or not', 'type': 'boolean'}, 'is_intermittent': {'description': 'Is it intermittent water or not', 'type': 'boolean'}}}}}
Two required fields are defined in the specification: subtype
and class
. There are even lists of possible values defined.
Both of these values detail the meaning of the feature. Together, everything maps to the path:
theme
(base) → type
(water) → subtype
(eg. reservoir) → class
(eg. basin).
Based on this hierarchy, all available values can be determined and mapped to columns.
In this way, you will obtain data in a wide format, where each feature defines what it is with boolean flags.
amsterdam = geocode_to_geometry("Amsterdam")
original_data = convert_geometry_to_geodataframe("base", "water", amsterdam)
wide_data = convert_geometry_to_wide_form_geodataframe("base", "water", amsterdam)
Finished operation in 0:00:06
Finished operation in 0:00:06
original_data
geometry | bbox | version | sources | level | subtype | class | names | source_tags | wikidata | is_salt | is_intermittent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
08b196950368afff0004b76ea0d2b5f2 | POLYGON ((4.87003 52.25277, 4.87057 52.25316, ... | {'xmin': 4.78419303894043, 'xmax': 4.909895896... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | river | river | None | [(natural, water), (water, river)] | None | None | None |
08b1969506046fff0004b5caa8b55bfb | POLYGON ((4.85637 52.318, 4.85626 52.31768, 4.... | {'xmin': 4.855721950531006, 'xmax': 4.85673761... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | water | water | None | [(natural, water)] | None | None | None |
08b196950472efff0004bf753c5d888b | POLYGON ((4.87813 52.32199, 4.87997 52.32202, ... | {'xmin': 4.8781208992004395, 'xmax': 4.8855113... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | water | water | None | [(natural, water)] | None | None | None |
08b1969504721fff0004bebc279a0d3b | POLYGON ((4.87896 52.32234, 4.87837 52.32232, ... | {'xmin': 4.878284454345703, 'xmax': 4.88528919... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | water | water | None | [(natural, water)] | None | None | None |
08b19695040d2fff0004b097f2f8e0b2 | POLYGON ((4.88137 52.32268, 4.88131 52.32268, ... | {'xmin': 4.881159782409668, 'xmax': 4.88363122... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | water | water | None | [(natural, water)] | None | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b1969c8116efff0004bec2c28ac64e | POLYGON ((4.99504 52.42547, 4.99491 52.42547, ... | {'xmin': 4.991415977478027, 'xmax': 4.99525833... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | water | water | None | [(natural, water)] | None | None | None |
08b1969c81b9cfff0004b1ebb8eb72c0 | POLYGON ((4.99542 52.42549, 4.99554 52.42548, ... | {'xmin': 4.995415210723877, 'xmax': 4.99554443... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | water | water | None | [(natural, water)] | None | None | None |
08b1969c8154cfff0004be2200d88ef7 | LINESTRING (4.98013 52.42564, 4.98006 52.4257) | {'xmin': 4.980057239532471, 'xmax': 4.98012685... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | canal | ditch | None | [(waterway, ditch)] | None | None | None |
08b1969c8156dfff0004b02f7ca42ab8 | LINESTRING (4.98203 52.42577, 4.98143 52.42628) | {'xmin': 4.98143196105957, 'xmax': 4.982034206... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | canal | ditch | None | [(waterway, ditch)] | None | None | None |
08b1969c86708fff0004c431e62e2b9d | POLYGON ((4.94925 52.48682, 4.94933 52.48715, ... | {'xmin': 4.929588794708252, 'xmax': 4.95170974... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | canal | canal | None | [(natural, water), (water, canal)] | None | None | None |
8677 rows × 12 columns
wide_data
geometry | base|water|canal|canal | base|water|canal|ditch | base|water|canal|drain | base|water|canal|moat | base|water|human_made|fish_pass | base|water|human_made|reflecting_pool | base|water|human_made|salt_pond | base|water|human_made|swimming_pool | base|water|lake|lagoon | ... | base|water|spring|geyser | base|water|spring|hot_spring | base|water|spring|spring | base|water|stream|stream | base|water|wastewater|sewage | base|water|water|dock | base|water|water|fairway | base|water|water|tidal_channel | base|water|water|wastewater | base|water|water|water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08b1969cc6851fff0004b203ae3edec4 | LINESTRING (5.13591 52.32805, 5.12618 52.33539... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | True | False | False | False |
08b1969ca9634fff0004b2424f99ea0c | POLYGON ((5.06485 52.41428, 5.06501 52.41461, ... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
08b1969ca9626fff0004b2f8a453eeb9 | POLYGON ((5.06681 52.41444, 5.06687 52.41454, ... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
08b1969cab2a1fff0004b24022ea3a2c | LINESTRING (5.0404 52.40745, 5.04045 52.40754,... | False | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1969cab214fff0004b7ae00b68305 | POLYGON ((5.0429 52.40837, 5.04287 52.40837, 5... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b1968244380fff0004bc1f6f95da05 | LINESTRING (4.74932 52.41254, 4.74821 52.41633... | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1968244549fff0004b54b0a20f0d0 | POLYGON ((4.73971 52.4272, 4.73969 52.42722, 4... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
08b19682440b6fff0004bbab6e125168 | POLYGON ((4.74042 52.42689, 4.74032 52.42681, ... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
08b1968244cdcfff0004bee9d4c2f99a | LINESTRING (4.73952 52.42956, 4.73955 52.42965... | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1968271b51fff0004ca1611b25ecd | POLYGON ((4.76481 52.4264, 4.77143 52.42566, 4... | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
8677 rows × 37 columns
Using this format, we can quickly filter out data or calculate number of features per category.
wide_data.drop(columns="geometry").sum().sort_values(ascending=False)
base|water|water|water 5106 base|water|canal|canal 1567 base|water|canal|ditch 970 base|water|canal|drain 710 base|water|pond|pond 83 base|water|water|fairway 58 base|water|human_made|swimming_pool 47 base|water|river|river 45 base|water|stream|stream 25 base|water|reservoir|basin 24 base|water|water|wastewater 23 base|water|lake|lake 6 base|water|canal|moat 5 base|water|reservoir|reservoir 3 base|water|water|dock 2 base|water|physical|bay 2 base|water|human_made|reflecting_pool 1 base|water|spring|spring 0 base|water|spring|hot_spring 0 base|water|spring|geyser 0 base|water|spring|blowhole 0 base|water|wastewater|sewage 0 base|water|water|tidal_channel 0 base|water|pond|fishpond 0 base|water|reservoir|water_storage 0 base|water|physical|strait 0 base|water|physical|shoal 0 base|water|physical|sea 0 base|water|physical|ocean 0 base|water|physical|cape 0 base|water|ocean|ocean 0 base|water|lake|oxbow 0 base|water|lake|lagoon 0 base|water|human_made|salt_pond 0 base|water|human_made|fish_pass 0 base|water|physical|waterfall 0 dtype: int64
Each theme type has defined list of columns used for generating final list of columns.
Most of the datasets have two columns (subtype
and class
) with three exceptions:
base|land_cover
→subtype
onlytransportation|segment
→subtype
,class
andsubclass
places|place
→1
,2
,3
, ... (this one is described in detail below)
from overturemaestro.advanced_functions.wide_form import THEME_TYPE_CLASSIFICATION
for (theme_value, type_value), definition in sorted(THEME_TYPE_CLASSIFICATION.items()):
print(theme_value, type_value, definition.hierachy_columns)
base infrastructure ['subtype', 'class'] base land ['subtype', 'class'] base land_cover ['subtype'] base land_use ['subtype', 'class'] base water ['subtype', 'class'] buildings building ['subtype', 'class'] places place ['1', '2', '3', '4', '5', '6'] transportation segment ['subtype', 'class', 'subclass']
Multiple data types¶
You can also download data for multiple data theme/types at once, or even download all at once.
If some datasets have been downloaded during previous executions, then only missing data is downloaded.
Here we will look at the top 10 most common features for both examples.
from overturemaestro.advanced_functions import (
convert_geometry_to_wide_form_geodataframe_for_all_types,
convert_geometry_to_wide_form_geodataframe_for_multiple_types,
)
two_datasets_gdf = convert_geometry_to_wide_form_geodataframe_for_multiple_types(
[("base", "water"), ("base", "land_cover")], amsterdam
)
two_datasets_gdf.drop(columns="geometry").sum().sort_values(ascending=False).head(10)
Finished operation in 0:00:06
base|water|water|water 5106 base|water|canal|canal 1567 base|water|canal|ditch 970 base|water|canal|drain 710 base|land_cover|shrub 308 base|land_cover|forest 169 base|land_cover|urban 89 base|water|pond|pond 83 base|land_cover|barren 82 base|water|water|fairway 58 dtype: int64
len(two_datasets_gdf.columns)
47
all_datasets_gdf = convert_geometry_to_wide_form_geodataframe_for_all_types(amsterdam)
all_datasets_gdf.drop(columns="geometry").sum().sort_values(ascending=False).head(10)
Finished operation in 0:00:50
buildings|building 72877 buildings|building|residential|house 54844 buildings|building|residential|apartments 53839 base|land|tree|tree 36807 base|infrastructure|transit|parking_space 21925 transportation|segment|road|footway 16406 base|land_use|managed|grass 14396 transportation|segment|road|residential 9958 base|infrastructure|barrier|fence 7579 base|land|forest|forest 6256 dtype: int64
len(all_datasets_gdf.columns)
2623
Limiting hierarchy depth¶
If for some reason you want to only have higher level aggregation of the data, you can limit the hierarchy depth of the data.
By default full hierarchy is used to generate the columns.
Note
If you pass too high value, it will be automatically capped to the highest possible for a given theme/type pair.
limited_depth_water_gdf = convert_geometry_to_wide_form_geodataframe(
"base", "water", amsterdam, hierarchy_depth=1
)
limited_depth_water_gdf.drop(columns="geometry").sum()
Finished operation in 0:00:06
base|water|canal 3252 base|water|human_made 48 base|water|lake 6 base|water|ocean 0 base|water|physical 2 base|water|pond 83 base|water|reservoir 27 base|water|river 45 base|water|spring 0 base|water|stream 25 base|water|wastewater 0 base|water|water 5189 dtype: int64
Using value of 0 will result in just a list of theme
/type
pairs.
limited_depth_all_gdf = convert_geometry_to_wide_form_geodataframe_for_all_types(
amsterdam, hierarchy_depth=0
)
limited_depth_all_gdf.drop(columns="geometry").sum()
Finished operation in 0:00:19
base|infrastructure 80559 base|land 49137 base|land_cover 702 base|land_use 24677 base|water 8677 buildings|building 199926 places|place 25112 transportation|segment 61358 dtype: int64
You can also pass a list if you are downloading data for multiple datasets at once. The list of values must be the same length as a list of theme_type_pairs
.
limited_depth_multiple_gdf = convert_geometry_to_wide_form_geodataframe_for_multiple_types(
[("places", "place"), ("base", "land_cover"), ("base", "water")],
amsterdam,
hierarchy_depth=[1, None, 0],
)
limited_depth_multiple_gdf.drop(columns="geometry").sum()
Finished operation in 0:00:01
base|land_cover|barren 82 base|land_cover|crop 24 base|land_cover|forest 169 base|land_cover|grass 1 base|land_cover|mangrove 0 base|land_cover|moss 0 base|land_cover|shrub 308 base|land_cover|snow 0 base|land_cover|urban 89 base|land_cover|wetland 29 base|water 8677 places|place|accommodation 898 places|place|active_life 1768 places|place|arts_and_entertainment 1669 places|place|attractions_and_activities 1254 places|place|automotive 455 places|place|beauty_and_spa 1830 places|place|business_to_business 2083 places|place|cafetaria 110 places|place|criminal_deense_law 0 places|place|eat_and_drink 4615 places|place|education 1462 places|place|financial_service 673 places|place|health_and_medical 2084 places|place|home_service 951 places|place|mass_media 815 places|place|pets 126 places|place|private_establishments_and_corporates 47 places|place|professional_services 7003 places|place|public_service_and_government 1832 places|place|real_estate 685 places|place|religious_organization 292 places|place|retail 6228 places|place|structure_and_geography 385 places|place|travel 1190 dtype: int64
Places¶
Places data have different schema than other datasets and it's the only one with possible multiple categories at once: primary
and optional multiple alternative
.
This structure is preserved in the wide
format and it's the only dataset where a single feature can have multiple True
values in the columns.
OvertureMaestro utilizes the categories
column with primary
and alternate
sub-fields to get feature categorization. The hierarchy depth of 6
is based on official taxonomy of the possible categories.
There are two pyarrow filters applied automatically when downloading the data for the wide
format: confidence
value >= 0.75 and categories
cannot be empty.
import pyarrow.compute as pc
category_not_null_filter = pc.invert(pc.field("categories").is_null())
minimal_confidence_filter = pc.field("confidence") >= pc.scalar(0.75)
combined_filter = category_not_null_filter & minimal_confidence_filter
original_places_data = convert_geometry_to_geodataframe(
"places",
"place",
amsterdam,
pyarrow_filter=combined_filter,
columns_to_download=["id", "geometry", "categories", "confidence"],
)
original_places_data
Finished operation in 0:00:04
geometry | categories | confidence | |
---|---|---|---|
id | |||
08f196950615415c03c9d5eaea80f5b0 | POINT (4.85689 52.32198) | {'primary': 'wholesaler', 'alternate': ['busin... | 0.770000 |
08f1969506154b06036b64670831a388 | POINT (4.85723 52.32199) | {'primary': 'cannabis_clinic', 'alternate': ['... | 0.953741 |
08f1969506172b8c0351e7859ad35363 | POINT (4.85724 52.3223) | {'primary': 'cafe', 'alternate': ['restaurant'... | 0.957325 |
08f196950617484b03ea3523cbedf6fa | POINT (4.85839 52.32263) | {'primary': 'lawyer', 'alternate': ['professio... | 0.990185 |
08f1969506d20c4003158d17dad82734 | POINT (4.85364 52.33093) | {'primary': 'professional_services', 'alternat... | 0.957325 |
... | ... | ... | ... |
08f1969c91604200031f347887ebd972 | POINT (4.89178 52.42417) | {'primary': 'cinema', 'alternate': ['arts_and_... | 0.911932 |
08f1969c9169dd8603018024b6c8451b | POINT (4.88606 52.42261) | {'primary': 'professional_services', 'alternat... | 0.953741 |
08f1969c91681620036c2fc0a52d4b23 | POINT (4.88697 52.42281) | {'primary': 'electrician', 'alternate': ['prof... | 0.953741 |
08f1969c916aa0a6032ccdbc71069126 | POINT (4.88774 52.42298) | {'primary': 'bed_and_breakfast', 'alternate': ... | 0.953741 |
08f1969c916846b503784c5b1abbe6ce | POINT (4.88671 52.42347) | {'primary': 'animal_shelter', 'alternate': None} | 0.957325 |
25112 rows × 3 columns
first_index = (
# Find first object with at least one alternate category
original_places_data[original_places_data.categories.str.get("alternate").str.len() > 1]
.iloc[0]
.name
)
first_index, original_places_data.loc[first_index].categories
('08f196950615415c03c9d5eaea80f5b0', {'primary': 'wholesaler', 'alternate': array(['business_equipment_and_supply', 'computer_hardware_company'], dtype=object)})
wide_form_places_data = convert_geometry_to_wide_form_geodataframe("places", "place", amsterdam)
wide_form_places_data
Finished operation in 0:00:17
geometry | places|place|accommodation | places|place|accommodation|bed_and_breakfast | places|place|accommodation|cabin | places|place|accommodation|campground | places|place|accommodation|cottage | places|place|accommodation|guest_house | places|place|accommodation|health_retreats | places|place|accommodation|holiday_rental_home | places|place|accommodation|hostel | ... | places|place|travel|transportation|transport_interchange | places|place|travel|transportation|water_taxi | places|place|travel|travel_services | places|place|travel|travel_services|luggage_storage | places|place|travel|travel_services|passport_and_visa_services | places|place|travel|travel_services|passport_and_visa_services|visa_agent | places|place|travel|travel_services|travel_agents | places|place|travel|travel_services|travel_agents|sightseeing_tour_agency | places|place|travel|travel_services|visitor_center | places|place|travel|vacation_rental_agents | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08f196950615415c03c9d5eaea80f5b0 | POINT (4.85689 52.32198) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969506154b06036b64670831a388 | POINT (4.85723 52.32199) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969506172b8c0351e7859ad35363 | POINT (4.85724 52.3223) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196950617484b03ea3523cbedf6fa | POINT (4.85839 52.32263) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969506d20c4003158d17dad82734 | POINT (4.85364 52.33093) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08f1969c91604200031f347887ebd972 | POINT (4.89178 52.42417) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969c9169dd8603018024b6c8451b | POINT (4.88606 52.42261) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969c91681620036c2fc0a52d4b23 | POINT (4.88697 52.42281) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969c916aa0a6032ccdbc71069126 | POINT (4.88774 52.42298) | True | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969c916846b503784c5b1abbe6ce | POINT (4.88671 52.42347) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
25112 rows × 2119 columns
As you can see, only those features existing in the categories
column are True
and the rest is False
.
wide_form_places_data.loc[first_index].drop("geometry").sort_values(ascending=False)
places|place|professional_services|computer_hardware_company True places|place|business_to_business|business_equipment_and_supply|wholesaler True places|place|business_to_business|business_equipment_and_supply True places|place|professional_services|digitizing_services False places|place|professional_services|courier_and_delivery_services False ... places|place|eat_and_drink|bar|kombucha False places|place|eat_and_drink|bar|irish_pub False places|place|eat_and_drink|bar|hotel_bar False places|place|eat_and_drink|bar|hookah_bar False places|place|travel|vacation_rental_agents False Name: 08f196950615415c03c9d5eaea80f5b0, Length: 2118, dtype: object
You can use places_use_primary_category_only
to use only single category per feature without altenatives.
primary_only_wide_form_places_data = convert_geometry_to_wide_form_geodataframe(
"places",
"place",
amsterdam,
places_use_primary_category_only=True,
)
primary_only_wide_form_places_data.loc[first_index].drop("geometry").sort_values(ascending=False)
Finished operation in 0:00:00
places|place|business_to_business|business_equipment_and_supply|wholesaler True places|place|accommodation False places|place|professional_services|diamond_dealer False places|place|professional_services|copywriting_service False places|place|professional_services|courier_and_delivery_services False ... places|place|eat_and_drink|bar|kombucha False places|place|eat_and_drink|bar|irish_pub False places|place|eat_and_drink|bar|hotel_bar False places|place|eat_and_drink|bar|hookah_bar False places|place|travel|vacation_rental_agents False Name: 08f196950615415c03c9d5eaea80f5b0, Length: 2118, dtype: object
Below you can see the difference in the counts of True
values across all columns.
wide_form_places_data.drop(columns="geometry").sum().sort_values(ascending=False)
places|place|professional_services 4221 places|place|eat_and_drink|restaurant 2155 places|place|retail|shopping 2103 places|place|eat_and_drink|cafe 1310 places|place|beauty_and_spa 1138 ... places|place|home_service|wallpaper_installers 0 places|place|home_service|washer_and_dryer_repair_service 0 places|place|home_service|water_purification_services 0 places|place|home_service|waterproofing 0 places|place|travel|vacation_rental_agents 0 Length: 2118, dtype: int64
primary_only_wide_form_places_data.drop(columns="geometry").sum().sort_values(ascending=False)
places|place|professional_services 1283 places|place|eat_and_drink|restaurant 898 places|place|beauty_and_spa 713 places|place|retail|shopping 596 places|place|eat_and_drink|cafe 465 ... places|place|health_and_medical|doctor|surgeon|cardiovascular_and_thoracic_surgeon 0 places|place|health_and_medical|doctor|sports_medicine 0 places|place|health_and_medical|doctor|spine_surgeon 0 places|place|health_and_medical|doctor|rheumatologist 0 places|place|travel|vacation_rental_agents 0 Length: 2118, dtype: int64
You can also change the minimal confidence value with places_minimal_confidence
parameter.
convert_geometry_to_wide_form_geodataframe(
"places", "place", amsterdam, places_minimal_confidence=0.95
)
Finished operation in 0:00:22
geometry | places|place|accommodation | places|place|accommodation|bed_and_breakfast | places|place|accommodation|cabin | places|place|accommodation|campground | places|place|accommodation|cottage | places|place|accommodation|guest_house | places|place|accommodation|health_retreats | places|place|accommodation|holiday_rental_home | places|place|accommodation|hostel | ... | places|place|travel|transportation|transport_interchange | places|place|travel|transportation|water_taxi | places|place|travel|travel_services | places|place|travel|travel_services|luggage_storage | places|place|travel|travel_services|passport_and_visa_services | places|place|travel|travel_services|passport_and_visa_services|visa_agent | places|place|travel|travel_services|travel_agents | places|place|travel|travel_services|travel_agents|sightseeing_tour_agency | places|place|travel|travel_services|visitor_center | places|place|travel|vacation_rental_agents | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08f196950449bd5d03fb0cb2e16f3b5f | POINT (4.8682 52.32415) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196950448a18a0372817c5a62b894 | POINT (4.86935 52.32449) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196950449ab0403db029686784c68 | POINT (4.86821 52.32457) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196950449a9ac0305e592b2e3d158 | POINT (4.86813 52.32469) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196950449e6440307b53f68b6924a | POINT (4.86812 52.32479) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08f1969c9e129430031c2e83e61e1e97 | POINT (4.89056 52.41333) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969c9e99b00503b204f90a5544c9 | POINT (4.89101 52.41787) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969c916e42a903d8a3843c7a94a5 | POINT (4.89096 52.42229) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969c9160e5930329a780e3575159 | POINT (4.89123 52.42331) | True | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969c930c4396031167f6dc33878f | POINT (4.86289 52.42271) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
19824 rows × 2119 columns
Full hierarchy of the places dataset is derived from the official taxonomy available here.
You can limit it to get less columns, with grouped categories.
convert_geometry_to_wide_form_geodataframe("places", "place", amsterdam, hierarchy_depth=1)
Finished operation in 0:00:00
geometry | places|place|accommodation | places|place|active_life | places|place|arts_and_entertainment | places|place|attractions_and_activities | places|place|automotive | places|place|beauty_and_spa | places|place|business_to_business | places|place|cafetaria | places|place|criminal_deense_law | ... | places|place|mass_media | places|place|pets | places|place|private_establishments_and_corporates | places|place|professional_services | places|place|public_service_and_government | places|place|real_estate | places|place|religious_organization | places|place|retail | places|place|structure_and_geography | places|place|travel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08f196950615415c03c9d5eaea80f5b0 | POINT (4.85689 52.32198) | False | False | False | False | False | False | True | False | False | ... | False | False | False | True | False | False | False | False | False | False |
08f1969506154b06036b64670831a388 | POINT (4.85723 52.32199) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969506172b8c0351e7859ad35363 | POINT (4.85724 52.3223) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f196950617484b03ea3523cbedf6fa | POINT (4.85839 52.32263) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
08f1969506d20c4003158d17dad82734 | POINT (4.85364 52.33093) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08f1969c91604200031f347887ebd972 | POINT (4.89178 52.42417) | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969c9169dd8603018024b6c8451b | POINT (4.88606 52.42261) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
08f1969c91681620036c2fc0a52d4b23 | POINT (4.88697 52.42281) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
08f1969c916aa0a6032ccdbc71069126 | POINT (4.88774 52.42298) | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f1969c916846b503784c5b1abbe6ce | POINT (4.88671 52.42347) | False | False | False | False | False | False | False | False | False | ... | False | True | False | False | False | False | False | False | False | False |
25112 rows × 25 columns
Pruning final list of columns¶
By default, OvertureMaestro
includes all possible columns regardless of whether any features of a given category exist.
This is done to keep the overall schema consistent for different geographical regions and simplifying the feature engineering process.
However, there is a dedicated parameter include_all_possible_columns
that can be set to False
to keep only columns based on actually existing features.
convert_geometry_to_wide_form_geodataframe(
"base", "infrastructure", amsterdam, include_all_possible_columns=True # default value
)
Finished operation in 0:00:00
geometry | base|infrastructure|aerialway|aerialway_station | base|infrastructure|aerialway|cable_car | base|infrastructure|aerialway|chair_lift | base|infrastructure|aerialway|drag_lift | base|infrastructure|aerialway|gondola | base|infrastructure|aerialway|goods | base|infrastructure|aerialway|j-bar | base|infrastructure|aerialway|magic_carpet | base|infrastructure|aerialway|mixed_lift | ... | base|infrastructure|utility|storage_tank | base|infrastructure|utility|utility_pole | base|infrastructure|utility|water_tower | base|infrastructure|waste_management|recycling | base|infrastructure|waste_management|waste_basket | base|infrastructure|waste_management|waste_disposal | base|infrastructure|water|dam | base|infrastructure|water|drinking_water | base|infrastructure|water|fountain | base|infrastructure|water|weir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08b19695a049dfff0001b57079076c1e | LINESTRING (4.58299 52.2431, 4.58547 52.24486,... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b196824815bfff0001bf88700422f2 | LINESTRING (4.76327 52.35875, 4.7635 52.35868) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b196824815bfff0001b20d7e7879cb | LINESTRING (4.76352 52.3587, 4.76329 52.35877) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b19682483aafff0001a7da78096ff0 | POINT (4.76176 52.35452) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b19682483aafff0001a1e79450b382 | POINT (4.76173 52.35453) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b1968244ccafff0001aed6065f6f0d | POINT (4.74009 52.42913) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1968244ccafff0001aa3ce70050fa | POINT (4.74007 52.42917) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1968244ccafff0001a0f044545755 | POINT (4.74012 52.42918) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1968244cd8fff0001b682e7e639f1 | LINESTRING (4.73934 52.42933, 4.73955 52.42931) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1968244cccfff0001b43fca44c1fd | LINESTRING (4.74075 52.42951, 4.74182 52.42937) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
80559 rows × 150 columns
convert_geometry_to_wide_form_geodataframe(
"base", "infrastructure", amsterdam, include_all_possible_columns=False
)
Finished operation in 0:00:00
geometry | base|infrastructure|airport|apron | base|infrastructure|airport|helipad | base|infrastructure|airport|heliport | base|infrastructure|barrier|barrier | base|infrastructure|barrier|block | base|infrastructure|barrier|bollard | base|infrastructure|barrier|border_control | base|infrastructure|barrier|bump_gate | base|infrastructure|barrier|bus_trap | ... | base|infrastructure|utility|storage_tank | base|infrastructure|utility|utility_pole | base|infrastructure|utility|water_tower | base|infrastructure|waste_management|recycling | base|infrastructure|waste_management|waste_basket | base|infrastructure|waste_management|waste_disposal | base|infrastructure|water|dam | base|infrastructure|water|drinking_water | base|infrastructure|water|fountain | base|infrastructure|water|weir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08b19695a049dfff0001b57079076c1e | LINESTRING (4.58299 52.2431, 4.58547 52.24486,... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b196824815bfff0001bf88700422f2 | LINESTRING (4.76327 52.35875, 4.7635 52.35868) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b196824815bfff0001b20d7e7879cb | LINESTRING (4.76352 52.3587, 4.76329 52.35877) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b19682483aafff0001a7da78096ff0 | POINT (4.76176 52.35452) | False | False | False | False | True | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b19682483aafff0001a1e79450b382 | POINT (4.76173 52.35453) | False | False | False | False | False | True | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b1968244ccafff0001aed6065f6f0d | POINT (4.74009 52.42913) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1968244ccafff0001aa3ce70050fa | POINT (4.74007 52.42917) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1968244ccafff0001a0f044545755 | POINT (4.74012 52.42918) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1968244cd8fff0001b682e7e639f1 | LINESTRING (4.73934 52.42933, 4.73955 52.42931) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b1968244cccfff0001b43fca44c1fd | LINESTRING (4.74075 52.42951, 4.74182 52.42937) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
80559 rows × 91 columns
Getting a full list of possible column names¶
You can also preview the final list of columns before downloading the data using get_all_possible_column_names
function.
You can specify the release
, theme
and type
, as well as hierarchy_depth
.
from overturemaestro.advanced_functions.wide_form import get_all_possible_column_names
get_all_possible_column_names(theme="base", type="water")
['base|water|canal|canal', 'base|water|canal|ditch', 'base|water|canal|drain', 'base|water|canal|moat', 'base|water|human_made|fish_pass', 'base|water|human_made|reflecting_pool', 'base|water|human_made|salt_pond', 'base|water|human_made|swimming_pool', 'base|water|lake|lagoon', 'base|water|lake|lake', 'base|water|lake|oxbow', 'base|water|ocean|ocean', 'base|water|physical|bay', 'base|water|physical|cape', 'base|water|physical|ocean', 'base|water|physical|sea', 'base|water|physical|shoal', 'base|water|physical|strait', 'base|water|physical|waterfall', 'base|water|pond|fishpond', 'base|water|pond|pond', 'base|water|reservoir|basin', 'base|water|reservoir|reservoir', 'base|water|reservoir|water_storage', 'base|water|river|river', 'base|water|spring|blowhole', 'base|water|spring|geyser', 'base|water|spring|hot_spring', 'base|water|spring|spring', 'base|water|stream|stream', 'base|water|wastewater|sewage', 'base|water|water|dock', 'base|water|water|fairway', 'base|water|water|tidal_channel', 'base|water|water|wastewater', 'base|water|water|water']
With all parameters empty, function will return a full list of all possible columns with maximal depth.
columns = get_all_possible_column_names()
len(columns)
2622
columns[:10]
['base|infrastructure|aerialway|aerialway_station', 'base|infrastructure|aerialway|cable_car', 'base|infrastructure|aerialway|chair_lift', 'base|infrastructure|aerialway|drag_lift', 'base|infrastructure|aerialway|gondola', 'base|infrastructure|aerialway|goods', 'base|infrastructure|aerialway|j-bar', 'base|infrastructure|aerialway|magic_carpet', 'base|infrastructure|aerialway|mixed_lift', 'base|infrastructure|aerialway|platter']
You can also specify different hierarchy_depth
values.
get_all_possible_column_names(theme="buildings", type="building", hierarchy_depth=1)
['buildings|building', 'buildings|building|agricultural', 'buildings|building|civic', 'buildings|building|commercial', 'buildings|building|education', 'buildings|building|entertainment', 'buildings|building|industrial', 'buildings|building|medical', 'buildings|building|military', 'buildings|building|outbuilding', 'buildings|building|religious', 'buildings|building|residential', 'buildings|building|service', 'buildings|building|transportation']