Wide format¶
OvertureMaestro implements a logic for transforming downloaded data into a wide
format. This format is dedicated for geospatial machine learning usage, where selected datasets are pivoted based on their categories to a columnar format.
This notebook will explore what is this format and how to work with it.
New functions¶
New module contains the same set of functions as the basic api, just with the wide_form
part inside:
convert_geometry_to_parquet
→convert_geometry_to_wide_form_parquet
convert_geometry_to_geodataframe
→convert_geometry_to_wide_form_geodataframe
- other functions ...
Additionally, special functions for downloading all available datasets are available:
convert_geometry_to_wide_form_parquet_for_all_types
convert_geometry_to_wide_form_geodataframe_for_all_types
convert_bounding_box_to_wide_form_parquet_for_all_types
convert_bounding_box_to_wide_form_geodataframe_for_all_types
You can import them from the overturemaestro.advanced_functions
module.
from overturemaestro import convert_geometry_to_geodataframe, geocode_to_geometry
from overturemaestro.advanced_functions import convert_geometry_to_wide_form_geodataframe
What is the wide format?¶
In this section we will compare how the original data format differs from the wide format based on water data.
Let's start by looking at the official Overture Maps schema for the base water data type:
import requests
import yaml
response = requests.get(
"https://raw.githubusercontent.com/OvertureMaps/schema/refs/tags/v1.4.0/schema/base/water.yaml",
allow_redirects=True,
)
water_schema = yaml.safe_load(response.content.decode("utf-8"))
water_schema
{'$schema': 'https://json-schema.org/draft/2020-12/schema', 'title': 'water', 'description': 'Physical representations of inland and ocean marine surfaces. Translates `natural` and `waterway` tags from OpenStreetMap.', 'type': 'object', 'properties': {'id': {'$ref': '../defs.yaml#/$defs/propertyDefinitions/id'}, 'geometry': {'unevaluatedProperties': False, 'oneOf': [{'$ref': 'https://geojson.org/schema/Point.json'}, {'$ref': 'https://geojson.org/schema/LineString.json'}, {'$ref': 'https://geojson.org/schema/Polygon.json'}, {'$ref': 'https://geojson.org/schema/MultiPolygon.json'}]}, 'properties': {'unevaluatedProperties': False, 'allOf': [{'$ref': '../defs.yaml#/$defs/propertyContainers/overtureFeaturePropertiesContainer'}, {'$ref': '../defs.yaml#/$defs/propertyContainers/levelContainer'}, {'$ref': '../defs.yaml#/$defs/propertyContainers/namesContainer'}, {'$ref': './defs.yaml#/$defs/propertyContainers/osmPropertiesContainer'}], 'required': ['subtype', 'class'], 'properties': {'subtype': {'description': 'The type of water body such as an river, ocean or lake.', 'default': ['water'], 'type': 'string', 'enum': ['canal', 'human_made', 'lake', 'ocean', 'physical', 'pond', 'reservoir', 'river', 'spring', 'stream', 'wastewater', 'water']}, 'class': {'description': 'Further description of the type of water', 'default': ['water'], 'enum': ['basin', 'bay', 'blowhole', 'canal', 'cape', 'ditch', 'dock', 'drain', 'fairway', 'fish_pass', 'fishpond', 'geyser', 'hot_spring', 'lagoon', 'lake', 'moat', 'ocean', 'oxbow', 'pond', 'reflecting_pool', 'reservoir', 'river', 'salt_pond', 'sea', 'sewage', 'shoal', 'spring', 'strait', 'stream', 'swimming_pool', 'tidal_channel', 'wastewater', 'water', 'water_storage', 'waterfall']}, 'is_salt': {'description': 'Is it salt water or not', 'type': 'boolean'}, 'is_intermittent': {'description': 'Is it intermittent water or not', 'type': 'boolean'}}}}}
Two required fields are defined in the specification: subtype
and class
. There are even lists of possible values defined.
Both of these values detail the meaning of the feature. Together, everything maps to the path:
theme
(base) → type
(water) → subtype
(eg. reservoir) → class
(eg. basin).
Based on this hierarchy, all available values can be determined and mapped to columns.
In this way, you will obtain data in a wide format, where each feature defines what it is with boolean flags.
amsterdam = geocode_to_geometry("Amsterdam")
original_data = convert_geometry_to_geodataframe("base", "water", amsterdam)
wide_data = convert_geometry_to_wide_form_geodataframe("base", "water", amsterdam)
Finished operation in 0:00:14
Finished operation in 0:00:11
original_data
geometry | bbox | version | sources | level | subtype | class | names | source_tags | wikidata | is_salt | is_intermittent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
08b2a1026039afff0004d4096c81973b | POLYGON ((-72.99967 40.68802, -72.99934 40.688... | {'xmin': -74.00066375732422, 'xmax': -72.99933... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | 0.0 | ocean | ocean | None | [] | None | True | False |
08b2a103a689afff0004b3f9b19b4f12 | POLYGON ((-73.72785 40.66063, -73.72784 40.660... | {'xmin': -73.7278823852539, 'xmax': -73.727821... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | human_made | swimming_pool | None | [(leisure, swimming_pool)] | None | None | None |
08b2a103a6898fff0004b39d61f5f310 | LINESTRING (-73.72846 40.66632, -73.72851 40.6... | {'xmin': -73.728515625, 'xmax': -73.7254257202... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | -1.0 | canal | drain | {'primary': 'Hook Creek', 'common': None, 'rul... | [(waterway, drain)] | None | None | None |
08b2a103a612efff0004b5780315a321 | POLYGON ((-73.72678 40.65868, -73.72682 40.658... | {'xmin': -73.7268295288086, 'xmax': -73.726760... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | human_made | swimming_pool | None | [(leisure, swimming_pool)] | None | None | None |
08b2a103a6132fff0004b4bb7a50d9bd | POLYGON ((-73.72903 40.66007, -73.72902 40.660... | {'xmin': -73.72909545898438, 'xmax': -73.72901... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | human_made | swimming_pool | None | [(leisure, swimming_pool)] | None | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b2a10033232fff0004b72598dbf45a | POLYGON ((-73.81194 40.88809, -73.81196 40.888... | {'xmin': -73.81196594238281, 'xmax': -73.81179... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | human_made | swimming_pool | None | [(leisure, swimming_pool), (location, outdoor)] | None | None | None |
08b2a1019d261fff0004c13b6867746d | POLYGON ((-73.91045 40.91526, -73.91018 40.915... | {'xmin': -73.94308471679688, 'xmax': -73.88542... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | river | river | None | [(intermittent, no), (natural, water), (water,... | None | None | False |
08b2a10183754fff0004b6517ca52878 | LINESTRING (-73.88916 41.04345, -73.89008 41.0... | {'xmin': -73.93392944335938, 'xmax': -73.88914... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | river | river | {'primary': 'Hudson River', 'common': [('es', ... | [(intermittent, no), (waterway, river)] | None | None | False |
08b2a101680a2fff0004de729001d686 | POLYGON ((-73.64511 41.00048, -73.64509 41.000... | {'xmin': -74.00066375732422, 'xmax': -72.99933... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | 0.0 | ocean | ocean | None | [] | None | True | False |
08b2a154318acfff0004c3ef7d67cb62 | POLYGON ((-72.23316 41.16052, -72.2331 41.1604... | {'xmin': -73.80953979492188, 'xmax': -71.85736... | 0 | [{'property': '', 'dataset': 'OpenStreetMap', ... | NaN | physical | bay | {'primary': 'Long Island Sound', 'common': Non... | [(natural, bay)] | Q867460 | None | None |
48992 rows × 12 columns
wide_data
geometry | base|water|canal|canal | base|water|canal|ditch | base|water|canal|drain | base|water|canal|moat | base|water|human_made|fish_pass | base|water|human_made|reflecting_pool | base|water|human_made|salt_pond | base|water|human_made|swimming_pool | base|water|lake|lagoon | ... | base|water|spring|geyser | base|water|spring|hot_spring | base|water|spring|spring | base|water|stream|stream | base|water|wastewater|sewage | base|water|water|dock | base|water|water|fairway | base|water|water|tidal_channel | base|water|water|wastewater | base|water|water|water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08b2a1026039afff0004d4096c81973b | POLYGON ((-72.99967 40.68802, -72.99934 40.688... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a103a689afff0004b3f9b19b4f12 | POLYGON ((-73.72785 40.66063, -73.72784 40.660... | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a103a6898fff0004b39d61f5f310 | LINESTRING (-73.72846 40.66632, -73.72851 40.6... | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a103a612efff0004b5780315a321 | POLYGON ((-73.72678 40.65868, -73.72682 40.658... | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a103a6132fff0004b4bb7a50d9bd | POLYGON ((-73.72903 40.66007, -73.72902 40.660... | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b2a10033232fff0004b72598dbf45a | POLYGON ((-73.81194 40.88809, -73.81196 40.888... | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a1019d261fff0004c13b6867746d | POLYGON ((-73.91045 40.91526, -73.91018 40.915... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a10183754fff0004b6517ca52878 | LINESTRING (-73.88916 41.04345, -73.89008 41.0... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a101680a2fff0004de729001d686 | POLYGON ((-73.64511 41.00048, -73.64509 41.000... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a154318acfff0004c3ef7d67cb62 | POLYGON ((-72.23316 41.16052, -72.2331 41.1604... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
48992 rows × 37 columns
Using this format, we can quickly filter out data or calculate number of features per category.
wide_data.drop(columns="geometry").sum().sort_values(ascending=False)
base|water|human_made|swimming_pool 46839 base|water|stream|stream 639 base|water|water|water 488 base|water|pond|pond 331 base|water|river|river 157 base|water|canal|ditch 110 base|water|water|wastewater 100 base|water|reservoir|basin 74 base|water|water|tidal_channel 65 base|water|physical|bay 53 base|water|canal|drain 32 base|water|physical|cape 30 base|water|water|fairway 14 base|water|canal|canal 14 base|water|physical|waterfall 10 base|water|physical|shoal 9 base|water|ocean|ocean 9 base|water|lake|lake 6 base|water|reservoir|reservoir 6 base|water|human_made|reflecting_pool 4 base|water|physical|strait 1 base|water|human_made|fish_pass 1 base|water|spring|spring 0 base|water|canal|moat 0 base|water|water|dock 0 base|water|wastewater|sewage 0 base|water|human_made|salt_pond 0 base|water|physical|ocean 0 base|water|spring|hot_spring 0 base|water|spring|geyser 0 base|water|physical|sea 0 base|water|lake|lagoon 0 base|water|reservoir|water_storage 0 base|water|lake|oxbow 0 base|water|pond|fishpond 0 base|water|spring|blowhole 0 dtype: int64
Each theme type has defined list of columns used for generating final list of columns.
Most of the datasets have two columns (subtype
and class
) with three exceptions:
base|land_cover
→subtype
onlytransportation|segment
→subtype
,class
andsubclass
places|place
→1
,2
,3
, ... (this one is described in detail below)
from overturemaestro.advanced_functions.wide_form import THEME_TYPE_CLASSIFICATION
for (theme_value, type_value), definition in sorted(THEME_TYPE_CLASSIFICATION.items()):
print(theme_value, type_value, definition.hierachy_columns)
base infrastructure ['subtype', 'class'] base land ['subtype', 'class'] base land_cover ['subtype'] base land_use ['subtype', 'class'] base water ['subtype', 'class'] buildings building ['subtype', 'class'] places place ['1', '2', '3', '4', '5', '6'] transportation segment ['subtype', 'class', 'subclass']
Multiple data types¶
You can also download data for multiple data theme/types at once, or even download all at once.
If some datasets have been downloaded during previous executions, then only missing data is downloaded.
Here we will look at the top 10 most common features for both examples.
from overturemaestro.advanced_functions import (
convert_geometry_to_wide_form_geodataframe_for_all_types,
convert_geometry_to_wide_form_geodataframe_for_multiple_types,
)
two_datasets_gdf = convert_geometry_to_wide_form_geodataframe_for_multiple_types(
[("base", "water"), ("base", "land_cover")], amsterdam
)
two_datasets_gdf.drop(columns="geometry").sum().sort_values(ascending=False).head(10)
Finished operation in 0:00:11
base|water|human_made|swimming_pool 46839 base|water|stream|stream 639 base|land_cover|shrub 505 base|water|water|water 488 base|land_cover|barren 475 base|land_cover|forest 442 base|water|pond|pond 331 base|water|river|river 157 base|land_cover|wetland 129 base|water|canal|ditch 110 dtype: int64
len(two_datasets_gdf.columns)
47
all_datasets_gdf = convert_geometry_to_wide_form_geodataframe_for_all_types(
amsterdam, sort_result=False # we skip sorting the result here for faster execution
)
all_datasets_gdf.drop(columns="geometry").sum().sort_values(ascending=False).head(10)
Finished operation in 0:01:48
buildings|building 824803 base|infrastructure|barrier|kerb 116746 buildings|building|residential|garage 103500 base|infrastructure|transportation|crossing 102550 base|infrastructure|transit|parking_space 77881 transportation|segment|road|footway 66077 buildings|building|residential|detached 64719 transportation|segment|road|residential 59703 base|land|tree|tree 49311 transportation|segment|road|footway|sidewalk 48815 dtype: int64
len(all_datasets_gdf.columns)
2633
Limiting hierarchy depth¶
If for some reason you want to only have higher level aggregation of the data, you can limit the hierarchy depth of the data.
By default full hierarchy is used to generate the columns.
Note
If you pass too high value, it will be automatically capped to the highest possible for a given theme/type pair.
limited_depth_water_gdf = convert_geometry_to_wide_form_geodataframe(
"base", "water", amsterdam, hierarchy_depth=1
)
limited_depth_water_gdf.drop(columns="geometry").sum()
Finished operation in 0:00:17
base|water|canal 156 base|water|human_made 46844 base|water|lake 6 base|water|ocean 9 base|water|physical 103 base|water|pond 331 base|water|reservoir 80 base|water|river 157 base|water|spring 0 base|water|stream 639 base|water|wastewater 0 base|water|water 667 dtype: int64
Using value of 0 will result in just a list of theme
/type
pairs.
limited_depth_all_gdf = convert_geometry_to_wide_form_geodataframe_for_all_types(
amsterdam, hierarchy_depth=0
)
limited_depth_all_gdf.drop(columns="geometry").sum()
Finished operation in 0:01:12
base|infrastructure 454866 base|land 55309 base|land_cover 1714 base|land_use 40165 base|water 48992 buildings|building 1091698 places|place 196640 transportation|segment 289825 dtype: int64
You can also pass a list if you are downloading data for multiple datasets at once. The list of values must be the same length as a list of theme_type_pairs
.
limited_depth_multiple_gdf = convert_geometry_to_wide_form_geodataframe_for_multiple_types(
[("places", "place"), ("base", "land_cover"), ("base", "water")],
amsterdam,
hierarchy_depth=[1, None, 0],
)
limited_depth_multiple_gdf.drop(columns="geometry").sum()
Finished operation in 0:00:08
base|land_cover|barren 475 base|land_cover|crop 58 base|land_cover|forest 442 base|land_cover|grass 0 base|land_cover|mangrove 0 base|land_cover|moss 0 base|land_cover|shrub 505 base|land_cover|snow 0 base|land_cover|urban 105 base|land_cover|wetland 129 base|water 48992 places|place|accommodation 3335 places|place|active_life 7495 places|place|arts_and_entertainment 7938 places|place|attractions_and_activities 9846 places|place|automotive 5934 places|place|beauty_and_spa 14829 places|place|business_to_business 11730 places|place|eat_and_drink 34244 places|place|education 10321 places|place|financial_service 10838 places|place|health_and_medical 28118 places|place|home_service 11383 places|place|mass_media 3131 places|place|pets 1416 places|place|private_establishments_and_corporates 758 places|place|professional_services 44393 places|place|public_service_and_government 14856 places|place|real_estate 8766 places|place|religious_organization 6680 places|place|retail 48376 places|place|structure_and_geography 702 places|place|travel 6028 dtype: int64
Places¶
Places data have different schema than other datasets and it's the only one with possible multiple categories at once: primary
and optional multiple alternative
.
This structure is preserved in the wide
format and it's the only dataset where a single feature can have multiple True
values in the columns.
OvertureMaestro utilizes the categories
column with primary
and alternate
sub-fields to get feature categorization. The hierarchy depth of 6
is based on official taxonomy of the possible categories.
There are two pyarrow filters applied automatically when downloading the data for the wide
format: confidence
value >= 0.75 and categories
cannot be empty.
import pyarrow.compute as pc
category_not_null_filter = pc.invert(pc.field("categories").is_null())
minimal_confidence_filter = pc.field("confidence") >= pc.scalar(0.75)
combined_filter = category_not_null_filter & minimal_confidence_filter
original_places_data = convert_geometry_to_geodataframe(
"places",
"place",
amsterdam,
pyarrow_filter=combined_filter,
columns_to_download=["id", "geometry", "categories", "confidence"],
)
original_places_data
Finished operation in 0:00:09
geometry | categories | confidence | |
---|---|---|---|
id | |||
08f2a106e23617b203dc15e14c759373 | POINT (-74.25304 40.48667) | {'primary': 'lighthouse', 'alternate': ['landm... | 0.935964 |
08f2a106e052b583031e2cb58e77a4dc | POINT (-74.23893 40.49926) | {'primary': 'italian_restaurant', 'alternate':... | 0.935964 |
08f2a106e058380e03ae29302d77c8c9 | POINT (-74.24486 40.49938) | {'primary': 'park', 'alternate': ['amusement_p... | 0.979399 |
08f2a106e280c8330333ba5758be10d5 | POINT (-74.25182 40.499) | {'primary': 'landmark_and_historical_building'... | 0.927987 |
08f2a106e29a4b46032abbdd36c49fd4 | POINT (-74.25322 40.5032) | {'primary': 'history_museum', 'alternate': ['m... | 0.979399 |
... | ... | ... | ... |
08f2a10381d14b0803cfd64b12d55c0a | POINT (-73.75876 40.59279) | {'primary': 'landmark_and_historical_building'... | 0.979399 |
08f2a10381d3121d031fa7e099fe6a88 | POINT (-73.75774 40.59271) | {'primary': 'garbage_collection_service', 'alt... | 0.770000 |
08f2a1038568b87403f1b15f711386b5 | POINT (-73.75544 40.59341) | {'primary': 'public_relations', 'alternate': [... | 0.865000 |
08f2a103856f001303ad05a994a73b5c | POINT (-73.75387 40.59336) | {'primary': 'hospital', 'alternate': ['health_... | 0.757967 |
08f2a103856a828103e6c667c0a76915 | POINT (-73.75336 40.59324) | {'primary': 'retirement_home', 'alternate': ['... | 0.935964 |
196640 rows × 3 columns
first_index = (
# Find first object with at least one alternate category
original_places_data[original_places_data.categories.str.get("alternate").str.len() > 1]
.iloc[0]
.name
)
first_index, original_places_data.loc[first_index].categories
('08f2a106e23617b203dc15e14c759373', {'primary': 'lighthouse', 'alternate': array(['landmark_and_historical_building', 'museum'], dtype=object)})
wide_form_places_data = convert_geometry_to_wide_form_geodataframe("places", "place", amsterdam)
wide_form_places_data
Finished operation in 0:00:35
geometry | places|place|accommodation | places|place|accommodation|bed_and_breakfast | places|place|accommodation|cabin | places|place|accommodation|campground | places|place|accommodation|cottage | places|place|accommodation|guest_house | places|place|accommodation|health_retreats | places|place|accommodation|holiday_rental_home | places|place|accommodation|hostel | ... | places|place|travel|transportation|transport_interchange | places|place|travel|transportation|water_taxi | places|place|travel|travel_services | places|place|travel|travel_services|luggage_storage | places|place|travel|travel_services|passport_and_visa_services | places|place|travel|travel_services|passport_and_visa_services|visa_agent | places|place|travel|travel_services|travel_agents | places|place|travel|travel_services|travel_agents|sightseeing_tour_agency | places|place|travel|travel_services|visitor_center | places|place|travel|vacation_rental_agents | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08f2a106e23617b203dc15e14c759373 | POINT (-74.25304 40.48667) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e052b583031e2cb58e77a4dc | POINT (-74.23893 40.49926) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e058380e03ae29302d77c8c9 | POINT (-74.24486 40.49938) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e280c8330333ba5758be10d5 | POINT (-74.25182 40.499) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e29a4b46032abbdd36c49fd4 | POINT (-74.25322 40.5032) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08f2a10381d14b0803cfd64b12d55c0a | POINT (-73.75876 40.59279) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a10381d3121d031fa7e099fe6a88 | POINT (-73.75774 40.59271) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a1038568b87403f1b15f711386b5 | POINT (-73.75544 40.59341) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a103856f001303ad05a994a73b5c | POINT (-73.75387 40.59336) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a103856a828103e6c667c0a76915 | POINT (-73.75336 40.59324) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
196640 rows × 2117 columns
As you can see, only those features existing in the categories
column are True
and the rest is False
.
wide_form_places_data.loc[first_index].drop("geometry").sort_values(ascending=False)
places|place|attractions_and_activities|landmark_and_historical_building True places|place|attractions_and_activities|museum True places|place|attractions_and_activities|lighthouse True places|place|pets|pet_services|farrier_services False places|place|professional_services|emergency_service False ... places|place|eat_and_drink|bar|milkshake_bar False places|place|eat_and_drink|bar|milk_bar False places|place|eat_and_drink|bar|lounge False places|place|eat_and_drink|bar|kombucha False places|place|travel|vacation_rental_agents False Name: 08f2a106e23617b203dc15e14c759373, Length: 2116, dtype: object
You can use places_use_primary_category_only
to use only single category per feature without altenatives.
primary_only_wide_form_places_data = convert_geometry_to_wide_form_geodataframe(
"places",
"place",
amsterdam,
places_use_primary_category_only=True,
)
primary_only_wide_form_places_data.loc[first_index].drop("geometry").sort_values(ascending=False)
Finished operation in 0:00:19
places|place|attractions_and_activities|lighthouse True places|place|professional_services|construction_services|stone_and_masonry|masonry_contractors False places|place|professional_services|electrical_consultant False places|place|professional_services|elder_care_planning False places|place|professional_services|editorial_services False ... places|place|eat_and_drink|bar|lounge False places|place|eat_and_drink|bar|kombucha False places|place|eat_and_drink|bar|irish_pub False places|place|eat_and_drink|bar|hotel_bar False places|place|travel|vacation_rental_agents False Name: 08f2a106e23617b203dc15e14c759373, Length: 2116, dtype: object
Below you can see the difference in the counts of True
values across all columns.
wide_form_places_data.drop(columns="geometry").sum().sort_values(ascending=False)
places|place|eat_and_drink|restaurant 20419 places|place|professional_services 19678 places|place|health_and_medical 16623 places|place|health_and_medical|doctor 9923 places|place|beauty_and_spa|beauty_salon 9626 ... places|place|business_to_business|business_to_business_services|tower_communication_service 0 places|place|active_life|sports_and_recreation_venue|disc_golf_course 0 places|place|arts_and_entertainment|carousel 0 places|place|eat_and_drink|bar|beach_bar 0 places|place|retail|food|back_shop 0 Length: 2116, dtype: int64
primary_only_wide_form_places_data.drop(columns="geometry").sum().sort_values(ascending=False)
places|place|professional_services 4513 places|place|religious_organization|church_cathedral 3390 places|place|beauty_and_spa|beauty_salon 3057 places|place|public_service_and_government|community_services 2839 places|place|health_and_medical|dentist 2449 ... places|place|real_estate|holiday_park 0 places|place|education|specialty_school|massage_school 0 places|place|real_estate|homeowner_association 0 places|place|real_estate|housing_cooperative 0 places|place|public_service_and_government|organization|social_service_organizations|gay_and_lesbian_services_organization 0 Length: 2116, dtype: int64
You can also change the minimal confidence value with places_minimal_confidence
parameter.
convert_geometry_to_wide_form_geodataframe(
"places", "place", amsterdam, places_minimal_confidence=0.95
)
Finished operation in 0:00:35
geometry | places|place|accommodation | places|place|accommodation|bed_and_breakfast | places|place|accommodation|cabin | places|place|accommodation|campground | places|place|accommodation|cottage | places|place|accommodation|guest_house | places|place|accommodation|health_retreats | places|place|accommodation|holiday_rental_home | places|place|accommodation|hostel | ... | places|place|travel|transportation|transport_interchange | places|place|travel|transportation|water_taxi | places|place|travel|travel_services | places|place|travel|travel_services|luggage_storage | places|place|travel|travel_services|passport_and_visa_services | places|place|travel|travel_services|passport_and_visa_services|visa_agent | places|place|travel|travel_services|travel_agents | places|place|travel|travel_services|travel_agents|sightseeing_tour_agency | places|place|travel|travel_services|visitor_center | places|place|travel|vacation_rental_agents | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08f2a106e2914b4103c2049f3ea40d93 | POINT (-74.25125 40.50058) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e29a4b46032abbdd36c49fd4 | POINT (-74.25322 40.5032) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e66548d003d64aa97032834d | POINT (-74.25416 40.50543) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e058380e03ae29302d77c8c9 | POINT (-74.24486 40.49938) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e0d93cea0343dfeb9e12ff86 | POINT (-74.23788 40.50544) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08f2a1038572146a0311dcb9897a707f | POINT (-73.74805 40.59802) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a103850622ea0369eabad6956832 | POINT (-73.74031 40.59847) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a103851596810325a359c9ef83cc | POINT (-73.73994 40.59916) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a1038500cbab03a0f5680783fc35 | POINT (-73.74217 40.59955) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a103853aedac03d0c867a8dc40e7 | POINT (-73.74159 40.59542) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
73450 rows × 2117 columns
Full hierarchy of the places dataset is derived from the official taxonomy available here.
You can limit it to get less columns, with grouped categories.
convert_geometry_to_wide_form_geodataframe("places", "place", amsterdam, hierarchy_depth=1)
Finished operation in 0:00:06
geometry | places|place|accommodation | places|place|active_life | places|place|arts_and_entertainment | places|place|attractions_and_activities | places|place|automotive | places|place|beauty_and_spa | places|place|business_to_business | places|place|eat_and_drink | places|place|education | ... | places|place|mass_media | places|place|pets | places|place|private_establishments_and_corporates | places|place|professional_services | places|place|public_service_and_government | places|place|real_estate | places|place|religious_organization | places|place|retail | places|place|structure_and_geography | places|place|travel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08f2a106e23617b203dc15e14c759373 | POINT (-74.25304 40.48667) | False | False | False | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e052b583031e2cb58e77a4dc | POINT (-74.23893 40.49926) | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e058380e03ae29302d77c8c9 | POINT (-74.24486 40.49938) | False | False | True | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e280c8330333ba5758be10d5 | POINT (-74.25182 40.499) | False | False | False | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08f2a106e29a4b46032abbdd36c49fd4 | POINT (-74.25322 40.5032) | False | False | False | True | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08f2a10381d14b0803cfd64b12d55c0a | POINT (-73.75876 40.59279) | False | False | False | True | False | False | False | False | False | ... | False | False | False | False | False | True | False | False | False | False |
08f2a10381d3121d031fa7e099fe6a88 | POINT (-73.75774 40.59271) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
08f2a1038568b87403f1b15f711386b5 | POINT (-73.75544 40.59341) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
08f2a103856f001303ad05a994a73b5c | POINT (-73.75387 40.59336) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | True | False | False | False |
08f2a103856a828103e6c667c0a76915 | POINT (-73.75336 40.59324) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | True | False | False | False | False | False |
196640 rows × 23 columns
Pruning final list of columns¶
By default, OvertureMaestro
includes all possible columns regardless of whether any features of a given category exist.
This is done to keep the overall schema consistent for different geographical regions and simplifying the feature engineering process.
However, there is a dedicated parameter include_all_possible_columns
that can be set to False
to keep only columns based on actually existing features.
convert_geometry_to_wide_form_geodataframe(
"base", "infrastructure", amsterdam, include_all_possible_columns=True # default value
)
Finished operation in 0:00:06
geometry | base|infrastructure|aerialway|aerialway_station | base|infrastructure|aerialway|cable_car | base|infrastructure|aerialway|chair_lift | base|infrastructure|aerialway|drag_lift | base|infrastructure|aerialway|gondola | base|infrastructure|aerialway|goods | base|infrastructure|aerialway|j-bar | base|infrastructure|aerialway|magic_carpet | base|infrastructure|aerialway|mixed_lift | ... | base|infrastructure|utility|utility_pole | base|infrastructure|utility|water_tower | base|infrastructure|waste_management|recycling | base|infrastructure|waste_management|waste_basket | base|infrastructure|waste_management|waste_disposal | base|infrastructure|water|breakwater | base|infrastructure|water|dam | base|infrastructure|water|drinking_water | base|infrastructure|water|fountain | base|infrastructure|water|weir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08b2a106e2826fff0001a5677bde3f76 | POINT (-74.25192 40.50054) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
08b2a106e2903fff0001a51cc65b0f94 | POINT (-74.25112 40.50196) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a106e2902fff0001b624fcf55baf | POLYGON ((-74.2511 40.5021, -74.25108 40.50208... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a106e2902fff0001b4649b8302d3 | POLYGON ((-74.25108 40.50208, -74.25106 40.502... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a106e2900fff0001b3fe003c3acc | POLYGON ((-74.25106 40.50206, -74.25105 40.502... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b2a1076b618fff0001b6dcf8743f84 | POLYGON ((-73.93049 40.55266, -73.93047 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a1076b618fff0001bde1c7f08c81 | POLYGON ((-73.93052 40.55265, -73.9305 40.5526... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a1076b618fff0001b3d68343c406 | POLYGON ((-73.93043 40.55267, -73.93041 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a1076b618fff0001ba7d0f5ee98c | POLYGON ((-73.9304 40.55268, -73.93038 40.5526... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a1076b618fff0001b91c5eca984c | POLYGON ((-73.93037 40.55269, -73.93035 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
454866 rows × 161 columns
convert_geometry_to_wide_form_geodataframe(
"base", "infrastructure", amsterdam, include_all_possible_columns=False
)
Finished operation in 0:00:06
geometry | base|infrastructure|aerialway|aerialway_station | base|infrastructure|aerialway|cable_car | base|infrastructure|aerialway|pylon | base|infrastructure|aerialway|zip_line | base|infrastructure|airport|airport_gate | base|infrastructure|airport|apron | base|infrastructure|airport|helipad | base|infrastructure|airport|heliport | base|infrastructure|airport|international_airport | ... | base|infrastructure|utility|utility_pole | base|infrastructure|utility|water_tower | base|infrastructure|waste_management|recycling | base|infrastructure|waste_management|waste_basket | base|infrastructure|waste_management|waste_disposal | base|infrastructure|water|breakwater | base|infrastructure|water|dam | base|infrastructure|water|drinking_water | base|infrastructure|water|fountain | base|infrastructure|water|weir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
08b2a106e2826fff0001a5677bde3f76 | POINT (-74.25192 40.50054) | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
08b2a106e2903fff0001a51cc65b0f94 | POINT (-74.25112 40.50196) | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a106e2902fff0001b624fcf55baf | POLYGON ((-74.2511 40.5021, -74.25108 40.50208... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a106e2902fff0001b4649b8302d3 | POLYGON ((-74.25108 40.50208, -74.25106 40.502... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a106e2900fff0001b3fe003c3acc | POLYGON ((-74.25106 40.50206, -74.25105 40.502... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
08b2a1076b618fff0001b6dcf8743f84 | POLYGON ((-73.93049 40.55266, -73.93047 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a1076b618fff0001bde1c7f08c81 | POLYGON ((-73.93052 40.55265, -73.9305 40.5526... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a1076b618fff0001b3d68343c406 | POLYGON ((-73.93043 40.55267, -73.93041 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a1076b618fff0001ba7d0f5ee98c | POLYGON ((-73.9304 40.55268, -73.93038 40.5526... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
08b2a1076b618fff0001b91c5eca984c | POLYGON ((-73.93037 40.55269, -73.93035 40.552... | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
454866 rows × 119 columns
Getting a full list of possible column names¶
You can also preview the final list of columns before downloading the data using get_all_possible_column_names
function.
You can specify the release
, theme
and type
, as well as hierarchy_depth
.
from overturemaestro.advanced_functions.wide_form import get_all_possible_column_names
get_all_possible_column_names(theme="base", type="water")
['base|water|canal|canal', 'base|water|canal|ditch', 'base|water|canal|drain', 'base|water|canal|moat', 'base|water|human_made|fish_pass', 'base|water|human_made|reflecting_pool', 'base|water|human_made|salt_pond', 'base|water|human_made|swimming_pool', 'base|water|lake|lagoon', 'base|water|lake|lake', 'base|water|lake|oxbow', 'base|water|ocean|ocean', 'base|water|physical|bay', 'base|water|physical|cape', 'base|water|physical|ocean', 'base|water|physical|sea', 'base|water|physical|shoal', 'base|water|physical|strait', 'base|water|physical|waterfall', 'base|water|pond|fishpond', 'base|water|pond|pond', 'base|water|reservoir|basin', 'base|water|reservoir|reservoir', 'base|water|reservoir|water_storage', 'base|water|river|river', 'base|water|spring|blowhole', 'base|water|spring|geyser', 'base|water|spring|hot_spring', 'base|water|spring|spring', 'base|water|stream|stream', 'base|water|wastewater|sewage', 'base|water|water|dock', 'base|water|water|fairway', 'base|water|water|tidal_channel', 'base|water|water|wastewater', 'base|water|water|water']
With all parameters empty, function will return a full list of all possible columns with maximal depth.
columns = get_all_possible_column_names()
len(columns)
2632
columns[:10]
['base|infrastructure|aerialway|aerialway_station', 'base|infrastructure|aerialway|cable_car', 'base|infrastructure|aerialway|chair_lift', 'base|infrastructure|aerialway|drag_lift', 'base|infrastructure|aerialway|gondola', 'base|infrastructure|aerialway|goods', 'base|infrastructure|aerialway|j-bar', 'base|infrastructure|aerialway|magic_carpet', 'base|infrastructure|aerialway|mixed_lift', 'base|infrastructure|aerialway|platter']
You can also specify different hierarchy_depth
values.