Skip to content

PbfFileReader

quackosm.pbf_file_reader.PbfFileReader(
    tags_filter=None,
    geometry_filter=None,
    working_directory="files",
    osm_way_polygon_features_config=None,
)

PbfFileReader.

PBF(Protocolbuffer Binary Format)[1] file reader is a dedicated *.osm.pbf files reader class based on DuckDB[2] and its spatial extension[3].

Handler can filter out OSM features based on tags filter and geometry filter to limit the result.

References
  1. https://wiki.openstreetmap.org/wiki/PBF_Format
  2. https://duckdb.org/
  3. https://github.com/duckdb/duckdb_spatial
PARAMETER DESCRIPTION
tags_filter

A dictionary specifying which tags to download. The keys should be OSM tags (e.g. building, amenity). The values should either be True for retrieving all objects with the tag, string for retrieving a single tag-value pair or list of strings for retrieving all values specified in the list. tags={'leisure': 'park} would return parks from the area. tags={'leisure': 'park, 'amenity': True, 'shop': ['bakery', 'bicycle']} would return parks, all amenity types, bakeries and bicycle shops. If None, handler will allow all of the tags to be parsed. Defaults to None.

TYPE: Union[OsmTagsFilter, GroupedOsmTagsFilter] DEFAULT: None

geometry_filter

Region which can be used to filter only intersecting OSM objects. Defaults to None.

TYPE: BaseGeometry DEFAULT: None

working_directory

Directory where to save the parsed *.parquet files. Defaults to "files".

TYPE: Union[str, Path] DEFAULT: 'files'

osm_way_polygon_features_config

Config used to determine which closed way features are polygons. Modifications to this config left are left for experienced OSM users. Defaults to predefined "osm_way_polygon_features.json".

TYPE: Union[OsmWayPolygonConfig, dict[str, Any]] DEFAULT: None

Source code in quackosm/pbf_file_reader.py
def __init__(
    self,
    tags_filter: Optional[Union[OsmTagsFilter, GroupedOsmTagsFilter]] = None,
    geometry_filter: Optional[BaseGeometry] = None,
    working_directory: Union[str, Path] = "files",
    osm_way_polygon_features_config: Optional[
        Union[OsmWayPolygonConfig, dict[str, Any]]
    ] = None,
) -> None:
    """
    Initialize PbfFileReader.

    Args:
        tags_filter (Union[OsmTagsFilter, GroupedOsmTagsFilter], optional): A dictionary
            specifying which tags to download.
            The keys should be OSM tags (e.g. `building`, `amenity`).
            The values should either be `True` for retrieving all objects with the tag,
            string for retrieving a single tag-value pair
            or list of strings for retrieving all values specified in the list.
            `tags={'leisure': 'park}` would return parks from the area.
            `tags={'leisure': 'park, 'amenity': True, 'shop': ['bakery', 'bicycle']}`
            would return parks, all amenity types, bakeries and bicycle shops.
            If `None`, handler will allow all of the tags to be parsed. Defaults to `None`.
        geometry_filter (BaseGeometry, optional): Region which can be used to filter only
            intersecting OSM objects. Defaults to `None`.
        working_directory (Union[str, Path], optional): Directory where to save
            the parsed `*.parquet` files. Defaults to "files".
        osm_way_polygon_features_config (Union[OsmWayPolygonConfig, dict[str, Any]], optional):
            Config used to determine which closed way features are polygons.
            Modifications to this config left are left for experienced OSM users.
            Defaults to predefined "osm_way_polygon_features.json".
    """
    self.tags_filter = tags_filter
    self.merged_tags_filter = merge_osm_tags_filter(tags_filter) if tags_filter else None
    self.geometry_filter = geometry_filter
    self.working_directory = Path(working_directory)
    self.working_directory.mkdir(parents=True, exist_ok=True)
    self.connection: duckdb.DuckDBPyConnection = None
    self.rows_per_bucket = 1_000_000
    if osm_way_polygon_features_config is None:
        # Config based on two sources + manual OSM wiki check
        # 1. https://github.com/tyrasd/osm-polygon-features/blob/v0.9.2/polygon-features.json
        # 2. https://github.com/ideditor/id-area-keys/blob/v5.0.1/areaKeys.json
        osm_way_polygon_features_config = json.loads(
            (Path(__file__).parent / "osm_way_polygon_features.json").read_text()
        )

    self.osm_way_polygon_features_config: OsmWayPolygonConfig = (
        osm_way_polygon_features_config
        if isinstance(osm_way_polygon_features_config, OsmWayPolygonConfig)
        else parse_dict_to_config_object(osm_way_polygon_features_config)
    )

ConvertedOSMParquetFiles

Bases: NamedTuple

List of parquet files read from the *.osm.pbf file.

ParsedOSMFeatures

Bases: NamedTuple

Final list of parsed features from the *.osm.pbf file.

get_features_gdf(
    file_paths,
    explode_tags=None,
    ignore_cache=False,
    filter_osm_ids=None,
)

Get features GeoDataFrame from a list of PBF files.

Function parses multiple PBF files and returns a single GeoDataFrame with parsed OSM objects.

PARAMETER DESCRIPTION
file_paths

Path or list of paths of *.osm.pbf files to be parsed.

TYPE: Union[str, Path, Iterable[Union[str, Path]]]

explode_tags

Whether to split tags into columns based on OSM tag keys. If None, will be set based on tags_filter parameter. If no tags filter is provided, then explode_tags will set to False, if there is tags filter it will set to True. Defaults to None.

TYPE: bool DEFAULT: None

ignore_cache

(bool, optional): Whether to ignore precalculated geoparquet files or not. Defaults to False.

TYPE: bool DEFAULT: False

filter_osm_ids

(list[str], optional): List of OSM features ids to read from the file. Have to be in the form of 'node/', 'way/' or 'relation/'. Defaults to an empty list.

TYPE: Optional[list[str]] DEFAULT: None

RETURNS DESCRIPTION
GeoDataFrame

gpd.GeoDataFrame: GeoDataFrame with OSM features.

Source code in quackosm/pbf_file_reader.py
def get_features_gdf(
    self,
    file_paths: Union[str, Path, Iterable[Union[str, Path]]],
    explode_tags: Optional[bool] = None,
    ignore_cache: bool = False,
    filter_osm_ids: Optional[list[str]] = None,
) -> gpd.GeoDataFrame:
    """
    Get features GeoDataFrame from a list of PBF files.

    Function parses multiple PBF files and returns a single GeoDataFrame with parsed
    OSM objects.

    Args:
        file_paths (Union[str, Path, Iterable[Union[str, Path]]]):
            Path or list of paths of `*.osm.pbf` files to be parsed.
        explode_tags (bool, optional): Whether to split tags into columns based on OSM tag keys.
            If `None`, will be set based on `tags_filter` parameter.
            If no tags filter is provided, then `explode_tags` will set to `False`,
            if there is tags filter it will set to `True`. Defaults to `None`.
        ignore_cache: (bool, optional): Whether to ignore precalculated geoparquet files or not.
            Defaults to False.
        filter_osm_ids: (list[str], optional): List of OSM features ids to read from the file.
            Have to be in the form of 'node/<id>', 'way/<id>' or 'relation/<id>'.
            Defaults to an empty list.

    Returns:
        gpd.GeoDataFrame: GeoDataFrame with OSM features.
    """
    if isinstance(file_paths, (str, Path)):
        file_paths = [file_paths]

    if filter_osm_ids is None:
        filter_osm_ids = []

    if explode_tags is None:
        explode_tags = self.tags_filter is not None

    parsed_geoparquet_files = []
    for file_path in file_paths:
        parsed_geoparquet_file = self.convert_pbf_to_gpq(
            file_path,
            explode_tags=explode_tags,
            ignore_cache=ignore_cache,
            filter_osm_ids=filter_osm_ids,
        )
        parsed_geoparquet_files.append(parsed_geoparquet_file)

    parquet_tables = [
        io.read_geoparquet_table(parsed_parquet_file)  # type: ignore
        for parsed_parquet_file in parsed_geoparquet_files
    ]
    joined_parquet_table: pa.Table = pa.concat_tables(parquet_tables)
    gdf_parquet = gpd.GeoDataFrame(
        data=joined_parquet_table.drop(GEOMETRY_COLUMN).to_pandas(maps_as_pydicts="strict"),
        geometry=ga.to_geopandas(joined_parquet_table.column(GEOMETRY_COLUMN)),
    ).set_index(FEATURES_INDEX)

    return gdf_parquet

convert_pbf_to_gpq(
    pbf_path,
    result_file_path=None,
    explode_tags=None,
    ignore_cache=False,
    filter_osm_ids=None,
)

Convert PBF file to GeoParquet file.

PARAMETER DESCRIPTION
pbf_path

Pbf file to be parsed to GeoParquet.

TYPE: Union[str, Path]

result_file_path

Where to save the geoparquet file. If not provided, will be generated based on hashes from provided tags filter and geometry filter. Defaults to None.

TYPE: Union[str, Path] DEFAULT: None

explode_tags

Whether to split tags into columns based on OSM tag keys. If None, will be set based on tags_filter parameter. If no tags filter is provided, then explode_tags will set to False, if there is tags filter it will set to True. Defaults to None.

TYPE: bool DEFAULT: None

ignore_cache

Whether to ignore precalculated geoparquet files or not. Defaults to False.

TYPE: bool DEFAULT: False

filter_osm_ids

(list[str], optional): List of OSM features ids to read from the file. Have to be in the form of 'node/', 'way/' or 'relation/'. Defaults to an empty list.

TYPE: Optional[list[str]] DEFAULT: None

RETURNS DESCRIPTION
Path

Path to the generated GeoParquet file.

TYPE: Path

Source code in quackosm/pbf_file_reader.py
def convert_pbf_to_gpq(
    self,
    pbf_path: Union[str, Path],
    result_file_path: Optional[Union[str, Path]] = None,
    explode_tags: Optional[bool] = None,
    ignore_cache: bool = False,
    filter_osm_ids: Optional[list[str]] = None,
) -> Path:
    """
    Convert PBF file to GeoParquet file.

    Args:
        pbf_path (Union[str, Path]): Pbf file to be parsed to GeoParquet.
        result_file_path (Union[str, Path], optional): Where to save
            the geoparquet file. If not provided, will be generated based on hashes
            from provided tags filter and geometry filter. Defaults to `None`.
        explode_tags (bool, optional): Whether to split tags into columns based on OSM tag keys.
            If `None`, will be set based on `tags_filter` parameter.
            If no tags filter is provided, then `explode_tags` will set to `False`,
            if there is tags filter it will set to `True`. Defaults to `None`.
        ignore_cache (bool, optional): Whether to ignore precalculated geoparquet files or not.
            Defaults to False.
        filter_osm_ids: (list[str], optional): List of OSM features ids to read from the file.
            Have to be in the form of 'node/<id>', 'way/<id>' or 'relation/<id>'.
            Defaults to an empty list.

    Returns:
        Path: Path to the generated GeoParquet file.
    """
    if filter_osm_ids is None:
        filter_osm_ids = []

    if explode_tags is None:
        explode_tags = self.tags_filter is not None

    with tempfile.TemporaryDirectory(dir=self.working_directory.resolve()) as tmp_dir_name:
        try:
            self._set_up_duckdb_connection(tmp_dir_name)
            result_file_path = result_file_path or self._generate_geoparquet_result_file_path(
                pbf_path,
                filter_osm_ids=filter_osm_ids,
                explode_tags=explode_tags,
            )
            parsed_geoparquet_file = self._parse_pbf_file(
                pbf_path=pbf_path,
                tmp_dir_name=tmp_dir_name,
                result_file_path=Path(result_file_path),
                filter_osm_ids=filter_osm_ids,
                explode_tags=explode_tags,
                ignore_cache=ignore_cache,
            )
            return parsed_geoparquet_file
        finally:
            if self.connection is not None:
                self.connection.close()
                self.connection = None