Skip to content

Performance

Read throughput

Count all 59M elements in Denmark extract (461 MB), best of 3 runs, fat LTO (commit 90df51f):

ToolModeTimeNotes
pbfhoggparallel0.31spar_map_reduce on all cores
osmpbf 0.3parallel0.53supstream crate, same API
pbfhoggpipelined1.3sfor_each_pipelined, preserves file order
Planetiler 0.10parallel2.0sJava, OsmInputFile + thread pool
pbfhoggsequential2.8sfor_each
pbfhoggblobreader2.9sBlobReader sequential decode
osmpbf 0.3sequential5.6supstream for_each
osmium 1.19cat to opl5.7sosmium cat -f opl -o /dev/null
Planetiler 0.10sequential8.7sJava, OsmInputFile single-threaded

par_map_reduce is fastest when order does not matter. for_each_pipelined is the fastest ordered read and the production hot path.

Write throughput

Decode all 59M elements then write through BlockBuilder + PbfWriter to /dev/null (commit def80d9):

CompressionSyncPipelinedNotes
none6.2s6.2sdecode + wire-format serialization floor
zstd:38.1s6.2spipelined hides compression cost
zlib:614.5s6.3s2.3x speedup from parallel compression

With pipelined writes, all compression modes converge to ~6.2s - the decode + wire-format serialization floor. All element types are encoded directly to protobuf wire format using reusable scratch buffers (no per-element allocation, no external protobuf dependencies).

CLI command benchmarks

Denmark (487 MB, 59M elements, commit 6fc1283, osmium from 23862d1):

Commandpbfhoggosmiumspeedup
inspect (indexdata)0.1s--index-only fast path
sort (sorted, indexdata)0.7s11.6s17x
apply-changes (indexdata + zlib)0.6s7.2s12x
tags-filter w/highway=primary -R0.2s0.56s2.8x
tags-filter amenity=restaurant -R0.5s1.19s2.4x
cat --type way (raw passthrough)0.24s2.22s9.3x
inspect tags --type way (indexdata)0.4s0.59s1.5x
getid (9 elements)0.6s0.83s1.4x
add-locations-to-ways (dense)9.9s12.1s1.2x
add-locations-to-ways (external)9.7s12.1s1.2x

The largest speedups come from blob passthrough (sort, apply-changes, cat --type) where pbfhogg avoids decompressing and re-compressing unmodified blobs entirely.

Extract benchmarks

Japan (2.4 GB, 344M elements, Tokyo bbox):

Strategypbfhoggosmiumratio
simple4.4s7.2s1.6x faster
complete-ways4.4s11.0s2.5x faster
smart5.2s13.4s2.6x faster

Simple extract uses a 3-phase barrier pipeline with parallel classification and raw frame passthrough. Complete-ways and smart use multi-pass parallel pread classification. Spatial blob filtering skips decompression of node blobs outside the extract region when indexdata is present.

Apply-changes at scale

Single-pass 4-phase batch pipeline with O(log n) inline upsert assignment, reader thread read-ahead, and passthrough coalescing (commit a6ebbfe):

DatasetConfigTimevs osmium
Japan (2.4 GB, 43K diff)indexdata + zlib3.0s15x faster
Germany (4.5 GB, 146K diff)buffered + zlib5.3s--
Germany (4.5 GB, 146K diff)buffered + none3.4s--
N. America (18.8 GB, 645K diff)buffered + zlib17.3s--
N. America (18.8 GB, 645K diff)buffered + none14.9s--
N. America (18.8 GB, 645K diff)io_uring + zlib15.2s--
N. America (18.8 GB, 645K diff)io_uring + none11.9s--

At Japan scale, osmium takes 36.6s for the same operation. pbfhogg passes ~92% of blobs through as raw bytes without decompression, using blob-level indexdata for O(1) classification. RSS stays under 600 MB even at North America scale (18.8 GB).

Apply-changes with LocationsOnWays

Denmark (501 MB with LocationsOnWays, daily diff, commit e7bbfa2):

Pipelinepbfhoggosmiumspeedup
apply-changes --locations-on-ways3.9s8.3s2.1x
apply-changes + ALTW (separate)2.7s + 6.5s = 9.2s4.3s + 9.5s = 13.8s--

The --locations-on-ways flag replaces a two-step pipeline with a single command.

Planet-scale pipeline

The full production pipeline on the planet (87 GB, ~3.4B elements) runs on a 30 GB machine:

StepCommandTimePeak memory
Generate indexdatacat~8 minminimal
Add way-node coordinatesadd-locations-to-ways --index-type external~24 min~17 GB
Build geocode indexbuild-geocode-index~22 min~18 GB
Apply daily diffapply-changes~13 min~1.8 GB

Every command runs with bounded memory. No 128 GB server required.

Performance tips

Use indexed PBFs

Generate an indexed PBF once with pbfhogg cat, then use it for all subsequent operations. Commands on indexed PBFs skip decompression of irrelevant blobs entirely, which is where the largest speedups come from.

Choose the right ALTW index type

add-locations-to-ways supports three index strategies:

TypeBest forTrade-off
denseCountry-scale where working set fits in RAMFastest when it fits; thrashes at planet scale
sparseMemory-constrained hosts, no temp disk~540 MB RAM; ~1.85x slower than dense when everything fits
externalPlanet-scale, memory-constrained hostsBounded memory, all sequential I/O; needs sorted input + temp disk

At planet scale on a 30 GB machine, external is 3.9x faster than dense (24 min vs 96 min) because dense causes page cache thrashing.

O_DIRECT for planet-scale I/O

Planet-scale operations read and write 80 GB+, polluting the entire page cache. The --direct-io flag bypasses the page cache entirely. Wall time is typically unchanged at country scale (CPU-bound) - the benefit is cache hygiene at planet scale and avoiding eviction of useful data from co-resident processes.

O_DIRECT wins for concurrent read/write patterns (merge). For sequential single-file passthrough (cat), buffered I/O is actually faster because the page cache prefetch helps.

io_uring for large writes

The --io-uring flag replaces the synchronous writer thread with io_uring WriteFixed. At North America scale (18.8 GB), io_uring + --compression none is 20% faster than buffered writes (11.9s vs 14.9s). Below ~4 GB input size, buffered writes keep up.

Compression choice

With pipelined writes (the production path), compression is dispatched to rayon and all modes converge to the decode + serialization floor. The choice mainly affects file size and downstream read speed:

  • none - fastest writes, largest files, ideal for intermediate files or erofs storage
  • zlib - standard PBF compression, compatible with all tools
  • zstd - better ratio and faster decompression, but not all consumers support it yet

System

All benchmarks measured on plantasjen: AMD Ryzen 9 5900X (12c/24t), 32 GB DDR4, NVMe SSD (input/output) + HDD (build artifacts), Linux 6.18. Measured with brokkr bench, cross-validated with brokkr verify.

Released under the Apache License 2.0. | Copyright folk@folk.wtf