Extract BIM element data from a Revit PDF

Every element in a Revit-exported PDF carries a tagged content block that encodes its element identifier. bim-cli extracts those tags and the accompanying text — including schedule rows that standard PDF extractors miss — as structured JSON.

Prompt

Copy this into your AI coding tool:

paste into your AI coding tool

I have a Revit-exported PDF drawing set. Use bim-cli to extract structured element and text data from it.
Step 1 — install bim-cli if not already installed:
powershell.exe -NoProfile -ExecutionPolicy Bypass -Command "iwr -useb https://bimcli.com/install.ps1 | iex"
Step 2 — refresh PATH in the current session:
bash:  export PATH="$PATH:$LOCALAPPDATA/bim-cli"
pwsh:  $env:Path += ";$env:LOCALAPPDATA\bim-cli"
Step 3 — confirm the pdf driver is ready:
bim doctor --json
Step 4 — inspect the PDF:
bim pdf info <path-to-drawings.pdf>
Step 5 — extract all BIM element tags (element identifier + page + bounds per element):
bim pdf marked <path-to-drawings.pdf>
Step 6 — extract all text content (handles Revit's Type0 Identity-H fonts):
bim pdf text <path-to-drawings.pdf>
Both commands return a single JSON object to stdout. Pipe to jq or write to a file.

The commands

inspect a Revit PDF

bim pdf info drawings.pdf

{"ok":true,"result":{"file":"drawings.pdf","sizeBytes":98304000,"pages":142,"pdfVersion":"1.6"}}

extract BIM element tags — element identifier + page + bounds per element

bim pdf marked drawings.pdf

{"ok":true,"result":[{"page":1,"tag":"Element_1234567","pathCount":3,"bounds":{"x0":120.4,"y0":340.2,"x1":180.1,"y1":380.5}},{"page":1,"tag":"Element_1234568","pathCount":2,"bounds":{"x0":200.0,"y0":340.2,"x1":260.0,"y1":380.5}},...]}

tag is the marked-content identifier from the Revit PDF content stream. pathCount is the number of graphic path operations inside the block — use it to filter out dimension strings and annotation elements. bounds is the bounding box in PDF user-space units.

extract text — recovers schedule rows and title block labels

bim pdf text drawings.pdf

{"ok":true,"result":[{"page":1,"text":"Room 101","x":245.0,"y":620.3,"fontSize":10.0,"font":"ArialMT"},{"page":1,"text":"47.5 m²","x":320.0,"y":620.3,"fontSize":10.0,"font":"ArialMT"},...]}

Revit uses Type0 Identity-H font encoding. Naive PDF text extractors return empty strings or garbled characters. bim pdf text decodes via ToUnicode CMap streams and recovers the original text.

What the test harness checks

This scenario is verified end-to-end by the bim-cli test harness. Run it yourself to confirm the installed binary matches this page:

bim scenario run tests/scenarios/pdf-extract-tables.yaml --strict

The YAML defines the prompt above, the commands, and JSONPath assertions over the actual stdout. All three steps must return "ok":true and exit 0.

Scenario source: tests/scenarios/pdf-extract-tables.yaml

Why this exists

Revit sheet sets contain door schedules, room finish schedules, and window schedules embedded as tables in each sheet. Standard tools like camelot-py extract tables but add a Python dependency and have no BIM context. bim pdf marked and bim pdf text provide the same structured output from a single offline Windows binary with no runtime dependency.