pdftojson
pdftojson is a pdftotext
wrapper that generates JSON with bounding box data. It takes care of overlapping duplicated characters, which often exists in MS-Word-generated PDF files with floating images and text.
pdftotext
?
Why bother a wrapper for Consider this PDF file:
pdftotext -bbox theFile.pdf
would generate this:
...(6)綠線G01G站延伸伸至大溪、龍潭先進進公共運輸輸系統發展展委託可行行性研究...
pdftotext
does a great job "undoing" physical layout (columns, hyphenation, etc) of a PDF document. However, in its result there are some overlapping and duplicate words. PDF layout engines sometimes generate these quirks when images and text are mixed within a page.
On the other hand, pdftojson theFile.pdf
could generate this:
... "xMin": 1032 "xMax": 34829439 "yMin": 5473557 "yMax": 56132172 "text": "(6)綠線 G01 站延伸至大溪、龍潭先進公" "xMin": 12468 "xMax": 320813062 "yMin": 5723757 "yMax": 58634172 "text": "共運輸系統發展委託可行性研究"...
Install
$ npm install pdftojson
pdftojson
uses pdftotext
. Please make sure pdftotext
is available in PATH
.
Usage
pdftojson is available as a command line tool and a nodejs library.
CLI
# outputs some.json
$ pdftojson some.pdf
# converts page 3 ~ 6 of some.pdf and outputs to some.json
$ pdftojson -c "-f 3 -l 6" some.pdf
NodeJS Library
The library exposes a single function that takes the name of a PDF file and returns a promise.
; ;
Output format
All numeric values are in pt
.
//: Page width: Number page width height: Number page height words: text: String the text enclosed in the bounding box // All coordinates calculated from top-left corner of the page xMin: Number left edge of the bounding box xMax: Number right edge of the bounding box yMin: Number top edge of the bounding box yMax: Number bottom edge of the bounding box // ... // ...