pdf-stream
Creates a stream from PDF
Node.js module for streaming PDF text content.
Based on PDF.js library.
Table of Contents
Install
npm i pdf-stream --save
Usage
Basic
Text stream from PDF file
'use strict'; const text = text; // Load file contents to ArrayBuffer synchronously let file = './example.pdf'; let pdf = fs; // Stream PDF text to stdout ;
Text stream from PDF link
You need the XMLHttpRequest
as global variable.
Install the xhr2 library locally:
npm i xhr2 --save
'use strict'; const text = text; globalXMLHttpRequest = ; // for PDFJS let pdf = 'https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf'; ;
Text stream from PDF link with metadata as XML string
If you got error:
ReferenceError: DOMParser is not defined
You need the DOMParser
as global variable, because PDF.js use it for XML metadata parsing. Install the xmldom library locally:
npm i xmldom --save
'use strict'; const text = text; globalXMLHttpRequest = ; // File download globalDOMParser = DOMParser; // XML Metadata parsing let pdf = 'https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf'; ;
Advanced
Create transform class for replacing string
'use strict'; const Transform = Transform; const pdf_stream = ; const PDFReadable = pdf_streamPDFReadable; const PDFStringifyTransform = pdf_streamPDFStringifyTransform; let url = 'https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf'; // Transform class for replacing strings { super writableObjectMode: true readableObjectMode: true ; thisfrom = optionsfrom; thisto = optionsto; } // For every object { // Get text content items if typeof objtextContent !== 'undefined' && Array objtextContentitems; this; ; } // Pipeline url // Convert stream from object to string ;
API
All methods are streams, use them with .pipe()
.
text(options)
alternative usage:
text(pdf, whitespace)
Gets text stream from PDF.
Convert PDF to text, optionally can replace whitespaces.
Options:
pdf
— URL or ArrayBuffer;whitespace
— the string that replaces the whitespace␣
. Replacement disabled by default.
In the PDF.js viewer whitespaces is an empty string. For making output comparable with the viewer use:
text(pdf, '')
Return: {stream.Readable}
new PDFReadable(options)
alternative usage:
new PDFReadable(pdf)
Making the Readable stream in object mode from PDF text content.
Options:
pdf
— URL or ArrayBuffer;- inherit from
stream.Readable
options.
Return: {stream.Readable}
new PDFStringifyTransform(options)
alternative usage:
new PDFStringifyTransform(whitespace)
Transform PDF text content object to string.
Options:
whitespace
— the string that replaces the whitespace␣
. Replacement disabled by default;- inherit from
stream.Transform
options.
Return: {stream.Readable}
Contribute
Contributors are welcome. Open an issue or submit pull request.
Small note: If editing the README, please conform to the standard-readme specification.
License
Apache 2.0
© Sergey N