pdf-ocr-ts

pdf-ocr-ts creates searchable PDF files out of PDF files that only contain images of scanned documents. It is javascript-only and hence works without the need to install further tools. Under the hood it uses pdf.js to render the pages within a pdf to png files, Jimp to create compressed jpeg images, tesseract.js to perform ocr and pdf-lib to merge the single page pdfs tesseract.js is creating into a final searchable output PDF file.

Usage

To create a searchable PDF with filename outputFilename from inputFilename use:

const { default: PdfOcr } = require('pdf-ocr-ts');

const inputFilename = './input/scan_test.pdf';
const outputFilename = './output/scan_test-searchable.pdf';

PdfOcr.createSearchablePdf(inputFilename, outputFilename);

In certain contexts it might be more handy to read the input file in some other function and also output the searchable PDF in another component. In these cases pdf-ocr-ts offers the function getSearchablePdfBufferBased(Uint8Array) that takes a Uint8Array (e.g. created by fs.readFile()), performs ocr and returns the searchable PDF file as Uint8Array. Which can than be used again in fs.writeFile().

const { default: PdfOcr } = require('pdf-ocr-ts');
const fs = require('fs');
const path = require('path');

const inputFilename = './input/scan_test.pdf';
const outputFilename = './output/scan_test-searchable.pdf';

(async () => {
    const pdf = new Uint8Array(fs.readFileSync(path.resolve(__dirname, inputFilename)));
    const { pdfBuffer, text } = await PdfOcr.getSearchablePdfBufferBased(pdf);
    fs.writeFile(path.resolve(__dirname, outputFilename), pdfBuffer, (error) => {
      if (error) {
          console.error(`Error: ${error}`);
      } else {
          console.log(`Finished merging PDFs into ${outputFilename}.`);
      }
    });
})();

To generate log output, pdf-ocr-ts supports logging frameworks. It ships with the most simple logger simpleLog and supports any logger with the call signature (level: string, message: string) => void; (see ./utils/Logger.ts).

const { default: PdfOcr } = require('pdf-ocr-ts');
const { simpleLog } = require("pdf-ocr-ts/build/utils/Logger");

const inputFilename = './input/scan_test.pdf';
const outputFilename = './output/scan_test-searchable.pdf';

PdfOcr.createSearchablePdf(inputFilename, outputFilename, simpleLog);

Here's an example for the log library winston.js via a simple wrapper like logHelper(level: string, message: string). Internally pdf-ocr-ts uses the log levels: info, error and debug.

const { default: PdfOcr } = require('pdf-ocr-ts');
const { createLogger, transports, format } = require("winston");

// create the winston logger
const logger = createLogger({
  transports: [new transports.Console()],
  format: format.combine(
    format.colorize(),
    format.timestamp(),
    format.printf(({ timestamp, level, message }) => {
      return `[${timestamp}] ${level}: ${message}`;
    })
  ),
});

// wrap winston logger in logHelper to comply with the call signature 
// (level: string, message: string) => void;
function logHelper(level: string, message: string) {
  logger.log(level, message);
}

// pass the logHelper function
PdfOcr.createSearchablePdf(inputFilename, outputFilename, logHelper);

To build the module from source run npm run build.

pdf-ocr-ts

pdf-ocr-ts

Usage

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

pdf-ocr-ts

pdf-ocr-ts

Usage

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads