arm-pdf-scrape
TypeScript icon, indicating that this package has built-in type declarations

0.0.1 • Public • Published

arm-pdf-scrape

npm module to scrape the assembly instructions from an ARM manual.

This is intended for ARMv7-M Architecture Reference Manual. You must provide an existing copy of the manual yourself, this is just a scraper.

Install

npm install --save arm-pdf-scrape

Usage

const {loadPdfFromPath, generateInstructions, instructionToText} = require("arm-pdf-scrape")

const filepath = "/path/to/manual.pdf";
loadPdfFromPath(filepath)
  .then(manual => generateInstructions(manual))
  .then(instructions => {
    instructions.forEach(i => console.log(instructionToText(i)))
  })
  .catch(e => console.error(`Something went wrong: ${e}`))

Fluff

Scraping is imprecise, so we use expected values to guide it. E.g.,

  • The beginning of entries have A7.7.[0-9]+ near the start of the page text.
  • The syntax follows "Assembler syntax" in bold font.
  • There will be "Encoding 1", etc., in bold font.

Steps:

  • Get text chunks of each page
  • Strip the runners (headers and footers)
  • Sort chunks and combine same-line items when possible
  • Extract regions of section-body
  • Merge all regions into one array
  • Separate regions into instructions

TODO:

  • Nested bullets in SSBB, PSSBB
  • Math in QADD
  • Spacing of bold, italic, verbatim

Readme

Keywords

none

Package Sidebar

Install

npm i arm-pdf-scrape

Weekly Downloads

2

Version

0.0.1

License

MIT

Unpacked Size

2.46 MB

Total Files

13

Last publish

Collaborators

  • aerijo