- Current Hash
b981d7197f06710027bf7a92446830fd917e6586
- Current Version:
0.16.11
https://github.com/Unstructured-IO/unstructured/commit/b981d7197f06710027bf7a92446830fd917e6586
To release a new version
- Replace all
b981d7197f06710027bf7a92446830fd917e6586
with the new hash - Replace all
0.16.11
with the new version tag - Replace all relevant
2.0.9
with the new @epilogo/unstructured-io-node version tag - Delete
./python/unstructured-io
dir - Run
./scripts/install.sh
from root - Run tests
pnpm run test
This library provides Node.js bindings to the unstructured.io
Python module. It enables Node.js applications to utilize the document parsing capabilities of the unstructured
library.
The unstructured
Python library excels at extracting structured data from various document formats, including PDFs, HTML, Word documents, and more. However, there are situations where utilizing this functionality within a Node.js environment directly is desirable. This library bridges that gap, allowing Node.js applications to leverage the power of unstructured
without relying solely on a Python environment.
You can install the library using npm, yarn, or pnpm:
npm install @epilogo/unstructured-io-node
yarn add @epilogo/unstructured-io-node
pnpm add @epilogo/unstructured-io-node
The post-installation script (install.sh
) will execute the following:
-
Clone the
unstructured-io
repository: It clones a specific commitb981d7197f06710027bf7a92446830fd917e6586
repository into apython/unstructured-io
directory within the library's folder. -
Install system dependencies: Based on your operating system (currently supporting Linux and macOS), it installs the necessary system packages for
unstructured
to function. These packages include tools for image processing, OCR, and handling various document formats. -
Create and activate a Python virtual environment: It creates a virtual environment within the
python
directory to isolate the Python dependencies of this library. -
Install Python dependencies: It installs the
unstructured
Python package along with its dependencies within the activated virtual environment. - Install Node.js and pnpm:
Here's a basic example demonstrating how to use the library:
import { UnstructuredIO } from '@epilogo/unstructured-io-node';
import * as path from 'path';
// This must be called at least once in your container or local environment
// It takes care of installing the neccesary dependencies.
// Only macOS and Linux is supported
await UnstructuredIO.ensureEnvironmentSetup();
const partitioned = await UnstructuredIO.partition({
filename: path.join(__dirname, '../__tests__/data/your-document.pdf'),
strategy: 'hi_res',
languages: ['eng'],
});
console.log(partitioned);
Explanation:
-
Import: Import the
UnstructuredIO
object from the library. -
Call
partition
: Invoke thepartition
function on theUnstructuredIO
object, passing an options object as an argument.-
filename
: Specify the path to the document you want to process. -
strategy
,languages
, etc.: Adjust other options as needed based on theunstructured
Python library's API.
-
-
Process the result: The
partition
function returns the extracted structured data, which you can then process according to your application's needs.
Refer to the unstructured
library's documentation (https://docs.unstructured.io/) for details on available options and their usage within the partition
function.