The Auto Scan DOCX and Image node for n8n allows you to automatically scan DOCX or image files (using Optical Character Recognition, OCR), extract relevant data, and classify the contents. This node supports both DOCX file processing and image-based OCR scanning, providing flexibility for document automation workflows.
- OCR for Image Files: Automatically scan and extract text from image files using OCR.
- DOCX Extraction: Extract text content from DOCX files for further processing.
- Document Classification: Classify extracted content into predefined categories such as department and priority.
- Notification Support: Optionally send notifications (e.g., via email, SMS, or print) once processing is complete.
To install this custom node, follow these steps:
- Clone or download this repository.
- Follow the official n8n custom node installation guide.
- Install the necessary dependencies for OCR and DOCX extraction:
- For OCR: Ensure Tesseract.js is installed and properly configured.
- For DOCX extraction: Install Mammoth.js.
Once you add the Auto Scan DOCX and Image node to your workflow, you can configure the following parameters:
-
Input Type (options):
- Choose between
Image
(for OCR) orDOCX
(for document extraction). -
Default:
docx
- Description: Select the input type for scanning (Image or DOCX file).
- Choose between
-
File URL or Path (string):
- Provide the URL or local file path to the document or image file you wish to process.
- Default: ``
- Description: The path or URL to the file.
-
Language (options):
-
English (
eng
) or Vietnamese (vie
). -
Default:
eng
- Description: The language to use for OCR processing on image files.
-
English (
-
Send Notification (boolean):
- Determines whether to send a notification after the document processing is complete.
-
Default:
false
-
Description: If set to
true
, a notification will be sent after processing is finished.
-
Output Format (options):
- Choose between JSON or Plain Text for the output format of the extracted data.
-
Default:
json
- Description: Choose the output format for the extracted data.
-
Department Routing (boolean):
- Automatically route the document to the correct department based on the extracted content.
-
Default:
true
-
Description: If set to
true
, the node will classify the document and route it to the appropriate department.
-
Notification Method (options):
- Choose the method to notify users or departments about the document status after processing.
- Options: Email, SMS, or Print.
-
Default:
email
- Description: Select the notification method for alerting users or departments.
{
"documentUrl": "https://example.com/document.docx",
"documentType": "docx",
"outputFormat": "json",
"departmentRouting": true,
"notificationMethod": "email"
}
- Document URL: https://example.com/document.docx
- Document Type: docx
- Output Format: json
- Department Routing: true
- Notification Method: email
If the document is a DOCX file, the output might look like this:
{
"extractedText": "This document contains financial data that needs to be routed to the finance department.",
"classifiedData": {
"department": "Finance",
"priority": "High",
"summary": "Extracted key financial information."
},
"notification": "Notification sent via email."
}
In this example:
- extractedText: The raw text extracted from the document or image.
- classifiedData: A summary of the classification (e.g., department, priority).
- notification: A message indicating that a notification was sent.