This is a little set of JS natural language processing tools.
npm install --save @jrc03c/js-nlp-tools
import { Corpus, Document } from "@jrc03c/js-nlp-tools"
import fs from "node:fs"
const doc1 = new Document({
name: "Frankenstein",
raw: fs.readFileSync("path/to/frankenstein.txt", "utf8"),
})
const doc2 = new Document({
name: "Pride & Prejudice",
raw: fs.readFileSync("path/to/pride-and-prejudice.txt", "utf8"),
})
const doc3 = new Document({
name: "Moby Dick",
raw: fs.readFileSync("path/to/moby-dick.txt", "utf8"),
})
const corpus = new Corpus({ docs: [doc1, doc2, doc3] })
corpus.process().then(() => {
console.log(corpus.computeTFIDFScore("Frankenstein", doc1))
})
Returns a new Corpus
instance. Can optionally take a data
argument, which is an object with properties corresponding to Corpus
instance properties (e.g., docs
).
Returns the inverse document frequency score for a given word. Is computed as:
\text{IDF} = \text{log}(N / n_t)
Where:
- $N$ = the total number of documents in the corpus
- $n_t$ = the number of documents in which the word appears
Returns the term frequency score for a given word and document. Is computed as:
\text{TF} = 0.5 + 0.5 \frac{f_{t, d}}{\text{max}_{\{t'∈d\}} f_{t',d}}
Where:
- $f_{t, d}$ = the number of times the word appears in the document
- $\text{max}_{{t'∈d}} f_{t',d}$ = the number of times the most frequently-occurring word appears in the document
Returns the tf-idf score for a given word and document. Is computed as the term frequency score multiplied by the inverse document frequency score.
Returns a Promise
that resolves once all documents in the corpus have been processed. Can optionally take a callback function that is passed the progress through the documents as a value between 0 and 1.
An array of Document
instances.
A boolean indicating whether or not the instance's process
method has been invoked (and completed).
Returns a new Document
instance. Can optionally take a data
object with properties corresponding to Document
instance properties (e.g., wordCounts
).
Returns the number of times word
(a string) appears in the document.
Returns a Promise
that resolves once the document has been processed (indexed).
A boolean representing whether or not the instance's process
method has been invoked (and completed).
A boolean representing whether or not case should matter when indexing words.
A string representing the word that appears most frequently in the document.
A string representing the name of the document. If no name is assigned via the data object passed into the constructor, then a random string will be assigned as the document's name.
A string representing the raw text on which the document is based.
A non-negative integer representing the total number of words in the document.
A dictionary that maps words (as strings) to the numbers of times those words appear in the document (as non-negative integers).
Given raw
(a string) and optionally shouldPreserveCase
(a boolean), returns a copy of raw
in which all punctuation has been removed and all whitespace characters have been replaced with spaces. By default, shouldPreserveCase
is false
.
Defines a read-only property called name
on object
with the value value
. Returns object
.
Note that any read-only properties defined this way will fail silently when new values are assigned to them. In other words, you won't be notified when any assignment attempts fail.