TFIDF
tf-idf is a type of word-analysis that can discover the most-characteristic, or unique words in a text.
It combines uniqueness of words, and their frequency in the document.
This plugin comes pre-built with a standard english model, so you can fingerprint an arbitrary text with .tfidif()
- .tfidf(opts, model?) -
alternatively, you can build your own model, from a compromise document:
- .buildIDF() -
let model=nlp(shakespeareWords)
let doc = nlp('thou art so sus.')
doc.tfidf()
// [ [ 'sus', 5.78 ], [ 'thou', 2.3 ], [ 'art', 1.75 ], [ 'so', 0.44 ] ]
if you want to combine tfidf with other analysis, you can add numbers to individual terms, like this:
let doc = nlp('no, my son is also named Bort')
doc.compute('tfidf')
let json = doc.json()
json[0].terms[6]
// {"text":"Bort", "tags":[], "tfidf":5.78, ... }
TF-IDF values are scaled, but have an unbounded maximum. The result for 'foo foo foo foo' would increase every with repitition.
Ngrams
- .ngrams({}) - list all repeating sub-phrases, by word-count
- .unigrams() - n-grams with one word
- .bigrams() - n-grams with two words
- .trigrams() - n-grams with three words
- .startgrams() - n-grams including the first term of a phrase
- .endgrams() - n-grams including the last term of a phrase
- .edgegrams() - n-grams including the first or last term of a phrase
all methods support the same option params:
let doc = nlp('one two three. one two foo.')
doc.ngrams({ size: 2 }) // only two-word grams
/*[
{ size: 2, count: 2, normal: 'one two' },
{ size: 2, count: 1, normal: 'two three' },
{ size: 2, count: 1, normal: 'two foo' }
]
*/
or all gram-sizes under/over a limit:
let doc = nlp('one two three. one two foo.')
let res = doc.ngrams({ min: 3 }) // or max:2
/*[
{ size: 3, count: 1, normal: 'one two three' },
{ size: 3, count: 1, normal: 'one two foo' }
]
*/
MIT