tokenize-text
Javascript text tokenizer that is easy to use and compose.
Installation
$ npm install tokenize-text
Usage
var Tokenizer = ;var tokenize = ;
tokenize.split(fn)
This is the main method of this module, all other methods are using it.
fn
will be called with 4 arguments:
text
: text value of the token (text == currentToken.value
)currentToken
: current token objectprevToken
: precedent token (or null)nextToken
: next token (or null)
fn
should return a string, an array of string, a token or an array of tokens.
tokenize.split(fn)
returns a tokenizer function that accept a list of tokens or a string argument (it will be convert as one token).
The tokenizer function returns an array of tokens with the following properties:
value
: text content of the tokenindex
: absolute position in the original textoffset
: length of the token (equivalent tovalue.length
)
// Simple tokenizer that split into 2 sectionsvar splitIn2 = tokenize; var tokens = ; /*[ { value: 'he', index: 0, offset: 2 }, { value: 'llo', index: 2, offset: 3 }]*/
tokenize.re(re)
Tokenize using a regular expression:
var extractUppercase = tokenize;var tokens = ; /*[ { value: 'B', index: 1, offset: 1 }, { value: 'D', index: 3, offset: 1 }]*/
tokenize.characters()
Tokenize and split as characters, tokenize.characters()
is equivalent to tokenize.re(/[^\s]/)
.
var tokens = tokenize'abc'; /*[ { value: 'a', index: 0, offset: 1 }, { value: 'b', index: 1, offset: 1 }, { value: 'c', index: 2, offset: 1 }]*/
tokenize.sections()
Split in sections, sections are split by \n . , ; ! ?
.
var tokens = tokenize'this is sentence 1. this is sentence 2'; /*[ { value: 'this is sentence 1', index: 0, offset: 18 }, { value: ' this is sentence 2', index: 19, offset: 19 }]*/
tokenize.words()
Split in words:
var tokens = tokenize'hello, how are you?'; /*[ { value: 'hello', index: 0, offset: 5 }, { value: 'how', index: 7, offset: 3 }, { value: 'are', index: 11, offset: 3 }, { value: 'you', index: 15, offset: 3 }]*/
tokenize.filter(fn)
Filter the list of tokens by calling fn(token)
:
// Filter the words to extract the ones that start with an uppercasevar extractNames = tokenize; // Split texts in wordsvar words = tokenize'My name is Samy.'; // Apply the filtervar tokens = ; /*[ { value: 'Samy', index: 11, offset: 4 }]*/
tokenize.flow(fn1, fn2, [...])
Creates a tokenizer that returns the result of invoking the provided tokenizers for each input token.
var extractNames = tokenize; var tokens = ;
To execute all tokenizer in series, you can use tokenize.serie(fn1, fn2, [...])
instead.
Examples
Extract repeated words in sentences
Example to extract all repeated words in sentences:
var repeatedWords = tokenize; var tokens = ; /*[ { value: 'great', index: 14, offset: 5 }, { value: 'an', index: 33, offset: 2 }]*/