For when tokenizers fail!

Have you tried to tokenize a sentence with combined words like 'tokenizerFail'? Well, that is easy because the words use camel case. But how about tokenizerfail? I'm sure you see the trouble encounter tokenizing these!

Unfortunately, with the advent of social media, these kind of 'compounded' words are much more common (especially with hashtags).

These package uses the concept of known consonant blends to attempt and discover & hence tokenize/humanize such words. It is not perfect (I'm looking for other methods to enhance it) but gets you closer to perfect tokenization.

Adopt for your language

Don't speak English? Go to the ./lang folder and create consonant blends for your language (check out ./lang/en.json).

 
const wordize = require('wordize');
 
var str = 'there is this bigmanInYellowSUIT who thinks he is the freakingpope & our rainmaker';
 
 
//numanize
wordize.humanize(str, 'en'); //There is this big man in yellow suit who thinks he is the freaking pope & our rain maker
 
//get words from the sentence
//Note: The second parameter is the appropriate language code. Defaults to 'en'
wordize.words(str) //[ 'There', 'is', 'this', 'big', 'man', 'in', 'yellow', 'suit', 'who', 'thinks', 'he', 'is', 'the', 'freaking', 'pope', 'our', 'rain', 'maker' ]

Got ideas on how we can enhance this module? Please share!

wordize

For when tokenizers fail!

Adopt for your language

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Last publish

Collaborators

wordize

For when tokenizers fail!

Adopt for your language

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Last publish

Collaborators

Weekly Downloads