@shelf/text-normalizer
TypeScript icon, indicating that this package has built-in type declarations

2.0.1 • Public • Published

text-normalizer CircleCI

Originally took from openai/whisperer and rewrote to TS

TypeScript library for normalizing English text. It provides a utility class EnglishTextNormalizer with methods for normalizing various types of text, such as contractions, abbreviations, and spacing. EnglishTextNormalizer consists of other classes you can reuse independently:

  • EnglishSpellingNormalizer - uses a dictionary of English words and their American spelling. The dictionary is stored in a JSON file named english.json
  • EnglishNumberNormalizer - works specifically to normalize text from English words to actually numbers
  • BasicTextNormalizer - provides methods for removing special characters and diacritics from text, as well as splitting words into separate letters.

Install

$ yarn add @shelf/text-normalizer

Usage

Node.js

import {EnglishTextNormalizer} from '@shelf/text-normalizer';

const normalizer = new EnglishTextNormalizer();

console.log(normalizer.normalize("Let's")); // Output: let us
console.log(normalizer.normalize("he's like")); // Output: he is like
console.log(normalizer.normalize("she's been like")); // Output: she has been like
console.log(normalizer.normalize('10km')); // Output: 10 km
console.log(normalizer.normalize('10mm')); // Output: 10 mm
console.log(normalizer.normalize('RC232')); // Output: rc 232
console.log(normalizer.normalize('Mr. Park visited Assoc. Prof. Kim Jr.')); // Output: mister park visited associate professor kim junior

Browser

import {EnglishTextNormalizer} from 'https://esm.sh/@shelf/text-normalizer';

const normalizer = new EnglishTextNormalizer();

console.log(normalizer.normalize("Let's")); // Output: let us
console.log(normalizer.normalize("he's like!")); // Output: he is like

Advanced Usage

Using EnglishNumberNormalizer

import {EnglishNumberNormalizer} from '@shelf/text-normalizer';

const numberNormalizer = new EnglishNumberNormalizer();

console.log(numberNormalizer.normalize('twenty-five')); // Output: 25
console.log(numberNormalizer.normalize('three million')); // Output: 3000000
console.log(numberNormalizer.normalize('two and a half')); // Output: 2.5
console.log(numberNormalizer.normalize('fifty percent')); // Output: 50%

Using EnglishSpellingNormalizer

import {EnglishSpellingNormalizer} from '@shelf/text-normalizer';

const spellingNormalizer = new EnglishSpellingNormalizer();

console.log(spellingNormalizer.normalize('colour')); // Output: color
console.log(spellingNormalizer.normalize('organise')); // Output: organize

Using BasicTextNormalizer

import {BasicTextNormalizer} from '@shelf/text-normalizer';

const basicNormalizer = new BasicTextNormalizer(true, true);

console.log(basicNormalizer.normalize('Café!')); // Output: c a f e
console.log(basicNormalizer.normalize('Hello [World]')); // Output: h e l l o

Configuration

BasicTextNormalizer

The BasicTextNormalizer constructor accepts two optional boolean parameters:

  • removeDiacritics (default: false): If set to true, diacritics will be removed from the text.
  • splitLetters (default: false): If set to true, letters will be split into individual characters.

Example:

const normalizer = new BasicTextNormalizer(true, true);

Performance Considerations

  • The EnglishTextNormalizer combines multiple normalization techniques and may be slower for very large texts. Consider using individual normalizers (EnglishNumberNormalizer, EnglishSpellingNormalizer, or BasicTextNormalizer) if you only need specific functionality.
  • For repeated normalization of large amounts of text, consider initializing the normalizer once and reusing it to avoid unnecessary setup time.

Related Projects

  • compromise - Natural language processing in JavaScript

Publish

$ git checkout master
$ yarn version
$ yarn publish
$ git push origin master --tags

License

MIT © Shelf

Readme

Keywords

none

Package Sidebar

Install

npm i @shelf/text-normalizer

Weekly Downloads

61

Version

2.0.1

License

MIT

Unpacked Size

130 kB

Total Files

22

Last publish

Collaborators

  • el_scrambone
  • yuliiakovalchuk
  • anton-russo
  • gemshelf
  • mykola.khytra
  • hartzler
  • olesiamuller
  • vladgolubev
  • hmelenok
  • knupman
  • maaraanas
  • terret
  • chapelskyi.slavik
  • ahavrysh
  • pihorb
  • i5adovyi
  • irynah
  • diana.kryskuv
  • andy.raven
  • rafler
  • sskalp88
  • mykhailo.yatsko
  • demiansua
  • yuriil
  • vadymaslovskyi
  • ktv18
  • drews_abuse
  • rostyslav-horytskyi
  • whodeen
  • andriisermiahin
  • mpushkin
  • ss1l
  • oles.zadorozhnyy
  • maksym.hayovets
  • dima-bond
  • duch0416
  • kristina.zhak
  • oleksii.dymnich
  • domovoj
  • batovpavlo
  • mateuszgajdashelf
  • bodyaflesh
  • dmytro.harazdovskiy
  • kchlon
  • mmazurowski
  • vladmarchuk
  • petro.bodnarchuk
  • marianna-milovanova
  • kateryna-kochina
  • andrii-nastenko
  • maksym.tarnavskyi
  • bogdan.kolesnyk
  • andrew214
  • monopotan
  • maciej.orlowski