Light-weight sentence tokenizer for Japanese.
published version 1.0.2, 3 years agoLight-weight sentence tokenizer for Chinese languages.
published version 1.0.1, 3 years agoLight-weight sentence tokenizer for Korean. Supports both full-width and half-width punctuation marks.
published version 1.0.1, 3 years agoLight-weight tool for normalizing whitespace, splitting lines, and accurately tokenizing words (no regex). Multiple natural languages supported.
published version 1.0.3, 3 years agoTool for stripping and normalizing punctuation and other non-alphanumeric characters. Supports multiple natural languages. Useful for scrapping, machine learning, and data analysis.
published version 1.0.2, 3 years agoLight-weight tool for converting characters in a string into common HTML entities (without regex).
published version 1.0.2, 3 years agoTool for escaping script tags using backslashes (no regex).
published version 1.0.4, 3 years ago