unicode-substring

Unicode-aware substring for JavaScript. Surrogate pairs are counted as a single character.

What?

Characters in JavaScript strings are exposed as 16-bit code points, also known as UCS-2 encoding. This usually good enough, but since there are more than 2^16 characters in Unicode, 16 bits is not enough to represent all characters. To overcome this limitation, characters with scalar value over 0x10FFFF need to be encoded as surrogate pairs. This encoding is known as UTF-16.

The purpose of this library is to treat surrogate pairs as one character when extracting substrings from a string. This might be preferable if indices are returned from an Unicode-compatible environment.

Usage

 
var unicodeSubstring = require('unicode-substring')
// unicodeSubstring(string, start, end)
unicodeSubstring("💥Emoji Rule💥", 0, 6)
// => "💥Emoji"

The start and end parameters behave similarly as String.prototype.substring.