@neopass/wordlist
TypeScript icon, indicating that this package has built-in type declarations

0.5.2 • Public • Published

Build Status

wordlist

Generate a word list from various sources, including system dictionaries and SCOWL.

Includes a default list of ~86,000 english words.

Additional dictionary/wordlist paths can be configured via the options. System dictionaries exist at locations such as /usr/share/dict/words, /usr/share/dict/british-english, etc.

Contents

Installation

npm install @neopass/wordlist

Usage

There are three functions available for creating word lists: wordList, wordListSync, and listBuilder. The default list is included by default, so no configuration of options is required.

wordList builds and returns the list asynchronously:

const { wordList } = require('@neopass/wordlist')

wordList().then(list => console.log(list.length)) // 86748

wordListSync builds and returns the list synchronously:

const { wordListSync } = require('@neopass/wordlist')

const list = wordListSync()
console.log(list.length) // 86748

listBuilder calls back each word asynchronously:

const { listBuilder } = require('@neopass/wordlist')

const builder = listBuilder()
const list = []

builder(word => list.push(word))
  .then(() => console.log(list.length)) // 86748

Options

export interface IListOptions {
  /**
   * Word list paths to search for in order. Only the first
   * one found is used. This option is ignored if 'combine'
   * is a non-empty array.
   *
   * default: [
   *  '$default',
   * ]
   */
  paths?: string[]
  /**
   * Word list paths to combine. All found files are used.
   */
  combine?: string[]
  /**
   * Mutate the list by filtering on lower-case words, converting to
   * lower case, or applying a custom mutator function.
   */
  mutator?: 'only-lower'|'to-lower'|Mutator
}

paths: Allows alternate, fallback lists to be used.

combine: Allows multiple lists to be combined into one.

mutator: mutates the list depending on the value provided.

  • only-lower: Filter out words that are not strictly comprised of characters [a-z].
  • to-lower: Convert words to lower case.
  • Mutator: (word: string) => string|string[]|void: a custom function that receives a word and returns one or more words, or undefined. Used for custom transformation/exclusion of words in the list.

Return values:

  • string: the returned string is added to the list.
  • string[]: all returned strings are added to the list.
  • For any other return value the word is not added.
const { wordList } = require('@neopass/wordlist')

/**
 * Create a custom mutator for splitting hyphenated words
 * and converting them to lower case.
 */
function customMutator(word: string) {
  // Will return ['west', 'ender'] for an input of 'West-ender'.
  return word.split('-').map(word => word.toLowerCase())
}

const options = {
  paths: ['/some/list/path/words.txt'],
  mutator: customMutator,
}

const list = await wordList(options)
assert(list.includes('west'))
assert(list.includes('ender'))

Specify Alternate Word Lists

The paths specified in options are searched in order and the first list found is used. This allows for the use of system word lists with different names and/or locations on various platforms. A common location for the system word list is /usr/share/dict/words.

const { wordList } = require('@neopass/wordlist')

// Prefer british-english list.
const options = {
  paths: [
    '/usr/share/dict/british-english',  // if found, use this one
    '/usr/share/dict/american-english', // else if found, use this one
    '/usr/share/dict/words',            // else if found, use this one
    '$default',  // else use this one
  ]
}

wordList(options)
  .then(list => console.log(list.length)) // 101825

Combine Lists

Lists can be combined into one with the combine option:

const { wordList } = require('@neopass/wordlist')

// Combine multiple dictionaries.
const options = {
  combine: [
    // System dictionary.
    '/usr/share/dict/words', // use this one
    '$default',              // and use this one
  ]
}

wordList(options)
  .then(list => console.log(list.length)) // 335427

Important: Using combine with wordList/wordListSync will result in duplicates if the lists overlap. It is recommended to use combine with listBuilder to control how words are added. For example, a Set can be used to eliminate duplicates from combined lists:

const { listBuilder } = require('@neopass/wordlist')

// Combine multiple lists.
const options = {
  combine: [
    // System dictionary.
    '/usr/share/dict/words',
    // Default list.
    '$default',
  ]
}

// Create a list builder.
const builder = listBuilder(options)

// Create a set to avoid duplicate words.
const set = new Set()

// Run the builder.
builder(word => set.add(word))
  .then(() => console.log(set.size)) // 299569

The Default List

The default list is a ~86,000-word, PG-13, lower-case list taken from english SCOWL sources, with some other additions including slang.

Suggestions for additions to the default list are welcome by submitting an issue. Whole lists are definitely preferred to single-word suggestions, e.g., "notable extraterrestrials in history", "insects of upper polish honduras", or "names of horses in modern literature". Suggestions for inappropriate word removal are also welcome (curse words, coarse words/slang, racial slurs, etc.).

By default the list alias, $default, is included in the options. This allows wordlist to create a largish list without any additional configuration.

export const defaultOptions: IListOptions = {
  paths: [
    '$default'
  ]
}
/**
 * We don't need to specify a config because the `$default` alias
 * is part of the default configuration.
 */
const list = wordListSync()

The $default alias (along with other aliases) resolves to a path at run time.

Generate a List From Scowl Sources

SCOWL word lists are included as aliases, and can be used to generate custom lists:

const { listBuilder } = require('@neopass/wordlist')

// Combine multiple lists from scowl.
const options = {
  combine: [
    '$english-words.10',
    '$english-words.20',
    '$english-words.35',
    '$special-hacker.50',
  ]
}

// Create a list builder.
const builder = listBuilder(options)

// We'll add the words to a set.
const set = new Set()

// Run the builder.
builder(word => set.add(word))
  .then(() => console.log(set.size)) // 49130

Warning: Some SCOWL sources contain words not approprate for all audiences, including swear words, racial slurs, and words of a sexual nature. You'll most likely want to scrutinize these sources depending on your use case and intended audience.

SCOWL is primarily intened as a source for spell checkers. From the SCOWL website:

SCOWL (Spell Checker Oriented Word Lists) and Friends is a database of information on English words useful for creating high-quality word lists suitable for use in spell checkers of most dialects of English. The database primary contains information on how common a word is, differences in spelling between the dialects if English, spelling variant information, and (basic) part-of-speech and inflection information.

Note: SCOWL sources contain some words with apostrophes 's and also unicode characters. Care should be taken to deal with these depending on your needs. For example, we can transform words to remove any trailing 's characters and then only accept words that contain the letters a-z:

const { listBuilder } = require('@neopass/wordlist')

/**
 * Remove trailing `'s` from words.
 */
function transform(word) {
  if (word.endsWith(`'s`)) {
    return word.slice(0, -2)
  }
  return word
}

/**
 * Determine if a word should be added.
 */
function accept(word) {
  // Only accept words with characters a-z (case insensitive).
  return (/^[a-z]+$/i).test(word)
}

// Combine multiple lists from scowl.
const options = {
  combine: [
    '$english-words.10',
    '$english-words.20',
    '$english-words.35',
    '$special-hacker.50',
  ]
}

// Create a list builder.
const builder = listBuilder(options)

// Create a set to avoid duplicate words.
const set = new Set()

// Run the builder.
const _builder = builder((word) => {
  word = transform(word)

  if (accept(word)) {
    set.add(word)
  }
})

_builder.then(() => console.log(set.size)) // 38714

Scowl Aliases

A path alias is defined for every SCOWL source list. SCOWL aliases consist of the $ character followed by the source file name. The below is a representative sample of the available source aliases.

$american-abbreviations.70
$american-abbreviations.95
$american-proper-names.80
$american-proper-names.95
$american-upper.50
$american-upper.80
$american-upper.95
$american-words.35
$american-words.80
$australian-abbreviations.35
$australian-abbreviations.80
$australian-contractions.35
$australian-proper-names.35
$australian-proper-names.80
$australian-proper-names.95
$australian-upper.60
$australian-upper.95
$australian-words.35
$australian-words.80
$australian_variant_1-abbreviations.95
$australian_variant_1-contractions.60
$australian_variant_1-proper-names.80
$australian_variant_1-proper-names.95
$australian_variant_1-upper.80
$australian_variant_1-upper.95
$australian_variant_1-words.80
$australian_variant_1-words.95
$australian_variant_2-abbreviations.80
$australian_variant_2-abbreviations.95
$australian_variant_2-contractions.50
$australian_variant_2-contractions.70
$australian_variant_2-proper-names.95
$australian_variant_2-upper.80
$australian_variant_2-words.55
$australian_variant_2-words.95
$british-abbreviations.35
$british-abbreviations.80
$british-proper-names.80
$british-proper-names.95
$british-upper.50
$british-upper.95
$british-words.10
$british-words.20
$british-words.35
$british-words.95
$british_variant_1-abbreviations.55
$british_variant_1-contractions.35
$british_variant_1-contractions.60
$british_variant_1-upper.95
$british_variant_1-words.10
$british_variant_1-words.95
$british_variant_2-abbreviations.70
$british_variant_2-contractions.50
$british_variant_2-upper.35
$british_variant_2-upper.95
$british_variant_2-words.80
$british_variant_2-words.95
$british_z-abbreviations.80
$british_z-abbreviations.95
$british_z-proper-names.80
$british_z-proper-names.95
$british_z-upper.50
$british_z-upper.95
$british_z-words.10
$british_z-words.95
$canadian-abbreviations.55
$canadian-proper-names.80
$canadian-proper-names.95
$canadian-upper.50
$canadian-upper.95
$canadian-words.10
$canadian-words.95
$canadian_variant_1-abbreviations.55
$canadian_variant_1-contractions.35
$canadian_variant_1-proper-names.95
$canadian_variant_1-upper.35
$canadian_variant_1-upper.80
$canadian_variant_1-words.35
$canadian_variant_1-words.95
$canadian_variant_2-abbreviations.70
$canadian_variant_2-contractions.50
$canadian_variant_2-upper.35
$canadian_variant_2-upper.80
$canadian_variant_2-words.35
$canadian_variant_2-words.80
$english-abbreviations.20
$english-abbreviations.80
$english-contractions.35
$english-contractions.80
$english-contractions.95
$english-proper-names.35
$english-proper-names.80
$english-upper.35
$english-upper.80
$english-words.80
$english-words.95
$special-hacker.50
$special-roman-numerals.35
$variant_1-abbreviations.55
$variant_1-abbreviations.95
$variant_1-contractions.35
$variant_1-proper-names.80
$variant_1-proper-names.95
$variant_1-upper.35
$variant_1-upper.80
$variant_1-words.20
$variant_1-words.80
$variant_2-abbreviations.70
$variant_2-abbreviations.95
$variant_2-contractions.50
$variant_2-contractions.70
$variant_2-upper.35
$variant_2-upper.95
$variant_2-words.35
$variant_2-words.95
$variant_3-abbreviations.40
$variant_3-abbreviations.95
$variant_3-words.35
$variant_3-words.95

See the SCOWL Readme for a description of SCOWL sources.

Create a Custom Word List File

A custom word list file from miscellaneous sources can be assembled with the wordlist-gen binary, or the word-gen utility in the wordlist repo.

From the @neopass/wordlist package:

npx wordlist-gen --sources <path1 path2 ...> [options]

From the wordlist repo:

git clone git@github.com:neopass/wordlist.git
cd wordlist
node bin/word-gen --sources <path1 path2 ...> [options]

First, set up a directory of book and/or word list files, for example:

root
  +-- data
    +-- books
    | -- modern steam engine design.txt
    | -- how to skin a rabbit.txt
    +-- lists
    | -- names.txt
    | -- animals.txt
    | -- slang.txt
    +-- scowl
    | -- english-words.10
    | -- english-words.20
    | -- english-words.35
    | -- special-hacker.50
    +-- exclusions
    | -- patterns.txt

The structure doesn't really matter. The format should be utf-8 text, and can consist of one or more words per line. exclusions is optional.

npx wordlist-gen --sources data/books data/lists data/scowl --out my-words.txt

sources can specify multiple files and/or directories.

Note: only words consisting of letters a-z are added, and they're all lower-cased.

Exclusions

Words can be scrubbed by specifying exclusions:

node bin/word-gen <...> --exclude data/exclusions

Much like the sources, exclusions can consist of multiple files and/or directories in the following format:

# Exclude whole words (case insensitive):
spoon
fork
Tongs

# Exclude patterns (as regular expressions):
/^fudge/i   # words starting with 'fudge'
/crikey/i   # words containing 'crikey'
/shazam$/   # words ending in lowercase 'shazam'
/^BLASTED$/ # exact match for uppercase 'blasted'

Using the Custom List

Use path.resolve or path.join to create an absolute path to your custom word list file:

const path = require('path')
const { wordList } = require('@neopass/wordlist')

const options = {
  paths: [
    // Use a path relative to the location of this module.
    path.resolve(__dirname, '../my-words.txt')
  ]
}

wordList(options)
  .then(list => console.log(list.length)) // 124030

SCOWL License

Copyright 2000-2016 by Kevin Atkinson

Permission to use, copy, modify, distribute and sell these word
lists, the associated scripts, the output created from the scripts,
and its documentation for any purpose is hereby granted without fee,
provided that the above copyright notice appears in all copies and
that both that copyright notice and this permission notice appear in
supporting documentation. Kevin Atkinson makes no representations
about the suitability of this array for any purpose. It is provided
"as is" without express or implied warranty.

Full License | SCOWL

Package Sidebar

Install

npm i @neopass/wordlist

Weekly Downloads

894

Version

0.5.2

License

MIT

Unpacked Size

8.68 MB

Total Files

395

Last publish

Collaborators

  • jabney