dbay-cmudict

0.3.0 • Public • Published

𓆤DBay 𓅗CMUdict

Table of Contents generated with DocToc

𓆤DBay 𓅗CMUdict

The 𓆤DBay 𓅗CMUdict takes the ARPAbet phonetic notations for the 126,689 (non-genitive) entries of The CMU Pronouncing Dictionary (CMUdict) and rewrites them in a number of ways:

  • whereas the CMUdict originally used upper case ARPAbet notation, we convert those into lower case and correct a few details.
  • From the rewritten ARPAbet, we derive a notation using the International Phonetic Alphabet (IPA), which is much more common and more readable. Since the CMUdict dataset does not contain any indicators for syllabification, we indicate stress by underlining stressed vowels with double and single lines for primary and secondary stress.
  • By substituting IPA symbols with those of the X-SAMPA transliteration scheme, we get a notation that should be easier to type on most keyboards.

Data Sources

To Do

  • [–] add table with phone occurrences
  • [+] make entries lower case?
  • [+] add column arpabet with spaces removed
  • [+] add X-SAMPA
  • [+] remove ambiguity (using stress marks?):
    • @` ɚ Xsampa-at'.png r-coloured schwa American English color ["kVl@`]

    • 3` ɝ rhotic open-mid central unrounded vowel English [n3`s] (Gen.Am.)

    • however, looking at the treatment of rhotic sounds in arcturus: aa2 r k t er1 ah0 s, ɑɹktɝʌs vs it would seem that the special symbol ɝ is not warranted: the first vowel in AmE arctic /ɑɹktɪk/ is very much a rhotic vowel written with two consecutive symbols, so why would you write, say, urge as /ɝdʒ/ with a single symbol instead of as /ɜrdʒ/?

      arctic      │ aa1 r k t ih0 k         │ aa1rktih0k       │ Ar\ktIk    │ ɑɹktɪk
      arcturus    │ aa2 r k t uh1 r ah0 s   │ aa2rktuh1rah0s   │ Ar\ktUr\Vs │ ɑɹktʊɹʌs
      arcturus(1) │ aa2 r k t er1 ah0 s     │ aa2rkter1ah0s    │ Ar\kt3`Vs  │ ɑɹktɝʌs
      
    • therefore, rewrite arpabet_s er(\d) as ah$1 r

  • [–] list all changes made to the original notation.
  • [–] apply transliteration to IPA first, keeping spaces and digits, then do replacements using IPA (should be much clearer)
  • [–] keep all transliterations in single table trlits so adding new schemes can be done w/out migration.
  • [–] keep transliterations with vs transliterations without stree marking in two separate tables? Or better use a flag field.
  • [–] remove / translate (into a field value) counter that indicates variants.
  • [–] replace remaining underscores with spaces
  • [–] recognize acronyms and remove spaces, correct case, as in i_p_a -> IPA, d_c -> DC &c.
  • [+] ensure that running tests does not affect contents of cmudict.sqlite
  • [+] rename cfg.create to cfg.rebuild

Package Sidebar

Install

npm i dbay-cmudict

Weekly Downloads

4

Version

0.3.0

License

MIT

Unpacked Size

57 MB

Total Files

26

Last publish

Collaborators

  • loveencounterflow