Speech Markdown grammar, parser, and formatters for use with JavaScript.
Supported platforms:
- microsoft-azure
Partial / no support:
- amazon-alexa
- amazon-polly
- amazon-polly-neural
- google-assistant
- samsung-bixby
import { SpeechMarkdown } from '@davi-ai/speechmarkdown-davi-js'
const options = {
platform: 'microsoft-azure',
includeSpeakTag: false,
globalVoiceAndLang: {
voice: 'en-US-JennyMultiLingualNeural',
lang: 'fr-FR'
}
}
const speechMarkdownParser = new SpeechMarkdown(options)
You can use multiple options, the most useful ones are :
- platform : 'microsoft-azure' to generate SSML for azure neural voices
- includeSpeakTag : add or not a tag at the beginning and tag at the ending.
- globalVoiceAndLang: {
voice?: string,
lang?: string
} : added for microsoft voices and retorik-framework architecture. If you use a selected voice as main voice, put it in 'voice' field
(format language-CULTURE-VoiceName (ex: en-US-GuyNeural, en-US-JennyNeural)) When using a multilingual voice (ex: JennyMultilingualNeural, if the text has to be spoken in a different language than the one of this language, add
the 'lang' field with the desired language, formatted language-CULTURE (ex: fr-FR, en-US, de-DE, ...)
With theses parameters, you will receive a complete SSML string, excepted for the tag that has to be put manually around. We don't use the includeSpeakTag = true
parameter because it only puts a tag, and to use Microsoft voices we need a complete tag as follows :
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" xml:lang="fr-FR">
There are many different tags and most of them have restrictions. To get the current documentation, go to docs.microsoft.com
On 2023/07/28, the available tags are :
- voice :
- (text to be read with that voice)[voice:"voice name"]
- the text can contain other tags except 'voice'
- the voice name can be as follows :
- language-CULTURE-VoiceName (ex: en-US-GuyNeural, en-US-JennyNeural))
- full Microsoft name (ex: Microsoft Server Speech Text to Speech Voice (en-US, JennyMultilingualNeural))
- example :
(Bonjour, comment ça va ?)[voice:"fr-FR-DeniseNeural"]
- lang :
- (text to be read in this language)[lang:"language name"]
- the text can contain other tags except 'voice' and 'lang'
- the lang name must be formatted as language-CULTURE (ex: fr-FR, en-US)
- example :
(Bonjour, comment ça va ?)[lang:"en-US"]
- break :
- [break time in seconds / milliseconds] or [break:"strength value"]
- strength values :
- none
- x-weak
- weak
- medium
- strong
- x-strong
- example :
ts [break:"strong"] / [1s] / [250ms]
- silence :
- [silence:"type value"]
- type and value are required
- type can be :
- Leading : beginning of text
- Tailing : end of text
- SentenceBoundary : between adjacent sentences
- value is an integer giving time in seconds or milliseconds, lower than 5000ms
- example :
[silence:"Leading 1s"]
- prosody :
- (text for which the prosody will be adjusted)[pitch:"value";contour="value";range="value";rate="value";volume="value"]
- you can use any of the modifiers below, from one to all of them
- modifiers :
- pitch
- contour
- range
- rate
- volume
- example :
(this will be spoken slow and high)[rate:"slow";pitch:"high"]
- emphasis :
- [emphasis:"value"] or ++text will be strong++
- value can be / corresponding symbols around text :
- reduced / -text reduced-
- none / ~text without change
- moderate / +text stronger+
- strong / ++text much stronger++
- example :
[emphasis:"moderate"] / +bonjour+
- say-as :
- (text to be said as)[modifier]
- modifier can be :
- address
- number
- characters
- fraction
- ordinal
- telephone
- time
- date
- example :
I need this answer (ASAP)[characters] / My phyone number is (0386300000)[telephone]
- ipa :
- the International Phonetic Alphabet (ipa) allows you to force the pronunciation of a word / sentence
- example :
I love (paintball)[ipa:"peɪntbɔːl"]
- emotions :
- [emotion:"style role/styledegree"]
- the style is mandatory, and depends on the voice speaking at that time (ex: fr-FR-DeniseNeural can only use 'sad' and 'cheerful' while ja-JP-NanamiNeural can use
'chat', cheerful' and 'customerservice') - role and styledegree are optionnal. Role is a string, while styledegree is a number. Note that 'role' is restricted to very few voices
- example :
(It's so cool ! We are going to a great park today !)[voice:"en-US-JennyNeural";emotion:"excited 2"]
- audio :
- !["src"]
- example : !["https://cdn.retorik.ai/retorik-framework/audiofiles/audiotest.mp3"]
- backgroundaudio :
- [backgroundaudio:"src volume fadein fadeout"]
- src mandatory, other fields optionnal but all fields on the left must be provided before using one on the right (ex: to use fadein,
you must have provided a value for src and volume) - only one backgroundaudio tag possible
- example :
[backgroundaudio:"https://cdn.retorik.ai/retorik-framework/audiofiles/audiotest.mp3 0.5 2000 1500"]
- lexicon :
- [lexicon:"url to the lexicon xml file"]
- the lexicon file is restricted to one language (en-US, fr-FR, ...) so it won't be used if the voice uses another language
- it does nothing when using a multilingual voice (ex: JennyMultilingualNeural), even if the lang tag of this voice is the same as the one in the lexicon file
- the lexicon inputs are case-sensitive, for example 'hello' and 'Hello' must be treated separately
- example :
[lexicon:"https://cdn.retorik.ai/retorik-framework/lexicon-en-US.xml"] Hi everybody ! BTW how are you today ?
- bookmark :
- [bookmark:"bookmark text"]
- example :
Bookmark after city name : first Paris [bookmark:"city1"], then Berlin [bookmark:"city2"]
Licensed under the MIT. See the LICENSE file for details.