azure-speech-utilities

Provides a convenient abstraction layer over the Microsoft Cognitive Services Speech SDK, simplifying the integration of speech-to-text and text-to-speech functionality into client/browser applications. Using this package, developers can quickly integrate basic STT and TTS capabilities into their applications without the need to write intricate code.

Features:

Perform a single speech recognition operation with ease.
Enable continuous speech recognition for real-time applications.
Multilingual speech recognition.
Text to speech synthesis.
SSML/Text input for TTS.

Installing

Using npm:

npm install azure-speech-utilities

Function Description

CreateRecognizer

Creates a new speech recognizer instance.

Parameter	Type	Default Value	Description
cogSvcSubKey	string	`""`	The Cognitive Services subscription key for Speech Services. (Required, default is `empty string`)
cogSvcRegion	string	`""`	The region of Cognitive Services subscription. (Required, default is `empty string`)
recognitionLang	string[]	`["en-US"]`	An array of language recognition codes. (Optional, default is `["en-US"]`).

RecognizeOnceAsync

Used for single-shot recognition, which recognizes a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.

Parameter	Type	Default Value	Description
recognizer	sdk.SpeechRecognizer \| undefined	undefined	The speech `recognizer` instance to use.

ContinuousRecognitionAsync

The previous function performs single-shot recognition, which recognizes a single utterance. In contrast, you can use continuous recognition to get a real-time recognized text stream. Make a call to StopContinuousRecognitionAsync() at some point to stop recognition

Parameter	Type	Default Value	Description
recognizer	sdk.SpeechRecognizer \| undefined	undefined	The speech `recognizer` instance to use.
callbackRecognized	(value: string) => void	`val => console.log(value)`	A callback function called with recognized text.
callbackRecognizing	(value: string) => void	`val => console.log(value)`	A callback function called while speech is being recognized.

StopContinuousRecognitionAsync

Stops ongoing continuous speech recognition.

Parameter	Type	Default Value	Description
recognizer	sdk.SpeechRecognizer \| undefined	undefined	The speech `recognizer` instance to use.

Note: Use the same recognizer instance which you are using for ContinuousRecognitionAsync() as an argument to this function.

CreateSynthesizer

Creates a new speech synthesizer instance.

Parameter	Type	Default Value	Description
cogSvcSubKey	string	`""`	The Cognitive Services subscription key for Speech Services. (Required)
cogSvcRegion	string	`""`	The region of Cognitive Services subscription. (Required)
synthesisLang	string	`""`	The language code for the speech synthesizer. (Required)
synthesisVoiceName	string	`""`	The name of the voice to use for speech synthesis. (Optional, default is `""`)
createAudioConfig	boolean	`false`	Whether to create an audio config for speech output. (Optional, default is `false`)

Note: The voice that speaks is determined in order of priority as follows:

Passing false for createAudioConfig, doesn't play the audio by default on the current active output device.
If you only set synthesisLang, the default voice for the specified locale speaks.
If both synthesisVoiceName and synthesisLang are set, the synthesisLang setting is ignored. The voice that you specify by using synthesisVoiceName speaks.
If the voice element is set by using Speech Synthesis Markup Language (SSML), the synthesisVoiceName and synthesisLang settings are ignored.

SpeakAsync

Performs speech synthesis and returns the result (synthesized audio) in form of arrayBuffer.

Parameter	Type	Default Value	Description
synthesizer	sdk.SpeechSynthesizer \| undefined	undefined	The speech `synthesizer` instance to use.
inputString	string	`"I'm excited to try text to speech"`	The text to be synthesized.
inputType	string	`"text"`	The format of the input text. (Optional, default is `"text"`)
callback	(result: sdk.SynthesisResult, error?: Error) => void	(result, error) => {}	A callback function called with the synthesis result or an error.

Example

Recognize Once

import { CreateRecognizer, RecognizeOnceAsync  } from "azure-speech-utilities"

const CGV_KEY = "AZURE_SPEECH_SERVICE_KEY"
const CGV_REGION = "AZURE_SPEECH_SERVICE_REGION"

async function recognizeSpeech() {
    const recognizer = CreateRecognizer(CGV_KEY, CGV_REGION, ["hi-IN"])
    try {
            const recognizedText = await RecognizeOnceAsync(recognizer)
            if (recognizedText.type === "text") {
                console.log(recognizedText.message)
            } else {
                console.log(recognizedText.message)
            }
    } catch (error) {
      console.error(error)
    }
}

Continuous Recognition

import { CreateRecognizer, ContinuousRecognitionAsync } from "azure-speech-utilities"

const CGV_KEY = "AZURE_SPEECH_SERVICE_KEY"
const CGV_REGION = "AZURE_SPEECH_SERVICE_REGION"

// As there are 2 or more recognition languages "hi-IN" and "en-US" so it will be multilingual recognition.
const recognizer = CreateRecognizer(CGV_KEY, CGV_REGION, ["hi-IN", "en-US"])

function callbackRecognized(text) {
    console.log("RECOGNIZED: ", text)
}

function callbackRecognizing(text) {
    console.log("RECOGNIZING: ", text)
}

async function recognizeSpeech() {
    try {
            const response = await ContinuousRecognitionAsync(recognizer, callbackRecognized, callbackRecognizing)
            if (response.type === "success") {
                console.log(response.message)
            } else {
                console.error(response.message)
            }
    } catch (error) {
      console.error(error)
    }
}

function stopContinuousRecognition() {
    StopContinuousRecognitionAsync(recognizer)
}

Speak Async

import { CreateSynthesizer, SpeakAsync } from "azure-speech-utilities"

const CGV_KEY = "AZURE_SPEECH_SERVICE_KEY"
const CGV_REGION = "AZURE_SPEECH_SERVICE_REGION"
const SYNTHESIS_LANGUAGE = "en-US"
const SYNTHESIS_VOICE_NAME = "en-US-JennyNeural"

function handleSpeck() {
    // By default, the input type is 'text.' If you change the input type to 'ssml,' then the input string should be in the following SSML format.
    // const ssml = `
    // <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="${SYNTHESIS_LANGUAGE}">
    //     <voice name="${SYNTHESIS_VOICE_NAME}">
    //         When you're on the freeway, it's a good idea to use a GPS.
    //     </voice>
    // </speak>
    // `

    const text = "When you're on the freeway, it's a good idea to use a GPS."

    // Please note that the 'createAudioConfig' is set to false, meaning audio will not play by default on the currently active output device.
    const synthesizer = CreateSynthesizer(CGV_KEY, CGV_REGION, SYNTHESIS_LANGUAGE, SYNTHESIS_VOICE_NAME, false)

    SpeakAsync(synthesizer, text, "text", (result, error) => {
      if (error) {
        console.error(error)
      } else {
        console.log(result)
        const audioBlob = new Blob([result.audioData], { type: "audio/wav" })

        // You can use this URL as an audio source, which allows easy user control such as starting, stopping, resetting, etc.
        console.log(URL.createObjectURL(audioBlob))
      }
    })
  }

  const stopSpeaking = () => {
    audioRef.current.pause()
  }

Note: If you do not wish to play audio through an audio source, you can set createAudioConfig to true. This will cause the audio to play on the current active output device by default. However, using this method will not provide the user with the ability to reset, play, or pause the speaking audio.

Contributing

This project welcomes contributions and suggestions.