Technical Details

What’s This About

This page is intended to give a quick technical overview of synSinger, and explain some implementation details.

The inspiration for synSinger is Software Automatic Mouth, a program originally published in 1982 by Don’t Ask Software (now SoftVoice, Inc.)

The book From text to speech: The MITalk System (Allen, Hunnicutt and Klatt) proved helpful in explaining how formant synthesis works.

A Broad Overview

synSinger reads in a MusicXML file, and generates a .wav file as output. It does this by applying the following steps:

  • Extract information from the MusicXML file.
  • Create a list of notes, with the associated lyrics.
  • Convert the lyrics into phonemes.
  • Calculate the timing of the phonemes.
  • Convert phonemes into audio information.
  • Create a .wav file as output

synSinger uses a local copy of the CMU Pronouncing Dictionary to lookup words and convert them to phonemes.

For the few times that a word isn’t found in the CMU Dictionary, it falls back on a public-domain Reciter program to guess what the word should sound like. Reciter has been augmented handle hyphenation and stress markers.

Converting Phonemes into Audio

synSinger uses a technique called formant synthesis to synthesize an artificial voice. Phonemes are turned into sound by simulating the human vocal tract electronically.

Creating voiced sound begins by air passing through the glottal folds, causing them to vibrate and create sound. By modifying the length of the folds’ opening, the pitch, called fundamental frequency can be controlled.

This harmonically rich glottal pulse passes through the mouth, where the tongue creates resonating chambers that reinforce specific frequencies in the glottal pitch. The specific frequencies – called “resonances” or “formants” – are what distinguish phonemes.

For example, for an average English speaker, the first three resonances of the /IY/ sound (as in “bEEt”) are at 270 Hz, 2300 Hz and 3000Hz.

In contrast, the first three resonances for the /AE/ sound (as in “bAt”) are 660Hz, 1700Hz and 2400 Hz.

synSinger simulates this by passing pulses through a series of filters set to the desired formant frequencies. By changing the frequencies of the filters, different vowel sounds can be created.

Aspiration – the breathy sound in the /H/ – is created by passing white noise through the digital filters.

The “burst” before plosives like /T/ or /K/, or the frication sound found in /F/ or /SH/ are created by using digital samples.

Tell Me More!

The classic texts are by Dennis Klatt. The first text provides a good general overview, while the second goes into considerably more detail:

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s