What’s This About
This page is intended to give a quick technical overview of synSinger, and explain some implementation details.
The book From text to speech: The MITalk System (Allen, Hunnicutt and Klatt) proved helpful in explaining how formant synthesis works.
A Broad Overview
synSinger reads in a MusicXML file, and generates a .wav file as output. It does this by applying the following steps:
- Extract information from the MusicXML file.
- Create a list of notes, with the associated lyrics.
- Convert the lyrics into phonemes.
- Calculate the timing of the phonemes.
- Convert phonemes into audio information.
- Create a .wav file as output
synSinger uses a local copy of the CMU Pronouncing Dictionary to lookup words and convert them to phonemes.
For the few times that a word isn’t found in the CMU Dictionary, it falls back on a public-domain Reciter program to guess what the word should sound like. Reciter has been augmented handle hyphenation and stress markers.
Converting Phonemes into Audio
synSinger uses a technique called formant synthesis to synthesize an artificial voice. Phonemes are turned into sound by simulating the human vocal tract electronically.
Creating voiced sound begins by air passing through the glottal folds, causing them to vibrate and create sound. By modifying the length of the folds’ opening, the pitch, called fundamental frequency can be controlled.
This harmonically rich glottal pulse passes through the mouth, where the tongue creates resonating chambers that reinforce specific frequencies in the glottal pitch. The specific frequencies – called “resonances” or “formants” – are what distinguish phonemes.
For example, for an average English speaker, the first three resonances of the /IY/ sound (as in “bEEt”) are at 270 Hz, 2300 Hz and 3000Hz.
In contrast, the first three resonances for the /AE/ sound (as in “bAt”) are 660Hz, 1700Hz and 2400 Hz.
synSinger simulates this by passing pulses through a series of filters set to the desired formant frequencies. By changing the frequencies of the filters, different vowel sounds can be created.
Aspiration – the breathy sound in the /H/ – is created by passing white noise through the digital filters.
The “burst” before plosives like /T/ or /K/, or the frication sound found in /F/ or /SH/ are created by using digital samples.
Tell Me More!
The classic texts are by Dennis Klatt. The first text provides a good general overview, while the second goes into considerably more detail: