What’s This About
This page is intended to give a quick technical overview of synSinger, and explain some implementation details.
The book From text to speech: The MITalk System (Allen, Hunnicutt and Klatt) proved helpful in explaining how formant synthesis works, which inspired me on the initial versions of synSinger.
A Broad Overview
synSinger reads in a MusicXML file, and generates a .wav file as output. It does this by applying the following steps:
- Extract information from the MusicXML file.
- Create a list of notes, with the associated lyrics.
- Convert the lyrics into phonemes.
- Calculate the timing of the phonemes.
- Convert phonemes into audio information.
- Create a .wav file as output
synSinger uses a local copy of the CMU Pronouncing Dictionary to lookup words and convert them to phonemes.
For the few times that a word isn’t found in the CMU Dictionary, it falls back on a public-domain Reciter program to guess what the word should sound like. Reciter has been augmented handle hyphenation and stress markers.
Converting Phonemes into Audio
synSinger started out using a technique called formant synthesis to synthesize an artificial voice. However, it’s recently been rewritten to use a technique called spectral modelling synthesis. This is under heavy development, but in general:
- A database of phoneme targets is created by performing harmonic analysis using a DFT (Discrete Fourier Transform) and storing the resulting spectral envelope.
- A spectral envelope is simply a function that, when passed in a frequency of a harmonic, returns the phase and amplitude of that harmonic.
- Phonemes are reconstructed by performing an iDFT (inverse Discrete Fourier Transform).
- Speech is approximated by morphing from one phoneme to the next.