I haven’t had a chance to get much coding done over the last couple months. However, I’ve been doing a lot of reading on various vocal synthesis technologies.
I’d read quite a bit about spectral modeling synthesis (SMS) before, and decided to put together some quick tests.
Essentially, SMS uses the short-term FFT to capture a “spectral envelope” – a graph of what the amplitude of each harmonic is at each frequency.
Getting this information in Praat– there’s an option to get a “spectral slice” that lists the amplitude (in decibels) at given frequencies:
freq(Hz) pow(dB/Hz) 0 -17.171071546036153 21.533203125 20.388397279697497 43.06640625 26.97257113598713 64.599609375 30.64585953310637 86.1328125 32.25338096741935 107.666015625 31.708646066090367 129.19921875 29.056762741248782 150.732421875 28.923892178490085 172.265625 32.719805559695565 ... and so on
Converting the decibels to a linear value looks like this:
function dbToLinear( x ) return math.pow(10, x/20) end
To convert the spectral envelope back into sound, the process is reversed. You can generate a bunch of sine waves that are multiples of the fundamental frequency, and use the spectral envelope to look up the amplitude of each frequency.
Or if you’re clever, you can do an inverse FFT and stitch the frames together.
I’m exploring the idea of using a single spectral slice to represent each phoneme target, and morphing from one target to the next. For the morphs to work, key features need to be specified. Conveniently, this corresponds to peaks at the formant frequencies – something that Praat also calculates.
In initial tests of the morphs, the formants seem to move fairly naturally.
However, using a single spectral slice to represent the sound creates a mechanical sounding voice, much like a door buzzer. Altering the amplitude and fundamental frequency may help solve that problem.
SMS generally also models the residual of the voice – the part of the sound that’s not represented by the harmonic portion. One option I’m looking into is using formant synthesis to generate the residual portion by passing white noise through filters.