I haven’t had a chance to get much coding done over the last couple months. However, I’ve been doing a lot of reading on various vocal synthesis technologies.

I’d read quite a bit about spectral modeling synthesis (SMS) before, and decided to put together some quick tests.

Essentially, SMS uses the short-term FFT to capture a “spectral envelope” – a graph of what the amplitude of each harmonic is at each frequency.

Getting this information in *Praat*– there’s an option to get a “spectral slice” that lists the amplitude (in decibels) at given frequencies:

freq(Hz) pow(dB/Hz)
0 -17.171071546036153
21.533203125 20.388397279697497
43.06640625 26.97257113598713
64.599609375 30.64585953310637
86.1328125 32.25338096741935
107.666015625 31.708646066090367
129.19921875 29.056762741248782
150.732421875 28.923892178490085
172.265625 32.719805559695565
... and so on

Converting the decibels to a linear value looks like this:

function dbToLinear( x )
return math.pow(10, x/20)
end

To convert the spectral envelope back into sound, the process is reversed. You can generate a bunch of sine waves that are multiples of the fundamental frequency, and use the spectral envelope to look up the amplitude of each frequency.

Or if you’re clever, you can do an inverse FFT and stitch the frames together.

I’m exploring the idea of using a single spectral slice to represent each phoneme target, and morphing from one target to the next. For the morphs to work, key features need to be specified. Conveniently, this corresponds to peaks at the formant frequencies – something that *Praat* also calculates.

In initial tests of the morphs, the formants seem to move fairly naturally.

However, using a single spectral slice to represent the sound creates a mechanical sounding voice, much like a door buzzer. Altering the amplitude and fundamental frequency may help solve that problem.

SMS generally also models the *residual* of the voice – the part of the sound that’s not represented by the harmonic portion. One option I’m looking into is using formant synthesis to generate the residual portion by passing white noise through filters.

### Like this:

Like Loading...

*Related*

Spectral morphing is tricky, I tried building something like this once and never could get it right. To morph between two spectra you basically need to map out the frequency weightings on a grid, with one spectrum on the X axis and one spectrum on the Y axis, and then walk from the one corner to the opposite corner to find a path of best fit (you can use Dynamic Programming or pathfinding or whatever in order to find this path). I found a paper that describes the process very well if you can wait until next week when I can dig it out again.

I have a half-finished iOS app that does spectral synthesis on a range of phonemes (no morphing, just fading), let me know if you’d like to try it out?

When I started out, I was more focused on an UTAU sort of approach, and tried to get spectral morphing to work when cross-fading samples. I never really had a lot of success with it.

My approach is obviously not nearly as sophisticated as what you’re suggesting. Still, I’d love to see the paper if you can run across it.

I’ll pass on the iOS application, though… mostly because I don’t have any iOS devices handy!