Looking at the FFT of a vowel, you can see that each formant has it’s own frequency and amplitude. These show up as “blobs” – some stretching for the duration of the wave, others for only a portion:
For example, the duration of the F1 lasts the duration of the wave, while the F4 is only about half that. The F2 and F3 clearly drop in amplitude at the low point of the glottal pulse, then rise up at the start of the new pulse.
To perform direct synthesis, I’m recreating these amplitudes for the formants at various points in time along the wave. To get this information, I use the Spectral Slice, feature in Praat:
The maximum amplitude of each formant is typically found at the onset of the glottal pulse, and the minimum a few milliseconds before that.
Using these minimum and maximum amplitudes for each formant, I can create an amplitude envelope to apply to each formant. Summing the formant waves creates a fairly good approximation of the original sound.
The main problem with this approach is when the formant frequencies are reset at the beginning of the pulse. As I described in the prior post, I’m applying an envelope to the waveform which drops the amplitude to zero at the reset point. It hides the “click” which would otherwise appear, but this drop in amplitude makes the output waveform buzzy.
To avoid this, I’m going to instead apply a crossfade to each formant when the frequency is reset. There will likely still be a drop in amplitude at the crossfade, but I’m hoping it’s not as noticeable as the current implementation.