The current process of resynthesis in synSinger is to take a cycle, and to find the spectral envelope of the waveform with a DFT. The amplitudes at the center frequency of bark band is then stored into a lookup table.
The bark scale divides up the frequencies into segments based on just-noticeable differences:
|Number||Center Freq||Cut-off Freq||Bandwidth|
The spectral envelope is re-computed from these 20 points by using a cubic spline.
I’d been reconstructing the vocal waveform by summing together sine waves at every nth harmonic, scaling them by the amplitude given by the spectral envelope.
That worked well, but there were still artifacts. It turns out that you need to include the phase of each harmonic as well. While this is eminently doable (the phase can be calculated along with the amplitude), the results were less than impressive.
I did some experiments estimating formants, with the plan of morphing the transitions smoothly together. But that turned out to be fairly tedious, and more lossy that using the spectral envelope.
So I turned back to some experiments that I had done before, where I created a simple vocoder bank from the FFT estimate, and used a simulated glottal pulse as the carrier signal.
The initial results had been promising, but I wasn’t able to get it to a level of quality that I wanted.
This time, instead of spacing the filters at fixed positions, I moved their center frequencies (and bandwidth) along with the harmonics. The results were significantly better, especially after tweaking the glottal pulse to match the voice.
In addition to my own (relatively low) voice, I’ve tried this out with high-pitched female voices, and the results are still quite good.
While it’s fairly free from artifacts, the process is lossy, and the audio has a sort of low-fi, grainy quality to it. So there’s still room for improvement.