Vocoder + Glottal Pulse

The current process of resynthesis in synSinger is to take a cycle, and to find the spectral envelope of the waveform with a DFT. The amplitudes at the center frequency of bark band is then stored into a lookup table.

The bark scale divides up the frequencies into segments based on just-noticeable differences:

NumberCenter Freq Cut-off Freq Bandwidth
16010080
2150200100
3250300100
4350400100
5450510110
6570630120
7700770140
8840920150
910001080160
1011701270190
1113701480210
1216001720240
1318502000280
1421502320320
1525002700380
1629003150450
1734003700550
1840004400700
1948005300900
20580064001100

The spectral envelope is re-computed from these 20 points by using a cubic spline.

I’d been reconstructing the vocal waveform by summing together sine waves at every nth harmonic, scaling them by the amplitude given by the spectral envelope.

That worked well, but there were still artifacts. It turns out that you need to include the phase of each harmonic as well. While this is eminently doable (the phase can be calculated along with the amplitude), the results were less than impressive.

I did some experiments estimating formants, with the plan of morphing the transitions smoothly together. But that turned out to be fairly tedious, and more lossy that using the spectral envelope.

So I turned back to some experiments that I had done before, where I created a simple vocoder bank from the FFT estimate, and used a simulated glottal pulse as the carrier signal.

The initial results had been promising, but I wasn’t able to get it to a level of quality that I wanted.

This time, instead of spacing the filters at fixed positions, I moved their center frequencies (and bandwidth) along with the harmonics. The results were significantly better, especially after tweaking the glottal pulse to match the voice.

In addition to my own (relatively low) voice, I’ve tried this out with high-pitched female voices, and the results are still quite good.

While it’s fairly free from artifacts, the process is lossy, and the audio has a sort of low-fi, grainy quality to it. So there’s still room for improvement.

Posted in Development | Tagged , , , , | 2 Comments

Displaying the FFT

I decided to have another swing a rendering a spectrogram:

FFT of /DH/ with a colorful gradient applied.

Getting the magnitudes to display without being washed out requires the use of a log() function since it’s non-linear.

The gradient code was adapted from code by Gytis Šk.

It probably won’t replace the Praat code, but it’s pretty.

Posted in Uncategorized | Leave a comment

Real and Imaginary: Arguments for Phase

Phase is calculated in the FFT (or DFT) as:

phase = atan2( y, x )

That is, by knowing the real and imaginary values, it’s possible to determine the phase.

I’ve always had trouble keeping these straight, and remembering which of the components is real, and which is imaginary. This usually leads to coding errors caused by getting the arguments backwards, and hilarity ensues.

OK, not really hilarity. More like lots of puzzlement and debugging.

It doesn’t help that the arguments to the atan2() function are backwards, which also leads to coding errors.

As a sort of mnemonic, someone helpfully pointed out that cos() is symmetric, so cos(x) = cos(-x), so the cosine value is real. On the other hand, sin(y) != sin(-y), so the sine value is imaginary.

So I went back and reviewed my code. And, of course, the DFT code had the phase arguments backwards.

It probably won’t be the last time that happens. 🙄

Posted in Uncategorized | Tagged , , , | Leave a comment

Handling Unvoiced Speech

I’ve decided to have synSinger handle unvoiced speech the same way most other voice synthesis applications do:

  • Analyze the harmonic content via FFT/DFT;
  • Put the results into Bark Bands; and
  • Re-synthesize using low-passed white noise run through band-pass filters associated with the bands

Basically, it’s a vocoder, toggling between voiced and unvoiced mode.

I’ve put together some code to test this out, and the results have been good. However, an /S/ requires capturing frequency information around 5500 Hz, which means using 20 bands.

That seems a bit excessive – the frequency resolution doesn’t need to be that high. I’ve doubled the width of each band so I’m only using 10 bands, and it seems to give results without much loss of quality. I’ll play with it a some more to see if I can reduce the band count down a bit more.

I’ve currently got the synSinger phoneme editor slicing phonemes into fixed-width frames of about 1/200th of a second and storing a list of number. Adding unvoiced speech would mean either adding additional parameters for unvoiced speech (the amplitudes), or adding a voiced/unvoiced flag that indicates whether the frame data represents voiced or unvoiced speech.

Posted in Development | Tagged , , , | Leave a comment

Phoneme Editor – Now with Oto Goodness!

I’ve started adding support for unvoiced sounds in the editor. I’ve taken some inspiration from the UTAU Oto editor, which defines feature such as:

  1. Consonant: The fixed duration portion.
  2. Vowel: The part of the wave that can be stretched (or truncated)
  3. Overlap: The transition from the prior phoneme
  4. Pre-Uutterance: The portion before the note downbeat

These don’t necessarily have direct correlation in synSinger, but it’s a place to start:

The phoneme editor now allows marking up the wave by function

In addition, I’ve added a fricative band, used to indicate what the unvoiced portion of the audio is doing. In theory, I could perform an FFT and break the noise into Bark Bands frequencies.

Strictly speaking, the unvoiced function needs to handle aspiration as well – for example, prefixed /P/ has a strong aspriated release, and the spectral content is determined by the following vowel (the formant tracks are visible in the example above).

One option is to use the formant guide tracks. synSinger has a “whisper” module that works in parallel with the voiced synthesis engine, running low-passed white noise though a formant synthesizer. Praat doesn’t supply any estimates for the formants for unvoiced sounds (and the editor uses its own DFT for harmonic analysis), but the “whisper” module uses a simple bandwidth estimation which works well:

local function estimateBandwidth( frequency )
return (50 * (1+(frequency/1000)))
end

Prior versions of synSinger simply looked ahead to the next vowel for correct aspiration formant frequencies, so that’s probably the approach I’ll use here as well.

Posted in Uncategorized | Tagged , , | Leave a comment

Phoneme Editor, continued

The phoneme editor is now capable of using the information from Praat to automatically generate “guide” tracks of the formants. Praat isn’t always perfect, but it’s a lot easier than doing it by hand. If you look closely at the image, you can see that Praat has mistakenly assigned the ending of most of the formants to the formants above them:

Phoneme Editor in Spectrograph mode

The green “guide” tracks are used to help the editor estimate the location of formant peaks.

The hollow purple dots represent formant peak locations selected by the peak picker, guided by the green guide tracks.

Clicking a peak puts the editor into Spectral Envelope mode, and displays the spectral envelope and the formants assigned to the estimated peaks:

Phoneme Editor in Spectral Envelope mode

The orange lines under the peaks represent the estimated formant width. These values aren’t quite the same as the traditional bandwidth, and the width estimation code is still buggy. Still, the ear is relatively insensitive to bandwidth, so for the moment the estimate is close enough to be useable.

I’m using VB (voice bar) as the lowest formant. I’d initially not used it, but found that the synthesis is “tinny” without it.

Because the data is often noisy, I’ve implemented a “smooth” function iteratively averages the data with a weighted average function, along the lines of:

local function smooth( t, weight )
-- holds the smoothed data
local smoothed = {}

-- calculate smoothed values
for i = 2, #t-1 do
-- apply averaging
smoothed[i] = (((t[i-1]+t[i+1])/2) * weight)
+ (t[i] * (1-weight))
end

-- replace prior values with smoothed
for i = 2, #t-1 do
t[i] = smoothed[i]
end

end

I’m currently using a weight of .001, and repeating the process 1000 times. This is enough to move most of the outliers into place and clean up the re-synthesis.

The editor is capable of copy synthesis. Some of the phonemes come remarkably close to the original audio. Others, not so close.

Since the editor can only currently handle voiced phonemes, the next step is to extend the editor to handle unvoiced data.

Posted in Uncategorized | Tagged , , , | Leave a comment

More Complex Phoneme Targets

I rewrote synSinger to use DFT spectral envelopes, but I’m not happy with the results – they still aren’t close to the target sound.

I suspect one of the problems is that synSinger only uses a small number of targets per phoneme, which just isn’t sufficient to capture the character of the voice.

So I’m putting together some tools that will allow me to build more complex phonemes, capturing more targets per phoneme.

I had extended my existing tools to handle multiple targets, but it’s quite tedious to edit that way. So I needed to code a new phoneme editor.

However, instead of re-inventing the wheel, I’ve decided to leverage Praat to do the “heavy lifting.”

So the new tool uses Praat to rendering the spectrograph, estimating the formants, calculating the pulse locations, and so on. I can then click and drag to specify the actual formants:

Preview of new Phoneme Builder tool for synSinger.

There’s still a lot of work left to do, but I think it’s good step in moving to more accurate voice synthesis.


Posted in Uncategorized | Leave a comment