Progress

Those baffling “clicks” returned. It appeared that the cause was an error in the interpolation, but I couldn’t find any issue.

So I dug deeper, and finally found  the issue in the spectral envelope code, where the upper and lower segment calculations had a bug caused by cutting and pasting code.

I had startied to wonder if there was something fundamentally wrong with the algorithm, so finding the bugs (there were several) was a relief.

I’ve not had a lot of success calculating the correct formant values, so I’ve decided to keep using Praat for the moment. The Formant listing… command returns information in the form:

Time_s   F1_Hz      F2_Hz       F3_Hz       F4_Hz
1.829342 842.260899 1445.984336 2580.714580 3505.377026

So I created a file in the form:

AA AA.wav
1.381190 756.655654 1542.162825 2509.759067 3337.066884
2.013578 759.139332 1347.519086 2585.987167 3491.345007

AH AH.wav
2.604888 648.105264 1225.678508 2619.991352 3316.054148
3.164389 606.682747 1244.227382 2610.945204 3372.232002

AO AO.wav
1.382510 711.846056 1076.859149 2658.969965 3280.754187
1.985170 671.614832 1091.053508 2681.490292 3497.401647

...and so on...

This is then processed by some Lua code, which opens up the specified .wav file,  calculates the frequency at the given position, and analyzes n cycles of the waveform. That’s married together with the formant information, and the targets are written out to a data file along the lines of:

TARGET['AA1'] = {
 f1=728, f2=1457, f3=2450,
 spectralBands=21,
 sampledFrequency=745,
 sampleSize=6,
 amplitude=2,
 cents=-6.2384,
 [1] = {frq=60, amp=0, phase=0 },
 [2] = {frq=150, amp=0, phase=0, },
 [3] = {frq=250, amp=0, phase=0, },
 [4] = {frq=350, amp=0, phase=0, },
 [5] = {frq=450, amp=0, phase=0, },
 [6] = {frq=612.5, amp=10.3521, phase=-2.9875, },
 [7] = {frq=728.9754, amp=4.8925, phase=0.3326, },
 [8] = {frq=816.6666, amp=6.9042, phase=0.4086, },
 [9] = {frq=1000, amp=0, phase=0, lowFrq=0, },
 [10] = {frq=1225, amp=3.4769, phase=-2.618, },
 [11] = {frq=1457.9508, amp=2.2372, phase=-2.4495, },
 [12] = {frq=1633.3333, amp=3.1306, phase=1.2742, },
 [13] = {frq=1837.5, amp=2.3573, phase=-2.8282, },
 [14] = {frq=2186.9262, amp=1.0527, phase=-2.5782, },
 [15] = {frq=2450, amp=2.7065, phase=-0.7922, },
 [16] = {frq=2964.7677, amp=1.3912, phase=0.9176, },
 [17] = {frq=3460.7923, amp=1.3707, phase=0.4431, },
 [18] = {frq=4175.0409, amp=1.8125, phase=-0.564, },
 [19] = {frq=4967.6092, amp=1.3487, phase=-1.0912, },
 [20] = {frq=5823.101, amp=1.3087, phase=0.6529, },
 [21] = {frq=6971.2317, amp=1.5128, phase=0.3313, },
}

It’s not very good at estimating the frequencies of some of the phonemes, which leads to more rendering errors. I’ll have to resample the phonemes, taking care to keep the amplitude up at the beginning and end of the sounds. Otherwise, the analysis is going to include the fade, which messes things up.

Since I don’t have anything other than vowels in the database, I can currently only render inarticulate babbling. Now what I’ve fixed the rendering problem, hopefully I can start focusing on consonants and start rendering some words.

Advertisements
Posted in Development | Tagged , | Leave a comment

Grinding Away

I’ve been working the last couple of days rebuilding the synthesis engine around spectral morphing.

It’s still too early to tell how well this will turn out. I’ve done plenty of experiments that held a lot of promise, only to fail for one reason or another. But I’m still hoping this will pan out.

I’m using Klatt’s flutter() method of summing three different sine() waves together:

function flutter(t)
    return math.sin(TWO_PI*12.7*t) + 
        math.sin(TWO_PI*7.1*t) +
        math.sin(TWO_PI*4.7*t)
end

However, I think I got better results using Gaussian probabilities, so I may end up revisiting that decision. The nice thing about summing sine waves is that it ensures values don’t wander too far away.

I had some problems with wave alignment. As pitches changed, adjacent waves were off by a small amount, which was adding artifacts.

To fix this, I currently have a hack that connects waves using spline interpolation. While it works well enough, I’d like to revisit that again some time later.

For obvious reasons, choosing the right formant values is key. I spent a lot of time hunting down bugs that were caused by bad formant estimations. The values my code comes up with aren’t very reliable, so I’m still using Praat to get them.

I also spent far too much time tracking down a glitch in the morphing code that wasn’t producing smooth joins, only to find that the root cause was a bad estimate of the t value.

While it’s nice to find a bug has a simple solution, it’s irritating to have something so minor (and obvious in retrospect!) take such a lot of time.

I’m hoping I can get synSinger to start singing simple words by next week, but right now that seems a long way off.

Posted in Development | Tagged , , , | Leave a comment

More work

I spent a lot of time in the last week putting together tools to get the mean and standard deviation of the parameters – and deltas – that make up a phonetic sound.

After all was said and done, it seems that pretty much a waste of time. Simply modifying the pitch and amplitude of the wave seems to be all that’s necessary to make it seem less robotic.

For the most part, the mapping of harmonics onto Bark bands seems to be successful. There are some differences with some of the sounds, but I doubt that most people will notice. For the most part, the resynthesized sounds are pretty convincing.

So now it’s time to start rewriting the synthesis engine so synSinger will be able to start singing again.

The biggest challenge is likely to be morphing from unvoiced consonants such as /K/ or /T/ into vowels. I’ve got some ideas I’d like to explore on this. Specifically, I think I’ll need to warp vowels backwards into the consonants, morphing the vowels.

Posted in Uncategorized | Leave a comment

Adjusting the Spectral Envelope

I’ve made a small change in the spectral envelope to give a slightly better representation.

In addition to storing the largest amplitude (and associated frequency) with each Bark band, I also store the lowest amplitude (and associated frequency) between each high amplitude.

This provides a more accurate shape to the spectral envelope without adding much additional overhead.

The sound is still quite robotic – and buzzy – but hopefully that will diminish as expressive features sure as vibrato are added in.

Here’s a spectrogram of the morph in action:

Display of formants in Praat

Morphed formants displayed in Praat

The next step will probably be to do some statistical analysis of a recorded phoneme. If I store the mean and standard deviation of values I should be able to use them for more accurate resynthesis. Currently, I’m just jittering values with a small amount of pink noise.

Posted in Uncategorized | Leave a comment

More Spectral Morphing

I’ve got spectral morphing working, more or less.

Recall that synSinger is only using a single cycle of a waveform to create a target spectral envelope. This is obviously much simpler than dynamically determining the spectral envelope for a long phrase.

Currently, a spectral envelope is mostly just a list of the sinusoidal frequency, amplitude and phase at each Bark band. I’ve used Praat to determine the formant frequencies.

local EE = { f1=476, f2=1945, f3=2508,
 spectralBands=21,
 [1] = {frq=0, amp=0, phase=0},
 [2] = {frq=135, amp=0.0442, phase=-0.9769},
 [3] = {frq=270, amp=0.0669, phase=-1.1058},
 [4] = {frq=0, amp=0, phase=0},
 [5] = {frq=405, amp=0.1155, phase=-0.8488},
 [6] = {frq=541, amp=0.0629, phase=3.0309},
 [7] = {frq=676, amp=0.0224, phase=-3.0915},
 [8] = {frq=811, amp=0.0093, phase=-2.1223},
 [9] = {frq=946, amp=0.0049, phase=-2.0646},
 [10] = {frq=1082, amp=0.0068, phase=-1.8839},
 [11] = {frq=1352, amp=0.0042, phase=-1.6601},
 [12] = {frq=1623, amp=0.0063, phase=-1.4562},
 [13] = {frq=1893, amp=0.0361, phase=-2.2912},
 [14] = {frq=2029, amp=0.0244, phase=2.334},
 [15] = {frq=2434, amp=0.0212, phase=0.427},
 [16] = {frq=2705, amp=0.0093, phase=-1.3662},
 [17] = {frq=3246, amp=0.015, phase=-2.5886},
 [18] = {frq=3787, amp=0.0025, phase=-2.0012},
 [19] = {frq=5275, amp=0.0025, phase=-2.7058},
 [20] = {frq=5816, amp=0.0038, phase=1.4143},
 [21] = {frq=7034, amp=0.0028, phase=-0.5983},
}

Getting the values from the spectral envelope is straight forward. A given frequency is converted into a Bark band index i through a simple lookup table. If the frequency is less than the frequency stored in the band, the values are interpreted as being between [i-1] and [i], otherwise they are calculated as being between frequency [i] and [i+1].

To morph from one spectral envelope to another, synSinger first interpolates the position of the formants F1, F2 and F3 from the current spectral envelope to the target spectral envelope. That’s just a simple lerp:

 -- morphed targets
 warpedF1 = lerp( spectralEnvelope1.f1, spectralEnvelope2.f1, t )
 warpedF2 = lerp( spectralEnvelope1.f2, spectralEnvelope2.f2, t )
 warpedF3 = lerp( spectralEnvelope1.f3, spectralEnvelope2.f3, t )

The frequencies for the harmonics are then warped to get the frequency in the spectral envelope:

local function warpFrequency( f, spectralEnvelope )

    -- where is the frequency in the morph?
    if f <= warpedF1 then
       return lerpFrequency( f, 0, warpedF1, 0, spectralEnvelope.f1 )

    elseif f <= warpedF2 then
       return lerpFrequency( f, warpedF1, warpedF2, spectralEnvelope.f1, spectralEnvelope.f2 )

    elseif f <= warpedF3 then
       return lerpFrequency( f, warpedF2, warpedF3, spectralEnvelope.f2, spectralEnvelope.f3 )

    elseif f <= 3200 then
       return lerpFrequency( f, warpedF3, 3200, spectralEnvelope.f3, 3200 )

    else
       -- don't morph
      return f

    end

end

The lerpFrequency maps a frequency between two formant frequencies in the warped envelope to a frequency between two formant frequencies in the unwarped envelope:

local function lerpFrequency( f, start1, end1, start2, end2 )

    -- calculate the ratio
    local t = (f-start1) / (end1-start1)

    -- lerp between the new target points
    return lerp( start2, end2, t )

end

That frequency is used to query the spectral envelope and get the amplitude and phase to use. These values are then passed to an inverse DFT, and the waveform is constructed.

This process translates the position of the formants smoothly, while preserving the bandwidths. There’s some distortion between the bandwidths because of the stretching, but in theory they aren’t that critical.

The main problem is that the resulting voice is pretty buzzy. That’s a function of the synthesis, not of the morphing. I need to look into that to see where the buzz is coming from, and how it could be alleviated.

Posted in Uncategorized | 2 Comments

Perceptual Dynamic Model

I recently ran across the paper A Fixed Dimension and Perceptually Based Dynamic Sinusiodal Model of Speech. In it, the authors propose dividing the spectral envelope into 21 bands based on the Bark scale critical bands.

Instead of storing all frequencies and amplitudes of the spectral envelope, only a one maximum value and slope is stored per critical band.

The authors report that a high-quality spectral envelope can then be reconstructed from this model.

I’ve been looking for a good model for storing the spectral envelope, and the PDM looks like it could be a good fit. I put together a test that puts the largest DFT amplitudes into bins based the Bark scale, and compared the output against the synthesis against the actual DFT values.

Rather than use the center frequency of the Bark band, I also store the frequency and phases of the highest sinusoidal in the band, and use that when interpolating.

While the result isn’t exactly the same as the original pulse, I think the results are good enough to move to the next step – morphing of spectral envelope targets.

Posted in Development | Tagged , , , , | Leave a comment

Synthesis with Sinusoids

Analysis and re-synthesis of a wave using DFT (Discrete Fourier Synthesis) or FFT (Fast Fourier Synthesis) is fairly direct. Here’s a DFT routine in Lua that, given a buffer containing a single pitch period, will return

function dft( buffer )

  local n = #buffer

  -- holds the output results
  local magnitude = {}
  local phase = {}

  -- iterate through the buffer
  for k=1, n do
    -- clear the sums
    local sumSin = 0
    local sumCos = 0

    -- iterate through each frequency
    for t = 1,n do
      -- calculate angle
      local angle = PI_2 * t * k / n

      -- calculate the sin and cosine contributions
      sumSin = sumSin + math.sin(angle) * buffer[t]
      sumCos = sumCos + math.cos(angle) * buffer[t]

    end

    -- use pythagorean theorem to calculate sum of contributions
    magnitude[k] = math.sqrt( sumSin * sumSin + sumCos * sumCos )

    phase[k] = math.atan2( sumCos, sumSin )

    -- normalize the value
    magnitude[k] = magnitude[k] / (n/2)

  end

  return magnitude, phase

end

So the Fourier transform can be seen is an operation that deconstructs audio into its component waves, and allows it to be reconstructed back again through an inverse of the process.

The information provided by the transform is a mapping of the frequency response of the vocal tract – some frequencies are emphasized, and others are de-emphasized.  This mapping is referred to as the spectral envelope.

Once the spectral envelope is captured, it can be used to resynthesize the sound at a different pitch.

Things get a bit more complex when interpolating from one spectral morph to the next.

The first solution – linear morphing between the spectral envelopers – won’t do the trick.

Imagine having two pictures – one with a ball on the left side, and one with a ball on the right. What you want is is an interpolation where the ball moves from the left to the right of the picture.

What you get if you do a linear morph is the ball “teleporting” Star Trek style from the left side to the right.

For the morph to perform correctly, you have to identify how features on one spectral envelope map to features on another envelope. Otherwise – to use another analogy – it’s like an ear morphing into a nose.

I’d worked on this problem earlier this year, but the results were pretty awful.

However, I’ve got some new ideas to try out this time. Hope springs eternal…

Posted in Uncategorized | Tagged , , , | 2 Comments