Identifying Formants

Praat generally does an excellent job of identifying formants, in contrast to my own feeble attempts.

Nevertheless, I continue to get sidetracked on the task of correctly identifying formants. My current method is fairly simple: I identify all the peaks in the fairly coarse DFT, and sort them by frequency. A “peak” is simply an inflection point, where the prior and next points are below the current point.

The results are fairly good, but there are occasional spots where formants “jump around”, so there’s a need to add a tracking mechanism of some sort, which I haven’t gotten around to writing.

For fun, I decided to overlay the formants on a display of the FFT, to see how accurate it was.

I’d played around with rendering an FFT before, but could never render them in a way that resembled the sonograms in other programs.

After convincing myself that the code actually was working, I finally settled on using the square root of the magnitude, along with fudge factor that reduced the amplitude of lower frequencies. The algorithm for pseudo-color was reverse engineered from an image I found on Wikipedia:

r = math.floor(lerp( 19, 252, n ))
b = math.floor(lerp( 108, 52, n ))
if n > .5 then
   g = math.floor(lerp( 19, 252, (n*2)-1 ))
  g = 0

The result was slow, but pretty:


Estimated formants overlaid onto an FFT of the wave.

I’ve added cosine windowing, and that cleaned up the results.

This version of the code merely identifies peaks. Adding peak tracking seems like something that’s fairly doable.

So I think it’s worth taking a bit more time to see if I can roll this code myself, instead of relying on Praat.

Posted in Development | Tagged , | Leave a comment

Interpolation to the Rescue

I’d been using linear interpolation to estimate the spectral envelope between sample points. With the addition of code to estimate “missing” peaks, it worked rather will.

However, I did some experimentation, and found that using a hermite spline seemed to give good results as well:

Estimating the spectral envelope with a hermite spline.

It didn’t seem make much difference audibly, but the implementation of the was simpler than the “missing” peak code, so synSinger now uses this for estimating the spectral envelope amplitudes. Phase is still linearly interpolated.

During the wave reconstruction, the amplitude and phase are now linearly interpolated across the wave. Prior versions were static, calculating the initial value, and using that single value to generate the wave. While it was fast, there was no guarantee that the wave would line up with the following wave.

By interpolating the parameters from the start to the end of the wave, it’s guaranteed that the waves will connect smoothly. While more expensive computationally, the resulting audio is dramatically clearer, eliminating much of the “buzzy” sound that had been plaguing the output.

Posted in Uncategorized | Tagged , | Leave a comment

Silent Letter Revisited

I finally implemented the /H/ sound using the new rendering engine. Since the aspiration was already part of the code, it was “just” a matter of turning off the vocal sound, and raising the aspiration value.

Of course, nothing is as simple as that, and the /H/ is an interesting phoneme. It has to:

  • Because /H/ takes the formants of the vowel that follows, it has to look ahead and set the formants based on the following vowel. If none is found, a neutral vowel is used.
  • Turn off the voicing, if needed.
  • Turn up the aspiration.
  • Render the target.
  • Turn down the aspiration.
  • Turn on the voicing if a voiced phoneme follows.
  • Set the pitch without a transition.

While I was working on that, I also fixed the amplitude, vibrato and pitch change code, since they all pretty much shared the same mechanism.

I also corrected a bunch of errors in the timing code, so notes are now rendered with the correct duration.

I’ve still got a number of fricatives to add, as well as r-colored vowels.

The output is currently a bit warbly because the phonemes aren’t balanced. Once I work out exactly what needs to be recorded, I’ll re-record everything in a single session.

I also need to work on the prosody code, since it currently sounds very mechanical, not smooth and legato.

Posted in Uncategorized | Leave a comment


One of the problems with generating output wave by wave is there’s a good chance that there will be a discontinuity with the prior wave. I’ve got some “smoothing” code that renders a small hermite spline to ease the transition.

This came to the forefront again as I was attempting to fine tune the “whisper” routine. parallel with the wave renderer, synSinger runs white noise through a series of filters to simulate breath. This is then low-passed and added to the output.

Unfortunately, I had added the routine after the glue code. There were small discontinuities in the noise, which were at the same frequency as the wave.

Moving the “whisper” code prior to the “smoothing” code corrected the problem.

Since synSinger also calculates the maximum amplitude of the samples when they are collected, I added an additional parameter to the renderer that would normalize the amplitude. It doesn’t do the job 100%, but helps and it a cheap fix.

Since the waves are being rendered on at a time, it’s actually possible to to have complete control over the amplitude, normalizing the wave after its been generated:


 -- holds the high and low points
 local maxAmp = 0
 local minAmp = 0
 -- find the high and low points
 for k = 1, samplesPerCycle do
   local sample = buffer[offset+k]
   maxAmp = math.max( maxAmp, sample )
   minAmp = math.min( minAmp, sample )
 -- flip the sign of the negative value
 minAmp = - minAmp
 -- normalize
 for k = 1, samplesPerCycle do
   -- get a sample
   local sample = buffer[offset+k]
   if sample > 0 then
     -- normalize in the positive direction
   sample = sample / maxAmp
     -- normalize in the negative direction
     sample = sample / minAmp
   -- scale by the expected cycle amplitude
   buffer[offset+k] = sample * cycleAmplitude

But it turns out that amplitude only makes a small difference perceptually, especially compared to changed in timbre, which normalizing amplitude does nothing to soften.

So this “super compressor” code won’t be added in.

Posted in Uncategorized | Leave a comment

Interpolating Peaks

I’ve added some simple logic that checks to see if a peak can be interpolated by looking at the intersection of the slopes between points. Here’s an example of it in action:

Missing peaks interpolated.

The results look very good – peaks move smoothly across frequencies as the sound changes.

However, I haven’t had a chance to test them out how this solution sounds yet.

There’s a fairly common case where the peak is interpolated incorrectly:

Peak incorrectly interpolated.

The solution to this is trivial – when checking points (n0, n1, n2, n3), a check is also made against points (n1, n2, n3, n4). If they both have an interpolated peak, that peak is skipped.

Here’s another case. Having a minimum slope might solve this:

Another incorrectly interpolated peak.

The core of the routine is a simple line intersection algorithm:

-- Return the intersection x, y between the two lines defined
-- by (x1, y1) - (x2, y2), (x3, y3) - (x4, y4), or nil if
-- there is no intersection

function findIntersection( x1, y1, x2, y2, x3, y3, x4, y4 )

 -- slope is change in y over change in x
 local m1 = (y2-y1) / (x2-x1)
 local m2 = (y4-y3) / (x4-x2)

 -- lines are parallel
 if m1 == m2 then
   return nil, nil

 -- intercept (y=0)
 local b1 = y1 - (m1 * x1)
 local b2 = y3 - (m2 * x3)

 local x = (b2 - b1) / (m1-m2)
 local y = m1 * x + b1

 -- intersection
 return x, y
Posted in Development | Tagged , , | Leave a comment

A Sampling Problem

I modified my GUI so that I could scroll through wave files and see how the spectral envelope changed over time. As I was clicking through the waves, I noticed that a number of peaks were “missing”, like so:

Graphic showing missing peak.

Peak is missing because the waveform is only being sampled at the harmonics, and peak lies between the harmonics.

For that matter, the first peak is also sliced off.

This isn’t an insolvable problem, but it set off an “a-ha” moment. I’m splitting the frequency space into Bark bands, and storing the highest peak in the band. Obviously, missing peaks will cause aliasing problems.

By doing this, my (perhaps overly clever) solution of using Bark bands is actually throwing away information and making things worse. If I’d just stored the core frequency, along with the amplitudes and phases of the harmonics, I’d have a simpler solution, and better results.

I had originally used the Bark band compression because I was going to base my morphing algorithm around it. But the method that I’m using morphs the harmonics, and then queries the spectral envelope for the values at the morphed harmonic. As long as I can pass a harmonic frequency in and get out an amplitude and phase out, it’ll work.

So I’m going to look into how to resolve the missing peaks. The simplest thing to do is to determine where the slopes intercept.

Once that’s resolved, I’ll see about simplifying the code to remove the Bark band logic.

Posted in Uncategorized | Leave a comment

New Rendering Logic

The core of the new rendering engine is an inverse DFT (Discrete Fourier Transform).

Each voiced phoneme target is described as a piecemeal spectral envelope, which when given a frequency, returns the amplitude and phase of that frequency.

Rendering is done on a wave-by-wave basis. Given a fundamental frequency f0, the amplitude for all the harmonics up to about 4000 Hz is calculated. One or more waves can then be generated using the inverse DFT (Discrete Fourier Transform).

Transitions between one phoneme target and another is accomplished via morphing, described in an earlier post.

The rendering is done by converting the phonemes into a series of instructions. Currently, the following instructions are implemented, but the list is subject to change:

  • SET sets a spectral envelope as active, but does not render anything.
  • ATTACK starts the amplitude ramping up again, if not already at 100%.
  • PLOSIVE ATTACK is like attack, but starts at 40% amplitude instead of 0%.
  • SILENCE renders a section of silence and sets the amplitude to zero.
  • SAMPLE renders a sample into the buffer. This is used for plosives and fricatives.
  • HOLD renders n cycles of the current target phoneme.
  • RELEASE renders n cycles of the current phoneme, fading the amplitude to zero.
  • MORPH renders n cycles of the current phoneme morphing into a new phoneme.

The core rendering is relatively simple, but converting the phonemes into the proper instructions requires a lot of attention to details. For example, creating the instructions for the phonemes “T AA” looks something like this:

SILENCE           -- start of the stop consonant
SAMPLE T          -- plosive portion of the /T/
PLOSIVE_ATTACK    -- start of voicing after plosive
SET T_AA1         -- start of transition
MORPH AA1         -- transition from /T/ to /AA1/
SET AA1           -- held portion starts on /AA1/ target...
MORPH AA2         -- ... and morphs into /AA2/
RELEASE           -- fade out using /AA2/ target

There are a number of “special” phonemes here.

Each transition from a plosive to a vowel has a different shape, so I create a phoneme for each plosive/vowel transition. So while the plosive attack of the /T/ has no spectral envelope target (it’s simply a digital sample), the beginning of the /T->AA/ transition is captured as the phoneme /T_AA1/.

Rather than capture every plosive/vowel transition, I only sample these “core” vowels:

/AA/as in odd
/AE/ as in bat
/AH/ as in hut
/AO/ as in talk
/EH/ as in bet
/IH/ as in bit
/OH/ as in cone
/UH/ as in book
/IY/ as in beet
/ER/ as in bird

(I pronounce /AA/ and /AO/ the same, so I’ll have to go back and correct the /AO/ phonemes at some point).

When a vowel transition is needed, synSinger uses a lookup table to determine which “core” vowel the first part of the vowel corresponds to. Not having to sample every transition is obviously a timesaver.

The phoneme /AA1/ is the sound at the beginning /AA/ of an sound, and /AA2/ is the sound at the end. Strictly speaking, /AA/ is a monothong and only has one “sound”, but this allows a natural sound transition during the sustained portion of vowels.

To generate the target’s spectral envelope information, I use Praat. I select a position in the .wav file that contains the target, and choose the Formant Listing, which returns the position (in seconds) and the formant information. This is then copied into a file in a format like this:

EY EY.wav 112  16
1.903082 572.273148 1790.221828 2551.646472 3458.211829
2.269429 501.891011 1808.743538 2468.468514 3475.364566
2.587991 353.950727 2079.433597 2677.219221 3318.279920

The first line (starting with an alpha character) contains the phoneme name, the file name, the (optional) approximate frequency, and the (optional) maximum number for cycles to consider for analysis. These last two hints greatly increase the quality of the analysis:

  • The frequency helps extract the cycles correctly;
  • Short cycles are selected for short transitions (such as plosive/vowel);
  • Longer cycles are used for sustained sounds.

The next three lines are the Formant Listing values from Praat, and will be automatically converted into three targets: EY1, EY2, and EY3 and married up to the phoneme data table. Phonemes with 3 targets are assumed to be diphthongs; 2 or less are monothongs.

With Praat doing the heavy lifting, the remaining analysis is pretty straight forward. Cycles are identified using autocorrelation, and then harmonically analyzed using a DFT. The analyzed targets are then written out as spectral envelopes as Lua code.

As the beginning of the post mentions, rendering is basically a reversal of the analysis process.

Posted in Development | Tagged , , , , , | Leave a comment