Direct Synthesis – continued

Looking at the FFT of a vowel, you can see that each formant has it’s own frequency and amplitude. These show up as “blobs” – some stretching for the duration of the wave, others for only a portion:


Waveform of “OO” sounds and FFT

For example, the duration of the F1 lasts the duration of the wave, while the F4 is only about half that. The F2 and F3 clearly drop in amplitude at the low point of the glottal pulse, then rise up at the start of the new pulse.

To perform direct synthesis, I’m recreating these amplitudes for the formants at various points in time along the wave. To get this information, I use the Spectral Slice, feature in Praat:


Spectral slice at glottal pulse maximum of “OO” sound

The maximum amplitude of each formant is typically found at the onset of the glottal pulse, and the minimum a few milliseconds before that.

Using these minimum and maximum amplitudes for each formant, I can create an amplitude envelope to apply to each formant. Summing the formant waves creates a fairly good approximation of the original sound.

The main problem with this approach is when the formant frequencies are reset at the beginning of the pulse. As I described in the prior post, I’m applying an envelope to the waveform which drops the amplitude to zero at the reset point. It hides the “click” which would otherwise appear, but this drop in amplitude makes the output waveform buzzy.

To avoid this, I’m going to instead apply a crossfade to each formant when the frequency is reset. There will likely still be a drop in amplitude at the crossfade, but I’m hoping it’s not as noticeable as the current implementation.

Posted in Development | Tagged , , | Leave a comment

Direct Synthesis

One of the problems that I’ve been encountering with the use of digital filters is a “squeal” when parameters rapidly change. Dennis Klatt published a solution for this which recalculated the stored filter coefficients, but I’ve never been able to get the code to work.

The algorithm that S.A.M. (Software Automatic Mouth) uses to generate its output doesn’t use filters. Rather, it directly generates the formants, but resets the angle of each generator back to zero at the start of each glottal pulse.

Here’s an example of a wave that’s the summation of only the formant frequencies:

formants only

Signal that is the sum of the formant frequencies

This does not sound like a vocal sound. The next step is to reset the frequency generators back at the start of each glottal pulse:

resetting formants

Resetting the formant waves at the start of each glottal pulse.

This gives the wave a “vocal” sound. However, instantly resetting the format frequency generators to zero causes discontinuities, which I’ve marked in red.

One solution is to put an amplitude envelope around the pulse, so the envelope falls to zero at the start of the pulse, at the same point the formant frequencies are reset:

formants with amplitude envelopes

Using an amplitude envelope to smooth the pulse transitions.

This waveform still has a very mechanical quality, but that can be mitigated (somewhat) by jittering the parameters.

Another factor that helps realism is to add vocal noise. This is typically generated by running noise through filters set to to the formant frequencies, and mixing that in with the vocal signal.

But it got me wondering whether the filtered noise could also be directly generated.

I’ve got a method that works fairly well, but I’m sure improvements could be made to it. The basic idea is very much like that of the example above – the formants are generated directly, for a pulse of random length. At the start of the next pulse, the starting angle for each formant frequency generator is set to a random value.

If the duration of the pulse is too long, the output sounds like a signal interrupted by noise. If the pulse is too long, the output sound like plain noise. But there’s a “sweet spot” between the two where the output resembles pitched noise. Setting the formants frequency to vowels gives the sound of a “whispered” vowel.

I haven’t had time to generate consonants using this method.


Posted in Development | Tagged , , | Leave a comment

Click and Buzz

While the prototype voice synthesis using spectral envelopes is encouraging, there are still a lot of issues left to resolve.

I tracked down a “click” in the output to sudden transitions in amplitude between frames. Amplitude and frequency are now interpolated over 128 samples, so the transition is smooth.

I was getting some very wrong results with some of the morphs, which I eventually tracked down to a number of bugs in the warping code.

I’ve also added some linear interpolation to the spectral envelope amplitude estimates. If it made any difference, I can’t hear it. But I’ll leave it in anyway.

The output is also still quite “buzzy” and robotic. I’m hoping that adding some jitter to the fundamental frequency will help take care of that. I tried averaging the spectral envelope to smooth it, but that just made the output sound muffled.

I’d experimented with resetting the phase of all the generators back to zero at the start of the pulse (synchronized with the F0), which had the result of making each wave pulse essentially symmetric. That code got tossed out.

The current plan is to continue adding the vowel phonemes and look for issues, and then start working on voiced consonants and see how well they work.

Posted in Development | Tagged | Leave a comment

Spectral Synthesis Revisited

I haven’t had a chance to get much coding done over the last couple months. However, I’ve been doing a lot of reading on various vocal synthesis technologies.

I’d read quite a bit about spectral modeling synthesis (SMS) before, and decided to put together some quick tests.

Essentially, SMS uses the short-term FFT to capture a “spectral envelope” – a graph of what the amplitude of each harmonic is at each frequency.

Getting this information in Praat– there’s an option to get a “spectral slice” that lists the amplitude (in decibels) at given frequencies:

freq(Hz)    pow(dB/Hz)
0    -17.171071546036153
21.533203125    20.388397279697497
43.06640625    26.97257113598713
64.599609375    30.64585953310637
86.1328125    32.25338096741935
107.666015625    31.708646066090367
129.19921875    29.056762741248782
150.732421875    28.923892178490085
172.265625    32.719805559695565
... and so on

Converting the decibels to a linear value looks like this:

function dbToLinear( x )
  return math.pow(10, x/20)  

To convert the spectral envelope back into sound, the process is reversed. You can generate a bunch of sine waves that are multiples of the fundamental frequency, and use the spectral envelope to look up the amplitude of each frequency.

Or if you’re clever, you can do an inverse FFT and stitch the frames together.

I’m exploring the idea of using a single spectral slice to represent each phoneme target, and morphing from one target to the next. For the morphs to work, key features need to be specified. Conveniently, this corresponds to peaks at the formant frequencies – something that Praat also calculates.

In initial tests of the morphs, the formants seem to move fairly naturally.

However, using a single spectral slice to represent the sound creates a mechanical sounding voice, much like a door buzzer. Altering the amplitude and fundamental frequency may help solve that problem.

SMS generally also models the residual of the voice – the part of the sound that’s not represented by the harmonic portion. One option I’m looking into is using formant synthesis to generate the residual portion by passing white noise through filters.

Posted in Uncategorized | Tagged , , | 2 Comments

Low-Budget Spectral Envelopes

Some time back I’d written some Java code that attempted to create a rough spectral envelope by picking the n highest point in an FFT. It was moderately successful, but I eventually decided to take the project down the synthesis route.

There have been some requests to see the code. Since WordPress won’t let me attach the source code, I’ve posted the code below. I’ve made a half-hearted attempt to format it, but since WordPress loves to screw up my formatting… Ah, well…

Enjoy! I make no claims that this works, and you’ll obviously need an FFT class to make it usable.

Essentially, the code copies the magnitudes to an array, and then finds the highest point. It then zeros out points around the picked point so no points near that peak will be selected, and then picks another point. It can then linearly interpolate the spectral envelope from the peaks.

It’s not elegant, and there are other well documented algorithms.

import java.awt.Point;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Hashtable;
import org.xml.sax.HandlerBase;

public class SpectralEnvelope {

  // sorted list of peaks
  int[] peakBin;

  // values corresponding to the peaks
  double[] peakValue;

  // holds all the envelope values
  double[] envelope;

   * Create a spectral envelope for the buffer. The buffer
   * is assumed to be FFT data, and only the first half of the
   * data is used. The zeroWidth determines the precision of
   * the envelope. Exact precision (1) creates artifacts.
   * @param buffer FFT buffer containing data.
   * @param wide Number of sample points to use
   * @param peakCount Number of peaks to gather
   public void build(double[] buffer, int wide, int peakCount) {
     // clear the envelope
    envelope = new double[wide];

    // copy of the data, because used peaks will have to be flagged
    double[] tmp = new double[wide];
    System.arraycopy(buffer, 0, tmp, 0, wide);

    // list of peak indexes
    peakBin = new int[peakCount];

    // add the first and last points
    peakBin[0] = 0;
    peakBin[1] = wide - 1;

    int peaksFound = 2;

    // estimate width needed to get requested peak count
    int zeroWidth = (int)(wide / peakCount / 2);

    // identify peaks, starting from the highest
    while (peaksFound < peakCount) {
      // clear the max
      double max = 0;
      int maxAt = 0;

      // search for the highest value
      for (int j = 0; j < wide; j++) { if (tmp[j] > max) {
        max = tmp[j];
        maxAt = j;

    // all peaks are found, exit
    if (max == 0) break;

    // store peak and increment peak count
    peakBin[peaksFound-1] = maxAt;

    // zero area around peak so they will be ignored
    int zeroStart = Math.max(maxAt-zeroWidth, 0);
    int zeroEnd = Math.min(maxAt+zeroWidth+1, tmp.length);
    for (int index = zeroStart; index < zeroEnd ; index++) {
      tmp[index] = 0;

  // sort the peaks

  // store the values of the peaks
  peakValue = new double[peakCount];
  for (int i = 0; i < peakCount; i++) {
    // get the value from the array and store it
    peakValue[i] = buffer[peakBin[i]];

  // connect the peaks
  double delta = 0;
  for (int i = 1; i < peakBin.length; i++) {
    // get the next peak
    int thisBinIndex = peakBin[i];
    int priorBinIndex = peakBin[i-1];

    // determine the slope
    int dy = thisBinIndex - priorBinIndex;

    // handle division by zero
    if (dy == 0) {
      delta = 0;
    } else {
      delta = (peakValue[i] - peakValue[i-1]) / dy;

    // set the magnitudes
    double m = peakValue[i-1];
    for (int j = priorBinIndex; j < thisBinIndex; j++) {
      envelope[j] = m;
      m += delta;


  * Linearly interpolates a spectral envelope between the source and
  * target envelopes.
  * @param source Initial envelope (amount = 0)
  * @param target Target envelope (amount = 1)
  * @param amount Range to interpolate value at (0..1)
  void interpolateEnvelope(SpectralEnvelope source, SpectralEnvelope target, double amount) {

  // number of peaks
  int peakCount = target.peakBin.length;

  // create an array to hold the interpolated peaks
  int[] lerpBin = new int[peakCount];

  // create a hashtable to hold the interpolated peak values
  Hashtable<Integer,Double> lerpValue = new Hashtable<Integer,Double>();

  // interpolate the values
  for (int i = 1; i < peakCount; i++) {
    // interpolated bin number
    lerpBin[i] = (int)Util.lerp(source.peakBin[i], target.peakBin[i], amount);
    // interpolated value
    lerpValue.put(lerpBin[i], Util.lerp(source.peakValue[i], target.peakValue[i], amount));

  // clear the envelope
  envelope = new double[source.envelope.length];

  // sort the bins

  // create the spectral envelopes using the interpolated values
  double priorValue = lerpValue.get(lerpBin[0]);
  double delta = 0;
  for (int i = 1; i < peakCount; i++) {
    // get the next peak
   int thisBinIndex = lerpBin[i];
   int priorBinIndex = lerpBin[i-1];

   // get this value
  double thisValue = lerpValue.get(lerpBin[i]);

  // determine the slope
  int dy = thisBinIndex - priorBinIndex;

  // handle divsion by zero
  if (dy == 0) {
    delta = 0;
  } else {
    // calculate slope
    delta = (thisValue - priorValue) / dy;

  // set the magnitudes
  double m = priorValue;
  for (int j = priorBinIndex; j < thisBinIndex; j++) { 
    envelope[j] = m; 
    m += delta; 

  // save current as prior value 
  priorValue = thisValue; 


/** * Since a single sample isn't likely to create an accurate spectral 
* envelope, this function is used to accumulate the maximum values in 
* over a number of samples. 
* @param fft The fft 
* @param buffer Buffer with the data to sample 
* @param pos Starting position 
* @param steps Number of samples 
* @param stepSize Distance between the samples 
* @return 
* @throws Exception 
final static double[] accumulateSteps(ProcessFFT fft, double[] buffer, int pos, int steps, int stepSize ) throws Exception { 
  // buffer holding the results 
  double[] accum = new double[fft.frameSize2]; 
  // iterate through the buffer, decrementing the count 
  for (; steps > 0; steps--, pos += stepSize ) {
    // process the fft
    fft.analyze(buffer, pos);

    // save the high values to the accumulator
    for (int i = 0; i < fft.frameSize2; i++) {
      accum[i] = Math.max(accum[i], fft.gAnaMagn[i]);
  return accum;

The call to the pitch shifting looks something like this:


     * Pitch shifting with formant preservation
    public void pitchShiftWithFormantPreservation() {
        // create a spectral envelope
        this.spectralEnvelope = new SpectralEnvelope();
        // build the spectral envelope, this.frameSize2, 100);
        // get the envelope
        double[] envelope = spectralEnvelope.envelope;
        // transpose the pitch
        for (int i = 0; i < this.frameSize2; i++) {
            // get the index of the target
            int target = (int)(i * this.pitchShift);
            // exit loop if past the end
            if (target >= this.frameSize2) break;
            // scale relative to the envelope            
            // need to check for division by zeros
            if (envelope[i] == 0 || envelope[target] == 0) {
                gSynMagn[target] = 0;
            } else {
                gSynMagn[target] = this.gAnaMagn[i] / envelope[i] * envelope[target];
            gSynFreq[target] = gAnaFreq[i] * this.pitchShift;    


Posted in Development | Tagged | 1 Comment

Speech Synthesis and Propeller

I ran across a 2006 article by Chip Gracey called “Synthesizing Speech with the Propeller”,  which describes how to perform speech synthesis on the Propeller chip.

Something that caught my attention was the resonator algorithm – it was based on a method I hadn’t run across before. Each resonator holds its state as a two-dimensional point (x, y). The resonators are placed in a cascade, and resonator n adds the point from  resonator n-1 to its own point, and rotates the new point around the origin (0,0). The final output is the y value of the last resonator. Here’s what my version of the code looks like:

-- run through resonators
f1x, f1y = rotateAroundOrigin( f1Angle, pulse+f1x, f1y )
f2x, f2y = rotateAroundOrigin( f2Angle, f1x+f2x, f1y+f2y )
f3x, f3y = rotateAroundOrigin( f3Angle, f2x+f3x, f2y+f3y )
f4x, f4y = rotateAroundOrigin( f4Angle, f3x+f4x, f3y+f4y )
out = f4y

The angle is calculated as proportional to the frequency:

-- calculate angle steps for resonators
local f1Angle = f1 * 2 * math.pi / SAMPLE_RATE
local f2Angle = f2 * 2 * math.pi / SAMPLE_RATE
local f3Angle = f3 * 2 * math.pi / SAMPLE_RATE
local f4Angle = f4 * 2 * math.pi / SAMPLE_RATE

Each point is also scaled by a small decay.

I was able to get these resonators running in synSinger, but because there’s no control of the bandwidths, the output sounds tinny. So there doesn’t seem to be much value in exploring this option, unless someone can point out a flaw in my code.

In his article, Chip mentions that one alternative he considered was running an oscillator at each resonant frequency, and resetting the resonators at the start of the glottal pulse – exactly the same algorithm that Software Automatic Mouth initially used. (It’s better known as VOSIM). Chip ended not up using it because it had discontinuities and was low quality.

Here’s an example of Chip’s code singing: Singing Voice

Posted in Development | Tagged , | Leave a comment

Putting Things in the Right Place

I found some places in the output where volume, instead of slowly ramping up at the beginning of a word, suddenly jumped to a maximum value.

Clearly, something wasn’t right.

In the process of tracking down that bug, I again uncovered a bug where the list of voicing amplitude values were being placed out of order. I had stuck that on my “To Do” list, and it seemed a likely culprit.

Sadly, that turned out not to be the case. The issue with that was that I was miscalculating the transitions. But solving that didn’t fix the glitch I was after.

I ended up re-writing the logic for the voicing amplitude, but mysteriously, it just wasn’t working out.

I finally tracked it down to the code where the targets were generated. Back when I initially wrote the code, I had “temporarily” put the code that calculates the duration of the targets into somewhere that I’d forgotten about. I’ve even commented that the code altered the values there, but it simply wasn’t where you’d expect it to be taking place.

So I took the opportunity to move the code where it should have gone. In the process, I decided to clean up another bit of confusing code that had been littered about. That “cleanup” made a lot of code easier to understand… but also broke a number of routines, which I’m currently trying to repair.

Posted in Development | Leave a comment