GUI Programming in Lua

Most of the tools I’ve been writing run from the command line with no GUI. That’s generally not a problem because I can use tools like Praat to check the output.

But some of the more recent things I’ve been doing would work a lot better if they were interactive.

In the past, I’d been using Java for the development, but for various reasons I’ve moved away to Lua. I didn’t want to split the code base, and I lost a lot of code in a hard drive crash a number of months ago.

Back up you hard drives folks!

There were basically two contenders for the Lua GUI library – wxLua and IUP.

I’ve done a lot of coding in wxWidgets, but I found the simplicity of IUP more attractive. Neither toolkit had support for audio.

While I was at it, I had a look at some of the other IDEs for Lua, but decided to stick with ZeroBrane Studio. Getting ZeroBrane Studio to work with IUP was just a matter of going to the user settings and setting the default Lua path to the IUP library:

path.lua = path to IUP library

Although I confess it took longer than I’d like to figure that part out.

Then came the issue of audio playback. There really isn’t much for Lua in terms of audio libraries. Apparently most of the libraries use binary bindings, but I wanted something that was just plug and play.

I finally found proteaAudio, which looked perfect. But it was no longer supported, didn’t work with the version of Lua that I was using, and the maintainer’s website had disappeared.

I found an pre-compiled binary on the Internet Archive Wayback Machine. Following the advice of some helpful Lua folk on line, I moved to Lua 5.1. Fortunately, there are builds for IUP across older versions, so that didn’t cause a problem.

I’d already written the tools for loading, saving, and manipulating audio, so adding them into a GUI wasn’t terribly difficult. I added a function to find the pitch period using autocorrelation, another to calculate the rough spectral envelope using DFT, and yet another to resynthesize the waveform:

Screenshot of audio analysis tool showing original waveform and resynthesized version.

And that got me back to playing around with synthesis using sinusoidals.

Posted in Development | Tagged , , , , , , , , | Leave a comment

TubeTalker and Pink Trombone

I haven’t been having a lot of luck with the direct synthesis experiments, and for some reason the idea of playing with articulatory synthesis again seemed like a good idea.

This is a familiar delusion. After hitting a brick wall with something, I’ll reluctantly let it go. After enough time has passed, I’ll decide that I’ve magically acquired enough knowledge and skill to successfully solve the problem, and have another go at it.

Pink Trombone is an interactive physical model of the vocal tract written in JavaScript by Neil Thapen. It runs in a browser window in real time, and it very cool. I’d played around with it before, and been fairly impressed by how it sounded. I’m not sure that it sounds better that formant synthesis, but I decided to have a go at writing a toy version in Lua and see how it sounded.

I ran across the paper by Brad Story titled Vocal tract area functions from magnetic resonance imaging, which included a table of vocal tract measurements for various phonemes. It seemed like a good place to start, so I grabbed cobbled together a vocal tract simulation.

I eventually got something that resembled different vowel sounds. It didn’t sound particularly good, but it was a reasonable start.

I started digging through the Pink Trombone code. There’s a lot of code in there, and some parts are a bit puzzling.

One thing that makes Pink Trombone different from the Tube Resonance Model is that it doesn’t explicitly use waveguides. The “classic” tube model is a series of digital delays connected together with simulated scattering junctions. Pink Trombone only uses the scattering junctions portion of the model. I believe that since the junctions are placed at 5 cm intervals, the junctions also act as (very short) waveguides.

Fortunately, there are at least two C rewrites of Pink Trombone, both coincidentally named Voc. The version of Voc by Paul Batchelor is documented using Literate Programming, which is helpful in working out Pink Trombone works its magic.

Paul includes a quite nice explanation of the implementation of the LF model of glottal flow. While my current glottal model is probably good enough, it’s about time synSinger had an implementation of the LF model as well.

EDIT: I’ve implemented the LF glottal flow model, but the code in Pink Trombone is simplified to be driven off the single tenseness parameter. At some point, I’ll get around to implementing a fuller version.


Posted in Development | Tagged , , , , | Leave a comment

Direct Synthesis – continued

Looking at the FFT of a vowel, you can see that each formant has it’s own frequency and amplitude. These show up as “blobs” – some stretching for the duration of the wave, others for only a portion:


Waveform of “OO” sounds and FFT

For example, the duration of the F1 lasts the duration of the wave, while the F4 is only about half that. The F2 and F3 clearly drop in amplitude at the low point of the glottal pulse, then rise up at the start of the new pulse.

To perform direct synthesis, I’m recreating these amplitudes for the formants at various points in time along the wave. To get this information, I use the Spectral Slice, feature in Praat:


Spectral slice at glottal pulse maximum of “OO” sound

The maximum amplitude of each formant is typically found at the onset of the glottal pulse, and the minimum a few milliseconds before that.

Using these minimum and maximum amplitudes for each formant, I can create an amplitude envelope to apply to each formant. Summing the formant waves creates a fairly good approximation of the original sound.

The main problem with this approach is when the formant frequencies are reset at the beginning of the pulse. As I described in the prior post, I’m applying an envelope to the waveform which drops the amplitude to zero at the reset point. It hides the “click” which would otherwise appear, but this drop in amplitude makes the output waveform buzzy.

To avoid this, I’m going to instead apply a crossfade to each formant when the frequency is reset. There will likely still be a drop in amplitude at the crossfade, but I’m hoping it’s not as noticeable as the current implementation.

Posted in Development | Tagged , , | Leave a comment

Direct Synthesis

One of the problems that I’ve been encountering with the use of digital filters is a “squeal” when parameters rapidly change. Dennis Klatt published a solution for this which recalculated the stored filter coefficients, but I’ve never been able to get the code to work.

The algorithm that S.A.M. (Software Automatic Mouth) uses to generate its output doesn’t use filters. Rather, it directly generates the formants, but resets the angle of each generator back to zero at the start of each glottal pulse.

Here’s an example of a wave that’s the summation of only the formant frequencies:

formants only

Signal that is the sum of the formant frequencies

This does not sound like a vocal sound. The next step is to reset the frequency generators back at the start of each glottal pulse:

resetting formants

Resetting the formant waves at the start of each glottal pulse.

This gives the wave a “vocal” sound. However, instantly resetting the format frequency generators to zero causes discontinuities, which I’ve marked in red.

One solution is to put an amplitude envelope around the pulse, so the envelope falls to zero at the start of the pulse, at the same point the formant frequencies are reset:

formants with amplitude envelopes

Using an amplitude envelope to smooth the pulse transitions.

This waveform still has a very mechanical quality, but that can be mitigated (somewhat) by jittering the parameters.

Another factor that helps realism is to add vocal noise. This is typically generated by running noise through filters set to to the formant frequencies, and mixing that in with the vocal signal.

But it got me wondering whether the filtered noise could also be directly generated.

I’ve got a method that works fairly well, but I’m sure improvements could be made to it. The basic idea is very much like that of the example above – the formants are generated directly, for a pulse of random length. At the start of the next pulse, the starting angle for each formant frequency generator is set to a random value.

If the duration of the pulse is too long, the output sounds like a signal interrupted by noise. If the pulse is too long, the output sound like plain noise. But there’s a “sweet spot” between the two where the output resembles pitched noise. Setting the formants frequency to vowels gives the sound of a “whispered” vowel.

I haven’t had time to generate consonants using this method.


Posted in Development | Tagged , , | Leave a comment

Click and Buzz

While the prototype voice synthesis using spectral envelopes is encouraging, there are still a lot of issues left to resolve.

I tracked down a “click” in the output to sudden transitions in amplitude between frames. Amplitude and frequency are now interpolated over 128 samples, so the transition is smooth.

I was getting some very wrong results with some of the morphs, which I eventually tracked down to a number of bugs in the warping code.

I’ve also added some linear interpolation to the spectral envelope amplitude estimates. If it made any difference, I can’t hear it. But I’ll leave it in anyway.

The output is also still quite “buzzy” and robotic. I’m hoping that adding some jitter to the fundamental frequency will help take care of that. I tried averaging the spectral envelope to smooth it, but that just made the output sound muffled.

I’d experimented with resetting the phase of all the generators back to zero at the start of the pulse (synchronized with the F0), which had the result of making each wave pulse essentially symmetric. That code got tossed out.

The current plan is to continue adding the vowel phonemes and look for issues, and then start working on voiced consonants and see how well they work.

Posted in Development | Tagged | Leave a comment

Spectral Synthesis Revisited

I haven’t had a chance to get much coding done over the last couple months. However, I’ve been doing a lot of reading on various vocal synthesis technologies.

I’d read quite a bit about spectral modeling synthesis (SMS) before, and decided to put together some quick tests.

Essentially, SMS uses the short-term FFT to capture a “spectral envelope” – a graph of what the amplitude of each harmonic is at each frequency.

Getting this information in Praat– there’s an option to get a “spectral slice” that lists the amplitude (in decibels) at given frequencies:

freq(Hz)    pow(dB/Hz)
0    -17.171071546036153
21.533203125    20.388397279697497
43.06640625    26.97257113598713
64.599609375    30.64585953310637
86.1328125    32.25338096741935
107.666015625    31.708646066090367
129.19921875    29.056762741248782
150.732421875    28.923892178490085
172.265625    32.719805559695565
... and so on

Converting the decibels to a linear value looks like this:

function dbToLinear( x )
  return math.pow(10, x/20)  

To convert the spectral envelope back into sound, the process is reversed. You can generate a bunch of sine waves that are multiples of the fundamental frequency, and use the spectral envelope to look up the amplitude of each frequency.

Or if you’re clever, you can do an inverse FFT and stitch the frames together.

I’m exploring the idea of using a single spectral slice to represent each phoneme target, and morphing from one target to the next. For the morphs to work, key features need to be specified. Conveniently, this corresponds to peaks at the formant frequencies – something that Praat also calculates.

In initial tests of the morphs, the formants seem to move fairly naturally.

However, using a single spectral slice to represent the sound creates a mechanical sounding voice, much like a door buzzer. Altering the amplitude and fundamental frequency may help solve that problem.

SMS generally also models the residual of the voice – the part of the sound that’s not represented by the harmonic portion. One option I’m looking into is using formant synthesis to generate the residual portion by passing white noise through filters.

Posted in Uncategorized | Tagged , , | 5 Comments

Low-Budget Spectral Envelopes

Some time back I’d written some Java code that attempted to create a rough spectral envelope by picking the n highest point in an FFT. It was moderately successful, but I eventually decided to take the project down the synthesis route.

There have been some requests to see the code. Since WordPress won’t let me attach the source code, I’ve posted the code below. I’ve made a half-hearted attempt to format it, but since WordPress loves to screw up my formatting… Ah, well…

Enjoy! I make no claims that this works, and you’ll obviously need an FFT class to make it usable.

Essentially, the code copies the magnitudes to an array, and then finds the highest point. It then zeros out points around the picked point so no points near that peak will be selected, and then picks another point. It can then linearly interpolate the spectral envelope from the peaks.

It’s not elegant, and there are other well documented algorithms.

import java.awt.Point;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Hashtable;
import org.xml.sax.HandlerBase;

public class SpectralEnvelope {

  // sorted list of peaks
  int[] peakBin;

  // values corresponding to the peaks
  double[] peakValue;

  // holds all the envelope values
  double[] envelope;

   * Create a spectral envelope for the buffer. The buffer
   * is assumed to be FFT data, and only the first half of the
   * data is used. The zeroWidth determines the precision of
   * the envelope. Exact precision (1) creates artifacts.
   * @param buffer FFT buffer containing data.
   * @param wide Number of sample points to use
   * @param peakCount Number of peaks to gather
   public void build(double[] buffer, int wide, int peakCount) {
     // clear the envelope
    envelope = new double[wide];

    // copy of the data, because used peaks will have to be flagged
    double[] tmp = new double[wide];
    System.arraycopy(buffer, 0, tmp, 0, wide);

    // list of peak indexes
    peakBin = new int[peakCount];

    // add the first and last points
    peakBin[0] = 0;
    peakBin[1] = wide - 1;

    int peaksFound = 2;

    // estimate width needed to get requested peak count
    int zeroWidth = (int)(wide / peakCount / 2);

    // identify peaks, starting from the highest
    while (peaksFound < peakCount) {
      // clear the max
      double max = 0;
      int maxAt = 0;

      // search for the highest value
      for (int j = 0; j < wide; j++) { if (tmp[j] > max) {
        max = tmp[j];
        maxAt = j;

    // all peaks are found, exit
    if (max == 0) break;

    // store peak and increment peak count
    peakBin[peaksFound-1] = maxAt;

    // zero area around peak so they will be ignored
    int zeroStart = Math.max(maxAt-zeroWidth, 0);
    int zeroEnd = Math.min(maxAt+zeroWidth+1, tmp.length);
    for (int index = zeroStart; index < zeroEnd ; index++) {
      tmp[index] = 0;

  // sort the peaks

  // store the values of the peaks
  peakValue = new double[peakCount];
  for (int i = 0; i < peakCount; i++) {
    // get the value from the array and store it
    peakValue[i] = buffer[peakBin[i]];

  // connect the peaks
  double delta = 0;
  for (int i = 1; i < peakBin.length; i++) {
    // get the next peak
    int thisBinIndex = peakBin[i];
    int priorBinIndex = peakBin[i-1];

    // determine the slope
    int dy = thisBinIndex - priorBinIndex;

    // handle division by zero
    if (dy == 0) {
      delta = 0;
    } else {
      delta = (peakValue[i] - peakValue[i-1]) / dy;

    // set the magnitudes
    double m = peakValue[i-1];
    for (int j = priorBinIndex; j < thisBinIndex; j++) {
      envelope[j] = m;
      m += delta;


  * Linearly interpolates a spectral envelope between the source and
  * target envelopes.
  * @param source Initial envelope (amount = 0)
  * @param target Target envelope (amount = 1)
  * @param amount Range to interpolate value at (0..1)
  void interpolateEnvelope(SpectralEnvelope source, SpectralEnvelope target, double amount) {

  // number of peaks
  int peakCount = target.peakBin.length;

  // create an array to hold the interpolated peaks
  int[] lerpBin = new int[peakCount];

  // create a hashtable to hold the interpolated peak values
  Hashtable<Integer,Double> lerpValue = new Hashtable<Integer,Double>();

  // interpolate the values
  for (int i = 1; i < peakCount; i++) {
    // interpolated bin number
    lerpBin[i] = (int)Util.lerp(source.peakBin[i], target.peakBin[i], amount);
    // interpolated value
    lerpValue.put(lerpBin[i], Util.lerp(source.peakValue[i], target.peakValue[i], amount));

  // clear the envelope
  envelope = new double[source.envelope.length];

  // sort the bins

  // create the spectral envelopes using the interpolated values
  double priorValue = lerpValue.get(lerpBin[0]);
  double delta = 0;
  for (int i = 1; i < peakCount; i++) {
    // get the next peak
   int thisBinIndex = lerpBin[i];
   int priorBinIndex = lerpBin[i-1];

   // get this value
  double thisValue = lerpValue.get(lerpBin[i]);

  // determine the slope
  int dy = thisBinIndex - priorBinIndex;

  // handle divsion by zero
  if (dy == 0) {
    delta = 0;
  } else {
    // calculate slope
    delta = (thisValue - priorValue) / dy;

  // set the magnitudes
  double m = priorValue;
  for (int j = priorBinIndex; j < thisBinIndex; j++) { 
    envelope[j] = m; 
    m += delta; 

  // save current as prior value 
  priorValue = thisValue; 


/** * Since a single sample isn't likely to create an accurate spectral 
* envelope, this function is used to accumulate the maximum values in 
* over a number of samples. 
* @param fft The fft 
* @param buffer Buffer with the data to sample 
* @param pos Starting position 
* @param steps Number of samples 
* @param stepSize Distance between the samples 
* @return 
* @throws Exception 
final static double[] accumulateSteps(ProcessFFT fft, double[] buffer, int pos, int steps, int stepSize ) throws Exception { 
  // buffer holding the results 
  double[] accum = new double[fft.frameSize2]; 
  // iterate through the buffer, decrementing the count 
  for (; steps > 0; steps--, pos += stepSize ) {
    // process the fft
    fft.analyze(buffer, pos);

    // save the high values to the accumulator
    for (int i = 0; i < fft.frameSize2; i++) {
      accum[i] = Math.max(accum[i], fft.gAnaMagn[i]);
  return accum;

The call to the pitch shifting looks something like this:


     * Pitch shifting with formant preservation
    public void pitchShiftWithFormantPreservation() {
        // create a spectral envelope
        this.spectralEnvelope = new SpectralEnvelope();
        // build the spectral envelope, this.frameSize2, 100);
        // get the envelope
        double[] envelope = spectralEnvelope.envelope;
        // transpose the pitch
        for (int i = 0; i < this.frameSize2; i++) {
            // get the index of the target
            int target = (int)(i * this.pitchShift);
            // exit loop if past the end
            if (target >= this.frameSize2) break;
            // scale relative to the envelope            
            // need to check for division by zeros
            if (envelope[i] == 0 || envelope[target] == 0) {
                gSynMagn[target] = 0;
            } else {
                gSynMagn[target] = this.gAnaMagn[i] / envelope[i] * envelope[target];
            gSynFreq[target] = gAnaFreq[i] * this.pitchShift;    


Posted in Development | Tagged | 1 Comment