Building Tools

I’m continuing to work on the phoneme “editor” for synSinger. Since the automatic detection of formants is still unreliable, I’ve added a feature that allows selecting on a formant with the mouse and clicking on the proper position to correct it. By default, the formant will be “snapped” to the most likely candidate at the position, although the “snap” option can be turned off:

Editor with formants overlaid on FFT display

Since the code is interpreted, it’s a bit sluggish. But even a relatively coarse FFT display is sufficient to identify which formants need to be corrected.

My end goal is to build something along the lines of UTAU’s OTO tool, where all the phoneme data that synSinger needs can be created via this tool.

Posted in Uncategorized | 2 Comments


Here’s a demo of synSinger with the old standard Twinkle, Twinkle, Little Star:

This version uses sampled consonants.

The are a number of problems with this, but it does show off the clarity of the new spectral synthesis rendering engine.

Posted in Development | Tagged , | Leave a comment

Synthesizing Frication

synSinger has been using sampled phonemes for voiced and unvoiced fricatives such as /V/ and /SH/. This works well, but I’ve been looking into synthesizing the consonants so they are better integrated into the rendering framework.

Frication noise is made by creating a constriction in the airflow. In formant synthesis, this can be simulated by passing noise through parallel filters.

In spectral synthesis, this can be simulated by creating harmonics, but resetting the phase to zero at random intervals.

For sounds like /V/ and /Z/, the first 8 or so harmonics are flagged as “harmonic”, and processed normally. For the higher harmonics, the phase is reset randomly.

So for sounds like /S/ and /F/, there are no “real” harmonics, so all the harmonics have their phases reset randomly.

The result is a reasonably good facsimile of the fricative sound, both voiced and unvoiced.

Implementing this required rewriting the rendering engine. Where the old code crossfaded parameters, the new code crossfades harmonics when a fricative is encountered.

Unfortunately, the method doesn’t work that well for adding a “whisper” to the vowels, as it’s a bit too rough.

I’ll look into replacing the attack portion of plosive phonemes such as /B/ and /T/ next. If successful, I’ll be able to completely remove samples from synSinger and the output will be completely synthetic.

Posted in Development | Tagged , , , | Leave a comment

Formant Tracking

I’ve implemented basic formant tracking. It work fairly well, as long as the frequency detection code works properly:


However, it’s very slow, and terribly inefficient. I’m re-using code that’s not at all been optimized for the task.

It also will fail terribly, so there’s still a lot of work left to do.

For each cycle, the top n peaks are selected. A peak is defined as a point surrounded by lower points. These peaks are then sorted by amplitude.

These peaks are tracked by associating peaks to the track with the closest prior peak. Each time a peak is added to a track, the track’s score is increased. In addition to tracking the frequency, the amplitude is also kept.

Tracks are processed in order of highest amplitude, which helps prevent peaks from being grabbed by nearby tracks.

If no peak is near a track, the track continues re-using the last frequency, but does not accrue a score. If a peak is not associated with a track, the a new track is created with a score of zero, back-filling prior values with the current frequency.

When the tracking is completed, the top 4 tracks are selected, and sorted by frequency.

Finally, the tracks are smoothed by averaging each point with the surrounding points.

While I’ll continue working on this, for the immediate future it makes sense to continue using Praat to calculate formants.

Posted in Uncategorized | Leave a comment

Identifying Formants

Praat generally does an excellent job of identifying formants, in contrast to my own feeble attempts.

Nevertheless, I continue to get sidetracked on the task of correctly identifying formants. My current method is fairly simple: I identify all the peaks in the fairly coarse DFT, and sort them by frequency. A “peak” is simply an inflection point, where the prior and next points are below the current point.

The results are fairly good, but there are occasional spots where formants “jump around”, so there’s a need to add a tracking mechanism of some sort, which I haven’t gotten around to writing.

For fun, I decided to overlay the formants on a display of the FFT, to see how accurate it was.

I’d played around with rendering an FFT before, but could never render them in a way that resembled the sonograms in other programs.

After convincing myself that the code actually was working, I finally settled on using the square root of the magnitude, along with fudge factor that reduced the amplitude of lower frequencies. The algorithm for pseudo-color was reverse engineered from an image I found on Wikipedia:

r = math.floor(lerp( 19, 252, n ))
b = math.floor(lerp( 108, 52, n ))
if n > .5 then
   g = math.floor(lerp( 19, 252, (n*2)-1 ))
  g = 0

The result was slow, but pretty:


Estimated formants overlaid onto an FFT of the wave.

I’ve added cosine windowing, and that cleaned up the results.

This version of the code merely identifies peaks. Adding peak tracking seems like something that’s fairly doable.

So I think it’s worth taking a bit more time to see if I can roll this code myself, instead of relying on Praat.

Posted in Development | Tagged , | Leave a comment

Interpolation to the Rescue

I’d been using linear interpolation to estimate the spectral envelope between sample points. With the addition of code to estimate “missing” peaks, it worked rather will.

However, I did some experimentation, and found that using a hermite spline seemed to give good results as well:

Estimating the spectral envelope with a hermite spline.

It didn’t seem make much difference audibly, but the implementation of the was simpler than the “missing” peak code, so synSinger now uses this for estimating the spectral envelope amplitudes. Phase is still linearly interpolated.

During the wave reconstruction, the amplitude and phase are now linearly interpolated across the wave. Prior versions were static, calculating the initial value, and using that single value to generate the wave. While it was fast, there was no guarantee that the wave would line up with the following wave.

By interpolating the parameters from the start to the end of the wave, it’s guaranteed that the waves will connect smoothly. While more expensive computationally, the resulting audio is dramatically clearer, eliminating much of the “buzzy” sound that had been plaguing the output.

Posted in Uncategorized | Tagged , | Leave a comment

Silent Letter Revisited

I finally implemented the /H/ sound using the new rendering engine. Since the aspiration was already part of the code, it was “just” a matter of turning off the vocal sound, and raising the aspiration value.

Of course, nothing is as simple as that, and the /H/ is an interesting phoneme. It has to:

  • Because /H/ takes the formants of the vowel that follows, it has to look ahead and set the formants based on the following vowel. If none is found, a neutral vowel is used.
  • Turn off the voicing, if needed.
  • Turn up the aspiration.
  • Render the target.
  • Turn down the aspiration.
  • Turn on the voicing if a voiced phoneme follows.
  • Set the pitch without a transition.

While I was working on that, I also fixed the amplitude, vibrato and pitch change code, since they all pretty much shared the same mechanism.

I also corrected a bunch of errors in the timing code, so notes are now rendered with the correct duration.

I’ve still got a number of fricatives to add, as well as r-colored vowels.

The output is currently a bit warbly because the phonemes aren’t balanced. Once I work out exactly what needs to be recorded, I’ll re-record everything in a single session.

I also need to work on the prosody code, since it currently sounds very mechanical, not smooth and legato.

Posted in Uncategorized | Leave a comment