More Work on the Phoneme Editor

While I’m pretty pleased with the overall progress of the phoneme editor, there have been a few phonemes in the preview mode that have been significantly worse than earlier versions, evenĀ  though the waveforms and base code are essentially the same.

After some code tweaks, the phonemes are now sounding a somewhat better, which is a reassuring. It won’t be until I get the the phonemes finished up and start connecting phonemes that I’ll know how much the quality has been changed.

The unvoiced and semi-voiced phonemes such as /Z/ and /ZH/ don’t sound as good as the sampled versions, but they are passable so I don’t expect to make any immediate changes there. But I expect that code handling unvoiced sounds will modified once I get things up and running again.

Because the phonemes now derive timing information from the sounds – rather than being hard coded values – I’ll likely need to re-record all the samples being analyzed.

Because of this, I’m reviewing the recording list and looking for places where additional specialized phonemes should be added.


Posted in Uncategorized | Leave a comment

Phoneme Target Voicing

In the current design of synSinger, phonemes can have one or more distinct “targets”.

For example, the vowel /AH/ has two targets, one at the beginning of the vowel, and one at the end, reflecting the subtle changes of the mouth as the vowel is pronounced.

A more complex sound, such as /OR/ may have 5 or 6 targets:

Spectrogram of /OR/ with targets marked

Other phonemes not only have multiple targets, but have different voicing types for each target. For example, the initial stop consonant /P/ includes:

  1. An short silence
  2. An unvoiced burst
  3. A semi-voices aspiration
  4. A voiced placement target, dependent on the following vowel.

Unfortunately, this isn’t how synSinger is currently built, so code is going to have be to rewritten to handle this.

Each target will have one of the following types of voicing assigned to it:

  1. Silent: No sound at all.
  2. Voiced: For example, a vowel
  3. Unvoiced: For example, a plosive burst or frication
  4. Semi-voiced: For example, aspiration that has a small voiced component

Currently, /H/ is being generated by special aspiration code. Since the sound of the /H/ varies depending on the vowel that follows, I’ll probably need to create unique /H/ targets for each vowel.

Posted in Uncategorized | Leave a comment

Refining Tools

I’ve continued to enhance the phoneme editor. The phoneme list has been integrated into the editor:

Phoneme list

Selecting a phoneme on the list automatically loads the associated file. The targets are displayed with the wave, and the FFT is automatically performed when the portion of the wave is selected:

synSinger phoneme target editor.

Phoneme targets are aligned to the cycles. Target formants are repositioned by clicking on them, and then clicking the new frequency. There are plenty of refinements that could be made, but it works well enough.

I’ve still got work to do integrating the consonants into the editor. Once that’s done, I should be able to go back to building the remaining stop consonants.

Posted in Uncategorized | Leave a comment

Building Tools

I’m continuing to work on the phoneme “editor” for synSinger. Since the automatic detection of formants is still unreliable, I’ve added a feature that allows selecting on a formant with the mouse and clicking on the proper position to correct it. By default, the formant will be “snapped” to the most likely candidate at the position, although the “snap” option can be turned off:

Editor with formants overlaid on FFT display

Since the code is interpreted, it’s a bit sluggish. But even a relatively coarse FFT display is sufficient to identify which formants need to be corrected.

My end goal is to build something along the lines of UTAU’s OTO tool, where all the phoneme data that synSinger needs can be created via this tool.

Posted in Uncategorized | 2 Comments


Here’s a demo of synSinger with the old standard Twinkle, Twinkle, Little Star:

This version uses sampled consonants.

The are a number of problems with this, but it does show off the clarity of the new spectral synthesis rendering engine.

Posted in Development | Tagged , | Leave a comment

Synthesizing Frication

synSinger has been using sampled phonemes for voiced and unvoiced fricatives such as /V/ and /SH/. This works well, but I’ve been looking into synthesizing the consonants so they are better integrated into the rendering framework.

Frication noise is made by creating a constriction in the airflow. In formant synthesis, this can be simulated by passing noise through parallel filters.

In spectral synthesis, this can be simulated by creating harmonics, but resetting the phase to zero at random intervals.

For sounds like /V/ and /Z/, the first 8 or so harmonics are flagged as “harmonic”, and processed normally. For the higher harmonics, the phase is reset randomly.

So for sounds like /S/ and /F/, there are no “real” harmonics, so all the harmonics have their phases reset randomly.

The result is a reasonably good facsimile of the fricative sound, both voiced and unvoiced.

Implementing this required rewriting the rendering engine. Where the old code crossfaded parameters, the new code crossfades harmonics when a fricative is encountered.

Unfortunately, the method doesn’t work that well for adding a “whisper” to the vowels, as it’s a bit too rough.

I’ll look into replacing the attack portion of plosive phonemes such as /B/ and /T/ next. If successful, I’ll be able to completely remove samples from synSinger and the output will be completely synthetic.

Posted in Development | Tagged , , , | Leave a comment

Formant Tracking

I’ve implemented basic formant tracking. It work fairly well, as long as the frequency detection code works properly:


However, it’s very slow, and terribly inefficient. I’m re-using code that’s not at all been optimized for the task.

It also will fail terribly, so there’s still a lot of work left to do.

For each cycle, the top n peaks are selected. A peak is defined as a point surrounded by lower points. These peaks are then sorted by amplitude.

These peaks are tracked by associating peaks to the track with the closest prior peak. Each time a peak is added to a track, the track’s score is increased. In addition to tracking the frequency, the amplitude is also kept.

Tracks are processed in order of highest amplitude, which helps prevent peaks from being grabbed by nearby tracks.

If no peak is near a track, the track continues re-using the last frequency, but does not accrue a score. If a peak is not associated with a track, the a new track is created with a score of zero, back-filling prior values with the current frequency.

When the tracking is completed, the top 4 tracks are selected, and sorted by frequency.

Finally, the tracks are smoothed by averaging each point with the surrounding points.

While I’ll continue working on this, for the immediate future it makes sense to continue using Praat to calculate formants.

Posted in Uncategorized | Leave a comment