Articulation

One of the goals of synSinger is clear articulation. I’d always assumed that interpolating formant transitions with non-linear curves would give better results than simple linear interpolation.

But while the results may be more “natural”, the aren’t as clear and articulated. I’ve been making a number of adjustments to the code over the last couple of days, and not real large change in the output.

So I was disappointed when I entered in a new song, and found that the output was mushy and hard to make out.

There were also a number of other artifacts, most of which had to do with adjusting the timing of phonemes. For example, I’d given a much too short transition time for vowel to vowel, so there “bumps” during the rapid transition. Creating a greater transitions fixed the problem.

I also tweaked the /YU/ phoneme, and added an allophonic replacement rule for /ERL/.

But the biggest issue was using a “more curved” ease function instead of simple linear interpolation. Changing the interpolation back to linearly vastly increased the comprehensibility

So for now, I’ve switched to using a cubic ease function for interpolating the formants. It’s smoothly curved at the start and stop,  but looks quite linear for the remainder, which give a “snappier” articulation.

Posted in Uncategorized | Leave a comment

Quasi-Sinusoidal Voicing Revisited

Voiced fricatives, such as /B/ and /G/, are essentially the same as their unvoiced variants /P/ and /K/, but with the addition of voicing.

However, the voicing is a “quasi-sinusoidal” (QS) voicing, in that it resembles a sinusoidal wave more than “normal” glottal voicing.

This is simulated in synSinger by setting the pulse width to 100%, and the rising portion to 50%.

I’d done a “clever” hack of using a negative amplitude to indicate quasi-sinusoidal voicing, which I’d hope would allow for a smooth interpolation between “normal” and QS voicing. Unfortunately, as the value passed from positive to negative, the amplitude would pass to zero, which created odd drops in amplitude.

The penny finally dropped when it occurred to me that QS and “normal” voicing are simply different ends of the spectrum. QS voicing occurs when there’s not much energy to drive the glottal folds. As pressure to the folds increases, the voicing becomes more “normal”.

This led to the replacement of the if/else logic with an interpolation of parameters driven by the amplitude of voicing (av):

-- interpolate between "smoothed voicing" to regular voicing
local t = math.min( 1, av )
pulseWide = lerp( 1, pulsePercent, t )
risingWidth = lerp( .5, pulseWide * risingPercent, t )

Honestly, it doesn’t make much difference in the output. But it gets rid of an artificial distinction between the two voicing types, and makes the code a bit more elegant.

Posted in Development | Tagged , , | Leave a comment

Packaging Lua Code

Although the code isn’t quite ready for prime time, I’m trying to get it to the point where I can get it to some beta testers.

Optimally, I’d like to be able to deliver synSinger as a single executable, rather than having the end user install Lua on their machine.

I’d written utilities like this before in Euphoria, so I figured I’d try rolling my own. My code simply looked for require statements, and, if it had not encountered that file before, replaced the require with the source code of the file the was being requested. It was rather brute force… and came to an abrupt halt when the Lua interpreter complained that the resulting code reached the upper limit of local variables.

The next step was to play around with Squish, a utility for packing multiple Lua scripts into a single file.  Squish is very cool, and has a bunch of neat options.

Unfortunately, I had trouble getting it to work properly. But the gist is that

require "myPackage"

causes the required source code to be inserted into packed file, along the lines of:

package.preload['myPackage'] = (function (...)
    ...package source code...
end)

Once I’d figured that out, it was fairly straightforward to create my own low-budget version of Squish.

The next order of business was to convert the CMU Pronouncing Dictionary data file into Lua source code, so it could be embedded, along the lines of:

CMU_DICTIONARY["1:HAGENS"]={"HEY1-GAH0NZ"}
CMU_DICTIONARY["1:HAEGELE"]={"HEH1-GAH0L"}
CMU_DICTIONARY["1:HENDEE"]={"HEH1N-DIY0"}
CMU_DICTIONARY["1:HELMUTH"]={"HEH1L-MUW2TH"}
CMU_DICTIONARY["2:HILLBILLY"]={"HIH1L-BIH0-LIY0"}

This is where I hit another snag – Lua only allows a limited number of literals per module – and being about 129,000 lines long, the dictionary exceeded the limit.

Fortunately, splitting the dictionary up into a series of modules fixed that problem.

I wrote a similar data-to-code routine to embed the .wav files into the source code, and synSinger was finally a single file.

There’s a version of Lua called srlua by one of the authors of Lua that consists of a Lua executable and a glue program. Only the source code is provided, but a bit of searching will turn up pre-built binaries. Creating a stand-alone executable in srlua is just a matter of typing:

glue srlua.exe myProgram.lua myProgram.exe

This binds myProgram.lua to the srlua interpreter, and creates a stand-alone executable myProgram.exe.

That’s exactly what I was looking for. In a perfect world would be where this story ends.

However, synSinger runs a slow under “plain” Lua – it takes 1/2 second to generate 1 second of output. I haven’t even tried optimizing it yet, but that’s painfully slow!

But there’s faster Lua interpreter called luajit – and it runs synSinger about 10 times faster than plain Lua.

luapower looked very promising, but the MingWin toolchain installer won’t work on my machine.

I was able to create a stand-alone .exe using luvit. It seems to have some small issues, but it doesn’t look like they’ll be insurmountable.

So it looks like I might be using luvit to build synSinger distributables.

Posted in Development | Tagged , , , , , | Leave a comment

Vibrato

I’ve added vibrato back into synSinger.

Vibrato is applied when one or more voiced nucleii are longer than some minimum duration, which can be selected by the user. An envelope controlling the vibrato amplitude is constructed, and passed to the synthesizer. A time to delay the vibrato from the start of the note can also be given.

While there are fairly complex models for the vibrato wave, literature agrees that using a simple sin() wave will suffice. At the beginning of synthesis, an “angle step rate” is calculated as:

vibratoStep = ((vibratoRate*TWO_PI) / OVERSAMPLED_SAMPLE_RATE)

This is added to the the vibratoAngle for each sample. The calculation of the vibrato wave is simply:

-- calculate vibrato amplitude
local vibrato = math.sin(vibratoAngle) * vibratoAmp

This is then multiplied by the scale wave (in cents) to determine how much to alter the glottal pitch. The routine to add a certain number of cents to the frequency is:

-- return the frequency plus the given number of cents
local function addCents(frequency, cents)
  return frequency * math.pow(2, cents/1200)
end

The vibrato frequency is fairly small – from 4 to 8 cycles per second.  Literature claims that a rate of 6.1 cycles per second is ideal, but I’ve currently got the values around 4 cps.

The scaled absolute value of the vibrato is also added to the glottal wave’s amplitude, amplitude modulation is applied in addition to frequency modulation.

While not making synSinger sound human, it certainly helps it sound less robotic.

Posted in Development | Tagged , , | Leave a comment

Men, Woman and Children

Despite knowing that formant synthesis really doesn’t do it all that well, I decided to have a go at trying to add a female and child voice to synSinger.

Rather than do it properly, I decided to take a cue from programs like Software Automatic Mouth, and tweak various properties.

For example, this page lays out a number of things that can be done:

  • Modify the shape of the glottal pulse;
  • Raise the fundamental frequency;
  • Modify the bandwidths based on a shorter oral tract length;
  • Increase the aspiration noise

Additionally, Klatt suggests deactivate high frequency filters. This can be done by setting a=1, b=0, c=0, effectively having the filters to have no effect.

These were all pretty easy to implement.However, they didn’t really transform the quality of the voice in a convincing way. For that to happen, the formants would have to be shifted to the position appropriate for a woman of child.

The mapping of the formants isn’t linear, so synSinger uses an approximation instead.  I used the following table, which lists the average formant frequencies of various vowels for men, women and children:

Average formant frequencies for men, woman and children for various vowels.

Average formant frequencies for men, woman and children for various vowels.

Given a tuple of (F1, F2, F3), synSinger firsts searches for the vowel in the table that best matches the target, using a squared error term:

error = (targetF1 - F1)^2 + (targetF2-F2)^2 + (targetF3-F3)^2

Once the vowel with the best match (lowest error value) is found, synSinger calculates a delta to that transforms each formant of that vowel into the target:

if f1 then tF1 = f1 / averageMan[index]["f1"] end
if f2 then tF2 = f2 / averageMan[index]["f2"] end
if f3 then tF3 = f3 / averageMan[index]["f3"] end

Finally, it finds the equivalent vowel for the woman or child, and scales those formant values to find where the tuple lies in the target space:

if f1 then targets["f1"] = averageWoman[index]["f1"] * tF1 end
if f2 then targets["f2"] = averageWoman[index]["f2"] * tF2 end
if f3 then targets["f3"] = averageWoman[index]["f3"] * tF3 end

The if f1… bit of code allows for matching against targets that might only contain one or two formant values.

This does an acceptable job of transforming the formants.

Here’s what it sounds like:  Twinkle, Twinkle Little Star (female voice)

The end result isn’t entirely convincing, it’s a step in the right direction.

Posted in Development | Tagged , , , , | Leave a comment

Silent Letter

For a while, the pop group America had a pattern of starting the names of their albums with the letter “H”… a “silent letter.”

However, the /H/ in synSinger turned out to be less than silent.

The phoneme /H/ is an interesting phoneme – it’s aspirated, not voiced, and it takes on the formant parameters of the vowel that follows it. But I had been in a hurry to get some output from synSinger, so I’d done some quick measurements in Praat that were “good enough” to have it sing:

… up above the world so high…

and then I moved to to other phonemes, with the intent of fixing it later.

Fast forward to last night, when I imported a different song into synSinger, and there were
“burps” all over the place. This was particularly distressing because they hadn’t been showing up in other songs.

The main commonality was that this song had a slew of /H/ sounds, and where the voicing on the prior phoneme faded, there was a distinct low “burp” sound.

It’s a sign that something in the code has gone wrong. Typically, it’s something in the filters.

My first guess was that perhaps there was a numeric precision issue as the voicing amplitude dropped to zero. A couple tests scratched that idea off the list.

The next guess was that it was a rapid change in one of the formants, or bandwidths. Setting the bandwidths to fixed values didn’t change anything, so the most likely culprit was a rapid change in the filters.

I (once again) tried implementing the code that Klatt suggested for modifying the filter history variables whenever the filter values were changed, but I still couldn’t figure out how to make that code work. Instead, it inserted more discontinuities whenever the filter parameters changed.

If the issue was a rapid change of the filter value, another solution is oversampling. That basically means increasing the sampling rate by n – while simultaneously lowering the frequency by 1/n. The upside of oversampling is better precision. The downside is having to generate n times more data, which (obviously) takes roughly n times longer.

Oversampling turned out to be not terribly difficult to implement… But it still didn’t get rid of the problem.

My next though was perhaps the frame resolution was too small. But more testing showed that wasn’t the problem, either.

 

So I was forced to consider the old programming adage: always suspect the data.

It turned out that the first formant of the /H/, which is generally in the range of 200-400, had a value of 800. This created a rapid rise in the F1 frequency at the transition, which created the “burp” in the first place.

Changing the F1 to a more sensible value of 300 fixed the problem.

Plus, I added in the code to adjust the formants of the /H/ to those of the following vowel.

Posted in Uncategorized | Tagged , , | Leave a comment

Reworking Fricatives

I’m making a second pass through the consonants, starting with the fricatives.

Voiced fricatives are essentially the same as unvoiced fricatives, with the addition of a voice bar. I’d accomplished this by using a vb (voice bar) parameter track, so I could fade out the av (amplitude of voicing) and bring the the voice bar.

For various reasons, this arrangement was rather clunky. Additionally, there was a need for another track – the voice source. That’s because the voice bar during voiced fricatives switched from normal voicing to “quasi-sinusoidal” pulses.

Since the “normal” and “quasi-sinusoidal” voicing were mutually exclusive, and the “normal” voicing generally drops in volume before the “quasi-sinusoidal” voicing comes in, I decided to be clever and have the sign of the av represent the voicing type. So dropping the av down into the negative values changes the voicing to “quasi-sinusoidal” in a smooth fashion, and the vb parameter and voicing source parameters go away – for now, anyway.

The prior version of the code had simulated “quasi-sinusoidal” voicing by passing the glottal pulse through only the F1 resonator, which wasn’t very effective.

The change to the glottal pulse code was actually quite small, since the “quasi-sinusoidal” voicing uses the normal voicing code, but with 100% of the wave as the pulse, and 50% of the pulse rising.

The voiced fricatives were relatively easy to tune. Surprisingly, the /CH/ proved quite difficult because I’d let the frication continue through part of the voicing, and the result was a  /G/.

Since the unvoiced fricatives use sampled sounds, getting them to work was pretty much just a matter of shaping the envelope for the af (amplitude of frication).

Posted in Development | Tagged , | Leave a comment