Still Doing Research

Not much to report – I’ve been spending a lot of time learning about neural networks, especially in regards to speech synthesis.

And… there’s a lot to learn.

Posted in Uncategorized | Leave a comment

Research Mode, Again

To recap: this summer, I decided to revisit the resynthesis routines that synSinger was using in hopes of creating better output. I ended up getting fairly good results with one method, only to have it fail badly when I starting testing it with other data. Sinusoidal synthesis still gives excellent results, but I need the flexibility of being able to modify vocal attributes such as tension and the glottal pulse – something that sinusoidal synthesis doesn’t automatically supply.

Additionally, I’m unhappy with how robotic the output as, and much of it seems to stem from the method of recording the phonemes.

While I’ve been mulling these issues over, I’ve been doing some research into neural networks. While there is lots of research into successful TTS (Text to Speech) synthesis, singing synthesis is a bit of a different animal.

Many of the features that are desirable in TTS are undesired in singing synthesis.

For example, the prosody of speech – the pitch line, emphasis, and phoneme duration – are all automatically baked into TTS.

In singing synthesis, these are manually specified.

One solution – adopted by Sinsy – is to have TTS initially generate the speech, and then pitch and time-shift the results to the pitch and timing constraints of the song.

In some ways, the constraints of singing synthesis are simpler than speech synthesis. For example, since the user needs the ability to specify down to the phoneme level, there’s no need for a language model to be created, as a phonetic dictionary will do. It’s up to the end user to handle homographs – words that are written the same, but sound different.

Concatenation synthesis programs like Vocaloid and UTAU typically have a large set of pre-recorded phoneme pairs that can be assembled to create singing.

This requires a lot of recordings. Because errors can creep into the process, it’s helpful to have multiple recordings of phonemes. The non-AI version of SynthesizerV has three different takes of each phoneme to choose from, which can be very useful when the primary recording is wrong.

The promise of neural networks is that they can “learn” from examples, so potentially handle atypical phoneme pairings more robustly. But this isn’t guaranteed, by any means.

The process of training a neural network to recognize speech isn’t the same as the process of training it to generate speech. But both are generally required, as the process of training speech generation requires lots of training data, which generally requires the training speech to be automatically labeled.

And while the neural network generated speech can often be very good, because of the constraints of singing synthesis, it’s not necessarily better than concatenative synthesis.

I’ve been playing around with the idea of using a neural network to do singing synthesis for some time. and have more recently been looking at it more seriously.

There are a lot of questions I still need to get answered, such as how to handle time in a controlled manner, and whether the output would be better.

In the meantime, I’ll continue to see if I can get better synthesis results.

Posted in Uncategorized | 1 Comment

Deja Vu, All Over Again

I worked out the kinks in the sinusoidal rendering code, with everything sounding relatively smooth. So I figured that perhaps I should backport it to my prior version of synSinger,

That’s the point where I got a bit of a surprise: synSinger already has an option for sinusoidal synthesis. I’d managed to reinvent the wheel again.

Which is a bit distressing, to be honest. It feels a bit like running in circles.

That certainly explains why it only took a couple days to put together.

So I’ve gone back and had a close listen to the prior sine wave based synthesis, to see why I moved away from it in the first place.

Spoiler: Being able to adjust the tension in the voice is important.

In general, sinusoidal synthesis sounds very good – especially with copy synthesis. The differences between the original and copy is slight.

Perhaps the biggest issue is dealing with phonemes like /v/ and /z/.

This is because these use mixed voicing, and synSinger doesn’t support that.

MELP determines if a frame uses mixed voicing by checking the following passbands:

  • 0-1000Hz,
  • 1000-2000Hz,
  • 2000-4000Hz,
  • 4000-6000Hz, and
  • 6000-8000Hz

By “checking”, I mean it uses auto-correlation using pitch-sized windows to see if there’s voiced content in the band. (Each band it separated out by using a band-pass filter).

If there’s a strong correlation, the band is flagged as voiced. Otherwise, it’s unvoiced.

For cases where mixed voicing is found, I might be able to just insert a low-frequency sine wave into the buffer, along the lines the quasi-sinusoidal voicing that Klatt used in MiTalk, and render the bands as noise. I’ll have to see what works.

There are a few more experiments that I’d like to run, but at this point, it looks like I’ll be moving forward with sinusoidal synthesis, with a long-term goal of replacing it with vocoding so I can take advantage of glottal effects.

Posted in Uncategorized | 2 Comments

Got Sinusoidal Synthesis Working

After running into issues with my DFT and FFT routines returning mismatched amplitude values, I decided that it would be a good thing to have another look at my routines and make sure they really were working – especially in light of what I’d found with the band-pass filters.

Additionally, I’ve been reading up on various Codecs such as CELP (Code-excited linear prediction), MELP (Mixed-excitation linear prediction) and Codec 2 (a low-bitrate speech audio), with an eye on improving synSinger’s synthesis routines.

But after reading through David Grant Rowe’s blog where he stated that “The speech quality of the basic harmonic sinusoidal model is pretty good, close to transparent“, I figured it might be a good thing to revisit sinusoidal synthesis.

I’d already done quite a bit with sinusoidal synthesis. The main thing that was missing was phase information, because I had issues with getting it to work with the DFT code.

So I revisited the DFT routine, fixed a number of bugs, and got it to return proper phase information.

With the phase information, the voiced portion reconstruction of the audio is impressively good. There’s some background clicking, but it seems to be an issue with wave mis-alignment – something that cross-fading the waveforms should resolve.

Time-stretching seems to work as well, although I spent way too long tracking down some boneheaded bugs.

I’m going to hold off final judgement until I can clear out some of these bugs, but it’s pretty promising.

Posted in Uncategorized | Leave a comment

Running Bandpass Filters In Series

I searched a bit for some better implementations of bandpass filters in Lua, and didn’t find anything that I could get to work.

So instead I tried running the bandpass filters through another set of bandpass filters, and the results were pretty good:

Bandpass filters running in series

The vocal gets a bit crispy past the 5kHz range, but that’s a known problem stemming from not producing many actual harmonics up past that point. There are a number of different ways I can address this.

The fact that this works indicates there is an issue with the filters I’m using.

But until I find another library that works, I’m going to call this a win and move on to some of the many other issues that need to be addressed.

Posted in Uncategorized | Leave a comment

Back to DFTs

I’ve moved the code back from the FFTs (Fast Fourier Transforms) to DFTs (Discreet Fourier Tranforms). Additionally, voiced frames correspond exactly with voiced pulses.

This has taken care of the problem of noise bursts preceding voiced sounds, as well as the “pre-echo” issue with the voice.

The sound is closer to acceptable, but still has issues. For example, here’s the original speech:

Original speech and spectrogram

Here’s the reconstructed speech:

Resynthesized speech and spectrogram

Simply looking at the spectrogram, you can see that while the formants are present, they aren’t as strong as they were in the original speech.

More problematic are the part of speech such as the /th/ sounds, which sounds more like an /s/ than a /th/.

I was guessing that the problem with the amplitude of the upper frequencies not matching was perhaps related to the spectral makeup of the glottal pulse.

But this shouldn’t be the case with the unvoiced sounds, as they use white noise, so there shouldn’t be any spectral dropoff.

So perhaps I need to have a closer look at my bandpass filters.

Posted in Uncategorized | Leave a comment

Starting to Re-Think Moving to FFT

Having implemented a fixed-frame width version of the synthesis code, I’m starting to have a lot of second thoughts about this.

There are all sorts of issue – such as pre-echo and bursts of noise – which I hadn’t encountered when using the DFT.

The FFT made things conceptually simpler, but dealing with the side-effects means adding little hacks into the code that just make things messy and harder in the long run.

I’ll still use the FFT to perform analysis on the noise.

But I’m thinking of adding some code to make sure it doesn’t look too far forward – or, at least not into any voiced frames.

So I’ve made a copy of the current code, and I’m now in the process of rewriting it yet again.

Fortunately, it should be that much of a change.

Posted in Uncategorized | Leave a comment

Changed Glottal Pulse Model

I was reading through the details of how Praat implemented their glottal pulse, and noticed that not only does Praat have a simple algorithm for generating a glottal pulse, but it’s got code for generating the derivative as well.

So I’ve replaced the glottal pulse I had been using with the derivative glottal flow, and the results are pretty good.

However, when I changed over to another vocal sample, I noticed there was a noise burst at the beginning of each vowel.

At first, I thought that perhaps I was overloading band-bass filters, and so added some logic which slowly brought up the pulse amplitude over the course of several pulses. But I was still getting the burst.

So I disabled the bandpass filters, and it was there at the front of the pulse train:

Noise burst preceding the pulse train.

So I had a closer look at the beginning, and sure enough – the code thought that the very beginning of the wave wasn’t voiced, so it passed it to the code to render the unvoiced frames:

Wave starts before voicing is detected.

Looking in Praat, it looks like it properly identifies the start of the waveform:

Where Praat shows the wave as starting

The most probably culprit is that my code simply isn’t properly detecting the frames that fall into the pulses. And that’s what shows when I added the pulse location to the display. Here’s the output with the corrected frames:

Dot show glottal pulses.

There are still some frames which don’t show up, which need to be accounted for.

The simplest thing to do would be to add some code to look for frames that fade to silence, and flag them as voiced.

But I’m also looking at some other options.

Posted in Uncategorized | Leave a comment

Fiddling with Parameters

I hooked up all the useful parameters to sliders, and played around with them. I didn’t find anything that made the vocal sound great, but I was able to find some that made the output really bad.

One of the other problems with the voice is that it currently is at an absolutely flat pitch, which makes it sound really robotic. I’ll fix that eventually, but for now it helps me focus on hearing that’s wrong with the voice.

The main problem is that it’s still fuzzy and buzzy – a bit too “out of focus”.

Anyway, here’s the list of parameters I’ve been playing with:

FFT Resolution

This is a balancing act. Too low (256 samples) and it’s too rough. Too high (4096) and it’s losing too much detail because what it gains in spectral fidelity, it loses in temporal fidelity.

Frequency Range

This is how what range of frequency data that’s being captured. Too low (3000) and it’s boxy. Too high and it’s capturing a lot of information that’s not really audible. At least, not to me with my mild hearing loss.

Mel Bands

This is the number of bands the frequency range that’s being captured. Less than 20 is too low resolution. Too many bands and the audio gets “ringy”.

Bandwidth

Each Mel Band has single bandpass filter the glottal pulse goes through. Too small, and the output is shrill. Too wide, and the audio is muffled. I’m still working on figuring this value out.

Percent Pulse

For the glottal pulse, there are two basic parameters: what percent of the pulse is “the pulse” and not flat. This corresponds to the “tension” of the voice.

Up/Down Percent

Simple enough – what percent of the pulse duration is spent going up vs. going down. This doesn’t seem to have a large impact on the voice.

Aspiration

This is how “breathy” the voice is. It’s more of a “mix to taste” sort of thing.

I suspect that one of the main problems is probably a mismatch between the frequency content of the synthetic glottal pulse and the spectral pulse.

That is, running the glottal pulse through the bandpass filter ends up with a copy that has less power at the high frequencies than the original signal.

I’ve tried working various ways to compensate for this, with various degrees of success.

In any case, I’m not quite ready to move on to the next step until I’ve run out of ideas for making the voice better. I keep hoping the I’ll find a useful hint in that next research paper I read.

Posted in Uncategorized | Leave a comment

Moving To FFT, and Slider Widgets

I noticed that when I was accidentally using frequency values to analyze one of the vocal of half the original value, I was getting better results. It turned out that was because it was using multiple cycles in the analysis.

Based on that, I’ve decided to move away from using the DFT and use the FFT instead.

I also moved to fixed frame sizes, just to make things more consistent. To make it easier for me to test things out, I’ve added a Motif-style scale widget to the UI, so I can just pick new values and see how it changes the results:

UI with sliders for adjusting the analysis values.

All this has perhaps given slightly better results from the synthesis, but I’m hoping I can get some higher quality than what I’m currently hearing.

Although the rendering time is fairly fast, the quality of the results leaves a much to be desired.

Posted in Uncategorized | Leave a comment