I’ve already got data on the phase of harmonics in the synSinger database, so I figured that I’d apply it to the new sine() synthesis code and see how much difference it made.
Here’s a screenshot from Praat showing the output of the harmonics with all the sine waves phases locked together, creating as very symmetric waveform:
Synthesized wave created without phase offsets.
In contrast, here’s the wave generated by including the offsets of the harmonics:
Synthesized wave created using phase offset from vowel data.
In theory, because the human ear is largely insensitive to phase, these should sound the same. And while they are essentially the same harmonically, the wave with phase information comes across as louder. And it certainly looks more correct.
Interestingly, using phase offsets from different vowels doesn’t seem to have much impact on the sound.
Obviously, analysis of the phases for patterns is the next step.
I haven’t had as much time to work on synSinger as I’d like, but I’ve continued trying to get the bugs out of the vocoder.
To help debugging, I created a version of the vocoder code that constructed a spectral curve using the frequency/amplitude pairs in each bark band and 4-point hermite interpolation. It the used the curve to calculate the contribution of the first n overtones of the fundamental frequency to render the vocal.
This helped me track down several problems with the bark band assignment code, which had been leaving out frequencies below 60Hz.
This led to better results from the vocoder, but the output wasn’t stellar.
On the other hand, the sine wave reconstruction was fairly good. One of the main problems was a metallic ringing. That turned out to be fairly trivial to fix once it was identified – the sine wave generators of the overtones needed to be in phase with each other.
Once that was corrected, there’s still a problem with it being excessively “buzzy” – it lacks the smoothness of the original material. It’s also not as high fidelity as I’d like.
Still, it seems to be worth more exploration.
I’m still playing around with using a sending glottal pulses through vocoder.
Praat revealed that biggest problem was that pitch tracking was still all over the map. Testing revealed that the pitch tracking code was basically broken, and needed to be rewritten. Fortunately, that was fairly straight forward.
The next problem was variance. The values for the formant frequencies were in the general neighborhood of where they should be, but there was far too much variance from value to value.
Averaging the calculated values helped steady the variance:
target[i].frequency = (target[i].frequency * .01 + prior[i].frequency * .99)
However, higher frequencies take longer to steady, which means the weight needs to be dependent on the frequency. The mapTo function maps a value from one range to another:
local weight = mapTo( 0, 5000, .99, .7, target[i].frequency )
target[i].frequency = (target[i].frequency * (1-weight) + prior[i].frequency * weight
This fixed a lot of the problems with frequency tracking.
So… is the end result any good?
At this point, it’s only OK. It sounds like a low-fidelity phone call – not awful, but not especially encouraging that the quality is going to get that much better.
I haven’t been entirely happy with using sin waves to simulate noise, so I decided to spend some time trying an alternate approach: creating a simple FFT vocoder.
The vocoder consists of a series of bandpass filters, one for each bark band.
Audio is chunked into non-overlapping frames of some sufficiently small size. An FFT is performed on each frame. The bin magnitude for each bark band is then summed, with the highest bin being used to set the band’s filter’s frequency, and bandwidth. A frame worth of output is then generated, by sending white noise through each filter.
The result was fairly good – audio in, whispered audio out.
But what if, instead of sending white noise to each filter, I passed through a glottal pulse? In theory, I should get a reconstructed voice.
In fact, that’s exactly what I got. The output was a bit muffled, but modifying the glottal pulse brightened it up a bit.
However, there are plenty of glitches. Trying various approaches sometimes made things better, and often made things worse.
I’m still playing around with this because it’s rather neat. But I don’t think it’ll replace my prior approach – except perhaps for non-vocalized consonants.
While I’m pretty pleased with the overall progress of the phoneme editor, there have been a few phonemes in the preview mode that have been significantly worse than earlier versions, even though the waveforms and base code are essentially the same.
After some code tweaks, the phonemes are now sounding a somewhat better, which is a reassuring. It won’t be until I get the the phonemes finished up and start connecting phonemes that I’ll know how much the quality has been changed.
The unvoiced and semi-voiced phonemes such as /Z/ and /ZH/ don’t sound as good as the sampled versions, but they are passable so I don’t expect to make any immediate changes there. But I expect that code handling unvoiced sounds will modified once I get things up and running again.
Because the phonemes now derive timing information from the sounds – rather than being hard coded values – I’ll likely need to re-record all the samples being analyzed.
Because of this, I’m reviewing the recording list and looking for places where additional specialized phonemes should be added.
In the current design of synSinger, phonemes can have one or more distinct “targets”.
For example, the vowel /AH/ has two targets, one at the beginning of the vowel, and one at the end, reflecting the subtle changes of the mouth as the vowel is pronounced.
A more complex sound, such as /OR/ may have 5 or 6 targets:
Spectrogram of /OR/ with targets marked
Other phonemes not only have multiple targets, but have different voicing types for each target. For example, the initial stop consonant /P/ includes:
- An short silence
- An unvoiced burst
- A semi-voices aspiration
- A voiced placement target, dependent on the following vowel.
Unfortunately, this isn’t how synSinger is currently built, so code is going to have be to rewritten to handle this.
Each target will have one of the following types of voicing assigned to it:
- Silent: No sound at all.
- Voiced: For example, a vowel
- Unvoiced: For example, a plosive burst or frication
- Semi-voiced: For example, aspiration that has a small voiced component
Currently, /H/ is being generated by special aspiration code. Since the sound of the /H/ varies depending on the vowel that follows, I’ll probably need to create unique /H/ targets for each vowel.
I’ve continued to enhance the phoneme editor. The phoneme list has been integrated into the editor:
Selecting a phoneme on the list automatically loads the associated file. The targets are displayed with the wave, and the FFT is automatically performed when the portion of the wave is selected:
synSinger phoneme target editor.
Phoneme targets are aligned to the cycles. Target formants are repositioned by clicking on them, and then clicking the new frequency. There are plenty of refinements that could be made, but it works well enough.
I’ve still got work to do integrating the consonants into the editor. Once that’s done, I should be able to go back to building the remaining stop consonants.