Reading…

Normally, when I request books from the library through the inter-library loan system, they trickle in a few at a time. This time, all 28 books showed up at the same time.

I was able to whittle them down to about half that number and bring a small pile of reference books home. As usual, a number of the most promising texts didn’t deliver the content I’d been hoping for, and were left at the library.

I first read through Principles of Computer Speech (Witten). It’s a bit out of date, but made for a pleasant skim.

Electronic Speech Synthesis (Bristow) was next. It looks like my current approach is akin to the SC-01 chip by Votrax, Inc. in that it appears to interpolate between phonemes. This doesn’t bode well, as the SC-01 wasn’t known for quality.

Interestingly, this page on the SC-01 has a link to MAME code which includes an emulation of the SC-01 chip. However, it just plays back audio samples captured from the chip – it doesn’t actually emulate it.

The chapter Using LPC for Non-speech Sounds has a section entitled Synthesis for Singing Voices, and has a number of pertinent notes, worth citing in detail:

The differences appear primarily in the sustained parts of vowels. First, the formant positions for spoken and sung vowels are different, especially for professional singers (Sundberg, 1970). This implies that the spectral shapes to be used must be obtained by the analysis vowels actually sung.

I’d seen reference to Sundberg’s work identifying a singer’s formant, but I hadn’t really noticed the idea that singers alter the formants with the fundamental frequency. This forment doesn’t appear in the spoken voice, and is placed between 2000-3000 Hz.

In addition, a given vowel may need to be synthesized on a very large pitch range. As a consequence it is not possible to consider that the spectral envelope is independent of the fundamental frequency, as is the case in speech synthesis. Otherwise, for certain pitch periods, the timbre of the voice can be drastically changed and become totally inhuman. In fact, an analysis of professional singers’ voices (Sundberg, 1975) has shown that the formants are consistently modified as a function of the fundamental frequency, particularly at high pitch levels where most vowels see to be merged into a single unique sound.

In practice, if the synthesizer filter is controlled by resonant frequencies, these can automatically be modified as a function of the pitch, for each vowel. according to some rules. Alternatively, different patterns can be stored for each sound, each at a different pitch.

So the singer alters the position of the formants, although it’s not clear what the relationship is to the fundamental frequency. Singers also make other vowel choices (for example, to deemphasize the nasality), and it’s not entirely clear what’s being referred to here. I’ll have to look at the sonagraph and look for patterns.

Another aspect that requires specific attention in the sustained part of the vowels is related to small variations of pitch. If a steady tone is synthesized with a constant pitch value, the resulting sound is very dry and does not resemble any sound produced by a human being. It is even difficult to recognize the vowel from which the envelope was extracted. An analysis of real speech shows that whenever the speaker tries to maintain a constant pitch, he effectively produces small variations around the desired value. They are probably due to the non-instantaneous psycho-acoustic feedback control of vocal-chord tension, and air pressure in the lungs which must be adjusted to produce a given frequency of vibration in the vocal chords. These variation can be modeled by the addition of two or three random variations at different lag times (Rodet, 1981), The total maximum variation is on the order of 3-5% but it is important that it be smooth. The difference between two consecutive  pitch values should not exceed 0.5%, otherwise the variation could be perceived as being performed in steps.

This addresses to some extent the “mechanical” sound of a sustained synthesized note.

Text to Speech (Taylor) is the most current text of the batch, and it’s quite good. I only lightly skimmed it, but there’s a lot of material in it. Written after the (over-enthusiastic) books I’d just read, It explains why getting high-quality results from formant synthesis is difficult, and has been essentially abandoned since the 1980’s:

That said, the main criticism of formant synthesis is that it doesn’t sound natural. It can in fact produce quite intelligible  speech, and is still competitive in that regard with some of the best of today’s systems. The reason for this goes back to our  initial discussion on the nature of communication in Chapter 2.  There we showed that a primary factor in successful communication was that of contrast between symbols (phonemes in our case). Hence the success of the message communicating aspect of speech (i.e. the intelligibility) can be achieved so long as the signal identifies the units in a contrasting fashion such that they can successfully be decoded. In the target-transition model, the targets are of course based on typical or canonical values of the formants for each phoneme; and as such should be clear and distinct from one another. The fact that the transitions between them don’t accurately follow natural patterns only slightly detracts from the ability to decode the signal.  In fact, it could be argued that good formant synthesis could produce more intelligible speech than slurred natural speech as the targets are harder to identify in the later.

I thought this was particularly interesting. So while formant synthesis may not be “natural”, it still may be “good enough” to use for synSinger.

About synsinger

Developer and Musician
This entry was posted in Development and tagged , , , , , , , . Bookmark the permalink.

Leave a comment