The core of the new rendering engine is an inverse DFT (Discrete Fourier Transform).
Each voiced phoneme target is described as a piecemeal spectral envelope, which when given a frequency, returns the amplitude and phase of that frequency.
Rendering is done on a wave-by-wave basis. Given a fundamental frequency f0, the amplitude for all the harmonics up to about 4000 Hz is calculated. One or more waves can then be generated using the inverse DFT (Discrete Fourier Transform).
Transitions between one phoneme target and another is accomplished via morphing, described in an earlier post.
The rendering is done by converting the phonemes into a series of instructions. Currently, the following instructions are implemented, but the list is subject to change:
- SET sets a spectral envelope as active, but does not render anything.
- ATTACK starts the amplitude ramping up again, if not already at 100%.
- PLOSIVE ATTACK is like attack, but starts at 40% amplitude instead of 0%.
- SILENCE renders a section of silence and sets the amplitude to zero.
- SAMPLE renders a sample into the buffer. This is used for plosives and fricatives.
- HOLD renders n cycles of the current target phoneme.
- RELEASE renders n cycles of the current phoneme, fading the amplitude to zero.
- MORPH renders n cycles of the current phoneme morphing into a new phoneme.
The core rendering is relatively simple, but converting the phonemes into the proper instructions requires a lot of attention to details. For example, creating the instructions for the phonemes “T AA” looks something like this:
SILENCE -- start of the stop consonant
SAMPLE T -- plosive portion of the /T/
PLOSIVE_ATTACK -- start of voicing after plosive
SET T_AA1 -- start of transition
MORPH AA1 -- transition from /T/ to /AA1/
SET AA1 -- held portion starts on /AA1/ target...
MORPH AA2 -- ... and morphs into /AA2/
RELEASE -- fade out using /AA2/ target
There are a number of “special” phonemes here.
Each transition from a plosive to a vowel has a different shape, so I create a phoneme for each plosive/vowel transition. So while the plosive attack of the /T/ has no spectral envelope target (it’s simply a digital sample), the beginning of the /T->AA/ transition is captured as the phoneme /T_AA1/.
Rather than capture every plosive/vowel transition, I only sample these “core” vowels:
/AA/as in odd
/AE/ as in bat
/AH/ as in hut
/AO/ as in talk
/EH/ as in bet
/IH/ as in bit
/OH/ as in cone
/UH/ as in book
/IY/ as in beet
/ER/ as in bird
(I pronounce /AA/ and /AO/ the same, so I’ll have to go back and correct the /AO/ phonemes at some point).
When a vowel transition is needed, synSinger uses a lookup table to determine which “core” vowel the first part of the vowel corresponds to. Not having to sample every transition is obviously a timesaver.
The phoneme /AA1/ is the sound at the beginning /AA/ of an sound, and /AA2/ is the sound at the end. Strictly speaking, /AA/ is a monothong and only has one “sound”, but this allows a natural sound transition during the sustained portion of vowels.
To generate the target’s spectral envelope information, I use Praat. I select a position in the .wav file that contains the target, and choose the Formant Listing, which returns the position (in seconds) and the formant information. This is then copied into a file in a format like this:
EY EY.wav 112 16
1.903082 572.273148 1790.221828 2551.646472 3458.211829
2.269429 501.891011 1808.743538 2468.468514 3475.364566
2.587991 353.950727 2079.433597 2677.219221 3318.279920
The first line (starting with an alpha character) contains the phoneme name, the file name, the (optional) approximate frequency, and the (optional) maximum number for cycles to consider for analysis. These last two hints greatly increase the quality of the analysis:
- The frequency helps extract the cycles correctly;
- Short cycles are selected for short transitions (such as plosive/vowel);
- Longer cycles are used for sustained sounds.
The next three lines are the Formant Listing values from Praat, and will be automatically converted into three targets: EY1, EY2, and EY3 and married up to the phoneme data table. Phonemes with 3 targets are assumed to be diphthongs; 2 or less are monothongs.
With Praat doing the heavy lifting, the remaining analysis is pretty straight forward. Cycles are identified using autocorrelation, and then harmonically analyzed using a DFT. The analyzed targets are then written out as spectral envelopes as Lua code.
As the beginning of the post mentions, rendering is basically a reversal of the analysis process.