Although the current method of building diphones works well, I’m finding that probably a better way of doing things.
For example, consider the a recording of “maa-maam” (there’s no gap between the hyphens, they’re just there for readability).
This single recording would be cut up into:
As a rule, I’m not using the “pure” vowels, only the transitions. So there’s no
aa diphone. But I could also include:
Using long strings of phonemes makes for a smoother connections, so having more transitions to choose from is a good thing. In the current implementation, the diphones
would be used instead of
aa_m_aa, so the result would be much the same.
But back to the original recording: “maa-maam”.
In the current phoneme editor, the same phoneme would have to be marked up 5 different times to create 5 diphones. Since these diphones share common split points, that’s a bit redundant.
On first glance, you might thing you could just divide up the recording into phonemes, like so:
SILENCE | M | AA | M | AA | M | SILENCE
But it’s not quite that simple. Consonants need to be divided in half, so it’s more like:
SILENCE M | M AA | AA | AA M | M AA | AA | AA M | M SILENCE
Because stop consonants (such as
/t/ ) don’t divide in half nicely, they are split just before the stop consonant.
The current system of splitting strings starts to work less well if I want to use longer strings.
For example, here’s a string from the Arparsing reclist:
"surd worlds verse" (s er d - w er l d z - v er s)
Consider the string “er l d z v er” between “worlds” and “verse”. It can be divided into the following 15 groupings:
er l d z v er
er l d z v
er l d z
er l d
l d z v er
l d z v
l d z
d z v er
d z v
z v er
Using my current system, I’d have to create 15 different n-phones, manually editing them each time. Plus, they’d each have their own data instead of sharing the same audio.
The reason for creating each distinct n-phone is this greatly simplifies the process of building smoothly connected phonemes at runtime.
But it would obviously be a great time-saver if the system were to automatically split out the phonemes once they were tagged, instead of having to repeat the process for each phoneme.
Note that synSinger doesn’t use the sustained portion of the vowels when building most of its n-phones.
While only the split points for phonemes need to be declared, vowels need to have their start and end tagged
For example, the string “maa-maam” (m aa m aa m) would require the markup:
+sil m aa+ +aa m aa+ +aa m sil+
and result in:
aa_m now appear twice.
This isn’t a huge savings, but it certainly is better than having to mark up the same audio multiple times.
Back to the example of “surd worlds verse”. The markup would require 16 tags:
+sil s er+ +er d w er+ +er l d z v er+ +er s sil+
And the result is 26 n-phones:
This will require a lot of changes to the phoneme editor, but hopefully less on the rendering side of the code.